辅导STAT 7008、csv数据报告辅导、解析LCID值分析、CVS/Matlab程序数据分析讲解

2018.10.01 - 首页 >> Matlab编程

STAT 7008 - Assignment 1

Due Date by 5 Oct 2018 numpy 和panda的使用是不被允许的

The use of numpy and pandas in this assignment are prohibited. You will

receive zero marks to solve problems in this assignment if you use the

mentioned packages.

Question 1

1. Please write codes to read the data file TrainingData.csv.

The first row is the header (variable names). Data are stored in

subsequent rows. 读csv

2. Determine the number of variables and the number of records in this

dataset. 变量和记录数

3. Store the variable names in a list. 保存在list

4. Determine if there is any missing values in the data set. If yes, please

report the total number of missing values.是否有缺失

5. Find the number of distinct LCID in the data set. LCID

6. Find the variable with the most missing values.

7. Convert the variable hour_id to datetime format.

8. What is the time duration of the entire data set? 时间长度

9. Determine the number of records per day.每天的记录数

10. Use the median method in the statistics package (from statistics

import median) or else, do the followings:

(a) Divide the entire data set by distinct value of LCID.

(b) For each distinct LCID value, determine the median of each

variables in the divided data set.

11. Determine the number of Complaint cases and Non-complaint cases

in the entire data set.

12. Determine the top 10 LCIDs with the most complaint cases.

13. Calculate the median value per day per each variable in the entire data

set.

14. Use the first 5 digits of the LCID values to define a new variable Region.

15. Determine the region with the most complaint cases found in the data

set.

Question 2

The objective of this part is to employ the provided data sets u.data and

u.item to develop a movie recommender.

u.data consists of user ratings on a set of movies. The last column

corresponds to time stamps relative to 1st Jan 1970. Column names for

u.data are ["userid","movieid","rating","timestamp"].

u.item represents the set of movies defined in u.data. The column names for

u.item are ["movieid", "title", "release", "url", "unknown", "Action",

"Adventure", "Animation", "Children", "Comedy", "Crime", "Documentary",

"Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance",

"Sci-Fi", "Thriller", "War", "Western"].

You can also download the two data files from

http://grouplens.org/datasets/movielens/.

1. Import the two data files with an appropriate separator. Do the

followings:

(a) Set the timestamp variable to its datetime format using

datetime.fromtimestamp() method.

(b) Add leading zeros to the movieid and userid with zfill(4) method,

e.g. "0023".

2. Remove movies with title = 'unknown'.

3. Find the average ratings and the number of reviews for all movies in

u.item.

4. Write a function to list the top n (e.g. 10) rated movies, title names

and their number of reviews.

5. Considering that a movie with a higher number of reviews should have

given a higher weight, we adjust the average rating formula by

incorporating c hypothetical users. These users rate each movie with

rating m. Use c = 59 and m = 3, write a function to list the top n rated

movies, title names and their number of reviews using the adjusted

average formula. Compare the listing with that found in question 4.

Which one is more reasonable?

6. For two distinct users A and B, find the set of movies common to both

users, that is the set of movies both users have given ratings. Apply

the Euclidean distance formula on the two sets of ratings to determine

a "distance" between user A and user B. Write a distance function with

userid of A and userid of B as input. The output of the function is

1/(1+d(A,B)), where d(A,B) is the distance between user A and user B.

7. Given a user, write a function to determine and output a list of

distances between the given user and others. Mark the distances with

their users.

8. Write a function with a given user and a given movie. If the movie was

rated by the user, output the rating provided. If the movie was not

rated by the user, output the weighted average of the ratings of all

other users weighted by their distances with the given user.

9. Hence, given a user, write a function to suggest 10 movies.