辅导STAT 7008、csv数据报告辅导、解析LCID值分析、CVS/Matlab程序数据分析讲解
- 首页 >> Matlab编程STAT 7008 - Assignment 1
Due Date by 5 Oct 2018 numpy 和panda的使用是不被允许的
The use of numpy and pandas in this assignment are prohibited. You will
receive zero marks to solve problems in this assignment if you use the
mentioned packages.
Question 1
1. Please write codes to read the data file TrainingData.csv.
The first row is the header (variable names). Data are stored in
subsequent rows. 读csv
2. Determine the number of variables and the number of records in this
dataset. 变量和记录数
3. Store the variable names in a list. 保存在list
4. Determine if there is any missing values in the data set. If yes, please
report the total number of missing values.是否有缺失
5. Find the number of distinct LCID in the data set. LCID
6. Find the variable with the most missing values.
7. Convert the variable hour_id to datetime format.
8. What is the time duration of the entire data set? 时间长度
9. Determine the number of records per day.每天的记录数
10. Use the median method in the statistics package (from statistics
import median) or else, do the followings:
(a) Divide the entire data set by distinct value of LCID.
(b) For each distinct LCID value, determine the median of each
variables in the divided data set.
(c) Package the result in (b) in a dictionary.
11. Determine the number of Complaint cases and Non-complaint cases
in the entire data set.
12. Determine the top 10 LCIDs with the most complaint cases.
13. Calculate the median value per day per each variable in the entire data
set.
14. Use the first 5 digits of the LCID values to define a new variable Region.
15. Determine the region with the most complaint cases found in the data
set.
Question 2
The objective of this part is to employ the provided data sets u.data and
u.item to develop a movie recommender.
u.data consists of user ratings on a set of movies. The last column
corresponds to time stamps relative to 1st Jan 1970. Column names for
u.data are ["userid","movieid","rating","timestamp"].
u.item represents the set of movies defined in u.data. The column names for
u.item are ["movieid", "title", "release", "url", "unknown", "Action",
"Adventure", "Animation", "Children", "Comedy", "Crime", "Documentary",
"Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance",
"Sci-Fi", "Thriller", "War", "Western"].
You can also download the two data files from
http://grouplens.org/datasets/movielens/.
1. Import the two data files with an appropriate separator. Do the
followings:
(a) Set the timestamp variable to its datetime format using
datetime.fromtimestamp() method.
(b) Add leading zeros to the movieid and userid with zfill(4) method,
e.g. "0023".
2. Remove movies with title = 'unknown'.
3. Find the average ratings and the number of reviews for all movies in
u.item.
4. Write a function to list the top n (e.g. 10) rated movies, title names
and their number of reviews.
5. Considering that a movie with a higher number of reviews should have
given a higher weight, we adjust the average rating formula by
incorporating c hypothetical users. These users rate each movie with
rating m. Use c = 59 and m = 3, write a function to list the top n rated
movies, title names and their number of reviews using the adjusted
average formula. Compare the listing with that found in question 4.
Which one is more reasonable?
6. For two distinct users A and B, find the set of movies common to both
users, that is the set of movies both users have given ratings. Apply
the Euclidean distance formula on the two sets of ratings to determine
a "distance" between user A and user B. Write a distance function with
userid of A and userid of B as input. The output of the function is
1/(1+d(A,B)), where d(A,B) is the distance between user A and user B.
7. Given a user, write a function to determine and output a list of
distances between the given user and others. Mark the distances with
their users.
8. Write a function with a given user and a given movie. If the movie was
rated by the user, output the rating provided. If the movie was not
rated by the user, output the weighted average of the ratings of all
other users weighted by their distances with the given user.
9. Hence, given a user, write a function to suggest 10 movies.