Analytics 512辅导、R编程设计讲解、辅导Canvas留学生、R设计讲解 调试C/C++编程|解析C/C++编程
- 首页 >> Web Analytics 512: Take Home Final Exam 2019
200 points in ten problems.
This is the take-home portion of the exam. You may use your notes, your books, all material on the course
website, and your computer or any computer in the departmental computer lab. You may also use official
documentation for R, built-in or on https://cran.r-project.org/, but no other material on the Internet. Provide
proper attribution for all such sources. You may not use any human help, except whatever help is provided
by me.
Your solution should consist of two files: An .Rmd file that loads all data and all packages, makes all plots,
and contains all comments and explanation, and an .html or .pdf file that is produced by the .Rmd file.
Return your solutions by Friday, 5/10/19, 11:59PM.
in Canvas
or hand in printed copies of both files
or fax both files to 202.687.6067.
Part I: Bikeshare Ridership
The first part of the exam uses data on hourly ridership counts for the Capital Bikeshare system in Washington,
DC for the years 2011 and 2012. Use the data frame cabi. The data frame contains time related variables
and weather related variables, plus two numerical target variables. Each observation contains data for
one hour during these two years, with a few gaps.
The data have been adapted from a set at the UCI repository. Link to the original data set: https:
//archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
System data of the Capital Bikeshare system are here: https://www.capitalbikeshare.com/system-data
Time related variables
season a categorical variable with values 1 (for January - March), 2 (April - June), 3 (July - September),
4 (October - December)
year with values 2011 and 2012
month a categorical variable with values 1, 2, . . . , 12
wday which is 0 for weekends and holidays and 1 otherwise
hr a numerical variable with values 0, 1, . . . , 23
Weather related variables
temp scaled temperature
atemp scaled perceived temperature
hum scaled humidity
windspeed scaled
weather, a categorical variable with values 1 (e.g. clear or few clouds or partly cloudy), 2 (e.g. cloudy
or broken clouds or foggy), 3 (e.g. snow or rain or thunderstorm)
1Target variables
The bikeshare system has casual riders who rent bicycles on the spot (e.g., tourists) and registered riders
who have a subscription (e.g., commuters).
casual Total number of casual riders during this hour
registered Total number of registered riders during this hour
Problem 1 (20)
Use numerical summaries, graphs, etc. to answer the following questions. No model fitting or other statistical
procedures are required for this. Each graph should help answer one or more of these questions and should
be accompanied by explanations.
(a) How do ridership counts depend on the year? The month? The hour of the day? How do casual and
registered riders differ in this respect?
(b) How are casual and registered ridership counts related? Does this depend on the year? Does it depend
on the type of day (working day or not)?
(c) Is there an association between the weather situation and ridership counts? For casual riders? For
registered riders?
(d) There are relations between time related predictors and weather related predictors. Demonstrate this
with a few suitable graphs.
For problems 2-4, split the data into a training set (70%) and a test set (30%).
Problem 2 (25)
(a) Fit a multiple regression to predict registered ridership from the other variables (excluding casual
ridership), using the training data. Identify the significant variables and comment on their coefficients.
(b) Estimate the RMS prediction error of this model using the test set.
(c) Does the RMS prediction error depend on the month? Answer this question using the test data and
suitable tables or graphs.
(d) Make copies of the training and test data in which hr is a categorical variable. Fit a multiple regression
model. Compare the summary of this model to the one from part (a). Also estimate the RMS prediction
error from the test set.
Problem 3 (30)
Use the original cabi data for this problem. (a) Train artificial neural networks with various numbers of
nodes in the hidden layer to predict registered ridership. Use the training data and only weather related
variables. Recommend a suitable number of nodes, with explanation. (b) Repeat part (a), using only time
related variables. (c) Repeat part (a), using two time related and two weather related variables. Explain
your choice of variables.
Problem 4 (10)
What do you think are six useful predictors? Use any method you want to answer this question.
2Part II: Vegetation Cover
Problems 5 - 8 use data on vegetation cover. Use the data frames covtype.train and covtype.test. The
original data are at https://archive.ics.uci.edu/ml/datasets/Covertype
Each data set contains 10,000 observations of 55 variables. These have been collected on 30m × 30m patches
of hilly forest land by the US Forest Service.
elev = elevation in meters, slope = slope of the terrain in degrees, aspect = direction of the slope in
degrees
h_dist_hydro, v_dist_hydro = Horizontal and vertical distance to nearest water feature in meters
h_dist_road = Horizontal distance to nearest roadway in meters
hillshade_9, hillshade_12, hillshade_3 = Index for hill shade at 9 AM, 12 noon, 3 PM, at
summer solstice
h_dist_fire = Horizontal distance to nearest wildfire ignition point in meters
wild1, ... wild4 = binary indicator variables for wilderness designation
soil1, ..., soil40 = binary indicator variables for soil type
cover = Target variable (type of forest cover), with values 2 and 3.
Problem 5 (20)
Fit a logistic model to the training data in order to separate the classes. Choose a classification threshold
so that sensitivity and specificity are approximately the same on the training data. Then report sensitivity,
specificity, and overall error rate for the test data.
Problem 6 (25)
Fit a support vector machine with radial kernels in order to separate the classes. Tune the cost and gamma
parameters so that cross validation gives the best performance on the training data. Then assess the resulting
model on the test data. Report sensitivity, specificity, and overall error rate for training and test data.
Problem 7 (10)
Fit a decision tree to the training data in order to separate the two classes. Prune the tree using cross
validation and make sure that there are no redundant splits (i.e. splits that lead to leaves with the same
classification). Then estimate the classification error rate for the pruned tree from the test data.
Problem 8 (20)
Fit a random forest model to the training data in order to separate the classes. Identify the ten most
important variables and fit another random forest model, using only these variables. Use the test data to
decide which model has better performance.
Part 3: MNIST Digit Data
Problems 9 and 10 use the MNIST image classification data, available as mnist_all.RData in Canvas. We
use only the test data (10,000 images).
3Problem 9 (20)
(a) Select a random subset of 1000 digits. Use hierarchical clustering with complete linkage on these images
and visualize the dendrogram.
(b) Does the dendrogram provides compelling evidence about the “correct” number of clusters? Explain
your answer.
(c) Cut the dendrogram to generate a set of clusters that appears to be reasonable. There should be
between 5 and 15 clusters. Then find a way to create a visual representation (i.e. a typical image) of
each cluster. Explain and describe your approach.
Problem 10 (20)
Use Principal Component Analysis on the MNIST images.
(a) Make a plot of the proportion of variance explained vs. number of principal components. Which fraction
of the variance is explained by the first two principal components? Which fraction is explained by the
first ten principal components?
(b) Plot the scores of the first two principal components of all digits against each other, color coded by the
digit that is represented. Comment on the plot. Does it appear that digits may be separated by these
scores?
(c) Find three digits which are reasonably well separated by the plot that you made in part (b). Illustrate
this with a color coded plot like the one in (b) for just these three digits. Don’t expect perfect separation.
(d) Find three other digits which are not well separated by the plot that you made in part (b). Illustrate
this with another color coded plot like the one in (b) for just these three digits.
4
200 points in ten problems.
This is the take-home portion of the exam. You may use your notes, your books, all material on the course
website, and your computer or any computer in the departmental computer lab. You may also use official
documentation for R, built-in or on https://cran.r-project.org/, but no other material on the Internet. Provide
proper attribution for all such sources. You may not use any human help, except whatever help is provided
by me.
Your solution should consist of two files: An .Rmd file that loads all data and all packages, makes all plots,
and contains all comments and explanation, and an .html or .pdf file that is produced by the .Rmd file.
Return your solutions by Friday, 5/10/19, 11:59PM.
in Canvas
or hand in printed copies of both files
or fax both files to 202.687.6067.
Part I: Bikeshare Ridership
The first part of the exam uses data on hourly ridership counts for the Capital Bikeshare system in Washington,
DC for the years 2011 and 2012. Use the data frame cabi. The data frame contains time related variables
and weather related variables, plus two numerical target variables. Each observation contains data for
one hour during these two years, with a few gaps.
The data have been adapted from a set at the UCI repository. Link to the original data set: https:
//archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
System data of the Capital Bikeshare system are here: https://www.capitalbikeshare.com/system-data
Time related variables
season a categorical variable with values 1 (for January - March), 2 (April - June), 3 (July - September),
4 (October - December)
year with values 2011 and 2012
month a categorical variable with values 1, 2, . . . , 12
wday which is 0 for weekends and holidays and 1 otherwise
hr a numerical variable with values 0, 1, . . . , 23
Weather related variables
temp scaled temperature
atemp scaled perceived temperature
hum scaled humidity
windspeed scaled
weather, a categorical variable with values 1 (e.g. clear or few clouds or partly cloudy), 2 (e.g. cloudy
or broken clouds or foggy), 3 (e.g. snow or rain or thunderstorm)
1Target variables
The bikeshare system has casual riders who rent bicycles on the spot (e.g., tourists) and registered riders
who have a subscription (e.g., commuters).
casual Total number of casual riders during this hour
registered Total number of registered riders during this hour
Problem 1 (20)
Use numerical summaries, graphs, etc. to answer the following questions. No model fitting or other statistical
procedures are required for this. Each graph should help answer one or more of these questions and should
be accompanied by explanations.
(a) How do ridership counts depend on the year? The month? The hour of the day? How do casual and
registered riders differ in this respect?
(b) How are casual and registered ridership counts related? Does this depend on the year? Does it depend
on the type of day (working day or not)?
(c) Is there an association between the weather situation and ridership counts? For casual riders? For
registered riders?
(d) There are relations between time related predictors and weather related predictors. Demonstrate this
with a few suitable graphs.
For problems 2-4, split the data into a training set (70%) and a test set (30%).
Problem 2 (25)
(a) Fit a multiple regression to predict registered ridership from the other variables (excluding casual
ridership), using the training data. Identify the significant variables and comment on their coefficients.
(b) Estimate the RMS prediction error of this model using the test set.
(c) Does the RMS prediction error depend on the month? Answer this question using the test data and
suitable tables or graphs.
(d) Make copies of the training and test data in which hr is a categorical variable. Fit a multiple regression
model. Compare the summary of this model to the one from part (a). Also estimate the RMS prediction
error from the test set.
Problem 3 (30)
Use the original cabi data for this problem. (a) Train artificial neural networks with various numbers of
nodes in the hidden layer to predict registered ridership. Use the training data and only weather related
variables. Recommend a suitable number of nodes, with explanation. (b) Repeat part (a), using only time
related variables. (c) Repeat part (a), using two time related and two weather related variables. Explain
your choice of variables.
Problem 4 (10)
What do you think are six useful predictors? Use any method you want to answer this question.
2Part II: Vegetation Cover
Problems 5 - 8 use data on vegetation cover. Use the data frames covtype.train and covtype.test. The
original data are at https://archive.ics.uci.edu/ml/datasets/Covertype
Each data set contains 10,000 observations of 55 variables. These have been collected on 30m × 30m patches
of hilly forest land by the US Forest Service.
elev = elevation in meters, slope = slope of the terrain in degrees, aspect = direction of the slope in
degrees
h_dist_hydro, v_dist_hydro = Horizontal and vertical distance to nearest water feature in meters
h_dist_road = Horizontal distance to nearest roadway in meters
hillshade_9, hillshade_12, hillshade_3 = Index for hill shade at 9 AM, 12 noon, 3 PM, at
summer solstice
h_dist_fire = Horizontal distance to nearest wildfire ignition point in meters
wild1, ... wild4 = binary indicator variables for wilderness designation
soil1, ..., soil40 = binary indicator variables for soil type
cover = Target variable (type of forest cover), with values 2 and 3.
Problem 5 (20)
Fit a logistic model to the training data in order to separate the classes. Choose a classification threshold
so that sensitivity and specificity are approximately the same on the training data. Then report sensitivity,
specificity, and overall error rate for the test data.
Problem 6 (25)
Fit a support vector machine with radial kernels in order to separate the classes. Tune the cost and gamma
parameters so that cross validation gives the best performance on the training data. Then assess the resulting
model on the test data. Report sensitivity, specificity, and overall error rate for training and test data.
Problem 7 (10)
Fit a decision tree to the training data in order to separate the two classes. Prune the tree using cross
validation and make sure that there are no redundant splits (i.e. splits that lead to leaves with the same
classification). Then estimate the classification error rate for the pruned tree from the test data.
Problem 8 (20)
Fit a random forest model to the training data in order to separate the classes. Identify the ten most
important variables and fit another random forest model, using only these variables. Use the test data to
decide which model has better performance.
Part 3: MNIST Digit Data
Problems 9 and 10 use the MNIST image classification data, available as mnist_all.RData in Canvas. We
use only the test data (10,000 images).
3Problem 9 (20)
(a) Select a random subset of 1000 digits. Use hierarchical clustering with complete linkage on these images
and visualize the dendrogram.
(b) Does the dendrogram provides compelling evidence about the “correct” number of clusters? Explain
your answer.
(c) Cut the dendrogram to generate a set of clusters that appears to be reasonable. There should be
between 5 and 15 clusters. Then find a way to create a visual representation (i.e. a typical image) of
each cluster. Explain and describe your approach.
Problem 10 (20)
Use Principal Component Analysis on the MNIST images.
(a) Make a plot of the proportion of variance explained vs. number of principal components. Which fraction
of the variance is explained by the first two principal components? Which fraction is explained by the
first ten principal components?
(b) Plot the scores of the first two principal components of all digits against each other, color coded by the
digit that is represented. Comment on the plot. Does it appear that digits may be separated by these
scores?
(c) Find three digits which are reasonably well separated by the plot that you made in part (b). Illustrate
this with a color coded plot like the one in (b) for just these three digits. Don’t expect perfect separation.
(d) Find three other digits which are not well separated by the plot that you made in part (b). Illustrate
this with another color coded plot like the one in (b) for just these three digits.
4