辅导留学生R编程: Final Exam
- 首页 >> 其他Reminder:
• Submit your exam via Brightspace before the deadline 3:00 pm on Sunday, April 8th.
• ABSOLUTELY NO COLLABORATION FOR THE EXAM! YOU MUST DO EXAM COMPLETELY
ON YOUR OWN!
• Show the outputs and results clearly.
• Provide the executable R code.
• Explain your results in detail using your own words.
• Comment your code for major steps.
Part 1 - Regression
Use the Carseats data set in package ISLR (column Sales is the response; the other columns are predictors)
to do the following questions.
[1 point] (a) Describe what is this data set about briefly. (hint: ?Carseats)
Before your analysis, set a random seed using the last 3 digits of your student ID.
[1 point] (b) Split the data set into a training data set (80%) and a test data set (20%).
[5 points] (c) Fit a multiple linear regression model on the training data using all predictors. Use the
summary() function to print the results and calculate the test error. Which predictors appear to have a
statistically significant relationship to the response? What does the coefficient for the Advertising suggest?
[2 points] (d) Use the boostrapping to estimate the interquartile range (IQR = Q3 - Q1) of the Advertising
column and provide the 95% CI (use 1000 bootstrap replicates). (hint: quantile() function. Q1: 25th
percentile; Q3: 75th percentile)
[5 points] (e) Choose one of the following methods (Best subset, Forward/Backward stepwise subset, the
Lasso) to do the model selection.
Perform the method you chose on the training data, use 10-fold cross validation to find the optimal parameter.
Provide the outputs and plots to show which predictors are selected in your best model? What are their
coefficients? Use your best model to make predictions on the test data and calculate the test error. (hint:
The parameter represents the number of predictors selected in the model for subset approaches and the
tuning parameter (λ) for the Lasso.)
1
[5 points] (f) Fit a decision tree to the training data. Use 10-fold cross validation to find the best tree size.
Create the plot with tree sizes on x-axis and deviance on y-axis. Use the decision tree in the best size to
predict the test data and calculate test error. Plot this tree with labels. Are the predictors chosen in (e) also
used as the splits in the tree?
[6 points] (g) Compare the methods you used for this regression problem. Based on all the results you have
above, write a short conclusion about your data.
Part 2 - Classification
Use the frogs data set in DAAG package (column pres.abs is the response; the other columns are predictors)
to do following questions. If you don’t have the DAAG package in R, you need to install it first.
[1 point] (a) Describe the data set briefly. (hint: ?frogs)
[1 point] (b) Set the random seed first. Split the data set into a training data set (80%) and a test data set
(20%).
[4 points] (c) Fit a logistic regression model on the training data using all predictors. Use the summary()
function to print the results. Make predictions on the test data and calculate the test error. What does the
coefficient for the distance suggest?
[4 points] (d) Choose one of the Bayes’ classifers (Naive Bayesian, LDA, QDA) to redo this question. Fit the
model on the training data , make the predictions on the test data and calculate the test error.
[3 points] (e) Describe the difference in the precedures of Bagging, Random Forest and Boosting tree briefly
(only need to introduce the general idea of these methods). And what are the parameters in these three
methods respectively?
[4 points] (f) Choose Random Forests OR Boosting Tree to do this question. Perform the method you chose
on the training data. Make the predictions on the test data and calculate the test error. (hint: Use 1000 as
the number of trees that will be built for Random Forest OR Boosting. For other parameters, just use the
default settings. You need to change response to factor first (as.factor()), otherwise the regression tree will
be built.)
2
[2 points] (g) Create the plot of importance (or influence for boosting) measures. Which predictors do you
think are more important?
[6 points] (h) Compare these three different methods you used for this classification problem. Based on all
the results you have above, write a short conclusion about your data.