FS19 STT481讲解、辅导dataset、c/c++,Python,Java编程语言讲解
- 首页 >> 其他 FS19 STT481: Homework 5
(Due: Wednesday, Dec. 4th, beginning of the class.)
100 points total
1. (20 pts) We now fit a GAM to predict Salary in the Hitters dataset.
First, we remove the observations for whom the salary information is unknown, and then we split the data
set into a training set and a test set by using the following command lines.
library(ISLR)
data("Hitters")
Hitters <- Hitters[!is.na(Hitters$Salary),]
set.seed(10)
train <- sample(nrow(Hitters), 200)
Hitters.train <- Hitters[train, ]
Hitters.test <- Hitters[-train, ]
(a) Using log(Salary) (log-transformation of Salary) as response and the other variables as the predictors,
perform forward stepwise selection on the training set in order to identify a satisfactory model that
uses just a subset of the predictors.
(b) Fit a GAM on the training data, using log(Salary) as the response and the features selected in the
previous step as the predictors. Plot the results, and explain your findings.
(c) Evaluate the model obtained on the test set. Try difference tuning parameters (if you are using
smoothing splines s() then try different df’s; if you are using local regression lo() then try different
span’s) and explain the results obtained.
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
2. (40 pts) This question relates to the Credit data set. (Regression problem).
First, we split the data set into a training set and a test set by using the following command lines.
library(ISLR)
data("Credit")
set.seed(15)
Credit <- Credit[,-1] # remove ID column
train <- sample(nrow(Credit), 300)
Credit.train <- Credit[train, ]
Credit.test <- Credit[-train, ]
(a) Fit a tree to the training data, with Balance as the response and the other variables. Use the summary()
function to produce summary statistics about the tree, and describe the results obtained. What is the
training MSE? How many terminal nodes does the tree have?
(b) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal
nodes, and interpret the information displayed.
(c) Create a plot of the tree, and interpret the results.
(d) Predict the response on the test data. What is the test MSE?
(e) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
(f) Produce a plot with tree size on the x-axis and cross-validated error on the y-axis.
(g) Which tree size corresponds to the lowest cross-validated error?
(h) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If
cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal
nodes.
(i) Compare the training MSEs between the pruned and unpruned trees. Which is higher?
(j) Compare the test MSEs between the pruned and unpruned trees. Which is higher?
1
(k) Fit a bagging model to the training set with Balance as the response and the other variables. Use
1,000 trees (ntree = 1000). Use the importance() function to determine which variables are most
important.
(l) Use the bagging model to predict the response on the test data. Compute the test MSE.
(m) Fit a random forest model to the training set with Balance as the response and the other variables. Use
1,000 trees (ntree = 1000). Use the importance() function to determine which variables are most
important.
(n) Use the random forest to predict the response on the test data. Compute the test MSE.
(o) Fit a boosting model to the training set with Balance as the response and the other variables. Use 1,000
trees, and a shrinkage value of 0.01 (λ = 0.01). Which predictors appear to be the most important?
(p) Use the boosting model to predict the response on the test data. Compute the test MSE.
(q) Fit a GAM to the training set with Balance as the response and the other variables, and use the GAM
to predict the response on the test data. Compute the test MSE.
(r) Compare the test MSEs between the unpruned trees, pruned trees, bagging, random forest, boosting,
and GAM. Which performs the best?
3. (40 pts) This question relates to the OJ data set. (Classification problem).
First, we split the data set into a training set and a test set by using the following command lines.
library(ISLR)
data("OJ")
set.seed(10)
train <- sample(nrow(OJ), 800)
OJ.train <- OJ[train, ]
OJ.test <- OJ[-train, ]
(a) Fit a tree to the training data, with Purchase as the response and the other variables. Use the
summary() function to produce summary statistics about the tree, and describe the results obtained.
What is the training error rate? How many terminal nodes does the tree have?
(b) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal
nodes, and interpret the information displayed.
(c) Create a plot of the tree, and interpret the results.
(d) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the
predicted test labels. What is the test error rate?
(e) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
(f) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
(g) Which tree size corresponds to the lowest cross-validated classification error rate?
(h) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If
cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal
nodes.
(i) Compare the training error rates between the pruned and unpruned trees. Which is higher?
(j) Compare the test error rates between the pruned and unpruned trees. Which is higher?
(k) Fit a bagging model to the training set with Purchase as the response and the other variables as
predictors. Use 1,000 trees (ntree = 1000). Use the importance() function to determine which
variables are most important.
(l) Use the bagging model to predict the response on the test data. Compute the test error rates.
(m) Fit a random forest model to the training set with Purchase as the response and the other variables
as predictors. Use 1,000 trees (ntree = 1000). Use the importance() function to determine which
variables are most important.
(n) Use the random forest to predict the response on the test data. Compute the test error rates.
(o) Fit a boosting model to the training set with Purchase as the response and the other variables as
predictors. Use 1,000 trees, and a shrinkage value of 0.01 (λ = 0.01). Which predictors appear to be
the most important?
(p) Use the boosting model to predict the response on the test data. Compute the test error rates.
(q) Fit a logistic regression to the training set with Purchase as the response and the other variables as
2
predictors, and predict on the test data. Compute the test error rates.
(r) Rank the significance of the coefficients of the logistic regression. Is the result consistent with (k)?
(s) Compare the test error rates between the unpruned trees, pruned trees, bagging, random forest, boosting,
and logistic regression. Which performs the best?
3
(Due: Wednesday, Dec. 4th, beginning of the class.)
100 points total
1. (20 pts) We now fit a GAM to predict Salary in the Hitters dataset.
First, we remove the observations for whom the salary information is unknown, and then we split the data
set into a training set and a test set by using the following command lines.
library(ISLR)
data("Hitters")
Hitters <- Hitters[!is.na(Hitters$Salary),]
set.seed(10)
train <- sample(nrow(Hitters), 200)
Hitters.train <- Hitters[train, ]
Hitters.test <- Hitters[-train, ]
(a) Using log(Salary) (log-transformation of Salary) as response and the other variables as the predictors,
perform forward stepwise selection on the training set in order to identify a satisfactory model that
uses just a subset of the predictors.
(b) Fit a GAM on the training data, using log(Salary) as the response and the features selected in the
previous step as the predictors. Plot the results, and explain your findings.
(c) Evaluate the model obtained on the test set. Try difference tuning parameters (if you are using
smoothing splines s() then try different df’s; if you are using local regression lo() then try different
span’s) and explain the results obtained.
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
2. (40 pts) This question relates to the Credit data set. (Regression problem).
First, we split the data set into a training set and a test set by using the following command lines.
library(ISLR)
data("Credit")
set.seed(15)
Credit <- Credit[,-1] # remove ID column
train <- sample(nrow(Credit), 300)
Credit.train <- Credit[train, ]
Credit.test <- Credit[-train, ]
(a) Fit a tree to the training data, with Balance as the response and the other variables. Use the summary()
function to produce summary statistics about the tree, and describe the results obtained. What is the
training MSE? How many terminal nodes does the tree have?
(b) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal
nodes, and interpret the information displayed.
(c) Create a plot of the tree, and interpret the results.
(d) Predict the response on the test data. What is the test MSE?
(e) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
(f) Produce a plot with tree size on the x-axis and cross-validated error on the y-axis.
(g) Which tree size corresponds to the lowest cross-validated error?
(h) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If
cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal
nodes.
(i) Compare the training MSEs between the pruned and unpruned trees. Which is higher?
(j) Compare the test MSEs between the pruned and unpruned trees. Which is higher?
1
(k) Fit a bagging model to the training set with Balance as the response and the other variables. Use
1,000 trees (ntree = 1000). Use the importance() function to determine which variables are most
important.
(l) Use the bagging model to predict the response on the test data. Compute the test MSE.
(m) Fit a random forest model to the training set with Balance as the response and the other variables. Use
1,000 trees (ntree = 1000). Use the importance() function to determine which variables are most
important.
(n) Use the random forest to predict the response on the test data. Compute the test MSE.
(o) Fit a boosting model to the training set with Balance as the response and the other variables. Use 1,000
trees, and a shrinkage value of 0.01 (λ = 0.01). Which predictors appear to be the most important?
(p) Use the boosting model to predict the response on the test data. Compute the test MSE.
(q) Fit a GAM to the training set with Balance as the response and the other variables, and use the GAM
to predict the response on the test data. Compute the test MSE.
(r) Compare the test MSEs between the unpruned trees, pruned trees, bagging, random forest, boosting,
and GAM. Which performs the best?
3. (40 pts) This question relates to the OJ data set. (Classification problem).
First, we split the data set into a training set and a test set by using the following command lines.
library(ISLR)
data("OJ")
set.seed(10)
train <- sample(nrow(OJ), 800)
OJ.train <- OJ[train, ]
OJ.test <- OJ[-train, ]
(a) Fit a tree to the training data, with Purchase as the response and the other variables. Use the
summary() function to produce summary statistics about the tree, and describe the results obtained.
What is the training error rate? How many terminal nodes does the tree have?
(b) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal
nodes, and interpret the information displayed.
(c) Create a plot of the tree, and interpret the results.
(d) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the
predicted test labels. What is the test error rate?
(e) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
(f) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
(g) Which tree size corresponds to the lowest cross-validated classification error rate?
(h) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If
cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal
nodes.
(i) Compare the training error rates between the pruned and unpruned trees. Which is higher?
(j) Compare the test error rates between the pruned and unpruned trees. Which is higher?
(k) Fit a bagging model to the training set with Purchase as the response and the other variables as
predictors. Use 1,000 trees (ntree = 1000). Use the importance() function to determine which
variables are most important.
(l) Use the bagging model to predict the response on the test data. Compute the test error rates.
(m) Fit a random forest model to the training set with Purchase as the response and the other variables
as predictors. Use 1,000 trees (ntree = 1000). Use the importance() function to determine which
variables are most important.
(n) Use the random forest to predict the response on the test data. Compute the test error rates.
(o) Fit a boosting model to the training set with Purchase as the response and the other variables as
predictors. Use 1,000 trees, and a shrinkage value of 0.01 (λ = 0.01). Which predictors appear to be
the most important?
(p) Use the boosting model to predict the response on the test data. Compute the test error rates.
(q) Fit a logistic regression to the training set with Purchase as the response and the other variables as
2
predictors, and predict on the test data. Compute the test error rates.
(r) Rank the significance of the coefficients of the logistic regression. Is the result consistent with (k)?
(s) Compare the test error rates between the unpruned trees, pruned trees, bagging, random forest, boosting,
and logistic regression. Which performs the best?
3