讲解IIMT2641 R 语言、讲解留学生R、讲解辅导R 讲解数据库SQL|讲解留学生Prolog
- 首页 >> OS编程 IIMT2641
Introduction to Business Analytics Due November 7
Fall 2019
Assignment 4
In this problem, we will practice building CART models with a continuous outcome, using the dataset
StateData.csv which has data from 1970s on all fifty US states. A description of the variables in the dataset is
given in Table 1.
Variable Description
Population Population estimate of the state in 1975.
Income Per capita income in the state in 1974.
Illiteracy Illiteracy rates in 1970, as a percentage of the state’s population.
LifeExp The life expectancy in years of residents of the state in 1970.
Murder
The murder and non-negligent manslaughter rate per 100,000
population in 1976.
HighSchoolGrad The high-school graduation rate in the state in 1970.
Frost
The mean number of days with minumum temperature below
freezing from 1931 to 1960 in the capital or a large city of the state.
Area The land area (in sqaure miles) of the state.
Longitude The longitude of the center of the state.
Latitude The latitude of the center of the state.
Region
The region (Northeast, South, North Central, or West)
that the state belongs to.
Table 1: Variables in the dataset StateData.csv.
(a) Let us start by building a linear regression model. Randomly split the dataset into a training set (70%)
and a test set (30%).
(i) First, build a linear regression model to predict LifeExp using the following several variables
as the independent variables: Population, Murder, Frost, Income, Illiteracy, Area, and
HighSchoolGrad. Use the training dataset to build the model. What is the R2 of the model on
the test set?
(ii) Now, build a linear regression model to predict LifeExp the following four variables as the
independent variables: Population, Murder, Frost, and HighSchoolGrad. Again, use the
training dataset to build the model. What is the R2 of the model on the test set?
(iii) Compare these two models. What are we achieving by removing independent variables? What
is the equivalent procedure in a CART model?
(b) Now, build a CART model to predict LifeExP using the following seven variables as the independent
variables: Population, Murder, Frost, Income, Illiteracy, Area, and HighSchoolGrad. Set
the parameter minbucket to be 5. Make sure that you are building a regression tree, and not a
classification tree, by setting the argument method to “anova” instead of “class”.
IIMT2641
Introduction to Business Analytic
Fall 2019
Assignment 4
(i) Plot the trees. Which of the independent variables appear in the tree? Do you find the linear
regression model or the CART model easier to interpret?
(ii) Compute the predicted life expectancies for the test dataset using the CART model, and calculate
the R2 of the predictions.
(c) Now, build a random forest model to predict LifeExP using the same severn variables as the inde?pendent variables. Set the parameter nodesize to 5. Compute the predicted life expectancies for
the test dataset using the random forest model, and calculate the R2 of the predictions.
(d) Which of the four models you built do you think is the best model, if out-of-sample accuracy is the
most important. How about if interpretability is the most important?
Introduction to Business Analytics Due November 7
Fall 2019
Assignment 4
In this problem, we will practice building CART models with a continuous outcome, using the dataset
StateData.csv which has data from 1970s on all fifty US states. A description of the variables in the dataset is
given in Table 1.
Variable Description
Population Population estimate of the state in 1975.
Income Per capita income in the state in 1974.
Illiteracy Illiteracy rates in 1970, as a percentage of the state’s population.
LifeExp The life expectancy in years of residents of the state in 1970.
Murder
The murder and non-negligent manslaughter rate per 100,000
population in 1976.
HighSchoolGrad The high-school graduation rate in the state in 1970.
Frost
The mean number of days with minumum temperature below
freezing from 1931 to 1960 in the capital or a large city of the state.
Area The land area (in sqaure miles) of the state.
Longitude The longitude of the center of the state.
Latitude The latitude of the center of the state.
Region
The region (Northeast, South, North Central, or West)
that the state belongs to.
Table 1: Variables in the dataset StateData.csv.
(a) Let us start by building a linear regression model. Randomly split the dataset into a training set (70%)
and a test set (30%).
(i) First, build a linear regression model to predict LifeExp using the following several variables
as the independent variables: Population, Murder, Frost, Income, Illiteracy, Area, and
HighSchoolGrad. Use the training dataset to build the model. What is the R2 of the model on
the test set?
(ii) Now, build a linear regression model to predict LifeExp the following four variables as the
independent variables: Population, Murder, Frost, and HighSchoolGrad. Again, use the
training dataset to build the model. What is the R2 of the model on the test set?
(iii) Compare these two models. What are we achieving by removing independent variables? What
is the equivalent procedure in a CART model?
(b) Now, build a CART model to predict LifeExP using the following seven variables as the independent
variables: Population, Murder, Frost, Income, Illiteracy, Area, and HighSchoolGrad. Set
the parameter minbucket to be 5. Make sure that you are building a regression tree, and not a
classification tree, by setting the argument method to “anova” instead of “class”.
IIMT2641
Introduction to Business Analytic
Fall 2019
Assignment 4
(i) Plot the trees. Which of the independent variables appear in the tree? Do you find the linear
regression model or the CART model easier to interpret?
(ii) Compute the predicted life expectancies for the test dataset using the CART model, and calculate
the R2 of the predictions.
(c) Now, build a random forest model to predict LifeExP using the same severn variables as the inde?pendent variables. Set the parameter nodesize to 5. Compute the predicted life expectancies for
the test dataset using the random forest model, and calculate the R2 of the predictions.
(d) Which of the four models you built do you think is the best model, if out-of-sample accuracy is the
most important. How about if interpretability is the most important?