辅导MSBA7027、R编程设计辅导
- 首页 >> Database MSBA7027 Machine Learning
Homework 2
Due 11:59 pm Dec. 28, 2022
Notes:
- You are required to submit 1) original R Markdown file and 2) a knitted HTML or PDF file via
Moodle. Please provide comments for R code wherever you see appropriate. In general, be as
concise as possible while giving a fully complete answer. Nice formatting of the assignment will
receive extra points.
- Remember that the Class Policy strictly applies to homework. You are encouraged to work in
groups and discuss with fellow students. However, each student has to know how to answer the
questions on her/his own.
- Please allow some buffer time and do not submit homework at the last moment. You will have
points deducted if you submit the above two items late.
Question. Load the dataset from HW2_house_dataset.csv. You will implement some tree-
based methods to predict housing prices. Basic characteristics of the dataset are given as follows:
- Problem type: supervised learning, regression
- Response variable: selling price of houses (in log10 units)
- Data variable name in R: “price”
- Number of features: 17
- Number of observations: 21,613
- Task: use house attributes to predict sale price of a house
Please perform the following tasks:
(1) Set seed
(2) Perform stratified sampling, use 80% as training and 20% as testing. Do not touch the testing data
until the last problem (7).
(3) Perform random forest (RF) on the training data. Find the best tuning parameters and describe
how you find them, and after that report the smallest cross-validated RMSE on the training data.
Which four predictors are the most important? Obtain PDPs for these four predictors, describe
them and provide possible explanations.
(4) Repeat (3) for basic GBM algorithm.
(5) (Optional, completing this part will earn you up to 5 bonus points) Repeat (3) for Xgboost
algorithm.
(6) Are the four most important variables different in (3)-(4)? (or (3)-(5) if you have done (5) )
(7) Among RF and GBM (and Xgboost, if you have done (5) ) with their own best-tuning parameters,
which one has the smallest cross-validated RMSE on the training data? Choose that method, refit
the model with all of the training data, use that model to make prediction on the testing data, report
the RMSE for the testing data. Is the obtained RMSE smaller or larger than the cross-validated
RMSE?
Appendix: Description of Features
price (numeric): sale price (log10 units)
bedrooms (numeric): number of bedrooms
bathrooms (numeric): number of bathrooms
sqft_living (numeric): size of living space
sqft_lot (numeric): size of property
floors (numeric): number of floors
waterfront (numeric): binary indicator for a waterfront view
view (numeric): rating of the quality of the view
condition (factor): condition of the house (poor to very good)
sqft_above (numeric): size of living space above group
sqft_basement (numeric):size of living space below group
yr_built (numeric): year build
year_renovated (numeric): year renovated and, if not renovated, the year built
zip_code (factor): zip code
latitude (numeric): latitude
longitude (numeric): longitude
nn_sqft_living (numeric): size of living space of 15 neighbors
nn_sqft_lot (numeric):size of lot of 15 neighbors
Homework 2
Due 11:59 pm Dec. 28, 2022
Notes:
- You are required to submit 1) original R Markdown file and 2) a knitted HTML or PDF file via
Moodle. Please provide comments for R code wherever you see appropriate. In general, be as
concise as possible while giving a fully complete answer. Nice formatting of the assignment will
receive extra points.
- Remember that the Class Policy strictly applies to homework. You are encouraged to work in
groups and discuss with fellow students. However, each student has to know how to answer the
questions on her/his own.
- Please allow some buffer time and do not submit homework at the last moment. You will have
points deducted if you submit the above two items late.
Question. Load the dataset from HW2_house_dataset.csv. You will implement some tree-
based methods to predict housing prices. Basic characteristics of the dataset are given as follows:
- Problem type: supervised learning, regression
- Response variable: selling price of houses (in log10 units)
- Data variable name in R: “price”
- Number of features: 17
- Number of observations: 21,613
- Task: use house attributes to predict sale price of a house
Please perform the following tasks:
(1) Set seed
(2) Perform stratified sampling, use 80% as training and 20% as testing. Do not touch the testing data
until the last problem (7).
(3) Perform random forest (RF) on the training data. Find the best tuning parameters and describe
how you find them, and after that report the smallest cross-validated RMSE on the training data.
Which four predictors are the most important? Obtain PDPs for these four predictors, describe
them and provide possible explanations.
(4) Repeat (3) for basic GBM algorithm.
(5) (Optional, completing this part will earn you up to 5 bonus points) Repeat (3) for Xgboost
algorithm.
(6) Are the four most important variables different in (3)-(4)? (or (3)-(5) if you have done (5) )
(7) Among RF and GBM (and Xgboost, if you have done (5) ) with their own best-tuning parameters,
which one has the smallest cross-validated RMSE on the training data? Choose that method, refit
the model with all of the training data, use that model to make prediction on the testing data, report
the RMSE for the testing data. Is the obtained RMSE smaller or larger than the cross-validated
RMSE?
Appendix: Description of Features
price (numeric): sale price (log10 units)
bedrooms (numeric): number of bedrooms
bathrooms (numeric): number of bathrooms
sqft_living (numeric): size of living space
sqft_lot (numeric): size of property
floors (numeric): number of floors
waterfront (numeric): binary indicator for a waterfront view
view (numeric): rating of the quality of the view
condition (factor): condition of the house (poor to very good)
sqft_above (numeric): size of living space above group
sqft_basement (numeric):size of living space below group
yr_built (numeric): year build
year_renovated (numeric): year renovated and, if not renovated, the year built
zip_code (factor): zip code
latitude (numeric): latitude
longitude (numeric): longitude
nn_sqft_living (numeric): size of living space of 15 neighbors
nn_sqft_lot (numeric):size of lot of 15 neighbors