辅导IE 7275、讲解Engineering留学生、辅导R程序语言、讲解R设计
- 首页 >> Algorithm 算法Homework 5
IE 7275: Data Mining in Engineering
Read the material “Tutorial on CART with R.pdf”. Read the book chapter on “Logistic
and Poisson Regression with R.pdf.” Ignore the Poisson Regression part of the tutorial.
Problem 1 (Predicting Price of Used Car, CART) [35 points]
The file ToyotaCorolla.xlsx contains the data on used cars (Toyota Corolla) on sale
during late summer of 2004 in The Netherlands. It has 1436 records containing details on
38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal is to
predict the price of a used Toyota Corolla based on its specifications.
Data Preprocessing: Create dummy variables for the categorical predictors (Fuel Type
and Color). Split the data into training (50%), validation (30%), and test (20%) datasets.
a. Run a regression tree (RT) with the output variable Price and input variables
Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfg_Guarantee,
Guarantee_Period, Airco, Automatic_Airco, CD Player, Powered_Windows, Sport_Model,
and Tow_Bar.
i. Which appear to be the three or four most important car specifications for
predicting the car’s price?
ii. Compare the prediction errors of the training, validation, and test sets by
examining their RMS error and by plotting the three boxplots. What is happening
with the training set predictions? How does the predictive performance of the test
set compare to the other two? Why does this occur?
iv. If we used the full tree instead of the best pruned tree to score the validation set,
how would this affect the predictive performance for the validation set? (Hint:
Does the full tree use the validation data?)
b. Let us see the effect of turning the price variable into a categorical variable. First,
create a new variable that categorizes price into 20 bins of equal counts. Now
repartition the data keeping Binned Price instead of Price. Run a classification tree
(CT) with the same set of input variables as in the RT, and with Binned Price as the
output variable.
i. Compare the tree generated by the CT with the one generated by the RT. Are they
different? (Look at structure, the top predictors, size of tree, etc.) Why?
ii. Predict the price, using the RT and the CT, of a used Toyota Corolla with the
specifications listed in Table below.
Table: Specifications for a particular Toyota Corolla
Variable Value
Age_08_04 77
KM 117,000
Fuel_Type Petrol
HP 110
Automatic No
Doors 5
Quarterly_Tax 100
Mfg_Garantee No
Guarantee_Period 3
Airco Yes
Automatic_Airco No
CD_Player No
Powered_Windows No
Sport_Model No
Tow_Bar Yes
iii. Compare the predictions in terms of the predictors that were used, the magnitude
of the difference between the two predictions, and the advantages and
disadvantages of the two methods.
Problem 2 (Financial condition of banks, Logistic Regression) [30 points]
The file Banks.xlsx includes data on a sample of 20 banks. The Financial Condition (Y)
column records the judgment of an expert on the financial condition of each bank. This
dependent variable takes one of two possible values -- weak or strong -- according to the
financial condition of the bank. The predictors are two ratios used in the financial
analysis of banks: TotLns&Lses/Assets (X1) is the ratio of total loans and leases to total
assets and TotExp/Assets (X2) is the ratio of total expenses to total assets. The target is to
use the two ratios for classifying the financial condition of a new bank.
Run a logistic regression model (on the entire dataset) that models the status of a bank
as a function of the two financial measures provided. Specify the success class as weak
(this is similar to creating a dummy that is 1 for financially weak banks and 0 otherwise),
and use the default cutoff value of 0.5.
a. Write the estimated equation that associates the financial condition of a bank
with its two predictors in three formats:
i. The logit as a function of the predictors
ii. The odds as a function of the predictors
iii. The probability as a function of the predictors
b. Consider a new bank whose total loans and leases/assets ratio = 0.6 and total
expenses/assets ratio = 0.11. From your logistic regression model, estimate the
following four quantities for this bank: the logit, the odds, the probability of
being financially weak, and the classification of the bank.
c. The cutoff value of 0.5 is used in conjunction with the probability of being
financially weak. Compute the threshold that should be used if we want to make
a classification based on the odds of being financially weak, and the threshold for
the corresponding logit.
d. Interpret the estimated coefficient for the total loans & leases to total assets ratio
(TotLns&Lses/Assets) in terms of the odds of being financially weak.
e. When a bank that is in poor financial condition is misclassified as financially
strong, the misclassification cost is much higher than when a financially strong
bank is misclassified as weak. To minimize the expected cost of misclassification,
should the cutoff value for classification (which is currently at 0.5) be increased
or decreased?
Problem 3 (Identifying good system administrators, Logistic Regression)
[35 points]
A management consultant is studying the roles played by experience and training in a
system administrator's ability to complete a set of tasks in a specified amount of time. In
particular, she is interested in discriminating between administrators who are able to
complete given tasks within a specified time and those who are not. Data are collected
on the performance of 75 randomly selected administrators. They are stored in the file
System Administrators.xlsx.
The variable Experience (X1) measures months of full-time system administrator
experience, while Training (X2) measures the number of relevant training credits. The
dependent variable Completed (Y) is either Yes or No, according to whether or not the
administrator completed the tasks.
a. Create a scatterplot of Experience versus Training using color or symbol to
differentiate programmers who complete the task from those who did not
complete it. Which predictor(s) appear(s) potentially useful for classifying task
completion?
b. Run a logistic regression model with both predictors using the entire dataset as
training data. Among those who complete the task, what is the percentage of
programmers who are incorrectly classified as failing to complete the task?
c. To decrease the percentage in part (b), should the cutoff probability be increased
or decreased?
d. How much experience must be accumulated by a programmer with 4 years of
training before his or her estimated probability of completing the task exceeds
50%?
Files Included in the Folder:
1. Homework 5.pdf
2. Tutorial on CART with R.pdf
3. Logistic and Poisson Regression with R.pdf
4. ToyotaCorolla.xlsx
5. Banks.xlsx
6. System Administrators.xlsx