辅导STAT 3032、讲解R设计、讲解Canvas留学生、辅导R编程语言
- 首页 >> 其他 STAT 3032 Homework5 Instruction (Spring 2019)
Due Wednesday, April 17 @ 11:59pm in Canvas
50 points in total
Please show your work on each problem for full credit. A correct answer, unsupported by the
necessary explanation, R code or output will receive very little if any credit. Your work needs
to be organized in a reasonably neat and coherent way, and submitted as a pdf file on
Canvas.
You are welcome to discuss with your classmates, but you must write up your homework
individually!
Problem 1
This problem is based on the Bridge Construction dataset (bridge.txt) from the textbook
website: http://gattonweb.uky.edu/sheather/book/docs/datasets/bridge.txt
Before construction begins, a bridge project goes through the design stage. Predicting the
design time is helpful for budgeting and scheduling purposes. Information on 45 bridge projects
was compiled. We will use the following variables (Note that the dataset includes more than
these 3 variables):
Response Variable Time = design time in person-days
Predictor Variables CCost = Construction cost (in $1000)
Dwgs = Number of structural drawings
(a).Import the data into R and provide a scatterplot matrix for the variables (the response and
predictor variables) mentioned above.
(b).Fit the model Time ~ 1 + CCost + Dwgs and provide the diagnostic plots. Do you see
any violation of the linear regression assumptions?
(c). You can see from the scatterplot matrix in (a) that the predictor variables are not linearly
related. This indicates that we need to transform the predictor variables. Find appropriate
transformations for the predictor variables. Hint: Use the R function powerTransform( ) from
the library alr4 or car.
1(d). After transforming the predictor variables based on (c), find an appropriate transformation
for the response variable, Time. Hint: use the R function boxcox( ) from the library MASS.
(e). Regardless of your answers in (c) and (d), we will use log transformation for the response
variable and the predictor variables. Draw the scatterplot matrix for the transformed variables.
Now the predictor variables should look linearly related.
(f). Fit the model log(Time) ~ 1 + log(CCost) + log(Dwgs)and interpret the slope
coefficient of log(CCost)by filling in the blanks below.
When the construction cost increase by 1%, the design time _________________________ .
(g). Common sense says that design time should increase as the construction cost goes up. Is
your estimated coefficient value consistent with this common sense? Please answer yes or no.
________ .
(h).Use R to calculate the VIF values for the model log(Time) ~ 1 + log(CCost) +
log(Dwgs). Use 5 as the threshold, does the model has collinearity issue? Hint: use the R
function vif( ).
(i). Draw the diagnostic plots of the model log(Time) ~ 1 + log(CCost) + log(Dwgs).
Do you see any violation of the linear regression assumptions? Compare and contrast with the
diagnostic plots in (b).
(j). Use model log(Time) ~ 1 + log(CCost) + log(Dwgs) to predict the design time
for the next bridge to be built in Hennepin County. This bridge has 6 structural drawings and
has a budget of 300 thousand dollars for construction cost. Estimate the design time and
provide the 95% prediction interval for the estimated design time.
2Problem 2:
For logistic regression with p predictor variables, the model is specified as
log( ) x x .. x .
E(Y |X)1E(Y |X) = β0 + β1 1 + β2 2 + . + βp p
Derive the formula to show that(β +β x +β x +...+β x ) 0 1 1 2 2 p p
Problem 3:
On April 15, 1912, during her maiden voyage, the ship Titanic sank after colliding with an
iceberg, killing many passengers and crew. Here, we will use a subset of the data to analyze the
survival rates for different groups of people. Please download the dataset
TitanicPartial.csv from Canvas and work through the following questions.
Variables:
Survival: 1 (survived) or 0 (dead)
Sex: “male” or “female”
Pclass: passenger class, 1, 2 or 3.
Age: age in years.
(a). Based on the scatterplots above, which group (combination of Sex and Pclass) has the
lowest odds for survival? Hint: look at the relative number of survival and non-survival in each
group.
3(b). Fit the following two logistic regression models and provide the summary output for each
model. Hint: Use R function glm( )
mod1: Survived ~ 1 + as.factor(Pclass)
mod2: Survived ~ 1 + as.factor(Pclass) + Age + Sex
(c) Write down the fitted model of mod1. You may use either format (log odds or probability).
(d). What happens if we don’t apply the as.factor( ) function to Pclass? Try fitting mod1
without as.factor( ). You can call this new model mod3. How do the summary outputs of
mod1 and mod3 differ?
(e) According to mod2, holding Age and Sex constant, which passenger class has the highest
odds of survival and which passenger class has the lowest odds of survival? Please explain.
(f). Interpret the estimated slope of Age in mod2 with respect to the probability of survival.
(g). Interpret the estimated coefficient of Sexmale in mod2 with respect to the probability of
survival.
4(h). In the 1997 movie Titanic, the lead character Jack was a 20-year-old male passenger in the
3rd class. Please predict his probability of survival based on mod2. Please use the formula of
the fitted model to compute the probability.
(i).Redo part (h). This time, please use the predict( ) function. You should get a very similar
answer if not identical.
(j). Since mod1 is nested within mod2, we can compare these two models using the deviance.
Which model do you prefer? Hint: use R function anova( )
(k) Redo Part (j). But this time, you are not allowed to use anova( ) function! Instead, reply
on the summary output of mod1 and mod2 in (a), and the pchisq( ) function. You should get
a very similar p value if not identical.
5
Due Wednesday, April 17 @ 11:59pm in Canvas
50 points in total
Please show your work on each problem for full credit. A correct answer, unsupported by the
necessary explanation, R code or output will receive very little if any credit. Your work needs
to be organized in a reasonably neat and coherent way, and submitted as a pdf file on
Canvas.
You are welcome to discuss with your classmates, but you must write up your homework
individually!
Problem 1
This problem is based on the Bridge Construction dataset (bridge.txt) from the textbook
website: http://gattonweb.uky.edu/sheather/book/docs/datasets/bridge.txt
Before construction begins, a bridge project goes through the design stage. Predicting the
design time is helpful for budgeting and scheduling purposes. Information on 45 bridge projects
was compiled. We will use the following variables (Note that the dataset includes more than
these 3 variables):
Response Variable Time = design time in person-days
Predictor Variables CCost = Construction cost (in $1000)
Dwgs = Number of structural drawings
(a).Import the data into R and provide a scatterplot matrix for the variables (the response and
predictor variables) mentioned above.
(b).Fit the model Time ~ 1 + CCost + Dwgs and provide the diagnostic plots. Do you see
any violation of the linear regression assumptions?
(c). You can see from the scatterplot matrix in (a) that the predictor variables are not linearly
related. This indicates that we need to transform the predictor variables. Find appropriate
transformations for the predictor variables. Hint: Use the R function powerTransform( ) from
the library alr4 or car.
1(d). After transforming the predictor variables based on (c), find an appropriate transformation
for the response variable, Time. Hint: use the R function boxcox( ) from the library MASS.
(e). Regardless of your answers in (c) and (d), we will use log transformation for the response
variable and the predictor variables. Draw the scatterplot matrix for the transformed variables.
Now the predictor variables should look linearly related.
(f). Fit the model log(Time) ~ 1 + log(CCost) + log(Dwgs)and interpret the slope
coefficient of log(CCost)by filling in the blanks below.
When the construction cost increase by 1%, the design time _________________________ .
(g). Common sense says that design time should increase as the construction cost goes up. Is
your estimated coefficient value consistent with this common sense? Please answer yes or no.
________ .
(h).Use R to calculate the VIF values for the model log(Time) ~ 1 + log(CCost) +
log(Dwgs). Use 5 as the threshold, does the model has collinearity issue? Hint: use the R
function vif( ).
(i). Draw the diagnostic plots of the model log(Time) ~ 1 + log(CCost) + log(Dwgs).
Do you see any violation of the linear regression assumptions? Compare and contrast with the
diagnostic plots in (b).
(j). Use model log(Time) ~ 1 + log(CCost) + log(Dwgs) to predict the design time
for the next bridge to be built in Hennepin County. This bridge has 6 structural drawings and
has a budget of 300 thousand dollars for construction cost. Estimate the design time and
provide the 95% prediction interval for the estimated design time.
2Problem 2:
For logistic regression with p predictor variables, the model is specified as
log( ) x x .. x .
E(Y |X)1E(Y |X) = β0 + β1 1 + β2 2 + . + βp p
Derive the formula to show that(β +β x +β x +...+β x ) 0 1 1 2 2 p p
Problem 3:
On April 15, 1912, during her maiden voyage, the ship Titanic sank after colliding with an
iceberg, killing many passengers and crew. Here, we will use a subset of the data to analyze the
survival rates for different groups of people. Please download the dataset
TitanicPartial.csv from Canvas and work through the following questions.
Variables:
Survival: 1 (survived) or 0 (dead)
Sex: “male” or “female”
Pclass: passenger class, 1, 2 or 3.
Age: age in years.
(a). Based on the scatterplots above, which group (combination of Sex and Pclass) has the
lowest odds for survival? Hint: look at the relative number of survival and non-survival in each
group.
3(b). Fit the following two logistic regression models and provide the summary output for each
model. Hint: Use R function glm( )
mod1: Survived ~ 1 + as.factor(Pclass)
mod2: Survived ~ 1 + as.factor(Pclass) + Age + Sex
(c) Write down the fitted model of mod1. You may use either format (log odds or probability).
(d). What happens if we don’t apply the as.factor( ) function to Pclass? Try fitting mod1
without as.factor( ). You can call this new model mod3. How do the summary outputs of
mod1 and mod3 differ?
(e) According to mod2, holding Age and Sex constant, which passenger class has the highest
odds of survival and which passenger class has the lowest odds of survival? Please explain.
(f). Interpret the estimated slope of Age in mod2 with respect to the probability of survival.
(g). Interpret the estimated coefficient of Sexmale in mod2 with respect to the probability of
survival.
4(h). In the 1997 movie Titanic, the lead character Jack was a 20-year-old male passenger in the
3rd class. Please predict his probability of survival based on mod2. Please use the formula of
the fitted model to compute the probability.
(i).Redo part (h). This time, please use the predict( ) function. You should get a very similar
answer if not identical.
(j). Since mod1 is nested within mod2, we can compare these two models using the deviance.
Which model do you prefer? Hint: use R function anova( )
(k) Redo Part (j). But this time, you are not allowed to use anova( ) function! Instead, reply
on the summary output of mod1 and mod2 in (a), and the pchisq( ) function. You should get
a very similar p value if not identical.
5