STA 138留学生讲解、讲解Book Portion、讲解R设计、R辅导

- 首页 >> 其他
STA 138 Winter 2019
Homework 5 - Due Friday, Feb 22nd
Book Portion (does not require R)
Note: This may be hand written or typed. Answers
should be clearly marked. Please put your name in
the upper right corner.
1. A study is trying to predict if someone will get the flu shot
or not, with the following dataset:
Column 1: shot (Y ): If the subject got a flu shot (y = 1),
or not (y = 0)
Column 2: age (X1): The age of the subject in years.
Column 3: aware (X2): The health awareness score, where
a higher score indicates a higher level of awareness.
Column 4: gender (X3): M or F
The estimated regression function is:
1.1772+0.0728X1 0.0990X2 0.4340X3,M
(a) Interpret the exponential of the β associated with
awareness score.
(b) Interpret the exponential of the β associated with gender.
(c) Estimate the probability that a male subject aged 50
with awareness score 60 would not get a flu shot.
(d) Estimate the odds that a female subject aged 30 with
awareness score 50 would get the flu shot.
2. Continue with problem 1. The estimated standard errors
for the β coefficients follow:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1772 2.9824 -0.3947 0.6931
age 0.0728 0.0304 2.3959 0.0166
aware -0.0990 0.0335 -2.9567 0.0031
genderM 0.4340 0.5218 0.8317 0.4056
(a) Based on the above, with an α of 0.05, does it appear
that gender was a significant predictor for the probability
of getting the flu shot? Explain your answer.
(b) Which coefficient appears to be the most useful in predicting
if a subject gets the flu shot? Explain your
answer.
(c) Find the 95% corrected confidence interval for the β
associated with age, assuming you are making g = 3
confidence intervals.
(d) What does your interval from (c) suggest about retaining
or removing the age variable from the model?
Explain your answer.
3. Continue with problem 1. The error matrix for this model
is (where the cutoff used was 0.50):
Predicted : y = 0 Predicted : y = 1
Truth : y = 0 130 5
Truth : y = 1 18 6
(a) Estimate the sensitivity, specificity, and overall error
rate.
(b) The 95% confidence interval for AUC is :
(0.7308,0.9139). Do you believe that the model
is predicting Y = 1 well? Explain your answer.
(c) Explain why you might be interested in AUC over the
error matrix.
(d) Explain why a standardized residual with a value over
3 may be concerning.
4. A study was performed to examine what effects the probability
of using birth control in women.
Column 1: con (Y ): If the subject uses birth control,
where Y = 1 indicated they do, and Y = 0
indicated they do not.
Column 2: age (X1): The age of the subject in years.
Column 3: edu (X2): The level of education of the subject,
with A (advanced), G (graduate or above), M
(high school), L (below highschool).
Column 4: working (X3): N (they are not working) or Y
(they are working). The purpose of the study
was to examine contraceptive use in married
women.
The estimated coefficients (β’s) and their standard errors
are:
Estimate Std. Error
(Intercept) 0.3392 0.5364
X1 -0.0095 0.0151
X2,G 0.8300 0.2964
X2,L -0.7679 0.4669
X2,M -0.1119 0.3370
X3,Y -0.0320 0.2888
(a) Write down the model for each of the categories corresponding
to X2. This should give four models.
(b) Estimate the probability that a subject with an advanced
degree who is not working and is age 30 uses
birth control.
(c) Interpret the value exp(0.8300) in terms of the problem.
(d) The log-likelihood for the model that includes all X
variables is: -195.3582 and the log-likelihood for the
model which includes only X1 and X2 is: -195.3644.
Use these to test to see if X3 can be dropped from
the model. State the null, alternative, test-statistic,
p-value, and conclusion.
5. Continue with problem 4. The estimated, corrected 95%
confidence intervals for the model with X1 and X2 in it
follow:β
age
1
-0.0474 0.0280
β
edu
2,G 0.0938 1.5791
β
edu
2,L -1.9989 0.3666
β
edu
2,M -0.9586 0.7315
(a) Does this suggest a significant difference in the odds
of success for education level L vs. A? Explain your
answer.
(b) Does this suggest a significant difference in the odds
of success for education level G vs. A? Explain your
answer.
(c) What would adding an interaction term between age
and education level do? What would be the practical
effect, in other words?
(d) What would your recommendation for the final model
for this data be? Explain your answer.
6. Continue with problem 4. Assume we are using the model
with both X1 and X2 in it.
(a) The five-number summary for the standardized residuals
are below:
Min First Quartile Median Third Quartile Max
-2.0220 -0.9455 -0.0008 0.9874 2.1746
Does this suggest there may be outliers in the data?
Explain why or why not.
(b) The error matrix with the cutoff of 0.50 follows:
Predicted: Y=0 Predicted: Y=1
Truth: Y=0 63 68
Truth: Y=1 50 119
Estimate the sensitivity, specificity, and overall error
rate.
(c) The error matrix with the cutoff of 0.70 follows:
Predicted: Y=0 Predicted: Y=1
Truth: Y=0 108 23
Truth: Y=1 130 39
Estimate the sensitivity, specificity, and overall error
rate.
(d) Which cutoff would you suggest using, and why?
7. Answer the following questions as True or False:
(a) In logistic regression, the larger the value of DFbeta,
the more influential the corresponding row of your data
was.
(b) In logistic regression, the intercept does not always
have a practical interpretation.
(c) In logistic regression, the larger the absolute value of
βi
, the more the corresponding X effects ?π.
R Portion (requires some use of R)
Note: You do not have to use R Markdown to turn
in the homework, but the homework must be turned
in in a reasonable format. The answers to the questions
should be in the body of the homework, and the
code used to obtain those answers should be in an appendix.
There should be no code in the body of the
homework. You can accomplish this in R, Word, LaTex,
Google Docs, etc. This portion should be printed
out and turned in with the hand-written portion.
I. Online under “Files” you will find the dataset
internet.csv, which has the following columns:
Column 1. Newbie: 1 the subject identified themselves
as “new to the Internet”, 0 otherwise.
Column 2. Age: The age of the subject
Column 3. Gender: 1 indicates the subject was male, 0
indicates female.
Column 4. Educational.Attainment: With levels
“High School“, “College”, “Masters”, and
“Doctoral”.
Column 5. score: The corresponding score for the
Educational.Attainment column, where 1
= High School, 2 = College, 3 = Masters,
and 4 = Doctoral.
The goal is to predict whether someone considers themselves
as “new to the Internet“.
(a) Fit a logistic regression model with Newbie as your
response variable, and Age, Gender, and Score as
your explanatory variables. Write down the estimated
logistic regression function.
(b) Interpret the value of exp β associated with Age in
terms of the problem.
(c) Interpret the value of exp β associated with Gender
in terms of the problem.
(d) Interpret the value of exp β associated with score
in terms of the problem.
II. Continue with problem I.
(a) Find and report the 99% profile likelihood confi-
dence intervals for all values of β.
(b) Using (a), which of your explanatory variables do
you believe significantly effect if someone identifies
themselves as “new to the Internet“? Explain.
(c) Predict the probability that a female, aged 28, with
a doctoral degree identifies themselves as “new to
the Internet“.
(d) Are there any unusual observations in your dataset?
Explain.
III. Online under “Resources” you will find the dataset
work.csv, which has the following columns:
Column 1. obese: 1 the subject was obese, 0 otherwise.
Column 2. gender: with levels male, female.
2Column 3. age: the age of the subject.
Column 4. marriage: With levels married, widowed, divorced,
never married.
Column 5. min: Minutes of Sedentary Activity per
Week
The goal is to predict whether a subject is obese or not.
(a) Fit and report the estimated logistic regression
model with coefficients for gender, age, and the categories
for the marriage variable.
(b) Write down the estimated logistic regression model
for people who have never been married.
(c) Write down the estimated logistic regression model
for people who are divorced.
IV. Continue with problem III.
(a) Display the Wald Test-statistics and p-values for
testing if each coefficient is zero or not.
(b) Based on the above, which variables would you retain
in your model, and why? Assume α = 0.10.
(c) Fit the estimated logistic regression model with only
the variables you chose from (b).
(d) Interpret the coefficients of the estimated regression
model you chose in (c).
V. Continue with problem III and IV.
(a) Predict the probability that a married women aged
28 who has 400 sedentary minutes per week is obese
using the full model (all first order predictors, no
interactions).
(b) Predict the probability that a married women aged
28 who has 400 sedentary minutes per week is obese
using the model suggested in IV(c).
(c) Using the LR-ratio test, test to see if you can drop
the coefficient for gender from the model. Assume
the “full model” is: logit(π) = α + β1x
gender +
β2x
min. Assume α = 0.05.
Report back the test-statistic, conclusion on
the test, and p-value.
(d) Using the LR-ratio test, test to see if you can drop
the coefficient for min from the model. Assume the
“full model” is: logit(π) = α + β1x
gender + β2x
min.
Assume α = 0.05.
Report back the test-statistic, conclusion on
the test, and p-value.
VI. Continue with problem IV, and use the “best model”
suggested.
(a) Find the value of AUC, the 95% confidence interval
for AUC, and plot the ROC.
(b) Does this value of AUC suggest that the model has
fit the data well? Explain your answer.
(c) Fit the full model (including all predictors) and repeat
(a) for the full model.
(d) What does (c) suggest AUC and adding predictors,
if anything?