讲解Statistics 272、辅导RMarkdown、辅导R程序、讲解R编程
- 首页 >> Algorithm 算法Statistics 272 Take Home Exam #2
PLEDGE
I pledge my honor that during this exam I have neither given nor received assistance and that I have seen no
dishonest work.
Signed:
I have intentionally NOT signed the pledge.
For this exam, you may use our course textbook (Stat2), class notes, material on Moodle, material on the R
server, and R. No other resources (books, electronic resources, other students, etc) may be used. You may
not discuss any aspect of this exam with any person other than me. Ask me if you have any questions.
This assignment is due at 4:00 PM on November 20. You will turn in a printed, paper document of your
knitted RMarkdown file to my office by the deadline as well as the considered pledge at the top of this page.
Your submission should be well-formatted, well-written, and easy to read. The RMarkdown file that you
used for your exam should be made available in the Submit folder on the R server. Do not edit this file
after the submission deadline.
The Pima are a group of Native Americans living in Arizona and Mexico. The Pima have one of the highest
prevalence of type 2 diabetes in the world. You will be working with the dataset pimas.csv, which contains
measurements on 768 Pima women. This file is on the R server. Answer each question clearly and concisely,
justify each answer, and check assumptions where appropriate.
Variable Name Description
pregnant # pregnancies (0 = 0-1, 1 = 2 or more)
glucose plasma glucose concentration (glucose tolerance test)
pressure diastolic blood pressure (mm Hg)
triceps triceps skin fold thickness (mm)
mass body mass index ((weight in kg)/(height in m)2
)
age age
diabetes diabetes status (neg = no diabetes, pos = diabetes)
1
We are interested in using the Pimas dataset to construct a model that predicts the probability of diabetes.
1. Perform an exploratory data analysis to determine how each explanatory variable is related to the
response. For quantitative variables, produce conditional density plots and summary statistics by
diabetes status. For categorical explanatory variables, produce an appropriate table showing the
relationship with diabetes status. Summarize each relationship in a brief sentence.
2. Construct a two-way table with diabetes and pregnancy. Find the (unadjusted) odds ratio and provide
an interpretation in the context of the problem.
Next, fit logistic regression models with diabetes as the response and the following sets of explanatory
variables.
1: pregnant
2: pregnant, mass, age, triceps
3: pregnant, mass, age
4: pregnant, mass, age, pressure, glucose
5: pregnant, mass(centered), age(centered), glucose(centered)
6: pregnant, mass, age, glucose, pregnant:age
1. Use Model 1 to find the unadjusted odds ratio. Compare this answer to your answer in 2 above and
explain any differences or similarities.
2. Construct the 95% confidence interval for the odds ratio in Model 1 and provide an interpretation in
the context of the study.
3. Is there evidence that we can remove triceps from the models? Justify your answer using an appropriate
test.
4. The predictors pregnant, mass, and age can all be obtained from medical records, but the predictors
glucose and pressure require an in-person measurement. Should these in-person explanatory variables
be included in the model? Conduct an appropriate test and justify your answer.
5. Provide an interpretation of all of the coefficients in Model 3.
6. Considering Model 3, for a woman with no previous pregnancies and with the average age, what does
her mass need to be in order to have a predicted greater than 0.50? Justify your answer by hand (not
using R).
7. Using Model 4, find the predicted probability of diabetes for a 55 year-old woman who has had 3
pregnancies and has a mass of 32.
8. Provide an interpretation of all of the coefficients in Model 5.
9. Provide an interpretation of the interaction term in Model 6.
The next few questions are on the bootstrap. We will continue with the Pima data. Suppose we want to
construct a 95% confidence interval for the IQR of glucose. Generate a bootstrap distribution of IQRs for
samples of size n from the original sample.
1. In a brief paragraph, describe how to use the bootstrap, why it works, and why it is useful.
2. Plot a histogram of the bootstrap distribution and report the standard deviation. What does this
distribution allow us to do?
3. Using your bootstrap distribution, select an appropriate method to construct a bootstrap confidence
interval for the population IQR. Interpret your interval in a brief sentence.