辅导Stat 203V、辅导R程序设计、讲解R、讲解datasets留学生

- 首页 >> 其他
Stat 203V
Final Exam.
Instructions: Solutions should be uploaded to Gradescope before Saturday, August
17 at 11:59 p.m. The deadline is firm: to avoid technical glitches, please start early.
There are 4 problems of equal points value. The datasets binge.txt along with
survey{.csv,.Rmd} and pima.Rmd are in the directory Final on the course website.
In writing solutions to the data problems, R Markdown is highly recommended. Include
only the pieces of output (and plots) needed for your answer. Readability of solutions
is important. Poorly organized presentation and/or errors in additional plots/analyses
may be penalized. For each question, as appropriate, explain clearly (i) your objectives, (ii)
any hypotheses that you are testing, (iii) the statistical procedure, and (iv) the statistical
support for your conclusions. An example (from a HW problem) is given in the files
sample-solution.{Rmd,pdf}.
Honor Code: Please respect the honor code in completing this exam. You can use
books, computers and the internet, but not other people.
11. [25 pts.] Some years back, the Centers for Disease Control issued a report on binge
drinking that received national attention. The data set binge.txt contains data for 48
states (no data for South Dakota and Tennessee) on the age-adjusted prevalence of binge
drinking (as a percentage of adults responding to a telephone survey).
The CDC article stated “Overall, states with the highest age-adjusted prevalence of
adult binge drinking were in the Midwest and New England, and included Alaska and
Hawaii.”.
A question of interest might be whether the variation in binge drinking was associated
with climate, in particular the depth of winters. The file binge.txt includes columns with
the average winter temperature (degrees Celsius) and the state population (in millions).
Investigate the relation between prevalence of binge drinking and the predictors. For
example, can the regional variation be ascribed to differences in climate? Summarize your
findings.
22. [25 pts.] The file survey203.csv contains the survey data on lecture attendance and
Freedman practice problems studied (variables Lectures and Practice, with NA used for
no-response cases) merged with the midterm Scores. The R Markdown file survey203.Rmd
preprocesses the data to reorder the levels of the factors. You should add your answers in
your copy of this file.
(a) Create a variable that indicates whether the case (i.e. row) contains a missing value.
Is non-response to the survey associated with the mid-term scores?
Use na.omit to create a data frame with no missing values for the rest of this question.
(b) Assess whether the levels factor Practice have a significant effect on the midterm
scores, both via ANOVA and by pairwise comparison of means. Summarize your conclusions.
(c) Now consider both factors and construct an ANOVA table with main effects and
interactions and interpret your conclusions.
(d) Use the function as.numeric() to “coerce” each of the factors to numeric variables
PracticeN, LectureN. Is there an association between these (coerced) variables? Fit a linear
model with Score as response and the numeric variables plus interaction as predictors,
and summarize your conclusions, including a comparison with the results from (c).
33. [25 pts.] (a) Dropping a predictor that is orthogonal to the others doesn’t change the
coefficient estimates. More specifically, suppose that the n × p design matrix X, assumed
to be of full rank, is partitioned into [XA XB] with XA being n×pA and XB being n×pB,
with p = pA + pB. Let β? be the least squares estimate using X and β?A that using XA.
Suppose that XB is orthogonal to XA: X0
BXA = 0. Show that
βi = βA,i for i = 1, . . . , pA.
(b) Consider a one way ANOVA model
yij = μ + αi + ij , i = 1, . . . , I, j = 1, . . . , ni
.
Suppose that the design is balanced, ni ≡ n1 for all i. Consider the design matrices
corresponding to “treatment” and “sum” contrasts. In each of the two cases, is the intercept
column orthogonal to the factor columns? What if the design is unbalanced? Explain.
(c) Now consider the coefficient differences αi ? αj in the balanced one way ANOVA
model. Do their estimates ?αi ? α?j depend on whether treatment or sum contrasts are
chosen? Explain.
(d) In question 2(c), do the sums of squares in the ANOVA table change if the order in
which Practice and Lectures appear is switched? Can you explain briefly in words (i.e.
without detailed mathematical argument) why this might be?
44. [25 pts.] The National Institute of Diabetes and Digestive and Kidney Diseases
conducted a study on 768 adult female Pima Indians living near Phoenix. The purpose
of the study was to investigate factors related to diabetes. The data may be found in the
the dataset pima in library(faraway). See also pima.Rmd in which some preprocessing is
done: it creates a factor version of the test results. And, as discussed in Ch. 1 of Faraway,
the zero values for variables diastolic, glucose, triceps, insulin and bmi in fact
seem to be missing values, so those are set to NA.
(a) Fit a model with the result of the diabetes test as the response and all the other
variables as predictors. How many observations were used in the model fitting?
(b) Refit the model but now without the insulin and triceps predictors. How many
observations were used in fitting this model? Devise a test to compare this model with
that in the previous question and report your conclusion. Hint: use na.omit() to create a
data frame with no missing values.
(c) Use AIC via the function step() to select a model. You will need to take account
of the missing values. Which predictors are selected? How many cases are used in your
selected model?
(d) Create a variable that indicates whether the case contains a missing value. Use this
variable as a predictor of the test result. Is missingness associated with the test result?
Refit the selected model from (c), but now using as much of the data as is feasible. Explain
why it is appropriate to do this.
(e) Using the last fitted model of the previous question, what is the ratio of the odds of
testing positive for diabetes for a woman with a BMI at the first quartile compared with
a woman at the third quartile, assuming that all other factors are held constant? Give a
confidence interval for this ratio.
(f) Do women who test positive have higher diastolic blood pressures? Is the diastolic
blood pressure significant in the regression model? Explain the distinction between the two
questions and discuss why the answers are only apparently contradictory.
5