辅导PSTAT 126、辅导CDI Data、讲解Python/Java编程语言、c/c++辅导 辅导Web开发|辅导留学生Prolog

- 首页 >> OS编程
PSTAT 126 Final Project Option 1: the CDI Data
1 Description
The following description can be found in Appendix C.2 of Applied Linear Regression Models, fourth
edition, by Kutner, Nachtsheim, and Neter:
This data set provides selected county demographic information (CDI) for 440 of the most
populous counties in the United States. Each line of the data set has an identification number with
a county name and state abbreviation and provides information on 14 variables for a single county.
Counties with missing data were deleted from the data set. The information generally pertains to
the years 1990 and 1992. The 16 variables are
Variable Name Description
County County name
State Two-letter state abbreviation
LandArea Land area (square miles)
TotalPop Estimated 1990 population
Pop18 Percent of 1990 CDI population aged 18–34
Pop65 Percent of 1990 CDI population aged 65 years old and older
Physicians Number of professionally active nonfederal physicians during 1990
Beds Total number of beds, cribs, and bassinets during 1990
Crimes Total number of serious crimes in 1990, including murder, rape, robbery, aggravated
asault, burglary, larceny-theft, and motor vehicle theft, as reported by law enforcement
agencies
HSGrad Percent of adult population (persons 25 years old or older) wo completed 12 or more years
of school
Bachelor Percent of adult population (percsons 25 years old or older) with bachelor’s degree
Poverty Percent of 1990 CDI population with income below poverty level
Unemp Percent of 1990 CDI labor force that is unemployed
IncPerCap Per capita income of 1990 CDI population (dollars)
PersonalInc Total personal income of 1990 CDI population (in millions of dollars)
Region Geographic region classification is that used by the U.S. Bureau of the Census, where:
1 = NE, 2 = NC, 3 = S, 4 = W
The file CDI.rds contains these data and is available on Gauchospace.
2 Project Components
The overall project consists of a thorough investigation of two regression models that combine concepts and
methods of linear regression used throughout the quarter.
2.1 Part I
You will investigate the model
Physicians ~ log(TotalPop) + LandArea + IncPerCap (1)
by answering the following questions.
1a) What relationships do you expect to see between the response and each of the predictors, and why? What
kind of associations, if any, do you expect will be present between the three predictors, and why? Do
some exploratory analysis (e.g. plots and/or numerical summaries) to test you intuition.
b) Fit the model in (1) and provide interpretations of the estimated coefficients. Report the value of R2 and
explain its meaning.
c) Do diagnostic checks to assess whether or not the linear regression assumptions seem to hold. If the
model assumptions do not hold in your view, investigate possible transformations for predictors and/or
response. Once suitable transformations are found, repeat b) for this new model and use this model for
the remainder of Part I. Otherwise, move on to d).
d) Using your fitted model, compute 95% confidence intervals for each of the coefficients in the model, and
provide an interpretation for each. Conduct a test for the existence of a linear relationship between the
predictors and response at α = 0.01. Give the null and alternative hypotheses (defining any notation
that you use), value of the test statistic and its null distribution, the p-value or critical value, and your
decision.
e) Does the variance increase or decrease with log(TotalPop)? Perfom a test to make your conclusion. If
you conclude that the variance is not constant, refit the model using weighted least squares and comment
on any differences to the fitted coefficients or their standard errors.
f) Summarize your analysis and comment on any interesting or unexpected findings.
2.2 Part II
You will investigate the model
Physicians ~ TotalPop + Region (2)
a) Fit the model in (2), and check the diagnostics. Find transformations if necessary.
b) Using your transformations from a), refit the new model. For each region separately, write out an equation
that expresses the estimated mean of number of physicians as a function of total population and personal
income. Based on these equations, explain why this model is called a parallel regression model.
c) Does the geographic region have a significant effect on the number of physicians in a county? Explain
your answer. If geographic region is not important, remove it from the model from now on.
d) Use model selection techniques from class, build on your current model by selecting relevant predictors
from Pop65, Crimes, Bachelor, Poverty, and PersonalInc. Perform a partial F-test to assess whether
the improvement from adding these predictors compared to the first model is statistically significant at
α = 0.05.
e) Using the model chosen in d), identify any influential points. For any data points with large influence,
use leverages and/or residuals (standardized or studentized) to explain why they are influential.
f) Summarize your analysis and comment on any interesting or unexpected findings.
2

站长地图