STAT4038讲解、辅导R语言、讲解STATISTICS留学生、辅导R编程设计 讲解Database|讲解R语言程序
- 首页 >> 其他 RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES AND STATISTICS
REGRESSION MODELLING
(STAT2008/STAT4038/STAT6014/STAT6038)
Assignment 2 for Semester 1, 2019
INSTRUCTIONS:
This assignment is worth 20% of your overall marks for this course.
Please submit your assignment on Wattle. When uploading to Wattle you must submit the following,
combined into a single document:
1. Your assignment/report in a pdf or word document.
2. The R code you have used for the assignment as an appendix. Failure to upload the R code
will result in a penalty.
Assignments should be typed. Scanned pdf les will not be marked and result in a penalty. Your
assignment may include some carefully edited computer output (e.g. graphs, tables) showing the
results of your data analysis and a discussion of these results, as well as some carefully selected code.
Please be selective about what you present and only include as many pages and as much computer
output as necessary to justify your solution. It is important to be be concise in your discussion of
the results. Clearly label each part of your report with the part of the question that it refers to.
Unless otherwise advised, use a signicance level of 5% and two decimal places for all answers.
Marks may be deducted if these instructions are not strictly adhered to, and marks will certainly be
deducted if the total report is of an unreasonable length, i.e. more than 10 pages including graphs
and tables. You may include an appendix that is in addition to the above page limits; however the
appendix will not be assessed. It will only be used if there is some question about what you have
actually done.
You may ask me (Abhinav Mehta) questions about this assignment up to 24 hours before the
submission time. This will allow me enough time to respond to your questions.
Late submissions will attract a penalty of 5% of your mark for each day of delay. No assignments
will be accepted 10 days beyond the due date.
Extensions will usually be granted on medical or compassionate grounds on production of appropriate
evidence, but must have my permission by no later than 24hours before the submission
date. If you are granted an extension and submit your assignment after the extended deadline then
the late submission penalty will still apply.
Assignment 2 - Sem 1, 2019 Page 1 of 3Question 1 [40 Marks]
A group of researchers in the US attempted to look at the pollution related factors aecting mortality.
Sixty US cities were sampled. Total age-adjusted mortality, (mortality), from all causes, in
deaths per 100,000 population, was measured, along with the following covariates: mean annual
precipitation (in inches) (precipitation); median number of school years completed for persons
aged 25 years or older (education); percentage of population that is non-white (nonwhite); relative
pollution potential of oxides of nitrogen (nox); and relative pollution potential of sulphur
dioxide (so2). “Relative pollution potential” is the product of tons emitted per day per square kilometre
and a factor correcting for the city dimension and exposure. The data is available in a .csv le,
pollution.
(a) [6 marks] Fit a multiple linear regression (MLR) model withMortality as the response variable
and all other covariates as predictors. Is the regression model signicant?
(b) [8 marks] What are the estimated coecients of the (MLR) model in part (a) and the standard
errors associated with these coecients? Interpret the values of these estimated coecients with
regards to model specication.
(c) [8 marks] There is a t-test associated with each of these coecients. Brie
y explain, what these
tests can or cannot be used for? In your answer, be sure to mention the appropriate hypotheses
that can be assessed using these t-tests.
(d) [6 marks] Construct an appropriate test of the hypothesis that education and nox are not
signicant contributors to the model. That is, test βeducation = βnox = 0.
(e) [6 marks] A researcher from this group suggested a model with coecients: βprecipitation = 2,
βeducation = 10, βnonwhite = 3, βnox = 0, and βso2 = 1 may be a better model. Can you
test whether this new model is signicant? How would you t such a model and what would
be the estimate of the intercept term with these coecients?
(f) [6 marks] One of the researcher is from the city of San Antonio, and has recorded a new set
of measurements on each of the predictors. The precipitation is 33, education is 11.5,
nonwhite is 17.2 and nox and so2 are each 1. What do you predict the mortality rate to be?
Find a 99% interval for this prediction.
Assignment 2 - Sem 1, 2019 Page 2 of 3Question 2 [60 Marks]
The data for this question comprises measurements on breeding pairs of land-bird species collected
from 16 islands around Britain over the course of several decades available in a .csv le, bird. For
each species, the data set contains an average time of extinctions, extinct, on those islands where
the species appeared. (This is actually the reciprocal of the average of 1/T where T is the length of
time the species remained on the island and 1/T is taken to be zero if the species did not become
extinct on the island); the average number of nesting pairs per year, over all islands where the species
appeared (nest.pair); the size (size) of the species, (S = Small, L = Large); and the migratory
status (mig.status) of the species, (R = Resident, M = Migrant). It is expected that species
with large numbers of nesting pairs will tend to remain longer before becoming extinct. Of particular
interest is whether, after accounting for the number of nesting pairs, size or migratory status has any
eect.
(a) [10 marks] Fit a multiple linear regression (MLR) model with extinct as the response variable
and all other covariates as predictors. Is the regression model signicant? Interpret the coe-
cients for the categorical variables in this model. Does the coecient support the expectations
that large number of nesting pairs tend to delay extinction?
(b) [6 marks] As the question indicates, of particular interest is whether, after accounting for the
number of nesting pairs, size or migratory status has any eect. Conduct a formal test of the
hypothesis that βSize = βMigStatus = 0 using an appropriate anova table. Evaluate the Fstatistic
and the corresponding p-value.
(c) [6 marks] The Red-crested Periwinkle is a small, migratory species of bird, while the Great
Plover is a large, resident species of bird. Assuming that the number of nesting pairs is the same
for each species over the period, based on the model in part (a), what would you predict the
dierence in extinction times to be for these two species?
(d) [8 marks] A noted theory suggests that Size and Migratory Status should contribute equally
to the extinction time. Test whether the coecients of size and mig.status are the same.
Construct an appropriate model to test this hypothesis.
(e) [20 marks] Produce the appropriate diagnostic plots for the model tted in part (a) and assess
the model assumptions. Produce the relevant in
uence diagnostics for this model. Which
data points appear to be in
uential in the analysis, and in what sense would you consider them
in
uential? Also, do any points appear to be outliers? If so, to which species do these points
correspond?
(f) [10 marks] Two transformations are suggested for the response variable, log(extinct) and
1/extinct. Investigate whether using these transformations improves on the model t. Comment
on the assumptions of MLR for these models as compared to your original model. Which
of three models would you choose based on your analysis?
Assignment 2 - Sem 1, 2019 Page 3 of 3
REGRESSION MODELLING
(STAT2008/STAT4038/STAT6014/STAT6038)
Assignment 2 for Semester 1, 2019
INSTRUCTIONS:
This assignment is worth 20% of your overall marks for this course.
Please submit your assignment on Wattle. When uploading to Wattle you must submit the following,
combined into a single document:
1. Your assignment/report in a pdf or word document.
2. The R code you have used for the assignment as an appendix. Failure to upload the R code
will result in a penalty.
Assignments should be typed. Scanned pdf les will not be marked and result in a penalty. Your
assignment may include some carefully edited computer output (e.g. graphs, tables) showing the
results of your data analysis and a discussion of these results, as well as some carefully selected code.
Please be selective about what you present and only include as many pages and as much computer
output as necessary to justify your solution. It is important to be be concise in your discussion of
the results. Clearly label each part of your report with the part of the question that it refers to.
Unless otherwise advised, use a signicance level of 5% and two decimal places for all answers.
Marks may be deducted if these instructions are not strictly adhered to, and marks will certainly be
deducted if the total report is of an unreasonable length, i.e. more than 10 pages including graphs
and tables. You may include an appendix that is in addition to the above page limits; however the
appendix will not be assessed. It will only be used if there is some question about what you have
actually done.
You may ask me (Abhinav Mehta) questions about this assignment up to 24 hours before the
submission time. This will allow me enough time to respond to your questions.
Late submissions will attract a penalty of 5% of your mark for each day of delay. No assignments
will be accepted 10 days beyond the due date.
Extensions will usually be granted on medical or compassionate grounds on production of appropriate
evidence, but must have my permission by no later than 24hours before the submission
date. If you are granted an extension and submit your assignment after the extended deadline then
the late submission penalty will still apply.
Assignment 2 - Sem 1, 2019 Page 1 of 3Question 1 [40 Marks]
A group of researchers in the US attempted to look at the pollution related factors aecting mortality.
Sixty US cities were sampled. Total age-adjusted mortality, (mortality), from all causes, in
deaths per 100,000 population, was measured, along with the following covariates: mean annual
precipitation (in inches) (precipitation); median number of school years completed for persons
aged 25 years or older (education); percentage of population that is non-white (nonwhite); relative
pollution potential of oxides of nitrogen (nox); and relative pollution potential of sulphur
dioxide (so2). “Relative pollution potential” is the product of tons emitted per day per square kilometre
and a factor correcting for the city dimension and exposure. The data is available in a .csv le,
pollution.
(a) [6 marks] Fit a multiple linear regression (MLR) model withMortality as the response variable
and all other covariates as predictors. Is the regression model signicant?
(b) [8 marks] What are the estimated coecients of the (MLR) model in part (a) and the standard
errors associated with these coecients? Interpret the values of these estimated coecients with
regards to model specication.
(c) [8 marks] There is a t-test associated with each of these coecients. Brie
y explain, what these
tests can or cannot be used for? In your answer, be sure to mention the appropriate hypotheses
that can be assessed using these t-tests.
(d) [6 marks] Construct an appropriate test of the hypothesis that education and nox are not
signicant contributors to the model. That is, test βeducation = βnox = 0.
(e) [6 marks] A researcher from this group suggested a model with coecients: βprecipitation = 2,
βeducation = 10, βnonwhite = 3, βnox = 0, and βso2 = 1 may be a better model. Can you
test whether this new model is signicant? How would you t such a model and what would
be the estimate of the intercept term with these coecients?
(f) [6 marks] One of the researcher is from the city of San Antonio, and has recorded a new set
of measurements on each of the predictors. The precipitation is 33, education is 11.5,
nonwhite is 17.2 and nox and so2 are each 1. What do you predict the mortality rate to be?
Find a 99% interval for this prediction.
Assignment 2 - Sem 1, 2019 Page 2 of 3Question 2 [60 Marks]
The data for this question comprises measurements on breeding pairs of land-bird species collected
from 16 islands around Britain over the course of several decades available in a .csv le, bird. For
each species, the data set contains an average time of extinctions, extinct, on those islands where
the species appeared. (This is actually the reciprocal of the average of 1/T where T is the length of
time the species remained on the island and 1/T is taken to be zero if the species did not become
extinct on the island); the average number of nesting pairs per year, over all islands where the species
appeared (nest.pair); the size (size) of the species, (S = Small, L = Large); and the migratory
status (mig.status) of the species, (R = Resident, M = Migrant). It is expected that species
with large numbers of nesting pairs will tend to remain longer before becoming extinct. Of particular
interest is whether, after accounting for the number of nesting pairs, size or migratory status has any
eect.
(a) [10 marks] Fit a multiple linear regression (MLR) model with extinct as the response variable
and all other covariates as predictors. Is the regression model signicant? Interpret the coe-
cients for the categorical variables in this model. Does the coecient support the expectations
that large number of nesting pairs tend to delay extinction?
(b) [6 marks] As the question indicates, of particular interest is whether, after accounting for the
number of nesting pairs, size or migratory status has any eect. Conduct a formal test of the
hypothesis that βSize = βMigStatus = 0 using an appropriate anova table. Evaluate the Fstatistic
and the corresponding p-value.
(c) [6 marks] The Red-crested Periwinkle is a small, migratory species of bird, while the Great
Plover is a large, resident species of bird. Assuming that the number of nesting pairs is the same
for each species over the period, based on the model in part (a), what would you predict the
dierence in extinction times to be for these two species?
(d) [8 marks] A noted theory suggests that Size and Migratory Status should contribute equally
to the extinction time. Test whether the coecients of size and mig.status are the same.
Construct an appropriate model to test this hypothesis.
(e) [20 marks] Produce the appropriate diagnostic plots for the model tted in part (a) and assess
the model assumptions. Produce the relevant in
uence diagnostics for this model. Which
data points appear to be in
uential in the analysis, and in what sense would you consider them
in
uential? Also, do any points appear to be outliers? If so, to which species do these points
correspond?
(f) [10 marks] Two transformations are suggested for the response variable, log(extinct) and
1/extinct. Investigate whether using these transformations improves on the model t. Comment
on the assumptions of MLR for these models as compared to your original model. Which
of three models would you choose based on your analysis?
Assignment 2 - Sem 1, 2019 Page 3 of 3