辅导MATH4007、辅导R语言编程
- 首页 >> OS编程 MATH4007/G14CST COMPUTATIONAL STATISTICS
Assessed Coursework 2 — 2021/2022
Your work should be submitted electronically via the module’s Moodle page by 15:00 Wednes-
day 4th May 2022. Since this work is assessed, your submission must be entirely your own
work (see the University’s policy on Academic Misconduct). Submissions up to five working days
late will be subject to a penalty of 5% of the maximum mark per working day.
Submission requirements
The submission should be uploaded electronically via the submission box on Moodle, and contain:
1. A pdf file containing any computational results (plots/relevant output) and discussion. This
can be produced using e.g. R Markdown, or by copying output into a Word document.
Please convert any documents to pdf for uploading.
2. A pdf of any theoretical working required. A scan of handwritten work is fine, but you
could also typeset using Latex if you prefer. If it’s more convenient, you can combine this
and the above part into one, e.g. if you wish to put everything in one Latex document, but
this is not required.
3. An R script file, i.e. with a .r extension containing your R code. This should be clearly
formatted, and include brief comments so that a reader can understand what it is doing.
The code should also be ready to run without any further modification by the user, and
should reproduce your results (approximately, for simulation-based results).
Please make sure that all required working, results, details of implementation and discussion are
contained in components 1 and 2 of the above list and not in the script file. The work will
be assessed based on the working, output and discussion in these components, and the script file
will only be used for verification of results. The exception is for the R code itself, whereby it is
sucient to say “refer to script file” where a question asks you to write R code.
A complete submission consists of all the files in your final submission. The submission time of
the work will be based on the time at which the submission is complete, i.e. all files are uploaded.
Please carefully check after uploading your work that the files you upload are the correct ones.
Updates to any part of the submission after the deadline will be considered a new submission and
late penalties will be applicable.
Questions
1. The whiteside data in the R library MASS contain the measurements of gas consumed
and external temperature before and after insulation of a house. We will consider the 26
measurements before insulation, and look at the relationship between gas and temperature.
library(MASS)
X=whiteside[1:26,c(2,3)]
(a) Compute the sample correlation between Gas and Temperature (you may use the R
function for this). Generate 10000 (non-parametric) bootstrap samples to compute
the standard error of this estimate. (You must code the bootstrap algorithm, do not
use boot or similar.) Compute the bias of the estimate using both the bootstrap and
the jackknife, and compare.
(b) Consider a linear regression model
Gasi = a+ bTempi + ?i, i = 1, · · · , 26.
Compute the least squares estimate of b. (You may use the lm command.) Use
model-based resampling to produce a 95% confidence interval for b, again coding the
bootstrap algorithm yourself.
(c) Compute the usual 95% confidence interval for b which results from assuming
?i
iid? N(0, 2), i = 1, · · · , 26. Again, you can use the inbuilt R functions for this.
Compare this interval with the bootstrap interval. Plot a density estimate of the
sampling distribution of b?, based on your bootstrap samples from part (b), and hence
comment on the agreement between the two intervals.
[15]
2. In a study to monitor glucose levels of patients with diabetes, 10 patients were fitted
with continuous glucose monitoring sensors to track glucose levels. If the sensors detect
abnormal glucose levels, this triggers an automatic intervention to change the rate of insulin
infusion. The following data show the number of days the patients were monitored for (the
follow-up time) and the number of instances where abnormal glucose levels were detected
so that an automatic infusion intervention was required.
Unfortunately, patients 9 and 10 had faulty sensors which failed at some point before the
follow-up time, so patients had to manage insulin levels manually. Therefore, all that is
known is that the true number of abnormal glucose level instances (counts) at the follow
up time is at least as large as the count in the table (indicated by ?), which is the number
of counts at the point when the sensor failed.
Patient 1 2 3 4 5 6 7 8 9 10
Follow-up time (xi) 94 16 63 126 5 1 1 2 10 31
Count (yi) 5 1 5 14 3 1 1 4 15 11
2
The following hierarchical model is proposed to model these data.
yi|xi,i Poisson(ixi), i = 1, . . . , 10
i|, Gamma(, ) i = 1, . . . , 10
|a, b Gamma(a, b)
using the parameterization that the density of a random variable Z which follows a Gamma
distribution with parameters a and b is
p(z) =
ba
(a)
za1 exp{bz},
where (·) is the Gamma function. Hence, i represents the rate of abnormal glucose level
instances per day for the ith individual. The yi are assumed independent given i and the i
are assumed independent given and . After consultation with the medical professionals,
the parameters = 2, a = 0.01 and b = 1 can be considered as fixed. The parameter
therefore controls the distribution of the rates of abnormal glucose level instances over the
larger population from which these patients were taken, and hence is the main object of
inference.
Let = (1, . . . ,10) and denote all the observed counts by yobs. Also let
y = (y1, y2, . . . , y9, y10) denote the vector of all the true counts, including the unknown y9
and y10, and let x = (x1, . . . , x10).
(a) By augmenting the observed data with the 2 missing true counts for patients 9 and
10, give full details of the Gibbs sampler to sample from the posterior distribution
?(, |yobs).
HINT: The overall target distribution is
?(y9, y10,, |x, yobs) / ?(y|, ,x)?(, ), y9 15, y10 11
where the proportionality means as a function of all the unknown variables y9, y10,, .
(b) Implement your Gibbs sampler in R. You may use any inbuilt R functions to sample
from the full conditional distributions, but you must write the main algorithm yourself.
Describe what steps you take to be confident that the MCMC has converged and is
giving reliable estimates.
(c) Use your samples to plot a density estimate of the marginal posterior density of ,
(|yobs).
(d) Produce a 95% posterior probability interval for from your samples using
(i) A normal approximation of ?(|yobs) with mean and standard deviation estimated
from your samples.
(ii) The 2.5% and 97.5% quantiles of your samples. You can use the quantile
function for this.
With reference to your plot in (c), comment on any di?erences between the two
intervals.
(e) A new individual is about to monitor their glucose levels for 50 days using a sensor.
Let y be the count of abnormal glucose levels in 50 days for the new individual.
HINT: You do not need to attempt to work out any integrals theoretically. This is
purely about simulation. using the given hierarchical model, and how does this help?
(f) Implement this to obtain samples of y? from ?(y?|yobs), and use your samples to
estimate the probability that the new individual will need 50 or more automatic
interventions over the 50 days of monitoring (assuming their sensor doesn’t fail).
Assessed Coursework 2 — 2021/2022
Your work should be submitted electronically via the module’s Moodle page by 15:00 Wednes-
day 4th May 2022. Since this work is assessed, your submission must be entirely your own
work (see the University’s policy on Academic Misconduct). Submissions up to five working days
late will be subject to a penalty of 5% of the maximum mark per working day.
Submission requirements
The submission should be uploaded electronically via the submission box on Moodle, and contain:
1. A pdf file containing any computational results (plots/relevant output) and discussion. This
can be produced using e.g. R Markdown, or by copying output into a Word document.
Please convert any documents to pdf for uploading.
2. A pdf of any theoretical working required. A scan of handwritten work is fine, but you
could also typeset using Latex if you prefer. If it’s more convenient, you can combine this
and the above part into one, e.g. if you wish to put everything in one Latex document, but
this is not required.
3. An R script file, i.e. with a .r extension containing your R code. This should be clearly
formatted, and include brief comments so that a reader can understand what it is doing.
The code should also be ready to run without any further modification by the user, and
should reproduce your results (approximately, for simulation-based results).
Please make sure that all required working, results, details of implementation and discussion are
contained in components 1 and 2 of the above list and not in the script file. The work will
be assessed based on the working, output and discussion in these components, and the script file
will only be used for verification of results. The exception is for the R code itself, whereby it is
sucient to say “refer to script file” where a question asks you to write R code.
A complete submission consists of all the files in your final submission. The submission time of
the work will be based on the time at which the submission is complete, i.e. all files are uploaded.
Please carefully check after uploading your work that the files you upload are the correct ones.
Updates to any part of the submission after the deadline will be considered a new submission and
late penalties will be applicable.
Questions
1. The whiteside data in the R library MASS contain the measurements of gas consumed
and external temperature before and after insulation of a house. We will consider the 26
measurements before insulation, and look at the relationship between gas and temperature.
library(MASS)
X=whiteside[1:26,c(2,3)]
(a) Compute the sample correlation between Gas and Temperature (you may use the R
function for this). Generate 10000 (non-parametric) bootstrap samples to compute
the standard error of this estimate. (You must code the bootstrap algorithm, do not
use boot or similar.) Compute the bias of the estimate using both the bootstrap and
the jackknife, and compare.
(b) Consider a linear regression model
Gasi = a+ bTempi + ?i, i = 1, · · · , 26.
Compute the least squares estimate of b. (You may use the lm command.) Use
model-based resampling to produce a 95% confidence interval for b, again coding the
bootstrap algorithm yourself.
(c) Compute the usual 95% confidence interval for b which results from assuming
?i
iid? N(0, 2), i = 1, · · · , 26. Again, you can use the inbuilt R functions for this.
Compare this interval with the bootstrap interval. Plot a density estimate of the
sampling distribution of b?, based on your bootstrap samples from part (b), and hence
comment on the agreement between the two intervals.
[15]
2. In a study to monitor glucose levels of patients with diabetes, 10 patients were fitted
with continuous glucose monitoring sensors to track glucose levels. If the sensors detect
abnormal glucose levels, this triggers an automatic intervention to change the rate of insulin
infusion. The following data show the number of days the patients were monitored for (the
follow-up time) and the number of instances where abnormal glucose levels were detected
so that an automatic infusion intervention was required.
Unfortunately, patients 9 and 10 had faulty sensors which failed at some point before the
follow-up time, so patients had to manage insulin levels manually. Therefore, all that is
known is that the true number of abnormal glucose level instances (counts) at the follow
up time is at least as large as the count in the table (indicated by ?), which is the number
of counts at the point when the sensor failed.
Patient 1 2 3 4 5 6 7 8 9 10
Follow-up time (xi) 94 16 63 126 5 1 1 2 10 31
Count (yi) 5 1 5 14 3 1 1 4 15 11
2
The following hierarchical model is proposed to model these data.
yi|xi,i Poisson(ixi), i = 1, . . . , 10
i|, Gamma(, ) i = 1, . . . , 10
|a, b Gamma(a, b)
using the parameterization that the density of a random variable Z which follows a Gamma
distribution with parameters a and b is
p(z) =
ba
(a)
za1 exp{bz},
where (·) is the Gamma function. Hence, i represents the rate of abnormal glucose level
instances per day for the ith individual. The yi are assumed independent given i and the i
are assumed independent given and . After consultation with the medical professionals,
the parameters = 2, a = 0.01 and b = 1 can be considered as fixed. The parameter
therefore controls the distribution of the rates of abnormal glucose level instances over the
larger population from which these patients were taken, and hence is the main object of
inference.
Let = (1, . . . ,10) and denote all the observed counts by yobs. Also let
y = (y1, y2, . . . , y9, y10) denote the vector of all the true counts, including the unknown y9
and y10, and let x = (x1, . . . , x10).
(a) By augmenting the observed data with the 2 missing true counts for patients 9 and
10, give full details of the Gibbs sampler to sample from the posterior distribution
?(, |yobs).
HINT: The overall target distribution is
?(y9, y10,, |x, yobs) / ?(y|, ,x)?(, ), y9 15, y10 11
where the proportionality means as a function of all the unknown variables y9, y10,, .
(b) Implement your Gibbs sampler in R. You may use any inbuilt R functions to sample
from the full conditional distributions, but you must write the main algorithm yourself.
Describe what steps you take to be confident that the MCMC has converged and is
giving reliable estimates.
(c) Use your samples to plot a density estimate of the marginal posterior density of ,
(|yobs).
(d) Produce a 95% posterior probability interval for from your samples using
(i) A normal approximation of ?(|yobs) with mean and standard deviation estimated
from your samples.
(ii) The 2.5% and 97.5% quantiles of your samples. You can use the quantile
function for this.
With reference to your plot in (c), comment on any di?erences between the two
intervals.
(e) A new individual is about to monitor their glucose levels for 50 days using a sensor.
Let y be the count of abnormal glucose levels in 50 days for the new individual.
HINT: You do not need to attempt to work out any integrals theoretically. This is
purely about simulation. using the given hierarchical model, and how does this help?
(f) Implement this to obtain samples of y? from ?(y?|yobs), and use your samples to
estimate the probability that the new individual will need 50 or more automatic
interventions over the 50 days of monitoring (assuming their sensor doesn’t fail).