G14CST辅导、讲解R语言、讲解COMPUTATIONAL STATISTICS、R设计辅导
- 首页 >> Algorithm 算法G14CST COMPUTATIONAL STATISTICS
Assessed Coursework 1 — 2018/2019
The submission for this coursework consists of two components: a written submission and an
R code submission. The deadline for both components is 3pm, Wednesday 12 December
2018. Please hand in your written work at a Student Service Centre. Please submit electronic
files containing the R code that you write using the assignment submission link “Coursework 1
R Code Submission” on Moodle, which is located directly below the coursework question sheet
and data. A complete submission consists of both the written work and the code,
and both must be submitted by the deadline for a submission to be considered
as on time.
The written work should contain all relevant theoretical working, discussion, parameter settings
used, plots and results, as necessary to answer the questions asked. For example, it is not
sufficient to only provide the R code to produce a plot or result, or provide details of input
parameters in the script file. The exception is for the R code itself, whereby it is sufficient to
say “refer to script file” where a question asks you to write R code. So you do not have to
include R code in your written work, though you can if you wish. The code in your script files
should be neatly edited and include brief comments so that a reader can understand what it is
doing. The code should also be ready to run without any further modification by the user, and
should reproduce the results presented in your written work (approximately, for simulation-based
results). For the written work, you are free to structure the submission in whatever way is most
convenient, provided all necessary detail and results are included. For example, neat handwritten
solutions are acceptable for any theoretical working and discussion, and plots can be attached
separately (sufficiently referenced).
Unauthorised late submission will be penalised by 5% of the full mark per day. Work submitted
more than one week late will receive zero marks. You are reminded to familiarise yourself with the
guidelines concerning plagiarism in assessed coursework (see the student handbook), and note
that this applies equally to computer code as it does to written work.
1. A Bayesian analysis is to be conducted for the mean of a normal distribution. Specifically,
we have the following model:
Xi ~ N(θ, 1), i = 1, . . . , n,
where the Xi are independent given θ. A robust prior on θ is a Cauchy distribution, whereby
π(θ) = 1
π(1 + θ
2
)
, ∞ < θ < ∞.
Given observed data x = (x1, . . . , xn), the posterior distribution is
π(θ|x) = π(x|θ)π(θ)
π(x)
.
Five data points are observed, with values 0.02,-0.18,-1.37,-0.60,0.29.
(a) Describe a quadrature method to estimate the normalising constant π(x).
(b) Code this method in R and give your estimate of π(x).
(c) Use quadrature again to calculate E[θ|x] and V ar[θ|x], using your value of π(x)
from the previous question.
(d) Describe a rejection sampler to simulate samples of θ from π(θ|x), with π(θ) as the
proposal distribution.
(e) Code your rejection sampler in R, and use it to generate 10, 000 samples from the
posterior distribution. You may use the R function rcauchy to simulate from π(θ).
(f) Plot a histogram of your samples (scaled to have area 1), and overlay the true
(normalized) posterior density function on the same plot. Comment on the agreement.
(g) Use your samples to estimate E[θ|x] and V ar[θ|x] and compare with those obtained
from quadrature.
(h) Interest lies in whether θ is in the range (?0.5, 0.5). Use your samples to estimate
the probability θ lies in this interval.
[20]
2. The file CW2Data.txt on Moodle contains the two variables ldl and case for 100 individuals.
The variable ldl is a measurement of the low density lipoprotein cholesterol of the
individual, and case is a binary variable taking the values 1 (if the individual has suffered a
heart attack, i.e. a case) and 0 if the individual has not suffered a heart attack (a control).
You can read the data into a data frame in R using the command
read.table(“CW2Data.txt”, header = TRUE),
after first downloading the file to your working directory.
A two-component mixture of normal distributions is to be used to model cholesterol levels.
Specifically, the model is
f(x; μ0, μ1, p, σ2
0
, σ2
1
) = (1 p)f0(x; μ0, σ2
0
) + pf1(x; μ1, σ2
1
), p ∈ (0, 1),
and fj
is the density of a Normal distribution with mean μj and variance σ
2
j
for j = 0, 1.
Let xi denote the ldl of individual i. To facilitate the fitting of this model, each observation
xi
is augmented with a label zi to indicate which component it comes from, where zi = j
if observation i comes from component j. Hence, the vector of unknown parameters is
θ = (μ0, μ1, p, σ2
0
, σ2
1
) and the “missing” data is z = (z1, . . . , z100), the vector of unknown
labels.
(a) Plot a histogram of the sample data, and discuss why a mixture model may be useful.
(You should refer to the features of the plot, not use the knowledge about the source
of the data.)
(b) Describe an EM algorithm to fit the mixture model, using only the observed xi values
as observed data, and with the true labels zi as the missing data. Interpret the forms
of the estimators in the M step in terms of the usual maximum likelihood estimators
if there were no missing data.
(c) Code your algorithm in R.
(d) Use your algorithm to estimate the components of θ. You should be careful to ensure
you have found the true global maximum.
(e) Plot a histogram (scaled to area 1) of the sample data, and overlay the density of
your fitted mixture model.
2
(f) A product of your algorithm will be that you have obtained estimates of P(zi = 1)
for each individual. Use the following rule to classify individuals: classify observation i
to group j if P?(zi = j) ≥ 0.5. Compare your predicted classifications from the fitted
model with the true classes. How many are correctly classified by your algorithm?
(Please read the note below.)
[20]
NOTE: there is a certain non-identifiability present in the model, in that the group labels
have no meaning other than to distinguish the model components. All that really matters
is the allocation of individuals to components; the labelling of components is arbitrary.
Mathematically, the likelihood is the same if the group labels are swapped round (i.e.
0 → 1 and 1 → 0), and the component parameters are exchanged accordingly.
This issue does not affect the model fit, but it does affect the comparison with the true
labels in part (f). (I.e. it is possible that in your model fit, j = 0 corresponds to the
heart attack cases.) Since the group of heart attack cases have a higher sample mean, a
sensible rule for relabelling the groups (if necessary) is to label the group with the larger
estimated mean as group 1. Therefore, once you have fitted your model, if you find that
j = 0 corresponds to the group with the larger estimated mean, then you should swap
round your predicted labels in order to compare with the true cases.