G14CST辅导、讲解R语言、讲解COMPUTATIONAL STATISTICS、R设计辅导

- 首页 >> Algorithm 算法

G14CST COMPUTATIONAL STATISTICS

Assessed Coursework 1 — 2018/2019

The submission for this coursework consists of two components: a written submission and an

R code submission. The deadline for both components is 3pm, Wednesday 12 December

2018. Please hand in your written work at a Student Service Centre. Please submit electronic

files containing the R code that you write using the assignment submission link “Coursework 1

R Code Submission” on Moodle, which is located directly below the coursework question sheet

and data. A complete submission consists of both the written work and the code,

and both must be submitted by the deadline for a submission to be considered

as on time.

The written work should contain all relevant theoretical working, discussion, parameter settings

used, plots and results, as necessary to answer the questions asked. For example, it is not

sufficient to only provide the R code to produce a plot or result, or provide details of input

parameters in the script file. The exception is for the R code itself, whereby it is sufficient to

say “refer to script file” where a question asks you to write R code. So you do not have to

include R code in your written work, though you can if you wish. The code in your script files

should be neatly edited and include brief comments so that a reader can understand what it is

doing. The code should also be ready to run without any further modification by the user, and

should reproduce the results presented in your written work (approximately, for simulation-based

results). For the written work, you are free to structure the submission in whatever way is most

convenient, provided all necessary detail and results are included. For example, neat handwritten

solutions are acceptable for any theoretical working and discussion, and plots can be attached

separately (sufficiently referenced).

Unauthorised late submission will be penalised by 5% of the full mark per day. Work submitted

more than one week late will receive zero marks. You are reminded to familiarise yourself with the

guidelines concerning plagiarism in assessed coursework (see the student handbook), and note

that this applies equally to computer code as it does to written work.

1. A Bayesian analysis is to be conducted for the mean of a normal distribution. Specifically,

we have the following model:

Xi ~ N(θ, 1), i = 1, . . . , n,

where the Xi are independent given θ. A robust prior on θ is a Cauchy distribution, whereby

π(θ) = 1

π(1 + θ

2

)

, ∞ < θ < ∞.

Given observed data x = (x1, . . . , xn), the posterior distribution is

π(θ|x) = π(x|θ)π(θ)

π(x)

.

Five data points are observed, with values 0.02,-0.18,-1.37,-0.60,0.29.

(a) Describe a quadrature method to estimate the normalising constant π(x).

(b) Code this method in R and give your estimate of π(x).

(c) Use quadrature again to calculate E[θ|x] and V ar[θ|x], using your value of π(x)

from the previous question.

(d) Describe a rejection sampler to simulate samples of θ from π(θ|x), with π(θ) as the

proposal distribution.

(e) Code your rejection sampler in R, and use it to generate 10, 000 samples from the

posterior distribution. You may use the R function rcauchy to simulate from π(θ).

(f) Plot a histogram of your samples (scaled to have area 1), and overlay the true

(normalized) posterior density function on the same plot. Comment on the agreement.

(g) Use your samples to estimate E[θ|x] and V ar[θ|x] and compare with those obtained

from quadrature.

(h) Interest lies in whether θ is in the range (?0.5, 0.5). Use your samples to estimate

the probability θ lies in this interval.

[20]

2. The file CW2Data.txt on Moodle contains the two variables ldl and case for 100 individuals.

The variable ldl is a measurement of the low density lipoprotein cholesterol of the

individual, and case is a binary variable taking the values 1 (if the individual has suffered a

heart attack, i.e. a case) and 0 if the individual has not suffered a heart attack (a control).

You can read the data into a data frame in R using the command

read.table(“CW2Data.txt”, header = TRUE),

after first downloading the file to your working directory.

A two-component mixture of normal distributions is to be used to model cholesterol levels.

Specifically, the model is

f(x; μ0, μ1, p, σ2

0

, σ2

1

) = (1 p)f0(x; μ0, σ2

0

) + pf1(x; μ1, σ2

1

), p ∈ (0, 1),

and fj

is the density of a Normal distribution with mean μj and variance σ

2

j

for j = 0, 1.

Let xi denote the ldl of individual i. To facilitate the fitting of this model, each observation

xi

is augmented with a label zi to indicate which component it comes from, where zi = j

if observation i comes from component j. Hence, the vector of unknown parameters is

θ = (μ0, μ1, p, σ2

0

, σ2

1

) and the “missing” data is z = (z1, . . . , z100), the vector of unknown

labels.

(a) Plot a histogram of the sample data, and discuss why a mixture model may be useful.

(You should refer to the features of the plot, not use the knowledge about the source

of the data.)

(b) Describe an EM algorithm to fit the mixture model, using only the observed xi values

as observed data, and with the true labels zi as the missing data. Interpret the forms

of the estimators in the M step in terms of the usual maximum likelihood estimators

if there were no missing data.

(c) Code your algorithm in R.

(d) Use your algorithm to estimate the components of θ. You should be careful to ensure

you have found the true global maximum.

(e) Plot a histogram (scaled to area 1) of the sample data, and overlay the density of

your fitted mixture model.

2

(f) A product of your algorithm will be that you have obtained estimates of P(zi = 1)

for each individual. Use the following rule to classify individuals: classify observation i

to group j if P?(zi = j) ≥ 0.5. Compare your predicted classifications from the fitted

model with the true classes. How many are correctly classified by your algorithm?

(Please read the note below.)

[20]

NOTE: there is a certain non-identifiability present in the model, in that the group labels

have no meaning other than to distinguish the model components. All that really matters

is the allocation of individuals to components; the labelling of components is arbitrary.

Mathematically, the likelihood is the same if the group labels are swapped round (i.e.

0 → 1 and 1 → 0), and the component parameters are exchanged accordingly.

This issue does not affect the model fit, but it does affect the comparison with the true

labels in part (f). (I.e. it is possible that in your model fit, j = 0 corresponds to the

heart attack cases.) Since the group of heart attack cases have a higher sample mean, a

sensible rule for relabelling the groups (if necessary) is to label the group with the larger

estimated mean as group 1. Therefore, once you have fitted your model, if you find that

j = 0 corresponds to the group with the larger estimated mean, then you should swap

round your predicted labels in order to compare with the true cases.


站长地图