辅导MTHM017、R编程设计辅导

- 首页 >> OS编程
MTHM017 Advanced Topics in Statistics
Assignment
Please make sure that the submitted work is your own. This is NOT a group assignment,
therefore approaches and solutions shouldn’t be discussed with other students. Plagiarism
and collusion with other students are examples of academic misconduct and will be reported.
More information on academic honesty can be found here.
The assignment has two main parts. Part A involves (i) fitting a Poisson regression model to assess the
effect of using different priors, and (ii) fitting a fixed effects and a random effects model in order to compare
posterior distributions under different models. Part B involves using different methods for classification of
data into two groups.
A. Bayesian Inference [66 marks]
1. The first question of part A involves fitting a Poisson regression model using the Scotland dataset, which
contains the observed and expected counts of lip cancer cases for the local authority administrative areas in
Scotland between 1975 and 1986.
a. [3 marks] Calculate the Standard Mortality Ratios (SMR=observed/expected) for each administrative
area and plot the distribution of the SMRs. Then plot a map of the SMRs by administrative area. To
map the SMRs you may use the ScotlandMap function readily available on the ELE page.
The code below uses random numbers to demonstrate how the ScotlandMap function works. Note
that in order for this to run you need to add the shapefiles for Scotland (.shx, .shp, .prj, .dbf) to your
working directory. The ScotlandMap.R file and the shapefiles are available on the ELE page.
library(tidyverse)
library(RColorBrewer)
library(sf)
library(rgdal)
source("ScotlandMap.R") # need to read in the ScotlandMap function
testdat <- runif(56) # generate random numbers to use as observations
ScotlandMap(testdat,figtitle="Scotland random numbers")
We are interested in estimating the relative risk (RR) for each area. The relative risk is the parameter
that quantifies whether an area i has a higher (RRi > 1) or lower (RRi < 1) occurrence of cases than
what we would expect from the reference rates. In order to estimate the relative risk we are going to fit
a Poisson model of the following form:
Obsi ~ Pois(μi)
log(μi) = log(Expi) + β0 + θi
RRi = exp(β0)exp(θi)
Where θi ~ Normal(0, σ2). Here, the Exp(ected) numbers are an ‘offset’, which are treated as fixed
data (i.e. no coefficient is estimated for this term).
b. [3 marks] Describe the role (interpretation) of β0 and the set of θ’s in this model and how they contribute
to the estimation of RR. Focus on the meaning of these parameters in the estimation of the relative risk.
1
c. [14 marks] Code up this Poisson(-Lognormal) model in JAGS to analyse the Scotland data. For the
parameter β0 use a Normal prior with mean 0 and variance 1000. And for the precision parameter
τ(= 1/σ2) of θi use a prior τ ~ Gamma(1, 0.05) (which is parametrised so as to have a mean of 20).
Initialise 2 chains and run the model with these two chains. You will have to decide on the appropriate
values of n.iter and burnin. Produce trace plots for the chains and summaries of all the model
parameters. Investigate whether the chains for all the parameters have converged.
d. [4 marks] Extract the posterior means for the RR and map them. Compare the posterior mean relative
risk to the SMR for each area. Note that we can think about the SMR as the maximum likelihood
estimate of the relative risk.
e. [7 marks] Now calculate the posterior probabilities that the relative risk in each area exceeds 1. Extract
these probabilities and map them. For which areas are you at least 95% certain that there is an excess
risk of lip cancer (i.e. relative risk > 1)?
f. [6 marks] Repeat the analysis with different priors for β0 and τ . The exact choice is yours, but explain
why you have chosen them and what they mean. Map the two sets of RRs and explain any differences
you see in the map and the summaries of the posteriors for the parameters of the model.
2. In this question we will analyse the effect different models have on the posterior distribution. We will use
the surgical.csv file, which contains data on 12 hospitals, where the column n gives the total number of
heart surgery operations carried out on children under 1 year old by each centre between April 1991 and
March 1995, and the column r gives the number of these operations where the patient died within 30 days of
the operation.
A binomial model seems reasonable for these data, so we will focus on comparing different models for the
hospital specific mortality rates.
We will first fit a logistic random effects model of the following form:
ri ~ Binom(ni, θi),
logit(θi) ~ N(μ, σ2).
This model assumes that failure rates (θi) across hospitals are similar in some way. This similarity is
represented by the assumption that the mortality rates for the hospitals are drawn from the same (parent)
distribution.
a. [9 marks] Code this model in JAGS to analyse the surgical data. Use vague priors for the parameters
of the shared normal distribution. In particular you can use the following model definition:
jags.mod <- function(){
for(i in 1:12){
r[i] ~ dbin(theta[i],n[i])
logit(theta[i]) <- logit.theta[i]
logit.theta[i] ~ dnorm(mu, tau)
}
mu ~ dnorm(0,1.0E-3)
tau <- 1/sigma?2
sigma ~ dunif(0,100)
}
Run the model for 10,000 iterations, with 2 chains, discarding the first 5,000 as ‘burn-in’. Produce
traceplots for the chains and summaries for the fitted parameters. Comment on whether the chains for
all the parameters have converged.
b. [8 marks] We want to know whether the assumptions of the above random effects model are reasonable,
or whether there is any evidence that the mortality rates for one or more of the hospitals are not drawn
from the same parent distribution. We will assess this by comparing the observed and predicted number
of deaths in each hospital.
2
Edit your JAGS model so that it (i) finds the posterior prediction of the number of deaths (rpredi ) in
each hospital, and (ii) computes the following posterior probability for each hospital:
pi = P (rpredi > ri) +
1
2P (r
pred
i = ri).
The above probability is called the ‘mid p-value’. This allows us to summarise conflict between discrete
quantities.
Examining the mid p-values, are there any hospitals that appear to have an unusual mortality rate?
Confirm your conclusion by producing kernel density plots of the predicted number of deaths in each
hospital, and visually comparing these to the observed number of deaths in each hospital.
c. [8 marks] As an alternative model we will consider the independent (fixed) effects model, which has the
following hospital-specific mortality rates:
θi ~ Unif(0, 1),
for i = 1, 2, . . . , 12.
Fit this alternative model, and compare the posterior distributions of the hospital mortality rates under
the random effects and the fixed effects models. Interpret the differences. (Hint: the comparison is
easiest to do by producing box plots of the mortality rates θ under each model).
d. [4 mark] Compute the mid p-values for the fixed effects model and compare these to the mid p-values
of the random effects model. Does the fixed effects model better explain the unusual mortality rates?
B. Classification [34 marks]
The following figure shows the information in the dataset Classification.csv - it shows two different
groups, plotted against two explanatory variables. This is simulated data - the aim is to find a suitable
method for classifying the 1000 datapoints into two groups from a selection of possible approaches.
1. [5 marks] Summarise the two groups (separately) in terms of the variables X1 and X2, and for each
group plot the distribution of both variables. Describe your findings.
3
2. [4 marks] Considering the plot showing the observations, your density plots and the numerical summaries,
which of the following classification methods do you think are suitable for classifying this data? For each
method, you should explain the reasons behind your answer (i.e. why is the given method suitable/why
it might not be suitable)?
a. Linear discriminant analysis.
b. Quadratic discriminant analysis.
c. K-nearest neighbour classification.
d. Support vector machines.
e. Random forests.
3. [1 marks] Select 80% of the data to act as a training set, with the remaining 20% for testing/evaluation.
4. [22 marks] Choose three of the methods listed in Q2 that might be suitable to classify the data.
Perform classification using these methods. (If you use more than three of the listed methods, only the
first three will be considered for marking). In each case, briefly describe how the given classification
method classifies the data, present the results of an evaluation of the model fit (highlighting different
aspects of the model performance) and describe your findings. Where appropriate optimise the
(hyper)parameters of the method.
5. [2 marks] Compare the results from your chosen three approaches and select what you think is the best
method for classification in this case, explaining your reasoning.
Total for paper = 100 marks
The deadline for submission is Noon (12pm), 31st March. Note that late submissions will be
penalised.
You should submit a pdf that contains your answers (and relevant output/plots!) to the
questions via eBart. In Part A you should use the R programming language, but in Part B
you can choose to use R or Python (or both).

站长地图