MTH6102代写、辅导R设计编程
- 首页 >> DatabaseBayesian Statistical Methods
Queen Mary, Autumn 2022
1
Contents
1 Likelihood 4
1.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Standard error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Bayesian inference 10
2.1 Bayes* theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Bayes* theorem and Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Conjugate prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Point estimates and credible intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Specifying prior distributions 21
3.1 Informative prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Less informative prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Markov chain Monte Carlo methods 23
4.1 The Metropolis algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Predictive distributions 28
5.1 Simulating the posterior predictive distribution . . . . . . . . . . . . . . . . . . . . . 30
6 Hierarchical models 31
6.1 Inference for hierarchical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2 Advantages of hierarchical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7 Tests and model comparisions 36
2
Introduction
So far at Queen Mary, the statistics modules have taught the classical or frequentist approach which is
based on the idea that probability represents a long run limiting frequency. In the Bayesian approach,
any uncertain quantity is described by a probability distribution, and so probability represents a
degree of belief in an event which is conditional on the knowledge of the person concerned.
This course will introduce you to Bayesian statistics. These notes are self-contained but you may
want to read other accounts of Bayesian statistics as well. A useful introductory textbook is:
Bayesian Statistics: An Introduction (4th ed.) by P M Lee. (This is available as an e-book
from the library)
Parts of the following are also useful:
Bayesian Inference for Statistical Analysis by G E P Box and G C Tiao.
Bayesian Data Analysis by A Gelman, J B Carlin, H S Stern and D B Rubin.
Probability and Statistics from a Bayesian viewpoint (Vol 2) by D V Lindley.
3
Chapter 1
Likelihood
First we review the concept of likelihood, which is essential for Bayesian theory, but can also be used
in frequentist methods. Let y be the data that we observe, which is usually a vector. We assume
that y was generated by some probability model which we can specify. Suppose that this probability
model depends on one or more parameters 牟, which we want to estimate.
Definition 1.1. If the components of y are continuous, then the likelihood is defined as the joint
probability density function of y; if y is discrete, then the likelihood is defined as the joint probability
mass function of y. In either case, we denote the likelihood as
p(y | 牟).
Example 1.1. Let y = y1, . . . , yn be a random sample from a normal distribution with unknown
parameters 米 and 考. Then 牟 is the vector (米, 考), and the likelihood is the joint probability density
function
p(y | 米, 考) =
n﹊
i=1
?(yi | 米, 考).
Note that
﹊
here is the symbol for the product:
n﹊
i=1
ai = a1 ℅ a2 ℅ ﹞ ﹞ ﹞ ℅ an.
is the normal probability density function with parameters 米 and 考
(yi | 米, 考) = e(yi?米)2/2考2﹟2羽 考2.
4
Example 1.2. Suppose we observe k successes in n independent trials, where each trial has proba-
bility of success q. Now 牟 = q, the observed data is k, and the likelihood is the binomial probability
mass function
p(k | q) =(nk)
qk(1 q)nk.
It is also possible to construct likelihoods which combine probabilities and probability density func-
tions, for example if the observed data contains both discrete and continuous components. Alter-
natively, probabilities may appear in the likelihood if continuous data is only observed to lie within
some interval.
Example 1.3. Assume that the time until failure for a certain type of light bulb is exponentially
distributed with parameter 竹, and we observe n bulbs, with failure times t = t1, . . . , tn.
The likelihood contribution for a single observation ti is the exponential probability density function
竹e竹ti .
So the overall likelihood is the joint probability density function
p(t | 竹) =
n﹊
i=1
竹e?竹ti .
Suppose instead that we observe the failure time for the first m light bulbs with m < n, but for the
remaining n ?m bulbs we only observe that they have not failed by time ti. Then for i ≒ m, the
likelihood contributions are as before.
For i > m, the likelihood is the probability of what we have observed. Denoting the random variable
for the failure time by Ti, we have observed that Ti > ti, so the likelihood contribution is
p(Ti > ti) = e竹ti .
This is because the cumulative distribution function is p(Ti ≒ ti) = 1? e?竹ti .
Hence the overall likelihood is now
p(t | 竹) =
m﹊
i=1
竹e竹ti
n﹊
i=m+1
e竹ti .
1.1 Maximum likelihood estimation
In example 1.1, the parameters 米 and 考 of the normal distribution which generated the data are
known as population parameters. An estimator is defined as a function of the observed data which
5
we use as an estimate of a population parameter. For example the sample mean and variance may
be used as estimators of the population mean and variance, respectively.
To use the likelihood to estimate parameters 牟, we find the vector 牟? which maximizes the likelihood
p(y | 牟), given the data x that we have observed. This is the method of maximum likelihood, and
the estimator 牟? is known as the maximum likelihood estimator, or MLE.
Usually the parameters 牟 are continuous, and those are the only examples we cover, but the idea of
likelihood also makes sense if the unknown quantity is discrete.
When finding the MLE it is usually more convenient to work with the log of the likelihood: as the
log is a monotonically increasing function, the same 牟 will maximize p(y | 牟) and log(p(y | 牟)). The
log-likelihood is denoted by
?(牟; y) = log(p(y | 牟)).
Since the likelihood is typically a product of terms for independent observations, the log-likelihood
is a sum of terms, so using the log greatly simplifies finding the derivatives in order to find the
maximum.
Returning to the binomial example 1.2, the log-likelihood is
We also do not cover confidence intervals, but we do cover the Bayesian version, which are called
credible intervals.
9
Chapter 2
Bayesian inference
2.1 Bayes* theorem
Bayes* theorem is a formula from probability theory that is central to Bayesian inference. It is
named after Rev. Thomas Bayes, a nonconformist minister who lived in England in the first half of
the eighteenth century. The theorem states that:
Theorem 2.1. Let ? be a sample space and A1, A2, . . . , Am be mutually exclusive and exhaustive
events in ? (i.e. Ai ﹎ Aj = ?, i ?= j ,﹍k i=1Ai = ?; the Ai form a partition of ?.) Let B be any event
with p(B) > 0. Then
p(Ai | B) = p(Ai)p(B | Ai)
p(B)
=
p(Ai)p(B | Ai)﹉m
j=1 p(Aj)p(B | Aj)
The proof follows from the definition of conditional probabilities and the law of total probabilities.
Example 2.1. Suppose a test for an infection has 90% sensitivity and 95% specificity, and the
proportion of the population with the infection is q = 1/2000. Sensitivity is the probability of
detecting a genuine infection, and specificity is the probability of being correct about a non-infection.
So p(+ve test | infected) = 0.9 and p(-ve test | not infected) = 0.95. What is the probability that
someone who tests positive is infected?
Let the events be as follows:
B: test positive
A1: is infected
A2: is not infected
We want to find p(A1 | B). The probabilities we have are p(A1) = 1/2000, p(B | A1) = 0.9 and
p(B | A2) = 1? (BC | A2) = 1? 0.95 = 0.05.
10
Applying Bayes* theorem,
p(A1 | B) = p(A1)p(B | A1)﹉2
j=1 p(Aj)p(B | Aj)
p(A1)p(B | A1) = 1/2000℅ 0.9 = 0.00045
p(A2)p(B | A2) = (1? 1/2000)℅ 0.05 = .05
Hence
p(A1 | B) = 0.00045
0.00045 + .05
= 0.0089
So there is a less than 1% chance that the person is infected if they test positive.
Bayes* Theorem is also applicable to probability densities.
Theorem 2.2. Let X , Y be two continuous r.v.*s (possibly multivariate) and let f(x, y) be the joint
probability density function (pdf), f(x | y) the conditional pdf etc. then
f(x | y) = f(y | x) f(x)
f(y)
=
f(y | x) f(x)÷
f(y | x∩) f(x∩) dx∩
Alternatively, Y may be discrete, in which case f(y | x) and f(y) are probability mass functions.
2.2 Bayes* theorem and Bayesian inference
In the Bayesian framework, all uncertainty is specified by probability distributions. This includes
uncertainty about the unknown parameters. So we need to start with a probability distribution for
the parameters p(牟).
We then update the probability distribution for 牟 using the observed data y. This updating is done
using Bayes* theorem
p(牟 | y) = p(牟) p(y | 牟)
p(y)
.
p(y | 牟) is the likelihood for parameters 牟 given the observed data y.
We don*t normally need to find the normalizing constant p(y), which is given by
p(y) =
÷
p(牟) p(y | 牟) d牟 or
﹉
牟
p(牟) p(y | 牟)
11
So the procedure is as follows:
Start with a distribution p(牟) - this is known as the prior distribution.
Combine the prior distribution with the likelihood p(y | 牟) using Bayes* theorem.
The resulting probability distribution p(牟 | y) is known as the posterior distribution.
We base our inferences about 牟 on this posterior distribution.
The use of Bayes* theorem can be summarized as
Posterior distribution ≦ prior distribution ℅ likelihood
Example 2.2. Suppose a biased coin has probability of heads q, and we observe k heads in n
independent coin tosses. We saw the binomial likelihood for this problem:
p(k | q) =
(
n
k
)
qk(1? q)n?k
For Bayesian inference, we need to specify a prior distribution for q. As q is a continuous quantity
between 0 and 1, the family of beta distributions is a reasonable choice.
If X ‵ Beta(汐 , 汕), its probability density function is
f(x) =
x汐?1(1? x)汕?1
B(汐 , 汕)
, 0 ≒ x ≒ 1
where B is the beta function and 汐 , 汕 > 0 are parameters.
B(汐 , 汕) =
÷ 1
0
x汐?1(1? x)汕?1 dx, also B(汐 , 汕) = 忙(汐)忙(汕)
忙(汐 + 汕)
.
The uniform distribution on [0, 1] is a special case of the beta distribution with 汐 = 1, 汕 = 1.
Combining prior distribution and likelihood, we have the posterior distribution p(q | k) given by
p(q | k) ≦ p(q) p(k | q) = q
汐?1(1? q)汕?1
B(汐 , 汕)
(
n
k
)
qk(1? q)n?k ≦ qk+汐?1(1? q)n?k+汕?1
We can recognize this posterior distribution as being proportional to the pdf of a Beta(k+汐 , n?k+汕)
distribution. Hence we do not need to explicitly work out the normalizing constant for p(q | k), we
can immediately see that q | k has a Beta(k + 汐 , n? k + 汕) distribution.
12
For a Bayesian point estimate for q, we summarize the posterior distribution, for example by its
mean. A Beta(汐 , 汕) rv has expected value
汐
汐 + 汕
. Hence our posterior mean for q is
E(q | k) = k + 汐
n+ 汐 + 汕
.
Recall that the maximum likelihood estimator is
q? =
k
n
.
So for large values of k and n, the Bayesian estimate and MLE will be similar, whereas they differ
more for smaller sample sizes.
In the special case that the prior distribution is uniform on [0, 1], the posterior distribution is Beta(k+
1, n? k + 1). and the posterior mean value for q is
E(q | k) = k + 1
n+ 2
.
2.3 Conjugate prior distributions
In the binomial example 2.2 with a beta prior distribution, we saw that the posterior distribution is
also a beta distribution. When we have the same family of distributions for the prior and posterior
for one or more parameters, this is known as a conjugate family of distributions. We say that the
family of beta distributions is conjugate to the binomial likelihood.
The binomial likelihood is
p(牟1, 牟2, 牟3 | y) d牟2 d牟3
In practical Bayesian inference, Markov Chain Monte Carlo methods, covered later in these notes,
are the most common means of approximating the posterior distribution. These methods produce an
approximate sample from the joint posterior density, and once this is done the marginal distribution
of each parameter is immediately available.
Example 2.6. Generating samples from the posterior distribution may also be helpful even if we
can calculate the exact posterior distribution. Suppose that the data are the outcome of a clinical
trial of two treatments for a serious illness, the number of deaths after each treatment. Let the data
be ki deaths out of ni patients, i = 1, 2 for the two treatments, and the two unknown parameters
are q1 and q2, the probability of death with each treatment. Assuming a binomial model for each
outcome, and independent Beta(汐i, 汕i) priors for each parameter, the posterior distribution is
qi | ki ‵ Beta(ki + 汐i, ni ? ki + 汕i), i = 1, 2.
We have independent prior distributions and likelihood, so the posterior distributions are also inde-
pendent.
p(q1, q2 | k1, k2) = p(q1 | k1) p(q2 | k2) ≦ p(k1 | q1)p(q1) p(k2 | q2)p(q2)
However, it is useful to think in terms of the joint posterior density for q1 and q2, as then we can
make probability statements involving both parameters. In this case, one quantity of interest is the
probability P (q2 < q1), i.e. does the second treatment have a lower death rate than the first. To
find this probability, we need to integrate the joint posterior density over the relevant region, which
is not possible to do exactly when it is a product of beta pdfs.
We can approximate the probability by generating a sample of (q1, q2) pairs from the joint density.
To generate the sample, we just need to generate each parameter from its beta posterior distribution,
which can be done in R using the rbeta command. Then once we have the sample, we just count
what proportion of pairs has q2 < q1 to estimate P (q2 < q1).
20
Chapter 3
Specifying prior distributions
The posterior distribution depends on both the observed data via the likelihood, and also on the
prior distribution. So far, we have taken the prior distribution as given, but now we look at how to
specify a prior.
3.1 Informative prior distributions
An informative prior distribution is one in which the probability mass is concentrated in some subset
of the possible range for the parameter(s), and is usually based on some specific information. There
may be other data that is relevant, and we might want to use this information without including all
previous data in our current model. In that case, we can use summaries of the data to find a prior
distribution.
Example 3.1. In example 2.3, the data was the lifetimes of light bulbs, t = (t1, . . . , tn), assumed to
be exponentially distributed with parameter 竹 (the failure rate, reciprocal of the mean lifetime). The
gamma distribution provides a conjugate prior distribution for 竹. Suppose that we had information
from several other similar types of light bulbs, which had observed failure rates r1, . . . , rk. Let the
sample mean and variance of these rates be m and v.
A Gamma(汐 , 汕) distribution has mean