辅导MATH 475、讲解R程序设计、辅导linear regression、讲解R编程
- 首页 >> Algorithm 算法MATH 475 Final Exam - Wednesday December 19, 2pm CST Dr. Syring
We are familiar with linear regression models for a conditional mean. In
these models, the conditional mean of response Y ∈ R given covariate
X = x ∈ R is modeled as a linear combination
E(Y |X = x) = θ0 + θ1x.
In quantile regression, we model a conditional quantile instead of a conditional
mean. There are several practical reasons to model quantiles.
An extreme quantile may be very informative, such as the 99th quantile
of insurance losses, which helps to describe how extreme losses (such
as those due to catastrophes like hurricanes) can imperil the financial
health of insurance companies. Modeling the conditional median (the
50th quantile) may be more helpful than modeling the conditional mean
if the data are prone to outliers (to which the median is robust).
We write the quantile regression model as
Qτ (Y |X = x) = θ0 + θ1x,
where Qτ (Y |X = x) denotes the 100τ
th conditional quantile of Y given
X = x where τ ∈ (0, 1).
Traditionally, quantile regression models are fit using numerical optimization.
Point estimates for θ are taken to be
θ = arg min
θ
Xn
i=1
|(Yi θ0 θ1xi)( 1{Yi < θ0 + θ1xi})| .
1
MATH 475 Final Exam - Wednesday December 19, 2pm CST Dr. Syring
Details can be found in the textbook Quantile Regression (Koenker,
2005). Note that this procedure does not require any probability model
for the data, hence it is non-parametric. Special methods are needed
to perform the minimization of the above loss function because it is not
everywhere differentiable.
We will explore how to make statistical inferences about the model parameters
θ = (θ0, θ1).
1. (5 points) Set up a quantile regression model in the following way:
Yi = θ0 + θ1x + i
where i ~ N(0, σ2 = 4), θ = (2, 1), and Xi ~ χ
2
(2)2. Then, consider
modeling the conditional median
Q0.5(Y |X = x) = θ0 + θ1x.
In R, set the seed to 12345 and simulate 50 pairs (X, Y ) from the above
model. First, simulate X, then Y .
2. (5 points) Maximum Likelihood
Suppose we know the true probability model of the data, i.e. we know
that Y |X = x has a normal distribution with mean θ0 + θ1x, and a
common, unknown variance σ
2
. Use this knowledge to fit the parameters
by maximum likelihood and give the MLE-based confidence intervals for
θ at α = 0.05.
2
MATH 475 Final Exam - Wednesday December 19, 2pm CST Dr. Syring
3. (20 Points) Bootstrap
The classical point estimates of the quantile regression parameters are
defined as the minimizers of a certain loss function. Since the quantile
regression problem is defined non-parametrically, it is not realistic to
assume accurate knowledge of the probability model of the data. But,
without the sampling distribution of these estimators, it is hard to determine
their variance (uncertainty), hence, it is not straightforward to
determine confidence intervals for the estimates.
One way to find confidence intervals for θ is to use the regression bootstrap.
Considering the covariates as fixed, perform bootstrap on the
quantile regression model. Use the quantreg R library along with the
function rq which works just like lm in order to fit the model for each
bootstrap resampled data set. Produce 95% confidence intervals for θ
and compare them to those previously calculated using maximum likelihood.
4. (20 points) Bayesian
The quantile regression model does not specify a probability distribution
of the data Y |X = x, and so it is not immediately clear how to de-
fine a posterior distribution for θ, since that would require a likelihood.
However, it would be nice to have a Bayesian method for modeling the
quantile regression parameters because the Bayesian approach provides
straightforward interval estimates (simply use the posterior quantiles).
3
MATH 475 Final Exam - Wednesday December 19, 2pm CST Dr. Syring
A “solution” is to use an asymmetric Laplace likelihood for the data,
L(θ; Y, x) =
[(Yi θ0 θ1xi)sign(Yi θ0 θ1xi)]!
.
It is not hard to see that the quantity in the exponential is negative
one half multipled by the loss function defining the quantile regression
estimates. Precisely for this reason, an asymmetric Laplace likelihood is
used to define a Bayesian posterior for the quantile regression model.
Use a ”flat” prior π(θ) = 1 and MCMC to sample from the Bayesian
posterior for θ using the above asymmetric Laplace likelihood. Pay careful
attention to your proposal distributions to achieve an acceptance rate
of about 1/3 for each parameter. And, thin your samples as needed to
mitigate autocorrelation. Take the 2.5 and 97.5 quantiles of the posterior
samples to form your interval estimates of θ.
5. (10 points) Monte Carlo Experiment (Warning: this may take 20 minutes
to run...)
We have two competing methods for producing interval estimates for θ,
the bootstrap and the Bayesian approach. We will use Monte Carlo to
estimate the length and coverage probability of the interval estimates
produced by these approaches. For reference, we will also compare the
MLE-based interval estimates, which utilize the true distribution of the
data.
First, fix the covariate values at X = x, the same you used previously.
Then, for i = 1, ..., M ≈ 100, sample a response data set of Y1, ..., Y50,
4
MATH 475 Final Exam - Wednesday December 19, 2pm CST Dr. Syring
and produce the interval estimates of θ using each of the three methods.
Keep track of the length of each interval estimate and whether or not it
“covers” (contains the true parameter value). Then, use the proportion
of interval estimates covering the true value to compare which method
was most accurate. The coverage proportion should be close to 95%.
You may also compare the average lengths of the interval estimates. Did
the various methods achieve a coverage probability of 95%?