辅导MATH 475、讲解R程序设计、辅导linear regression、讲解R编程

2018.12.18 - 首页 >> Algorithm 算法

MATH 475 Final Exam - Wednesday December 19, 2pm CST Dr. Syring

We are familiar with linear regression models for a conditional mean. In

these models, the conditional mean of response Y ∈ R given covariate

X = x ∈ R is modeled as a linear combination

E(Y |X = x) = θ0 + θ1x.

In quantile regression, we model a conditional quantile instead of a conditional

mean. There are several practical reasons to model quantiles.

An extreme quantile may be very informative, such as the 99th quantile

of insurance losses, which helps to describe how extreme losses (such

as those due to catastrophes like hurricanes) can imperil the financial

health of insurance companies. Modeling the conditional median (the

50th quantile) may be more helpful than modeling the conditional mean

if the data are prone to outliers (to which the median is robust).

We write the quantile regression model as

Qτ (Y |X = x) = θ0 + θ1x,

where Qτ (Y |X = x) denotes the 100τ

th conditional quantile of Y given

X = x where τ ∈ (0, 1).

Traditionally, quantile regression models are fit using numerical optimization.

Point estimates for θ are taken to be

θ = arg min

i=1

|(Yi θ0 θ1xi)( 1{Yi < θ0 + θ1xi})| .

MATH 475 Final Exam - Wednesday December 19, 2pm CST Dr. Syring

Details can be found in the textbook Quantile Regression (Koenker,

2005). Note that this procedure does not require any probability model

for the data, hence it is non-parametric. Special methods are needed

to perform the minimization of the above loss function because it is not

everywhere differentiable.

We will explore how to make statistical inferences about the model parameters

θ = (θ0, θ1).

1. (5 points) Set up a quantile regression model in the following way:

Yi = θ0 + θ1x + i

where i ～ N(0, σ2 = 4), θ = (2, 1), and Xi ～ χ

(2)2. Then, consider

modeling the conditional median

Q0.5(Y |X = x) = θ0 + θ1x.

In R, set the seed to 12345 and simulate 50 pairs (X, Y ) from the above

model. First, simulate X, then Y .

2. (5 points) Maximum Likelihood

Suppose we know the true probability model of the data, i.e. we know

that Y |X = x has a normal distribution with mean θ0 + θ1x, and a

common, unknown variance σ

. Use this knowledge to fit the parameters

by maximum likelihood and give the MLE-based confidence intervals for

θ at α = 0.05.

MATH 475 Final Exam - Wednesday December 19, 2pm CST Dr. Syring

3. (20 Points) Bootstrap

The classical point estimates of the quantile regression parameters are

defined as the minimizers of a certain loss function. Since the quantile

regression problem is defined non-parametrically, it is not realistic to

assume accurate knowledge of the probability model of the data. But,

without the sampling distribution of these estimators, it is hard to determine

their variance (uncertainty), hence, it is not straightforward to

determine confidence intervals for the estimates.

One way to find confidence intervals for θ is to use the regression bootstrap.

Considering the covariates as fixed, perform bootstrap on the

quantile regression model. Use the quantreg R library along with the

function rq which works just like lm in order to fit the model for each

bootstrap resampled data set. Produce 95% confidence intervals for θ

and compare them to those previously calculated using maximum likelihood.

4. (20 points) Bayesian

The quantile regression model does not specify a probability distribution

of the data Y |X = x, and so it is not immediately clear how to de-

fine a posterior distribution for θ, since that would require a likelihood.

However, it would be nice to have a Bayesian method for modeling the

quantile regression parameters because the Bayesian approach provides

straightforward interval estimates (simply use the posterior quantiles).

MATH 475 Final Exam - Wednesday December 19, 2pm CST Dr. Syring

A “solution” is to use an asymmetric Laplace likelihood for the data,

L(θ; Y, x) =

[(Yi θ0 θ1xi)sign(Yi θ0 θ1xi)]!

It is not hard to see that the quantity in the exponential is negative

one half multipled by the loss function defining the quantile regression

estimates. Precisely for this reason, an asymmetric Laplace likelihood is

used to define a Bayesian posterior for the quantile regression model.

Use a ”flat” prior π(θ) = 1 and MCMC to sample from the Bayesian

posterior for θ using the above asymmetric Laplace likelihood. Pay careful

attention to your proposal distributions to achieve an acceptance rate

of about 1/3 for each parameter. And, thin your samples as needed to

mitigate autocorrelation. Take the 2.5 and 97.5 quantiles of the posterior

samples to form your interval estimates of θ.

5. (10 points) Monte Carlo Experiment (Warning: this may take 20 minutes

to run...)

We have two competing methods for producing interval estimates for θ,

the bootstrap and the Bayesian approach. We will use Monte Carlo to

estimate the length and coverage probability of the interval estimates

produced by these approaches. For reference, we will also compare the

MLE-based interval estimates, which utilize the true distribution of the

data.

First, fix the covariate values at X = x, the same you used previously.

Then, for i = 1, ..., M ≈ 100, sample a response data set of Y1, ..., Y50,

MATH 475 Final Exam - Wednesday December 19, 2pm CST Dr. Syring

and produce the interval estimates of θ using each of the three methods.

Keep track of the length of each interval estimate and whether or not it

“covers” (contains the true parameter value). Then, use the proportion

of interval estimates covering the true value to compare which method

was most accurate. The coverage proportion should be close to 95%.

You may also compare the average lengths of the interval estimates. Did

the various methods achieve a coverage probability of 95%?