辅导STAT 464/864、辅导R语言、讲解R设计、辅导R Studio

- 首页 >> Algorithm 算法

STAT 464/864 - Assignment #1

Due: Thursday, January 24th, 2019

Please carefully read the following instructions below, as they must be followed to

achieve the requirements of the assignment.

Format

Standard formatting requirements must be met.

1. Be aware of the instructions and consequences listed in “Policies” and “Academic

Integrity” in the course outline.

2. All code must be included in an appendix at the end of the document. The

solutions cannot include any of the code. Furthermore, all results should be

clearly stated at the end of a calculation. This means that the final solution is

written in a sentence. If several results are obtained, they should be presented

in a table or a graph, depending on which is more useful in the context of the

problem.

3. Discussions must include full sentences in paragraph structure, and explain in

full what has been discovered from the result of the calculations or analyses.

Whenever a result is obtained, take a moment to consider what the point of

the question is regarding the course content. Given what has been taught in

the course, why was the question chosen for the assignment? This should be a

necessary philosophy when answering the discussion questions.

4. Ensure that all graphs contain axis labels which clearly define what is being

plotted. If possible, write “Temperature in degrees Celsius”, not “T (

C)”. Legends

should not interfere with the lines of the plot, so set the intervals of the

plotting axis accordingly. Graphs should not require colour to separate between

different curves. The distinction between curves should be made using different

line styles. Lines or points should be set to different types.

1

References

When part of a solution is based on an external reference to the lecture notes

or the typed notes, the reference must be cited in an appendix called “References”.

The important aspect of assignments is that what is written must be in the words of

the student, and that the solution must be unique; it cannot be a straight copy of

the solution of the reference, but, rather, the solution which the student understands

from that research.

2

Question 1

For this problem, the R software is required. To install R Studio, follow the directions

of the file, “Instructions for R Setup”, in the “Course Readings and Resources” page

of “Content” in OnQ. For the first question, read “US_Population_Example_1.R”

and make the requested changes in the code so that you can run the code on your

own computer.

Part a)

Look at the output file and plots of “US_Population_Example_1.R”. The following

files should appear in the directory in which you saved the R code.

1. US_Population_Time_Series.pdf

2. US_Population_Fitting_Comparison.pdf

3. QQ_Plot.pdf

List answers to the following two questions.

1. In words, state what the measurement process is.

2. Specify the time-series models used in the code.

Part b)

Are the time-series models strictly stationary? Are they weakly stationary?

Use the mathematical definitions in your justification.

3

Part c)

Generate polynomial fits to the data with orders increasing from 1 to 6. Create

a figure and include in it the data and the different trends. Use the example

of the R code to generate the curves of the different fits, and include the figure

in the assignment paper. If one or more of the colours is difficult to see, make an

adjustment so that that colour is easily legible. Based on a visual inspection, is there

an order beyond which including higher-order terms in the polynomial model might

appear to be redundant?

Part d)

Let βj ∈ R denote the j’th coefficient in the fitting model, where j ∈ {0, · · · , p} and

β0 is the coefficient corresponding to the intercept term. Let β?

j

: → R denote the

corresponding linear-regression estimator. In “US_Population_Output.txt”, read

the “Coefficients” table under the section, “Summary of the lm.object2 R fitting

object”. In that table, refer to the column, “Pr(>|t|)”. The j’th entry of this

column lists a probability that a realization of the absolute value, T

(t)

j ∝ β

j

, of the

standardized coefficient estimator exceeds the value, |t|, corresponding to β

j = β

(0)

.

This probability is obtained from the distribution of the test statistic, T

(t)

j

, under

null hypothesis,

H0 : βj = 0. (1)

Construct a table with all the p-values from the model fits, and include this in the

assignment paper. Rows should be ordered by polynomial order. Write all the entries

of the table in scientific notation (m × 10l or m e l) and then round the coefficients

(the m) of those numbers to one decimal place. Based on the set of p-values of a row,

list the models for which all the polynomial terms must be included at the 95% level

of the hypothesis test.

4

Part e)

In order to accurately model a specific dataset, an analysis of variance (ANOVA)

can be used to determine if a linear model should include more terms than the

current number. It tests whether a linear model with p terms has a residual variance

which is distinguishable from that of a model with J ? p additional terms. In

“US_Population_Output.txt”, read the “Analysis of Variance Table” table under

the section, “ANOVA of the lm.object2 R fitting object”. In that table, refer to the

column, “Pr(>F)”. The j’th entry of this column lists a probability that a realization

of the variance ratio estimator, T

(F)

j

, exceeds the value, F, corresponding to the

variance-ratio estimate. This probability is obtained from the distribution of the test

statistic, T

(F)

j

, under null hypothesis,

H0 : {βj}

J

j=p+1 = 0. (2)

Based on the ANOVA p-values, include a list of models where all the terms must be

included. Given the three lm objects, “fit1”, “fit2” and “fit3”, the line,

“anova(fit1, fit2, fit3)”,

of R code generates a table with a column of p-values associated with the ratios

of the variance ratios of the linear model to respective polynomial models of

orders, {2, · · · , 6}. Construct a table with all the p-values from the five model fits for

p ∈ {2, · · · , 6}, and include this in the assignment paper. Write all the entries of the

table in scientific notation (m × 10l or m e l) and then round the coefficients (the m)

of those numbers to one decimal place. Write all the entries of the table in scientific

notation (m × 10l or m e l) and then round the coefficients (the m) of those numbers

to one decimal place. Based on the set of p-values of a row, list the models for which

all the polynomial terms must be included at the 95% level of the hypothesis test.

Part f)

Based on the coefficient-estimator p-values and the ANOVA p-values, declare

which order the best-fitting polynomial has, and explain why. Compute and list

sample estimates of the mean and variance of the distribution of the discrete-time

process associated with the residual time series of the best fit.

5

Part g)

Include in the assignment paper a quantile-quantile plot of the residuals of

the best polynomial fit, where the theoretical quantiles are from the standardnormal

distribution. An example of a quantile-quantile plot is included in

“US_Population_Example_1.R”. The residual series which is entered as a parameter

to the “qqnorm” function must be standardized prior to running the function

in the plotting code. Include a diagonal line segment in the plot. Explain whether

or not the residuals appear to approximately constitute the realization of a Gaussian

white-noise process.

Part h) Graduate students only

Compute the time series associated with the discrete-time processes defined by

Xn, 2Xn and ?3Xn, where the Xn are elements of X = (Xn)

N1

n=0 , the time-series

model chosen in Part f). Include in the assignment paper a single plot with each

of the three time series overlaid on the residual series. Do these series support the

choice of the polynomial order in Part f)? Explain why or why not.

6

Question 2

In this problem,

1. TD := {tn}

N1

n=0 .

2. tm, tm1 ∈ TD.

3. t := tm tm1.

4. T := Nt.

5. TC := {X(t)}t∈[0,T)

.

For each of the following discretizations, X = (Xn)n∈TD

, of X = {X(t)}t∈TC

, determine

if the process is strictly stationary and if it is weakly stationary.

a) X(tm) = μ0 + X(GW N)

(tm), where:

1. μ0 ∈ R and

2.

X

(GW N)

(tm)

IID~ Normal(0, σ2

). (3)

b) X(tm) = Acos(2πf0tm) + X(IID)

(tm), where:

1. A, f0 ∈ R.

2.

X

(IID)

(tm) ~ IID(0, σ2

). (4)

3.

t 6=2πf0

.

c) Graduate students only

Same as Part b), but with

t =2πf0

.

7

Question 3

Let X = (Xn)n∈TD denote a discrete-time process used in the modelling of the discretetime

series of the wins and losses of the National League team against the American

League team in the annual All-Star American Baseball game. The process is defined

by the relation,

Xn =+1 National League team wins

1 American League team wins

(5)

In the case of a fair game, it is straightforward to see that the distribution of the

process is Xn

IID~ Bernoulli(p), where p = P(X0 = 1). For each n in TD, then, the

ensemble of Xn (here called the state space of Xn) is {?1, +1}. Let S = {Sn}n∈TD

denote a random walk, where

Sn =

0 n = 0

Xn

m=0

Xm n > 0

(6)

a) Is S strictly stationary? Is it weakly stationary?

b) For n > m, derive an expression for the best linear predictor, S?

n(Sm), of Sn given

Sm. The expression must be in terms of {m, n, p, Sm} only and simplified as far as

possible.

c) Graduate students only

Use the central-limit theorem to approximate the probability distribution of Sn.


站长地图