辅导catchment、讲解Java程序设计、辅导Python/c++语言、辅导pasture system
- 首页 >> 其他 MULTIPLE CHOICE [25 marks]
Question 1
A survey of deer was performed in a catchment based on surveying in a number of 1km by 1km
parcels of land. Based on exploratory data analysis you have the following information.
y = 14.3 deer/ha; s = 3.0 deer/ha; n = 10; t
0.025
10 = 2.228; t
0.025
9 = 2.262.
The 95% confidence interval around the mean is?
(a) [2.65, 28.65]
(b) [12.15, 16.45]
(c) [8.03, 24.65]
(d) [3.24, 21.32]
(e) none of the above
Question 2
A survey of deer was performed in a catchment based on surveying in a number of 1km by 1km
parcels of land. Based on exploratory data analysis you have the following information.
y = 14.3 deer/ha; s = 3.0 deer/ha; n = 10;
When you survey next year you want to have survey with a standard error of the mean equal to
0.75. Based on the statistics above how many parcels of land will you survey next year?
Question 3
When designing a monitoring scheme which of the following statements is correct?
(a) If we have a large covariance between the 2 sampling periods then it is most likely best that
we resample the same locations or sampling units.
(b) If we have a small covariance between the 2 sampling periods then it is most likely best that
we resample the same locations or sampling units.
(c) If we have zero covariance between the 2 sampling periods then it is most likely best that
we resample the same locations or sampling unit.
(d) It does not matter whether we resample the same units.
(e) None of the above.
Question 4
Soil carbon was measured in a field at the start of a season and at the end of the season. The
aim is to estimate the change in mean carbon for the field and see if the change is statistically
significant. You sampled different locations between the 2 sampling events.
The outputs from 2 different analyses are shown below.
Two Sample t-test
data: init and fin
t = 0.28987, df = 10, p-value = 0.7778
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
Page 2 of 31-28.97550 37.64217
sample estimates:
mean of x mean of y
42.83333 38.50000
Paired t-test
data: init and fin
t = 6.0613, df = 5, p-value = 0.001764
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.495572 6.171095
sample estimates:
mean of the differences
4.333333
Which of the following statement is true?
(a) The most appropriate analysis is the paired t-test and we can state that there was a signifi-
cant change in soil carbon over the season.
(b) The most appropriate analysis is the paired t-test and we can state that there wasn’t a
significant change in soil carbon over the season.
(c) The most appropriate analysis is the two sample t-test and we can state that there was a
significant change in soil carbon over the season.
(d) The most appropriate analysis is the two sample t-test and we can state that there wasn’t a
significant change in soil carbon over the season.
(e) None of the above.
Question 5
When performing a survey - stratified random sampling is better than simple random sampling
because it is likely to
(a) give a more representative sample.
(b) give a better estimate of the mean and variance.
(c) give more precise estimates of the mean.
(d) all of a, b and c.
(e) none of these/insufficient information.
Questions 6-7 relate to the analysis of the following dataset.
The protein content of milk(%) of two breeds of cattle was compared, with a random sample of
20 of each breed being selected. The following table of descriptive statistics was obtained.
Breed n Mean SD
Breed 1 20 3.352 0.212
Breed 2 20 3.681 0.233
The data were analysed using a pooled (i.e. equal variance) two-sample t-test, and a t-value
of 4.67 was obtained. The data were subsequently re-analysed using an analysis of variance.
Answer the following questions:
Page 3 of 31Question 6
In this re-analysis, what is the F-value?
Question 7
In this re-analysis, what are the Breed df and Residual df?
Question 8
You are establishing a plant breeding trial at a site with five soil types which would impact on the
yield of the different varieties of wheat you intend to use. The experimental design you would
use is a
(a) completely randomised design.
(b) paired design.
(c) randomised complete block design.
(d) factorial treatment design.
(e) none of these/insufficient information.
Question 9
A field experiment was conducted to compare yields of 10 varieties of wheat. A randomised
complete block design was used for the experiment with four blocks being used, each block containing
all 10 varieties. The data generated from this design were to be analysed using ANOVA.
What are the residual degrees of freedom?
Question 10
A field experiment was conducted to compare weight gains of sheep under 2 pasture systems. 40
sheep were available for the experiment and 20 sheep were randomly allocated to 2 paddocks,
each representing one pasture system. Each sheep was weighed before and after the experiment
to estimate weight gain. Which statement is true?
(a) This is an example of confounding as we cannot disentangle whether it is the pasture system
causing differences in weight gain or other factors that may vary between each paddock, e.g.
soil.
(b) The experimental unit is the paddock and sampling unit is a sheep.
(c) Both the experimental and sampling unit are a sheep.
(d) a and b are True.
(e) a, b and c are True.
Question 11
An experiment was being planned, and R was used to generate a randomisation for the experimental
design. The following output was obtained:
Page 4 of 31> library(agricolae)
> (Trt <- LETTERS[1:5])
[1] "A" "B" "C" "D" "E"
> design.crd(trt = Trt, r = 4)$book
Based on this code and output, this would be appropriate for the following experimental design:
(a) completely randomised design with four treatments and five replicates per treatment.
(b) completely randomised design with five treatments and four replicates per treatment.
(c) randomised complete block design with four treatments and five blocks.
(d) randomised complete block design with five treatments and four blocks.
(e) none of these.
Questions 12-15 relate to the analysis of the following dataset.
An experiment was designed to assess the usefulness of synthetic protein dietary supplements
in cattle, involving a comparison of three different supplements (Supplement A, Supplement B,
Supplement C), with a Control (no supplement). The level of total protein content (g/100 ml)
was measured in the blood of cows. A total of 40 similar cows was used for this experiment,
with ten cows being randomly allocated to each of the four treatments. The following analysis
was undertaken in R. However, not all of the output from the LSD.test function is shown below.
Df Sum Sq Mean Sq F value Pr(>F)
Supplement 3 2.651 0.8838 4.347 0.0103 *
Residuals 36 7.319 0.2033
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
test p.ajusted name.t ntr alpha
Fisher-LSD none Supplement 4 0.05
MSerror Df Mean CV t.value LSD
0.2033186 36 7.24625 6.222648 2.028094 0.4089702
Page 5 of 31Protein std r LCL UCL Min Max Q25 Q50 Q75
Control 7.017 0.3652716 10 6.727814 7.306186 6.32 7.55 6.8675 7.015 7.2200
Supp A 7.683 0.4541671 10 7.393814 7.972186 7.00 8.57 7.3950 7.645 7.8975
Supp B 7.130 0.4212943 10 6.840814 7.419186 6.34 7.77 6.9550 7.180 7.3275
Supp C 7.155 0.5441456 10 6.865814 7.444186 6.53 8.27 6.6700 7.225 7.4225
Question 12
The proportion of variation in protein explained by Supplement is:
(a) 0.266
(b) 0.362
(c) 0.638
(d) 0.734
(e) none of the above.
Question 13
Assuming that Residuals ~ N(0, s2), what is the correct value of s2:
(a) 2.651
(b) 7.319
(c) 0.8838
(d) 0.2033
(e) none of the above.
Question 14
From the output, which treatment(s) show significantly (α = 0.05) the highest blood protein
content?
(a) Control + Supp B + Supp C
(b) Supp A
(c) Supp B
(d) Supp C
(e) none of the above.
Question 15
From the output, which treatment(s) show significantly (α = 0.05) the lowest blood protein
content?
(a) Control + Supp B + Supp C
(b) Supp A
(c) Supp B
(d) Supp C
(e) none of the above.
Questions 16-20 relate to the analysis of the following dataset.
Researchers are interested in what explains the relative abundance of C3 and C4 plants at 73
sites in North America. The data contains the following:
Response variable (y)
C3: relative abundance of C3 plants at 73 sites
C4: relative abundance of C4 plants at 73 sites
Predictor variables
1. MAP: Mean annual rainfall (mm) at the site.
2. MAT: Mean annual temperature (degrees C) at the site.
3. JJAMAP: proportion of mean annual rainfall in June, July and August (summer rainfall).
Page 6 of 314. DJFMAP: proportion of mean annual rainfall in December, January and February (winter
rainfall).
5. LAT: Latitude in centesimal degrees.
6. LONG: Longitude in centesimal degrees.
The researchers were first interested in C3 plants before also looking at C4 plants. After initial
inspection of the data, the researchers decided to log10 transform both the response variables
(C3 and C4). Part of the output of the model predicting the abundance of C3 plants gives the
coefficients and the statistical detail, based on this, answer the following 2 questions.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.986574e-01 3.329192e-01 -2.0985795 0.039684448
MAP 7.784743e-05 5.843685e-05 1.3321634 0.187388322
MAT 1.637150e-03 3.123558e-03 0.5241298 0.601944104
JJAMAP -9.622251e-02 1.276384e-01 -0.7538678 0.453609777
DJFMAP -1.854179e-01 1.924485e-01 -0.9634680 0.338829626
LONG 2.931004e-03 2.695546e-03 1.0873509 0.280836839
LAT 1.243240e-02 2.667459e-03 4.6607671 0.000015784
Question 16
How many degrees of freedom are there to calculate the significance of this model?
(a) 7
(b) 73
(c) 66
(d) you can’t work that out from this data.
Question 17
Is the model significant?
(a) no, because all βi 6= 0
(b) yes, because at least one βi > 0
(c) yes, because all βi 6= 0
(d) you can’t tell this from this information.
The researchers subsequently did the same analysis on the C4 plants and found the following
results:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.886485e-01 2.963954e-01 2.32341153 0.02324746
MAP 8.551407e-05 5.202588e-05 1.64368332 0.10499743
MAT 3.492573e-03 2.780880e-03 1.25592384 0.21357141
JJAMAP 2.952668e-01 1.136355e-01 2.59836713 0.01154167
DJFMAP -1.495957e-02 1.713354e-01 -0.08731162 0.93068818
LONG -5.033666e-03 2.399823e-03 -2.09751523 0.03978172
LAT -5.316684e-03 2.374818e-03 -2.23877505 0.02854770
Question 18
Which of the following statements is TRUE?
(a) With each unit increase in JJAMAP rainfall the abundance of C4 plants increases by 1.97
units, with all other variables held constant.
(b) With each unit increase in JJAMAP rainfall the abundance of C4 plants increases by 0.2953
units, with all other variables held constant.
(c) With each unit increase in JJAMAP rainfall the abundance of C4 plants decreases by 1.97
units.
(d) You cannot calculate this from this data.
Page 7 of 31Question 19
Rather than including all the variables, the researchers decided to look at simpler models that
included fewer variables. The first model they tried was based only on Longitude and Latitude.
Call:
lm(formula = Log10C4 ~ LONG + LAT, data = C4data1)
Residuals:
Min 1Q Median 3Q Max
-0.120517 -0.043905 -0.007991 0.046580 0.161236
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.475395 0.127024 11.615 < 2e-16 ***
LONG -0.010037 0.001126 -8.915 3.79e-13 ***
LAT -0.007724 0.001366 -5.653 3.17e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.06119 on 70 degrees of freedom
Multiple R-squared: 0.636, Adjusted R-squared: 0.6256
F-statistic: 61.16 on 2 and 70 DF, p-value: 4.347e-16
Explain from a statistical model point of view why the p-value and the estimate of LONG and
LAT have changed compared to the earlier full model.
(a) Because it is a totally different model and data.
(b) Because LONG and LAT individually have a lot more explaining power with fewer variables
in the model.
(c) Because the true value of the estimates of LONG and LAT is only visible in a simple linear
regression.
(d) Because the estimates of LONG and LAT are partial regression coefficients and both estimate
and p-value changes with the number of variables in the model.
Question 20
The researchers also looked at different model which also includes JJAMAP, and then ran a
partial F-test to see which model was the best.
Call:
lm(formula = Log10C4 ~ LONG + LAT + JJAMAP, data = C4data1)
Residuals:
Min 1Q Median 3Q Max
-0.12407 -0.03799 -0.01356 0.03388 0.14974
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.228906 0.144208 8.522 2.22e-12 ***
LONG -0.008134 0.001230 -6.614 6.60e-09 ***
LAT -0.008287 0.001303 -6.359 1.89e-08 ***
JJAMAP 0.230981 0.074997 3.080 0.00297 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Page 8 of 31Residual standard error: 0.05779 on 69 degrees of freedom
Multiple R-squared: 0.68, Adjusted R-squared: 0.6661
F-statistic: 48.87 on 3 and 69 DF, p-value: < 2.2e-16
Analysis of Variance Table
Model 1: Log10C4 ~ LONG + LAT + JJAMAP
Model 2: Log10C4 ~ LONG + LAT
Res.Df RSS Df Sum of Sq F Pr(>F)
1 69 0.23042
2 70 0.26209 -1 -0.031676 9.4855 0.002973 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which model did the researchers choose as their final model and why?
- Model 1: Model with Latitude, Longitude and JJAMAP.
- Model 2: Model with only Latitude and Longitude.
(a) Model 2, as this model has fewer variables and based on the principle of parsimony this is
the best model.
(b) Model 1, as this has the higher r-squared.
(c) Model 1 as the F-test says the difference between the models is significant and therefore
adding a variable is warranted.
(d) Model 2 as the F-test says the difference between the models is significant and therefore
adding a variable is not warranted.
Question 21
The researchers redid the analysis with a smaller dataset (they removed 20% of the data) so
they could validate the best model. They verified that they got similar results with the reduced
calibration data set as with the full data set by looking at the coefficients in the following table:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.225171731 0.158891238 7.710757 2.586135e-10
LONG -0.007966020 0.001385858 -5.748077 4.091923e-07
LAT -0.008301983 0.001501368 -5.529612 9.134469e-07
JJAMAP 0.206330437 0.089569087 2.303590 2.505033e-02
They then calculated the correlation, r, between the calibration data and the validation data, to
check the performance of the models. They also calculated Lin’s concordance.
Calibration_r Validation_r
1 0.809256 0.8798249
Calibration_Lins.est Calibration_Lins.lower Calibration_Lins.upper
1 0.7914643 0.6811801 0.8666261
Validation_Lins.est Validation_Lins.lower Validation_Lins.upper
1 0.7494391 0.4662542 0.8933096
Based on this which of one of the following conclusions can be drawn?
(a) The model does not perform well as the validation results are considerably lower than the
calibration results.
(b) The correlation coefficient alone is sufficient to indicate how well the model performs in
validation and calibration.
(c) Lin’s concordance indicates the relationship between validated and calibrated predictions does
Page 9 of 31not follow the 1:1 line and therefore the model performs poorly.
(d) Lin’s concordance indicates that both validation and calibration results follow the 1:1 line
between predicted and observed quite well and therefore the model performs well.
Question 22
Which of the following are multivariate methods
(a) Regression.
(b) Classification.
(c) Clustering.
(d) Ordination.
(e) All of the above.
Question 23
You are going to run an nMDS on a data set which contains numbers of ant species collected at
different sites. Some of the ant species are much higher in abundance than other ant species,
and there are many species absent from some sites. Which of the following statements is true?
(a) You should use a Bray-Curtis similarity matrix.
(b) You should use a 4th root transformation.
(c) You should not use a Euclidean distance matrix.
(d) All of a, b and c.
(e) None of a, b, c.
Question 24
Which of the following is required by K-means clustering?
(a) defined distance metric.
(b) number of clusters.
(c) initial guess as to cluster centroids.
(d) All a, b, and c.
(e) None of a, b, and c.
Question 25
In cluster analysis, objects with larger distances between them are more similar to each other
than are those at smaller distances.
(a) True.
(b) False.
End of multiple choice questions
Page 10 of 31SHORT ANSWER [25 marks]
Question A [6 marks]
An experiment was performed to examine the impact of different insecticide treatments on the
number of living insect larvae in a rice crop. The experiment had 9 insecticide treatments arranged
in a randomised complete block design with 4 blocks. The control (no insecticide used)
was coded T9. The aim is to have the smallest amount of larvae after treatment.
The R output below shows the results of an analysis on the data.
Analysis of Variance Table
Response: larvae
Df Sum Sq Mean Sq F value Pr(>F)
rep 3 385.64 128.546 3.7550 0.024209 *
trt 8 1255.50 156.937 4.5843 0.001719 **
Residuals 24 821.61 34.234
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
test p.ajusted name.t ntr alpha
Fisher-LSD none trt 9 0.05
MSerror Df Mean CV t.value LSD
34.2338 24 7.916667 73.90693 2.063899 8.538879
Theoretical Quantiles
Sample Quantiles
The R output below shows the results of an analysis on the data after log transformation.
Analysis of Variance Table
Page 11 of 31Response: log_larvae
Df Sum Sq Mean Sq F value Pr(>F)
rep 3 0.9567 0.31889 3.6511 0.0267223 *
trt 8 3.9823 0.49779 5.6995 0.0004092 ***
Residuals 24 2.0961 0.08734
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
test p.ajusted name.t ntr alpha
Fisher-LSD none trt 9 0.05
MSerror Df Mean CV t.value LSD
0.08733941 24 0.7547053 39.15865 2.063899 0.431299
log_larvae groups
Choose and justify the appropriate output based on meeting the model assumptions. Use this
output to identify which insecticide treatments are better at controlling insects than the control
(T9). Explain how you made this decision.
Answer
Page 12 of 31Question B [7 marks]
An experiment was conducted to examine to compare the yields of leys grown from four mixtures
(labelled A, B, C and D) and three seed rates (4, 6 and 8 g per unit area). A randomised
complete block design was used with three blocks. An analysis of the original data detected
an unstable variance and to remedy this a natural log transformation was used. The following
ANOVA was obtained together with means on the log-scale. Some of the entries have been
omitted in the ANOVA table.
(i) Fill in the values of the cells in the analysis of variance table marked with a *.
Source df SS MS F-ratio P-value
Block * 0.12469 0.0623 4.75
Rate * 0.34706 0.1735 13.22 <0.001
Mixture * 0.76693 0.2556 19.47 <0.001
Rate × Mixture * 0.21969 * * 0.036
Residual * 0.28879 *
Total * 1.74715
A B C D Overall
4 g 2.960 2.462 2.586 2.739 2.700
6 g 2.965 2.486 2.863 2.940 2.807
8 g 2.974 2.788 3.048 2.951 2.940
Overall 2.966 2.579 2.823 2.895 2.816
Page 13 of 31(ii) What is the least significant difference (LSD) to compare a pair of means of different
p
mixtures but the same seed rate? The LSD is calculated as tcrit × SED where SED =(ResMS × 2/rep).
The following selected critical values from the t-tables will be useful:
df P
One tailed 0.025
Two tailed 0.05
6 2.447
22 2.074
30 2.042
40 2.021
Answer
(iii) Prior to the experiment, the experimenters were particularly interested in comparing seed
mixtures B and C at a seed rate of 6 g per unit area. Provide an estimate of this difference and
the 95% confidence interval for this comparison, initially on the log-scale, then on the original
scale of measurement.
Answer
Page 14 of 31Question C [7 marks]
Data on water quality parameters and algae counts were collected in many different Queensland
lakes and rivers for different algae species. Here we concentrate on the data for the
Cylindrospermopsis raciborskii (CR). The water qualiy data considered were Sulfide (mg/L),
Total N(itrogen) (mg/L), Total P(hosphorus) (mg/L), Dissolved Oxygen (DO, mg/L), Turbidity
(NTU), Temperature (degrees C), pH, and electrical conductivity (Cond) (uS/m). The
researchers are interested in developing the best possible model to predict the occurrence of CR
in Queensland waters.
Here is a snippet of the data.
| | SulfidesmgL| TotalNmgL| TotalPmgL| ConduScm| DOmgL|
|:--|-----------:|---------:|---------:|---------:|---------:|
|23 | 0.02| 1.450| 0.085| 788.9091| 1.308182|
|26 | 0.00| 0.455| 0.040| 235.4000| 14.428000|
|38 | 0.02| 1.600| 0.095| 1009.5000| 3.662500|
|54 | 0.02| 0.650| 0.045| 399.0000| 5.240000|
|61 | 0.01| 0.295| 0.024| 153.5000| 7.500000|
(i) On inspection of the data, the researchers decide to log10 transform most of the data
columns, namely SulfidesmgL, TotalNmgL, TotalPmgL, ConduScm, DOmgL, TurbidityNTU,
and CR. Subsequently a correlation table was generated. Based on the correlation matrix, explain
which independent variable you suspect will be the best predictor for CR in a single variable
linear regression (simple linear regression)
SulfidesmgL TotalNmgL TotalPmgL ConduScm DOmgL
SulfidesmgL 1.000000000 0.011520667 0.009485123 0.123405309 0.022352894
TotalNmgL 0.011520667 1.000000000 0.999325874 -0.014672749 -0.007831802
TotalPmgL 0.009485123 0.999325874 1.000000000 -0.017052459 -0.009546349
ConduScm 0.123405309 -0.014672749 -0.017052459 1.000000000 -0.173858384
DOmgL 0.022352894 -0.007831802 -0.009546349 -0.173858384 1.000000000
pH 0.103157999 -0.147361880 -0.158040039 0.239738159 0.109014015
Temperature 0.091086576 -0.016655411 -0.012593383 0.212635719 -0.118745738
TurbidityNTU 0.140494654 -0.027861083 -0.013408112 -0.001087102 -0.070986845
CR 0.022925155 -0.056605979 -0.055010325 -0.232874072 0.109641652
pH Temperature TurbidityNTU CR
SulfidesmgL 0.10315800 0.09108658 0.140494654 0.02292515
TotalNmgL -0.14736188 -0.01665541 -0.027861083 -0.05660598
TotalPmgL -0.15804004 -0.01259338 -0.013408112 -0.05501032
ConduScm 0.23973816 0.21263572 -0.001087102 -0.23287407
DOmgL 0.10901402 -0.11874574 -0.070986845 0.10964165
pH 1.00000000 0.22301849 -0.087793362 -0.12148631
Temperature 0.22301849 1.00000000 0.204382392 -0.29034452
TurbidityNTU -0.08779336 0.20438239 1.000000000 -0.11048542
CR -0.12148631 -0.29034452 -0.110485416 1.00000000
Answer
Page 15 of 31(ii) The researchers continued with a multiple regression analysis on all the data for CR. The
model output is given below. Explain whether this is a satisfactory statistical relationship.
Answer
Call:
lm(formula = log10CR ~ ., data = CR_QAlgae_tr)
Residuals:
Min 1Q Median 3Q Max
-2.28110 -0.69332 -0.00444 0.64203 2.07150
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.91824 1.84634 4.289 4.39e-05 ***
pH -0.26345 0.23046 -1.143 0.2559
Temperature -0.01908 0.02783 -0.686 0.4946
log10SulfidesmgL -3.07906 12.90546 -0.239 0.8120
log10TotalNmgL 4.87464 1.04979 4.643 1.12e-05 ***
log10TotalPmgL -11.31783 2.34518 -4.826 5.44e-06 ***
log10ConduScm -0.79975 0.38771 -2.063 0.0419 *
log10DOmgL 0.27440 0.53338 0.514 0.6082
log10TurbidityNTU -0.55953 0.21525 -2.599 0.0109 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8497 on 93 degrees of freedom
Multiple R-squared: 0.3331, Adjusted R-squared: 0.2758
F-statistic: 5.807 on 8 and 93 DF, p-value: 5.251e-06
(iii) why is the adjusted r
2
value important for model selection and relate this to the principle
of parsimony?
Answer
Page 16 of 31(iv) The following residual plots were obtained for the full multi-linear regression model with three
variables. Based on the residual plots, argue whether there are any concerns about continuing
with the regression analysis.
Answer
Theoretical Quantiles
Standardized residuals
(v) Backward elimination was performed and the following output was obtained. Using the
output explain in some statistical detail what the best final model for predicting log10(CR)
is and how the output of the variable selection procedure informs you of this (Indicate what
information you would be looking at).
Answer
Start: AIC=-24.64
log10CR ~ pH + Temperature + log10SulfidesmgL + log10TotalNmgL +
log10TotalPmgL + log10ConduScm + log10DOmgL + log10TurbidityNTU
Df Sum of Sq RSS AIC
Page 17 of 31- log10SulfidesmgL 1 0.0411 67.191 -26.5783
- log10DOmgL 1 0.1911 67.341 -26.3509
- Temperature 1 0.3394 67.489 -26.1264
- pH 1 0.9436 68.093 -25.2175
67.150 -24.6407
- log10ConduScm 1 3.0722 70.222 -22.0777
- log10TurbidityNTU 1 4.8790 72.029 -19.4865
- log10TotalNmgL 1 15.5683 82.718 -5.3725
- log10TotalPmgL 1 16.8166 83.966 -3.8447
Step: AIC=-26.58
log10CR ~ pH + Temperature + log10TotalNmgL + log10TotalPmgL +
log10ConduScm + log10DOmgL + log10TurbidityNTU
Df Sum of Sq RSS AIC
- log10DOmgL 1 0.1766 67.368 -28.3106
- Temperature 1 0.3536 67.545 -28.0430
- pH 1 0.9782 68.169 -27.1040
67.191 -26.5783
- log10ConduScm 1 3.1269 70.318 -23.9386
- log10TurbidityNTU 1 5.0982 72.289 -21.1186
- log10TotalNmgL 1 15.5567 82.748 -7.3361
- log10TotalPmgL 1 16.8151 84.006 -5.7966
Step: AIC=-28.31
log10CR ~ pH + Temperature + log10TotalNmgL + log10TotalPmgL +
log10ConduScm + log10TurbidityNTU
Df Sum of Sq RSS AIC
- Temperature 1 0.5775 67.945 -29.4399
- pH 1 0.8772 68.245 -28.9909
67.368 -28.3106
- log10ConduScm 1 3.6333 71.001 -24.9526
- log10TurbidityNTU 1 5.2317 72.599 -22.6819
- log10TotalNmgL 1 15.3830 82.751 -9.3325
- log10TotalPmgL 1 16.6387 84.006 -7.7963
Step: AIC=-29.44
log10CR ~ pH + log10TotalNmgL + log10TotalPmgL + log10ConduScm +
log10TurbidityNTU
Df Sum of Sq RSS AIC
- pH 1 1.2354 69.180 -29.6020
67.945 -29.4399
- log10ConduScm 1 4.6170 72.562 -24.7341
- log10TurbidityNTU 1 6.1241 74.069 -22.6372
- log10TotalNmgL 1 18.6416 86.587 -6.7104
- log10TotalPmgL 1 19.8658 87.811 -5.2783
Step: AIC=-29.6
log10CR ~ log10TotalNmgL + log10TotalPmgL + log10ConduScm + log10TurbidityNTU
Df Sum of Sq RSS AIC
69.180 -29.6020
Page 18 of 31- log10TurbidityNTU 1 5.6547 74.835 -23.5879
- log10ConduScm 1 6.2639 75.444 -22.7609
- log10TotalNmgL 1 18.5837 87.764 -7.3326
- log10TotalPmgL 1 19.1970 88.377 -6.6223
Page 19 of 31Question D [5 marks]
The climate dataset contains various sea surface temperature anomalies (AO and NPI) as well
as rainfall data, ice cover data, temperature, and the year each of the average measurements
were made. Explore the R output and then answer questions i, ii, and iii.
Arctic Oscillation AO - Annual
North Pacific Index NPI - Annual
Ice - Annual, January to July, October to December, coverage, ice free days
Temp - Annual, summer and winter
Rain - Annual, summer and winter
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.7825 1.6162 1.4399 1.3051 1.00640 0.90898 0.67192
Proportion of Variance 0.2444 0.2009 0.1595 0.1310 0.07791 0.06356 0.03473
Cumulative Proportion 0.2444 0.4453 0.6048 0.7358 0.81373 0.87729 0.91202
PC8 PC9 PC10 PC11 PC12 PC13
Standard deviation 0.62628 0.50743 0.48147 0.38410 0.29745 0.16192
Proportion of Variance 0.03017 0.01981 0.01783 0.01135 0.00681 0.00202
Cumulative Proportion 0.94219 0.96200 0.97983 0.99118 0.99798 1.00000
PC1 PC2 PC3 PC4 PC5
AO 0.156287502 -0.50815718 0.12983848 -0.008177511 0.14421247
NPI 0.250939630 -0.42144515 -0.27943937 0.050866762 0.10421175
Temp 0.165175832 0.20243032 -0.01502539 -0.474768071 -0.56365852
SummerTemp 0.180231809 -0.29613178 -0.25368167 -0.195303153 -0.16756395
WinterTemp -0.266893813 0.14433019 -0.14234741 -0.428832652 0.34726092
Rain 0.213236238 -0.24402773 0.44640553 -0.314164836 0.06113113
SummerRain -0.001514271 -0.13388458 0.58070456 -0.229915571 0.21086146
WinterRain 0.149930136 -0.24867925 -0.47822831 -0.113466132 0.09667562
Ice -0.480852604 -0.23695449 0.01721970 -0.093068314 -0.20416734
Ice_JanJul -0.442093681 -0.16548521 -0.04652799 -0.151069775 -0.35809493
Ice_OctDec -0.271927458 -0.42494690 0.06788669 0.026653770 -0.21552433
IceCover 0.153102429 -0.05717217 0.21768767 0.532971801 -0.39965508
IceFreeDays 0.435433891 0.08921713 0.02359594 -0.263081329 -0.26139430
Page 20 of 31Screeplot climate PCA
Variances
IceFreeDays
(i) According to Kaiser’s criterion, how many criteria would you consider in your analysis?
Page 21 of 31Answer
(ii) Consider the biplot of the climate data and report three relationships of interest.
Answer
(iii) Summarise the loadings of the first three principal components.
Answer
End of Short Answer Questions
Page 22 of 31Equations
Sample variance
Confidence interval for mean, given unknown standard deviation
95%CI = y ± t0.025
Variance of the mean for simple random sampling (SiR)
Degrees of freedom for simple random sampling
Mean for stratified random sampling (StR)
Variance of the mean for stratified random sampling (StR)
Degrees of freedom for stratified random sampling
df = n H
Variance of the change in mean
Covariance between 2 sets of observations
Total sum of squares
T otalSS = T reatmentSS + ResidualSS
One way ANOVA table
Source df SS MS F-ratio
Treatment (t 1) TrtSS TrtSS/(t 1) TrtMS/ResMS
Residual (N t) ResSS ResSS/(N t)
Total (N 1) TotSS
ANOVA table with Block
Source df SS MS F-ratio
Block b 1 BlkSS BlkSS/(b 1)
Treatment t 1 TrtSS TrtSS/(t 1) TrtMS/ResMS
Residual (b 1)(t 1) ResSS ResSS/((b 1)(t 1))
Total bt 1 TotSS
Full factorial ANOVA table with Block
Source df SS MS F-ratio
Block b 1 BlkSS BlkSS/(b 1)
Treatment A tA 1 TrtASS TrtASS/(tA 1) TrtAMS/ResMS
Treatment B tB 1 TrtBSS TrtBSS/(tB 1) TrtAMS/ResMS
Treatment AB (tA 1)(tB 1) TrtABSS TrtABSS/((tA 1)(tB 1)) TrtABMS/ResMS
Residual (b 1)(tAB 1) ResSS ResSS/((b 1)(tAB 1))
Total bt 1 TotSS
Treatment SS
T reatmentSS = T reatmentASS + T reatmentBSS + T reatmentABSS
Page 24 of 31A.1 Some probabilities for the cumulative standard normal
distribution
The distribution tabulated is for the normal distribution with mean 0 and standard deviation 1.
For each value of z, the table gives the proportion, P, of the distribution less than z, P(Z < z).ENVX1001 Introductory Statistical Methods
A.2 Some right-tail critical values for the Student’s T
distribution
The distribution tabulated is that of Student’s t. The first column is the degrees of freedom (df).
The remaining columns give either the one tailed (upper tail) critical values so that P(Tdf > t) =P, or the two tailed critical values so that P(Tdf > t or Tdf < –t) = P where P is the probability
shown at the top of the columns.ENVX1001 Introductory Statistical Methods
A.3 Some right-tail critical values for the Chi-Squared (2)
Distribution
The distribution tabulated is that of 2. The first column is the degrees of freedom (df). The
remaining columns give the upper tail critical values so that P(2df > x2) = P, where P is the
probability shown at the top of the columns.
For larger degrees of freedom than tabulated here, use the normal approximation to the 2,and refer z to the “Table of Probabilities for the Standard Normal
Distribution”. 4
A.4 Table of Probabilities of Fisher’s F Distribution
The distribution tabulated is that of Fisher’s F. The numerator degrees of freedom (1) are given by the column position and the denominator degrees of freedom
(?2) are given by the row position. The values in the body of the table are the upper tail critical values so that PF f P, where P is the probability shown
(0.10, 0.05, 0.01).
Question 1
A survey of deer was performed in a catchment based on surveying in a number of 1km by 1km
parcels of land. Based on exploratory data analysis you have the following information.
y = 14.3 deer/ha; s = 3.0 deer/ha; n = 10; t
0.025
10 = 2.228; t
0.025
9 = 2.262.
The 95% confidence interval around the mean is?
(a) [2.65, 28.65]
(b) [12.15, 16.45]
(c) [8.03, 24.65]
(d) [3.24, 21.32]
(e) none of the above
Question 2
A survey of deer was performed in a catchment based on surveying in a number of 1km by 1km
parcels of land. Based on exploratory data analysis you have the following information.
y = 14.3 deer/ha; s = 3.0 deer/ha; n = 10;
When you survey next year you want to have survey with a standard error of the mean equal to
0.75. Based on the statistics above how many parcels of land will you survey next year?
Question 3
When designing a monitoring scheme which of the following statements is correct?
(a) If we have a large covariance between the 2 sampling periods then it is most likely best that
we resample the same locations or sampling units.
(b) If we have a small covariance between the 2 sampling periods then it is most likely best that
we resample the same locations or sampling units.
(c) If we have zero covariance between the 2 sampling periods then it is most likely best that
we resample the same locations or sampling unit.
(d) It does not matter whether we resample the same units.
(e) None of the above.
Question 4
Soil carbon was measured in a field at the start of a season and at the end of the season. The
aim is to estimate the change in mean carbon for the field and see if the change is statistically
significant. You sampled different locations between the 2 sampling events.
The outputs from 2 different analyses are shown below.
Two Sample t-test
data: init and fin
t = 0.28987, df = 10, p-value = 0.7778
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
Page 2 of 31-28.97550 37.64217
sample estimates:
mean of x mean of y
42.83333 38.50000
Paired t-test
data: init and fin
t = 6.0613, df = 5, p-value = 0.001764
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.495572 6.171095
sample estimates:
mean of the differences
4.333333
Which of the following statement is true?
(a) The most appropriate analysis is the paired t-test and we can state that there was a signifi-
cant change in soil carbon over the season.
(b) The most appropriate analysis is the paired t-test and we can state that there wasn’t a
significant change in soil carbon over the season.
(c) The most appropriate analysis is the two sample t-test and we can state that there was a
significant change in soil carbon over the season.
(d) The most appropriate analysis is the two sample t-test and we can state that there wasn’t a
significant change in soil carbon over the season.
(e) None of the above.
Question 5
When performing a survey - stratified random sampling is better than simple random sampling
because it is likely to
(a) give a more representative sample.
(b) give a better estimate of the mean and variance.
(c) give more precise estimates of the mean.
(d) all of a, b and c.
(e) none of these/insufficient information.
Questions 6-7 relate to the analysis of the following dataset.
The protein content of milk(%) of two breeds of cattle was compared, with a random sample of
20 of each breed being selected. The following table of descriptive statistics was obtained.
Breed n Mean SD
Breed 1 20 3.352 0.212
Breed 2 20 3.681 0.233
The data were analysed using a pooled (i.e. equal variance) two-sample t-test, and a t-value
of 4.67 was obtained. The data were subsequently re-analysed using an analysis of variance.
Answer the following questions:
Page 3 of 31Question 6
In this re-analysis, what is the F-value?
Question 7
In this re-analysis, what are the Breed df and Residual df?
Question 8
You are establishing a plant breeding trial at a site with five soil types which would impact on the
yield of the different varieties of wheat you intend to use. The experimental design you would
use is a
(a) completely randomised design.
(b) paired design.
(c) randomised complete block design.
(d) factorial treatment design.
(e) none of these/insufficient information.
Question 9
A field experiment was conducted to compare yields of 10 varieties of wheat. A randomised
complete block design was used for the experiment with four blocks being used, each block containing
all 10 varieties. The data generated from this design were to be analysed using ANOVA.
What are the residual degrees of freedom?
Question 10
A field experiment was conducted to compare weight gains of sheep under 2 pasture systems. 40
sheep were available for the experiment and 20 sheep were randomly allocated to 2 paddocks,
each representing one pasture system. Each sheep was weighed before and after the experiment
to estimate weight gain. Which statement is true?
(a) This is an example of confounding as we cannot disentangle whether it is the pasture system
causing differences in weight gain or other factors that may vary between each paddock, e.g.
soil.
(b) The experimental unit is the paddock and sampling unit is a sheep.
(c) Both the experimental and sampling unit are a sheep.
(d) a and b are True.
(e) a, b and c are True.
Question 11
An experiment was being planned, and R was used to generate a randomisation for the experimental
design. The following output was obtained:
Page 4 of 31> library(agricolae)
> (Trt <- LETTERS[1:5])
[1] "A" "B" "C" "D" "E"
> design.crd(trt = Trt, r = 4)$book
Based on this code and output, this would be appropriate for the following experimental design:
(a) completely randomised design with four treatments and five replicates per treatment.
(b) completely randomised design with five treatments and four replicates per treatment.
(c) randomised complete block design with four treatments and five blocks.
(d) randomised complete block design with five treatments and four blocks.
(e) none of these.
Questions 12-15 relate to the analysis of the following dataset.
An experiment was designed to assess the usefulness of synthetic protein dietary supplements
in cattle, involving a comparison of three different supplements (Supplement A, Supplement B,
Supplement C), with a Control (no supplement). The level of total protein content (g/100 ml)
was measured in the blood of cows. A total of 40 similar cows was used for this experiment,
with ten cows being randomly allocated to each of the four treatments. The following analysis
was undertaken in R. However, not all of the output from the LSD.test function is shown below.
Df Sum Sq Mean Sq F value Pr(>F)
Supplement 3 2.651 0.8838 4.347 0.0103 *
Residuals 36 7.319 0.2033
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
test p.ajusted name.t ntr alpha
Fisher-LSD none Supplement 4 0.05
MSerror Df Mean CV t.value LSD
0.2033186 36 7.24625 6.222648 2.028094 0.4089702
Page 5 of 31Protein std r LCL UCL Min Max Q25 Q50 Q75
Control 7.017 0.3652716 10 6.727814 7.306186 6.32 7.55 6.8675 7.015 7.2200
Supp A 7.683 0.4541671 10 7.393814 7.972186 7.00 8.57 7.3950 7.645 7.8975
Supp B 7.130 0.4212943 10 6.840814 7.419186 6.34 7.77 6.9550 7.180 7.3275
Supp C 7.155 0.5441456 10 6.865814 7.444186 6.53 8.27 6.6700 7.225 7.4225
Question 12
The proportion of variation in protein explained by Supplement is:
(a) 0.266
(b) 0.362
(c) 0.638
(d) 0.734
(e) none of the above.
Question 13
Assuming that Residuals ~ N(0, s2), what is the correct value of s2:
(a) 2.651
(b) 7.319
(c) 0.8838
(d) 0.2033
(e) none of the above.
Question 14
From the output, which treatment(s) show significantly (α = 0.05) the highest blood protein
content?
(a) Control + Supp B + Supp C
(b) Supp A
(c) Supp B
(d) Supp C
(e) none of the above.
Question 15
From the output, which treatment(s) show significantly (α = 0.05) the lowest blood protein
content?
(a) Control + Supp B + Supp C
(b) Supp A
(c) Supp B
(d) Supp C
(e) none of the above.
Questions 16-20 relate to the analysis of the following dataset.
Researchers are interested in what explains the relative abundance of C3 and C4 plants at 73
sites in North America. The data contains the following:
Response variable (y)
C3: relative abundance of C3 plants at 73 sites
C4: relative abundance of C4 plants at 73 sites
Predictor variables
1. MAP: Mean annual rainfall (mm) at the site.
2. MAT: Mean annual temperature (degrees C) at the site.
3. JJAMAP: proportion of mean annual rainfall in June, July and August (summer rainfall).
Page 6 of 314. DJFMAP: proportion of mean annual rainfall in December, January and February (winter
rainfall).
5. LAT: Latitude in centesimal degrees.
6. LONG: Longitude in centesimal degrees.
The researchers were first interested in C3 plants before also looking at C4 plants. After initial
inspection of the data, the researchers decided to log10 transform both the response variables
(C3 and C4). Part of the output of the model predicting the abundance of C3 plants gives the
coefficients and the statistical detail, based on this, answer the following 2 questions.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.986574e-01 3.329192e-01 -2.0985795 0.039684448
MAP 7.784743e-05 5.843685e-05 1.3321634 0.187388322
MAT 1.637150e-03 3.123558e-03 0.5241298 0.601944104
JJAMAP -9.622251e-02 1.276384e-01 -0.7538678 0.453609777
DJFMAP -1.854179e-01 1.924485e-01 -0.9634680 0.338829626
LONG 2.931004e-03 2.695546e-03 1.0873509 0.280836839
LAT 1.243240e-02 2.667459e-03 4.6607671 0.000015784
Question 16
How many degrees of freedom are there to calculate the significance of this model?
(a) 7
(b) 73
(c) 66
(d) you can’t work that out from this data.
Question 17
Is the model significant?
(a) no, because all βi 6= 0
(b) yes, because at least one βi > 0
(c) yes, because all βi 6= 0
(d) you can’t tell this from this information.
The researchers subsequently did the same analysis on the C4 plants and found the following
results:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.886485e-01 2.963954e-01 2.32341153 0.02324746
MAP 8.551407e-05 5.202588e-05 1.64368332 0.10499743
MAT 3.492573e-03 2.780880e-03 1.25592384 0.21357141
JJAMAP 2.952668e-01 1.136355e-01 2.59836713 0.01154167
DJFMAP -1.495957e-02 1.713354e-01 -0.08731162 0.93068818
LONG -5.033666e-03 2.399823e-03 -2.09751523 0.03978172
LAT -5.316684e-03 2.374818e-03 -2.23877505 0.02854770
Question 18
Which of the following statements is TRUE?
(a) With each unit increase in JJAMAP rainfall the abundance of C4 plants increases by 1.97
units, with all other variables held constant.
(b) With each unit increase in JJAMAP rainfall the abundance of C4 plants increases by 0.2953
units, with all other variables held constant.
(c) With each unit increase in JJAMAP rainfall the abundance of C4 plants decreases by 1.97
units.
(d) You cannot calculate this from this data.
Page 7 of 31Question 19
Rather than including all the variables, the researchers decided to look at simpler models that
included fewer variables. The first model they tried was based only on Longitude and Latitude.
Call:
lm(formula = Log10C4 ~ LONG + LAT, data = C4data1)
Residuals:
Min 1Q Median 3Q Max
-0.120517 -0.043905 -0.007991 0.046580 0.161236
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.475395 0.127024 11.615 < 2e-16 ***
LONG -0.010037 0.001126 -8.915 3.79e-13 ***
LAT -0.007724 0.001366 -5.653 3.17e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.06119 on 70 degrees of freedom
Multiple R-squared: 0.636, Adjusted R-squared: 0.6256
F-statistic: 61.16 on 2 and 70 DF, p-value: 4.347e-16
Explain from a statistical model point of view why the p-value and the estimate of LONG and
LAT have changed compared to the earlier full model.
(a) Because it is a totally different model and data.
(b) Because LONG and LAT individually have a lot more explaining power with fewer variables
in the model.
(c) Because the true value of the estimates of LONG and LAT is only visible in a simple linear
regression.
(d) Because the estimates of LONG and LAT are partial regression coefficients and both estimate
and p-value changes with the number of variables in the model.
Question 20
The researchers also looked at different model which also includes JJAMAP, and then ran a
partial F-test to see which model was the best.
Call:
lm(formula = Log10C4 ~ LONG + LAT + JJAMAP, data = C4data1)
Residuals:
Min 1Q Median 3Q Max
-0.12407 -0.03799 -0.01356 0.03388 0.14974
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.228906 0.144208 8.522 2.22e-12 ***
LONG -0.008134 0.001230 -6.614 6.60e-09 ***
LAT -0.008287 0.001303 -6.359 1.89e-08 ***
JJAMAP 0.230981 0.074997 3.080 0.00297 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Page 8 of 31Residual standard error: 0.05779 on 69 degrees of freedom
Multiple R-squared: 0.68, Adjusted R-squared: 0.6661
F-statistic: 48.87 on 3 and 69 DF, p-value: < 2.2e-16
Analysis of Variance Table
Model 1: Log10C4 ~ LONG + LAT + JJAMAP
Model 2: Log10C4 ~ LONG + LAT
Res.Df RSS Df Sum of Sq F Pr(>F)
1 69 0.23042
2 70 0.26209 -1 -0.031676 9.4855 0.002973 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which model did the researchers choose as their final model and why?
- Model 1: Model with Latitude, Longitude and JJAMAP.
- Model 2: Model with only Latitude and Longitude.
(a) Model 2, as this model has fewer variables and based on the principle of parsimony this is
the best model.
(b) Model 1, as this has the higher r-squared.
(c) Model 1 as the F-test says the difference between the models is significant and therefore
adding a variable is warranted.
(d) Model 2 as the F-test says the difference between the models is significant and therefore
adding a variable is not warranted.
Question 21
The researchers redid the analysis with a smaller dataset (they removed 20% of the data) so
they could validate the best model. They verified that they got similar results with the reduced
calibration data set as with the full data set by looking at the coefficients in the following table:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.225171731 0.158891238 7.710757 2.586135e-10
LONG -0.007966020 0.001385858 -5.748077 4.091923e-07
LAT -0.008301983 0.001501368 -5.529612 9.134469e-07
JJAMAP 0.206330437 0.089569087 2.303590 2.505033e-02
They then calculated the correlation, r, between the calibration data and the validation data, to
check the performance of the models. They also calculated Lin’s concordance.
Calibration_r Validation_r
1 0.809256 0.8798249
Calibration_Lins.est Calibration_Lins.lower Calibration_Lins.upper
1 0.7914643 0.6811801 0.8666261
Validation_Lins.est Validation_Lins.lower Validation_Lins.upper
1 0.7494391 0.4662542 0.8933096
Based on this which of one of the following conclusions can be drawn?
(a) The model does not perform well as the validation results are considerably lower than the
calibration results.
(b) The correlation coefficient alone is sufficient to indicate how well the model performs in
validation and calibration.
(c) Lin’s concordance indicates the relationship between validated and calibrated predictions does
Page 9 of 31not follow the 1:1 line and therefore the model performs poorly.
(d) Lin’s concordance indicates that both validation and calibration results follow the 1:1 line
between predicted and observed quite well and therefore the model performs well.
Question 22
Which of the following are multivariate methods
(a) Regression.
(b) Classification.
(c) Clustering.
(d) Ordination.
(e) All of the above.
Question 23
You are going to run an nMDS on a data set which contains numbers of ant species collected at
different sites. Some of the ant species are much higher in abundance than other ant species,
and there are many species absent from some sites. Which of the following statements is true?
(a) You should use a Bray-Curtis similarity matrix.
(b) You should use a 4th root transformation.
(c) You should not use a Euclidean distance matrix.
(d) All of a, b and c.
(e) None of a, b, c.
Question 24
Which of the following is required by K-means clustering?
(a) defined distance metric.
(b) number of clusters.
(c) initial guess as to cluster centroids.
(d) All a, b, and c.
(e) None of a, b, and c.
Question 25
In cluster analysis, objects with larger distances between them are more similar to each other
than are those at smaller distances.
(a) True.
(b) False.
End of multiple choice questions
Page 10 of 31SHORT ANSWER [25 marks]
Question A [6 marks]
An experiment was performed to examine the impact of different insecticide treatments on the
number of living insect larvae in a rice crop. The experiment had 9 insecticide treatments arranged
in a randomised complete block design with 4 blocks. The control (no insecticide used)
was coded T9. The aim is to have the smallest amount of larvae after treatment.
The R output below shows the results of an analysis on the data.
Analysis of Variance Table
Response: larvae
Df Sum Sq Mean Sq F value Pr(>F)
rep 3 385.64 128.546 3.7550 0.024209 *
trt 8 1255.50 156.937 4.5843 0.001719 **
Residuals 24 821.61 34.234
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
test p.ajusted name.t ntr alpha
Fisher-LSD none trt 9 0.05
MSerror Df Mean CV t.value LSD
34.2338 24 7.916667 73.90693 2.063899 8.538879
Theoretical Quantiles
Sample Quantiles
The R output below shows the results of an analysis on the data after log transformation.
Analysis of Variance Table
Page 11 of 31Response: log_larvae
Df Sum Sq Mean Sq F value Pr(>F)
rep 3 0.9567 0.31889 3.6511 0.0267223 *
trt 8 3.9823 0.49779 5.6995 0.0004092 ***
Residuals 24 2.0961 0.08734
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
test p.ajusted name.t ntr alpha
Fisher-LSD none trt 9 0.05
MSerror Df Mean CV t.value LSD
0.08733941 24 0.7547053 39.15865 2.063899 0.431299
log_larvae groups
Choose and justify the appropriate output based on meeting the model assumptions. Use this
output to identify which insecticide treatments are better at controlling insects than the control
(T9). Explain how you made this decision.
Answer
Page 12 of 31Question B [7 marks]
An experiment was conducted to examine to compare the yields of leys grown from four mixtures
(labelled A, B, C and D) and three seed rates (4, 6 and 8 g per unit area). A randomised
complete block design was used with three blocks. An analysis of the original data detected
an unstable variance and to remedy this a natural log transformation was used. The following
ANOVA was obtained together with means on the log-scale. Some of the entries have been
omitted in the ANOVA table.
(i) Fill in the values of the cells in the analysis of variance table marked with a *.
Source df SS MS F-ratio P-value
Block * 0.12469 0.0623 4.75
Rate * 0.34706 0.1735 13.22 <0.001
Mixture * 0.76693 0.2556 19.47 <0.001
Rate × Mixture * 0.21969 * * 0.036
Residual * 0.28879 *
Total * 1.74715
A B C D Overall
4 g 2.960 2.462 2.586 2.739 2.700
6 g 2.965 2.486 2.863 2.940 2.807
8 g 2.974 2.788 3.048 2.951 2.940
Overall 2.966 2.579 2.823 2.895 2.816
Page 13 of 31(ii) What is the least significant difference (LSD) to compare a pair of means of different
p
mixtures but the same seed rate? The LSD is calculated as tcrit × SED where SED =(ResMS × 2/rep).
The following selected critical values from the t-tables will be useful:
df P
One tailed 0.025
Two tailed 0.05
6 2.447
22 2.074
30 2.042
40 2.021
Answer
(iii) Prior to the experiment, the experimenters were particularly interested in comparing seed
mixtures B and C at a seed rate of 6 g per unit area. Provide an estimate of this difference and
the 95% confidence interval for this comparison, initially on the log-scale, then on the original
scale of measurement.
Answer
Page 14 of 31Question C [7 marks]
Data on water quality parameters and algae counts were collected in many different Queensland
lakes and rivers for different algae species. Here we concentrate on the data for the
Cylindrospermopsis raciborskii (CR). The water qualiy data considered were Sulfide (mg/L),
Total N(itrogen) (mg/L), Total P(hosphorus) (mg/L), Dissolved Oxygen (DO, mg/L), Turbidity
(NTU), Temperature (degrees C), pH, and electrical conductivity (Cond) (uS/m). The
researchers are interested in developing the best possible model to predict the occurrence of CR
in Queensland waters.
Here is a snippet of the data.
| | SulfidesmgL| TotalNmgL| TotalPmgL| ConduScm| DOmgL|
|:--|-----------:|---------:|---------:|---------:|---------:|
|23 | 0.02| 1.450| 0.085| 788.9091| 1.308182|
|26 | 0.00| 0.455| 0.040| 235.4000| 14.428000|
|38 | 0.02| 1.600| 0.095| 1009.5000| 3.662500|
|54 | 0.02| 0.650| 0.045| 399.0000| 5.240000|
|61 | 0.01| 0.295| 0.024| 153.5000| 7.500000|
(i) On inspection of the data, the researchers decide to log10 transform most of the data
columns, namely SulfidesmgL, TotalNmgL, TotalPmgL, ConduScm, DOmgL, TurbidityNTU,
and CR. Subsequently a correlation table was generated. Based on the correlation matrix, explain
which independent variable you suspect will be the best predictor for CR in a single variable
linear regression (simple linear regression)
SulfidesmgL TotalNmgL TotalPmgL ConduScm DOmgL
SulfidesmgL 1.000000000 0.011520667 0.009485123 0.123405309 0.022352894
TotalNmgL 0.011520667 1.000000000 0.999325874 -0.014672749 -0.007831802
TotalPmgL 0.009485123 0.999325874 1.000000000 -0.017052459 -0.009546349
ConduScm 0.123405309 -0.014672749 -0.017052459 1.000000000 -0.173858384
DOmgL 0.022352894 -0.007831802 -0.009546349 -0.173858384 1.000000000
pH 0.103157999 -0.147361880 -0.158040039 0.239738159 0.109014015
Temperature 0.091086576 -0.016655411 -0.012593383 0.212635719 -0.118745738
TurbidityNTU 0.140494654 -0.027861083 -0.013408112 -0.001087102 -0.070986845
CR 0.022925155 -0.056605979 -0.055010325 -0.232874072 0.109641652
pH Temperature TurbidityNTU CR
SulfidesmgL 0.10315800 0.09108658 0.140494654 0.02292515
TotalNmgL -0.14736188 -0.01665541 -0.027861083 -0.05660598
TotalPmgL -0.15804004 -0.01259338 -0.013408112 -0.05501032
ConduScm 0.23973816 0.21263572 -0.001087102 -0.23287407
DOmgL 0.10901402 -0.11874574 -0.070986845 0.10964165
pH 1.00000000 0.22301849 -0.087793362 -0.12148631
Temperature 0.22301849 1.00000000 0.204382392 -0.29034452
TurbidityNTU -0.08779336 0.20438239 1.000000000 -0.11048542
CR -0.12148631 -0.29034452 -0.110485416 1.00000000
Answer
Page 15 of 31(ii) The researchers continued with a multiple regression analysis on all the data for CR. The
model output is given below. Explain whether this is a satisfactory statistical relationship.
Answer
Call:
lm(formula = log10CR ~ ., data = CR_QAlgae_tr)
Residuals:
Min 1Q Median 3Q Max
-2.28110 -0.69332 -0.00444 0.64203 2.07150
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.91824 1.84634 4.289 4.39e-05 ***
pH -0.26345 0.23046 -1.143 0.2559
Temperature -0.01908 0.02783 -0.686 0.4946
log10SulfidesmgL -3.07906 12.90546 -0.239 0.8120
log10TotalNmgL 4.87464 1.04979 4.643 1.12e-05 ***
log10TotalPmgL -11.31783 2.34518 -4.826 5.44e-06 ***
log10ConduScm -0.79975 0.38771 -2.063 0.0419 *
log10DOmgL 0.27440 0.53338 0.514 0.6082
log10TurbidityNTU -0.55953 0.21525 -2.599 0.0109 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8497 on 93 degrees of freedom
Multiple R-squared: 0.3331, Adjusted R-squared: 0.2758
F-statistic: 5.807 on 8 and 93 DF, p-value: 5.251e-06
(iii) why is the adjusted r
2
value important for model selection and relate this to the principle
of parsimony?
Answer
Page 16 of 31(iv) The following residual plots were obtained for the full multi-linear regression model with three
variables. Based on the residual plots, argue whether there are any concerns about continuing
with the regression analysis.
Answer
Theoretical Quantiles
Standardized residuals
(v) Backward elimination was performed and the following output was obtained. Using the
output explain in some statistical detail what the best final model for predicting log10(CR)
is and how the output of the variable selection procedure informs you of this (Indicate what
information you would be looking at).
Answer
Start: AIC=-24.64
log10CR ~ pH + Temperature + log10SulfidesmgL + log10TotalNmgL +
log10TotalPmgL + log10ConduScm + log10DOmgL + log10TurbidityNTU
Df Sum of Sq RSS AIC
Page 17 of 31- log10SulfidesmgL 1 0.0411 67.191 -26.5783
- log10DOmgL 1 0.1911 67.341 -26.3509
- Temperature 1 0.3394 67.489 -26.1264
- pH 1 0.9436 68.093 -25.2175
- log10ConduScm 1 3.0722 70.222 -22.0777
- log10TurbidityNTU 1 4.8790 72.029 -19.4865
- log10TotalNmgL 1 15.5683 82.718 -5.3725
- log10TotalPmgL 1 16.8166 83.966 -3.8447
Step: AIC=-26.58
log10CR ~ pH + Temperature + log10TotalNmgL + log10TotalPmgL +
log10ConduScm + log10DOmgL + log10TurbidityNTU
Df Sum of Sq RSS AIC
- log10DOmgL 1 0.1766 67.368 -28.3106
- Temperature 1 0.3536 67.545 -28.0430
- pH 1 0.9782 68.169 -27.1040
- log10ConduScm 1 3.1269 70.318 -23.9386
- log10TurbidityNTU 1 5.0982 72.289 -21.1186
- log10TotalNmgL 1 15.5567 82.748 -7.3361
- log10TotalPmgL 1 16.8151 84.006 -5.7966
Step: AIC=-28.31
log10CR ~ pH + Temperature + log10TotalNmgL + log10TotalPmgL +
log10ConduScm + log10TurbidityNTU
Df Sum of Sq RSS AIC
- Temperature 1 0.5775 67.945 -29.4399
- pH 1 0.8772 68.245 -28.9909
- log10ConduScm 1 3.6333 71.001 -24.9526
- log10TurbidityNTU 1 5.2317 72.599 -22.6819
- log10TotalNmgL 1 15.3830 82.751 -9.3325
- log10TotalPmgL 1 16.6387 84.006 -7.7963
Step: AIC=-29.44
log10CR ~ pH + log10TotalNmgL + log10TotalPmgL + log10ConduScm +
log10TurbidityNTU
Df Sum of Sq RSS AIC
- pH 1 1.2354 69.180 -29.6020
- log10ConduScm 1 4.6170 72.562 -24.7341
- log10TurbidityNTU 1 6.1241 74.069 -22.6372
- log10TotalNmgL 1 18.6416 86.587 -6.7104
- log10TotalPmgL 1 19.8658 87.811 -5.2783
Step: AIC=-29.6
log10CR ~ log10TotalNmgL + log10TotalPmgL + log10ConduScm + log10TurbidityNTU
Df Sum of Sq RSS AIC
Page 18 of 31- log10TurbidityNTU 1 5.6547 74.835 -23.5879
- log10ConduScm 1 6.2639 75.444 -22.7609
- log10TotalNmgL 1 18.5837 87.764 -7.3326
- log10TotalPmgL 1 19.1970 88.377 -6.6223
Page 19 of 31Question D [5 marks]
The climate dataset contains various sea surface temperature anomalies (AO and NPI) as well
as rainfall data, ice cover data, temperature, and the year each of the average measurements
were made. Explore the R output and then answer questions i, ii, and iii.
Arctic Oscillation AO - Annual
North Pacific Index NPI - Annual
Ice - Annual, January to July, October to December, coverage, ice free days
Temp - Annual, summer and winter
Rain - Annual, summer and winter
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.7825 1.6162 1.4399 1.3051 1.00640 0.90898 0.67192
Proportion of Variance 0.2444 0.2009 0.1595 0.1310 0.07791 0.06356 0.03473
Cumulative Proportion 0.2444 0.4453 0.6048 0.7358 0.81373 0.87729 0.91202
PC8 PC9 PC10 PC11 PC12 PC13
Standard deviation 0.62628 0.50743 0.48147 0.38410 0.29745 0.16192
Proportion of Variance 0.03017 0.01981 0.01783 0.01135 0.00681 0.00202
Cumulative Proportion 0.94219 0.96200 0.97983 0.99118 0.99798 1.00000
PC1 PC2 PC3 PC4 PC5
AO 0.156287502 -0.50815718 0.12983848 -0.008177511 0.14421247
NPI 0.250939630 -0.42144515 -0.27943937 0.050866762 0.10421175
Temp 0.165175832 0.20243032 -0.01502539 -0.474768071 -0.56365852
SummerTemp 0.180231809 -0.29613178 -0.25368167 -0.195303153 -0.16756395
WinterTemp -0.266893813 0.14433019 -0.14234741 -0.428832652 0.34726092
Rain 0.213236238 -0.24402773 0.44640553 -0.314164836 0.06113113
SummerRain -0.001514271 -0.13388458 0.58070456 -0.229915571 0.21086146
WinterRain 0.149930136 -0.24867925 -0.47822831 -0.113466132 0.09667562
Ice -0.480852604 -0.23695449 0.01721970 -0.093068314 -0.20416734
Ice_JanJul -0.442093681 -0.16548521 -0.04652799 -0.151069775 -0.35809493
Ice_OctDec -0.271927458 -0.42494690 0.06788669 0.026653770 -0.21552433
IceCover 0.153102429 -0.05717217 0.21768767 0.532971801 -0.39965508
IceFreeDays 0.435433891 0.08921713 0.02359594 -0.263081329 -0.26139430
Page 20 of 31Screeplot climate PCA
Variances
IceFreeDays
(i) According to Kaiser’s criterion, how many criteria would you consider in your analysis?
Page 21 of 31Answer
(ii) Consider the biplot of the climate data and report three relationships of interest.
Answer
(iii) Summarise the loadings of the first three principal components.
Answer
End of Short Answer Questions
Page 22 of 31Equations
Sample variance
Confidence interval for mean, given unknown standard deviation
95%CI = y ± t0.025
Variance of the mean for simple random sampling (SiR)
Degrees of freedom for simple random sampling
Mean for stratified random sampling (StR)
Variance of the mean for stratified random sampling (StR)
Degrees of freedom for stratified random sampling
df = n H
Variance of the change in mean
Covariance between 2 sets of observations
Total sum of squares
T otalSS = T reatmentSS + ResidualSS
One way ANOVA table
Source df SS MS F-ratio
Treatment (t 1) TrtSS TrtSS/(t 1) TrtMS/ResMS
Residual (N t) ResSS ResSS/(N t)
Total (N 1) TotSS
ANOVA table with Block
Source df SS MS F-ratio
Block b 1 BlkSS BlkSS/(b 1)
Treatment t 1 TrtSS TrtSS/(t 1) TrtMS/ResMS
Residual (b 1)(t 1) ResSS ResSS/((b 1)(t 1))
Total bt 1 TotSS
Full factorial ANOVA table with Block
Source df SS MS F-ratio
Block b 1 BlkSS BlkSS/(b 1)
Treatment A tA 1 TrtASS TrtASS/(tA 1) TrtAMS/ResMS
Treatment B tB 1 TrtBSS TrtBSS/(tB 1) TrtAMS/ResMS
Treatment AB (tA 1)(tB 1) TrtABSS TrtABSS/((tA 1)(tB 1)) TrtABMS/ResMS
Residual (b 1)(tAB 1) ResSS ResSS/((b 1)(tAB 1))
Total bt 1 TotSS
Treatment SS
T reatmentSS = T reatmentASS + T reatmentBSS + T reatmentABSS
Page 24 of 31A.1 Some probabilities for the cumulative standard normal
distribution
The distribution tabulated is for the normal distribution with mean 0 and standard deviation 1.
For each value of z, the table gives the proportion, P, of the distribution less than z, P(Z < z).ENVX1001 Introductory Statistical Methods
A.2 Some right-tail critical values for the Student’s T
distribution
The distribution tabulated is that of Student’s t. The first column is the degrees of freedom (df).
The remaining columns give either the one tailed (upper tail) critical values so that P(Tdf > t) =P, or the two tailed critical values so that P(Tdf > t or Tdf < –t) = P where P is the probability
shown at the top of the columns.ENVX1001 Introductory Statistical Methods
A.3 Some right-tail critical values for the Chi-Squared (2)
Distribution
The distribution tabulated is that of 2. The first column is the degrees of freedom (df). The
remaining columns give the upper tail critical values so that P(2df > x2) = P, where P is the
probability shown at the top of the columns.
For larger degrees of freedom than tabulated here, use the normal approximation to the 2,and refer z to the “Table of Probabilities for the Standard Normal
Distribution”. 4
A.4 Table of Probabilities of Fisher’s F Distribution
The distribution tabulated is that of Fisher’s F. The numerator degrees of freedom (1) are given by the column position and the denominator degrees of freedom
(?2) are given by the row position. The values in the body of the table are the upper tail critical values so that PF f P, where P is the probability shown
(0.10, 0.05, 0.01).