调试R、R程序讲解、讲解R、R编程解析、讲解R、R辅导、讲解R
- 首页 >> 其他ENEN90032 - Environmental Analysis Tools
Assignment 1
Dongryeol Ryu, Naveen Joseph, Shuci Liu, Chihchung Chou, Jie Jian
17 August 2018
Submit an electronic copy (in PDF format) to the Turnitin Subsmission
Link on LMS by 12pm (NOON) on Wednesday 5 September 2018. Make sure you
meet the Infrastructure Engineering submission requirements (include the cover-
sheet with signatures of team members). Include appropriate graphs and tables in
your solution. The report should contain no more than 2500 words excluding g-
ures. Submit your Matlab codes and corresponding input data via Matlab Code
Submission Link on LMS. Name your codes self explanatory (e.g., Q1.m, Q3.m)
and add comments in the code properly. Compress your codes/datasets and name
it with your group number (i.e., Assignment 01 Group 01.zip)
For the Hypothesis Test problems, please explain rationale behind the choice of
the null hypothesis, test statistic, alternative hypothesis and conclusions. 30% of
the marks for each section will be given based on the report quality and quality of
gures and tables whenever they are required. Figures should be properly labeled
and self explanatory. For each question, please describe individual members’ con-
tributions brie y. You are expected to work TOGETHER for all questions, thus
issues originating from splitting assignment questions between the group members
will not be assisted with.
1 Exploratory Data Analysis - Meteorological
Datasets (20 marks)
Go to the Climate Data Online1 of Bureau of Meteorology and choose a weather
station in Victoria. Download daily maximum temperature and daily rainfall data
of the station collected in 2017. Missing values of the station in 2017 should be
fewer than 10. Do NOT remove zero rainfall values from the daily rainfall dataset.
1http://www.bom.gov.au/climate/data/
1
1. Make a table that summarizes the location (sample mean, median and trimean),
spread (sample standard deviation, IQR and median absolute deviation) and
symmetry (sample skewness and Yule-Kendall index) of the datasets. Can
you infer skewness of the datasets by comparing the mean with the median?
Based on the shape of the distribution (refer to the gures produced in the
next question), discuss the robustness of the summary statistics calculated
above.
2. For the ‘daily maximum temperature data’ t i) a Gaussian ii) a Gamma
and iii) Weibull2 distributions to the dataset and compare histogram of the
data (or Gaussian kernel estimates that can produce a smoothed distribution
of the dataset) with the tted PDF curves. Also, compare its empirical
cumulative distribution with tted CDFs (Gaussian, Gamma, and Weibull)
and make a Q-Q plot for evaluation. Which model ts the data best based
on the graphical examinations?
3. For the ‘daily rainfall data’, t i) a Gaussian, and ii) a Gamma distribution.
As in the previous question, compare its empirical cumulative distribution
with tted CDFs (Gaussian and Gamma) and make Q-Q plots for evaluation.
Which model ts the data best based on the graphical examinations?
4. For the above ts, calculate the log-likelihood values of the ts and quanti-
tatively prove your judgement above.
2 One Sample Test - Newcomb-Michelson Veloc-
ity of Light Experiments (10 marks)
Simon Newcomb of the Nautical Almanac O ce (NAO), U.S., published the veloc-
ity of light [Newcomb, 1883]3 based on a series of experiments he conducted with
Albert Michelson until 1882. The dataset ‘NewcombLight.txt’ contains 66 sam-
ples (time in seconds taken for light to travel 7442 meters at sea level) Newcomb
collected in 1882. Conduct the t-test and the bootstrap based one-sample tests
and provide the population mean of light velocity (m/s) and its 99% con dence
interval. Do the estimates show any systematic di erence? If so, provide possible
reasons based on the sampling distributions used by the two approaches.
2https://en.wikipedia.org/wiki/Weibull_distribution; you may use Matlab functions
weibcdf, weibfit, weibpdf
3http://vigo.ime.unicamp.br/~fismat/newcomb.pdf
2
3 Hypothesis Test - Space Shuttle O-Ring Fail-
ures (10 marks)
On 27 January 1986, the night before the space shuttle Challenger exploded, en-
gineers at the company that built the shuttle warned NASA scientists that the
shuttle should not be launched because of predicted cold weather. Fuel seal prob-
lems, which had been encountered in earlier ights, were suspected being associated
with low temperatures. It was argued, however, that the evidence was inconclu-
sive. The decision was made to launch, even though the temperature at launch
time was 29 F.
The dataset ‘O Ring Data.XLS’ summarizes the number of O-ring incidents on
24 space shuttle ights prior to the Challenger disaster. Launch temperature was
below 65 F for data labeled ‘COOL’ and above 65 F for data labeled ‘WARM’.
Conduct a permutation test if the number of O-ring incidents was associated with
the temperature. Using 99% con dence interval with your choice of one-sided or
two-sided test options. Use 10,000 permutations to draw conclusion. Justify your
choice and show your null distribution as a histogram with a test statistic marked
on it. Make your nal suggestion about the launch of the space shuttle on the day
of accident based on the quantitative evidence that supports your suggestions.
4 Hypothesis Test - E ect of Nitrogen Removal
in Wetland (20 marks)
Natural and constructed wetlands are thought to retain and remove the organic/inorganic
pollutants. The dataset ‘Nitrogen Removal Wetland.xlsx’ contains water quality
samples collected from a constructed hybrid wetland in northern Kaohsiung in
2007. The sampling scheme aims to test a hypothesis that the physical, chemical
and biological processes in constructed wetlands can lead to reduced concentra-
tion of nutrients (i.e., NH3-N, TKN, NO3-N, NO2-N) in water bodies. Water
quality samples are collected from the in uent and e uent in 52 consecutive days
(Column1: day number; column 2: nutrient concentrations; column 3: sampling
location).
1. Using a parametric method, conduct a test to identify whether the concentra-
tion of nutrient was removed signi cantly by the constructed wetland. Use
both 95% and 99% con dence intervals and provide your inferences. Choose
between one-sided and two-sided tests with proper justi cation.
2. Repeat the above assessment now using a permutation test. Use 10,000
permutations to draw your conclusion and show your resampled data in the
3
histogram. Compare your results with those from the parametric test above
and explain the di erences identi ed based on the pros and cons of the two
methods.
3. The concentration of nutrients provided is skewed. Transform. the nutrient
concentrations using a logarithm function (with a base of 10) and repeat
the parametric test under the same conditions used for Q.1 above. Does
the transformation change your conclusion? Discuss the potential in uences
when the data is asymmetric.
5 Hypothesis Test - Surface runo decrease dur-
ing ‘Millennium Drought’ in VIC and NSW
(10 marks)
South-eastern Australia experienced the Millennium Drought in 1997-2009 (CSIRO,
20124), which reportedly led to decrease in stream discharge. The annual discharge
in 1976-2009 are provided for two Water Information sites in VIC (Site no. 405217)
and NSW (Site no. 412002) in the dataset ‘Discharge two stations.mat’.
1. Using either a parametric or a non-parametric method, examine if the dis-
charge in the VIC and the NSW catchments have undergone signi cant de-
creases during the Millennium Drought.
2. Quantitatively show which site (VIC or NSW) has experienced more signi -
cant change in discharge. Show and explain the null hypothesis, test statistic,
alternative hypothesis and conclusion with 95% and 99% con dence levels
(you choose either CIs or p-value, or both). (Please remove NaN during your
calculation).
6 Exploratory Data Analysis and Linear Regres-
sion (30 marks)
Nutrient/sediment concentrations vs. stream discharge relationships have been
widely used as a clue to explore hydro-chemical processes that control runo
chemistry. Here we examine sediment concentration vs. stream discharge re-
lationships using linear regression. The Paddock-to-Reef Integrated Monitoring
4CSIRO (2012), Climate and water availability in south-eastern Australia: A synthesis of
ndings from Phase 2 of the South Eastern Australian Climate Initiative (SEACI), 41 pp.,
CSIRO, Australia.
4
Program of the Queensland Government maintains water quality measurements
collected during major rainfall events in a number of catchments near the Great
Barrier Reef. This question investigates the correlation between instantaneous
stream discharge and the Total Suspended Solids concentration (TSS) collected
from the site "116001F, Herbert River at Ingham" located in the Wet Tropics
region of Queensland. The data ‘Q TSS.mat’ contains two columns: stream dis-
charge (m3/s, column 1) and TSS (mg/L, column 2). Discharge and TSS have
been measured at this site from 2009 to 2016 (missing measurements are lled
in blanks with NaN, and you might need to remove these missing values before
performing the linear regression).
1. Calculate the Pearson correlation coe cient, Spearman’s rank correlation
coe cient and Kendall’s for the paired Q vs TSS data. Then, calculate
the same correlation coe cients for the paired log(Q) and log(TSS), loga-
rithm with a base of e. What can you infer from the these values about the
relationship between Q and TSS? Suggest which paired-data is more suitable
for constructing linear regression, and justify your selection.
2. Based on your selection of the paired data, plot Q vs TSS concentrations
and t a simple linear regression. Calculate the 95% con dence intervals for
i) conditional mean and ii) prediction. Construct a gure showing the linear
model and the con dence intervals with the observed data values.
3. According to the simple linear model developed, what is the TSS concen-
tration expected when discharge reaches 1000 m3/s? Also report the 95%
prediction interval at Q=1000 m3/s.
4. What is the pattern of the residuals of your developed Q TSS model? Pro-
vide your assessment of the residuals.
5. Based on the plot you created in question 2, check how much fraction of
the observed data values falls within the 95% prediction con dence interval
(e.g., you can create a FOR loop in Matlab to check if individual observation
falling within 95% CI). According to the results and the pattern/distribution
of residuals, do you recommend the application of this linear model for pre-
dicting further TSS concentrations?