代做7SSGN110 Environmental Data Analysis | Practical 1 | Introduction to Excel & data exploration帮
- 首页 >> Web7SSGN110 Environmental Data Analysis | Practical 1 | Introduction to Excel & data exploration
1. Introduction
1.1. About this practical
This practical will introduce the manipulation and exploration of data using Microsoft Excel and the R statistical environment. The result of your work in Excel can contribute to the formative coursework assignment.
We will investigate changes in water height (‘water level’) in the River Frome in Dorset, linked to rainfall in the upstream catchment. Field data are from a pressure transducer placed on the riverbed at a site near Frampton, which recorded water pressure from May 2003 to April 2005 (data used in the practical are from Moggridge and Goodson, 2005). Pressure Transducers (PTs) measure the pressure produced by a column of water above a sensing element and output a voltage. The higher the pressure recorded by the PT, the higher the water level sitting above it. The voltage output readings were taken every 15 minutes for the duration of the study and recorded in a data logger.
This practical will introduce (or maybe refresh) you to the following Excel functions: simple formulae, scatter plots, simple linear regression, time and date formats and conversions, text to columns, pivot tables, time-series graphs and graph formatting. Some knowledge of Excel is assumed. This practical will also introduce the R statistical environment and show how it can be used to perform. the same functions.
For the formative coursework you should present plots and graphs created in Excel only. The instructions for R are to get you started learning how to use that software environment – we will focus more on R in future weeks.
1.2. Download the data
Data from (Moggridge and Goodson, 2005) are provided in an Excel spreadsheet “Frome_Data.xlsx” on KEATS. Download this file to your computer and open it in the Desktop App (do not use the online web-version because this lacks some of the functionality we’ll be using).
1.3. Where to save the data?
At King’s, your personal computing storage is in the ‘cloud’, referred to as (Microsoft) OneDrive. It is accessible through the same suite of services as your Outlook email, Calendar, and the online versions of MS Office Apps (Word, Excel, PowerPoint), from any laptop or desktop computer, at home or at King’s.
Open your OneDrive to see what kind of directory structure it has and to create a new Folder for your EDA practical exercises. It is also good practice to always have a USB memory stick with you for saving your work, so that you can access your files when offline. Furthermore, when doing GIS and Remote Sensing (other modules), many files are extremely large (many GB) and a USB memory stick can be more efficient that uploading and downloading from the ‘cloud’ .
In whichever file managing method you use, you will need to be able to identify full paths for files and folders when you start working in R. Paths may look like: ‘E:/My Documents/EDA/etc’ (PC) or like ‘~/users/yourusername/desktop/etc’ (Mac).
• Q1: After doing some research about your machine, what is the path (address) you would use to access the folder you have just created? Type the path here:
Make sure to start saving your work in this folder.
2. Data familiarization
Open the Frome data in the Excel spreadsheet and familiarise yourself with the data. This workbook contains three worksheets:
1. Metadata: information on the data in the other worksheets
2. Logger: the data from the logger
3. Calibration Data: a small set of data points where the relative water height (also called water level) in the river was manually measured and matched against the recorded PT voltage output
Go to the worksheet Logger. This is typical of what a data logger output may look like. These data show the PT voltages over the duration of the study. However, for purpose of this study, we need water level information, not PT voltage.
In order to decipher water level from a water pressure measurement, at certain points in the study the water level was recorded manually and the corresponding PT voltage was noted. This is recorded in the Calibration Data worksheet. By doing this we can convert the voltages to relative water heights.
Check you understand what the values in each column of each worksheet refer to by reviewing the Metadata worksheet. You should always go through this process of checking what the fields and values refer to when encountering a new dataset.
3. Converting PT Output to Water Levels
Using the Calibration Data, we can establish the relationship between voltage and water height and apply this relationship to all the Logger data, to obtain a full time-series of water heights. We will do this using a technique called linear regression, which we will cover in more detail later on in the module. One way to do a linear regression in Excel is to plot data as a scatterplot and then add a simple linear regression line. To do this:
a) Go to the Calibration worksheet.
b) Select the PT and WaterLevel data. Create a scatterplot of these data. If you do not know how to create the plot: with data selected goto Insert Charts X Y (Scatter)
You plot should appear as a window within the worksheet. It is often convenient to move plots (known as charts in Excel) to their own worksheet. To do this:
c) Right-click on an empty area near the top of the plot Move chart Select new sheet
d) Rename the new sheet created (from ‘Chart1’ to something sensible) by right-clicking on the tab at bottom of the window, clicking Rename, and then typing the new name.
Putting your chart in a new sheet will make it easier to edit the chart.
You should see a clear pattern in the data. You can now perform. a simple linear regression on these data, which will create a line of best fit. We can then use the equation of this line to convert the PT voltages to water levels. To perform a simple linear regression on data in a chart in Excel:
e) Right-click on the data series and select Add Trendline. A menu should appear.
o In Trendline options, make sure Linear is selected
o Check Display equation on chart and Display R-squared value on chart
o Close the menu by clicking the cross in the top right
You should have something similar to the image below (note, this figure deliberately does not have a figure caption). This is not a well-formatted chart and you should now edit the chart to improve how it communicates the data. For example:
f) Add chart elements such as axis titles by clicking Add Chart Element at the top left of the Design tab
g) Label the X and Y axes with the variable name and its measurement units in ( )
Think about why the image below does not communicate well and what edits you could make to present the data more clearly.
Figure 3 Example of generated graph
The R2 value shown on the plot gives an indication of the ‘Goodness of Fit’ and ranges from 0 to 1 (with 1 being a perfect linear relationship, 0 no relationship). Does your regression line have a good fit? The equation of the simple linear regression (Trendline) you have created allows you to predict Water Level from a known value of PT. To do this:
h) Open the Logger worksheet
i) Label a new column as “WaterLevel”
j) Use the equation to calculate the water levels from the PT output:
i. Select acell in your new column, type ‘=’ in the formula bar at the top of the screen below the menu and then type the remainder of your equation (which columns contain the relevant variables?). Type Enter.
ii. Copy the contents of your cell down to apply the equation to all rows of data
Save your Excel workbook!
4. Converting Date and Time Formats
The date and times recorded in the Logger sheet are not in a format that Excel recognises. Examine how the time and date are displayed, what is unusual about this?
Currently, the date is portrayed as days past since the first of the year, as in the first data point was collected on the 126th day of the year 2003.
To convert the dates and times from the data logger to Excel format you will first need to find out what the date was on the 126th day of the year in 2003. This can easily be done with a quick internet search, for example try this site
https://www.epochconverter.com/days/2003
Establishing the date and time of the first data point
a) Create a new column and call it “DateTime”
b) Type in the correct date and time for the first data row using this format dd/mm/yyyy hh:mm
c) Correctly format the column in Excel as follows: a. Select the column
b. Right-click on the selected cells and click Format Cells
c. A new window should appear
i. In the list to the left, select Custom
ii. Select dd/mm/yyyy hh:mm (if this is not in the list you can type it in yourself). iii. Click OK
Next you will need to add 15-minute intervals to the first cell in order to fill the entire column accurately
1. Select the cell below the one you just filled in
2. Type in a formula that adds 15 minutes to the cell above it, e.g.: “=E2+TIME(0,15,0)” . Note,if you are not working in column E you will need to adjust the formula accordingly.
3. Copy this formula to the rest of the column
Check Point, is this correct? Confirm with the GTA available that you have formatted the dates correctly.
You should now have a column which contains data in the following format 01/01/2004 00:00 (for example).
Compare the values in your new DateTime column with those in the respective Year, Day and Time columns to check the conversion makes sense, especially at the end of the dataset(!). This is a habit you should get into, to always verify that your data processing and manipulations are working correctly throughout the whole dataset.
• Q2: What do you notice when you check whether the conversion has gone correctly all the way to the end of the dataset? How are you going to rectify this?
Do not proceed until you have fixed the problem, because otherwise your time-series will be incorrect.
Save your Excel workbook!
5. Daily Summary Data
As there are very many data points from the Logger (collected every 15 minutes) we can create daily summaries of the water level to reduce the number of values we plot in a time-series. To do this, we can use a Pivot Tables, which is a means of summarising (aggregating) data in Excel. First, we need to convert our Date and Time formats so they just show the date (pivot tables require replicate measurements).
5.1. Splitting Date and Time Data
To convert formats, we will split the DateTime data, using the “Text to Columns” function (this is a useful function you should bear in mind in the future).
To split date and time data using the “Text to Columns” function:
a) Select the DateTime column and copy it.
b) Select the next column to the right
c) Right-click, goto Paste Special and select the icon Values. (Make sure you format the column as Date Time in the same way as above)
d) Click OK. This has duplicated the DateTime column, but the data are simply text without formulas. See this by clicking on values in each of the columns and looking at the formula bar. How are they different between the original and duplicated columns?
e) Select this new column.
f) Click on Data → Text to Column. A new window should appear
i. Select Delimited and click Next
ii. Check “space” as a delimiter (as there is a space between the date and the time) and click Next.
iii. You can see in the Data preview that we have split the previous column into 2 new columns. With the first columnselected (which has the dates), select “Date DMY” as the column data format and the destination as first cell in the column to the right of DateTime (likely K1).
iv. In the data preview, click on the second column (General). Check the box under column data format which says Do not import column (skip)
v. Click Finish
g) Label this new column “Date”
You should now have a new column which just has the date that the water level was taken (if times are still showing in this column, Format Cells todd/mm/yyyy as instep 3c above).
Save your Excel workbook!
5.2. Using Pivot Tables/Charts
A pivot table is a data summarization tool used in spreadsheet software to automatically count, sum or average data based on some aggregating factor. Usually a new table is created that summarisesan existing one. In this case we will create a daily summary of our data.
a) Highlight all the columns of data on your sheet
b) Click on Insert → Pivot Table → Pivot Table. All data should be automatically selected and a new window should appear.
c) Make sure that the destination of the Pivot Table is a New Worksheet and click “OK”
d) A new worksheet will appear, with a blank table
e) On the right, drag the “Date” field into the section labelled “ Rows” at the bottom of the window
f) Drag the “WaterLevel” field into the section labelled Values at the bottom of the window
g) In the Values section, click on “Sum of WaterLevel”, click Value Field Settings, and change the function from “Sum” to “Average”, the click OK
a. Right click in your Pivot Table on one of the years in the Row Labels column and click on Ungroup
h) Right click in your Pivot Table (i.e. on one of the values that has been calculated) and select Pivot Table Options
Totals & Filters
o Uncheck the “Show grand totals for columns” and “Show grand totals for rows”
Think about what has happened here. We have gone from data measured every 15 to a summary of those data for each day in the dataset. What has this done to the amount of data in the new summary data?
Pivot tables are interactive, so it is best to copy and paste the data from the pivot tables into a new worksheet:
i) Create a new worksheet and give it an appropriate label
j) Copy the data in the Pivot Table
k) In your new worksheet Paste Special with Values and Number Formatting
Through this section you have created several new variables (columns of data) and worksheets. You should consider addingto the metadata worksheet to ensure these are described properly. This will be very helpful when you come back to your data in future.
Save your Excel workbook!
6. Plotting a Time-Series
Using your new complete data set of predicted average daily water level, create a line plot (time series) of the data (e.g., select your data thengo to Insert → Line →2D Line → Line).
Format your chart in a manner which you think is suitable, so that it is clear and easy to interpret. Most of this can be done through the Design menu, which appears on the top of the screen when you select a chart. Remember, you may want to move the chart to a new window before you start editing.
Consider the following:
Titles and labels of axes
Suitable formats of the axis labels (right click on each axis to change this or use Chart Tools). For example, the date should be clear
Are the axes a suitable length? Do they allow easy interpretation of patterns? (this can also be changed by right- clicking on the axis or through Chart Tools)
Suitable format of data series: is it clear? (Right click on the data series to change this)
The background of the chart (can be changed by right-clicking in the chart and selecting Format PlotArea)
Gridlines (which can be changed via the Add Chart Elements button):
o Do you need gridlines on the x andy axis?
o The format of gridlines – they should not dominate the graph! Save your Excel workbook!
7. Download NRFA Data
As rainfall is the key driver behind the river flow we will download daily rainfall amounts for the catchment area upstream of Frampton from the UK National River Flow Archive (NRFA). Moggridge and Goodson (2005) collected their data at Frampton, a little distance upstream of gauging station 44004 ‘Fromeat Dorchester’ . Information about stations and the data collected at them are available online from the NRFA website which is managed by the Centre for Ecology and
Hydrology (CEH). The NRFA website URL is:http://nrfa.ceh.ac.uk To download daily rainfall data:
a) Go to the link above and click on Data → Search for Data
b) In the Search Table box, enter the station ID number
c) Click on the station ID number in the filtered table
d) Click on the Daily Flow Data tab
e) Click Download catchment daily rainfall data
f) Select appropriate responses, agree to the terms and conditions and click Download
Depending on what browser you are using the data will either be automatically downloaded or you will received a dialogue box asking if you want open or save the data. When downloading data from the Internet you almost always want to save the file to disc. If you click open you are at high risk of losing the data later. It is much easier to save the data to disk, then open the data in your desired software.
The data you have downloaded are in comma separated values format. You can tell this because the file suffix is ‘ .csv’ . A succinct description of csv files from file.org is as follows:
“Files that contain the .csv file extension are comma delimited files that contain separated database fields. These database fields have been exported into a format that contains a single line for each database record. The record is then divided and each field of the record that has been exported into a single line is separated by a comma.” [http://file.org/extension/csv]
Many environmental datasets you will download and use are in csv format. Excel can open csv files for viewing, and save to this format. However, remember that csv format is a basic text file which will not contain, for example, Excel formulas and other formatting that you might later add to it in Excel. If you plan to work with data in Excel in the future you should save the data in .xlsx format. Let’s do that for the file you just downloaded:
a) Open the .csv file in Excel
b) ‘save as’ an .xlsx file
The data you have download is for the entire data collection period available. For the formative coursework assignment you will need to reduce this down to match the period of the field data from Frampton, and follow the instructions and requirements for the coursework.
8. Data Analysis in R
The remainder of this practical is intended to introduce you tothe R statistical programming language and environment. R is a freely available powerful language/platform for dealing with data both statistically and graphically. RStudio is an Integrated Development Environment (IDE) that enables more efficient and flexible use of R.
We will use R and RStudio throughout the remainder of the module so it will be useful for you to get up to speed soon. Initially, the focus is on R, but in the remainder of the module you will likely find it useful to use RStudio. But remember, for the first coursework you need to use Excel (not R).
The introductory R activities are hosted online on two pages:
• StartR:http://bit.ly/KCL_StartR
• First R Analysis:http://bit.ly/KCL_FirstR
Work through each of the pages in turn. The StartR page provides some instructions for installing the R and RStudio software on your own computer (should you wish) and then an overview of how R and RStudio work in general. The First R Analysis page then repeats the analysis presented above (in Excel) in R – if you get a bit lost as to what the R code is doing, compare it to what you did in Excel.
The data (and script) files you will need for the activities are linked to from the pages, except for the Frome River data which are slightly different from those used above in that they are in .csv format, not Excel format, and are found on the module KEATS page (week 1):
• LoggerData.csv
• Calibration.csv