代做7SSGN110 Environmental Data Analysis | Practical 2 | Introduction to R & data exploration 代写留学

- 首页 >> Web

7SSGN110 Environmental Data Analysis | Practical 2 | Introduction to R & data exploration

1. Introduction

1.1. About this practical

This practical is focused on introducing you to the basics of R, RStudio and R Markdown. The aim of the practical is to advance your learning of new technical tools (R and RStudio) for data exploration, description and data visualisation. During the session we will be investigating the differences in characteristics in annual rainfall from four locations in the UK. The aim is to determine what impact longitude and altitude have on the amount and seasonal distribution of rainfall.

The practical uses annual rainfall totals from four locations across northern England for the 50-year period from 1941 to 1990 (MetOffice, 1993). The attributes of the four sites that we will be investigating are given in Table 1 and plan/cross-sectional profiles of the four locations are in Figures 1 and 2.

Table 1. Summary of the spatial characteristics of the four sites used in this practical

SITE

Denton

Redmires

Sheffield

Kirk Bramwith

Altitude (m)

93

305

131

7

Latitude

53.45

53.37

53.38

53.60

Longitude

2.21

1.57

1.47

1.07


Figure 1a. The spatial location of the four climate stations used in today’s practical. Note that this map was created using Digimap (2013), which allows access, customization, and annotation ordnance survey maps (e.g. for reports). Figure 1b. Altitude of the 4 study sites. One method for visualizing altitude vs. distance of the four climate stations, projected onto a ‘linear’ line, and relative distance along the line indicated as Eastings (km).

1.2. Practical structure

The practical session comprises 4 parts:

1. Data familiarisation & producing descriptive statistics using MS Excel

2. Start R and First R Analysis – if you haven’t completed this already 

3. Rainfall data exploration using R

4. Additional, optional exercise – air quality data exploration

Associated with practical are 15 questions for you to answer. These are to test your understanding of key concepts in the practical. Answers to the questions & example script. will be posted on KEATS a few days after the practical session.

1.3. Required files & saving your data

You can access the files for this practical, RainfallData.xlsx and RainfallData.csv via KEATS. You may remember that last week we put particular emphasis on establishing the correct place to save your data. Save your data to an appropriate working directory (folder) titled “EDA_Practical2”.

2. Data familiarization & descriptive statistics using Excel

Before attempting any form. of analysis, it is important to know what our objective is, what data we have available to us and where this data was collected. Assuming most of you are more comfortable with Excel than R, initially we’ll do some brief data exploration in Excel. Hopefully, this way you’ll have some familiarisation with the data before introducing R.

Calculate some summary statistics in Excel:

1. Download the worksheet ‘RainfallData.xlsx’

2. In the worksheet containing your data click on View -> Freeze Panes -> Freeze Top Row. Now you will always be able to see the column labels if you scroll down the data in this view.

3. Scroll down to the bottom of the data.

4. Calculate the mean, median, maximum, minimum, standard deviation and range for the four locations. To do this enter the following formulas in different cells, selecting the appropriate data for ‘Data range’:

• =AVERAGE(‘Data range’)

• =MEDIAN(‘Data range’)

• =MAX(‘Data range’)

• =MIN(‘Data range’)

• =STDEV(‘Data range’)

Note: you’ll need to use some combination of these formulas to calculate the range.

Using the values you have calculated answer the following questions:

• Q1: Which site has the greatest inter-annual variation in rainfall?

• Q2: Rank the locations in order of ‘wetness’

When did you last save your Excel Workbook? If you haven’t already you should get in the habit of saving your work (in Excel, Word, etc.) frequently. Do so now in Excel (.xslx) format.

We’ll also save the data in Comma Separated Values (.csv) format. This can be done using Save As and selecting the appropriate file format. See online help or ask the GTAs if you need further assistance. This is good practice as csv format is generally what we will use with R (as we will now see).

3. Start R and First R Analysis

The practical instructions assume you have read and followed the instructions in the StartR and First R Analysis activities online. If you have not yet worked through these activities STOP here and work through these activities before you start the rainfall data analysis.

Once you have completed Start R and First R Analysis answer the following questions:

• Q3: What command do you use to use the contents of an object?

• Q4: What is wrong with the following line of code?

TreeDiameters(mean)

• Q5: What is the difference between the source and the console pane?

4. Rainfall data analysis using R

4.1. Getting started in R

This document contains code. In places this code is annotated. You can do this useful by using the ‘#’ before any annotations you make. This is very useful for remembering what you have done and why!

 answer<-1+# Sums 1 and 2
print(answer) # Prints the answer

4.1.1. Setting the working directory

Open RStudio. First we need to set the working directory to wherever we have saved the data files (so that R knows where to look for them). This can be done one of two ways: 1. through the R Studio user interface 2. through code.

Setting the working directory through the user interface:The easiest way to set the working directory using RStudio’s user interface:

4.1.2. Figure 2. Setting the working directory through the user interface

Setting working directory through code If you are familiar with the path to your data you can use the set directory command, setwd, altering the path according to your preferred working directory.

setwd("X:/My Documents/EDA_Practical2") # Sets the working directory
getwd() # Prints the working directory

4.1.3. Loading packages

Packages are optional parcels of software that are downloaded and installed directly into R. There are thousands of packagex which allow us to undertake a range of different analysis. If is it the first time you are using a package you will need to install it in RStudio. Once it is installed, you will then only need to load it when you start R Studio. The StartR page gives you guidance on the ways you can do this.

The packages will we need today are: * ‘tidyr’ * ‘ggplot2’

A good habit to get into is loading the packages before you start running any code. You can do this by using the library function:

library(tidyr)
library(ggplot2)

4.1.4. Reading a .csv file

Once the working directory is set correctly, you can read the data in to R. This can be done with following command:

rainfall.data <- read.csv("RainfallData.csv", header = T)

Specifically, what read.csv() does is create a data frame. called rainfall.data from the csv file RainfallData.csv. You can name the data frame. anything you want - in R tutorials “my_data” is frequently used. However, when you are dealing with lots of dataframes in one script, it is good to name them something intuitive - my_data1, my_data2, my_data3 can get a bit confusing!

The header argument tells R whether the first row of data contains the names of the columns (in this case T indicates this is true); if you use csv files with R it is generally a good idea for the first line of the file to contains column headers.

To print all the data to screen, you could enter the following:

print(rainfall.data)

Alternatively, you can also just enter the name of the data frame.

rainfall.data

Usually we don’t want to look at the whole data frame, rather just check to see if the data have been imported correctly. We can use the ‘head’ function to view the first few lines of the data frame. In this case, ‘head’ is the function and ‘rainfall.data’ is the object.

head (rainfall.data)

##   Year Denton Redmires Sheffield Kirk..Bramwith
## 1 1941    750     1139       881            532
## 2 1942    909      938       679            475
## 3 1943    960      977       660            411
## 4 1944   1103     1236       847            661
## 5 1945    944      979       682            436
## 6 1946   1091     1400       985            701

4.1.5. Dataframes, vectors & index numbers

A data frame. essentially works like a table. In the data frame, each column contains the value of one variable and also each row contains the value of each column. From this you can note that our data frame. has five columns of data:: Year, Denton, Redmires, Sheffield, and Kirk Bramwith. Note how this last column has had its header changed by R to remove spaces

These columns of data are ‘vectors’ - lists of items of the same type. In this case our vectors are numeric (rainfall values and year). In other cases they could be strings (characters or classes or data) or logical arguments (TRUE or FALSE).

So you can think of a data frame. as a list of vectors (i.e. a list of columns), each of which has a name or numerical index.

We can access a vector in a dataframe. by using the ‘$’ symbol after the dataframe. name followed by the name of the vector. Lets say we want to access the rainfall data from Sheffield:

print(rainfall.data$Sheffield) #Dataframe. name $ Vector name

##  [1]  881  679  660  847  682  985  771  735  712  773  938  693  597  982  699
## [16]  900  733  963  609 1035  717  688  736  642 1037  968  802  876  947  787
## [31]  770  810  763  775  560  675  884  824  961  879  940  833  906  867  675
## [46]  998  790  916  737  764

We can also access as vector using index numbers. Every row and column in a dataframe. has a number assigned. These are sequential. So the first column in the dataframe. will have a value of ‘1’ and so on. The same is true for rows. However, if your dataframe. has column headings (like our rainfall.data) then row number ‘1’ will be the first row that contains observation. In our case this will be the row containing the 1941 rainfall data.

Index numbers are written in the format [row, column], so to find the values in row ‘1’ the format would be [1,]. To find the values in column 3, the format would be [,3].

So to access the rainfall data for Sheffield using the index numbers, the code would be:

print (rainfall.data [,4]) #Dataframe. name  [Index number]

##  [1]  881  679  660  847  682  985  771  735  712  773  938  693  597  982  699
## [16]  900  733  963  609 1035  717  688  736  642 1037  968  802  876  947  787
## [31]  770  810  763  775  560  675  884  824  961  879  940  833  906  867  675
## [46]  998  790  916  737  764

We can combine these index numbers to access specific observations in a dataframe. Lets say we wanted to access the rainfall value for Denton (column 2) in 1946 6 (row number 6)

print (rainfall.data [6,2]) #Dataframe. name  [Index number]

## [1] 1091

• Q6. Adapting the code above, find the rainfall value for Redmires in 1950

Hopefully you have an intuitive understanding of this now, but it will certainly come clearer with practice. Now, back to the descriptives…

4.2. Descriptive statistics

4.2.1. The ‘summary’ function

The ‘summary’ function is simple function for exploring data frame. It computes summary statistics of data and model objects.

summary (rainfall.data)

##       Year          Denton          Redmires        Sheffield     
##  Min.   :1941   Min.   : 650.0   Min.   : 740.0   Min.   : 560.0  
##  1st Qu.:1953   1st Qu.: 827.2   1st Qu.: 977.5   1st Qu.: 713.2  
##  Median :1966   Median : 917.5   Median :1055.0   Median : 788.5  
##  Mean   :1966   Mean   : 916.3   Mean   :1077.9   Mean   : 808.0  
##  3rd Qu.:1978   3rd Qu.: 976.5   3rd Qu.:1190.2   3rd Qu.: 904.5  
##  Max.   :1990   Max.   :1370.0   Max.   :1400.0   Max.   :1037.0  
##  Kirk..Bramwith
##  Min.   :401.0  
##  1st Qu.:491.5  
##  Median :572.0  
##  Mean   :577.0  
##  3rd Qu.:650.2  
##  Max.   :812.0

With these simple statistics we can compare the upper and lower limits of the measurements (min and max), central tendency (mean and median) and some indicators of the dispersion of the data around the central tendency (1st & 3rd quantiles).

Check that the values you have just calculated in R match those in Excel. If they don’t you’ve gone wrong somewhere…

4.2.2. Summaries with ‘apply()’

Two useful measures of dispersion of data not provided by summary() are the standard deviation and the interquartile range (IQR). The functions to calculate standard deviation and IQR are sd() and IQR() respectively. However, these functions can only be used on vectors. In this case, we need to specify the vector we want to run the analysis on.

For example, we want to calculate the standard deviation of rainfall data from Denton. We can use both the vector and numerical index approach above:

sd(rainfall.data$Denton)

## [1] 132.7956

sd(rainfall.data [,2])

## [1] 132.7956

• Q7. Calculate the IQR for the Denton rainfall data

Calculating statistics for each individual vector in a data frame. becomes more time-consuming using the apporach above when you have a larger dataset. To overcome this we can use the apply() function. apply() runs through the vectors of a data frame, applying a function to each as it goes. For example, here’s how to use apply() to calculate the standard deviation for each of the columns in our data frame.

apply (rainfall.data, 2, sd) # function (data, columns, function)

##           Year         Denton       Redmires      Sheffield Kirk..Bramwith
##       14.57738      132.79560      155.58496      122.54070      107.75085

Working backwards through the arguments, the command above tells R we want to apply the sd() function (sd argument - the function name without parentheses) to the columns (2 argument) of the my_data data frame. (my_data). Note that because we used the name of the data frame. as an argument we have been given the standard deviaiton of all columns, including the year (but think about why the standard deviation of the year column is not really very useful). To calculate value for only some columns we can use indexing:

apply (rainfall.data[,2:5], 2, sd) # function (data, columns, function)

##         Denton       Redmires      Sheffield Kirk..Bramwith
##       132.7956       155.5850       122.5407       107.7508

To calculate a different function on the columns, we simply change the final argument of the apply() function.

• Q8. Calculate the IQR for each variable

The apply() function can also be used across rows of a data frame. by changing the second argument from 2 to 1:

apply (rainfall.data[,2:5], 1, sd) # function (data, columns,  statistical function)

##  [1] 253.7748 217.0182 269.9593 257.4120 253.5592 288.7183 244.1407 249.1270
##  [9] 226.8098 227.6597 234.6591 271.7799 121.3342 357.4624 208.9186 260.6524
## [17] 208.3979 235.1430 173.8246 258.7058 214.6678 207.1077 181.0359 213.1531
## [25] 233.0186 221.3271 178.6531 172.0136 167.1175 203.4296 137.5645 210.0625
## [33] 208.6097 202.9540 147.1198 157.1143 218.2583 183.0799 215.2059 243.4124
## [41] 265.9536 225.9002 220.6913 199.8639 149.3751 279.8083 178.9010 243.1726
## [49] 181.2530 247.2259

In this case we are calculating the standard deviation of the rainfall data from ALL sites in each year. Note: in the code above it is very important we tell R to only calculate values for columns 2 to 5. Otherwise, we would be including the Year value in the calculation, along with the rainfall values


站长地图