代做11464 – 11524 AR/VR for Data Analysis and Communication代写C/C++程序

- 首页 >> Database

Tutorial and Laboratories

11464 - 11524 AR/VR for Data Analysis and Communication

Data Structures and how to load data into R - Week 1

Introduction

In this tutorial will continue practising basic operations in R. In particular, you will learn about vectors and data frames in R; how to create them, access their elements and modify them in your program, and basics of data visualisation.

Skills Covered in this tutorial include:

•    Data Structures

•    Obtaining specific information from data frames

•    Reading files (txt and csv)

Note: Do not copy-paste the commands. As you type each line, you will make mistakes and correct them, which make you think as you go along. Remember, that the objective is that you understand the commands and master the concepts, so you can reproduce their principles on your own later.

2. Data Structures

R has five basic data structures: atomic vectors, matrices, arrays, lists, and data frames. These structures have specific requirements in terms of their dimension. Figure 1, presents a graphical representation of these data structures.

One dimension: Atomic vectors and lists

Two dimensions: Matrices and data frames

•    N dimensions: Arrays

Figure 1. Basic data structures in R. Different colours represent different data types (e.g., numeric, character, Boolean).

For today’slab, we will practice using vectors and data frames only.

2.1. Vectors

These are the basic data structure in R. It contains elements of the same type. The data types can be logical, integer, double, character, complex. A vector’s type can be checked with the typeof() function. Another important property of a vector is its length. This is the number of elements in the vector and can be checked with the function length().

Exercise 1. Create atomic vectors with different data types and observe their type and length.

int_var <- c(10L, 2L, 5L)

num_var <- c(0.4, 3.7, 2)

typeof(int_var)

length(int_var)

coe_var <- c(5L, 3.5, “A”)

typeof(coe_var)

animals <- c("mouse", "rat", "dog", "bear")

x <- seq(0, 10, 2)

y <- 2:-2;

Elements of a vector can be accessed using vector indexing. The vector used for indexing can be logical, integer or character vector. Note: Vector index in R starts from 1, unlike most programming languages where index start from 0.

x[3]           # access 3rd element

x[c(2, 4)]     # access 2nd and 4th element

x[-1]          # access all but 1st element

x[c(2, -4)]    # cannot mix positive and negative integers

x[c(2.4, 3.54)]    # real numbers are truncated to integers

Using character vector as index. This type of indexing is useful when dealing with named vectors. We can name each elements of a vector.

x <- c("first"=3, "second"=0, "third"=9) #create vector

names(x)                       #print names of each element in the vector

x["second"]                    #access value of “second” element

x[c("first", "third")]          #access the 1st and 3rd element

2.2. Data Frames

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. Following are the characteristics of a data frame.:

•    The column names should be non-empty.

•    The row names should be unique.

•    The data stored in a data frame can be of numeric, factor or character type.

Each column should contain same number of data items.

We can create a data frame using the data.frame() function.

Exercise 2. Create the data framedf from a group of three vectors n, s, b.

n = c(2, 3, 5)

s = c("aa", "bb", "cc")

b = c(TRUE, FALSE, TRUE)

df = data.frame("var1"=n, "var2"=s, "var3"=b, stringsAsFactors=FALSE)       # df is a data frame.

str(df)                  # check structure of df

Note: By default, when building a data frame, the columns that contain characters are converted into factors. If you want to keep that columns as characters, you need to pass the argument stringsAsFactors=FALSE

You can also visualize the structure of the data frame by clicking on its name in the Environment Pane. The top line of the table, called the header, contains the column names (variable names). Each horizontal line afterward denotes a data row, which begins with the name of the row, and then followed by the actual data. Each data member of a row is called a cell.

We can use either [, [[ or $ operator to access columns of data frame. Practice using these operators to access data from a data frame and observe the results.

df[[1]]                  # access data in column 1

df$var2               #access column 2 by name, returns a vector

df[2]                    #access data in column 2, returns a data frame.

sum(df$var1 == 5)           #returns the number of observations invariable 1 that are equal to 5

max(df$var1)                   # returns the maximum value invariable 1

which(df$var1==3)         # return the element (not the value) in var1 == 3

df[c("var1", "var3")]       #access column1 and column 3 simultaneously

df[2,1]                                #access cell (2,1)

df$var4 <- c(1.5, 3.5, 5.5) # add variable named “var4”

df <- rbind(df,list(7, "dd",FALSE,7.7)) # add a row

df$var3 <- NULL #delete variable 3

df <- df[-1,] #delete first row

You can also use the subset() function to access data in a data frame. An advantage of using subset() is that it drops all missing values. However, most functions allow to remove missing values with the  parameter na.rm =TRUE. We will practice using subset() in the next section.

Exercise 3. Use built-in data frames in R. You can get the list of all the datasets by using data() function. For this exercise, we will use a built-in data frame called mtcars (for a complete description of the dataset visit: https://rpubs.com/neros/61800). Have a look at the following instructions.

rm(list=ls())

data()               # get the list of built-in data frames

data("mtcars")   #select mtcars data frame.

head(mtcars)     #visualise only the first 6 rows

rownames(mtcars)     #returns the name of each row

Now, imaging that you are required to some specific information from the mtcars data frame.

# How big is the data frame? Use the function dim() or the functions nrow() and ncol().

dim(mtcars)                     # dimension of data frame (rows, columns)

nrow(mtcars)                   #number of rows

ncol(mtcars)                     #number of columns

# What is the cell value from the first row, second column of mtcars?

mtcars[1,2]

#Could you get the same value by using row and column names instead? Which names?

mtcars["Mazda RX4", "cyl"]         # using row and column names instead

#Are there more automatic (0) or manual (1) transmission-type cars in the dataset? Hint: use the

sum() function to sum each type of transmission in the am (automatic/manual) variable.

sum(mtcars$am == 1)                   # get total number of manual transmission-type cars

sum(mtcars$am == 0)                   # get total number of automatic transmission-type cars

3. Reading Data from Files

In this section you will learn how to read data from different sources. It is assumed that you are now familiar with different data types.

The data file is in the format called comma-separated values (CSV). In other words, each line contains a row of values which can be letters or numbers, and each value is separated by a comma. Generally, the very first row in the file contains the labels to refer to each column of values.

The data file that we need for this example is available in Canvas. Download the file and save it in   your working directory. The labels of the three columns are: trial, mass, velocity. The values from each row comes from an observation during one of two experimental conditions labelled: A and B.

For this tutorial we will use two commands to input the same data, read.table() and read.csv(). Exercise 4. Read data from structure files using read.table() function.

read.table() allows us to read a file containing structured (table-like format) text into a data frame. The file can be comma delimited, tab, or any other delimiter specified by parameter “sep=” . If the  parameter “header = TRUE”, then the first row will be used as the row names.

The “sep=” argument can be used to specify different separators, some of the most common

separators are: tab (\t), space (\s),single backslash (\\), comma (,), and blank space (“”) (default).

Now, lets use read.table() to get the data from the csv file.

# read data

data_csv <- read.table("simple.csv",sep="\t")

# check if it is a data frame.

is.data.frame(data_tab)

How many columns and rows did you obtained?

What is the name of each column?

Did you get the right number of columns?

How can you read correctly the names of each column and rows?

If R is not finding the file you are trying to read, then it might be looking in the wrong folder. You can change the working directory from the menu bar, click on “Session” then “Set Working Directory” and “Choose Directory” . If you are not sure what files are in the current working directory you can use the dir() command to list the files and the getwd() command to determine the current directory.

# list files in working directory

dir()

# obtain location of current working directory

getwd()

Exercise 5. Read data from structure files using read.csv() function.

We will now use another example, which is also csv. In this case, we will create the file using windows notepad by copying and pasting the data. Save the file as input_data.csv, use the save As All Files (*) option in notepad. If you are using mac, following these examples:

https://help.sharpspring.com/hc/en-us/articles/115001068588-Saving-CSV-Files

id,name,salary,start_date,dept

1,Rick,623.3,2012-01-01,IT

2,Dan,515.2,2013-09-23,Operations

3,Michelle,611,2014-11-15,IT

4,Ryan,729,2014-05-11,HR

5,Gary,843.25,2015-03-27,Finance

6,Nina,578,2013-05-21,IT

7,Simon,632.8,2013-07-30,Operations

8,Richard,722.5,2014-06-17,Finance

Once the data was saved, we can read it and store it into a data frame using read.csv() function.

# read data

data <- read.csv(file="input_data.csv",header=TRUE,sep=",");

# print data

print(data_names)

Remember that by default the read.csv() and read.table() functions gives the output as a data frame. Once we load and save the data in a data frame, we can apply all the functions available for data frames as explained in the previous section.

Now, obtain the following:

# get the max salary from the data frame.

sal < - max(data$salary)

print(sal)

# get the details of the person with the highest salary

details < - subset(data, salary == sal)

print(details)

# get all the staff members working in IT

peopleIT < - subset(data, dept == “IT”)

print(peopleIT)

# get the staff member in IT whose salary is greater than 600

richIT <- subset(data, salary > 600 & dept == “IT”)

print(richIT)

Who gets the lowest salary in Operations department?

4. Take Home Exercises

4.1 Titanic dataset. This dataset contains survival status of passengers on the Titanic. The dataset is a tab-separated file and saved as a txt file. Information included in the dataset: names, passenger class, age, gender, and survival status. More information about this dataset can be obtain from:

http://www.statsci.org/data/general/titanic.html.

Inspect the data set and answer the following questions:

1.    How many passengers are in the dataset?

2.    Create two new data frames, one with male survivors and one with female survivors.

3.    Using the newly created data frames, who was the oldest surviving male? What was his age?

4.    In what passenger class was the youngest surviving female?

5.    How many female and male passengers survived?

6.    What is the average age of those who survived and those who did not?

7.    What is the name of the oldest survivor?

4.2 Rainfall dataset. This dataset contains 52 years (1968-2020) of daily rainfall amounts as measured in Canberra. Source: BOM (http://www.bom.gov.au/climate/data/).

Inspect the data set and answer the following questions:

1.    Calculate the mean and standard deviation of the rainfall variable.

2.    Which date (day,month,year) saw the highest rainfall? (use a loop)

3.    Obtain a subset of the rainfall data where rain is larger than 20mm.

4.    Find the mean rainfall for the days where the rainfall was at least 30mm?

5.    How many days (which dates) were recorded where the rainfall was exactly 40.4mm?

6.    Obtain the average rainfall for each year in the dataset. What years got the highest and lowest rainfall in the dataset? (use a loop)

7.    Obtain the average rainfall for each month in the dataset. In average, what months are the driest and wettest in Canberra? (use a loop)

5. Summary of some functions useful for this tutorial.




站长地图