代做Overview of Quantitative Methods Lab 2 Notes代做SQL语言
- 首页 >> C/C++编程Lab 2 Notes
Overview of Quantitative Methods
Data manipulation & Tidyverse
We can make our R experience much more pleasant with user-made libraries:
#install.packages("readxl")
#install.packages("tidyverse")
library(readxl)
#(this is considered bad taste btw, and it's better to load only the tidyverse packages you need)
library(tidyverse)
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr 1.1.4 v readr 2.1.4
## v forcats 1.0.0 v stringr 1.5.1
## v ggplot2 3.4.4 v tibble 3.2.1
## v lubridate 1.9.3 v tidyr 1.3.0
## v purrr 1.0.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## i Use the conflicted package (
We will work with the data from the American coffee preference test this time.
gact <- read_xlsx("GACTT_RESULTS_ANONYMIZED_LAB2.xlsx")
gact_codebook <- read_xlsx("GACTT_RESULTS_ANONYMIZED_LAB2.xlsx",
sheet = "codebook") #to read particular xlsx sheet
head(gact_codebook)
## # A tibble: 6 x 2
## variable_name survey_question
##
## 1 submission_id Submission ID
## 2 zip What is your ZIP code?
## 3 age What is your age?
## 4 gender Gender
## 5 cups How many cups of coffee do you typically drink per day?
## 6 where Where do you typically drink coffee?
glimpse(gact)
## Rows: 3,280
## Columns: 6
## $ submission_id
## $ zip
## $ age
## $ gender
## $ cups
## $ where
Tidyverse is the set of libraries which make R lot cooler. For example, it has the pipe %>% which applies function to object:
gact %>% head(3)
## # A tibble: 3 x 6
## submission_id zip age gender cups where
##
## 1 gMR29l
## 2 BkPN0e
## 3 W5G8jj
head(gact, 3) #equivalent
## # A tibble: 3 x 6
## submission_id zip age gender cups where
##
## 1 gMR29l
## 2 BkPN0e
## 3 W5G8jj
Tidyverse also introduces a ton of functions for data manipulation (with pipe we don’t change the original object):
select(gact, submission_id)
## # A tibble: 3,280 x 1
## submission_id
##
## 1 gMR29l
## 2 BkPN0e
## 3 W5G8jj
## 4 4xWgGr
## 5 QD27Q8
## 6 V0LPeM
## 7 V0Gaxg
## 8 AdzRL0
## 9 EXQLWN
## 10 xpa2K9
## # i 3,270 more rows
gact %>% select(submission_id)
## # A tibble: 3,280 x 1
## submission_id
##
## 1 gMR29l
## 2 BkPN0e
## 3 W5G8jj
## 4 4xWgGr
## 5 QD27Q8
## 6 V0LPeM
## 7 V0Gaxg
## 8 AdzRL0
## 9 EXQLWN
## 10 xpa2K9
## # i 3,270 more rows
gact %>% rename(submission_ID = submission_id)
## # A tibble: 3,280 x 6
## submission_ID zip age gender cups where
##
## 1 gMR29l
## 2 BkPN0e
## 3 W5G8jj
## 4 4xWgGr
## 5 QD27Q8
## 6 V0LPeM
## 7 V0Gaxg 10023 18-24 years old
## 8 AdzRL0
## 9 EXQLWN
## 10 xpa2K9
## # i 3,270 more rows
gact %>% filter(age == "18-24 years old")
## # A tibble: 399 x 6
## submission_id zip age gender cups where
##
## 1 gMR29l
## 2 V0Gaxg 10023 18-24 years old
## 3 jyDqva 2312 18-24 years old
## 4 eB7QoJ 12345 18-24 years old
## 5 24xzlg 11221 18-24 years old
## 6 AlbYao 85225 18-24 years old
## 7 42NYWO
## 8 J2kj9J 110034 18-24 years old
## 9 8dpDNA 61761 18-24 years old
## 10 vNx6A4 61600 18-24 years old
gact %>% filter(age == "18-24 years old" & cups >= 3.0) #what's the problem here?
## # A tibble: 99 x 6
## submission_id zip age gender cups where
##
## 1 eB7QoJ 12345 18-24 years old
## 2 8dpDNA 61761 18-24 years old
## 3 qbEQPO 61600 18-24 years old
## 4 BGboZR 92008 18-24 years old
## 5 aBrp0W 14456 18-24 years old
## 6 EPbVxA 90039 18-24 years old Female Less than 1 At a cafe, At home
## 7 WApa9N 90036 18-24 years old Male Less than 1 At the office
## 8 zKQA18 95834 18-24 years old Non-binary Less than 1 On the go, At home
## 9 eBxO5q 90036 18-24 years old Non-binary Less than 1 At home
## 10 MAdj6k 83709 18-24 years old Male 3 At home, At a cafe
## # i 89 more rows
table(gact$cups) #`cups` has weird categories while substantially it is a number
##
## 1 2 3 4 Less than 1 More than 4
## 1005 1378 404 101 246 60
#how to recode strings:
gact$cups[15:30] %>% recode("Less than 1" = '0', "More than 4" = '5')
## [1] "3" "1" "2" "2" "3" NA NA "5" "1" "2" "1" NA "3" NA "1" "2"
gact$cups_number <- gact$cups %>% recode("Less than 1" = '0', "More than 4" = '5')
#time to change type:
gact %>% mutate(cups_num = as.numeric(cups_number),
cups_norm = cups_num - mean(cups_num, na.rm = T)) #remove NA = TRUE
## # A tibble: 3,280 x 9
## submission_id zip age gender cups where cups_number cups_num cups_norm
##
## 1 gMR29l
## 2 BkPN0e
## 3 W5G8jj
## 4 4xWgGr
## 5 QD27Q8
## 6 V0LPeM
## 7 V0Gaxg 10023 18-24 ~
## 8 AdzRL0
## 9 EXQLWN
## 10 xpa2K9
## # i 3,270 more rows
gact$cups_num <- as.numeric(gact$cups_number)
gact %>% arrange(cups_num, desc = T)
## # A tibble: 3,280 x 8
## submission_id zip age gender cups where cups_number cups_num
##
## 1 bl41Zg 34322 45-54 years old Male Less t~ At a~ 0 0
## 2 KMBroK 11768 25-34 years old Female Less t~ At h~ 0 0
## 3 6dQ0eO 11201 25-34 years old Male Less t~ At t~ 0 0
## 4 eBxgQO 94043 25-34 years old Male Less t~ None~ 0 0
## 5 EPbVxA 90039 18-24 years old Female Less t~ At a~ 0 0
## 6 dY2GGD 80234 25-34 years old Female Less t~ On t~ 0 0
## 7 WApa9N 90036 18-24 years old Male Less t~ At t~ 0 0
## 8 BGbJE5 80234 25-34 years old Male Less t~ None~ 0 0
## 9 VZ1WVN 15206 25-34 years old Male Less t~ At h~ 0 0
## 10 RzBq4P 89148 35-44 years old Male Less t~ At h~ 0 0
## # i 3,270 more rows
With pipe we can also beautifully chain functions (you can think of the syntax as and SQL query)
#let's also create the numeric column for age
table(gact$age)
##
## <18 years="" ld="">65 years old 18-24 years old 25-34 years old 35-44 years old
## 13 63 399 1652 760
## 45-54 years old 55-64 years old
## 232 130
gact %>%
select(submission_id, cups_num, age) %>%
filter(cups_num >= 3) %>%
mutate(age_num = str_remove(age, " years old"))
## # A tibble: 565 x 4
## submission_id cups_num age age_num
##
## 1 Zd694B 3 <18 years old <18
## 2 QAeYZY 3 >65 years old >65
## 3 eB7QoJ 3 18-24 years old 18-24
## 4 1r2YJQ 5 25-34 years old 25-34
## 5 aBr18q 3 25-34 years old 25-34
## 6 8dpDNA 3 18-24 years old 18-24
## 7 qbEQPO 3 18-24 years old 18-24
## 8 BGboZR 3 18-24 years old 18-24
## 9 RzBeZj 5 45-54 years old 45-54
## 10 5dqajd 5 25-34 years old 25-34
## # i 555 more rows
#equivalent to, but more readable!
mutate(filter(select(gact, submission_id, cups_num, age), cups_num >= 3),
age_num = str_remove(age, " years old"))
## # A tibble: 565 x 4
## submission_id cups_num age age_num
##
## 1 Zd694B 3 <18 years old <18
## 2 QAeYZY 3 >65 years old >65
## 3 eB7QoJ 3 18-24 years old 18-24
## 4 1r2YJQ 5 25-34 years old 25-34
## 5 aBr18q 3 25-34 years old 25-34
## 6 8dpDNA 3 18-24 years old 18-24
## 7 qbEQPO 3 18-24 years old 18-24
## 8 BGboZR 3 18-24 years old 18-24
## 9 RzBeZj 5 45-54 years old 45-54
## 10 5dqajd 5 25-34 years old 25-34
## # i 555 more rows
gact$age_num <- gact$age %>%
recode("<18 years old" = 18,
">65 years old" = 65,
"18-24 years old" = 21,
"25-34 years old" = 30,
"35-44 years old" = 40,
"45-54 years old" = 50,
"55-64 years old" = 60)
Also, we can make summaries:
gact %>%
summarize(mean_cups = mean(cups_num),
mean_cups_narm = mean(cups_num, na.rm = T),
median_cups = median(cups_num, na.rm = T))
## # A tibble: 1 x 3
## mean_cups mean_cups_narm median_cups
##
## 1 NA 1.78 2
gact %>%
group_by(gender) %>% #same, but for each gender
summarize(mean_cups = mean(cups_num, na.rm = T),
median_cups = median(cups_num, na.rm = T),
n_obs = n())
## # A tibble: 6 x 4
## gender mean_cups median_cups n_obs
##
## 1 Female 1.44 1 579
## 2 Male 1.87 2 2134
## 3 Non-binary 1.48 1 96
## 4 Other (please specify) 1.78 2 9
## 5 Prefer not to say 1.68 2 28
## 6
gact %>%
select(gender, age, cups_num) %>%
filter(age %in% c("<18 years old", "18-24 years old", "25-34 years old")) %>%
group_by(gender, age) %>% #we can group by many columns
summarize(mean_cups = mean(cups_num, na.rm = T),
median_cups = median(cups_num, na.rm = T),
n_obs = n())
## ‘summarise()‘ has grouped output by ’gender’. You can override using the
## ‘.groups‘ argument.
## # A tibble: 16 x 5
## # Groups: gender [6]
## gender age mean_cups median_cups n_obs
##
## 1 Female 18-24 years old 1.14 1 59
## 2 Female 25-34 years old 1.36 1 296
## 3 Female <18 years old 2 2 1
## 4 Male 18-24 years old 1.59 2 254
## 5 Male 25-34 years old 1.78 2 1105
## 6 Male <18 years old 1.6 2 5
## 7 Non-binary 18-24 years old 1 1 12
## 8 Non-binary 25-34 years old 1.46 1 65
## 9 Other (please specify) 18-24 years old 1.25 1.5 4
## 10 Other (please specify) 25-34 years old 2 2 3
## 11 Other (please specify) <18 years old 3 3 1
## 12 Prefer not to say 18-24 years old 0.5 0.5 4
## 13 Prefer not to say 25-34 years old 1.67 2 12
## 14
## 15
## 16
Now to some serious things. Question “where do you drink coffee” (variable where in our data) is multiple-choice and the result is total mess, let’s make nice dummies from it.
gact %>% select(where) %>% table() %>% head()
## where
## At a cafe
## 49
## At a cafe, At home
## 70
## At a cafe, At home, At the office
## 14
## At a cafe, At home, At the office, On the go
## 5
## At a cafe, At home, On the go
## 11
## At a cafe, At home, On the go, At the office
## 6
gact %>%
drop_na(where) %>% #we don't need missing observations for this
select(submission_id, where) %>%
separate_rows(where, sep = ",\\s*") %>% #that's regex and it's black magic
# we flattened our data so each submission now has as many rows as there were choices
mutate(dummy = 1)
## # A tibble: 5,748 x 3
## submission_id where dummy
##
## 1 V0Gaxg At a cafe 1
## 2 V0Gaxg At the office 1
## 3 V0Gaxg At home 1
## 4 V0Gaxg On the go 1
## 5 7WWB4A On the go 1
## 6 7WWB4A At a cafe 1
## 7 7WWB4A At home 1
## 8 7WWB4A At the office 1
## 9 pKL8aB At the office 1
## 10 Zd694B At home 1
## # i 5,738 more rows
gact %>%
#drop_na(where) %>% #this time let's not drop the missing data
select(submission_id, where) %>%
separate_rows(where, sep = ",\\s*") %>%
mutate(dummy = 1) %>%
#pivot_* is a VERY useful function for transforming your data from long to wide format or vice versa
pivot_wider(names_from = where, values_from = dummy, values_fill = 0)
## # A tibble: 3,280 x 7
## submission_id ‘NA‘ ‘At a cafe‘ ‘At the office‘ ‘At home‘ ‘On the go‘
##
## 1 gMR29l 1 0 0 0 0
## 2 BkPN0e 1 0 0 0 0
## 3 W5G8jj 1 0 0 0 0
## 4 4xWgGr 1 0 0 0 0
## 5 QD27Q8 1 0 0 0 0
## 6 V0LPeM 1 0 0 0 0
## 7 V0Gaxg 0 1 1 1 1
## 8 AdzRL0 1 0 0 0 0
## 9 EXQLWN 1 0 0 0 0
## 10 xpa2K9 1 0 0 0 0
## # i 3,270 more rows
## # i 1 more variable: ‘None of these‘
We can of course just mutate the original dataset, but let’s also practice merging data:
gact_where <- gact %>%
select(submission_id, where) %>%
separate_rows(where, sep = ",\\s*") %>%
mutate(dummy = 1) %>%
pivot_wider(names_from = where, values_from = dummy, values_fill = 0) %>%
select(-`NA`) %>% #drop `NA` column
rename("where_cafe" = "At a cafe", #rename to get rid of whitespaces
"where_office" = "At the office",
"where_home" = "At home",
"where_go" = "On the go",
"where_other" = "None of these")
gact <- left_join(gact, #identical to SQL left join
gact_where, #to each row in gact add data from gact_where
by = "submission_id") #matching by `submission_id` value