STA 4373辅导、辅导Python语言程序
- 首页 >> Database STA 4373 – Computational Methods in Statistics
Fall 2022
STA 4373 Assignment 2
Instructions.
In this assignment you’ll analyze a COVID dataset and create a PDF of your results using the same Quarto
template I posted to Canvas. As before, when you turn in the file, the filename of the turn-in should be
last names separated by dashes and terminated with -2.pdf. For example, if Joe Shmo, Jane Doe, Mickey
Mouse worked together, they would turn in shmo-doe-mouse-2.pdf.
Again, you may use your text and work in groups of size up to three. Only one delegate of your team
will submit the resulting PDF on Canvas. The PDF should have the names of each of the collaborators on
top. The main advantage to working in a group is that you can bounce ideas off one another, and hopefully
uncover more interesting features of the data.
You may use the internet to access the text’s wepage, other websites directly linked in this document, and
other general-purpose data science in R questions. However, you may not read or use any analyses of this
or related datasets you find online. Failure to follow this rule may be considered a violation of this course’s
academic integrity policy. If you have any questions about this, please contact me.
Please put a new page break before each question so each question starts on its own page (this will
facilitate grading) and never provide output that runs over more than one page if you can help it. Be sure
to echo all your code!
The COVID19 pandemic in Texas.
The Texas Department of State Health Services (DSHS) is the primary municipal body in the state that
tracks the spread of the Covid-19 pandemic and makes information available to the public. To that end it
has two dashboards, one that monitors case counts, available here, and another that focuses on testing and
hospitalization, available here; these were setup in the early days and weeks of the pandemic shutdown in
March and April 2020. DSHS also provides web endpoints for related datasets, a listing of which can be
found at https://dshs.texas.gov/coronavirus/AdditionalData.aspx. Until relatively recently, this data was
updated daily, typically in the 3pm–5pm range.
1
Questions.
1. Read in the data ”Cases over Time by County” ("TexasCOVID-19NewCasesOverTimebyCounty.xlsx")
into a variable called new cases, but don’t clean it yet (that will come in the next steps). Then run
the code below to show you’ve succeeded.
Note: You may look at the file you have downloaded in another application, but do not edit it; all
manipulations of the file must be done in R.
Hint: Be sure to look at the whole dataset before reading it in. I encourage you to use readxl::cell limits()
with the ul and lr arguments to get the reading right.
new_cases |> select(1:5) |> glimpse()
# Rows: 254
# Columns: 5
# $ County "Anderson", "Andrews", "Angelina", "Aransas", "~
# $ ‘New Cases 03-04-2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
# $ ‘New Cases 03-05-2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
# $ ‘New Cases 03-06-2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
# $ ‘New Cases 03-07-2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
2. Clean the names to match the naming conventions listed below. Run the code below to show you’ve
succeeded.
new_cases |> select(1:5) |> glimpse()
# Rows: 254
# Columns: 5
# $ county "Anderson", "Andrews", "Angelina", "Aransas", "Archer", "~
# $ ‘03_04_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# $ ‘03_05_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# $ ‘03_06_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# $ ‘03_07_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
3. Change all count columns to integers (instead of doubles). Run the code below to show you’ve suc-
ceeded.
new_cases |> select(1:5) |> glimpse()
# Rows: 254
# Columns: 5
# $ county "Anderson", "Andrews", "Angelina", "Aransas", "Archer", "~
# $ ‘03_04_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# $ ‘03_05_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# $ ‘03_06_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# $ ‘03_07_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
2
4. Reshape new cases to have columns date, county, cases, and convert the dates into date objects.
Run the code below to show you’ve succeeded.
new_cases
# # A tibble: 158,496 x 3
# county date new_cases
#
# 1 Anderson 2020-03-04 0
# 2 Anderson 2020-03-05 0
# 3 Anderson 2020-03-06 0
# 4 Anderson 2020-03-07 0
# 5 Anderson 2020-03-08 0
# 6 Anderson 2020-03-09 0
# 7 Anderson 2020-03-10 0
# 8 Anderson 2020-03-11 0
# 9 Anderson 2020-03-12 0
# 10 Anderson 2020-03-13 0
# # ... with 158,486 more rows
5. I have included along with this document on Canvas another data file containing the population of
each Texas county (this data came from DSHS as well). The file name is county-populations.csv.
Read this file and merge its information into cases. After merging the population information in, run
the code below to show you’ve succeeded
new_cases
# # A tibble: 158,496 x 4
# county date new_cases population
#
# 1 Anderson 2020-03-04 0 58199
# 2 Anderson 2020-03-05 0 58199
# 3 Anderson 2020-03-06 0 58199
# 4 Anderson 2020-03-07 0 58199
# 5 Anderson 2020-03-08 0 58199
# 6 Anderson 2020-03-09 0 58199
# 7 Anderson 2020-03-10 0 58199
# 8 Anderson 2020-03-11 0 58199
# 9 Anderson 2020-03-12 0 58199
# 10 Anderson 2020-03-13 0 58199
# # ... with 158,486 more rows
6. Create a line chart showing the incident cases (daily new cases) for the top 9 counties in Texas by
population. Plot these on the same graph, differentiating different counties by color. Polish the graphic.
Note: The plot will be overplotted.
Hint: What are the aesthetics in this plot?
Hint 2: Determine the top population counties first, then filter new cases by checking whether the
county is in that top list (in a pipeline, don’t re-save new cases). Then make the plot.
Hint 3: Consider using scale x date()!
3
7. Instead of using color, facet the graphic. Free the scales in the faceting function to allow for easier
visibility of the curves. Again, polish the graphic.
8. The slider package allows you to compute windowed functions; here we’ll use it for computing moving
averages. Look at this (and think about it!) to see how it works.
library("slider")
x <- 1:5
slide_dbl(x, mean, .before = 1)
# [1] 1.0 1.5 2.5 3.5 4.5
slide_dbl(x, mean, .before = 2)
# [1] 1.0 1.5 2.0 3.0 4.0
Instead of looking at daily new cases, re-create the graphic above using 7-day moving averages.
9. Re-stack the graphic as colored curves using the smoothed 7-day moving average graphic.
Note: This will just be copy/paste and add one line from the last code chunk.
10. Note that the above graphics don’t necessarily communicate how bad community spread is in each
county, since the county sizes differ. Make a line chart of the new cases per 10,000 individuals by
county, by dividing the 7-day moving average of new cases per day divided by population size and
multiplying by 10,000. Again, color the lines by county.
What do you notice in this graphic that you couldn’t see in the last one?
4
Fall 2022
STA 4373 Assignment 2
Instructions.
In this assignment you’ll analyze a COVID dataset and create a PDF of your results using the same Quarto
template I posted to Canvas. As before, when you turn in the file, the filename of the turn-in should be
last names separated by dashes and terminated with -2.pdf. For example, if Joe Shmo, Jane Doe, Mickey
Mouse worked together, they would turn in shmo-doe-mouse-2.pdf.
Again, you may use your text and work in groups of size up to three. Only one delegate of your team
will submit the resulting PDF on Canvas. The PDF should have the names of each of the collaborators on
top. The main advantage to working in a group is that you can bounce ideas off one another, and hopefully
uncover more interesting features of the data.
You may use the internet to access the text’s wepage, other websites directly linked in this document, and
other general-purpose data science in R questions. However, you may not read or use any analyses of this
or related datasets you find online. Failure to follow this rule may be considered a violation of this course’s
academic integrity policy. If you have any questions about this, please contact me.
Please put a new page break before each question so each question starts on its own page (this will
facilitate grading) and never provide output that runs over more than one page if you can help it. Be sure
to echo all your code!
The COVID19 pandemic in Texas.
The Texas Department of State Health Services (DSHS) is the primary municipal body in the state that
tracks the spread of the Covid-19 pandemic and makes information available to the public. To that end it
has two dashboards, one that monitors case counts, available here, and another that focuses on testing and
hospitalization, available here; these were setup in the early days and weeks of the pandemic shutdown in
March and April 2020. DSHS also provides web endpoints for related datasets, a listing of which can be
found at https://dshs.texas.gov/coronavirus/AdditionalData.aspx. Until relatively recently, this data was
updated daily, typically in the 3pm–5pm range.
1
Questions.
1. Read in the data ”Cases over Time by County” ("TexasCOVID-19NewCasesOverTimebyCounty.xlsx")
into a variable called new cases, but don’t clean it yet (that will come in the next steps). Then run
the code below to show you’ve succeeded.
Note: You may look at the file you have downloaded in another application, but do not edit it; all
manipulations of the file must be done in R.
Hint: Be sure to look at the whole dataset before reading it in. I encourage you to use readxl::cell limits()
with the ul and lr arguments to get the reading right.
new_cases |> select(1:5) |> glimpse()
# Rows: 254
# Columns: 5
# $ County "Anderson", "Andrews", "Angelina", "Aransas", "~
# $ ‘New Cases 03-04-2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
# $ ‘New Cases 03-05-2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
# $ ‘New Cases 03-06-2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
# $ ‘New Cases 03-07-2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
2. Clean the names to match the naming conventions listed below. Run the code below to show you’ve
succeeded.
new_cases |> select(1:5) |> glimpse()
# Rows: 254
# Columns: 5
# $ county "Anderson", "Andrews", "Angelina", "Aransas", "Archer", "~
# $ ‘03_04_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# $ ‘03_05_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# $ ‘03_06_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# $ ‘03_07_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
3. Change all count columns to integers (instead of doubles). Run the code below to show you’ve suc-
ceeded.
new_cases |> select(1:5) |> glimpse()
# Rows: 254
# Columns: 5
# $ county "Anderson", "Andrews", "Angelina", "Aransas", "Archer", "~
# $ ‘03_04_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# $ ‘03_05_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# $ ‘03_06_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# $ ‘03_07_2020‘ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
2
4. Reshape new cases to have columns date, county, cases, and convert the dates into date objects.
Run the code below to show you’ve succeeded.
new_cases
# # A tibble: 158,496 x 3
# county date new_cases
#
# 1 Anderson 2020-03-04 0
# 2 Anderson 2020-03-05 0
# 3 Anderson 2020-03-06 0
# 4 Anderson 2020-03-07 0
# 5 Anderson 2020-03-08 0
# 6 Anderson 2020-03-09 0
# 7 Anderson 2020-03-10 0
# 8 Anderson 2020-03-11 0
# 9 Anderson 2020-03-12 0
# 10 Anderson 2020-03-13 0
# # ... with 158,486 more rows
5. I have included along with this document on Canvas another data file containing the population of
each Texas county (this data came from DSHS as well). The file name is county-populations.csv.
Read this file and merge its information into cases. After merging the population information in, run
the code below to show you’ve succeeded
new_cases
# # A tibble: 158,496 x 4
# county date new_cases population
#
# 1 Anderson 2020-03-04 0 58199
# 2 Anderson 2020-03-05 0 58199
# 3 Anderson 2020-03-06 0 58199
# 4 Anderson 2020-03-07 0 58199
# 5 Anderson 2020-03-08 0 58199
# 6 Anderson 2020-03-09 0 58199
# 7 Anderson 2020-03-10 0 58199
# 8 Anderson 2020-03-11 0 58199
# 9 Anderson 2020-03-12 0 58199
# 10 Anderson 2020-03-13 0 58199
# # ... with 158,486 more rows
6. Create a line chart showing the incident cases (daily new cases) for the top 9 counties in Texas by
population. Plot these on the same graph, differentiating different counties by color. Polish the graphic.
Note: The plot will be overplotted.
Hint: What are the aesthetics in this plot?
Hint 2: Determine the top population counties first, then filter new cases by checking whether the
county is in that top list (in a pipeline, don’t re-save new cases). Then make the plot.
Hint 3: Consider using scale x date()!
3
7. Instead of using color, facet the graphic. Free the scales in the faceting function to allow for easier
visibility of the curves. Again, polish the graphic.
8. The slider package allows you to compute windowed functions; here we’ll use it for computing moving
averages. Look at this (and think about it!) to see how it works.
library("slider")
x <- 1:5
slide_dbl(x, mean, .before = 1)
# [1] 1.0 1.5 2.5 3.5 4.5
slide_dbl(x, mean, .before = 2)
# [1] 1.0 1.5 2.0 3.0 4.0
Instead of looking at daily new cases, re-create the graphic above using 7-day moving averages.
9. Re-stack the graphic as colored curves using the smoothed 7-day moving average graphic.
Note: This will just be copy/paste and add one line from the last code chunk.
10. Note that the above graphics don’t necessarily communicate how bad community spread is in each
county, since the county sizes differ. Make a line chart of the new cases per 10,000 individuals by
county, by dividing the 7-day moving average of new cases per day divided by population size and
multiplying by 10,000. Again, color the lines by county.
What do you notice in this graphic that you couldn’t see in the last one?
4