辅导STA 4373、辅导R程序设计
- 首页 >> C/C++编程 STA 4373 – Computational Methods in Statistics
Fall 2022
STA 4373 Assignment 1
Instructions.
In this assignment you’ll analyze a college sports dataset and create a PDF of your results using the Quarto
template I’ve posted to Canvas. Quarto essentially functions equivalently to R Markdown but will likely
become a kind of successor to that technology. Like R Markdown, Quarto interweaves code with text using
Markdown syntax, and presently most R Markdown files (file extension .Rmd) can be changed to Quarto files
(file extension .qmd) and compile in exactly the same way. You will need to download and install Quarto on
your machine, see the above link.
When you turn in the file, the filename of the turn-in should be last names separated by dashes and
terminated with -1.pdf. For example, if Joe Shmo, Jane Doe, Mickey Mouse worked together, they would
turn in shmo-doe-mouse-1.pdf.
You may use your text and work in groups of size up to three. Only one delegate of your team will submit
the resulting PDF on Canvas. The PDF should have the names of each of the collaborators on top. The
main advantage to working in a group is that you can bounce ideas off one another, and hopefully uncover
more interesting features of the data.
You may use the internet to access the text’s wepage, other websites directly linked in this document, and
other general-purpose data science in R questions. However, you may not read or use any analyses of this
or related datasets you find online. Failure to follow this rule may be considered a violation of this course’s
academic integrity policy. If you have any questions about this, please contact me.
Please put a new page break before each question so each question starts on its own page (this will
facilitate grading) and never provide output that runs over more than one page!
College athletics.
According to Data is Plural, a well-known data science blog:
The Equity in Athletics Disclosure Act (EADA) requires thousands of US colleges to provide
annual data on athletic participation, staffing, and finances by team gender and sport. School-
and team-level datasets are available through the Department of Education for the academic
years ending 2003–19.
The good folks at TidyTuesday, a popular online data-science education community, have scraped and
aggregated some of this data and posted it to GitHub. You can download the dataset using the tidytuesdayR
package like this:
tuesdata <- tidytuesdayR::tt_load(2022, week = 13)
sports <- tuesdata$sports
A brief explanation of the data, along with the code used to pull it off of the DoE’s webpage, can be seen at
the GitHub link above.
Note: This assignment involves the analysis of an interesting dataset that can easily lead to controversy.
As is so often the case, such data tend to speak to phenomena that are surprisingly nuanced. In such
situations, try to be slow to form opinions, skeptical of the ones you do form, and limit any conclusions you
draw to analyses you yourself have conducted on the data at hand.
1
Questions.
1. Describe the dataset: how many observations does it contain?; what does each represent?; how many
variables are present?
2. classification name refers to the league that the team plays in. Count how many observations
pertain to each level of classification name.
3. NCAA Division I-FBS corresponds to the Football Bowl Subdivision of Division 1 NCAA sports, read
the top blurb here for a brief introduction.
Make a dataset div 1 fbs that contains only the observations from Division 1 FBS schools. glimpse()
the dataset to show you’ve succeeded.
4. How many Division 1 FBS schools are in the dataset?
5. Presumably the revenues of a team are realized after the expenditures. Make a scatterplot of the
revenues (y) vs expenditures (x) colored by sex. Polish the graphic so that the axes are clearly legible
and understood. Use geom_function(fun = ~ .x, linetype = 2) to add a reference break-even line
to the graphic.
Hint 1: See slide 52 in the data visualization slides. Also, check out scale color manual().
Hint 2: In your axes scales, use labels = label dollar(suffix = "mil", scale = 1e-6); it’ll
make them look much better!
Hint 3: Use na.rm = TRUE in your geom calls to suppress missing values, which come from, e.g., not
having female football teams.
6. Comment on the graphic according to the standard criteria: 1) general trend, 2) local behavior, 3)
outliers. Be sure to comment on the difference between the two groups in the process.
7. Re-create the previous graphic using log based 10 scales. Address overplotting with alpha blending
and shrinking point size (to the reasonable extent possible). Remove the break-even line.
Note: start the limits of the axes at $100k in order to eliminate lower-value amounts.
Hint: Setting aesthetics for point layers will percolate into the legend, so that if you make the points
very transparent and small, the legend will be hard to see. You can override those set aesthetic values
in the legend by adding this:
guides(
color = guide_legend(override.aes = list(size = 3, alpha = 1))
)
Hint 2: The log axes will make the scale such that it’s possible to distinguish among values in the
thousand range and the million range simultaneously. A clean way to address this problem is to set
scale cut = c(0, k = 1e3, m = 1e6) in label dollar(); see the documentation of that function
for details. Be sure to put in many breaks (not just 3, say) to illustrate the scale.
Hint 3: Try label dollar(scale cut = c(0, k = 1e3, m = 1e6)) here.
8. What proportion of men’s sports’ expenditures are the same or within $2 dollars (say) of their revenues?
9. If we assume expenditures are investments intended to generate revenue, the return on investment
(ROI) of each team on a per-dollar-spent basis is simply the ratio of revenues to expenditures. Create
new variables roi men and roi women and add them to the div 1 fbs dataset.
2
10. Investigate and compare the distribution of men’s teams’ and women’s teams’ ROIs. How are they
similar? How are they different? Please present no more than 3 graphics in your write-up (not including
combinations of graphics made with patchwork). Comment briefly to explain your line of reasoning.
Note: this is not asking you to look at their joint distribution.
Hint: Remember you can use the patchwork package to put more than one graphic in a figure!
11. Which sports seem to be garner the highest ROI for each sex? Provide graphics to support your
conclusion.
Hint: You can use drop na(roi men) (for example) to drop out levels for sports that aren’t played by
one sex or the other.
12. Restricting our attention to basketball, where the team sizes and game requirements are more or less
the same, compare the expenditures of Division 1 FBS schools for men and women by visualizing the
joint distribution of the two. Comment briefly.
Fall 2022
STA 4373 Assignment 1
Instructions.
In this assignment you’ll analyze a college sports dataset and create a PDF of your results using the Quarto
template I’ve posted to Canvas. Quarto essentially functions equivalently to R Markdown but will likely
become a kind of successor to that technology. Like R Markdown, Quarto interweaves code with text using
Markdown syntax, and presently most R Markdown files (file extension .Rmd) can be changed to Quarto files
(file extension .qmd) and compile in exactly the same way. You will need to download and install Quarto on
your machine, see the above link.
When you turn in the file, the filename of the turn-in should be last names separated by dashes and
terminated with -1.pdf. For example, if Joe Shmo, Jane Doe, Mickey Mouse worked together, they would
turn in shmo-doe-mouse-1.pdf.
You may use your text and work in groups of size up to three. Only one delegate of your team will submit
the resulting PDF on Canvas. The PDF should have the names of each of the collaborators on top. The
main advantage to working in a group is that you can bounce ideas off one another, and hopefully uncover
more interesting features of the data.
You may use the internet to access the text’s wepage, other websites directly linked in this document, and
other general-purpose data science in R questions. However, you may not read or use any analyses of this
or related datasets you find online. Failure to follow this rule may be considered a violation of this course’s
academic integrity policy. If you have any questions about this, please contact me.
Please put a new page break before each question so each question starts on its own page (this will
facilitate grading) and never provide output that runs over more than one page!
College athletics.
According to Data is Plural, a well-known data science blog:
The Equity in Athletics Disclosure Act (EADA) requires thousands of US colleges to provide
annual data on athletic participation, staffing, and finances by team gender and sport. School-
and team-level datasets are available through the Department of Education for the academic
years ending 2003–19.
The good folks at TidyTuesday, a popular online data-science education community, have scraped and
aggregated some of this data and posted it to GitHub. You can download the dataset using the tidytuesdayR
package like this:
tuesdata <- tidytuesdayR::tt_load(2022, week = 13)
sports <- tuesdata$sports
A brief explanation of the data, along with the code used to pull it off of the DoE’s webpage, can be seen at
the GitHub link above.
Note: This assignment involves the analysis of an interesting dataset that can easily lead to controversy.
As is so often the case, such data tend to speak to phenomena that are surprisingly nuanced. In such
situations, try to be slow to form opinions, skeptical of the ones you do form, and limit any conclusions you
draw to analyses you yourself have conducted on the data at hand.
1
Questions.
1. Describe the dataset: how many observations does it contain?; what does each represent?; how many
variables are present?
2. classification name refers to the league that the team plays in. Count how many observations
pertain to each level of classification name.
3. NCAA Division I-FBS corresponds to the Football Bowl Subdivision of Division 1 NCAA sports, read
the top blurb here for a brief introduction.
Make a dataset div 1 fbs that contains only the observations from Division 1 FBS schools. glimpse()
the dataset to show you’ve succeeded.
4. How many Division 1 FBS schools are in the dataset?
5. Presumably the revenues of a team are realized after the expenditures. Make a scatterplot of the
revenues (y) vs expenditures (x) colored by sex. Polish the graphic so that the axes are clearly legible
and understood. Use geom_function(fun = ~ .x, linetype = 2) to add a reference break-even line
to the graphic.
Hint 1: See slide 52 in the data visualization slides. Also, check out scale color manual().
Hint 2: In your axes scales, use labels = label dollar(suffix = "mil", scale = 1e-6); it’ll
make them look much better!
Hint 3: Use na.rm = TRUE in your geom calls to suppress missing values, which come from, e.g., not
having female football teams.
6. Comment on the graphic according to the standard criteria: 1) general trend, 2) local behavior, 3)
outliers. Be sure to comment on the difference between the two groups in the process.
7. Re-create the previous graphic using log based 10 scales. Address overplotting with alpha blending
and shrinking point size (to the reasonable extent possible). Remove the break-even line.
Note: start the limits of the axes at $100k in order to eliminate lower-value amounts.
Hint: Setting aesthetics for point layers will percolate into the legend, so that if you make the points
very transparent and small, the legend will be hard to see. You can override those set aesthetic values
in the legend by adding this:
guides(
color = guide_legend(override.aes = list(size = 3, alpha = 1))
)
Hint 2: The log axes will make the scale such that it’s possible to distinguish among values in the
thousand range and the million range simultaneously. A clean way to address this problem is to set
scale cut = c(0, k = 1e3, m = 1e6) in label dollar(); see the documentation of that function
for details. Be sure to put in many breaks (not just 3, say) to illustrate the scale.
Hint 3: Try label dollar(scale cut = c(0, k = 1e3, m = 1e6)) here.
8. What proportion of men’s sports’ expenditures are the same or within $2 dollars (say) of their revenues?
9. If we assume expenditures are investments intended to generate revenue, the return on investment
(ROI) of each team on a per-dollar-spent basis is simply the ratio of revenues to expenditures. Create
new variables roi men and roi women and add them to the div 1 fbs dataset.
2
10. Investigate and compare the distribution of men’s teams’ and women’s teams’ ROIs. How are they
similar? How are they different? Please present no more than 3 graphics in your write-up (not including
combinations of graphics made with patchwork). Comment briefly to explain your line of reasoning.
Note: this is not asking you to look at their joint distribution.
Hint: Remember you can use the patchwork package to put more than one graphic in a figure!
11. Which sports seem to be garner the highest ROI for each sex? Provide graphics to support your
conclusion.
Hint: You can use drop na(roi men) (for example) to drop out levels for sports that aren’t played by
one sex or the other.
12. Restricting our attention to basketball, where the team sizes and game requirements are more or less
the same, compare the expenditures of Division 1 FBS schools for men and women by visualizing the
joint distribution of the two. Comment briefly.