讲解Maths Skills、辅导R、辅导LendingClub、讲解R编程设计
- 首页 >> 其他 Maths Skills 2 (Statistics) assessment
Spring Term 2019
This assignment consists of two parts: in the first one you are asked to revisit the data on
UK fishing vessels under 10 metres long; the second part instead concerns data on loans
(granted as well as rejected) from LendingClub, one of the largest U.S. peer-to-peer financial
companies. For each part, you will download data from a website, read them into R, perform
some computations on them, and produce graphical displays.
The process and results of the analysis should be documented in an R Markdown file, producing
an output document in PDF format. Code alone is not acceptable: at each step, you need to
explain what you are doing and comment (briefly!) on the results. Your submission should
consist of a single zipped folder containing only:
1. the R Markdown file (.Rmd file) and
2. the output PDF document.
Absence of one of the files will incur a very high penalty. Inclusion in the zipped folder of
other unrequested files, including the data files, will also attract a penalty.
The maximum number of pages of the PDF document is 10, all inclusive. Because of this, it
is best to do without a table of contents; however, sectioning of the document is strongly
encouraged. In particular, separate sections should be devoted to each of the two parts of
the assignmemt.
You are allowed to use and adapt for your purposes all the materials presented in the lectures
and posted on Moodle, including the R Markdown input files, without any need of referencing
them. However, you should provide, at the end of your report, references to the R packages
you are using and to the data sources.
You may informally discuss with fellow students how to perform a task. However, you are
not allowed to share your code, and should write the report on your own.
The following is a more detailed description of what the assignment entails.
Part A: UK Fishing vessels
The goal of this part is provide a graphical representation of the geographical distribution of
UK fishing vessels under 10 metres in length.
1. Use get_map() in package ggmap to obtain from Stamen Maps a map of the UK and
save it into an object UKmap. Make sure that Northern Ireland and the Shetland Islands
are included, this will require some experimentation to find an appropriate bounding box.
2. Read into R the vessel data using read_csv(), then change the variables’ names as
done in the lectures.
13. Produce a tibble AdminPortCount containing a tally of the numbers of vessels in each
Administrative Port.
4. Use geocode() on the first column of AdminPortCount to obtain a tibble with the
coordinates of each port. Then, with bind_cols(), bind together this tibble and
AdminPortCount, and call the resulting tibble AdminPortCLL.
Check that the coordinates of the ports in AdminPortCLL are all within the map’s
bounding box; filter() may be useful for this purpose.
If any of the coordinates are outside the map, use geocode() again for the corresponding
ports, adding ", UK" to the name of the port. Then replace the wrong coordinates in
AdminPortCLL with the correct ones. Print the tibble AdminPortCLL.
5. Plot UKmap with ggmap() and use geom_point() to overlay on it red filled circles
centered at the Administrative ports and with area proportional to the number of fishing
vessels in each port. The size aesthetic is useful in this regard. Experiment with a few
values of alpha to control the transparency of the circles. What is the eect
of adding
scale_size_area() to the specification of your plot?
Part B: LendingClub’s data on Loans and Rejections
The general aim of this part is to produce some plots to compare the loans granted by
LendingClub to the loan applications that were rejected. In order to do that, the data for
Loans and Rejects, which are available in dierent
formats, need to undergo preliminary
transformations.
1. Pick a State in the U.S., a list is available at https://en.wikipedia.org/wiki/U.S._state.
Send me an email at agostino.nobile@york.ac.uk with subject Maths Skills 2 -
US State, to inform me of your choice: I may ask you to change it, if it is unsuitable or
too popular. By sending the email you are not committing yourself to that choice of
State (email me again to change it), or indeed to doing the Statistics assignment of
Maths Skills 2.
2. Go to the LendingClub website https://www.lendingclub.com/info/download-data.action
and download the Loans data and the Rejections data for the four quarters of 2018.
Also download the Data Dictionary available at the bottom of the same page. It may
be useful to explore a bit the LendingClub site, to get an idea of their business and of
peer-to-peer lending more generally.
The eight zipped csv files downloaded have a total size of about 170MB. Place them in
a subfolder, say called DATA, of the folder where you keep the Rmd file for this project.
Depending on how much disk space you have available, you may not wish to unzip all
the files at once, as their inflated size is about 1.1GB.
23. For each of the zipped Rejects files:
Unzip the file from within R using something like
system("unzip DATA/filename.csv.zip")
Read the data in the unzipped file into R, using read_csv().
Remove the unzipped file (to save disk space), with something like
system("rm filename.csv")
Filter the data in the resulting tibble, to keep only the records where the variable State
equals the two letter code of your chosen state.
Finally, use the function bind_rows() to collect all the data for your chosen state into
a single tibble called Rejects.
4. Read the Loans data into R, following a procedure similar to the one for the Rejects.
For the Loans data, you will need to deal with two additional complications:
the csv files contain two summary rows at the bottom which must not be read, the
argument n_max in read_csv() is helpful in this respect;
the algorithm used by read_csv() to guess the type of each variable (chr, dbl, etc.)
fails for these data, if one uses the default value of the argument guess_max (more
details can be found in § 11.4 of Wickham and Grolemund, 2017). To fix this, in the
call to read_csv(), set guess_max to the same value as n_max.
As for Rejects, collect all the Loans data for your chosen state in a single tibble called
Loans.
5. The Loans tibble should contain 145 variables. Use select() to keep only the variables
loan_amnt, title, dti, zip_code, addr_state and emp_length, renamed respectively
as Amount, Title, Debt2IncomeRatio, ZipCode, State and EmploymentLength.
Perform the same operation on the Rejects data, by selecting the same six variables
(note the dierent
names!) and renaming them as done for the Loans data.
The variable Debt2IncomeRatio for the Rejects data is of type chr: remove the
trailing % symbols and transform the variable to type dbl.
Replace any instance of "n/a" in the character variables in the two tibbles with NA.
6. Use bind_rows() to collect together the Loans and Rejects data into a single tibble
named LoanApps, with an additional character variable Granted that is equal to "Yes"
for Loans, and equal to "No" for Rejects. The argument .id of bind_rows() is of
help in this regard.
Use mutate(), fct_recode() and fct_relevel() to replace, in the tibble LoanApps,
the variable EmploymentLength with a factor EmploymentYears having levels "< 1",
"1", ..., "9", "10+".
7. Make a barplot to explore whether the proportion of granted loans is related to the loan
title. Which type of loans are granted more (or less) often?
Consider next how the proportion of granted loans varies across the levels of the factor
3EmploymentYears. Make a plot and comment on the results.
8. Use boxplots (and an appropriate scale) to compare the distributions of the variable
Amount between granted and rejected loans. Are the variables Amount and Granted
related, and if so how?
Compare next the distributions of the variable Debt2IncomeRatio between granted and
rejected loans.
9. Use geom_bar() to investigate whether the proportion of granted loans varies across
ZipCodes. As some ZipCodes have small counts, use fct_lump() to restrict attention
to the most common (30 or fewer) ZipCodes
Spring Term 2019
This assignment consists of two parts: in the first one you are asked to revisit the data on
UK fishing vessels under 10 metres long; the second part instead concerns data on loans
(granted as well as rejected) from LendingClub, one of the largest U.S. peer-to-peer financial
companies. For each part, you will download data from a website, read them into R, perform
some computations on them, and produce graphical displays.
The process and results of the analysis should be documented in an R Markdown file, producing
an output document in PDF format. Code alone is not acceptable: at each step, you need to
explain what you are doing and comment (briefly!) on the results. Your submission should
consist of a single zipped folder containing only:
1. the R Markdown file (.Rmd file) and
2. the output PDF document.
Absence of one of the files will incur a very high penalty. Inclusion in the zipped folder of
other unrequested files, including the data files, will also attract a penalty.
The maximum number of pages of the PDF document is 10, all inclusive. Because of this, it
is best to do without a table of contents; however, sectioning of the document is strongly
encouraged. In particular, separate sections should be devoted to each of the two parts of
the assignmemt.
You are allowed to use and adapt for your purposes all the materials presented in the lectures
and posted on Moodle, including the R Markdown input files, without any need of referencing
them. However, you should provide, at the end of your report, references to the R packages
you are using and to the data sources.
You may informally discuss with fellow students how to perform a task. However, you are
not allowed to share your code, and should write the report on your own.
The following is a more detailed description of what the assignment entails.
Part A: UK Fishing vessels
The goal of this part is provide a graphical representation of the geographical distribution of
UK fishing vessels under 10 metres in length.
1. Use get_map() in package ggmap to obtain from Stamen Maps a map of the UK and
save it into an object UKmap. Make sure that Northern Ireland and the Shetland Islands
are included, this will require some experimentation to find an appropriate bounding box.
2. Read into R the vessel data using read_csv(), then change the variables’ names as
done in the lectures.
13. Produce a tibble AdminPortCount containing a tally of the numbers of vessels in each
Administrative Port.
4. Use geocode() on the first column of AdminPortCount to obtain a tibble with the
coordinates of each port. Then, with bind_cols(), bind together this tibble and
AdminPortCount, and call the resulting tibble AdminPortCLL.
Check that the coordinates of the ports in AdminPortCLL are all within the map’s
bounding box; filter() may be useful for this purpose.
If any of the coordinates are outside the map, use geocode() again for the corresponding
ports, adding ", UK" to the name of the port. Then replace the wrong coordinates in
AdminPortCLL with the correct ones. Print the tibble AdminPortCLL.
5. Plot UKmap with ggmap() and use geom_point() to overlay on it red filled circles
centered at the Administrative ports and with area proportional to the number of fishing
vessels in each port. The size aesthetic is useful in this regard. Experiment with a few
values of alpha to control the transparency of the circles. What is the eect
of adding
scale_size_area() to the specification of your plot?
Part B: LendingClub’s data on Loans and Rejections
The general aim of this part is to produce some plots to compare the loans granted by
LendingClub to the loan applications that were rejected. In order to do that, the data for
Loans and Rejects, which are available in dierent
formats, need to undergo preliminary
transformations.
1. Pick a State in the U.S., a list is available at https://en.wikipedia.org/wiki/U.S._state.
Send me an email at agostino.nobile@york.ac.uk with subject Maths Skills 2 -
US State, to inform me of your choice: I may ask you to change it, if it is unsuitable or
too popular. By sending the email you are not committing yourself to that choice of
State (email me again to change it), or indeed to doing the Statistics assignment of
Maths Skills 2.
2. Go to the LendingClub website https://www.lendingclub.com/info/download-data.action
and download the Loans data and the Rejections data for the four quarters of 2018.
Also download the Data Dictionary available at the bottom of the same page. It may
be useful to explore a bit the LendingClub site, to get an idea of their business and of
peer-to-peer lending more generally.
The eight zipped csv files downloaded have a total size of about 170MB. Place them in
a subfolder, say called DATA, of the folder where you keep the Rmd file for this project.
Depending on how much disk space you have available, you may not wish to unzip all
the files at once, as their inflated size is about 1.1GB.
23. For each of the zipped Rejects files:
Unzip the file from within R using something like
system("unzip DATA/filename.csv.zip")
Read the data in the unzipped file into R, using read_csv().
Remove the unzipped file (to save disk space), with something like
system("rm filename.csv")
Filter the data in the resulting tibble, to keep only the records where the variable State
equals the two letter code of your chosen state.
Finally, use the function bind_rows() to collect all the data for your chosen state into
a single tibble called Rejects.
4. Read the Loans data into R, following a procedure similar to the one for the Rejects.
For the Loans data, you will need to deal with two additional complications:
the csv files contain two summary rows at the bottom which must not be read, the
argument n_max in read_csv() is helpful in this respect;
the algorithm used by read_csv() to guess the type of each variable (chr, dbl, etc.)
fails for these data, if one uses the default value of the argument guess_max (more
details can be found in § 11.4 of Wickham and Grolemund, 2017). To fix this, in the
call to read_csv(), set guess_max to the same value as n_max.
As for Rejects, collect all the Loans data for your chosen state in a single tibble called
Loans.
5. The Loans tibble should contain 145 variables. Use select() to keep only the variables
loan_amnt, title, dti, zip_code, addr_state and emp_length, renamed respectively
as Amount, Title, Debt2IncomeRatio, ZipCode, State and EmploymentLength.
Perform the same operation on the Rejects data, by selecting the same six variables
(note the dierent
names!) and renaming them as done for the Loans data.
The variable Debt2IncomeRatio for the Rejects data is of type chr: remove the
trailing % symbols and transform the variable to type dbl.
Replace any instance of "n/a" in the character variables in the two tibbles with NA.
6. Use bind_rows() to collect together the Loans and Rejects data into a single tibble
named LoanApps, with an additional character variable Granted that is equal to "Yes"
for Loans, and equal to "No" for Rejects. The argument .id of bind_rows() is of
help in this regard.
Use mutate(), fct_recode() and fct_relevel() to replace, in the tibble LoanApps,
the variable EmploymentLength with a factor EmploymentYears having levels "< 1",
"1", ..., "9", "10+".
7. Make a barplot to explore whether the proportion of granted loans is related to the loan
title. Which type of loans are granted more (or less) often?
Consider next how the proportion of granted loans varies across the levels of the factor
3EmploymentYears. Make a plot and comment on the results.
8. Use boxplots (and an appropriate scale) to compare the distributions of the variable
Amount between granted and rejected loans. Are the variables Amount and Granted
related, and if so how?
Compare next the distributions of the variable Debt2IncomeRatio between granted and
rejected loans.
9. Use geom_bar() to investigate whether the proportion of granted loans varies across
ZipCodes. As some ZipCodes have small counts, use fct_lump() to restrict attention
to the most common (30 or fewer) ZipCodes