讲解FIT5145 S1、辅导R设计、讲解R编程语言、command lines辅导
- 首页 >> 其他 FIT5145 S1 2019 Assignment 3
Semester 1, 2019
Due Week 10 Sunday 1150pm 19th May.
Student Name:
Student ID:
Tutor name:
Hand in Requirements:
1) Please hand in a PDF file containing your answers to all the questions, numbered correspondingly.
- You can use Word or other word processing software to format your submission. Just save the final copy to a PDF before submitting. We recommend modifying the word version of these assignment instructions using the format provided.
- Make sure to include screenshots/images of the graphs you generate in order to justify your answers to all the questions.
- Make sure to include copies of all the bash command lines and R scripts you use. Screen shots of code are not acceptable. You must copy and paste the actual text. If your answer is wrong, you may still get half marks if your command line or script is close to correct.
2) Please hand in a text files that you create for Parts B and C (i.e., fuel_data.txt and california.txt).
3) Please submit the PDF file and txt files separately. (zip, .rar or similar file formats are NOT accepted. Zip file submission will have a penalty of 10%.)
4) Late submissions will have a penalty of 5% per day, including weekends and public holidays.
NOTE: Two data sets for this assignment are in the Google shared drive:
https://drive.google.com/drive/folders/1QUJ6aFqgkIaefnjRpj2sO3bdoQOOow5A?usp=sharing
Both are large, so your best bet is to download them while in the lab/studio and do the assignment on the lab machines. You will need to use either a Linux machine for this or a Mac terminal or Cygwin on a Windows machine. Ideally you would run R from the bash shell/terminal.
Use case sensitive parsing when processing data in all parts.
Part A: Investigating the data in the Shell
Download the file ELEC.zip. Use a Unix shell to manipulate the file and answer the following questions.
1)Decompress the file ELEC.zip. How big is the file ELEC.txt that is obtained after unzipping?
Linux command:
Answer:
2)Based on visual inspection of parts of the file what is the most common delimiter used in the file? What units are used to quantify “electric fuel consumption”?
Linux command:
Answer:
3)Almost every line in the file that begins with the field “series_id” provides data on “electric fuel consumption” for a specific power plant in the USA. Apart from the field “series_id”, what are the other fields provided in a line containing the field “series id”?
Answer:
4)How many lines are there in the file?
Linux command:
Answer:
5)For each line containing the field “series_id” and a given powerplant, what does the field “f” represent? What is the date range that the monthly electric fuel consumption data spans for the first power plant in the file (“Arlington Wind Power Project (56855)” for “all fuels” and “all primemovers”) and for which there is actually electric fuel consumption data?
Linux command:
Answer:
6)How many lines in the file contain the field “series_id”?
Linux command:
Answer:
7)How many unique power plants are named in the file in the lines containing the field “series_id”? Note that some power plants occur on multiple lines based on different information provided for a given power plant on a given line containing the field “series_id”.
Linux command:
Answer:
8)On which month and year was “electric fuel consumption” the highest for the “12 Applegate Solar LLC (59371)” power plant when considering “solar” fuel and “all primemovers”? What was the amount of electric fuel consumption in at this time? (Hint: “electric fuel consumption” data is captured in the “data” field)
Linux command:
Answer:
9)How many times has the “126 Grove Solar LLC (60858)” power plant been listed in the file? Is this number equal to the number of lines containing “126 Grove Solar LLC (60858)” in the file?
Linux command:
Answer:
10)Do you think we would be able to compute correlations (e.g. Pearson’s correlation) in electric fuel consumption between power plants using the data provided here? What problems might we face in doing so? If instead we wanted to make predictions about tomorrow’s electric fuel consumption at a given power plant, what problems would we face?
Answer:
Part B: Graphing the data in R
1)We want to visualize, analyse and make future predictions about the “electric fuel consumption” for the “12 Applegate Solar LLC (59371)” power plant when considering “solar” fuel and “all primemovers”. First we need to extract the data corresponding with the “data” field using Bash so that we can save it as the file “fuel_data.txt” and load it into R. Provide the Bash command you used to do this and provide a table with two columns (date, electric fuel consumption) containing the annual values of electric field consumption for the month of December. (Note 1: the “fuel_data.txt” file should contain the data for all months in order to answer the other remaining questions for this part of the assignment – we only ask for the December values here since it is easier to mark; Note 2: remember to submit the “fuel_data.txt” file with your assignment).
Linux Command:
submitted file:
Answer:
2)Load in the file you created “fuel_data.txt” and plot a histogram of the electric fuel consumption with labels on the axes and a title. Does the data follow a Gaussian distribution?
3)Now plot the monthly electric fuel consumption data as a function of time with time increasing in the rightward direction of the plot and with labels on the axes and a title. Based on looking at this time series give a reason why you gave your answer in question 2 above as to whether or not the electric fuel consumption data followed a Gaussian distribution. What issues would you face if you tried to fit a linear regression to this data to predict fuel consumption during times occurring just after the data provided?
R scripts:
Plot:
Answer:
Part C: Investigating Chronic Disease Indicators Data
Download the file U.S._Chronic_Disease_Indicators__CDI_.zip. Use a Unix shell to manipulate the file and answer the following questions.
1)Decompress the file U.S._Chronic_Disease_Indicators__CDI_.zip. Unzipping this file outputs U.S._Chronic_Disease_Indicators__CDI_.csv. View the start of this csv file. What information does the first line in the file provide about the remaining lines in the file?
Linux command:
Answers:
2)The file provides information about Chronic Diseases in the United States by referring to disease “topic”s and “questions” (i.e. indicator measures) that relate to these topics. Use an awk script to extract only the lines providing “Crude Prevalences” for disease topic questions for the state of California and save them to a file called ‘california.txt’.
Linux command:
submitted file:
3)Considering the new file “california.txt”:
A.How many lines are associated with the disease “topic”s “Alcohol” and “Cancer”?
Linux command:
Answers:
B.What is the highest Crude Prevalence value for the “Cancer” “topic” in this file? What “year” and “question” is associated with this highest crude prevalence value for “Cancer”? What type of cancer does this “question”/indicator measure most likely relate to? Why would Crude Prevalence be high for this “question”/indicator measure?
Linux command:
Answers:
GOOD LUCK!
Semester 1, 2019
Due Week 10 Sunday 1150pm 19th May.
Student Name:
Student ID:
Tutor name:
Hand in Requirements:
1) Please hand in a PDF file containing your answers to all the questions, numbered correspondingly.
- You can use Word or other word processing software to format your submission. Just save the final copy to a PDF before submitting. We recommend modifying the word version of these assignment instructions using the format provided.
- Make sure to include screenshots/images of the graphs you generate in order to justify your answers to all the questions.
- Make sure to include copies of all the bash command lines and R scripts you use. Screen shots of code are not acceptable. You must copy and paste the actual text. If your answer is wrong, you may still get half marks if your command line or script is close to correct.
2) Please hand in a text files that you create for Parts B and C (i.e., fuel_data.txt and california.txt).
3) Please submit the PDF file and txt files separately. (zip, .rar or similar file formats are NOT accepted. Zip file submission will have a penalty of 10%.)
4) Late submissions will have a penalty of 5% per day, including weekends and public holidays.
NOTE: Two data sets for this assignment are in the Google shared drive:
https://drive.google.com/drive/folders/1QUJ6aFqgkIaefnjRpj2sO3bdoQOOow5A?usp=sharing
Both are large, so your best bet is to download them while in the lab/studio and do the assignment on the lab machines. You will need to use either a Linux machine for this or a Mac terminal or Cygwin on a Windows machine. Ideally you would run R from the bash shell/terminal.
Use case sensitive parsing when processing data in all parts.
Part A: Investigating the data in the Shell
Download the file ELEC.zip. Use a Unix shell to manipulate the file and answer the following questions.
1)Decompress the file ELEC.zip. How big is the file ELEC.txt that is obtained after unzipping?
Linux command:
Answer:
2)Based on visual inspection of parts of the file what is the most common delimiter used in the file? What units are used to quantify “electric fuel consumption”?
Linux command:
Answer:
3)Almost every line in the file that begins with the field “series_id” provides data on “electric fuel consumption” for a specific power plant in the USA. Apart from the field “series_id”, what are the other fields provided in a line containing the field “series id”?
Answer:
4)How many lines are there in the file?
Linux command:
Answer:
5)For each line containing the field “series_id” and a given powerplant, what does the field “f” represent? What is the date range that the monthly electric fuel consumption data spans for the first power plant in the file (“Arlington Wind Power Project (56855)” for “all fuels” and “all primemovers”) and for which there is actually electric fuel consumption data?
Linux command:
Answer:
6)How many lines in the file contain the field “series_id”?
Linux command:
Answer:
7)How many unique power plants are named in the file in the lines containing the field “series_id”? Note that some power plants occur on multiple lines based on different information provided for a given power plant on a given line containing the field “series_id”.
Linux command:
Answer:
8)On which month and year was “electric fuel consumption” the highest for the “12 Applegate Solar LLC (59371)” power plant when considering “solar” fuel and “all primemovers”? What was the amount of electric fuel consumption in at this time? (Hint: “electric fuel consumption” data is captured in the “data” field)
Linux command:
Answer:
9)How many times has the “126 Grove Solar LLC (60858)” power plant been listed in the file? Is this number equal to the number of lines containing “126 Grove Solar LLC (60858)” in the file?
Linux command:
Answer:
10)Do you think we would be able to compute correlations (e.g. Pearson’s correlation) in electric fuel consumption between power plants using the data provided here? What problems might we face in doing so? If instead we wanted to make predictions about tomorrow’s electric fuel consumption at a given power plant, what problems would we face?
Answer:
Part B: Graphing the data in R
1)We want to visualize, analyse and make future predictions about the “electric fuel consumption” for the “12 Applegate Solar LLC (59371)” power plant when considering “solar” fuel and “all primemovers”. First we need to extract the data corresponding with the “data” field using Bash so that we can save it as the file “fuel_data.txt” and load it into R. Provide the Bash command you used to do this and provide a table with two columns (date, electric fuel consumption) containing the annual values of electric field consumption for the month of December. (Note 1: the “fuel_data.txt” file should contain the data for all months in order to answer the other remaining questions for this part of the assignment – we only ask for the December values here since it is easier to mark; Note 2: remember to submit the “fuel_data.txt” file with your assignment).
Linux Command:
submitted file:
Answer:
2)Load in the file you created “fuel_data.txt” and plot a histogram of the electric fuel consumption with labels on the axes and a title. Does the data follow a Gaussian distribution?
3)Now plot the monthly electric fuel consumption data as a function of time with time increasing in the rightward direction of the plot and with labels on the axes and a title. Based on looking at this time series give a reason why you gave your answer in question 2 above as to whether or not the electric fuel consumption data followed a Gaussian distribution. What issues would you face if you tried to fit a linear regression to this data to predict fuel consumption during times occurring just after the data provided?
R scripts:
Plot:
Answer:
Part C: Investigating Chronic Disease Indicators Data
Download the file U.S._Chronic_Disease_Indicators__CDI_.zip. Use a Unix shell to manipulate the file and answer the following questions.
1)Decompress the file U.S._Chronic_Disease_Indicators__CDI_.zip. Unzipping this file outputs U.S._Chronic_Disease_Indicators__CDI_.csv. View the start of this csv file. What information does the first line in the file provide about the remaining lines in the file?
Linux command:
Answers:
2)The file provides information about Chronic Diseases in the United States by referring to disease “topic”s and “questions” (i.e. indicator measures) that relate to these topics. Use an awk script to extract only the lines providing “Crude Prevalences” for disease topic questions for the state of California and save them to a file called ‘california.txt’.
Linux command:
submitted file:
3)Considering the new file “california.txt”:
A.How many lines are associated with the disease “topic”s “Alcohol” and “Cancer”?
Linux command:
Answers:
B.What is the highest Crude Prevalence value for the “Cancer” “topic” in this file? What “year” and “question” is associated with this highest crude prevalence value for “Cancer”? What type of cancer does this “question”/indicator measure most likely relate to? Why would Crude Prevalence be high for this “question”/indicator measure?
Linux command:
Answers:
GOOD LUCK!