辅导python、辅导python data cleaning 程序、辅导讲解英文python homework
- 首页 >> Python编程Attached Files:
• ps1_3.py (896 B)
• ps1_3_data.csv (155 B)
• ps1_4a_data.csv (151 B)
• ps1_4b_data.csv (147 B)
This describes an activity for 'data cleaning' for a hypothetical survey of train passengers. Data collected from customer surveys often contains missing or erroneous entries. Data cleaning refers to processing to identify and fix these entries. Attached are spreadsheet (csv) files with similar data for travel times for rail transit users from various suburbs, but going from clean to data with errors. You will develop Python program to process the data and report the average train travel time by suburb.
Below is example for calculating the average of a set of data in Python; your task below is to do this by suburb (using dictionary) as read from the spreadsheet (csv) file.
data = [5, 10, 5, 10] # i.e. data for east suburb
count=0 # initialise count
average=0 #initialise average
for x in data:
count = count + 1
average = average + x # using average here to sum data
average = average / count # from summed values above calc actual average
print(count,average)
• ps1_3.py (896 B) 文件是老师给了一半的:
内容同上
import csv
# program to calculate average travel time for survey data
# input is a csv file with two columns, column 1 has suburb, column 2 a travel time
# assume 'clean' input with suburbs below and valid data
# Dictionary with suburbs key to store summary average and count
suburbs_average = {'North':0,'South':0,'East':0,'West':0}
suburbs_count = {'North':0,'South':0,'East':0,'West':0}
# open csv file csvfile = open("ps1_3_data.csv")
# loop to read data csv_reader = csv.reader(csvfile, delimiter=',')
for row in csv_reader:
# input data is in row variable, i.e. row[0] is column 1 and row[1] is column 2
# but data read in as strings, so row[1] needs to be converted to float print (row[0],row[1])
# your code goes here to sum and count values by suburb csvfile.close()
# your code goes here to compute averages by suburb print(suburbs_average)
• ps1_3_data.csv (155 B)
• ps1_4a_data.csv (151 B)
• ps1_4b_data.csv (147 B)
Part 1 - Level of difficulty is moderate
You are provided with a CSV file with clean data. A Python dictionary is used to store the suburbs as keys, and average travel time as the value. You may assume the input data is ‘clean’ with one row for each survey response containing a valid suburb name (string) and a travel time (number).
You are provided with a partially completed script named ps1_3.py that you will add code to and submit as answer. You should type in code where comments say 'your code goes here'. A CSV file is also provided ps1_3_data.csv. Save this to the same folder as where you save above script; if you open the script in IDLE and run it; its will print the rows in the CSV file.
Example output for provided CSV file:
{'South': 7.5, 'North': 11.666666666666666, 'East': 8.0, 'West': 15.0}
Note: In case you have problems with opening the CSV file in Python you may need to provide the full path where file is stored. This is usually obtained from the top of Windows Explorer but Python uses a standard convention where folder separators are specified as ‘/’ whereas Windows uses ‘\’. You can switch them around from Windows or use the r string modifier before the Python string name, i.e. a file pathname C:\Work\ file.txt in Windows would be assigned to a variable in Python as: pathname='C:/Work/ file.txt' … this is equivalent to: pathname=r'C:\Work\ file.txt'
Part 2 - Level of difficulty is high
Following from the problem above you are provided responses, but like many surveys the data is not ‘clean’. There are two problems you can fix in the data:
i) Respondents in some cases used abbreviations and different font case for the names of the suburb; the common abbreviations are given in this Python dictionary:
suburbs_abbrevations = {'Nth':'North', 'Est':'East', 'Wst':'West', 'Sth':'South'}
Add the above dictionary to your code and fix inconsistencies for font case and abbreviations that occur in the test CSV file provides ps1_4_data1.csv.
ii) Respondents in some cases missed adding their travel time. The missing value may be imputed using a strategy call a mean substitution where you replace the missing value with a mean calculated for the suburb. See https://en.wikipedia.org/wiki/Imputation_(statistics) . Missing data is recognised by a blank in the data file. You are provided with a second data test CSV file ps1_4_data2.csv to summarise.
Rename the Python file above as ps1_4.py that you will add code to and submit as answer to address problem i) and ii); these will be tested separately for grading.
QUESTION 3练习内容
1.You are provided with a CSV file with survey data on travel times for rail transit users for various suburbs. You are expected to summarise this data and report the average travel time (see example calculation for practical Week2) for each suburb. Use a Python dictionary to store the suburbs as keys, and average travel time as the value. You may assume the input data is ‘clean’ with one row for each survey response containing a valid suburb name and a travel time.
From Week2 practical you are provided with a partially completed script named ps1_3.py that you will add code to and submit as answer. You should type in code where comments say 'your code goes here'. Also an input CSV file is provided ps1_3_data.csv. Save this to the same folder as where you save above script. You can run the partially completed script in IDLE and it prints the rows in the CSV file. Below is example of expected output from completed script for provided CSV file.
{'South': 7.5, 'North': 11.666666666666666, 'East': 8.0, 'West': 15.0}
Attach modified script as answer. Partial marks given for incomplete answers.
1.Attach File
QUESTION 4 练习内容
1.Following from the problem above you are provided other data files for survey responses, but like many surveys the data is not ‘clean’. There are two problems you need to deal with in the data:
a) Respondents sometimes used different word capitalization and abbreviations for the suburb name; to effectively process the survey requires suburb names have standard form. The common abbreviations found are given in this Python dictionary:
suburbs_abbrevations = {'Nth':'North', 'Est':'East', 'Wst':'West', 'Sth':'South'}
Fix inconsistencies in suburb names by capitalizing names (1st character capitalized) and use above dictionary in your code to correct abbreviations that occur in the test CSV file ps1_4a_data.csv (see Week 2 practical). (2 marks)
b) Respondents in some cases missed adding their travel time. The missing value may be imputed using a strategy called a mean substitution where you replace the missing value with a mean calculated for the suburb. See en.wikipedia.org/wiki/Imputation_(statistics). Missing data is recognised by a blank in the data file. You are provided with a second data test CSV file ps1_4b_data.csv (see Week 2 practical) to summarise. (2 marks)
Rename the Python file from last question to ps1_4.py with the additional code to script to deal with unclean data, and attach script as answer. Partial marks given for incomplete answers.
1.Attach File