讲解MAC, MSDOS、辅导Python编程设计、讲解CrimeData、辅导Python
- 首页 >> 其他 Here are additional alternative versions of the Crime_Data_Large_Dirty and Crime_Data_Small_Dirty, providing MAC, MSDOS and standard CDV (Comma Deliminated Value). Please try these if you are experiencing problems with the original dataFiles. Please ensure you have supplied the correct filename and location for the file you are attempting to load. I have tested each and found them to work without issue.
Large Data Set:
COMP90059_CrimeData_Large_Dirty_MSDOS.csv
COMP90059_CrimeData_Large_Dirty_Mac.csv
COMP90059_CrimeData_Large_Dirty_CDV.csv
Small Data Set:
COMP90059_CrimeData_Small_Dirty_CDV.csv
COMP90059_CrimeData_Small_Dirty_Mac.csv
COMP90059_CrimeData_Small_Dirty_MSDOS.csv
Files required for this assignment:
Assignment2.py - A starting program to help you load the data into Python
COMP90059_CrimeData_Large_Clean.csv - A clean version of the main data to help you, with the functions other than cleaning... if you need it.
COMP90059_CrimeData_Large_Dirty.csv - The main data you will need to work on for your final submission
COMP90059_CrimeData_Small_Clean.csv - Small version of the clean data to let you work on data navigation... if you need it.
COMP90059_CrimeData_Small_Dirty.csv - Small version of the data to help you clean and navigate the data... if you need it
There are FIVE (5) questions in this assignment. The fifth question will require you to call the functions you wrote in the first four questions
Things to look out for in solving the questions are:
Never be afraid to create extra variables, e.g. to break up the code into conceptual sub-parts, improve readability, or avoid redundancy in your code.
You are encouraged you to write helper functions to simplify your code – you can write as many functions as you like, as long as one of them is the
function you are asked to write.
Commenting of code is something you will be marked on; get some practice writing comments in your code, focusing on:
oAdding a header block, providing the developers ID
oDescribing key variables when they are first defined (but not things like index variables in for loops)
oDescribing what "chunks" of code do (i.e. not every line, but chunks of code that perform a particular operation, such as
o
o # find the maximum value in the list or
o # count the number of vowels
o
o
The Australian crime statistics database holds crime statistical data that is freely available on the Australian government website: data.gov.au/dataset. This data indicates trends in crime covering the whole of Australia, which is separated between counties and Local Government Authority areas (LGA), over a number of years. The information held in these databases highlight the number of crimes committed from Trespass to Homicide, in a number of geographical locations.
Your task for this assignment will be to take on a contract as a software developer/analyst and respond to a realistic project requirement.
------
Congratulations! You have been appointed by the MUC (Made Up Company) through the Australian Government to help them ascertain various information, from within the online crime dataset. The dataset has been vandalised by high tech criminals and will need your skills to help clean it up, prior to providing basic analysis.
As part of your task, you are asked to develop and application that can cater for this request. You will code four (4) functions that perform specific tasks. In addition, you will write a “main” function that will utilise all the new functions you have developed.
The Crime Statistics data is provided to you in one or more comma-separated values (CSV) files. You will find this in the Assignment 2 folder on LMS.
CSV is a simple file format which is widely used for storing tabular data (data that consists of columns and rows). In a CSV file, columns are separated by commas, and rows are separated by newlines (so every line of the text file corresponds to a row of the data). Usually, the first row of the file is a header row which gives names for the columns.
The Crime Statistics data contains the following columns: ID, Statistical Division or Subdivision, LGA, Offence category, Subcategory, Year statistics.
ID (An integer unique ID assigned to each row of data)
Statistical Division or Subdivision (The broad area the crimes were committed)
LGA (The Local Governance Area that managed the crime area)
Offence category (A title of the crime category)
Subcategory (A breakdown of the crime within each category area)
Year statistics (from 2002 through to 2012) (Holding a tally corresponding to each crime that took place in that year)
Supplied is a sample of the CVS data, as provided to you by the MUC. In fact, we have provided 4 data samples; two large and two small. One of each (large and small) samples are contaminated (vandalised), and the other two (large and small) are clean. Both the small and clean samples are provided to assist you in developing your code. Small samples are processed quicker. Clean samples allow you to progress, without completing the required cleaning task. Your final code should work on the large-dirty data-set.
In order to clean up and analyse the data, you need a way to take data from a CSV file and put it into a Python data structure. Fortunately, Python has a built-in csv library which can do most of the work for you.
In this assignment, you won't have to use the csv library directly, though. We will provide you with a helper function called read_data which uses the csv library to read the data and turn it into a dictionary of dictionaries. For example, suppose the data above was stored in a file called CrimeDataSet.csv. To work with this data in Python, we would call (from within the read_data.py file’s working directory) the following:
read_data("CrimeDataSet.csv")
which would return the following Python dictionary:
{'1' :{'Division': 'Inner Sydney', 'LGA': 'Botany Bay', 'Offence': 'Homicide','Subcategory': 'Murder (a)', '2002': '1','2003': '0', '2004': 'zero', '2005': '1', '2006': '2', '2007': '1', '2008': '0', '2009': '1', '2010': '0', '2011': '0', '2012': '1'}}
Note
Notice that all of the values in the nested dictionaries are strings, even the numeric values. If you want to use the values in numerical calculations, you will have to typecast them yourself.
Nested dictionaries can be confusing. Here are some simple examples of how to access data in a nested dictionary:
# save the data in a variable
data = {'1':{'Division': 'Inner Sydney', 'LGA': 'Botany Bay', 'Offence': 'Homicide','Subcategory': 'Murder (a)', '2002': '1', '2003': '0', '2004': '0', '2005': '1', '2006':'2', '2007': '1', '2008': '0', '2009': '1', '2010': '0', '2011': '0', '2012': '1'}}
# Where is the ‘1’ ID’s Division
print(data["1"]["Division"])
# What is the second ID’s subcategory
print(data["2"]["Subcategory"])
# What is the summation of each year of ID '1'
sum = 0
for year_data in range(2002, 2012+1):
sum += int(data['1'][str(year_data)])
You have been provided with large CSV files containing Crime data within Australia. Unfortunately, the data has been contaminated and is considered "dirty": some criminal-hackers have attacked the data, and intentionally entered incorrect data values, to subvert the clear understanding of the crimes committed.
Your first task as a programmer-analyst is to clean up the dirty data and fix any issues caused by the criminals, for later analysis.
The errors in this data-set consist of the following changes, peppered or scattered throughout:
They have included zero’s instead of the integer value 0
They have also entered NULL instead of the integer value 0
They have converted positive numbers to negative numbers (i.e. -10 instead of 10, etc).
And they have altered all entries for Trespass (within the subcategory column) with a cruel capitalised string of text; MUC-SUCK!
To clarify, in the data set, any value referred to as zero, should be an integer 0 (zero), any value with a NULL reference should also be an integer 0 (zero), all integer values should be positive values (i.e. minus 20 should be positive 20) and any derogatory remarks about the MUC within the Subcatogory flied, should be actually read as Trespass
Task 1 (Clean data)
Write a function called clean which takes one argument called data. It should be utilised like this:
clean(data)
The data value consists of a dictionary of data; which is the format type returned by read_data. This data has been read directly from a CSV file and is presumed contaminated or dirty! Your function should construct and return a new data dictionary which is identical to the input dictionary, except that invalid data values must be replaced; as described above. You should not need to modify the argument dictionary variable data. The cleaning process should keep a count of all the data samples it cleans and also return this summated value, along with the new cleaned dictionary data set.
Let’s look at the data contained in CrimeDataSetDirty.csv:
{'1' :{'Division': 'Inner Sydney', 'LGA': 'Botany Bay', 'Offence': 'Homicide','Subcategory': 'Murder (a)', '2002': '1', '2003': '0', '2004': 'zero', '2005':'1', '2006': '2', '2007':'-1' , '2008': '0', '2009': '1', '2010': '0', '2011': '0', '2012':'1'}}
Clearly some of the values are invalid! Calling clean_data on this data, would yield the following result:
{'1' :{'Division': 'Inner Sydney', 'LGA': 'Botany Bay', 'Offence': 'Homicide','Subcategory': 'Murder (a)', '2002': '1', '2003': '0', '2004': '0', '2005': '1', '2006':'2', '2007': '1', '2008': '0', '2009': '1', '2010': '0', '2011': '0', '2012': '1'}}
Notice the 0 and negative values in the nested 2004 and 2007 dictionary of the cleaned data, was previously ‘zero’ and '-1'. Don’t forget the other repair alterations too, from the list above!
You can assume the following:
The final input data dictionary should not contain zero or null or negative values;
All year-column data entry statistics (once cleaned) are strings that can be cast to ints;
Any references within the Subcategory-column, containing derogatory remarks about the MUC should be renamed from the derogatory remark back to Trespass
Task 2 (Worst year)
Write a function called countCrimes that takes in two arguments; data and key. The data value will be the dictionarycontaining crime data and the key value will be a suitable value representing the year statistics data. The function summates all the values within the key-column(s) and returns the sum value (representing each year). For example, all crimes for key ‘2012’ should return an integer representing the total sum of crimes though for that column.
Using this function, and inside your main method, you must calculate the worst year for crime, (i.e. the year with the maximum total crimes), and store then display the year and crime total-number in that year. You may assume the crime data in data is "clean", after all invalid values have been replaced. So, all values are non-negative integers, and all data values have been repaired. A clean_data_set.csv file containing clean data is supplied to allow you to test this method, in case you have not yet been able to clean the dirty data-set yet.
Task 3 (Worst area)
The MUC are interested in the distribution of crime throughout the different Statistical Subdivision areas. One way to establish this is to divide each Subdivision into unique bins or dictionary keys; where a key holds a summation of all the crimes for all the years within that subdivision area.
Write a function called worstCrime, which takes the argument data and adds up the values of each Subdivision of each year then returns a new dictionary, where the key is the Subdivision name and the value is the summated total of all crimes within that area over all years. From within the main function, store and display the number of Subdivisions found and present the area with the highest overall crime values as the ‘Worst Area’ and the area with the lowest overall crime values as the ‘Best Area’.
Task 4 (Most active criminal activity)
The MUC are also interested in learning which crime is performed the most throughout the whole dataset. By acquiring this information, they will be able to focus on reinforcing security levels, targeting that type of crime more robustly.
Write a function called mostActiveCrime(data), which returns a dictionary of the different crime types, which holds a tally (count) of how often those crimes were committed overall. That is, each key within the dictionary will be the name of a crime (such as Homicide, or Robbery, etc), and the values therein, will be the tally of crimes for that particular crime throughout allyears. Finally, within your main function, from the returned dictionary, store and display the most active/performed crime type and present it as the ‘Most active Crime overall’, and include its summated value.
Task 5 (Providing a final report)
The government has asked the MUC to produce a report on the final status and crime situation within their supplied dataset. The MUC have asked you to help them locate the appropriate data for this report.
Write a function called report which takes a filename (called datafile) as an argument. This function reads the original crime data contained under that filename, then uses your function (clean) to clean the data. Following this, you will use your newly created functions (1 to 4) to present some facts about crime in Australia.
You should assume that the data in datafile is noisy. Your function should calculate and return the following data-facts as a list:
The total number of rows in the data file
The total number of Subdivisions examined in the data
The total number of Offence Categories
The worst area and best area for crime (most and least crime counts respectively)
The most active type of crime.
Note
You will probably find it useful to call read_data, clean_data, etc. in your main function, to ensure you are testing your function outputs.
Present your analysis in a formatted text output, with the statistical data values embedded. Here is an example of what the output might look like;CAPITALISATION has been used to represent your analytical data, where is written, please add your name and student_ID. The values of CAPITALISATOINS will be provided by the results of your functions and analyses.
‘On behalf of the MUC (Made Up Company), I have analysed TOTAL_DATA_ROWS units of the crime statistics data, over a 10-year period. I have repaired?REPAIR_COUNTcorrupt data values. This data-set covered?NUMBER_OF_UNIQUE_CRIME_SUBDIVISIONS?Subdivisions and found UNIQUE_CRIMES types of crimes. I conclude that the worst area for crime is WORST_AREA, the safest area is BEST_AREA and that the most active category of crime is MOST_ACTIVE. Sincerely,.
Save your Python file as YourName_StudentID_Assignment2.py. Where your name is your actual name and student ID is your actual student ID. To submit your work, please upload your Python Name_ID_Assignment2.py file, containing your complete source code (with all 5 functions) through the GROK system and onto the LMS?turnitin platform, provided.
Large Data Set:
COMP90059_CrimeData_Large_Dirty_MSDOS.csv
COMP90059_CrimeData_Large_Dirty_Mac.csv
COMP90059_CrimeData_Large_Dirty_CDV.csv
Small Data Set:
COMP90059_CrimeData_Small_Dirty_CDV.csv
COMP90059_CrimeData_Small_Dirty_Mac.csv
COMP90059_CrimeData_Small_Dirty_MSDOS.csv
Files required for this assignment:
Assignment2.py - A starting program to help you load the data into Python
COMP90059_CrimeData_Large_Clean.csv - A clean version of the main data to help you, with the functions other than cleaning... if you need it.
COMP90059_CrimeData_Large_Dirty.csv - The main data you will need to work on for your final submission
COMP90059_CrimeData_Small_Clean.csv - Small version of the clean data to let you work on data navigation... if you need it.
COMP90059_CrimeData_Small_Dirty.csv - Small version of the data to help you clean and navigate the data... if you need it
There are FIVE (5) questions in this assignment. The fifth question will require you to call the functions you wrote in the first four questions
Things to look out for in solving the questions are:
Never be afraid to create extra variables, e.g. to break up the code into conceptual sub-parts, improve readability, or avoid redundancy in your code.
You are encouraged you to write helper functions to simplify your code – you can write as many functions as you like, as long as one of them is the
function you are asked to write.
Commenting of code is something you will be marked on; get some practice writing comments in your code, focusing on:
oAdding a header block, providing the developers ID
oDescribing key variables when they are first defined (but not things like index variables in for loops)
oDescribing what "chunks" of code do (i.e. not every line, but chunks of code that perform a particular operation, such as
o
o # find the maximum value in the list or
o # count the number of vowels
o
o
The Australian crime statistics database holds crime statistical data that is freely available on the Australian government website: data.gov.au/dataset. This data indicates trends in crime covering the whole of Australia, which is separated between counties and Local Government Authority areas (LGA), over a number of years. The information held in these databases highlight the number of crimes committed from Trespass to Homicide, in a number of geographical locations.
Your task for this assignment will be to take on a contract as a software developer/analyst and respond to a realistic project requirement.
------
Congratulations! You have been appointed by the MUC (Made Up Company) through the Australian Government to help them ascertain various information, from within the online crime dataset. The dataset has been vandalised by high tech criminals and will need your skills to help clean it up, prior to providing basic analysis.
As part of your task, you are asked to develop and application that can cater for this request. You will code four (4) functions that perform specific tasks. In addition, you will write a “main” function that will utilise all the new functions you have developed.
The Crime Statistics data is provided to you in one or more comma-separated values (CSV) files. You will find this in the Assignment 2 folder on LMS.
CSV is a simple file format which is widely used for storing tabular data (data that consists of columns and rows). In a CSV file, columns are separated by commas, and rows are separated by newlines (so every line of the text file corresponds to a row of the data). Usually, the first row of the file is a header row which gives names for the columns.
The Crime Statistics data contains the following columns: ID, Statistical Division or Subdivision, LGA, Offence category, Subcategory, Year statistics.
ID (An integer unique ID assigned to each row of data)
Statistical Division or Subdivision (The broad area the crimes were committed)
LGA (The Local Governance Area that managed the crime area)
Offence category (A title of the crime category)
Subcategory (A breakdown of the crime within each category area)
Year statistics (from 2002 through to 2012) (Holding a tally corresponding to each crime that took place in that year)
Supplied is a sample of the CVS data, as provided to you by the MUC. In fact, we have provided 4 data samples; two large and two small. One of each (large and small) samples are contaminated (vandalised), and the other two (large and small) are clean. Both the small and clean samples are provided to assist you in developing your code. Small samples are processed quicker. Clean samples allow you to progress, without completing the required cleaning task. Your final code should work on the large-dirty data-set.
In order to clean up and analyse the data, you need a way to take data from a CSV file and put it into a Python data structure. Fortunately, Python has a built-in csv library which can do most of the work for you.
In this assignment, you won't have to use the csv library directly, though. We will provide you with a helper function called read_data which uses the csv library to read the data and turn it into a dictionary of dictionaries. For example, suppose the data above was stored in a file called CrimeDataSet.csv. To work with this data in Python, we would call (from within the read_data.py file’s working directory) the following:
read_data("CrimeDataSet.csv")
which would return the following Python dictionary:
{'1' :{'Division': 'Inner Sydney', 'LGA': 'Botany Bay', 'Offence': 'Homicide','Subcategory': 'Murder (a)', '2002': '1','2003': '0', '2004': 'zero', '2005': '1', '2006': '2', '2007': '1', '2008': '0', '2009': '1', '2010': '0', '2011': '0', '2012': '1'}}
Note
Notice that all of the values in the nested dictionaries are strings, even the numeric values. If you want to use the values in numerical calculations, you will have to typecast them yourself.
Nested dictionaries can be confusing. Here are some simple examples of how to access data in a nested dictionary:
# save the data in a variable
data = {'1':{'Division': 'Inner Sydney', 'LGA': 'Botany Bay', 'Offence': 'Homicide','Subcategory': 'Murder (a)', '2002': '1', '2003': '0', '2004': '0', '2005': '1', '2006':'2', '2007': '1', '2008': '0', '2009': '1', '2010': '0', '2011': '0', '2012': '1'}}
# Where is the ‘1’ ID’s Division
print(data["1"]["Division"])
# What is the second ID’s subcategory
print(data["2"]["Subcategory"])
# What is the summation of each year of ID '1'
sum = 0
for year_data in range(2002, 2012+1):
sum += int(data['1'][str(year_data)])
You have been provided with large CSV files containing Crime data within Australia. Unfortunately, the data has been contaminated and is considered "dirty": some criminal-hackers have attacked the data, and intentionally entered incorrect data values, to subvert the clear understanding of the crimes committed.
Your first task as a programmer-analyst is to clean up the dirty data and fix any issues caused by the criminals, for later analysis.
The errors in this data-set consist of the following changes, peppered or scattered throughout:
They have included zero’s instead of the integer value 0
They have also entered NULL instead of the integer value 0
They have converted positive numbers to negative numbers (i.e. -10 instead of 10, etc).
And they have altered all entries for Trespass (within the subcategory column) with a cruel capitalised string of text; MUC-SUCK!
To clarify, in the data set, any value referred to as zero, should be an integer 0 (zero), any value with a NULL reference should also be an integer 0 (zero), all integer values should be positive values (i.e. minus 20 should be positive 20) and any derogatory remarks about the MUC within the Subcatogory flied, should be actually read as Trespass
Task 1 (Clean data)
Write a function called clean which takes one argument called data. It should be utilised like this:
clean(data)
The data value consists of a dictionary of data; which is the format type returned by read_data. This data has been read directly from a CSV file and is presumed contaminated or dirty! Your function should construct and return a new data dictionary which is identical to the input dictionary, except that invalid data values must be replaced; as described above. You should not need to modify the argument dictionary variable data. The cleaning process should keep a count of all the data samples it cleans and also return this summated value, along with the new cleaned dictionary data set.
Let’s look at the data contained in CrimeDataSetDirty.csv:
{'1' :{'Division': 'Inner Sydney', 'LGA': 'Botany Bay', 'Offence': 'Homicide','Subcategory': 'Murder (a)', '2002': '1', '2003': '0', '2004': 'zero', '2005':'1', '2006': '2', '2007':'-1' , '2008': '0', '2009': '1', '2010': '0', '2011': '0', '2012':'1'}}
Clearly some of the values are invalid! Calling clean_data on this data, would yield the following result:
{'1' :{'Division': 'Inner Sydney', 'LGA': 'Botany Bay', 'Offence': 'Homicide','Subcategory': 'Murder (a)', '2002': '1', '2003': '0', '2004': '0', '2005': '1', '2006':'2', '2007': '1', '2008': '0', '2009': '1', '2010': '0', '2011': '0', '2012': '1'}}
Notice the 0 and negative values in the nested 2004 and 2007 dictionary of the cleaned data, was previously ‘zero’ and '-1'. Don’t forget the other repair alterations too, from the list above!
You can assume the following:
The final input data dictionary should not contain zero or null or negative values;
All year-column data entry statistics (once cleaned) are strings that can be cast to ints;
Any references within the Subcategory-column, containing derogatory remarks about the MUC should be renamed from the derogatory remark back to Trespass
Task 2 (Worst year)
Write a function called countCrimes that takes in two arguments; data and key. The data value will be the dictionarycontaining crime data and the key value will be a suitable value representing the year statistics data. The function summates all the values within the key-column(s) and returns the sum value (representing each year). For example, all crimes for key ‘2012’ should return an integer representing the total sum of crimes though for that column.
Using this function, and inside your main method, you must calculate the worst year for crime, (i.e. the year with the maximum total crimes), and store then display the year and crime total-number in that year. You may assume the crime data in data is "clean", after all invalid values have been replaced. So, all values are non-negative integers, and all data values have been repaired. A clean_data_set.csv file containing clean data is supplied to allow you to test this method, in case you have not yet been able to clean the dirty data-set yet.
Task 3 (Worst area)
The MUC are interested in the distribution of crime throughout the different Statistical Subdivision areas. One way to establish this is to divide each Subdivision into unique bins or dictionary keys; where a key holds a summation of all the crimes for all the years within that subdivision area.
Write a function called worstCrime, which takes the argument data and adds up the values of each Subdivision of each year then returns a new dictionary, where the key is the Subdivision name and the value is the summated total of all crimes within that area over all years. From within the main function, store and display the number of Subdivisions found and present the area with the highest overall crime values as the ‘Worst Area’ and the area with the lowest overall crime values as the ‘Best Area’.
Task 4 (Most active criminal activity)
The MUC are also interested in learning which crime is performed the most throughout the whole dataset. By acquiring this information, they will be able to focus on reinforcing security levels, targeting that type of crime more robustly.
Write a function called mostActiveCrime(data), which returns a dictionary of the different crime types, which holds a tally (count) of how often those crimes were committed overall. That is, each key within the dictionary will be the name of a crime (such as Homicide, or Robbery, etc), and the values therein, will be the tally of crimes for that particular crime throughout allyears. Finally, within your main function, from the returned dictionary, store and display the most active/performed crime type and present it as the ‘Most active Crime overall’, and include its summated value.
Task 5 (Providing a final report)
The government has asked the MUC to produce a report on the final status and crime situation within their supplied dataset. The MUC have asked you to help them locate the appropriate data for this report.
Write a function called report which takes a filename (called datafile) as an argument. This function reads the original crime data contained under that filename, then uses your function (clean) to clean the data. Following this, you will use your newly created functions (1 to 4) to present some facts about crime in Australia.
You should assume that the data in datafile is noisy. Your function should calculate and return the following data-facts as a list:
The total number of rows in the data file
The total number of Subdivisions examined in the data
The total number of Offence Categories
The worst area and best area for crime (most and least crime counts respectively)
The most active type of crime.
Note
You will probably find it useful to call read_data, clean_data, etc. in your main function, to ensure you are testing your function outputs.
Present your analysis in a formatted text output, with the statistical data values embedded. Here is an example of what the output might look like;CAPITALISATION has been used to represent your analytical data, where
‘On behalf of the MUC (Made Up Company), I have analysed TOTAL_DATA_ROWS units of the crime statistics data, over a 10-year period. I have repaired?REPAIR_COUNTcorrupt data values. This data-set covered?NUMBER_OF_UNIQUE_CRIME_SUBDIVISIONS?Subdivisions and found UNIQUE_CRIMES types of crimes. I conclude that the worst area for crime is WORST_AREA, the safest area is BEST_AREA and that the most active category of crime is MOST_ACTIVE. Sincerely,
Save your Python file as YourName_StudentID_Assignment2.py. Where your name is your actual name and student ID is your actual student ID. To submit your work, please upload your Python Name_ID_Assignment2.py file, containing your complete source code (with all 5 functions) through the GROK system and onto the LMS?turnitin platform, provided.