代写program、代做Python设计编程
- 首页 >> Matlab编程 Assignment 3
Due Monday by 6p.m. Points 0 Available Mar 22 at 12a.m. - Apr 8 at 11:59p.m.
Hypertension and Low Income in Toronto Neighbourhoods
Goals of this Assignment
In this assignment, you will practise working with files, building and using dictionaries,
designing functions using the Function Design Recipe, reading documentation, and writing
unit tests.
No Extensions
Please, note that NO EXTENSIONS are allowed for this assignment, since it is due on the
last day of classes. Only AccessAbility requests and illness- or emergency-related requests
will be accepted. Please, submit these by email to Purva, the course coordinator, with
appropriate documentation.
Background
A commonly-held belief is that an individual's health is largely influenced by the choices they
make. However, there is lots of evidence that health is affected by systemic factors.
Health researchers often study the relationships between an individual's health outcomes
and factors related to their physical environment, social and economic situations, and
geographic location. Studies such as this oneLinks to an external site. investigate how a
particular health outcome (living with hypertension) are tied to a systemic factor (the income
level of a country).
In this assignment, you will write code to assist with analyzing data on the relationship
between hypertension (also known as high blood pressure) and income levels in Toronto
neighbourhoods. The data you will work with is real data, however we have simplified it
somewhat to make this assignment clearer for you.
A note on math and stats
The data analysis that your code will do will include some statistical analysis that we have
not talked about in the course. You do NOT need to understand the underlying statistics to
complete this assignment. The code you write will do some simple mathematical operations,
like adding up some numbers, or finding ratios using division. We will use Pearson
correlation for the more advanced analysis and you will use existing functions that we have
imported for you.
You will need to take a look at the examples of these functions in order to figure out what
arguments you need to pass to them, and what types of data they return, but you do not
need to understand how they work in any detail.
Correlation is a single coefficient expressing the tendency of one set of data to grow linearly,
in the same or opposite direction, with another set of data. This is done by comparing
whether points that have been paired between the two sets are similarly greater or less than
than their set's respective averages.
For example, if we wanted to compare whether for students in the class, age is correlated
with height, we would have two sets of data, birth date (which we could express as, say,
number of weeks old for finer granularity), and heights.
Numbers from each set are ordered in the same way so that each height value corresponds
to the age value for the same student. What is nice about the correlation metric we are
using, is that it is normalized to be between -1 and 1, with these values giving us a nice
human interpretation. A value of 1 means that the points make a straight line. In our
example, this means, for some increase in age, we have a consistent increase in height.
Similarly, a value of -1 is the same relationship but with a flip of direction, where older
students would be shorter than younger ones. Finally, a value of 0 would say that there is no
consistent increase or decrease in height for a change in age. We will use this to investigate
the relationship between low income rates and hypertension, for any tendency to increase or
decrease together.
If you are a statistics person, keep in mind that the learning goals of the assignment are
about writing code using what we've learned in the course, not about doing a proper
statistical analysis.
Dataset descriptions
This assignment uses data files related to one of the two variables of interest (i.e.,
hypertension data or income data). The files are CSV (comma separated values) files, where
each column in a line is separated by a comma. You can assume there are no commas
anywhere else in the files, other than to separate columns, and that any file given is in the
correct format. The two file types are described below.
Neighbourhood hypertension data files
The first row in a neighbourhood hypertension file contain header information, and the
remaining rows each contain data relating to hypertension prevalence in a particular Toronto
neighbourhood.
Here is a description of the different columns of the dataset. Notice the use of constants and
carefully study the starter file constants.py.
Neighbourhood hypertension dataset
Column index Description
HT_ID_COL An ID that uniquely identifies each neighbourhood.
HT_NBH_NAME_COL The name of the neighbourhood. Neighbourhood names are
unique.
HT_20_44_COL The number of people aged 20 to 44 with hypertension in the
neighbourhood.
NBH_20_44_COL The total number of people aged 20 to 44 in the neighbourhood.
HT_45_64_COL The number of people aged 45 to 64 with hypertension in the
neighbourhood.
NBH_45_64_COL The total number of people aged 45 to 64 in the neighbourhood.
HT_65_UP_COL The number of people aged 65 and older with hypertension in the
neighbourhood.
NBH_65_UP_COL The total number of people aged 65 and older in the neighbourhood.
Neighbourhood income data files
The first row in a neighbourhood income data file contains header information, and the
remaining rows each contain data about low income status.
Here is a description of the different columns of the dataset. Notice the use of constants and
carefully study the starter file constants.py.
Neighbourhood income dataset
Column index Description
LI_ID_COL An ID that uniquely identifies each neighbourhood.
LI_NBH_NAME_COL The name of the neighbourhood. Neighbourhood names are unique.
POP_COL The total population in the neighbourhood.
LI_POP_COL The number of people in the neighbourhood with low income status.
Neighbourhood names and ids are the same between our hypertension data files and our
low income data files. However, the total population of a neighbourhood can be different
between the two data files, as they were collected at different times.
The CityData Type
The code you will write for this assignment will build and then use a dictionary that contains
hypertension and low income data about neighbourhoods in a city. This section describes
the format of that dictionary.
Key/value pairs in a CityData dictionary
Each key in a CityData dictionary is a string representing the name of a neighbourhood. As
is necessary for dictionary keys, all neighbourhood names will be unique.
The values in a CityData dictionary are dictionaries containing information about a
neighbourhood. These neighbourhood data dictionaries contain specific keys that label a
neighbourhood's data.
Format of the inner dictionaries
A dictionary that is a value in a dictionary of type CityData has the following key/value pairs.
Notice the use of constants and carefully study the starter file constants.py.
CityData dictionary
Key (Type) Value
ID (int) The id number of this neighbourhood.
TOTAL (int) The total population of this neighbourhood, as given in the low income data file.
LOW_INCOME (int) The number of people in this neighbourhood who are classified as
low income.
HT (list[int]) A list of the hypertension data of this neighbourhood. This list will have
length exactly 6, and the values will be the numbers from columns HT_20_44_COL,
NBH_20_44_COL, HT_45_64_COL, NBH_45_64_COL, HT_65_UP_COL, and
NBH_65_UP_COL stored at indices HT_20_44_IDX, NBH_20_44_IDX, HT_45_64_IDX,
NBH_45_64_IDX, HT_65_UP_IDX, and NBH_65_UP_IDX of the list, correspondingly. See
the section above on neighbourhood hypertension data files.
An example CityData dictionary
The following is an example of a CityData dictionary. We have also provided this dictionary
for you to use in your docstring examples and other testing in the starter code file. Note that
we have formatted the dictionary below for easier reading, however you will not see this
formatting in your code.
{'West Humber-Clairville': {
'id': 1,
'hypertension': [703, 13291, 3741, 9663, 3959, 5176],
'total': 33230,
'low_income': 5950},
'Mount Olive-Silverstone-Jamestown': {
'id': 2,
'hypertension': [789, 12906, 3578, 8815, 2927, 3902],
'total': 32940,
'low_income': 9690},
'Thistletown-Beaumond Heights': {
'id': 3,
'hypertension': [220, 3631, 1047, 2829, 1349, 1767],
'total': 10365,
'low_income': 2005},
'Rexdale-Kipling': {
'id': 4,
'hypertension': [201, 3669, 1134, 3229, 1393, 1854],
'total': 10540,
'low_income': 2140},
'Elms-Old Rexdale': {
'id': 5,
'hypertension': [176, 3353, 1040, 2842, 948, 1322],
'total': 9460,
'low_income': 2315}}
The sample CityData dictionary above represents hypertension and low income data for five
neighbourhoods: West Humber-Clairville, Mount Olive-Silverstone-Jamestown,
Thistletown-Beaumond Heights, Rexdale-Kipling, and Elms-Old Rexdale.
Let's take a closer look at the data for Elms-Old Rexdale. This neighbourhood is represented
by the key/value pair where the key is 'Elms-Old Rexdale'. The id of this neighbourhood is 5.
The hypertension data for this neighbourhood is as follows: 3353 people are between the
ages of 20 and 44, 176 of whom have hypertension. There are 2842 people between the
ages of 45 and 64, 1040 of whom have hypertension, and there are 1322 people aged 65
and up, 948 of whom have hypertension. The low income data for this neighbourhood is that
2315 people are classified as low income, from a total population of 9460 people.
Note that the totals do not match between the low income and the hypertension data — this
is because the low income data was collected before the hypertension data, and the size of
the neighbourhoods changed. For the purposes of this assignment, we will assume the
collection of these two datasets is close enough in time to compare them to each other. You
do not need to do anything about these differing totals, other than to make sure you are
using the correct total when computing rates, as described later.
Age standardisation
This section describes the process of age standardisation that we will use in this assignment
to perform a more accurate analysis. Note that we have given you a function that computes
the age standardised rate from the raw rate (described in Task 3). This section is for your
information only; we have already implemented this for you.
Our dataset will let us calculate the rate of hypertension in each Toronto neighbourhood.
One complicating factor is that different neighbourhoods have different age demographics.
For example, the Henry Farm neighbourhood has a significantly lower proportion of 65+
residents than Hillcrest Village. And because people aged 65+ have a higher overall rate of
hypertension, this demographic difference alone would cause us to expect to see a
difference in the overall hypertension between these neighbourhoods.
So because we care about the impact of low income status on hypertension rates, we want
to remove the impact of different age demographics between the neighbourhoods. To do so,
we will use a process called age standardisation to calculate an adjusted hypertension rate
that ignores differences in ages. This process involves the following steps for each
neighbourhood:
First, we'll calculate the hypertension rate within each of the following age groups: 20-44,
45-64, and 65+. We'll report these rates as percentages, which you can think of as being the
number of cases of hypertension per 100 people aged 20-44 / 45-64 / 65 and up.
Then, we'll pick one standard population with certain numbers of people in these age groups.
For the purpose of this assignment, we'll use the total Canadian population from the 1991
census:Links to an external site.
Population by age group data
Age Group Population
20-44 11,199,830
45-64 5,365,865
65+ 3,169,970
Total (20+) 19,735,665
Then, we'll use the neighbourhood rates to calculate the hypothetical number of people in
the standard population who would have hypertension. For example, if the rates for
neighbourhood X were 20% of 20-44, 30% of 45-64, and 66% of 65+, the total number of
people with hypertension in the standard population would be 2,239,966 + 1,609,760 +
2,092,180 = 5,941,906.
Finally, divide this number of people with hypertension by the total size of the standard
population, yielding a final percentage 5,941,906 / 19,735,665 x 100 or approximately 30%.
This percentage is the age standardised rate for the neighbourhood.
If you are interested, you can read more about age standardised rates hereLinks to an
external site..
Required Functions
In the starter code file a3.py, follow the Function Design Recipe to complete the functions
described below.
You will need helper functions (i.e., functions you define yourself to be called in other
functions) for some of the required functions, but likely not for all of them. Helper functions
also require complete docstrings with doctests. We strongly recommend you also follow any
suggestions about helper functions in the table below; we give you these hints to make your
programming task easier.
Some indicators that you should consider writing a new helper function, or using something
you've already written as a helper are:
Rewriting code to solve a task you have already solved in another function
Getting a warning from the checker that your function is too long
Getting a warning from the checker that your function has too many nested blocks or too
many branches
Realizing that your function can be broken down into smaller sub-problems (with a helper
function for each)
For each of the functions below, other than the file reading functions in Task 1, write at least
two examples in the docstring. You can use the provided SAMPLE_DATA dictionary, and you
should also create another small CityData dictionary for examples and testing. If your helper
function takes an open file as an argument, you do NOT need to write any examples in that
function's docstring. Otherwise, for any helper functions you add, write at least two examples
in the docstring.
Your functions should not mutate their arguments, unless the description says that is what
they do.
Assumptions
Assume the following about the data:
All neighbourhood ids and names are unique, and will appear the same in all data files. That
is, no neighbourhood will have a different id between files, or a different name.
In all tasks except Task 1, the dictionary argument will have both hypertension and low
income data for every neighbourhood. That is, it will be a valid CityData dictionary.
All float values should be left as is; do not round any of them.
Using Constants
The starter code contains constants in the file constants.py that you should use in your
solution for the list indices and key identifiers for the CityData dictionary as well as the
column numbers for the input files. You may add other constants if you wish, but DO NOT
place them in the file constants.py: instead put them in the a3.py file.
Task 1: Building the data dictionary
In this task, you will write functions that read in files and build the dictionary of
neighbourhood data. You will write two functions — one that adds hypertension data to a
dictionary, and one that adds low income data. You will almost certainly also need to define
one or more helper functions to help you solve this task.
These functions will be used to build a CityData dictionary, however the dictionary that is
passed to the functions may not yet contain all of the data.
To illustrate this, we have provided two small data files. After passing the same dictionary to
both functions with each of those small files, the dictionary should be a CityData dictionary
that contains the same information as the provided SAMPLE_DATA dictionary. Using the
small hypertension file and an empty dictionary as arguments to get_hypertension_data, the
result should be that the dictionary now contains the hypertension data as in
SAMPLE_DATA, but not the low income data.
{'West Humber-Clairville':
{'id': 1, 'hypertension': [703, 13291, 3741, 9663, 3959, 5176]},
'Mount Olive-Silverstone-Jamestown':
{'id': 2, 'hypertension': [789, 12906, 3578, 8815, 2927, 3902]},
'Thistletown-Beaumond Heights':
{'id': 3, 'hypertension': [220, 3631, 1047, 2829, 1349, 1767]},
'Rexdale-Kipling':
{'id': 4, 'hypertension': [201, 3669, 1134, 3229, 1393, 1854]},
'Elms-Old Rexdale':
{'id': 5, 'hypertension': [176, 3353, 1040, 2842, 948, 1322]}}
Similarly, using the small low income file and an empty dictionary as arguments to
get_low_income_data, the result should be that the dictionary now contains the low income
data as in SAMPLE_DATA, but not the hypertension data.
{'West Humber-Clairville':
{'id': 1, 'total': 33230, 'low_income': 5950},
'Mount Olive-Silverstone-Jamestown':
{'id': 2, 'total': 32940, 'low_income': 9690},
'Thistletown-Beaumond Heights':
{'id': 3, 'total': 10365, 'low_income': 2005},
'Rexdale-Kipling':
{'id': 4, 'total': 10540, 'low_income': 2140},
'Elms-Old Rexdale':
{'id': 5, 'total': 9460, 'low_income': 2315}}
A complete CityData dictionary will have been passed to both functions. See the sample
usage at the end of the starter code file for an example of how both functions are used to
build a CityData dictionary.
Note: While this is the first task, it is not necessarily the easiest. If you are stuck while
working on this task, we suggest moving on to other tasks and coming back to this later.
Recall that TextIO as the parameter type means the file is already open.
Functions: Task 1
Function name:
(Parameter types) -> Return type Full Description (paraphrase to get a proper docstring
description)
get_hypertension_data:
(dict, TextIO) -> None
The first parameter is a dictionary representing hypertension and/or low income data for a
neighbourhood and the second parameter is a hypertension data file that is open for reading.
This function should modify the dictionary so that it contains the hypertension data in the file.
If a neighbourhood with data in the file is already in the dictionary then its hypertension data
should be updated. Otherwise it should be added to the dictionary with its hypertension data.
After this function is called, the dictionary should contain key/value pairs whose keys are the
names of every neighbourhood in the hypertension data file, and whose values are
dictionaries which contain at least the keys ID and HT for those neighbourhoods. After both
functions get_hypertension_data and get_low_income_data are called with a dict as the first
argument, this argument will be a complete CityData type.
get_low_income_data:
(dict, TextIO) -> None
The first parameter is a dictionary representing hypertension and/or low income data for a
neighbourhood and the second parameter is a low income data file that is open for reading.
This function should modify the dictionary so that it contains the low income data in the file.
If a neighbourhood with data in the file is already in the dictionary then its low income data
should be updated. Otherwise it should be added to the dictionary with its low income data.
After this function is called, the dictionary should contain key/value pairs whose keys are the
names of every neighbourhood in the low income data file, and whose values are
dictionaries which contain at least the keys ID, TOTAL, and LOW_INCOME for those
neighbourhoods. After both functions get_hypertension_data and get_low_income_data are
called with a dict as the first argument, this argument will be a complete CityData type.
Task 2: Neighbourhood-level Analysis
Functions: Task 2
Function name:
(Parameter types) -> Return type Full Description (paraphrase to get a proper docstring
description)
get_bigger_neighbourhood:
(CityData, str, str) -> str
The first parameter is a CityData dictionary, and the second and third parameters are strings
representing the names of neighbourhoods. The function returns the name of the
neighbourhood that has a higher population, according to the low income data.
Assume that the two neighbourhood names are different. If a name is not in the dictionary,
assume it has a population of 0. If the two neighbourhoods are the same size, return the first
name (i.e., the leftmost one in the parameters list, not alphabetically).
get_high_hypertension_rate:
(CityData, float) -> list[tuple[str, float]]
The first parameter is a CityData dictionary, and the second parameter is a number between
0.0 and 1.0 (inclusive) that represents a threshold. This function should return a list of tuples
representing all neighbourhoods with a hypertension rate greater than or equal to the
threshold. In each tuple, the first item is the neighbourhood name and the second item is the
hypertension rate in that neighbourhood.
Compute the overall hypertension rate for a neighbourhood by dividing the total number of
people with hypertension by the total number of adults in the neighbourhood. You may
assume that no neighbourhood has 0 population.
If this function was called with the provided SAMPLE_DATA dictionary and a threshold of 0.3
as arguments, then the returned value would be [('Thistletown-Beaumond Heights',
0.31797739151574084), ('Rexdale-Kipling', 0.3117001828153565)]. The order of the tuples
in the list does not matter.
get_ht_to_low_income_ratios:
(CityData) -> dict[str, float]
The parameter is a CityData dictionary. This function should return a dictionary where the
keys are the same as in the parameter, and the values are the ratio of the hypertension rate
to the low income rate for that neighbourhood.
For the denominators for each rate, use the total number of people as given in the
corresponding data file. That is, for calculating the low income rate, use the total population
in the neighbourhood from the low-income data file; and for the hypertension rate, use the
sum of the total people in all three age groups in the hypertension data. You may assume
that no neighbourhood has 0 population.
For example, if this function was called with the provided SAMPLE_DATA dictionary as an
argument, then the returned dictionary should include the key/value pairs 'West
Humber-Clairville': 1.6683148168616895 and 'Mount Olive-Silverstone-Jamestown':
0.9676885451091314, as well as the pairs for the other three neighbourhoods.
You will find that writing a helper function would be useful here.
calculate_ht_rates_by_age_group:
(CityData, str) -> tuple[float, float, float]
The first parameter is a CityData dictionary, and the second parameter represents a
neighbourhood name that is a key in the dictionary. The function returns a tuple of three
values, representing the hypertension rate for each of the three age groups in the
neighbourhood as a percentage. (Note that this is different from the previous two functions,
where the rate was calculated using the total numbers, and was not expressed as a
percentage.)
For example, consider the neighbourhood with the name 'Elms-Old Rexdale' in the provided
SAMPLE_DATA dictionary. The rate of hypertension for the 20-44 age group in the
neighbourhood is computed by dividing the number of people aged 20-44 with hypertension
by the total number of people aged 20-44. For this neighbourhood, that is 176 / 3353. To get
this rate as a percentage, we then multiply by 100, for a rate of 5.24903071875932. The rate
is calculated in the same way for the 45-64 and 65+ age groups. Thus, calling this function
with the provided SAMPLE_DATA dictionary and the string 'Elms-Old Rexdale' should return
(5.24903071875932, 36.593947923997185, 71.70953101361573).
You may assume that no neighbourhood has a 0 population. Notice that this function is used
as a helper in the get_age_standardized_ht_rate function that we have provided for you.
Task 3: Finding the Correlation
Functions: Task 3
Function name:
(Parameter types) -> Return type Full Description (paraphrase to get a proper docstring
description)
get_correlation:
(CityData) -> float
The parameter for this function is a CityData dictionary. This function returns the correlation
between age standardised hypertension rates and low income rates across all
neighbourhoods.
To complete this function, you will need to use the correlation function in the module
statistics. Refer to the documentation for the function to determine what arguments to pass
to the correlation function, and how to use its returned value. You can find the documentation
online hereLinks to an external site. or by using help(statistics.correlation). Remember that
to call help on a function from another module, you first need to import the module.
You will need to use the provided function get_age_standardized_ht_rate as a helper to get
the age-standardised rate for each neighbourhood.
Task 4: Order by Ratio
Functions: Task 4
Function name:
(Parameter types) -> Return type Full Description (paraphrase to get a proper docstring
description)
order_by_ht_rate:
(CityData) -> list[str]
The parameter is a CityData dictionary. This function will return a list of the names of the
neighbourhoods, ordered from lowest to highest age-standardised hypertension rate. We
use the age-standardised rate because we are comparing across neighbourhoods.
Assume every neighbourhood has a unique hypertension rate; i.e., that there are no ties.
For example, if this function is called with the CityData dictionary provided in the starter
code, it will return ['Elms-Old Rexdale', 'Rexdale-Kipling', 'Thistletown-Beaumond Heights',
'West Humber-Clairville', 'Mount Olive-Silverstone-Jamestown']
There are multiple ways to solve this problem. You may choose to solve this problem by
writing your own sorting code, but you do not have to do this. You can also use list.sort as
part of your solution, if you choose.
Task 5: Required Testing (unittest)
Write and submit a unittest file for the get_bigger_neighbourhood function. We have
provided starter code in the test_a3.py file. We have included one test that you can use as a
template to write your other test methods. For each test method, include a brief docstring
description specifying what is being tested. Do not write examples in the docstrings. Your set
of tests should all pass on correct code, and your tests should be thorough enough that at
least one of them will fail on a buggy version of the function. There is no required number of
tests; we will mark your tests by running them on the correct code as well as several buggy
versions.
Files to Download
Download a3.zip Download a3.zip which contains starter code (a3.py and test_a3.py), the
checker (a3_checker.py together with the helper file checker.py and folder pyta), and two
sizes of each type of data file.
Marking
These are the aspects of your work that will be marked for Assignment 3:
Correctness (70%): Your functions should perform as specified. Correctness, as measured
by our tests, will count for the largest single portion of your marks. Once your assignment is
submitted, we will run additional tests, not provided in the checker. Passing the checker does
not mean that your code will earn full marks for correctness.
Testing (15%): Your test suite will be checked by running it on incorrect/broken
implementations. Your tests should all pass on a correct version of the function, and at least
one should fail on each of our broken implementations.
Coding style (15%):
Make sure that you follow Python style guidelines that we have introduced and the Python
coding conventions that we have been using throughout the semester. Although we don't
provide an exhaustive list of style rules, the checker tests for style are complete, so if your
code passes the checker, then it will earn full marks for coding style with one exception:
docstrings may be evaluated separately. For each occurrence of a PyTA error, one mark (out
of 20) deduction will be applied. For example, if a C0301 (line-too-long) error occurs 3 times,
then 3 marks will be deducted.
If you encounter PyTA error R0915 (too-many-statements), that indicates that your function
is too long (more than 20 statements long). In that case, introduce helper functions to do
some of the work — even if the helpers will only be called once. Your program should be
broken down into functions, both to avoid repetitive code and to make the program easier to
read.
All functions, including helper functions, should have complete docstrings including
preconditions when you think they are necessary.
Also, your variable names and names of your helper functions should be meaningful. Your
code should be as simple and clear as possible.
What to Hand In
The very last thing you do before submitting should be to run the checker program one last
time.
Otherwise, you could make a small error in your final changes before submitting that causes
your code to receive zero for correctness.
Submit a3.py and test_a3.py on MarkUs by following the instructions on the course website.
Remember that spelling of filenames, including case, counts: your file must be named
exactly as above.
Due Monday by 6p.m. Points 0 Available Mar 22 at 12a.m. - Apr 8 at 11:59p.m.
Hypertension and Low Income in Toronto Neighbourhoods
Goals of this Assignment
In this assignment, you will practise working with files, building and using dictionaries,
designing functions using the Function Design Recipe, reading documentation, and writing
unit tests.
No Extensions
Please, note that NO EXTENSIONS are allowed for this assignment, since it is due on the
last day of classes. Only AccessAbility requests and illness- or emergency-related requests
will be accepted. Please, submit these by email to Purva, the course coordinator, with
appropriate documentation.
Background
A commonly-held belief is that an individual's health is largely influenced by the choices they
make. However, there is lots of evidence that health is affected by systemic factors.
Health researchers often study the relationships between an individual's health outcomes
and factors related to their physical environment, social and economic situations, and
geographic location. Studies such as this oneLinks to an external site. investigate how a
particular health outcome (living with hypertension) are tied to a systemic factor (the income
level of a country).
In this assignment, you will write code to assist with analyzing data on the relationship
between hypertension (also known as high blood pressure) and income levels in Toronto
neighbourhoods. The data you will work with is real data, however we have simplified it
somewhat to make this assignment clearer for you.
A note on math and stats
The data analysis that your code will do will include some statistical analysis that we have
not talked about in the course. You do NOT need to understand the underlying statistics to
complete this assignment. The code you write will do some simple mathematical operations,
like adding up some numbers, or finding ratios using division. We will use Pearson
correlation for the more advanced analysis and you will use existing functions that we have
imported for you.
You will need to take a look at the examples of these functions in order to figure out what
arguments you need to pass to them, and what types of data they return, but you do not
need to understand how they work in any detail.
Correlation is a single coefficient expressing the tendency of one set of data to grow linearly,
in the same or opposite direction, with another set of data. This is done by comparing
whether points that have been paired between the two sets are similarly greater or less than
than their set's respective averages.
For example, if we wanted to compare whether for students in the class, age is correlated
with height, we would have two sets of data, birth date (which we could express as, say,
number of weeks old for finer granularity), and heights.
Numbers from each set are ordered in the same way so that each height value corresponds
to the age value for the same student. What is nice about the correlation metric we are
using, is that it is normalized to be between -1 and 1, with these values giving us a nice
human interpretation. A value of 1 means that the points make a straight line. In our
example, this means, for some increase in age, we have a consistent increase in height.
Similarly, a value of -1 is the same relationship but with a flip of direction, where older
students would be shorter than younger ones. Finally, a value of 0 would say that there is no
consistent increase or decrease in height for a change in age. We will use this to investigate
the relationship between low income rates and hypertension, for any tendency to increase or
decrease together.
If you are a statistics person, keep in mind that the learning goals of the assignment are
about writing code using what we've learned in the course, not about doing a proper
statistical analysis.
Dataset descriptions
This assignment uses data files related to one of the two variables of interest (i.e.,
hypertension data or income data). The files are CSV (comma separated values) files, where
each column in a line is separated by a comma. You can assume there are no commas
anywhere else in the files, other than to separate columns, and that any file given is in the
correct format. The two file types are described below.
Neighbourhood hypertension data files
The first row in a neighbourhood hypertension file contain header information, and the
remaining rows each contain data relating to hypertension prevalence in a particular Toronto
neighbourhood.
Here is a description of the different columns of the dataset. Notice the use of constants and
carefully study the starter file constants.py.
Neighbourhood hypertension dataset
Column index Description
HT_ID_COL An ID that uniquely identifies each neighbourhood.
HT_NBH_NAME_COL The name of the neighbourhood. Neighbourhood names are
unique.
HT_20_44_COL The number of people aged 20 to 44 with hypertension in the
neighbourhood.
NBH_20_44_COL The total number of people aged 20 to 44 in the neighbourhood.
HT_45_64_COL The number of people aged 45 to 64 with hypertension in the
neighbourhood.
NBH_45_64_COL The total number of people aged 45 to 64 in the neighbourhood.
HT_65_UP_COL The number of people aged 65 and older with hypertension in the
neighbourhood.
NBH_65_UP_COL The total number of people aged 65 and older in the neighbourhood.
Neighbourhood income data files
The first row in a neighbourhood income data file contains header information, and the
remaining rows each contain data about low income status.
Here is a description of the different columns of the dataset. Notice the use of constants and
carefully study the starter file constants.py.
Neighbourhood income dataset
Column index Description
LI_ID_COL An ID that uniquely identifies each neighbourhood.
LI_NBH_NAME_COL The name of the neighbourhood. Neighbourhood names are unique.
POP_COL The total population in the neighbourhood.
LI_POP_COL The number of people in the neighbourhood with low income status.
Neighbourhood names and ids are the same between our hypertension data files and our
low income data files. However, the total population of a neighbourhood can be different
between the two data files, as they were collected at different times.
The CityData Type
The code you will write for this assignment will build and then use a dictionary that contains
hypertension and low income data about neighbourhoods in a city. This section describes
the format of that dictionary.
Key/value pairs in a CityData dictionary
Each key in a CityData dictionary is a string representing the name of a neighbourhood. As
is necessary for dictionary keys, all neighbourhood names will be unique.
The values in a CityData dictionary are dictionaries containing information about a
neighbourhood. These neighbourhood data dictionaries contain specific keys that label a
neighbourhood's data.
Format of the inner dictionaries
A dictionary that is a value in a dictionary of type CityData has the following key/value pairs.
Notice the use of constants and carefully study the starter file constants.py.
CityData dictionary
Key (Type) Value
ID (int) The id number of this neighbourhood.
TOTAL (int) The total population of this neighbourhood, as given in the low income data file.
LOW_INCOME (int) The number of people in this neighbourhood who are classified as
low income.
HT (list[int]) A list of the hypertension data of this neighbourhood. This list will have
length exactly 6, and the values will be the numbers from columns HT_20_44_COL,
NBH_20_44_COL, HT_45_64_COL, NBH_45_64_COL, HT_65_UP_COL, and
NBH_65_UP_COL stored at indices HT_20_44_IDX, NBH_20_44_IDX, HT_45_64_IDX,
NBH_45_64_IDX, HT_65_UP_IDX, and NBH_65_UP_IDX of the list, correspondingly. See
the section above on neighbourhood hypertension data files.
An example CityData dictionary
The following is an example of a CityData dictionary. We have also provided this dictionary
for you to use in your docstring examples and other testing in the starter code file. Note that
we have formatted the dictionary below for easier reading, however you will not see this
formatting in your code.
{'West Humber-Clairville': {
'id': 1,
'hypertension': [703, 13291, 3741, 9663, 3959, 5176],
'total': 33230,
'low_income': 5950},
'Mount Olive-Silverstone-Jamestown': {
'id': 2,
'hypertension': [789, 12906, 3578, 8815, 2927, 3902],
'total': 32940,
'low_income': 9690},
'Thistletown-Beaumond Heights': {
'id': 3,
'hypertension': [220, 3631, 1047, 2829, 1349, 1767],
'total': 10365,
'low_income': 2005},
'Rexdale-Kipling': {
'id': 4,
'hypertension': [201, 3669, 1134, 3229, 1393, 1854],
'total': 10540,
'low_income': 2140},
'Elms-Old Rexdale': {
'id': 5,
'hypertension': [176, 3353, 1040, 2842, 948, 1322],
'total': 9460,
'low_income': 2315}}
The sample CityData dictionary above represents hypertension and low income data for five
neighbourhoods: West Humber-Clairville, Mount Olive-Silverstone-Jamestown,
Thistletown-Beaumond Heights, Rexdale-Kipling, and Elms-Old Rexdale.
Let's take a closer look at the data for Elms-Old Rexdale. This neighbourhood is represented
by the key/value pair where the key is 'Elms-Old Rexdale'. The id of this neighbourhood is 5.
The hypertension data for this neighbourhood is as follows: 3353 people are between the
ages of 20 and 44, 176 of whom have hypertension. There are 2842 people between the
ages of 45 and 64, 1040 of whom have hypertension, and there are 1322 people aged 65
and up, 948 of whom have hypertension. The low income data for this neighbourhood is that
2315 people are classified as low income, from a total population of 9460 people.
Note that the totals do not match between the low income and the hypertension data — this
is because the low income data was collected before the hypertension data, and the size of
the neighbourhoods changed. For the purposes of this assignment, we will assume the
collection of these two datasets is close enough in time to compare them to each other. You
do not need to do anything about these differing totals, other than to make sure you are
using the correct total when computing rates, as described later.
Age standardisation
This section describes the process of age standardisation that we will use in this assignment
to perform a more accurate analysis. Note that we have given you a function that computes
the age standardised rate from the raw rate (described in Task 3). This section is for your
information only; we have already implemented this for you.
Our dataset will let us calculate the rate of hypertension in each Toronto neighbourhood.
One complicating factor is that different neighbourhoods have different age demographics.
For example, the Henry Farm neighbourhood has a significantly lower proportion of 65+
residents than Hillcrest Village. And because people aged 65+ have a higher overall rate of
hypertension, this demographic difference alone would cause us to expect to see a
difference in the overall hypertension between these neighbourhoods.
So because we care about the impact of low income status on hypertension rates, we want
to remove the impact of different age demographics between the neighbourhoods. To do so,
we will use a process called age standardisation to calculate an adjusted hypertension rate
that ignores differences in ages. This process involves the following steps for each
neighbourhood:
First, we'll calculate the hypertension rate within each of the following age groups: 20-44,
45-64, and 65+. We'll report these rates as percentages, which you can think of as being the
number of cases of hypertension per 100 people aged 20-44 / 45-64 / 65 and up.
Then, we'll pick one standard population with certain numbers of people in these age groups.
For the purpose of this assignment, we'll use the total Canadian population from the 1991
census:Links to an external site.
Population by age group data
Age Group Population
20-44 11,199,830
45-64 5,365,865
65+ 3,169,970
Total (20+) 19,735,665
Then, we'll use the neighbourhood rates to calculate the hypothetical number of people in
the standard population who would have hypertension. For example, if the rates for
neighbourhood X were 20% of 20-44, 30% of 45-64, and 66% of 65+, the total number of
people with hypertension in the standard population would be 2,239,966 + 1,609,760 +
2,092,180 = 5,941,906.
Finally, divide this number of people with hypertension by the total size of the standard
population, yielding a final percentage 5,941,906 / 19,735,665 x 100 or approximately 30%.
This percentage is the age standardised rate for the neighbourhood.
If you are interested, you can read more about age standardised rates hereLinks to an
external site..
Required Functions
In the starter code file a3.py, follow the Function Design Recipe to complete the functions
described below.
You will need helper functions (i.e., functions you define yourself to be called in other
functions) for some of the required functions, but likely not for all of them. Helper functions
also require complete docstrings with doctests. We strongly recommend you also follow any
suggestions about helper functions in the table below; we give you these hints to make your
programming task easier.
Some indicators that you should consider writing a new helper function, or using something
you've already written as a helper are:
Rewriting code to solve a task you have already solved in another function
Getting a warning from the checker that your function is too long
Getting a warning from the checker that your function has too many nested blocks or too
many branches
Realizing that your function can be broken down into smaller sub-problems (with a helper
function for each)
For each of the functions below, other than the file reading functions in Task 1, write at least
two examples in the docstring. You can use the provided SAMPLE_DATA dictionary, and you
should also create another small CityData dictionary for examples and testing. If your helper
function takes an open file as an argument, you do NOT need to write any examples in that
function's docstring. Otherwise, for any helper functions you add, write at least two examples
in the docstring.
Your functions should not mutate their arguments, unless the description says that is what
they do.
Assumptions
Assume the following about the data:
All neighbourhood ids and names are unique, and will appear the same in all data files. That
is, no neighbourhood will have a different id between files, or a different name.
In all tasks except Task 1, the dictionary argument will have both hypertension and low
income data for every neighbourhood. That is, it will be a valid CityData dictionary.
All float values should be left as is; do not round any of them.
Using Constants
The starter code contains constants in the file constants.py that you should use in your
solution for the list indices and key identifiers for the CityData dictionary as well as the
column numbers for the input files. You may add other constants if you wish, but DO NOT
place them in the file constants.py: instead put them in the a3.py file.
Task 1: Building the data dictionary
In this task, you will write functions that read in files and build the dictionary of
neighbourhood data. You will write two functions — one that adds hypertension data to a
dictionary, and one that adds low income data. You will almost certainly also need to define
one or more helper functions to help you solve this task.
These functions will be used to build a CityData dictionary, however the dictionary that is
passed to the functions may not yet contain all of the data.
To illustrate this, we have provided two small data files. After passing the same dictionary to
both functions with each of those small files, the dictionary should be a CityData dictionary
that contains the same information as the provided SAMPLE_DATA dictionary. Using the
small hypertension file and an empty dictionary as arguments to get_hypertension_data, the
result should be that the dictionary now contains the hypertension data as in
SAMPLE_DATA, but not the low income data.
{'West Humber-Clairville':
{'id': 1, 'hypertension': [703, 13291, 3741, 9663, 3959, 5176]},
'Mount Olive-Silverstone-Jamestown':
{'id': 2, 'hypertension': [789, 12906, 3578, 8815, 2927, 3902]},
'Thistletown-Beaumond Heights':
{'id': 3, 'hypertension': [220, 3631, 1047, 2829, 1349, 1767]},
'Rexdale-Kipling':
{'id': 4, 'hypertension': [201, 3669, 1134, 3229, 1393, 1854]},
'Elms-Old Rexdale':
{'id': 5, 'hypertension': [176, 3353, 1040, 2842, 948, 1322]}}
Similarly, using the small low income file and an empty dictionary as arguments to
get_low_income_data, the result should be that the dictionary now contains the low income
data as in SAMPLE_DATA, but not the hypertension data.
{'West Humber-Clairville':
{'id': 1, 'total': 33230, 'low_income': 5950},
'Mount Olive-Silverstone-Jamestown':
{'id': 2, 'total': 32940, 'low_income': 9690},
'Thistletown-Beaumond Heights':
{'id': 3, 'total': 10365, 'low_income': 2005},
'Rexdale-Kipling':
{'id': 4, 'total': 10540, 'low_income': 2140},
'Elms-Old Rexdale':
{'id': 5, 'total': 9460, 'low_income': 2315}}
A complete CityData dictionary will have been passed to both functions. See the sample
usage at the end of the starter code file for an example of how both functions are used to
build a CityData dictionary.
Note: While this is the first task, it is not necessarily the easiest. If you are stuck while
working on this task, we suggest moving on to other tasks and coming back to this later.
Recall that TextIO as the parameter type means the file is already open.
Functions: Task 1
Function name:
(Parameter types) -> Return type Full Description (paraphrase to get a proper docstring
description)
get_hypertension_data:
(dict, TextIO) -> None
The first parameter is a dictionary representing hypertension and/or low income data for a
neighbourhood and the second parameter is a hypertension data file that is open for reading.
This function should modify the dictionary so that it contains the hypertension data in the file.
If a neighbourhood with data in the file is already in the dictionary then its hypertension data
should be updated. Otherwise it should be added to the dictionary with its hypertension data.
After this function is called, the dictionary should contain key/value pairs whose keys are the
names of every neighbourhood in the hypertension data file, and whose values are
dictionaries which contain at least the keys ID and HT for those neighbourhoods. After both
functions get_hypertension_data and get_low_income_data are called with a dict as the first
argument, this argument will be a complete CityData type.
get_low_income_data:
(dict, TextIO) -> None
The first parameter is a dictionary representing hypertension and/or low income data for a
neighbourhood and the second parameter is a low income data file that is open for reading.
This function should modify the dictionary so that it contains the low income data in the file.
If a neighbourhood with data in the file is already in the dictionary then its low income data
should be updated. Otherwise it should be added to the dictionary with its low income data.
After this function is called, the dictionary should contain key/value pairs whose keys are the
names of every neighbourhood in the low income data file, and whose values are
dictionaries which contain at least the keys ID, TOTAL, and LOW_INCOME for those
neighbourhoods. After both functions get_hypertension_data and get_low_income_data are
called with a dict as the first argument, this argument will be a complete CityData type.
Task 2: Neighbourhood-level Analysis
Functions: Task 2
Function name:
(Parameter types) -> Return type Full Description (paraphrase to get a proper docstring
description)
get_bigger_neighbourhood:
(CityData, str, str) -> str
The first parameter is a CityData dictionary, and the second and third parameters are strings
representing the names of neighbourhoods. The function returns the name of the
neighbourhood that has a higher population, according to the low income data.
Assume that the two neighbourhood names are different. If a name is not in the dictionary,
assume it has a population of 0. If the two neighbourhoods are the same size, return the first
name (i.e., the leftmost one in the parameters list, not alphabetically).
get_high_hypertension_rate:
(CityData, float) -> list[tuple[str, float]]
The first parameter is a CityData dictionary, and the second parameter is a number between
0.0 and 1.0 (inclusive) that represents a threshold. This function should return a list of tuples
representing all neighbourhoods with a hypertension rate greater than or equal to the
threshold. In each tuple, the first item is the neighbourhood name and the second item is the
hypertension rate in that neighbourhood.
Compute the overall hypertension rate for a neighbourhood by dividing the total number of
people with hypertension by the total number of adults in the neighbourhood. You may
assume that no neighbourhood has 0 population.
If this function was called with the provided SAMPLE_DATA dictionary and a threshold of 0.3
as arguments, then the returned value would be [('Thistletown-Beaumond Heights',
0.31797739151574084), ('Rexdale-Kipling', 0.3117001828153565)]. The order of the tuples
in the list does not matter.
get_ht_to_low_income_ratios:
(CityData) -> dict[str, float]
The parameter is a CityData dictionary. This function should return a dictionary where the
keys are the same as in the parameter, and the values are the ratio of the hypertension rate
to the low income rate for that neighbourhood.
For the denominators for each rate, use the total number of people as given in the
corresponding data file. That is, for calculating the low income rate, use the total population
in the neighbourhood from the low-income data file; and for the hypertension rate, use the
sum of the total people in all three age groups in the hypertension data. You may assume
that no neighbourhood has 0 population.
For example, if this function was called with the provided SAMPLE_DATA dictionary as an
argument, then the returned dictionary should include the key/value pairs 'West
Humber-Clairville': 1.6683148168616895 and 'Mount Olive-Silverstone-Jamestown':
0.9676885451091314, as well as the pairs for the other three neighbourhoods.
You will find that writing a helper function would be useful here.
calculate_ht_rates_by_age_group:
(CityData, str) -> tuple[float, float, float]
The first parameter is a CityData dictionary, and the second parameter represents a
neighbourhood name that is a key in the dictionary. The function returns a tuple of three
values, representing the hypertension rate for each of the three age groups in the
neighbourhood as a percentage. (Note that this is different from the previous two functions,
where the rate was calculated using the total numbers, and was not expressed as a
percentage.)
For example, consider the neighbourhood with the name 'Elms-Old Rexdale' in the provided
SAMPLE_DATA dictionary. The rate of hypertension for the 20-44 age group in the
neighbourhood is computed by dividing the number of people aged 20-44 with hypertension
by the total number of people aged 20-44. For this neighbourhood, that is 176 / 3353. To get
this rate as a percentage, we then multiply by 100, for a rate of 5.24903071875932. The rate
is calculated in the same way for the 45-64 and 65+ age groups. Thus, calling this function
with the provided SAMPLE_DATA dictionary and the string 'Elms-Old Rexdale' should return
(5.24903071875932, 36.593947923997185, 71.70953101361573).
You may assume that no neighbourhood has a 0 population. Notice that this function is used
as a helper in the get_age_standardized_ht_rate function that we have provided for you.
Task 3: Finding the Correlation
Functions: Task 3
Function name:
(Parameter types) -> Return type Full Description (paraphrase to get a proper docstring
description)
get_correlation:
(CityData) -> float
The parameter for this function is a CityData dictionary. This function returns the correlation
between age standardised hypertension rates and low income rates across all
neighbourhoods.
To complete this function, you will need to use the correlation function in the module
statistics. Refer to the documentation for the function to determine what arguments to pass
to the correlation function, and how to use its returned value. You can find the documentation
online hereLinks to an external site. or by using help(statistics.correlation). Remember that
to call help on a function from another module, you first need to import the module.
You will need to use the provided function get_age_standardized_ht_rate as a helper to get
the age-standardised rate for each neighbourhood.
Task 4: Order by Ratio
Functions: Task 4
Function name:
(Parameter types) -> Return type Full Description (paraphrase to get a proper docstring
description)
order_by_ht_rate:
(CityData) -> list[str]
The parameter is a CityData dictionary. This function will return a list of the names of the
neighbourhoods, ordered from lowest to highest age-standardised hypertension rate. We
use the age-standardised rate because we are comparing across neighbourhoods.
Assume every neighbourhood has a unique hypertension rate; i.e., that there are no ties.
For example, if this function is called with the CityData dictionary provided in the starter
code, it will return ['Elms-Old Rexdale', 'Rexdale-Kipling', 'Thistletown-Beaumond Heights',
'West Humber-Clairville', 'Mount Olive-Silverstone-Jamestown']
There are multiple ways to solve this problem. You may choose to solve this problem by
writing your own sorting code, but you do not have to do this. You can also use list.sort as
part of your solution, if you choose.
Task 5: Required Testing (unittest)
Write and submit a unittest file for the get_bigger_neighbourhood function. We have
provided starter code in the test_a3.py file. We have included one test that you can use as a
template to write your other test methods. For each test method, include a brief docstring
description specifying what is being tested. Do not write examples in the docstrings. Your set
of tests should all pass on correct code, and your tests should be thorough enough that at
least one of them will fail on a buggy version of the function. There is no required number of
tests; we will mark your tests by running them on the correct code as well as several buggy
versions.
Files to Download
Download a3.zip Download a3.zip which contains starter code (a3.py and test_a3.py), the
checker (a3_checker.py together with the helper file checker.py and folder pyta), and two
sizes of each type of data file.
Marking
These are the aspects of your work that will be marked for Assignment 3:
Correctness (70%): Your functions should perform as specified. Correctness, as measured
by our tests, will count for the largest single portion of your marks. Once your assignment is
submitted, we will run additional tests, not provided in the checker. Passing the checker does
not mean that your code will earn full marks for correctness.
Testing (15%): Your test suite will be checked by running it on incorrect/broken
implementations. Your tests should all pass on a correct version of the function, and at least
one should fail on each of our broken implementations.
Coding style (15%):
Make sure that you follow Python style guidelines that we have introduced and the Python
coding conventions that we have been using throughout the semester. Although we don't
provide an exhaustive list of style rules, the checker tests for style are complete, so if your
code passes the checker, then it will earn full marks for coding style with one exception:
docstrings may be evaluated separately. For each occurrence of a PyTA error, one mark (out
of 20) deduction will be applied. For example, if a C0301 (line-too-long) error occurs 3 times,
then 3 marks will be deducted.
If you encounter PyTA error R0915 (too-many-statements), that indicates that your function
is too long (more than 20 statements long). In that case, introduce helper functions to do
some of the work — even if the helpers will only be called once. Your program should be
broken down into functions, both to avoid repetitive code and to make the program easier to
read.
All functions, including helper functions, should have complete docstrings including
preconditions when you think they are necessary.
Also, your variable names and names of your helper functions should be meaningful. Your
code should be as simple and clear as possible.
What to Hand In
The very last thing you do before submitting should be to run the checker program one last
time.
Otherwise, you could make a small error in your final changes before submitting that causes
your code to receive zero for correctness.
Submit a3.py and test_a3.py on MarkUs by following the instructions on the course website.
Remember that spelling of filenames, including case, counts: your file must be named
exactly as above.