代写ECON 2041 Problem set调试Python程序
- 首页 >> Java编程ECON 2041 Problem set
Objectives
Demonstrate your mastery of topics covered in ECON2041 so far, including multivariate regression, statistical inference, and model diagnostics.
Practice producing a professional memo that applies econometric techniques to a business problem and communicates the results to a non-technical audience.
Employers (whether in government, consulting, finance, or research) value graduates who can turn careful analysis into clear, non-technical language that supports decision-making. This assessment is designed to give you practice with this highly sought-after skill, which is often tested in job applications and interviews for junior roles.
Introduction
Insurance companies face uncertainty in predicting how much different customers will cost them. You will use a dataset of 1,338 insured individuals that includes demographics (age, sex, number of children), health behaviors (smoking status), body composition (BMI), geographic region, and annual medical charges billed to the insurance system. These charges effectively represent what the insured individual cost the insurance company in the previous year.
Your objective is to analyze which observable characteristics drive differences in annual medical costs and communicate your findings in a memo for an insurance company's research division.
Deliverables
1. Analysis notebook (60 marks)
The analysis notebook ( .ipynb ) should contain all of the technical Python work that underpins the memo, following the technical requirements below.
The notebook should be clear, reproducible, and fully commented. Include the sections as described below. The notebook should be possible to run from start to finish without errors and must reproduce the results you report in the memo.
(A) Exploratory data analysis (EDA) [15 marks]
Complete the following EDA tasks:
1. Summary statistics
Create a summary statistics table for charges , age , bmi , and children
Create a frequency table for smoker , region , and sex
2. Distribution of the dependent variable Create a histogram of charges
3. Smokers vs. non-smokers
Calculate and report the mean and standard deviation of charges for smokers and non-smokers
4. Joint relationships
Create a scatterplot of charges vs age
Create a scatterplot of charges vs bmi
Create a scatterplot, grouped by smoker status: create another scatterplot, with points marked different colors by smoker status (Hint: Remember that we use the option hue to apply different colors to different groups)
(B) Model estimation [20 marks]
1. Model 1: Baseline
Estimate a simple regression using OLS, with charges as the dependent variable and smoker as the independent variable. This is your baseline model.
2. Model 2: Binary BMI
Create a binary variable for BMI > 30 ( obese ). Estimate an OLS model with charges as the dependent variable. Explanatory variables: smoker , obese , plus the demographic variables ( age , sex , children ) and geographical variable ( region ). Make sure that your regression specification accounts for the fact that region is categorical.
3. Model 3: Your preferred model
Estimate an OLS model, with all the same explanatory variables as Model 2, plus you must include the interaction between smoker and obese
you may include non-linear transformations of age , such as age 2 and age 3 , depending on the results of the relevant hypothesis test (see below)
you may include the interaction between age and smoker, depending on the results of the relevant hypothesis test (see below)
Remember: if you include an interaction term between two variables xj and xk, you always have to include the variables themselves as well
Include all the regressions you try in the notebook, even though you will only report your final regression in the memo
Clearly identify in the notebook which is your preferred regression (based on the hypothesis tests)
(C) Diagnostic tests [10 marks]
1. Conduct a Breusch-Pagan test for heteroskedasticity on all three models (Baseline, Model 2, Preferred model)
2. If any regression shows evidence of heteroskedasticity at α = 0.1, then re-estimate the model using robust standard errors
include the robust-standard error results in your memo
conduct the hypothesis tests below with robust standard errors
Note: you may have to iterate a bit back and forth between (C) and (D), as you use the findings from the F-tests to choose your preferred model.
(D) Hypothesis tests [15 marks]
1. t-tests:
1. Test whether smoking significantly affects medical costs in Model 1
2. Test whether the difference in costs between smokers and non-smokers differs significantly between obese and non-obese individuals in your preferred model
2. F-tests:
1. Test whether the demographic variables ( age , children , sex ) and region are jointly significant using an F-test in Model 2
2. Test whether age has a non-linear relationship with charges by comparing Model 2 to a regression that also includes age2 and age3 . Test their joint significance using an F-test
For each test, state the null and alternative hypotheses in your notebook and report the test statistic & p-value
(E) Save your results
1. Save your regression output in an Excel file (instructions in the notebook) for easy copying into a Word table
2. As always, save your .ipynb file after running everything and include this with your submission
2. Research memo, max 3 pages (40 marks)
Written for a well-informed, non-econometrician audience, the memo should be concise, focused, and free of econometric jargon. You should write in complete sentences and paragraphs, not bullet points.
Please view the memo template for the recommended structure!
. The memo must include exactly one figure and one regression table:
. The figure should illustrate your most important finding (e.g., the relationship between a key predictor and medical costs).
The table should report the results from 3 different regression models. For each, it should report the
estimated coefficients, the appropriate standard errors (i.e. robust if there is heteroskedasticity), observation numbers, and R2 values. Format the table clearly and report an appropriate number of decimal places.
1. The first column should report the results from the baseline model.
2. The second column should report results from Model 2.
3. The last column of the table should report the results from your preferred model.
Memo marking criteria
Technical accuracy (15 marks): Correct interpretation of coefficients, standard errors and confidence intervals, and model fit statistics
Business communication (10 marks): Presents findings in clear, jargon-free language that a non- econometrician can understand, while maintaining a professional tone and style.
Critical analysis (5 marks): promCareful discussion of limitations and realistic recommendations
Professional presentation (10 marks): The memo includes one clear figure with appropriate labels and one well- formatted regression table, uses consistent and appropriate formatting (e.g., not too many significant digits), and respects the page limit. This component also covers compliance with the assessment's genAI policy.