代写ECON 2041 Problem set调试Python程序

2025.10.18 - 首页 >> Java编程

ECON 2041 Problem set

Objectives

Demonstrate your mastery of topics covered in ECON2041 so far, including multivariate regression, statistical inference, and model diagnostics.

Practice producing a professional memo that applies econometric techniques to a business problem and communicates the results to a non-technical audience.

Employers (whether in government, consulting, finance, or research) value graduates who can turn careful analysis into clear, non-technical language that supports decision-making. This assessment is designed to give you practice with this highly sought-after skill, which is often tested in job applications and interviews for junior roles.

Introduction

Insurance companies face uncertainty in predicting how much different customers will cost them. You will use a dataset of 1,338 insured individuals that includes demographics (age, sex, number of children), health behaviors (smoking status), body composition (BMI), geographic region, and annual medical charges billed to the insurance system. These charges effectively represent what the insured individual cost the insurance company in the previous year.

Your objective is to analyze which observable characteristics drive differences in annual medical costs and communicate your findings in a memo for an insurance company's research division.

Deliverables

1. Analysis notebook (60 marks)

The analysis notebook ( .ipynb ) should contain all of the technical Python work that underpins the memo, following the technical requirements below.

The notebook should be clear, reproducible, and fully commented. Include the sections as described below. The notebook should be possible to run from start to finish without errors and must reproduce the results you report in the memo.

(A) Exploratory data analysis (EDA) [15 marks]

Complete the following EDA tasks:

1. Summary statistics

Create a summary statistics table for charges , age , bmi , and children

Create a frequency table for smoker , region , and sex

2. Distribution of the dependent variable Create a histogram of charges

3. Smokers vs. non-smokers

Calculate and report the mean and standard deviation of charges for smokers and non-smokers

4. Joint relationships

Create a scatterplot of charges vs age

Create a scatterplot of charges vs bmi

Create a scatterplot, grouped by smoker status: create another scatterplot, with points marked different colors by smoker status (Hint: Remember that we use the option hue to apply different colors to different groups)

(B) Model estimation [20 marks]

1. Model 1: Baseline

Estimate a simple regression using OLS, with charges as the dependent variable and smoker as the independent variable. This is your baseline model.

2. Model 2: Binary BMI

Create a binary variable for BMI > 30 ( obese ). Estimate an OLS model with charges as the dependent variable. Explanatory variables: smoker , obese , plus the demographic variables ( age , sex , children ) and geographical variable ( region ). Make sure that your regression specification accounts for the fact that region is categorical.

3. Model 3: Your preferred model

Estimate an OLS model, with all the same explanatory variables as Model 2, plus you must include the interaction between smoker and obese

you may include non-linear transformations of age , such as age 2 and age 3 , depending on the results of the relevant hypothesis test (see below)

you may include the interaction between age and smoker, depending on the results of the relevant hypothesis test (see below)

Remember: if you include an interaction term between two variables xj and xk, you always have to include the variables themselves as well

Include all the regressions you try in the notebook, even though you will only report your final regression in the memo

Clearly identify in the notebook which is your preferred regression (based on the hypothesis tests)

(C) Diagnostic tests [10 marks]

1. Conduct a Breusch-Pagan test for heteroskedasticity on all three models (Baseline, Model 2, Preferred model)

2. If any regression shows evidence of heteroskedasticity at α = 0.1, then re-estimate the model using robust standard errors

include the robust-standard error results in your memo

conduct the hypothesis tests below with robust standard errors

Note: you may have to iterate a bit back and forth between (C) and (D), as you use the findings from the F-tests to choose your preferred model.

(D) Hypothesis tests [15 marks]

1. t-tests:

1. Test whether smoking significantly affects medical costs in Model 1

2. Test whether the difference in costs between smokers and non-smokers differs significantly between obese and non-obese individuals in your preferred model

2. F-tests:

1. Test whether the demographic variables ( age , children , sex ) and region are jointly significant using an F-test in Model 2

2. Test whether age has a non-linear relationship with charges by comparing Model 2 to a regression that also includes age2 and age3 . Test their joint significance using an F-test

For each test, state the null and alternative hypotheses in your notebook and report the test statistic & p-value

(E) Save your results

1. Save your regression output in an Excel file (instructions in the notebook) for easy copying into a Word table

2. As always, save your .ipynb file after running everything and include this with your submission

2. Research memo, max 3 pages (40 marks)

Written for a well-informed, non-econometrician audience, the memo should be concise, focused, and free of econometric jargon. You should write in complete sentences and paragraphs, not bullet points.

Please view the memo template for the recommended structure!

. The memo must include exactly one figure and one regression table:

. The figure should illustrate your most important finding (e.g., the relationship between a key predictor and medical costs).

The table should report the results from 3 different regression models. For each, it should report the

estimated coefficients, the appropriate standard errors (i.e. robust if there is heteroskedasticity), observation numbers, and R2 values. Format the table clearly and report an appropriate number of decimal places.

1. The first column should report the results from the baseline model.

2. The second column should report results from Model 2.

3. The last column of the table should report the results from your preferred model.

Memo marking criteria

Technical accuracy (15 marks): Correct interpretation of coefficients, standard errors and confidence intervals, and model fit statistics

Business communication (10 marks): Presents findings in clear, jargon-free language that a non- econometrician can understand, while maintaining a professional tone and style.

Critical analysis (5 marks): promCareful discussion of limitations and realistic recommendations

Professional presentation (10 marks): The memo includes one clear figure with appropriate labels and one well- formatted regression table, uses consistent and appropriate formatting (e.g., not too many significant digits), and respects the page limit. This component also covers compliance with the assessment's genAI policy.