ARE 106讲解、辅导Python设计、辅导SSID留学生、Python编程调试

2019.08.27 - 首页 >> Algorithm 算法

ARE 106 Summer Session II
Homework 3
This homework will be due on August 29th at 2pm
SSID: 916184515
Please put your name and SSID in the corresponding cells above.
The homework is worth 13.5 points.
For each of the following questions, show as much of your steps as you can (without going overboard). If you
end up getting the wrong answer, but we spot where you made a mistake in the algebra, partial credit will be
more readily given. If you only put the final answer, you will be marked either right or wrong.
Answer questions in the correct cell. For problems where you have to input math, make sure that you know that
it's a markdown cell (It won't have a In: [] on the left) and make sure you run the cell by either pressing
Ctrl + Enter or going to Cell -> Run Cell . Alternatively, write all your answers and then go to Cell ->
Run All Cells after you're done.
Please ignore cells that read \pagebreak . These are so your document converts to PDF in a way that will
make it possible to grade your homework. Ignore them and only write your answers where it is specified.
When you are finished export your homework to a PDF by going to File -> Download as -> PDF .
Exercise 1: Single Regression
Please don't forget to comment your code. Failure to do so will result in a loss of points.
Also remember that all code that is required here (unless otherwise stated) can be found in the lecture Jupyter
Notebooks or the coding notebooks from class.
Here are three models for the median starting salary of law school graduates in 1985.
Each observation represents a school.
\pagebreak2019/8/24 HW3
localhost:8888/notebooks/HW3.ipynb 2/8
The variables in the dataset are:
| | Variable | Description | |---------|---------------|---------------| | 1. | rank |law school ranking |
| 2. | salary |median starting salary| | 3. | cost |law school cost| | 4. | LSAT |median LSAT score| | 5. | GPA
|median college GPA| | 6. | libvol |no. volumes in lib., 1000s| | 7. | faculty |no. of faculty| | 8. | age |age of law
sch., years| | 9. | clsize |size of entering class| | 10. | north |=1 if law sch in north| | 11. | south |=1 if law sch in
south| | 12. | east |=1 if law sch in east| | 13. | west |=1 if law sch in west| | 14. | studfac |student-faculty ratio| |
15. | top10 |=1 if ranked in top 10| | 16. | r11_25 |=1 if ranked 11-25| | 17. | r26_40 |=1 if ranked 26-40| | 18. |
r41_60 |=1 if ranked 41-60|
a. In the code cell below, write the appropriate imports you will need for this question (we will need pandas ,
numpy and statsmodels.formula.api ). You can do an abbreviated import if you wish (but the standard for
pandas is pd , statsmodels.formula.api is smf , and numpy is np ). Afterwards, load in the data from
here:
https://raw.githubusercontent.com/lordflaron/ARE106data/master/lawsch85.csv
(https://raw.githubusercontent.com/lordflaron/ARE106data/master/lawsch85.csv)
This can be done using the read_csv() function. Name this dataset raw_df . After loading in the data, show
the first 10 observations in the output.
In?[1]:
b. Use the describe() method on raw_df to show a table of summary statistics for each variable in the
dataset. How many observations does have? Write this in a print statement. (Hint: This is in the "count"
row the summary table).
c. Since we'll need a log-transformed version of for all our models, use assign() to create a new
variable which is the log of . Name this new variable log_salary .
Hints:
Remember that assign is not an inplace operation!
Remember to use a lambda function in this case. To log a variable, you can use np.log()
Remember the syntax for assign() :
\pagebreak
## a. Put your answer in this cell.
\pagebreak
## b. Put your answer in this cell.
\pagebreak2019/8/24 HW3
localhost:8888/notebooks/HW3.ipynb 3/8
my_df.assign(new_variable = expression)
After this we now need to also drop any observations that are missing. This isn't actually how econometricians
deal with missing data, but this is good enough for us for now.
You can do this by chaining the dropna() method after the assign() method.
Warning: Do not do dropna BEFORE assign
The end result should look something like this:
df = raw_df.assign(log_salary= expression).dropna()
In?[3]:
d. Before estimating the model, explain how to interpret 1
in Model 1.
Please write your answer for d here. If you need to use more than one line, you may do so.
e. Before estimating the model, explain how to interpret 1 in Model 2.
Please write your answer for e here. If you need to use more than one line, you may do so.
f. Before estimating the model, do you expect and to be positive or negative in Model 2? Explain. (Hint:
I'm not asking for any rigorous mathematical way to answer this question. Just use your economic intuition and
reasoning skills to write an argument).
Please write your answer for f here. If you need to use more than one line, you may do so.
g. Estimate Model 1. Show the regression output.
In[4]:
h. What is the effect of a one unit increase in LSAT score on the log of median salary?
\pagebreak
## c. Put your answer in this cell.
\pagebreak
\pagebreak
## g. Put your answer in this cell.
\pagebreak2019/8/24 HW3
localhost:8888/notebooks/HW3.ipynb 4/8
Please write your answer for h here. If you need to use more than one line, you may do so.
i. What does the measure in the regression? What is the in this case? (Not the adjusted ).
Please write your answer for i here. If you need to use more than one line, you may do so.
Exercise 2: Multiple Regression
This is a continuation of what we were doing in Exercise 1.
For this exercise, observe the expression for when there are two regressors in the equation:
Hint: Notice that each of these terms in the equation look similar to either covariances or variances (in fact if
you multiply the denominator and numerator by then they are in fact variances and covariances without
changing the value of the coefficient (since is 1).
Also notice that the covariance is like an un-normalized correlation coefficient. So if you calculate the
correlation between two variables, you won't know the covariance between the two, but you'll know the direction
and strength of their relationship.
a. Estimate Model 2. Show the regression output.
In?[5]:
b. Calculate the correlations between , and .
Use the slicing notation to first make a subset of the data with only log_salary, LSAT and GPA.
Then use the corr() method to get the correlation for those variables, i.e. it will look something like this:
df[['log_salary', 'GPA', 'LSAT']].corr()
This will give a matrix where you can see correlation between variables. (Note: correlation of a variable with
itself is always 1).
\pagebreak
\pagebreak
## a. Put your answer in this cell.
\pagebreak2019/8/24 HW3
localhost:8888/notebooks/HW3.ipynb 5/8
In[7]:
c. Using you answer from (b) and the expression for above, answer this question:
Why is in Model 2 different from in Model 1
Please write your answer for c here. If you need to use more than one line, you may do so.
d. Why is the in Model 2 higher than Model 1? (Not the adjusted ). 2 ??2
Please write your answer for d here. If you need to use more than one line, you may do so.
e. Estimate Model 3. Show the regression output.
Hint: One of the extra regressors in Model 3 is log-transformed. Instead of doing another assign() call, run
this regression by explicitly logging the variable in the patsy formula. Use np.log() to do this.
In?[8]:
f. Suppose School A and School B have the same values for all the variables on the right hand side in Model 3,
except School A is ranked 10 places higher than School B. What is the predicted difference in log median salary
between the two schools?
This question can be answered by simply printing out the math you did in a print statement using an f-string .
In?[9]:
Exercise 3: Multicollinearity
a. Re-estimate Model 1, except add north, south, east, and west as the additional right hand side variables.
## b. Put your answer in this cell.
\pagebreak
\pagebreak
## e. Put your answer in this cell.
\pagebreak
## f. Put your answer in this cell.
\pagebreak2019/8/24 HW3
localhost:8888/notebooks/HW3.ipynb 6/8
In?[12]:
b. What is wrong with this regression? What happens when you estimate it? How could fix this problem?
Hint: Look at the warnings underneath the regression.
Please write your answer for b here. If you need to use more than one line, you may do so.
Exercise 4: Auxiliary Regression
Consider the following two regressions:
a. Estimate . This is a two-step process. First, you need to estimate the first regression model and save the
errors. Then, you regress on those errors ( ). Compare your estimate of to the estimate you
found from Model 2. Explain the similarity or difference.
In order to do this, you need to save the errors (also called residuals) after you run the first stage. In order to do
this, after fitting the first stage, the results variable will have an attribute resid . So to call the residuals all
you need to do is type this: results.resid .
You can then run the second stage in one of two ways:
1. assign a new variable to your data, called "residuals" and run a regresion with it like any other
variable, or
2. Directly call results.resid in your second stage's patsy formula , i.e, 'log_salary ~
results.resid'
b. What do you notice from the coefficient on this regression, versus the one in Model 1?
## a. Put your answer in this cell.
\pagebreak
\pagebreak
\pagebreak
## a. Put your answer in this cell.
\pagebreak2019/8/24 HW3
localhost:8888/notebooks/HW3.ipynb 7/8
Please write your answer for b here. If you need to use more than one line, you may do so.
Exercise 5: Back to
Suppose that we have an estimated regression model , where are estimated OLS
coefficients. Let , so that:
Let's look at the next step of solving this problem in order to finally get at solving a mystery we've had during the
class.
If we wanted to solve for the , we would use the fact that a way to understand the variability in is to look at
its variance. And we already know that:
Up until now, we've just assumed it to be true that was 0 and it allowed us to finish the proof. But all
along, we've been implicitly assuming a Gauss-Markov assumption in order to make that claim.
Which of the Gauss-Markov assumptions do we need in order to say that ?
Hint: Don't forget that you can express the covariance in terms of expectations.
Hint: Try plugging in into this expression and seeing what you end up with.
Hint: Don't forget that
Please write your answer for exercise 5 here. If you need to use more than one line, you may do so.
Exercise 6: Data Types
Let's say that we have a population model:
The subscripts for the variables have been purposely omitted. For each part, rewrite the model so that it
corresponds to each data type and explain why you wrote it that way.
a. Cross-section
b. Time Series
c. Panel
\pagebreak
\pagebreak
\pagebreak2019/8/24 HW3
localhost:8888/notebooks/HW3.ipynb 8/8
Please write your answer for exercise 6 here. If you need to use more than one line, you may do so.
\pagebreak