EC 508程序讲解、辅导R编程

- 首页 >> Java编程
EC 508, Econometrics Jean-Jacques Forneron
Boston University
Problem Set 2: Testing Hypotheses with OLS
due Monday February 22, 2021
Instructions: Submissions are individual, R code must be readable, commented and attached
at the end of your problem set. Plots, tables and other outputs should be given in the answers
or at the end of the problem set.
Problem 1: Suits
The data set in lawsch85.dta contains information for 1985 cohort of the top 156 law schools in
the US. Variables in the dataset include rank, law school ranking, salary, median starting salary,
cost, law school cost.
i. Remove the na values using the command:
data = subset(data,!is.na(salary) & !is.na(cost))
where data is the name of the dataset you’ve loaded into R. Compute the average starting
salary across law schools in the sample. Do you think it coincides with the average starting
salary across law students?1
ii. Regress starting salaries on the law school’s ranking:
salaryi = β0 + β1 × ranki + ui
compute standard errors and a 95% confidence interval for β1. Report your results.
iii. What is the expected difference in starting salary between the 20th top law school with the
40th top law school? Construct a 95% confidence interval for the difference. Report your
results.2
iv. Now regress the cost of attending law school on the school’s ranking:
costi = β0 + β1 × ranki + ui
compute standard errors and a 95% confidence interval for β1. Report your results.
1Hint: think about the size of different schools and the law of iterated expectations
2Hint: the standard error for 2βˆ1 is 2se(βˆ1). More generally, for any number ∆, the standard error for ∆βˆ1 is
|∆|se(βˆ1); standard errors cannot be negative.
1
v. What is the expected difference in cost between the 20th top law school with the 40th top
law school? Construct a 95% confidence interval for the difference. Report your results.
vi. Given the results in ii-iii. and iv-v. discuss the relative benefits and costs of attending a
more prestigious program.
vii. Construct a plot with rank on the x-axis and cost on the y-axis. Do you believe Least-Squares
Assumptions (LSA) 1-3 are reasonable assumptions in this setting? Plot rank against salary
in the same manner and comment on LSA 1-3.
viii. Construct a plot with rank on the x-axis and log(salary) on the y-axis.3 Comment on LSA
1-3.
ix. Repeat ii. but this time regressing log(salary) on rank:
log(salaryi) = β0 + β1 × ranki + ui
,
compute standard errors and a 95% confidence interval for β1.
Remark: This is still a linear model as we saw in class, everything we have seen so far
applies to this regression. The only difference is in the interpretation of β1, when x is a
continuous regressor:
because d log(x) = dx/x. This means that 100×β1 is (roughly) the percentage increase in y
when x changes by one unit. Economists often look at log(salary) instead of salary to make
statements in terms of percentage increases/decreases. Here x is discrete, so 100 × β1 is just
the percent change in log(salary) when we change rank by one unit.
Problem 2: Real Estate
The data set hprice1.dta contains observations on the selling price, in thousands of dollars, and
features of houses sold in a given area, including bdrms, the number of bedrooms and, sqrft, the
size of house in square feet. For more details on the variables in the dataset, see hprice1.des.
i. Estimate the following regression model:
pricei = β0 + β1sqrf ti + β2bdrmsi + ui
,
and report the estimated coefficients, standard errors.
ii. What is the estimated increase in price for a house with one more bedroom, holding square
footage constant? Compare this number to the average selling price and discuss the magnitude
of this increase.
3
log(salary) is already present in the dataset as lsalary but you could also construct it using data$lsalary =
log(data$salary).
2
iii. Using a 95% confidence interval, determine whether this increase statistically significant?
Explain why this result is, or is not, intuitive.
iv. What is the estimated increase in price for a house with an additional bedroom that is 140
square feet in size? Compare this to your answer in part (ii).
v. Is the effect of the size of house alone statistically significant? Explain why this result is, or
is not, intuitive.
vi. The first house in the sample has 2,438 square feet and 4 bedrooms. Find the predicted
selling price for this house from the OLS regression line.
vii. The actual selling price of the first house in the sample was $300,000 (so price is 300 in the
data). Find the residual for this house. Does it suggest that the buyer underpaid or overpaid
for the house?
Problem 3: Omitted Variables
Consider the true population model:
yi = β0 + β1xi + β2zi + ui (1)
where ui has mean zero and is independent of both xi and zi. Some notation: var(xi) = σ2x,var(zi) = σ2z and cov(xi, zi) = σxz. (yi, xi, zi) are iid and have finite fourth moments. Assume
xi and zi have mean zero.
i. Suppose an economist regresses yi on xi only, omitting zi
. Should she/he be concerned about
the validity of the Least-Squares Assumptions? Explain.
ii. He/she decides to proceed regardless of your previous answer and estimates the following
model:yi = β0 + β1xi + ei, (2)
with ei as an error term in the regression formula. Note that ei = β2zi + ui
. Write down
the OLS formula for β1 with only xi as a regressor. Substitute yi
in this formula using (2).
Express βˆ
1 as the sum of β1 and an another term.
iii. Express the probability limit of βˆ
1 − β1 using the law of large numbers. The limit depends
on the following terms: σ2x, σx,z and β2. This is the so-called omitted variable bias.
iv. Suppose the economist finds a positive effect: βˆ
1 > 0. You know that σxz > 0 and β2 < 0.
What can you tell him/her about the true β1 using this information?
v. You will now conduct a numerical experiment to see the effect of omitted variable bias on
the coefficients. To fix the random numbers, so that everyone gets identical results, type3
set.seed(123) at the beginning of your R code.4 Then, using rnorm and setting n = 1, 000,
draw ui ∼ N (0, 1) , xi ∼ N (0, 1) and compute zi = xi + vi, vi ∼ N (0, 1) for i = 1, . . . , n.
This implies that: σ2x = 1, σxz = 1. Now generate:yi = 0 + xi − zi + ui.
With the lm function, compute the OLS estimates when regressing yi only on xi
. Use coeftest
to test for H0 : β1 = 0 using the single regressor specification.5
vi. Explain your result above in light of your earlier findings. To do this, you should compute
the omitted variable bias using the formula you derived by hand in iii.
4Every time you run set.seed(123) in R, it re-sets the random numbers to the same sequence. There is nothing
special about 123, set.seed(666) would set another deterministic sequence.
5Do not forget to use vcovHC.
4

站长地图