EC 421讲解、R编程设计调试、R设计辅导、讲解data留学生
- 首页 >> C/C++编程 Problem Set 1
Econometrics Review
EC 421: Introduction to Econometrics
Due before noon (11:59am) on July 1st, 2020 (on canvas)
To make grading slightly easier, please include all of your R code at the end of your word doc with your
written answers.
OBJECTIVE This problem set has three purposes: (1) reinforce the econometrics topics we reviewed in
class; (2) build your R toolset; (3) build your intuition for newer topics like heteroskedasticity and
consistency.
Problem 0: Inference
For this question I used data from a survey conducted by the department of education in 1980. A
description of the data can be found here. We want to test the effect of an extra year of education on
wages. For this question, we have observations and parameters. This means that we have
degrees of freedom. For a signicance
level, this gives us a t-critical value of .
You can use this information throughout the rest of the problem.
We are interested in the following regression:
where is the individual hourly wage, and is the individual years of education. This regression
yields the following parameter estimates:
where standard errors for each parameter are given below in parenthesis.
0a. Conduct the appropriate statistical test to determine whether or not education has a statistically
signicant
impact on wages. Write out all steps and be clear with your conclusion.
0b. Now write out the formula for the standard error of . Is the standard error increasing or decreasing in
the sample size ? Next write out the formula for the test statistic you calculated in 0a.. Is the test statistic
increasing or decreasing in the sample size? Lastly, use your answers to this question to determine whether
or not probability that you reject the null hypothesis is increasing or decreasing in the size. Hint: You don't
have to write out this probability explicitly. Just explain the intuition behind what the test-statistic is telling
you and how this helps you answer the question.
Oc. Use the information provided combined with the regression output to construct a 95% condence
interval for the parameter . Write out the steps you took to get to the lower and upper bounds. Provide a
careful interpretation of what this condence
interval tells you.
0d. Now suppose we think we omitted an important variable: gender. State the two conditions this variable
must meet (in the context of this example) for it to cause omitted variables bias. Would increasing the
sample size (working with "big data") alleviate the issues caused by omitting gender from this regression?
0e. Luckily, our data contains information on whether or not individuals in our data are male or female. We
now include two indicators in our regression. One for male, and one for female -- and drop the intercept.
We have the following coefcient
estimates and standard errors.
You don't need to calculate the next test (I have not given you enough information to do so), but write out
how you would use this model to statistically test the null hypothesis that wages for males and females are
different from each other. Write out each step.
n = 4739 k = 2
4737 α = 5 t0.025,4737 = 1.96
wagei = β0 + β1educi + ϵi
Problem 1: Bias and variance
1a. Throughout this course, we will use the OLS estimator to estimate . Explain what it means for to be
biased for .
Figure 1
Note This gure
shows the distributions of three estimators (A, B, and C) that each estimate the unknown
parameter . E[A]= , E[B]= , E[C]=
1b. Which of the estimators in Figure 1 (above) are unbiased? Hint: There may be more than one.
1c. Which of the estimators in Figure 1 (above) has the minimum variance?
1d. Which of the estimators in Figure 1 (above) is the best (minimum variance) unbiased estimator?
1e. Suppose we want to estimate the effect of advertising on sales. Explain what it bias would mean in this
context.
1f. What does the term "standard error" mean?
1g. What does it mean for an estimator to be more efcient
than another estimator? Of the unbiased
estimators, which one is efcient?
Problem 2: Getting Started with R
Problems 2 - 6 will use data I downloaded from the 2018 American Community Survey, which I downloaded
from IPUMS. You can nd
this data on canvas.
2a. Load packages. You will probably want to load the tidyverse and here packages. Maybe some others
as well.
2b. Load the data. The data can be found on canvas. To accomplish this, use the read_csv() command.
2c. Check your dataset. How many observations and variables do you have? Hint: Try dim(), ncol() and
nrow()
Problem 3: Getting you know your data
3a. Plot a histogram of household income (hh_income) using ggplot2.
Remember: the hh_income variable is measured in tens of thousands (meaning a value of 3
means the household's income is $30,000)
This link provides a few good examples of how to create a histogram using ggplot2.
3b. What are the mean and median levels of household income? Based upon this answer and the previous
histogram, is household income (fairly) evenly distributed or is it skewed? Explain.
3c. Run a regression summarizing the relationship between household income and household size.
Interpret the results of the regression -- e.g. tell me what the coefcients
mean and comment on their
statistical signicance.
3d. Explain why you chose the specication
that you did in the previous question.
Was it linear, log-linear, log-log?
What was the outcome variable?
What was the explanatory variable?
Why did you make these choices?
Problem 4: Regression Refesher
4a. Regress average commute time time_commuting on household income (hh_income). Interpret the
coefcient
and comment on its statistical signicance.
4b. Regress the log of aeverage commute time on household income. Interpret the coefcient
and
comment on its statistical signicance.
4c. Regress the log of aeverage commute time on the log household income. Interpret the coefcient
and
comment on its statistical signicance.
'
4d. If you had to pick one of the above specications
to show your boss at work, which one would you pick?
Why? (There is no right answer to this question, just want you to start thinking about model specication.)
4 / 8
Problem 5: Multiple Linear Regression
We will now add some covariates to our regression model.
5a. Regress average commute time on household income and the share of individuals in the household
who are non-white ehtnicities (hh_share_nonwhite). Interpret the coefcients
and comment on their
statistical signicance.
Also compare your results to 4a. Has anything changed?
5b. Regress average commute time on the indicator variable for whether a household moved in the last
year (i_moved). Interpret the coefcients
and comment on their statistical signcance.
5c. Add the share of the household that represents a non-white ethnicity (hh_share_nonwhite) to the
regression from 5b. Note: Your outcome variable is still average household commute time, but you should
now have two explanatory variables. Interpret the coefcients
and comment on their statistical signicance.
5d. Did adding this second explanatory variable change the coefcient
of the rst
variable at all? What does
that tell you? Explain your answer.
5e. One variable that we potentially omitted from our regression is an indicator for whether or not the
individual lives in an urban or rural area. Does this variable (which we don't have) meet the criteria for an
omitted variable? Specically
state both conditions it needs to meet for us to have classic omitted variables
bias. Sign the bias on hh_income that results from omitting urban/rural status.
5 / 8
Problem 6: Heteroskedasticity
6a. Suppose we are interested in the relationship between a household's housing costs and its time spent
community. Plot a scatter plot using ggplot2 with housing cost (cost_housing) on the axis and commute
time (time_commuting) on the axis. Make sure to label your axis.
This Link provides an example if you need help.
6b. Based on your plot 5a, if we regress cost_housing on time_commuting, do you think we would have an
issue with heteroskedasticity? Explain your answer.
6c. What issues can heteroskedasticity cause (Hint: there are at least two main issues)
6d. Time for a regression. Regress cost_housing on time_commuting and hh_income. Report your results --
interpret the coefcients
and comment on their statistical signicance.
Be careful with your language here.
Remember: the hh_income variable is measured in tens of thousands (meaning a value of 3
means the household's income is $30,000)
6e. Use the residuals from your regression in 5d to conduct a Breusch-Pagan Test for heteroscedasticity. Do
you nd
signicant
evidence of heteroskedasticity? Justify your answer. Note: I will post an additional video
that will help you write the code for this question. There is also sample code in the slides.
6f. Now conduct a Goldfeld-Quandt test for heteroskedasticity. Do you nd
signcant
evidence of
heteroskedasticity? Here are some hints:
We are still interested in the same regression (regressing the cost of housing on commute time
and household income)
Sort the dataset on time_commuting. This can be done with the arrange() function.
Create two groups for the GQ test by using the rst
8,000 and the last 8,000 observations (after
sorting on commute time). The head() and tail() functions will help here.
When you construct the GQ stat, put the larger SSE value in the numerator.
6g. Use the lm_robust() command from the estimatr package to calculate heteroskedastic-robust standard
errors. How do these standard errors compare to the plain OLS standard errors you previously found?
Hint: lm_robust(y ~ x, data = some_df, se_type = "HC2") will calculate heteroskedasticrobust
standard errors.
6h. Why did your coefcients
remain the same in 5g -- even though your standard errors changed?
Problem 7: Unbiasedness and consistency
Throughout this course, we will use the OLS estimator to estimate . We will continue to discuss
situations in which the estimator (or other estimators) are (1) unbiased or (2) consistent.
7a. What is the formal (mathematical) denition
of bias?
7b. Why do we care if if the OLS estimator (or any estimator) is biased?
7c. What does it mean for an estimator to be consistent?
7d. True/False Unbiasedness is a property for nite-sized
samples, while consistency refers to an esimator
as sample sizes approach innity.
7e. Which of the following two estimators would you choose? Explain your reasoning.
Estimator A is unbiased and inconsistent.
Estimator B is biased and consistent.
^β β
7 / 8
Description of variables and names
Variable Description
ps
County FIPS code
hh_size Household size (number of people)
hh_income Household total income in $10,000
cost_housing Household's total reported cost of housing
n_vehicles Household's number of vehicles
hh_share_nonwhite Share of household members identifying as non-white ethnicites
i_renter Binary indicator for whether any household members are renters
i_moved Binary indicator for whether a household member moved in prior one year
i_foodstamp Binary indicator for whether any household member participates in foodstamps
i_smartphone Binary indicator for whether a household member owns a smartphone
i_internet Binary indicator for whether the household has access to the internet
time_commuting Average time spent commuting per day by household member
In general, I've tried to stick with a naming convention. Variables that begin with i_ denote binary indicatory
variables (taking on the value of 0 or 1). Variables that begin with n_ are numeric variables.
8 / 8
Econometrics Review
EC 421: Introduction to Econometrics
Due before noon (11:59am) on July 1st, 2020 (on canvas)
To make grading slightly easier, please include all of your R code at the end of your word doc with your
written answers.
OBJECTIVE This problem set has three purposes: (1) reinforce the econometrics topics we reviewed in
class; (2) build your R toolset; (3) build your intuition for newer topics like heteroskedasticity and
consistency.
Problem 0: Inference
For this question I used data from a survey conducted by the department of education in 1980. A
description of the data can be found here. We want to test the effect of an extra year of education on
wages. For this question, we have observations and parameters. This means that we have
degrees of freedom. For a signicance
level, this gives us a t-critical value of .
You can use this information throughout the rest of the problem.
We are interested in the following regression:
where is the individual hourly wage, and is the individual years of education. This regression
yields the following parameter estimates:
where standard errors for each parameter are given below in parenthesis.
0a. Conduct the appropriate statistical test to determine whether or not education has a statistically
signicant
impact on wages. Write out all steps and be clear with your conclusion.
0b. Now write out the formula for the standard error of . Is the standard error increasing or decreasing in
the sample size ? Next write out the formula for the test statistic you calculated in 0a.. Is the test statistic
increasing or decreasing in the sample size? Lastly, use your answers to this question to determine whether
or not probability that you reject the null hypothesis is increasing or decreasing in the size. Hint: You don't
have to write out this probability explicitly. Just explain the intuition behind what the test-statistic is telling
you and how this helps you answer the question.
Oc. Use the information provided combined with the regression output to construct a 95% condence
interval for the parameter . Write out the steps you took to get to the lower and upper bounds. Provide a
careful interpretation of what this condence
interval tells you.
0d. Now suppose we think we omitted an important variable: gender. State the two conditions this variable
must meet (in the context of this example) for it to cause omitted variables bias. Would increasing the
sample size (working with "big data") alleviate the issues caused by omitting gender from this regression?
0e. Luckily, our data contains information on whether or not individuals in our data are male or female. We
now include two indicators in our regression. One for male, and one for female -- and drop the intercept.
We have the following coefcient
estimates and standard errors.
You don't need to calculate the next test (I have not given you enough information to do so), but write out
how you would use this model to statistically test the null hypothesis that wages for males and females are
different from each other. Write out each step.
n = 4739 k = 2
4737 α = 5 t0.025,4737 = 1.96
wagei = β0 + β1educi + ϵi
Problem 1: Bias and variance
1a. Throughout this course, we will use the OLS estimator to estimate . Explain what it means for to be
biased for .
Figure 1
Note This gure
shows the distributions of three estimators (A, B, and C) that each estimate the unknown
parameter . E[A]= , E[B]= , E[C]=
1b. Which of the estimators in Figure 1 (above) are unbiased? Hint: There may be more than one.
1c. Which of the estimators in Figure 1 (above) has the minimum variance?
1d. Which of the estimators in Figure 1 (above) is the best (minimum variance) unbiased estimator?
1e. Suppose we want to estimate the effect of advertising on sales. Explain what it bias would mean in this
context.
1f. What does the term "standard error" mean?
1g. What does it mean for an estimator to be more efcient
than another estimator? Of the unbiased
estimators, which one is efcient?
Problem 2: Getting Started with R
Problems 2 - 6 will use data I downloaded from the 2018 American Community Survey, which I downloaded
from IPUMS. You can nd
this data on canvas.
2a. Load packages. You will probably want to load the tidyverse and here packages. Maybe some others
as well.
2b. Load the data. The data can be found on canvas. To accomplish this, use the read_csv() command.
2c. Check your dataset. How many observations and variables do you have? Hint: Try dim(), ncol() and
nrow()
Problem 3: Getting you know your data
3a. Plot a histogram of household income (hh_income) using ggplot2.
Remember: the hh_income variable is measured in tens of thousands (meaning a value of 3
means the household's income is $30,000)
This link provides a few good examples of how to create a histogram using ggplot2.
3b. What are the mean and median levels of household income? Based upon this answer and the previous
histogram, is household income (fairly) evenly distributed or is it skewed? Explain.
3c. Run a regression summarizing the relationship between household income and household size.
Interpret the results of the regression -- e.g. tell me what the coefcients
mean and comment on their
statistical signicance.
3d. Explain why you chose the specication
that you did in the previous question.
Was it linear, log-linear, log-log?
What was the outcome variable?
What was the explanatory variable?
Why did you make these choices?
Problem 4: Regression Refesher
4a. Regress average commute time time_commuting on household income (hh_income). Interpret the
coefcient
and comment on its statistical signicance.
4b. Regress the log of aeverage commute time on household income. Interpret the coefcient
and
comment on its statistical signicance.
4c. Regress the log of aeverage commute time on the log household income. Interpret the coefcient
and
comment on its statistical signicance.
'
4d. If you had to pick one of the above specications
to show your boss at work, which one would you pick?
Why? (There is no right answer to this question, just want you to start thinking about model specication.)
4 / 8
Problem 5: Multiple Linear Regression
We will now add some covariates to our regression model.
5a. Regress average commute time on household income and the share of individuals in the household
who are non-white ehtnicities (hh_share_nonwhite). Interpret the coefcients
and comment on their
statistical signicance.
Also compare your results to 4a. Has anything changed?
5b. Regress average commute time on the indicator variable for whether a household moved in the last
year (i_moved). Interpret the coefcients
and comment on their statistical signcance.
5c. Add the share of the household that represents a non-white ethnicity (hh_share_nonwhite) to the
regression from 5b. Note: Your outcome variable is still average household commute time, but you should
now have two explanatory variables. Interpret the coefcients
and comment on their statistical signicance.
5d. Did adding this second explanatory variable change the coefcient
of the rst
variable at all? What does
that tell you? Explain your answer.
5e. One variable that we potentially omitted from our regression is an indicator for whether or not the
individual lives in an urban or rural area. Does this variable (which we don't have) meet the criteria for an
omitted variable? Specically
state both conditions it needs to meet for us to have classic omitted variables
bias. Sign the bias on hh_income that results from omitting urban/rural status.
5 / 8
Problem 6: Heteroskedasticity
6a. Suppose we are interested in the relationship between a household's housing costs and its time spent
community. Plot a scatter plot using ggplot2 with housing cost (cost_housing) on the axis and commute
time (time_commuting) on the axis. Make sure to label your axis.
This Link provides an example if you need help.
6b. Based on your plot 5a, if we regress cost_housing on time_commuting, do you think we would have an
issue with heteroskedasticity? Explain your answer.
6c. What issues can heteroskedasticity cause (Hint: there are at least two main issues)
6d. Time for a regression. Regress cost_housing on time_commuting and hh_income. Report your results --
interpret the coefcients
and comment on their statistical signicance.
Be careful with your language here.
Remember: the hh_income variable is measured in tens of thousands (meaning a value of 3
means the household's income is $30,000)
6e. Use the residuals from your regression in 5d to conduct a Breusch-Pagan Test for heteroscedasticity. Do
you nd
signicant
evidence of heteroskedasticity? Justify your answer. Note: I will post an additional video
that will help you write the code for this question. There is also sample code in the slides.
6f. Now conduct a Goldfeld-Quandt test for heteroskedasticity. Do you nd
signcant
evidence of
heteroskedasticity? Here are some hints:
We are still interested in the same regression (regressing the cost of housing on commute time
and household income)
Sort the dataset on time_commuting. This can be done with the arrange() function.
Create two groups for the GQ test by using the rst
8,000 and the last 8,000 observations (after
sorting on commute time). The head() and tail() functions will help here.
When you construct the GQ stat, put the larger SSE value in the numerator.
6g. Use the lm_robust() command from the estimatr package to calculate heteroskedastic-robust standard
errors. How do these standard errors compare to the plain OLS standard errors you previously found?
Hint: lm_robust(y ~ x, data = some_df, se_type = "HC2") will calculate heteroskedasticrobust
standard errors.
6h. Why did your coefcients
remain the same in 5g -- even though your standard errors changed?
Problem 7: Unbiasedness and consistency
Throughout this course, we will use the OLS estimator to estimate . We will continue to discuss
situations in which the estimator (or other estimators) are (1) unbiased or (2) consistent.
7a. What is the formal (mathematical) denition
of bias?
7b. Why do we care if if the OLS estimator (or any estimator) is biased?
7c. What does it mean for an estimator to be consistent?
7d. True/False Unbiasedness is a property for nite-sized
samples, while consistency refers to an esimator
as sample sizes approach innity.
7e. Which of the following two estimators would you choose? Explain your reasoning.
Estimator A is unbiased and inconsistent.
Estimator B is biased and consistent.
^β β
7 / 8
Description of variables and names
Variable Description
ps
County FIPS code
hh_size Household size (number of people)
hh_income Household total income in $10,000
cost_housing Household's total reported cost of housing
n_vehicles Household's number of vehicles
hh_share_nonwhite Share of household members identifying as non-white ethnicites
i_renter Binary indicator for whether any household members are renters
i_moved Binary indicator for whether a household member moved in prior one year
i_foodstamp Binary indicator for whether any household member participates in foodstamps
i_smartphone Binary indicator for whether a household member owns a smartphone
i_internet Binary indicator for whether the household has access to the internet
time_commuting Average time spent commuting per day by household member
In general, I've tried to stick with a naming convention. Variables that begin with i_ denote binary indicatory
variables (taking on the value of 0 or 1). Variables that begin with n_ are numeric variables.
8 / 8