HW3留学生讲解、R程序设计调试、R语言辅导、讲解data
- 首页 >> Web Homework #4
Due 4/22/2020, 11:59pm (but flexible)
4/14/2020
Front Matter
As with HW3, you will write your task code and answer the questions in a separate template file.
As you write your code, run your work “from the top” when you finish each chunk. This will ensure that any
errors in rendering are caught before you move on. The top right corner of each gray chunk has a button
that will run all of your code from the beginning up to that chunk, and another button that will run that
chunk. Use these buttons.
Question 1: Instrumental Variables
In the previous homework, we used 2SLS to estimate a model with an instrumental variable. We used
constructed data where we knew all the parts, including unobserved errors and parameters, and saw that
2SLS got us an estimate that was close to the true value, while naive OLS did not.
In this exercise, we will briefly use the AER package’s ivreg to analyze the same data.
Task 1.1 - Setup (4 points)
1. (2 points) Use require(...) to load the wooldridge package, the AER package, and the lmtest and
sandwich packages for robust standard errors. Do not re-install the packages (unless you are working
from a new computer). Remember, you never include install.packages(...) in your code chunks.
2. (2 points) Save the template with your name (LastFirst) in an appropriately named folder on your
drive. This is good file management and is super important to keeping your work organized. Note: this
is not something to write into your code chunks.
Questions 1.1 (2 points)
1. (2 points) Where on your computer are you saving your .Rmd file? Does EC420 have its own folder?
Task 1.2 - Loading data (0 points)
I have included the data construction in your template already. Do not change the data. It is identical to
HW3.
Questions 1.2 (6 points)
1. (2 points) How many observations does the data have?
2. (2 points) Refreshing your memory from HW3, what variable is the outcome variable? What is the
variable of interest? Which variables are endogenous?
3. (2 points) Which variables were our instruments?
4. (2 points) What was your 2SLS estimate from the final sections of HW3?
5. (1 point) What was the true parameter value for βD in HW3? This can also be inferred from the data
construction in Task 1.2.
1
Task 1.3 - Estimating using ivreg (15 points)
An important R skill is being able to figure out the syntax of an R function. We will use the AER package’s
ivreg function to estimate the same instrumental variables model as HW3. Use ?ivreg directly in your
console (not in your code chunk) to see the syntax for the function. This will tell you how to specify
your formula, and what other inputs the ivreg function needs. If you did not follow Task 1.1 and did not
require(...) the AER package, then you will not see anything when you type ?ivreg.
1. (3 points) The first input ivreg needs is the formula. We can input our formula and save it as an R
object (which we can then input to the call of ivreg). To do this, simply use as.formula(y x +
... | z + ...). You don’t need to put quotations around the formula when you code it up. It is up
to you to figure out how to specify the endogenous and instrumental variables in the formula. See the
arguments section of the ivreg help for instructions, and then look at your data to see what to put in
which place. Use the "recommended" three-part formula format from the help.
Remember that we included our exogenous variables X1, X2 in both stages of our 2SLS in HW2.
This is because our exogenous variables "instrument for themselves". That means we specify them as
instruments. Keep this in mind when writing your formula.
2. (12 points) Run the ivreg command using your formula and the P2 data.frame. Use the robust standard
errors by wrapping the command in coeftest as you did in HW3.
Question 1.3 - Interpreting ivreg (15 points)
1. (4 points) What is your estimate for βD, the coefficient of interest?
2. (3 points) What is the interpretation of the coefficient on βD?
3. (3 points) If D is our treatment variable, what type of treatment effect does βD represent? Is it the
average treatment effect (ATE)?
4. (5 points) We had a way of establishing whether or not our model met the relevant first stage requirement
(see our notes on Instrumental Variables and 2SLS). What was the criteria (hint: it has to do with an
F-test).
Task 1.4 - Testing (10 points)
1. (5 points) To test the relevant first stage assumption, we will use lm(...) to regress D on Z, the first
stage of our 2SLS (leaving aside the exogenous variables). This will tell us whether or not Z has an
effect on D. Run this simple regression.
2. (5 points) Naturally, ivreg has the ability to output some important tests, including one for the relevant
first stage. We can get this by using summarize(ivreg(myFormula, data=P2)) (robust standard errors
are not necessary here as we won’t be looking at the standard errors of the coefficients).
Question 1.4 - Testing (10 points)
1. (4 points) Using the output from Task 1.4.1 lm(...), the first-stage regression of D on Z, what can
we say about the relevant first stage assumption based on these results? If our instruments are
not relevant to the endogenous variable D, we say we have "weak instruments". Do we have weak
instruments?
2. (4 points) The output from summarize(ivreg(...)) gives us some additional statistics, including one
which tells us about the answer to the previous question. What is the value here, and what does this
tell us about our instruments’ relevant first stage?
3. (2 points) Are the results from Task 1.4.1 using lm(...) and the diagnostic result from
summarize(ivreg(...)) similar?
2
Question 2 - Difference-in-Differences
Here we will employ our Difference-in-Differences estimator and compare it to some “naive” estimates. We
will first do a little data manipulation and get some practice in R with merging and creating variables.
Task 2.1 (18 points)
1. (1 point) Install (by typing install.packages("...") directly in the console) the zoo and lubridate
packages. These are excellent packages for working with time-indexed values like dates and times. It
will help us move between date and month.
2. (1 point) Add code to your chunk for this task that uses require(...) to load the zoo and lubridate
packages. Also load the lmtest and sandwich packages as you’ll be using HC-robust errors (they are
already installed from the last homework, unless you’ve changed computers.)
3. (0 points) We will use the hw4.csv file posted on my gitHub. The line that loads the .csv directly
from gitHub is already in your template. Note that you can load data directly from a .csv on the web.
Handy!
This is solar installation data from California. Each observation is aggregated publicly-available data
about the total quarterly solar watts installed in a city, (cecptrating), the sum total cost of all
installations in that city-quarter totalcost, the average incentive received (incentiverate), the
average base cost of electricity in that city and quarter (PWRPRICE), the location (city), and our
outcome variable of interest: watts installed per owner-occupied household WPOOH. You will be doing a
similar analysis as the Kirkpatrick and Bennear paper we read, but will not be using the exact same
data nor get the exact same answer.
4. (6 points) We will use the as.yearqtr(...) function from zoo to manage the time variable, yq. This
is quite easy - just use hw4$yq = as.yearqtr(hw4$yq). The column will now be recognizable to R as
a time series.
5. (2 points) Try this: Look at head(hw4$yq) and head(hw4$yq + .75). Do you see what this does?
Don’t overwrite your yq column with this, just take a look at how R manages adding time periods.
6. (2 points) Use table(...) on the new column called yq to see how many installations we observed in
each quarter.
7. (4 points) Choose one city in the data and make a new data.frame with just that city by subsetting.
Use plot(...) to plot WPOOH on the Y-axis and yq on the X-axis.
8. (2 points) We often want to see a simple line of best fit. This is easy to add after we have plotted our
points. On the next line immediately after your plot(...) command, add an a-b line with a lm(...)
call like this:
abline(lm(WPOOH ~ yq, mySubsetData), col="gray50")
Question 2.1 (8 points)
• (1 points) Do all quarters have the same number of cities observed?
• (1 point) What is the mean value for total watts installed, cecptrating, in the data?
• (2 points) What is T, the number of time periods (hint: unique(...) returns a vector of all of the
unique values of a column)
• (2 points) What is N, the number of cities (hint: unique(...) returns a vector of all of the unique
values of a column)
• (2 points) What city did you choose for your plot? Does there seem to be a time trend in the outcome
variable, WPOOH?
3
Task 2.2 (11 points)
The treatment we are interested in, PACE, is not in our data. We have to merge in information about each
city’s treatment status by quarter. Merging data in R is not terribly hard. You just have to have the data
you’d like to merge in another R data.frame and know which field(s) are the key fields. “Key” fields are the
fields that will be matched up - here, it will be the city. The TREATMENT data has the city, the county, and
the date that treatment (PACE) started, if any.
1. (2 points) We will merge in some TREATMENT data on each city in the data. It is located on my
gitHub as well and the code is in your template. Use the same read.csv function as in Task 2.1 to
load this in. Call the R object holding this data TREATMENT.
2. (2 points) Since city is the "key" field in TREATMENT, we need to check that it is unique. The R
function duplicated(...) tells us which values in whatever column we give it are duplicated. If we
sum(duplicated(...)) we can see how many duplicated values there are. Use duplicated(...) and
sum(...) to see that TREATMENT$city has no duplicates. Duplicate values in the merge will multiply
the number of rows in our data, which would be bad!
3. (5 points) Now, we want to merge the city level data in TREATMENT to the city-quarter data in hw4. We
will use merge(...) for this and call the new object PAN.merged (PAN is for PANel). Merge takes the
following function inputs:
(a) x = hw4 (the data you start with)
(b) y = TREATMENT (the data being merged)
(c) by = c(’city’) The "keys" on which we are merging
(d) all = F This tells R to do a "left join" which keeps only those cities that are in both PAN and
TREATMENT
Put it all together: PAN.merged = merge(x = hw4, y = TREATMENT, by = c(’city’), all=F)
4. (2 points) Use names(PAN.merged) to show the names of the columns we now have merged and
NROW(PAN.merged) to see how many observations we have.
Question 2.2 (4 points)
1. (2 points) How many observations do you have now? Hint: it should be 2,250
2. (2 point) View() the data to see what we have. Is this panel, time series, or cross-sectional data? Why?
Task 2.3 (16 points)
The last thing we need is to create our time-varying treatment variable from the treatment start date in the
data.
1. (2 points) We are going to set the PACE variable based on whether or not yq>=PACE.start. PACE.start
is NA for all untreated cities, but contains the date of PACE start for the treated cities. First, convert
PAN.merged$PACE.start to an R-recognizable date. Do this by using PAN.merged$PACE.startdate
= ymd(PAN.merged$PACE.start).
2. (2 points) Next convert PAN.merged$PACE.startdate to a year-quarter using as.yearqtr(...) as
before.
3. (4 points) Create a column in PAN.merged called PACE that is TRUE if yq>=PACE.startdate and FALSE
otherwise.
4. (2 points) R has trouble comparing anything to a NA. It’s likely that your PAN.merged$PACE column is
a lot of TRUE and NA. Using a subset, assign a value of FALSE to any PACE that is na:
• PAN.merged[is.na(PAN.merged$PACE), "PACE"] = FALSE.
4
• Note that we don’t have to assign this to a new object or column - we are updating the NA’s in
PAN.merged "in place".
• When we compare things using <,>,=, we get a type of data that is called a logical. R recognizes
the words FALSE and TRUE as the two values of a logical object.
Now that we have our time-varying treatment variable, PACE, we need a non-time-varying indicator for
all the treated cities. Just like in lecture, we will call this TMT and it will be TRUE for all cities who are
ever treated. We will use an R shortcut function %in% for this.
The shortcut %in% takes whatever is to the left of it and tells you, item by item, if it is anywhere in
the thing on the right. So c(1,2,3) %in% c(2,3,4) will return FALSE,TRUE,TRUE because the second
two entries in c(1,2,3), 2,3, are in both 1,2,3 and 2,3,4.
5. (3 points) Make a new object called treated.cities = PAN.merged[PAN.merged$PACE==T,"city"].
This will be a vector of all of the treatment cities. Then, make a new column in PAN.merged$TMT =
PAN.merged$city %in% treated.cities.
The new column in PAN.merged will be TRUE if the city is ever treated, and false otherwise.
6. (3 points) We should probably compare the before-treatment levels of the outcome variable between
the two groups. We can use a boxplot to compare the mean and distribution:
• boxplot(WPOOH TMT, data = PAN.merged[PAN.merged$PACE==F,])
Question 2.3 (4 points)
1. (2 points) Use table(...) to see how many treated city-quarter observations we have.
2. (2 points) The boxplot shows the mean (the horizontal bar) and the 25th-75th percentile (the edges of
the box) for the variable WPOOH for each group before PACE starts. The dots are the “outliers”. From
this boxplot, does it look like there is a systematic difference before treatment between the treatment
and the control?
Task 2.4 (5 points)
Let’s run some regressions. I’m not going to write out each regression. It’s up to you to construct the right
regression formula. Make sure you always use HC-robust errors as usual.
First, let’s be very naive and just compare the before-after amongst the treated cities only. Run a regression
of WPOOH on PACE, PWRPRICE, and incentiverate on a subset of the data consisting only
of the treatment group. That is, only on the data for the treated. You can subset in the lm(...,
data=PAN.merged[PAN.merged$TMT==T,]) call.
• PWRPRICE is the average cost per kwh of electricity for the city.
• incentiverate is the per-watt subsidy given by the state under the California Solar Initiative. It was
designed to “step down” over time as more people install solar (it eventually hit zero in early 2014).
• PACE is our treatment variable of interest. It is the presence of a PACE program in that city during
that quarter. It varies by city-quarter.
Question 2.4 (11 points)
1. (1 points) By including PWRPRICE and incentiverate, what are we controlling for and why?
2. (3 points) Can you think of anything that isn’t controlled for that could cause bias? Hint: there are
lots of possible answers since we are only looking at the treatment group, but make sure you explain
why yours could cause bias!
3. (1 point) What is the coefficient on incentiverate and what does it mean?
5
4. (2 points) Does this comport with your prior expectation? That is, does it make sense? Why or why
not?
5. (2 point) What is the coefficient on PACE and what does it mean? Make sure you state your answer
including units.
6. (2 points) What is the std. error on the coefficient for PACE, and is it statistically significant?
Task 2.5 (5 points)
Run the regression from 2.4 again, but keep the whole sample. PACE is yq- and city-specific, so you don’t
need to interact it with anything.
Question 2.5 (6 points)
1. (2 points) If we call the cities with PACE programs the “treatment”, what do we call the cities without
treatment?
2. (1 point) What is the coefficient on PACE in this specification?
3. (1 point) Is it significant? Why or why not?
4. (2 points) Is there anything missing that we have not controlled for?
Task 2.6 (5 points)
Finally, let’s do a Difference-in-differences specification. PACE is already the interaction between TMT and
POST (unlike in lecture, we have time-varying treatment times, so we don’t really have a POST), so all we
need to do is add time fixed effects yq and city-level fixed effects city. Run the DID specification.
Question 2.6 (20 points)
1. (3 points) By adding the fixed effects, what have we controlled for?
2. (5 points) What is the identifying assumption for this regression?
3. (2 points) What is the coefficient on PACE and what does it mean (note: it won’t equal the coefficient
in the Kirkpatrick and Bennear paper)? Is this the ATE?
4. (2 points) Is it statistically significant and why?
5. (2 points) What is the mean of the outcome in the data (use mean(...)) and is the effect on PACE
economically meaningful or not. That is, compared to the average value of WPOOH, is the effect big or
small?
6. (2 points) Which specification do you feel is the least biased? Why?
7. (4 points) Scroll down to see the time fixed effects. Do they follow a pattern? Does that pattern match
the pattern you saw in the plot in Task 2.1? Note that, depending on what size window you have open,
the p-value column might appear down below.
6
Due 4/22/2020, 11:59pm (but flexible)
4/14/2020
Front Matter
As with HW3, you will write your task code and answer the questions in a separate template file.
As you write your code, run your work “from the top” when you finish each chunk. This will ensure that any
errors in rendering are caught before you move on. The top right corner of each gray chunk has a button
that will run all of your code from the beginning up to that chunk, and another button that will run that
chunk. Use these buttons.
Question 1: Instrumental Variables
In the previous homework, we used 2SLS to estimate a model with an instrumental variable. We used
constructed data where we knew all the parts, including unobserved errors and parameters, and saw that
2SLS got us an estimate that was close to the true value, while naive OLS did not.
In this exercise, we will briefly use the AER package’s ivreg to analyze the same data.
Task 1.1 - Setup (4 points)
1. (2 points) Use require(...) to load the wooldridge package, the AER package, and the lmtest and
sandwich packages for robust standard errors. Do not re-install the packages (unless you are working
from a new computer). Remember, you never include install.packages(...) in your code chunks.
2. (2 points) Save the template with your name (LastFirst) in an appropriately named folder on your
drive. This is good file management and is super important to keeping your work organized. Note: this
is not something to write into your code chunks.
Questions 1.1 (2 points)
1. (2 points) Where on your computer are you saving your .Rmd file? Does EC420 have its own folder?
Task 1.2 - Loading data (0 points)
I have included the data construction in your template already. Do not change the data. It is identical to
HW3.
Questions 1.2 (6 points)
1. (2 points) How many observations does the data have?
2. (2 points) Refreshing your memory from HW3, what variable is the outcome variable? What is the
variable of interest? Which variables are endogenous?
3. (2 points) Which variables were our instruments?
4. (2 points) What was your 2SLS estimate from the final sections of HW3?
5. (1 point) What was the true parameter value for βD in HW3? This can also be inferred from the data
construction in Task 1.2.
1
Task 1.3 - Estimating using ivreg (15 points)
An important R skill is being able to figure out the syntax of an R function. We will use the AER package’s
ivreg function to estimate the same instrumental variables model as HW3. Use ?ivreg directly in your
console (not in your code chunk) to see the syntax for the function. This will tell you how to specify
your formula, and what other inputs the ivreg function needs. If you did not follow Task 1.1 and did not
require(...) the AER package, then you will not see anything when you type ?ivreg.
1. (3 points) The first input ivreg needs is the formula. We can input our formula and save it as an R
object (which we can then input to the call of ivreg). To do this, simply use as.formula(y x +
... | z + ...). You don’t need to put quotations around the formula when you code it up. It is up
to you to figure out how to specify the endogenous and instrumental variables in the formula. See the
arguments section of the ivreg help for instructions, and then look at your data to see what to put in
which place. Use the "recommended" three-part formula format from the help.
Remember that we included our exogenous variables X1, X2 in both stages of our 2SLS in HW2.
This is because our exogenous variables "instrument for themselves". That means we specify them as
instruments. Keep this in mind when writing your formula.
2. (12 points) Run the ivreg command using your formula and the P2 data.frame. Use the robust standard
errors by wrapping the command in coeftest as you did in HW3.
Question 1.3 - Interpreting ivreg (15 points)
1. (4 points) What is your estimate for βD, the coefficient of interest?
2. (3 points) What is the interpretation of the coefficient on βD?
3. (3 points) If D is our treatment variable, what type of treatment effect does βD represent? Is it the
average treatment effect (ATE)?
4. (5 points) We had a way of establishing whether or not our model met the relevant first stage requirement
(see our notes on Instrumental Variables and 2SLS). What was the criteria (hint: it has to do with an
F-test).
Task 1.4 - Testing (10 points)
1. (5 points) To test the relevant first stage assumption, we will use lm(...) to regress D on Z, the first
stage of our 2SLS (leaving aside the exogenous variables). This will tell us whether or not Z has an
effect on D. Run this simple regression.
2. (5 points) Naturally, ivreg has the ability to output some important tests, including one for the relevant
first stage. We can get this by using summarize(ivreg(myFormula, data=P2)) (robust standard errors
are not necessary here as we won’t be looking at the standard errors of the coefficients).
Question 1.4 - Testing (10 points)
1. (4 points) Using the output from Task 1.4.1 lm(...), the first-stage regression of D on Z, what can
we say about the relevant first stage assumption based on these results? If our instruments are
not relevant to the endogenous variable D, we say we have "weak instruments". Do we have weak
instruments?
2. (4 points) The output from summarize(ivreg(...)) gives us some additional statistics, including one
which tells us about the answer to the previous question. What is the value here, and what does this
tell us about our instruments’ relevant first stage?
3. (2 points) Are the results from Task 1.4.1 using lm(...) and the diagnostic result from
summarize(ivreg(...)) similar?
2
Question 2 - Difference-in-Differences
Here we will employ our Difference-in-Differences estimator and compare it to some “naive” estimates. We
will first do a little data manipulation and get some practice in R with merging and creating variables.
Task 2.1 (18 points)
1. (1 point) Install (by typing install.packages("...") directly in the console) the zoo and lubridate
packages. These are excellent packages for working with time-indexed values like dates and times. It
will help us move between date and month.
2. (1 point) Add code to your chunk for this task that uses require(...) to load the zoo and lubridate
packages. Also load the lmtest and sandwich packages as you’ll be using HC-robust errors (they are
already installed from the last homework, unless you’ve changed computers.)
3. (0 points) We will use the hw4.csv file posted on my gitHub. The line that loads the .csv directly
from gitHub is already in your template. Note that you can load data directly from a .csv on the web.
Handy!
This is solar installation data from California. Each observation is aggregated publicly-available data
about the total quarterly solar watts installed in a city, (cecptrating), the sum total cost of all
installations in that city-quarter totalcost, the average incentive received (incentiverate), the
average base cost of electricity in that city and quarter (PWRPRICE), the location (city), and our
outcome variable of interest: watts installed per owner-occupied household WPOOH. You will be doing a
similar analysis as the Kirkpatrick and Bennear paper we read, but will not be using the exact same
data nor get the exact same answer.
4. (6 points) We will use the as.yearqtr(...) function from zoo to manage the time variable, yq. This
is quite easy - just use hw4$yq = as.yearqtr(hw4$yq). The column will now be recognizable to R as
a time series.
5. (2 points) Try this: Look at head(hw4$yq) and head(hw4$yq + .75). Do you see what this does?
Don’t overwrite your yq column with this, just take a look at how R manages adding time periods.
6. (2 points) Use table(...) on the new column called yq to see how many installations we observed in
each quarter.
7. (4 points) Choose one city in the data and make a new data.frame with just that city by subsetting.
Use plot(...) to plot WPOOH on the Y-axis and yq on the X-axis.
8. (2 points) We often want to see a simple line of best fit. This is easy to add after we have plotted our
points. On the next line immediately after your plot(...) command, add an a-b line with a lm(...)
call like this:
abline(lm(WPOOH ~ yq, mySubsetData), col="gray50")
Question 2.1 (8 points)
• (1 points) Do all quarters have the same number of cities observed?
• (1 point) What is the mean value for total watts installed, cecptrating, in the data?
• (2 points) What is T, the number of time periods (hint: unique(...) returns a vector of all of the
unique values of a column)
• (2 points) What is N, the number of cities (hint: unique(...) returns a vector of all of the unique
values of a column)
• (2 points) What city did you choose for your plot? Does there seem to be a time trend in the outcome
variable, WPOOH?
3
Task 2.2 (11 points)
The treatment we are interested in, PACE, is not in our data. We have to merge in information about each
city’s treatment status by quarter. Merging data in R is not terribly hard. You just have to have the data
you’d like to merge in another R data.frame and know which field(s) are the key fields. “Key” fields are the
fields that will be matched up - here, it will be the city. The TREATMENT data has the city, the county, and
the date that treatment (PACE) started, if any.
1. (2 points) We will merge in some TREATMENT data on each city in the data. It is located on my
gitHub as well and the code is in your template. Use the same read.csv function as in Task 2.1 to
load this in. Call the R object holding this data TREATMENT.
2. (2 points) Since city is the "key" field in TREATMENT, we need to check that it is unique. The R
function duplicated(...) tells us which values in whatever column we give it are duplicated. If we
sum(duplicated(...)) we can see how many duplicated values there are. Use duplicated(...) and
sum(...) to see that TREATMENT$city has no duplicates. Duplicate values in the merge will multiply
the number of rows in our data, which would be bad!
3. (5 points) Now, we want to merge the city level data in TREATMENT to the city-quarter data in hw4. We
will use merge(...) for this and call the new object PAN.merged (PAN is for PANel). Merge takes the
following function inputs:
(a) x = hw4 (the data you start with)
(b) y = TREATMENT (the data being merged)
(c) by = c(’city’) The "keys" on which we are merging
(d) all = F This tells R to do a "left join" which keeps only those cities that are in both PAN and
TREATMENT
Put it all together: PAN.merged = merge(x = hw4, y = TREATMENT, by = c(’city’), all=F)
4. (2 points) Use names(PAN.merged) to show the names of the columns we now have merged and
NROW(PAN.merged) to see how many observations we have.
Question 2.2 (4 points)
1. (2 points) How many observations do you have now? Hint: it should be 2,250
2. (2 point) View() the data to see what we have. Is this panel, time series, or cross-sectional data? Why?
Task 2.3 (16 points)
The last thing we need is to create our time-varying treatment variable from the treatment start date in the
data.
1. (2 points) We are going to set the PACE variable based on whether or not yq>=PACE.start. PACE.start
is NA for all untreated cities, but contains the date of PACE start for the treated cities. First, convert
PAN.merged$PACE.start to an R-recognizable date. Do this by using PAN.merged$PACE.startdate
= ymd(PAN.merged$PACE.start).
2. (2 points) Next convert PAN.merged$PACE.startdate to a year-quarter using as.yearqtr(...) as
before.
3. (4 points) Create a column in PAN.merged called PACE that is TRUE if yq>=PACE.startdate and FALSE
otherwise.
4. (2 points) R has trouble comparing anything to a NA. It’s likely that your PAN.merged$PACE column is
a lot of TRUE and NA. Using a subset, assign a value of FALSE to any PACE that is na:
• PAN.merged[is.na(PAN.merged$PACE), "PACE"] = FALSE.
4
• Note that we don’t have to assign this to a new object or column - we are updating the NA’s in
PAN.merged "in place".
• When we compare things using <,>,=, we get a type of data that is called a logical. R recognizes
the words FALSE and TRUE as the two values of a logical object.
Now that we have our time-varying treatment variable, PACE, we need a non-time-varying indicator for
all the treated cities. Just like in lecture, we will call this TMT and it will be TRUE for all cities who are
ever treated. We will use an R shortcut function %in% for this.
The shortcut %in% takes whatever is to the left of it and tells you, item by item, if it is anywhere in
the thing on the right. So c(1,2,3) %in% c(2,3,4) will return FALSE,TRUE,TRUE because the second
two entries in c(1,2,3), 2,3, are in both 1,2,3 and 2,3,4.
5. (3 points) Make a new object called treated.cities = PAN.merged[PAN.merged$PACE==T,"city"].
This will be a vector of all of the treatment cities. Then, make a new column in PAN.merged$TMT =
PAN.merged$city %in% treated.cities.
The new column in PAN.merged will be TRUE if the city is ever treated, and false otherwise.
6. (3 points) We should probably compare the before-treatment levels of the outcome variable between
the two groups. We can use a boxplot to compare the mean and distribution:
• boxplot(WPOOH TMT, data = PAN.merged[PAN.merged$PACE==F,])
Question 2.3 (4 points)
1. (2 points) Use table(...) to see how many treated city-quarter observations we have.
2. (2 points) The boxplot shows the mean (the horizontal bar) and the 25th-75th percentile (the edges of
the box) for the variable WPOOH for each group before PACE starts. The dots are the “outliers”. From
this boxplot, does it look like there is a systematic difference before treatment between the treatment
and the control?
Task 2.4 (5 points)
Let’s run some regressions. I’m not going to write out each regression. It’s up to you to construct the right
regression formula. Make sure you always use HC-robust errors as usual.
First, let’s be very naive and just compare the before-after amongst the treated cities only. Run a regression
of WPOOH on PACE, PWRPRICE, and incentiverate on a subset of the data consisting only
of the treatment group. That is, only on the data for the treated. You can subset in the lm(...,
data=PAN.merged[PAN.merged$TMT==T,]) call.
• PWRPRICE is the average cost per kwh of electricity for the city.
• incentiverate is the per-watt subsidy given by the state under the California Solar Initiative. It was
designed to “step down” over time as more people install solar (it eventually hit zero in early 2014).
• PACE is our treatment variable of interest. It is the presence of a PACE program in that city during
that quarter. It varies by city-quarter.
Question 2.4 (11 points)
1. (1 points) By including PWRPRICE and incentiverate, what are we controlling for and why?
2. (3 points) Can you think of anything that isn’t controlled for that could cause bias? Hint: there are
lots of possible answers since we are only looking at the treatment group, but make sure you explain
why yours could cause bias!
3. (1 point) What is the coefficient on incentiverate and what does it mean?
5
4. (2 points) Does this comport with your prior expectation? That is, does it make sense? Why or why
not?
5. (2 point) What is the coefficient on PACE and what does it mean? Make sure you state your answer
including units.
6. (2 points) What is the std. error on the coefficient for PACE, and is it statistically significant?
Task 2.5 (5 points)
Run the regression from 2.4 again, but keep the whole sample. PACE is yq- and city-specific, so you don’t
need to interact it with anything.
Question 2.5 (6 points)
1. (2 points) If we call the cities with PACE programs the “treatment”, what do we call the cities without
treatment?
2. (1 point) What is the coefficient on PACE in this specification?
3. (1 point) Is it significant? Why or why not?
4. (2 points) Is there anything missing that we have not controlled for?
Task 2.6 (5 points)
Finally, let’s do a Difference-in-differences specification. PACE is already the interaction between TMT and
POST (unlike in lecture, we have time-varying treatment times, so we don’t really have a POST), so all we
need to do is add time fixed effects yq and city-level fixed effects city. Run the DID specification.
Question 2.6 (20 points)
1. (3 points) By adding the fixed effects, what have we controlled for?
2. (5 points) What is the identifying assumption for this regression?
3. (2 points) What is the coefficient on PACE and what does it mean (note: it won’t equal the coefficient
in the Kirkpatrick and Bennear paper)? Is this the ATE?
4. (2 points) Is it statistically significant and why?
5. (2 points) What is the mean of the outcome in the data (use mean(...)) and is the effect on PACE
economically meaningful or not. That is, compared to the average value of WPOOH, is the effect big or
small?
6. (2 points) Which specification do you feel is the least biased? Why?
7. (4 points) Scroll down to see the time fixed effects. Do they follow a pattern? Does that pattern match
the pattern you saw in the plot in Task 2.1? Note that, depending on what size window you have open,
the p-value column might appear down below.
6