forked from NickCH-K/CausalitySlides
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLecture_08_Fixed_Effects.Rmd
364 lines (283 loc) · 15 KB
/
Lecture_08_Fixed_Effects.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
---
title: "Lecture 8 Fixed Effects"
author: "Nick Huntington-Klein"
date: "`r Sys.Date()`"
output:
revealjs::revealjs_presentation:
theme: solarized
transition: slide
self_contained: true
smart: true
fig_caption: true
reveal_options:
slideNumber: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
library(tidyverse)
library(stringr)
library(dagitty)
library(ggdag)
library(gganimate)
library(ggthemes)
library(Cairo)
library(gapminder)
library(modelsummary)
library(fixest)
library(ggpubr)
theme_set(theme_gray(base_size = 15))
```
## Recap
- Last time we talked about how controlling is a common way of blocking back doors to identify an effect
- We can control for a variable `W` by using our method of using `W` to explain our other variables, then take the residuals
- Another form of controlling is using a sample that has only observations with similar values of `W`
- Some variables you want to be careful NOT to control for - you don't want to close front doors, or open back doors by controlling for colliders
## Today
- Today we'll be starting on our path for the rest of the class, where we'll be talking about standard *methods* for performing causal inference
- Different ways of getting identification once we have a diagram!
- Our goal here will be to understand these methods *conceptually* and to also figure out some good statistical practices for their use
- Our goal is to *understand* these methods and be able to apply a straightforward version of them
## Today
- In particular we'll be talking about a method that is commonly used to identify causal effects, called fixed effects
- We'll be discussing the *kind* of causal diagram that fixed effects can identify
- All of the methods we'll be discussing are like this - they'll only apply to particular diagrams
- And so knowing our diagrams will be key to knowing when to use a given method
## The Problem
- One problem we ran into last time is that we can't really control for things if we can't measure them
- And there are lots of things we can't measure or don't have data for!
- So what can we do?
## The Solution
- If we observe each person/firm/country *multiple times*, then we can forget about controlling for the actual back-door variable we're interested in
- And just control for *person/firm/country identity* instead!
- This will control for EVERYTHING unique to that individual, whether we can measure it or not!
## In Practice
- Let's do this on the data from the "gapminder" package
- This data tracks life expectancy and GDP per capita in many countries over time
```{r, echo=TRUE, eval=FALSE}
library(gapminder)
data(gapminder)
cor(gapminder$lifeExp,log(gapminder$gdpPercap))
```
```{r, echo=FALSE, eval=TRUE}
data(gapminder)
cor(gapminder$lifeExp,log(gapminder$gdpPercap))
```
```{r, echo=TRUE}
gapminder <- gapminder %>% group_by(country) %>%
mutate(lifeExp.r = lifeExp - mean(lifeExp),
logGDP.r = log(gdpPercap) - mean(log(gdpPercap))) %>% ungroup()
cor(gapminder$lifeExp.r,gapminder$logGDP.r)
```
## So What?
- This isn't any different, mechanically, from any other time we've controlled for something
- So what's different here?
- Let's think about what we're doing conceptually
## What's the Diagram?
- Why are we controlling for things in this gapminder analysis?
- Because there are LOTS of things that might be back doors between GDP per capita and life expectancy
- War, disease, political institutions, trade relationships, health of the population, economic institutions...
## What's the Diagram?
```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=6}
dag <- dagify(LifeEx~GDPpc+A+B+C+D+E+F+G+H,
GDPpc~A+B+C+D+E+F+G+H,
coords=list(
x=c(LifeEx=4,GDPpc=2,A=1,B=2,C=3,D=4,E=1,F=2,G=3,H=4),
y=c(LifeEx=2,GDPpc=2,A=3,B=3,C=3,D=3,E=1,F=1,G=1,H=1)
)) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## What's the Diagram?
- There's no way we can identify this
- The list of back doors is very long
- And likely includes some things we can't measure!
## What's the Diagram?
- HOWEVER! If we think that these things are likely to be constant within country...
- Then we don't really have a big long list of back doors, we just have one: "country"
```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=3.5}
dag <- dagify(LifeEx~GDPpc+Coun,
GDPpc~Coun,
coords=list(
x=c(LifeEx=4,GDPpc=2,Coun=3),
y=c(LifeEx=2,GDPpc=2,Coun=3)
)) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## What We Get
- So what we get out of this is that we can identify our effect even if some of our back doors include variables that we can't actually measure
- When we do this, we're basically comparing countries *to themselves* at different time periods!
- Pretty good way to do an apples-to-apples comparison!
## Graphically
```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
lgdpmean <- mean(log(gapminder$gdpPercap))
lemean <- mean(gapminder$lifeExp)
ggplot(gapminder)+geom_point(aes(x=log(gdpPercap)-lgdpmean,y=lifeExp-lemean,color="Raw"),alpha=.6)+
geom_point(aes(x=logGDP.r,y=lifeExp.r,color="After Fixed Effects"))+
geom_point(data=filter(gapminder,country=="Pakistan"),
aes(x=log(gdpPercap)-lgdpmean,y=lifeExp-lemean,color="Raw Pakistan"),size=3)+
labs(x="log(GDP Per Capita)",y="Life Expectancy")+
theme(axis.text.x=element_blank(),
axis.text.y=element_blank())+
scale_color_manual(name="Values",values=c("After Fixed Effects" = "blue","Raw" = "black","Raw Pakistan" = "red"))
```
## Graphically
- The post-fixed-effects dots are basically a bunch of "Raw Country X" pasted together.
- Imagine taking "Raw Pakistan" and moving it to the center, then taking "Raw Britain" and moving it to the center, etc.
- Ignoring the baseline differences between Pakistan, Britain, China, etc., in their GDP per capita and life expectancy, and just looking within each country.
- We are ignoring all differences *between* countries (since that way back doors lie!) and looking only at differences *within* countries.
- Fixed Effects is sometimes also referred to as the "within" estimator
## In Action
```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
df <- data.frame(Person = rep(1:4,50)) %>%
mutate(X = .5+.5*(Person-2.5) + rnorm(200)) %>%
mutate(Y = -.5*X + (Person-2.5) + 1 + rnorm(200),time="1") %>%
group_by(Person) %>%
mutate(mean_X=mean(X),mean_Y=mean(Y)) %>%
ungroup()
#Calculate correlations
before_cor <- paste("1. Start with raw data. Correlation between X and Y: ",round(cor(df$X,df$Y),3),sep='')
after_cor <- paste("6. Within-Individual Correlation Between X and Y: ",round(cor(df$X-df$mean_X,df$Y-df$mean_Y),3),sep='')
#Add step 2 in which X is demeaned, and 3 in which both X and Y are, and 4 which just changes label
dffull <- rbind(
#Step 1: Raw data only
df %>% mutate(mean_X=NA,mean_Y=NA,time=before_cor),
#Step 2: Add x-lines
df %>% mutate(mean_Y=NA,time='2. Figure out between-Individual differences in X'),
#Step 3: X de-meaned
df %>% mutate(X = X - mean_X,mean_X=0,mean_Y=NA,time="3. Remove all between-Individual differences in X"),
#Step 4: Remove X lines, add Y
df %>% mutate(X = X - mean_X,mean_X=NA,time="4. Figure out between-Individual differences in Y"),
#Step 5: Y de-meaned
df %>% mutate(X = X - mean_X,Y = Y - mean_Y,mean_X=NA,mean_Y=0,time="5. Remove all between-Individual differences in Y"),
#Step 6: Raw demeaned data only
df %>% mutate(X = X - mean_X,Y = Y - mean_Y,mean_X=NA,mean_Y=NA,time=after_cor))
p <- ggplot(dffull,aes(y=Y,x=X,color=as.factor(Person)))+geom_point()+
geom_vline(aes(xintercept=mean_X,color=as.factor(Person)))+
geom_hline(aes(yintercept=mean_Y,color=as.factor(Person)))+
guides(color=guide_legend(title="Individual"))+
scale_color_colorblind()+
labs(title = 'The Relationship between Y and X, with Individual Fixed Effects \n{next_state}')+
transition_states(time,transition_length=c(12,32,12,32,12,12),state_length=c(160,100,75,100,75,160),wrap=FALSE)+
ease_aes('sine-in-out')+
exit_fade()+enter_fade()
animate(p,nframes=200)
```
## Notably
- This does assume, of course, that all those back door variables CAN be described by country
- In other words, that these back doors operate by things that are *fixed* within country
- If something is a back door and changes over time in that country, fixed effects won't help!
## Varying Over Time
- For example, earlier we mentioned war... that's not fixed within country! A given country is at war sometimes and not other times.
```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=4}
dag <- dagify(LifeEx~GDPpc+Coun+War,
GDPpc~Coun+War,
coords=list(
x=c(LifeEx=4,GDPpc=2,Coun=3.5,War=2.5),
y=c(LifeEx=2,GDPpc=2,Coun=3,War=3)
)) %>% tidy_dagitty()
ggdag_classic(dag,node_size=20) +
theme_dag_blank()
```
## Varying Over Time
- Of course, in this case, we could control for War as well and be good!
- Time-varying things doesn't mean that fixed effects doesn't work, it just means you need to control for that stuff too
- It always comes down to thinking carefully about your diagram
- Fixed effects mainly works as a convenient way of combining together lots of different constant-within-country back doors into something that lets us identify the model even if we can't measure them all
## Fixed Effects in Regression
- We can just do fixed effects as we did-subtract out the group means and analyze (perhaps with regression) what's left
- We can also include *dummy variables for each group/individual*, which accomplishes the same thing
$$ Y = \beta_0 + \beta_1Group1 + \beta_2Group2 + ... + $$
$$ \beta_NGroupN + \beta_{N+1}X + \varepsilon $$
$$ Y = \beta_i + \beta_1X + \varepsilon $$
## Fixed Effects in Regression
- Why does that work?
- We want to "control for group/individual" right? So... just... put in a control for group/individual
- Of course, like all categorical variables as predictors, we leave out a reference group
- But here, unlike with, say, a binary predictor, we're rarely interested in the FE coefficients themselves. Most software works with the mean-subtraction approach (or a variant) and don't even report them!
## Fixed Effects in Regression: Variation
- Remember we are isolating *within variation*
- If an individual *has* no within variation, say their treatment never changes, they basically get washed out entirely!
- A fixed-effects regression wouldn't represent them. And can't use FE to study things that are fixed over time
- And in general if there's not a lot of within variation, FE is going to be very noisy. Make sure there's variation to study!
## Fixed Effects in Regression: Notes
- It's common to *cluster standard errors* at the level of the fixed effects, since it seems likely that errors would be correlated over time (autocorrelated errors)
- It's possible to have *more than one set* of fixed effects. $Y = \beta_i + \beta_j + \beta_1X + \varepsilon$
- But interpretation gets tricky - think through what variation in $X$ you're looking at at that point!
## Coding up Fixed Effects
- We will use the **fixest** package
- It's very fast, and can be easily adjusted to do FE with other regression methods like logit, or combined with instrumental variables
- Clusters at the first listed fixed effect by default
```{r, echo = TRUE, eval = FALSE}
library(fixest)
m1 <- feols(outcome ~ predictors | FEs, data = data)
msummary(m1)
```
## Example: Sentencing
- What effect do sentencing reforms have on crime?
- One purpose of punishment for crime is to deter crime
- If sentences are more clear and less risky, that may reduce a deterrent to crime and so increase crime
- Marvell & Moody study this using data on reforms in US states from 1969-1989
## Example: Sentencing
```{r, echo=FALSE}
mm <- as.data.frame(readLines('marvel_moody_sentencing.txt'))
mm1 <- as.data.frame(mm[rep(c(TRUE,FALSE),2100/2),]) %>%
rename(mm1 = `mm[rep(c(TRUE, FALSE), 2100/2), ]`)
mm2 <- as.data.frame(mm[rep(c(FALSE,TRUE),2100/2),]) %>%
rename(mm2 = `mm[rep(c(FALSE, TRUE), 2100/2), ]`)
mmdata <- tibble(
state = substr(mm1$mm1,5,8),
year = as.numeric(substr(mm1$mm1,11,12)),
assault = as.numeric(substr(mm1$mm1,44,49)),
robbery = as.numeric(substr(mm1$mm1,50,55)),
pop1000 = as.numeric(substr(mm1$mm1,56,61)),
sentreform = as.numeric(str_sub(trimws(mm2$mm2),-3))
) %>%
mutate(sentreform = ceiling(sentreform)) %>%
na.omit
```
- I've omitted code reading in the data
- But in our data we have multiple observations per state
```{r, echo=TRUE}
head(mmdata)
mmdata <- mmdata %>% mutate(assaultper1000 = assault/pop1000,
robberyper1000 = robbery/pop1000)
```
## Fixed Effects
- We can see how robbery rates evolve in each state over time as states implement reform
```{r, dev='CairoPNG', echo=FALSE, fig.width=8, fig.height=4.5}
ggplot(mmdata,aes(x=year,y=robberyper1000,
group=state,color=factor(sentreform)))+
geom_line(size=1)+scale_color_colorblind(name="Reform")+
labs(x="Year",y="Robberies per 1000 Population") +
theme_pubr() +
theme(legend.position = c(.8,.9))
```
## Fixed Effects
- You can tell that states are more or less likely to implement reform in a way that's correlated with the level of robbery they already had
- So SOMETHING about the state is driving both the level of robberies AND the decision to implement reform
- Who knows what!
- Our diagram has `reform -> robberies` and `reform <- state -> robberies`, which is something we can address with fixed effects.
## Fixed Effects
```{r, echo=TRUE}
sentencing_ols <- lm(robberyper1000 ~ sentreform, data = mmdata)
sentencing_fe <- feols(robberyper1000 ~ sentreform | state, data = mmdata)
msummary(list('OLS' = sentencing_ols, 'FE' = sentencing_fe), stars = TRUE, gof_omit = 'AIC|BIC|F|Lik|Adj|Pseudo')
```
## Example
- The `r scales::number(coef(sentencing_ols), accuracy = .001)` included the fact that different kinds of states tend to institute reform
- The `r scales::number(coef(sentencing_fe), accuracy = .001)` doesn't!
- Looks like the deterrent effect was real! Although important to consider if there might be time-varying back doors too, we don't account for those in our analysis
- What things might change within state over time that would be related to robberies and to sentencing reform?
## Practice
- We want to know the effect of your `teacher` on the `test score`s of high school students
- Some potential back doors might go through: `parents' intelligence`, `age`, `demographics`, `school`, `last year's teacher`
- Draw a diagram including all these variables, plus maybe some unobservables where appropriate
- If you used fixed effects for students, what back doors would still be open?
- What would the `feols()` command for this regression look like?
## Practice Answers
- Fixed effects would close your back doors for `parents' intelligence`, `demographics`, and `school`, but leave open `age` and `last year's teacher`
```{r, eval = FALSE, echo = TRUE}
m <- feols(TestScore ~ Teacher + Age + LastYearsTeacher |
Student, data = data)
```