-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathtutorial2-general-linear-models.Rmd
312 lines (205 loc) · 11.7 KB
/
tutorial2-general-linear-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
---
title: "R Growth Curve Analysis & Eyetracking Workshop: Tutorial 2: General Linear Models"
author: "Brock Ferguson"
date: "July 1, 2015"
output:
html_document:
toc: true
theme: readable
---
Load packages we'll be using:
```{r}
library(ggplot2)
```
Load in our vocabulary data again.
```{r}
vocab <- read.csv('data-vocab.csv')
```
# Introduction to lm()
Re-run our correlation as a basic linear model:
```{r}
model <- lm(median ~ age, data=subset(vocab, gender == 'F'))
```
What methods are available to us on this object?
```{r}
class(model)
model
print(model)
summary(model)
plot(model)
```
# lm() and aov() comparison
Let's do some analyses with our vocab dataset to compare the different linear modeling
methods made available by R.
First, let's begin by visualizing the raw data again:
```{r}
ggplot(vocab, aes(x=age, y=median, color=gender)) +
geom_pointrange(aes(min=median-ci.l, max=median+ci.h), position=position_dodge(.4)) +
scale_x_continuous(name="Age (Months)") +
scale_y_continuous(name="Productive Vocabulary")
```
## Single factor models
Let's compare `lm()` (Linear Model) and `aov()` (Analysis of Variance) in modeling these data.
Note: For simplicity, we are going to treat each datapoint here as if it were a different subject, as if they were 1 subject of each gender at each age.
```{r}
summary(lm(median ~ age, data=subset(vocab, gender == 'F')))
summary(aov(median ~ age, data=subset(vocab, gender == 'F')))
```
What we expect is that a single term linear model should give us the exact same results as a one-way ANOVA. This is actually what we see when compare these models. Although the ANOVA did not return a precise estimate of the slope (like `lm()`), their p-values are identical and the ANOVA's F-value is exactly the linear model's t, squared:
```{r}
sqrt(630.2)
```
## Two factor models, no interaction
What happens when we add in a second main effect (but no interaction)?
```{r}
summary(lm(median ~ age + gender, data=vocab))
summary(aov(median ~ age + gender, data=vocab))
```
Perfect again! Exact same results using both methods which is what we would expect when we have a two-way, non-interaction ANOVA and a two-parameter linear model.
## Two factor models, with interactions
What happens when we allow our two factors to interact in predicting vocabulary?
```{r}
summary(lm(median ~ age*gender, data=vocab))
summary(aov(median ~ age*gender, data=vocab))
```
Now we see (a) vastly different results between the models and, (b) strange results within the linear model especially. Let's break this down.
In our linear model, we still have our significant effect of age (which is good) but now no reliable effect of gender (which is bad) and a positive estimate of being male (which is very bad).
In our ANOVA, we see a wildly significant effect of age, and a wildly significant effect of gender.
What's going on?
## Why are lm() and aov() different?
The answer to this questions lays in (a) the ways the models are fitted, and (b) the way our parameters are coded before running the model. We normally don't have to think about either of these things in SPSS, for example, because they are done for us automatically.
### lm()
Let's first look at the linear model which, at first glance, seems the most questionable.
Specifically, (1) why did it not return a main effect of gender?
(2) why is its estimate of the effect of being male positive, when each male "subject" is lower than the female at that same age?
Let's re-fit our model and save its results in a variable.
```{r}
model <- lm(median ~ age*gender, data=vocab)
```
We want to know what this model fit looks like. We can do this by retrieving the predictions of our model and storing them in our dataframe.
```{r}
vocab$predictions <- predict(model, vocab)
```
Rather than comparing our predictions and actual values line-by-line, it's best to visualize them on a single plot.
Here we will plot the data like we did before, however we are now going to add a new "layer" to the plot which will include our predicted values.
```{r}
ggplot(vocab, aes(x=age, y=median, color=gender)) +
geom_point() + # the raw data, plots the 'y' aesthetic above
geom_line(aes(y=predictions)) + # our predictions, plotted as a line
scale_x_continuous(name="Age (Months)") +
scale_y_continuous(name="Predicted Vocabulary")
```
Looks like a good fit! Indeed, our model fit maps exactly on to the line of best we can generate using ggplot:
```{r}
ggplot(vocab, aes(x=age, y=median, color=gender)) +
geom_point() + # the raw data, plots the 'y' aesthetic above
stat_smooth(method="lm", se = FALSE) + # add a line of best fit, without SE
scale_x_continuous(name="Age (Months)") +
scale_y_continuous(name="Predicted Vocabulary")
```
So what's wrong with our model?
Let's think about we want to know with each parameter:
For Age, we want to know what the effect of Age is while controlling for Gender.
For Gender, we want to know what the effect of Gender is while controlling for Age.
When we say "Controlling" in a linear model, what we actually mean is "while holding that parameter constant at zero."
If we set Age to 0 in our current model, as the variables are currently coded, we are essentially asking for the effect of Gender on a newborn's vocabulary.
We can extrapolate our model's predictions down to a newborn to take the model's perspective on this question...
```{r}
vocab_extrapolations <- data.frame(
c(rep(0:30,each=2)),
c(rep(c('M','F'),times=31)),
c(rep(0,62))
)
colnames(vocab_extrapolations) <- c('age','gender','prediction')
vocab_extrapolations$prediction <- predict(model, vocab_extrapolations)
ggplot(vocab_extrapolations, aes(x=age, y=prediction, color=gender)) +
geom_line() +
scale_x_continuous(name="Age (Months)") +
scale_y_continuous(name="Predicted Vocabulary")
```
Now everything should begin to make sense. Our non-significant, positive effect of being a Male is the model telling us that it thinks male newborns have a slight advantage on female newborns. This obviously doesn't make sense.
This wasn't a problem for us before adding the interaction term because, without the interaction, it's estimate of the effect of Gender is constant and therefore equally reliable at all ages.
To fix this problem, let's *center* the age variable so when we tell the model to give us the effect of gender while controlling for Age, we are holding Age constant at the *mean* age for our sample.
```{r}
vocab$ageC <- scale(vocab$age, center=T, scale=F)
summary(lm(median ~ ageC*gender, data=vocab))
```
This is better: a significant, negative effect for gender.
Now, we have to think about our effect of Age. As things stand, when we hold Gender constant at zero, we are actually fixing the gender parameter at "Female." This "treament coding" of variables is the default for factor columns in R dataframes.
We will fix this in the exact same way as before: centering. To center a non-numeric factor variable with only 2 levels, we will re-code it as a numeric variable where our two groups differ by a total of 1 unit. This is called deviation coding:
```{r}
vocab$genderC <- ifelse(vocab$gender == 'F', -.5, .5)
vocab$genderC <- scale(vocab$genderC, center=T, scale=F) # make sure it's centered
summary(lm(median ~ ageC*genderC, data=vocab))
```
Now we have a proper model!
The moral here is that we need to pay close attention to the way our variables are treated (e.g., centered, coded, standardized) because they change the way the model's effects are estimated and intepreted. It's more of an issue in R than elsewhere because, unlike SPSS and other packages which do all of this automatically in the background, R won't assume you want these transformations done in the background.
I see mistakes with this all of the time. Here are my recommendations when fitting any linear model in R:
(1) Deviation code and center all variables, by default.
(2) Ask yourself, do the model's estimates make sense intuitively?
(3) Visualize your raw data and model predictions.
(4) Try to replicate main effect parameter's estimate using arithmetic. For example:
```{r}
# our intercept should be equivalent to our total mean
mean(vocab$median)
# our gender effect should equal the difference between our male and female groups
mean(vocab[which(vocab$gender=='F'), 'median']) - mean(vocab[which(vocab$gender=='M'), 'median'])
```
The reason we have to be so careful with `lm()` is that what it's giving us is *simple* effects -- estimates of the effect of a single variable while holding other independent variables constant at one level.
What we usually want are *main effects* -- estimates of the effect of a single variable while holding all other independent variables at their average.
By centering and using deviation-coding, we can get main effects from `lm()`.
### aov()
But what `aov()`? If you remember, it wasn't so bad. It already knew there was a reliable main effect of gender (though we don't know which direction, because we haven't plotted its predictions).
Is it better than `lm()`? No!
Here's why:
```{r}
# generate three random vectors
random_dv <- rnorm(30,5,5)
random_iv1 <- rnorm(30,5,5)
random_iv2 <- rnorm(30,5,5)
# fit a model using these random vectors
summary(aov(random_dv ~ random_iv1 + random_iv2))
# now enter them in the opposite order:
summary(aov(random_dv ~ random_iv2 + random_iv1))
```
We got different answers depending on the order we entered the variables. This is because `aov()` uses Type I (sequential) sums of squares. No amount of re-coding or centering can overcome this property and give us the answers we (probably) want.
Notice, however, that changing the order of age and gender in our model as no effect:
```{r}
summary(aov(median ~ age + gender, data=vocab))
summary(aov(median ~ gender + age, data=vocab))
```
This is because these predictors are perfectly uncorrelated (because we have a balanced design):
```{r}
cor.test(vocab$age, as.numeric(vocab$gender))
```
With even the slightest correlation, our estimates are going to change based on order of entry in the model.
## When should I use aov()?
There is only one time you should use `aov()` -- repeated-measures ANOVAs. In tutorial 3, we will begin using mixed-effects models which, by way of random effects, serve much the same function. However, sometimes you just want a good old-fashioned repeated measures ANOVA and, *provided your design is balanced and everything is coded properly*, you will get an appropriate answer from `aov()`.
Here's a quick example:
```{r}
# build a quick dataframe
dv <- c(1,3,4,2,2,3,2,5,6,3,4,4,3,5,6)
subject <- factor(rep(paste0('s',1:5),each=3))
myfactor <- factor(rep(paste0('f',1:3),times=5))
mydata <- data.frame(dv, subject, myfactor)
model <- aov(dv ~ myfactor + Error(subject/myfactor), data=mydata)
summary(model)
```
## Other helpful lm() functions
Here are some other helpful functions for linear models:
```{r}
model <- lm(median ~ ageC*genderC, data=vocab)
# confint(): get confidence intervals for estimates
confint(model)
# pairwise.t.test(): run pairwise.t.tests between groups (ignoring other factors)
pairwise.t.test(vocab$median, vocab$genderC)
# insignificant, because we aren't controlling for Age
# drop1(): drop each parameter and retrieve an F, for reporting like an ANOVA
model <- lm(median ~ ageC*genderC, data=vocab)
drop1(model,~.,test="F")
```
Clean up our workspace.
```{r}
ls()
rm(list=ls())
```