-
Notifications
You must be signed in to change notification settings - Fork 33
/
lec10-multivariate-stats.Rmd
807 lines (550 loc) · 30.7 KB
/
lec10-multivariate-stats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
---
title: "Multivariate statistics"
author: Amber Gigi Hoi
output: pdf_document
---
# Lesson preamble:
> ### Lesson objectives:
>
> - Most things in EEB are correlated - what does that do to our analyses?
> - Use multivariate techniques to reduce complex situtations into more manageable ones
> - Run multivariate multiple regressions with structural equation models
> - Understand that a **very** long list of caveats come with using each of these techniques
> - Bottomline: integrity is the number one virtue to being a good researcher
>
> ### Lesson outline:
> Total lesson time: 2 hours
>
> - Introduction to dataset and hypotheses (10 min)
> - Performing ordination on species data (30 min)
> - Inspect multicollinearity (20 min)
> - A very brief introduction to structural equation models (40 min)
---
# Introduction
## Correlations - a double edge sword
Ecologists are obsessed with patterns in nature. It is thus great news that most
things in nature _are_ correlated with one another - that's why we all still
have jobs. However, this is also bad news, because when things are correlated,
it becomes difficult to tease part which effects came from where.
Take this simple example of something we can all relate to -- food.
You love spicy food. To satisfy your craving, you made this very spicy dish with
five different kinds of chili peppers. It was a very delicious dish! It has this
real depth of flavor, and needless to say, spicy! Then you find yourselves
wondering where is the heat actually coming from anyway? Was it the habanero?
The Scotch bonnet? There was a little bit of smokiness, perhaps that was
chipotle? That was a difficult question because all of the ingredients you put
in have a similar flavor profile – spicy.
In today’s lecture, we’ll be working through a real-life –- read: it’s gonna be
messy –- example of how different multivariate techniques can be used to
~~distinguish between the flavor profile of different chili peppers~~
disentangle the effects of highly correlated variables.
## Setup
The overarching question we will be tackling today is whether vector diversity
mattered to malaria transmission. Did you know that malaria parasites,
_Plasmodium spp._, can infect up to 70 mosquito species? These vector species
were not created equal in terms of their capacity to transmit disease as they
often differ in traits such as life history, phenology, and feeding preference.
Thus, neglected details of vector diversity can have important consequences for
disease dynamics. There are many ways in which "diversity" can be measured:
abundance, species richness, and composition. Which one of these axes of
diversity really mattered, and how much does it matter?
Here are the packages we will be using today:
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(MASS)
library(PerformanceAnalytics)
library(MuMIn)
library(ggfortify)
library(lavaan)
library(viridis)
```
In addition, we will also be _sourcing_ some functions provided by Zuur _et al._
(2009). The full reference can be found at the end of this document.
```{r}
# Original link http://highstat.com/Books/BGS/GAMM/RCodeP2/HighstatLibV6.R
source("https://raw.githubusercontent.com/UofTCoders/rcourse/master/resources/HighstatLibV6.R")
```
And here is the data we will be working with:
```{r eval=FALSE}
download.file("https://uoftcoders.github.io/rcourse/data/kenya.wide",
"kenya.wide")
```
```{r message=FALSE, warning=FALSE}
kenya.wide <- read_csv("data/kenya.wide.csv")
```
The data used in this analysis was collected by Mbogo and colleagues in 1998 and
published in their 2003 manuscript. In brief, these researchers conducted
entomological and epidemiological surveys at 30 villages along the eastern
Kenyan coast in 1998. Overnight human-landing catches were performed in ten huts
per village every two months over the course of the year-long study. Mosquito
species were identified via molecular methods, and their abundances were
recorded. At the end of the vector survey period (May 1998), 76-100 school
children (ages 6-12) were sampled for malaria at each village. Malaria
prevalence was estimated as the proportion of samples that showed blood-stage
parasites in stained blood smears. _P. falciparum_ was found to be responsible
for more than 95% of the malaria cases in this region, and thus is the only
parasite species considered here. The longitudinal mosquito diversity data was
aggregated to match the cross-sectional nature of the malaria prevalence data.
The data relevant to our analysis thus consisted of the following:
1. Malaria prevalence (number positive divided by number sampled at each village),
2. Mosquitoes observed at each village (relative and total abundance of each species),
3. Latitudinal location of each village as a proxy for unavailable climate variables,
4. Distance to closest big city in the region (Mombasa) as a proxy for access to health care.
The unit of replication in this study, i.e. the smallest scale at which
information on variation is available, is “village”, bringing the total sample
size to 30.
Let's have a look at the data, and start to organize our analysis.
```{r}
head(kenya.wide)
```
* The **response** variable is "PfPR" which stands for _Plasmodium falciparum_
parasite rate.
* The **predictor** variables are "total.abundance", "SR" which
stands for species richness, and the four columns "arabiensis", "funestus",
"gambiae", and "merus", which recorded relative abundances of each of the
vector species of the genus _Anopheles_ found in this area.
* We will also be
considering "lat" and "distance" (to closest big city) as potential
**confounders**.
## Creating a proxy for community structure
At the moment, we have four predictors that relate to "vector community
structure" (i.e., those four columns containing species relative abundance
info), and that is a lot! How can we reduce this to something simpler to use in
our regressions?
The first multivariate technique we are going to work with today is ordination.
As the name may suggest, ordination is a way of organizing things into a
particular order, typically based on how similar things are in some aspect. In
this case, we will be organizing our sites based on how similar they are in
vector species composition.
One of the most common ordination techniques used by ecologists is the Principle
Components Analysis (PCA). A PCA seeks to find a new coordinate system (PC axes)
that best accounts for the variation in multivariate data space, thus reducing
dimensionality in the process.
The procudure to conduct a PCA is as follows:
1. Imagine a 4-dimensional object with each of our species relative abundances as an axis (k=4).
2. Plot our sites (n=30) in this 4-D space, and then find the distance between
those points (do not despair, R will do the actual calculations for us).
3. After you have the distances between points, we can use processes that are
similar (in concept) to least squares to find a "best-fit line" through our
data points. This will be our first PC axes and it explains the most variance
in our data.
4. Find another best-fit line through the remainder of the data under the
condition that it has to be orthogonal to the first axes (i.e., at a 90°
angle, meaning that these two axes are as unrelated to each other as
possible).
5. This process continues until we have k PC axes.
Easy peasy, right? PCA _is_ in fact a fairly straightforward and automatic
analysis in R. We just have to give R the appropriate object, do a little of bit
of transformation to ensure our results are not disproportionately affected by
any outliers (i.e., rare species), and we are good to go!
```{r}
kenya.pca <- kenya.wide %>%
dplyr::select(arabiensis, gambiae, funestus, merus) %>% # Choose relevant columns
mutate_all(sqrt) %>% # Hellinger transformation
prcomp(.) # Pipe directly into base R function for PCA
```
These lines of code returns the result of our simple PCA. Use _str()_ and _summary()_ for a quick initial inspection of this object.
```{r}
summary(kenya.pca)
```
This gives us a table _summary_ of variance explained by each axes. R even did
the math for us and showed cumulative variance explained as we move to the
higher dimensions, so convenient! We see that the first two PC axes already
explained close to 90% of the variation of the data, that's pretty good! We will
therefore only be using these two axes in subsequent analyses, and have
successfully reduced our community composition measure from four variables to
two!
```{r}
str(kenya.pca)
```
We see that this object is *str*uctured as a five-item list. We mostly care
about two of these items:
1. Species loadings - appropriately named _rotation_.
2. Site scores - lazily named _x_.
To inspect these items, use the $ notation.
```{r}
kenya.pca$rotation
```
Basically, species loadings show how much of the original species info
contributed to each new axes. For example, we see that the species _An.
funestus_ loaded the most strongly on to PC1, followed by _An. arabiensis_ and
_An. gambiae_ in the opposite direction. This means that sites with relatively
more _An. funestus_ will tend to have relatively few _An. arabiensis_ and _An.
gambiae_, but this antagonism is not balanced (due to weaker loading of
_arabiensis_ and _gambiae_ onto this axes). As a mini challenge, interpret PC2
loadings.
```{r}
kenya.pca$x
```
Site scores, on the other hand, tells us the coordinates of each site following
projection onto our newly made PC axes.
The best way to understand what all of that meant is by visualizing the PCA
results in a _biplot_. This type of graph is called a *bi*plot because we will
be plotting the species loadings and site scores simultaneously. Just like our
PCA, R has made it extremely easy for us to make biplots, and there are many
options available in terms of packages and functions. We will be using a new
function -- _autoplot()_ -- from a new package, _ggfortify_. This package (and
function) takes the convenience afforded by ggplot to the next level (hence
"fortify") -- we don't even have to tell it which type of plot or give it any
aesthetics!
```{r}
autoplot(kenya.pca)
```
But wait, that's not a biplot though!? Turns out _autoplot()_ isn't that auto after all...
```{r}
autoplot(kenya.pca, loadings=TRUE, loadings.label=TRUE)
```
Some notes on interpreting this biplot:
* PC1 is typically plotted on the x-axis, and PC2 on the y-axis. The percent
variance explain by these axes have been included in the labels automatically
for our convenience.
* The arrows (rotations = species loadings) remind us what the axes mean --
_An. funestus_ loaded strong onto PC1 whereas PC2 mainly represent the
antagoism between _An. arabiensis_ and _An. gambiae_.
* The clustering of sites along the axes (and rotations) tell us which sites
are similar in terms of their species composition and how they are similar.
For example, sites 1, 9, 11, and 17 were all similar in that they are
dominated by _An. funestus_.
We can add some simple modifications to include more information on the plot,
such as other community attributes.
```{r tidy=FALSE}
autoplot(kenya.pca, loadings=TRUE, loadings.label=TRUE, data=kenya.wide, colour="total.abundance")
# Remember to tell R where the original data is coming from!
# Also, ggplot will refuse to understand you if you spell colour the non-Canadian way
```
To make the scale more readable, we can manually change the colour bar to other
colors, or, we could use a pre-made colour palette -- brace yourselves for the
package _viridis_ because it is the BEST. PACKAGE. EVERRRRRRRRR.
```{r}
autoplot(kenya.pca, loadings=TRUE, loadings.label=TRUE,
data=kenya.wide, colour="total.abundance", size=3) +
scale_colour_viridis(option="D", direction=-1) +
theme_classic()
# Setting direction = -1 flips the colour scale
```
Lastly, we'll add our PC axes to the main dataframe for subsequent analyses.
```{r}
axes <- data.frame(kenya.pca$x)
kenya.wide <- bind_cols(kenya.wide, axes) # Note order of rows!
```
_Sidebar:_ there are many ways to run ordinations, PCA is but one of them, and
even then there are a couple of ways you can run a PCA. It will take this entire
course and some more to walk you through all of the ordination techniques so we
will refrain. Before you ordinate, you should always consult textbooks and
and/or experts to make sure you are chosing the best method for your case.
# Investigating the contributions of different axes of vector diversity to malaria prevalence
## Initial data exploration
Before we actually do our analysis, let's have another look at our data.
Specifically, let's inspect the relationships between predictors using this tool
called a correlation chart:
```{r}
kenya.wide %>%
dplyr::select(c(total.abundance, SR, PC1, PC2, distance, lat)) %>%
chart.Correlation(.)
```
We see that there is fairly good correlation between almost all of our
predictors. This is bad news -- remember chili peppers? This phenomena is termed
_multicollinearity_, and we can investigate this in a more objective and
quantitative manner with the help of variance inflation factors (VIF). VIFs are
so called because correlations amongst predictors inflates standard errors in
parameter estimates, which means the p-value will also be inflated, making it
difficult to detect an effect. The function we will use to calculate VIFs was
sourced from Zuur _et al._ (2009).
```{r}
kenya.wide %>%
dplyr::select(c(total.abundance, SR, PC1, PC2, distance, lat)) %>%
corvif(.)
```
A general rule of thumb is that VIFs over 3 will cause problems to your
regressions, and we should sequentially drop the variable with the highest VIF,
and recalculate VIFs until all are less than 3. If this is the case in your
analysis, you **must** report which variables were dropped **and why**, ideally
including the before and after VIF values. Of course, sometimes there are
variables that are important to include and we cannot just drop them (e.g., kind
of important to keep mosquito abundance in our model). We will learn about
methods to handle those situations in just a sec.
Back to our VIF analysis. We see that both *lat*itude and _distance_ to closest
big city both have VIFs higher than 3. Check out Figure 1 in the original Mbogo
_et al._ (2003) publication. We see that the closest big city in the area,
Mombasa, is situated close to the south. No wonder latitude and distance were so
highly correlated! Here, we make the executive decision to drop latitude,
because we are not actually asking any climate-related question at the moment,
and also because we have means to obtain climate variables directly (next
lecture) and thus there is no need to use a proxy for that.
Calculate VIFs again, this time without latitude.
```{r}
kenya.wide %>%
dplyr::select(c(total.abundance, SR, PC1, PC2, distance)) %>%
corvif(.)
```
There, much better!
Another thing we should inspect before running our linear models is the
assmption of normality. Remember how the correlation chart we were working with
earlier showed us histograms alongside correlations?
```{r warning=FALSE}
kenya.wide %>%
dplyr::select(c(total.abundance, SR, PC1, PC2, distance)) %>%
chart.Correlation(.) # Not very normal
kenya.wide %>%
dplyr::select(c(total.abundance, SR, PC1, PC2, distance)) %>%
mutate_all(log) %>% # Natural-log transformation. For regular log, use log10()
chart.Correlation(., histogram=TRUE, pch=19) # Looks better
```
We make our second executive decision of the day - we will be fitting our linear
models in log-log space.
## Finally, we are going to fit our models!
We should be quite familiar with the syntax of linear models by now. Let's go
ahead and write out one model each for our three predictors of interest,
mosquito abundance (total.abundance), species richness (SR), and composition
(PC1 and PC2).
```{r}
mod01 <- lm(log(PfPR)~log(total.abundance), data=kenya.wide); summary(mod01)
mod02 <- lm(log(PfPR)~log(SR), data=kenya.wide); summary(mod02)
mod03 <- lm(log(PfPR)~PC1+PC2, data=kenya.wide); summary(mod03)
```
Basically, we see that both mosquito abundance and species richness have strong
positive effects, whereas PCs 1&2 were not statistically significant. What about
our confounder, distance?
```{r}
mod04 <- lm(log(PfPR)~log(total.abundance)+log(distance), data=kenya.wide); summary(mod04)
mod05 <- lm(log(PfPR)~log(SR)+log(distance), data=kenya.wide); summary(mod05)
```
From these models, we see the effects of total abundance and species richness on
malaria prevalence differed _quantitatively_ from the previous models without
the confounder, but not _qualitatively_ (i.e., the interpretation of the effect
is the same -- significant positive association). We can thus make yet another
executive decision to drop distance from subsequent analyses. Again, you
**must** disclose this to your readers, ideally providing a table showing
regression coefficients from both models with and without the confounder.
Let's inspect the effects of abundance and species richness more closely. We can
regress them against malaria prevalence simltaneously in a multiple
regression...
```{r}
mod06 <- lm(log(PfPR)~log(total.abundance)+log(SR), data=kenya.wide); summary(mod06)
```
...but then the whole thing flatlined. What is going on? Did the effect just disappear?
```{r}
ggplot(kenya.wide, aes(x=SR, y=PfPR, colour=total.abundance)) +
geom_point(size=3) +
scale_colour_viridis(option="D", direction=-1) +
labs(y="Malaria prevalence", x="Number of mosquito species", fill="Mosquito\nabundance") +
theme_classic()
```
From this plot, we can see clearly that the effects did not in fact disappear (I
mean, how _could_ it? Silly me...). The positive associations between abundance,
species richness, and malaria prevalence are very pronounced. Could
multicollinearity _still_ be obscuring our ability to find the "truth", even
though we ensured VIFs looked good? To add even more confusion to the situation,
when we compete all four predictors together in the model, we see that all of a
sudden PC2 came back as significant...??!?
```{r}
mod07 <- lm(log(PfPR)~log(total.abundance)+log(SR)+PC1+PC2, data=kenya.wide); summary(mod07)
```
**Disclaimer: Don't do this at home!** This exercise felt a lot like data mining
-- it probably was! We showed you this in class only to demonstrate what
multicollinearity can do to your regressions.
## Structural equation models to the rescue
Structural equation models (SEM) is a scientific framework for the modelling and
interpretation of complex networks of relationships as a whole system. SEMs are
particularly well-suited for disentangling processess that operate
simultaneously in systems, due to its focus on the many direct and indirect
pathways through which a predictor may influence an outcome. This is acheived by
explicitly modelling the relationships _between_ predictors (those darn
correlations!) in addition to their associations with the focal response. This
also means that SEMs have the capacity to become multivariate multiple
regressions with great flexibility, where variables can be simultaneously
predictors and responses.
The first and absolute most important step in constructing an SEM is model
specification. It is absolutely crucial that this step is done appropriately.
There are a couple types of relationships (i.e., paths) between variables one
can specify:
1. Direct causal effect -- strong words!
2. Correlation -- less strong words
3. No relationship -- also strong words!!!
4. Latent variable -- represents abstract theoretical ideas unmeasured or
unmeasurable, these are uncommon in ecology and won't be discussed further
We totally sound like broken records at this point, but you **must** have solid
rationale for the inclusion of each variable and for the specification of each
path.
The model we will be working with today takes the following form.
![Path diagram representing hypothesized relationships between characteristics of vector communities and malaria prevalence. Paths (i.e., arrows) represent the direction of the hypothesized causal relationship between variables, with the arrow pointing from the predictor to the response. Double-headed arrows represent correlations between the pair of variables. Paths are labeled with letters to facilitate discussion.](image/SEMfig.png)
Once you are satisfied with the specification of your model, we can then
evaluate the model and inspect model fit.
There are many ways one can evaluate SEMs, and the R package _lavaan_ is perhaps
the most popular amongst ecologists. _Lavaan_ stands for *la*tent *va*riable
*an*alysis. As the name suggests, this package is particularly suited to
handling SEMs with latent variables. Inspite of the fact that latent variable
analyses (a.k.a. factor analyses) are not that common in ecology, _lavaan_ is
popular because it is still powerful for non-factor-analyses, and also because
it is extremely well documented.
We need to do a little data preparation before we can specify the model in R.
Specifically, we need to create the natural-log-transformed predictors (total
abundance and species richness) and response (malaria prevalence) before hand,
because _lavaan_ does not like it when we do it in the equations itself.
```{r}
kenya.wide <- kenya.wide %>%
mutate(l.abun = log(total.abundance),
l.sr = log(SR),
l.pfpr = log(PfPR))
```
The _lavaan_ model specification syntax is slightly different from the linear
models we are familiar with. The biggest difference is perhaps that, because we
are dealing with multiple pathways in a system, we have to specify each path
individuals and then include all of them as part of our model in a pair of
quotations. Direct casual relationships are specified with the good old ~, as we
do in regressions, and correlations are specified with ~~.
```{r}
sem01 <- '
# Regressions
l.pfpr ~ l.sr + l.abun + PC2
# Correlations
l.sr ~~ l.abun
PC2 ~~ l.sr
'
```
We will then call the _sem()_ function to evaluate this model. Remember to give
it data! Given that each path in this model is expected to conform to
assumptions of normality, and we can already foresee that this will not always
be the case, we will use bootstrapping to obtain more robust standard error and
p-value estiamtes. Since bootstrapping is a randomization procedure, we will set
a seed for the random number generator to ensure that we can reproduce the same
results when we rerun this code or share this with a collaborator.
```{r results="hide"}
set.seed(42)
sem01.fit <- sem(sem01, data=kenya.wide, se="boot", bootstrap=1000, verbose=TRUE)
# verbose is unnecessary, but it makes you look cool
```
It is always tempting to jump straight to results (those darn p-values!) -- BUT
-- let's refrain for now and instead inspect model fit. Depending on how you
chose to evaluate your SEM (e.g., which package you used), there will be
different ways to inspect model fit. _Lavaan_ provides a whole suite of model
fit indices, and we will focus on the Chi-sq test as it is often considered to
be the gold standard.
To understand how the Chi-sq test works, we need to back track a little bit and
think about how SEMs are evaluated. Basically, _lavaan_ calculates a matrix of
covariances based on the observed data, and compares this to one generated from
your hypothesized model structure. The null hypothesis of the Chi-sq test is
that these two sets of values are not different from each other -- which is a
great thing because that means your hypothesized model _is consistent with_ the
observed data. In other words, a large p-value in the Chi-sq test for goodness
of fit of an SEM is desirable. This is in contrast to traditional hypotheses
testing approaches (e.g., in linear models) where the researcher hopes to find a
large difference between the observed data and the null model which is a
zero-effect model.
**Disclaimer 2: A model that fits well doesn't mean that it is correct _per se_,
or that it reflects the "truth" or the "reality"!** Whether an SEM is "good" is
determined by the robustness of the theory that went into constucting it. We can
never guarantee that we have found the "global best model", and we must never
claim that we have found "the truth". The appropriate phrase to use is (and the
correct way to think about SEM model fit) is whether "the data is consistent
with what we’ve hypothesized/proposed". You **must** report model fit statistics
when you work with SEMs.
To inspect model fit statistics, we can call our best friend _summary()_. This
generates a very long summary, most are irrelevant at this point. Alternatively,
we can use the self-explanatory function _fitMeasures()_.
```{r}
fitMeasures(sem01.fit, c("df", "pvalue"))
# Retrieves df (make sure this is not zero) and p-value of chi-sq test
```
Significant p-value! Yay! ... WAIT, we are working with SEMs, this is not good!
So our hypothesized model structure is inconsistent with the observed data.
Bummer! There are actually very good theoretical reasons why this model
structure is inadequate, however, for the purposes of this lecture, we will
pretend that this is not an issue and continue with our analysis.
We can use another self-explanatory function to pull path coefficients.
```{r}
parameterEstimates(sem01.fit, boot.ci.type="bca.simple", standardized = TRUE)
```
* The first three rows gave the results of the hypothesized direct causal
relationships. These results are congurent with the last linear model (mod07)
we ran in that species richness and PC2 both have significant positive
associations with malaria prevalence, but not total abundance.
* The next two rows gave the results of hypoethesized correlations between
species richness and total abundance and PC2, and as suspected, they were
indeed correlated.
* The last four rows showed relationships we did not specify, and they were
very strange ones as well -- what does the correlation between malaria
prevalence and malaria prevalence mean? Turns out, this represents the
residual variance in this variable. In other words, 1 minus this value gives
the R^2^ value we are familiar with from working with linear models. This only
works for direct causual relationships and not correlations, however, so you
see that the last three gave estimates of 1, which means none of the variation
is explained.
_Note:_ Standardizing path coefficients (i.e., expressed in units of standard
deviation) allow us to compare the effect of different predictors in a model,
whereas raw path coefficients should be used when comparing the effect of that
same predictor across different models.
So we see that the code used to specify, evulate, and inspect elements of SEMs
is pretty straight forward. To visualize SEM, however, is quite another story.
There is an R package, _semPlot_, that can be used to draw path diagrams such as
the one shown in Figure 1. Figures generated with this package is quite
difficult to customize, though, so it is often recommended that you make your
path diagrams in your favorite imaging processing software such as MS Paint,
Inkscape, or even Powerpoint.
Lastly, one simply cannot work with SEMs without calculating indirect and total effects:
* An indirect effect of predictor A to response B can be obtained by
multiplying all of the path coefficients that are in the way of A and B in one
straight path. For instance, there are two ways in which total abundance is
exerting indirect effects on malaria prevalence: through species richness and
through PC2. The former is obtained with b$\times$d and the latter,
d$\times$e$\times$c.
* Summing all the direct and indirect effects from A to B gives the total effect A has on B.
We will add indirect and total effects of total abundance to our models first by
labelling each path according to Figure 1, and using the symbol ":=".
```{r results="hide"}
sem02 <- '
# regressions
l.pfpr ~ a*l.sr + b*l.abun + c*PC2
# correlations
l.sr ~~ d*l.abun
PC2 ~~ e*l.sr
# defined parameters
indirect.abun := (b*d) #indirect effect of abundance via SR
indirect.abun2 := (d*e*c) #indirect effect of abundance via PC2
total.abun := b + (a*d) + (d*e*c)
'
set.seed(42)
sem02.fit <- sem(sem02, data=kenya.wide, se="boot", bootstrap=1000, verbose=TRUE)
```
```{r}
parameterEstimates(sem02.fit, boot.ci.type="bca.simple", standardized = TRUE)
```
The total effect of total abundance was statistically significant, but none of
the rest of the compenents are. Think about what these results mean, especially
in relation to what our linear model results suggested.
## Bottomline
* SEM has the potential to provide much greater insight regarding how systems
work, and thus also our ability to predict its behavior under perturbation, than
traditional univariate regression approaches **but only when applied
appropriately**. Specifically, it can reveal pathways and mechanisms that are
not apparent (and often impossible to find) via classic univariate approaches
**but only when applied appropriately**.
* We write this again here because this is extremely important: A model that
fits well doesn’t necessarily mean that it is a good model, or that it reflects
truth or reality. The theory behind the model is the most crucial to deciding
whether or not a model is any good. For any given set of variables, there can be
many different models that fit well, and they may even contradict each other or
suggest nonsensical relationships. Thus, the model needs to be solid
theoretically before you evaluate it and get too excited over the results.
* To ensure transparency and accountability, always provide a table summary of
pre-hoc rationale for the specification of each path, and a post-hoc
interpretation of the results of model fit. Know that there is absolutely
nothing wrong with "getting it wrong" the first time, in fact, those case often
turn out to be the most interesting! There is also nothing wrong with adjusting
your hypotheses as you work through your data and gain better understanding of
it -- we’d all be out of jobs if the world always work exactly the way we
thought it would. Again, transparency is key!
## Additional readings
Grace, J. B., Anderson, T. M., Olff, H. & Scheiner, S. M. (2010). On the
specification of structural equation models for ecological systems. Ecol.
Monogr. 80, 67-87.
Mbogo, C. M. et al (2003). Spatial and temporal heterogeneity of Anopheles
mosquitoes and Plasmodium falciparum transmission along the Kenyan coast. Am. J.
Trop. Med. Hyg. 68, 734–742.
Shipley, B. (2000). Cause and correlation in biology: a user’s guide to path
analysis, structural equations, and causal inference. Cambridge University
Press, London, UK.
Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A. & Smith, G. M. (2009).
Mixed effects models and extensions in ecology with R. Springer, New York, USA.
Zuur, A. F., Ieno, E. N. & Elphick, C. S. (2010). A protocol for data
exploration to avoid common statistical problems. Methods Ecol. Evol. 1, 3–14.