-
Notifications
You must be signed in to change notification settings - Fork 0
/
nested_model_case_control.Rmd
303 lines (217 loc) · 12.6 KB
/
nested_model_case_control.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
---
title: "Nested and Functional Programming For Case Control Analysis"
author: "Avery Richards"
date: "12/9/2021"
output:
prettydoc::html_pretty:
theme: hpstr
highlight: github
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
In epidemiology, outbreak investigations often rely on case control studies to test hypotheses around a probable source of contagion among many exposures. In this blog I walkthrough a tidy approach to case control analysis, using the a nesting and many models approach I picked up from the [R for Data Science book](https://r4ds.had.co.nz/many-models.html) and the de-identified results of a CDC survey used during a classic investigation of an E. coli O157:H7 outbreak in the United States, circa 2009.
To begin, we will need to load and install a variety of libraries before we import our raw dataset.
```{r, warning = F, message = F}
# installing and loading packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
rio, # import data
tidyverse, janitor, # shape data
survival, # statistics
broom, # model evaluation
kableExtra, #table output
prettydoc #knitting
)
# turn off scipen
options(scipen = 999)
#import dataset, clean names
ecoli_cc <- import("case_control_study_readacted.xlsx") %>%
janitor::clean_names()
```
Information from this study does not come to us in a pristine form. There are cases *(people who became ill)* observed in the dataset who do not have an equivalent control *(not sick person with identical exposure)*. Having identified the cases without controls, we can make a list of the `cdcid` values and filter those cases out of an updated line list, or `dataframe`, etc.
```{r}
## first we make list of cases without matched controls
unmatched_cases <- c("CDC001", "CDC006", "CDC009", "CDC012",
"CDC015", "CDC022", "CDC023", "CDC024",
"CDC037", "CDC043", "CDC045", "CDC046",
"CDC048", "CDC049", "CDC067", "CDC068")
```
One situation has multiple controls assigned to a single case. We need to pluck that case from the data frame before we can continue.
```{r}
# some cases have multiple controls assigned.
ecoli_cc %>%
select(cdcid, case, controlletter) %>%
filter(cdcid == "CDC047")
```
```{r}
# filter out unmatched cases.
ecoli_cc_match <-
ecoli_cc %>%
filter(!(str_detect(cdcid,
paste(unmatched_cases, collapse = "|")))) %>%
filter(!(cdcid == "CDC047" & cdccaseid == "CDC047_B")) %>%
# create a strata number from cdcid strings
mutate(strata_num = as.numeric(stringr::str_extract_all(cdcid,"(..$)")))
# count to verify the case numbers match, 36 strata
ecoli_cc_match %>%
distinct(strata_num) %>% tally()
```
After verifying the 1:1 matching pattern of cases and controls present in the dataset, we select the food exposure variables, renaming the more cryptic values into human friendly identifiers. We must also recode the `99` and `3` values sprinkled throughout all the exposure data. These represent missing data in the survey response as well as the always troubling, *"I'm not sure"* survey responses recieved from participants.
```{r}
# Select food exposure variables from CDC questionaire
cc_exposures <- ecoli_cc_match %>%
select(strata_num, case, strawberry,
apples, rollup, gb, rawcd, milk,
smoothie, cchip, carrot, cucumber,
raspberry, watermelon, nocdfzndes,
shopsmoothie, cantaloupe, mandarin,
grapes, bologna, hotdog, bacon) %>%
# rename variables
rename(ground_beef = gb, raw_cookie_dough = rawcd,
fruit_rollup = rollup,
chocolate_chips = cchip,
frozen_dessert = nocdfzndes,
storebought_smoothie = shopsmoothie)
# Replace 99 and 3 to NA in dataframe
cc_exposures <- map_df(cc_exposures, ~ na_if(.,"99"))
cc_exposures <- map_df(cc_exposures, ~ na_if(.,"3"))
#evaluate structure of new dataframe
str(cc_exposures)
```
Here we are with a cleaned dataset of food exposures between case and control survey participants. Our next step would be to conduct a statistical model of some sort, generate odds ratios to determine an estimate of disease given exposure to the food items. There are 20 distinct food exposures to analyze. *So how can we conduct our tests in an organized and comparable way without repeating a stat function 20 times, wrestling and herding the outputs together someplace to observe?* The best answer I can think of is using the tidyverse `nest()` function with `purrr::map`.
To do that, we first we need to restructure the data so we can `group_by` the individual food exposures. Pivoting the wide exposure matrices into a long format works well.
```{r}
# pivot longer to create category for each food.
cc_pivot <- cc_exposures %>%
pivot_longer(cols = 3:22, values_to = "exposure",
names_to = "food")
dim(cc_pivot)
```
After pivoting, the dataframe contains identical data but is expressed differently: each observation, per exposure and case number, is treated as a row in the long format. __72 (observations) * 20 (exposures) = 1440 rows__. Now we can `group_by` the distinct exposures and `nest()` our data via those groups.
```{r}
# group and nest by food category.
cc_nested <- cc_pivot %>%
group_by(food) %>%
nest()
head(cc_nested)
```
After pivoting, grouping, and nesting up our data we have a *dataframe of dataframes* sort of object, with each value in the `cc_nest$data` column being a `tibble` in itself that contains the exposure, case and control counts for each `food` item we categorized with `group_by`.
The next step is to put togther a function that runs a conditional logistic regression model in a way that we can operate on all our nested dataframes without repeating the process for each exposure.
```{r}
# function to map clogit model on dataframe.
clogit_model <- function(data){
survival::clogit(case ~ exposure +
strata(strata_num), data = data) }
```
Once the function is put together, we can use a `mutate` function to create a new column that `map`-s the `clogit` operation to all the nested dataframes in one call. But wait, not so fast...
```{r}
# map clogit onto nested "data" column to create model a new row of model outputs.
cc_logit <- cc_nested %>%
mutate(model = map(data, clogit_model))
```
...we are getting a warning from R after running the models.
Due to the structure and limited amount of exposure data, one of our 20 models was unable to converge and will give us wild OR outputs that do not seem plausable (*because they aren't*). There is a variation of the conditional logistic model, an "exact" method that is unavailable to the R computing environment at this time. Having done our homework, we learn that the [original analysis of this data](https://academic.oup.com/cid/article-pdf/54/4/511/1105929/cir831.pdf) used STATA software, which has an exact method available and was able to converge and return the following values below. So we can assemble those values to insert into our final tables.
```{r}
# outputs from a conditional exact model for the dataframe that did not converge.
added_output <- data.frame(food = "raw_cookie_dough",
estimate = 41.34,
p.value = .001,
conf.low = 7.37,
conf.high = Inf)
```
Once we have made a new column with the 20 model outputs, one for each nested food exposure, (*including the erronious one in there*),we can extract outputs using the `broom` package. Exponentiation is key to getting correct odds ratios in a logistic regression model, and confidence intervals are also necessary to evalutate the strength of our estimates beyond p-value thresholds, so we are sure to include those arguments in the `broom::tidy` function that we `map` on our nested data frames.
```{r}
# summary of outputs.
cc_summary <- cc_logit %>%
mutate(outputs = map(model, broom::tidy,
# exponentiate and add confidence intervals.
exponentiate = TRUE, conf.int = TRUE))
head(cc_summary)
```
We have the information from our models at the ready, we then `unnest()` our model objects, select the relevant outputs, and organize them in a table object, including replacing data from our model that was unable to converge with `survival::clogit()`.
```{r, message = F}
# table of output summary of models
cc_summary %>%
unnest(outputs) %>%
# select the outputs relevant to our analysis
select(estimate, p.value, conf.low, conf.high) %>%
# round these values for legibility sake
mutate(p.value = round(p.value, 3),
estimate = round(estimate, 2),
conf.low = round(conf.low, 3),
conf.high = round(conf.high, 3)) %>%
# remove the row (or model) that was unable to converge
filter(food != "raw_cookie_dough") %>%
# add fixed outputs from published stata analysis
bind_rows(added_output) %>%
arrange(estimate) %>%
# create print quality table object for ease of viewing
kbl(caption = "Output of Conditional Logistic Regression Models",
col.names = c("Food Consumed",
"Estimated OR",
"p-value",
"CI-low",
"CI-high")) %>%
kable_classic_2(full_width = F) %>% column_spec(2, bold = T)
```
It is also relevant to explore a count and proportion of cases and controls who consumed the food items. Here we can filter our pivoted data based on case or control status, grouping again on the food exposure, counting and creating a proportional measure.
```{r}
# count and proportions of controls per food exposure
controls_tab <- cc_pivot %>%
filter(case == 0) %>%
group_by(food) %>%
summarise(controls = sum(exposure, na.rm = T)) %>%
mutate(controls_percent = round(controls / 36, 2))
```
Now we have created a table for the controls, we repeat the process for cases and `inner_join` the tables together. Finally we add a similar wrapper for print quality table object.
```{r}
# join controls with cases
cc_pivot %>%
filter(case == 1) %>%
group_by(food) %>%
summarise(cases = sum(exposure, na.rm = T)) %>%
mutate(cases_percent = round(cases / 36, 2)) %>%
inner_join(controls_tab, by = "food") %>%
arrange(cases_percent) %>%
kbl(caption = "Counts and Percentages of Exposed Cases",
col.names = c("Food Consumed",
"Cases",
"No.",
"Controls",
"No.")) %>%
kable_classic_2(full_width = F)
```
You may notice a high percentage of cases who reported eating the raw cookie dough at the bottom here. This is evidence that further supports raw cookie dough being the source of the outbreak. Another way to explore the summary data is with visualizations. We can visualize estimated OR values with confidence intervals fairly simply in a `ggplot` bar graph.
```{r, fig.align="center"}
cc_summary %>%
# we must unnest and ungroup our data object
unnest(outputs) %>%
filter(food != "raw_cookie_dough") %>%
ungroup() %>%
# can't neglect the exact test measurements
add_row(food = "raw_cookie_dough", estimate = 41.34,
conf.low = 7.37, conf.high = Inf) %>%
## plotting ###
# plot with theme and titles
ggplot(aes(food, estimate)) +
geom_bar(stat = "identity", fill = "steelblue",
alpha = .6) +
# use conf values to create error bar.
geom_errorbar(aes(x = food, ymin = conf.low, ymax = conf.high,
color = "green", alpha = .9)) +
ylim(0, 42) +
coord_flip() +
# add an hline for the conf.ints
geom_hline(yintercept = 1, color = "purple", linetype = "dotdash") +
# themes and title
theme_bw() +
theme(axis.text.x = element_text(angle = 90),
legend.position = "none") +
# title and labels
ylab("Estimated OR with Confidence Interval") +
xlab("Food Exposures") +
ggtitle("OR Estimates of Disease Given Food Exposure Status")
```
Even if the scale of this visualization is destroyed by the `raw_cookie_dough` estimate, we can add a `geom_hline` at the OR = 1 intercept, giving us a reference and ability to assess multiple exposures for statistical significance visually as well as numerically. Here we see that `ground_beef` and `strawberry` are *close* to significance, but an OR estimate for every other exposure passes through that un-significant OR = 1 area, adding even more support to the hypothesis, and [indeed true life situation](https://www.cdc.gov/ecoli/2009/cookie-dough-6-30-2009.html) where raw cookie dough was the source of this E. Coli outbreak.
You can find the raw data and code for this blog here at [my personal github](https://github.com/Averysaurus/epi_cc_analysis_many_models). Thank you for reading this far!