nested_model_case_control.Rmd

---
title: "Nested and Functional Programming For Case Control Analysis"
author: "Avery Richards"
date: "12/9/2021"
output: 
  prettydoc::html_pretty:
    theme: hpstr
    highlight: github
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


  In epidemiology, outbreak investigations often rely on case control studies to test hypotheses around a probable source of contagion among many exposures. In this blog I walkthrough a tidy approach to case control analysis, using the a nesting and many models approach I picked up from the [R for Data Science book](https://r4ds.had.co.nz/many-models.html) and the de-identified results of a CDC survey used during a classic investigation of an E. coli O157:H7 outbreak in the United States, circa 2009. 

To begin, we will need to load and install a variety of libraries before we import our raw dataset. 

```{r, warning = F, message = F}

# installing and loading packages
if (!require("pacman")) install.packages("pacman")

pacman::p_load(
  rio, # import data
  tidyverse, janitor, # shape data
  survival, # statistics 
  broom, # model evaluation
  kableExtra, #table output
  prettydoc #knitting
  )

# turn off scipen
options(scipen = 999)

#import dataset, clean names
ecoli_cc <- import("case_control_study_readacted.xlsx") %>% 
  janitor::clean_names()
  
```

Information from this study does not come to us in a pristine form. There are cases *(people who became ill)* observed in the dataset who do not have an equivalent control *(not sick person with identical exposure)*. Having identified the cases without controls, we can make a list of the `cdcid` values and filter those cases out of an updated line list, or `dataframe`, etc. 

```{r}

## first we make list of cases without matched controls
unmatched_cases <- c("CDC001", "CDC006", "CDC009", "CDC012",
                     "CDC015", "CDC022", "CDC023", "CDC024",
                     "CDC037", "CDC043", "CDC045", "CDC046",
                     "CDC048", "CDC049", "CDC067", "CDC068")

```

One situation has multiple controls assigned to a single case. We need to pluck that case from the data frame before we can continue. 

```{r}

# some cases have multiple controls assigned.
ecoli_cc %>%  
  select(cdcid, case, controlletter) %>% 
filter(cdcid == "CDC047")

```


```{r}

# filter out unmatched cases. 
ecoli_cc_match <- 
  ecoli_cc %>% 
filter(!(str_detect(cdcid, 
        paste(unmatched_cases, collapse = "|")))) %>% 
  filter(!(cdcid == "CDC047" & cdccaseid == "CDC047_B")) %>% 

# create a strata number from cdcid strings 
  mutate(strata_num = as.numeric(stringr::str_extract_all(cdcid,"(..$)")))
  
# count to verify the case numbers match, 36 strata 
ecoli_cc_match %>% 
  distinct(strata_num) %>% tally()

```

After verifying the 1:1 matching pattern of  cases and controls present in the dataset, we select the food exposure variables, renaming the more cryptic values into human friendly identifiers. We must also recode the `99` and `3` values sprinkled throughout all the exposure data. These represent missing data in the survey response as well as the always troubling, *"I'm not sure"* survey responses recieved from participants.

```{r}

# Select food exposure variables from CDC questionaire 
cc_exposures <- ecoli_cc_match %>% 
  select(strata_num, case, strawberry, 
         apples, rollup, gb, rawcd, milk, 
         smoothie, cchip, carrot, cucumber, 
         raspberry, watermelon, nocdfzndes, 
         shopsmoothie, cantaloupe, mandarin, 
         grapes, bologna, hotdog, bacon) %>% 
 
   # rename variables
  rename(ground_beef = gb, raw_cookie_dough = rawcd, 
         fruit_rollup = rollup,
         chocolate_chips = cchip, 
         frozen_dessert = nocdfzndes, 
         storebought_smoothie = shopsmoothie)

# Replace 99 and 3 to NA in dataframe
cc_exposures <- map_df(cc_exposures, ~ na_if(.,"99")) 
cc_exposures <- map_df(cc_exposures, ~ na_if(.,"3"))

#evaluate structure of new dataframe
str(cc_exposures)
```

Here we are with a cleaned dataset of food exposures between case and control survey participants. Our next step would be to conduct a statistical model of some sort, generate odds ratios to determine an estimate of disease given exposure to the food items. There are 20 distinct food exposures to analyze. *So how can we conduct our tests in an organized and comparable way without repeating a stat function 20 times, wrestling and herding the outputs together someplace to observe?* The best answer I can think of is using the tidyverse `nest()` function with `purrr::map`.  

To do that, we first we need to restructure the data so we can `group_by` the individual food exposures. Pivoting the wide exposure matrices into a long format works well. 

```{r}

# pivot longer to create category for each food.  
cc_pivot <- cc_exposures %>% 
 pivot_longer(cols = 3:22, values_to = "exposure",
              names_to = "food")

dim(cc_pivot)

```

After pivoting, the dataframe contains identical data but is expressed differently: each observation, per exposure and case number, is treated as a row in the long format.  __72 (observations) * 20 (exposures) = 1440 rows__.  Now we can `group_by` the distinct exposures and `nest()` our data via those groups. 

```{r}
  
# group and nest by food category.  
cc_nested <- cc_pivot %>% 
  group_by(food) %>% 
  nest()

head(cc_nested)

```

After pivoting, grouping, and nesting up our data we have a *dataframe of dataframes* sort of object, with each value in the `cc_nest$data` column being a `tibble` in itself that contains the exposure, case and control counts for each `food` item we categorized with `group_by`.

The next step is to put togther a function that runs a conditional logistic regression model in a way that we can operate on all our nested dataframes without repeating the process for each exposure. 

```{r}

# function to map clogit model on dataframe. 
clogit_model <- function(data){
  survival::clogit(case ~ exposure + 
            strata(strata_num), data = data) }

```

Once the function is put together, we can use a `mutate` function to create a new column that `map`-s  the `clogit` operation to all the nested dataframes in one call. But wait, not so fast...

```{r}
# map clogit onto nested "data" column to create model a new row of model outputs.
cc_logit <- cc_nested %>% 
  mutate(model = map(data, clogit_model))

```
...we are getting a warning from R after running the models. 

Due to the structure and limited amount of exposure data, one of our 20 models was unable to converge and will give us wild OR outputs that do not seem plausable (*because they aren't*). There is a variation of the conditional logistic model, an "exact" method that is unavailable to the R computing environment at this time. Having done our homework, we learn that the [original analysis of this data](https://academic.oup.com/cid/article-pdf/54/4/511/1105929/cir831.pdf) used STATA software, which has an exact method available and was able to converge and return the following values below. So we can assemble those values to insert into our final tables. 

```{r}

# outputs from a conditional exact model for the dataframe that did not converge. 
added_output <- data.frame(food = "raw_cookie_dough", 
                              estimate = 41.34,
                              p.value = .001,
                              conf.low = 7.37,
                              conf.high = Inf)
```

 Once we have made a new column with the 20 model outputs, one for each nested food exposure, (*including the erronious one in there*),we can extract outputs using the `broom` package. Exponentiation is key to getting correct odds ratios in a logistic regression model, and confidence intervals are also necessary to evalutate the strength of our estimates beyond p-value thresholds, so we are sure to include those arguments in the `broom::tidy` function that we `map` on our nested data frames.

```{r}

# summary of outputs.
cc_summary <- cc_logit %>% 
  mutate(outputs = map(model, broom::tidy, 
    # exponentiate and add confidence intervals. 
        exponentiate = TRUE, conf.int = TRUE)) 

head(cc_summary)


```

We have the information from our models at the ready, we then `unnest()` our model objects, select the relevant outputs, and organize them in a table object, including replacing data from our model that was unable to converge with `survival::clogit()`.


```{r, message = F}


# table of output summary of models
cc_summary %>% 
unnest(outputs) %>% 
  # select the outputs relevant to our analysis
  select(estimate, p.value, conf.low, conf.high) %>% 
  # round these values for legibility sake
  mutate(p.value = round(p.value, 3),
         estimate = round(estimate, 2),
         conf.low = round(conf.low, 3),
        conf.high = round(conf.high, 3)) %>% 
  # remove the row (or model) that was unable to converge
  filter(food != "raw_cookie_dough") %>% 
  
# add fixed outputs from published stata analysis 
   bind_rows(added_output) %>% 
  arrange(estimate) %>% 
  # create print quality table object for ease of viewing
    kbl(caption = "Output of Conditional Logistic Regression Models", 
        col.names = c("Food Consumed", 
                      "Estimated OR",
                      "p-value", 
                      "CI-low", 
                      "CI-high")) %>% 
  kable_classic_2(full_width = F) %>% column_spec(2, bold = T) 

```

It is also relevant to explore a count and proportion of cases and controls who consumed the food items. Here we can filter our pivoted data based on case or control status, grouping again on the food exposure, counting and creating a proportional measure.

```{r}

# count and proportions of controls per food exposure
controls_tab <- cc_pivot %>% 
  filter(case == 0) %>% 
  group_by(food) %>% 
  summarise(controls = sum(exposure, na.rm = T)) %>% 
  mutate(controls_percent = round(controls / 36, 2))

```

Now we have created a table for the controls, we repeat the process for cases and `inner_join` the tables together. Finally we add a similar wrapper for print quality table object. 

```{r}

# join controls with cases 
 cc_pivot %>% 
  filter(case == 1) %>% 
  group_by(food) %>% 
  summarise(cases = sum(exposure, na.rm = T)) %>% 
  mutate(cases_percent = round(cases / 36, 2)) %>% 
  inner_join(controls_tab, by = "food") %>% 
    arrange(cases_percent) %>% 
  kbl(caption = "Counts and Percentages of Exposed Cases",
      col.names = c("Food Consumed",
                    "Cases",
                    "No.", 
                    "Controls", 
                    "No.")) %>%   
  kable_classic_2(full_width = F)

```

You may notice a high percentage of cases who reported eating the raw cookie dough at the bottom here. This is evidence that further supports raw cookie dough being the source of the outbreak. Another way to explore the summary data is with visualizations. We can visualize estimated OR values with confidence intervals fairly simply in a `ggplot` bar graph. 


```{r, fig.align="center"}


cc_summary %>% 
  # we must unnest and ungroup our data object 
 unnest(outputs) %>% 
  filter(food != "raw_cookie_dough") %>% 
  ungroup() %>% 
  # can't neglect the exact test measurements
  add_row(food = "raw_cookie_dough", estimate = 41.34,
          conf.low = 7.37, conf.high = Inf) %>% 
  
  ## plotting ###
  # plot with theme and titles 
ggplot(aes(food, estimate)) +
  geom_bar(stat = "identity", fill = "steelblue", 
           alpha = .6) +
  # use conf values to create error bar. 
  geom_errorbar(aes(x = food, ymin = conf.low, ymax = conf.high,
                    color = "green", alpha = .9)) +
    ylim(0, 42) +
  coord_flip() +
  # add an hline for the conf.ints
  geom_hline(yintercept = 1, color = "purple", linetype = "dotdash") +
 # themes and title
   theme_bw() +
  theme(axis.text.x = element_text(angle = 90),
        legend.position = "none") +

  # title and labels
  ylab("Estimated OR with Confidence Interval") +
  xlab("Food Exposures") +
  ggtitle("OR Estimates of Disease Given Food Exposure Status") 
  

```

Even if the scale of this visualization is destroyed by the `raw_cookie_dough` estimate, we can add a `geom_hline` at the OR = 1 intercept, giving us a reference and ability to assess multiple exposures for statistical significance visually as well as numerically. Here we see that `ground_beef` and `strawberry` are *close* to significance, but an OR estimate for every other exposure passes through that un-significant OR = 1 area, adding even more support to the hypothesis, and [indeed true life situation](https://www.cdc.gov/ecoli/2009/cookie-dough-6-30-2009.html) where raw cookie dough was the source of this E. Coli outbreak. 

You can find the raw data and code for this blog here at [my personal github](https://github.com/Averysaurus/epi_cc_analysis_many_models). Thank you for reading this far!