-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtutorial.Rmd
804 lines (662 loc) · 32.8 KB
/
tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
---
title: "Prostate Cancer Case Study"
author: Tomas Bencomo and Kyle W. Singleton
output:
html_document:
df_print: paged
toc: yes
toc_depth: '3'
toc_float:
collapsed: yes
html_notebook:
toc: yes
toc_depth: 3
toc_float:
collapsed: yes
---
## Introduction
The presentation covered
concepts we should be thinking about as
we design studies and analyze survival data.
This notebook will put these ideas into practice
on real data using `R`.
Before reading this notebook, you should already know
`R` basics like calling functions,
loading data, manipulating dataframes, and fitting models.
It's also important to understand
concepts like hypothesis testing, p-values, confidence
intervals, linear regression, and hazard ratios. If you're unsure
of your `R` skills, Grolemund and Wickham have a good
[book](https://r4ds.had.co.nz)
that explains how to use `R` for data analysis.
Brown University's
[Seeing Theory](https://seeing-theory.brown.edu/#firstPage)
website is a good source for anyone needing to refresh
their statistics.
In this notebook we'll analyze survival data from
a randomized clinical trial (RCT) of estrogen therapy for
prostate cancer from Green and Byar
(1980, Bulletin Cancer, Paris, 67, 477-488). The dataset
has information on 502 patients treated with different
doses of estrogen and includes several clinical
measurements (also known as covariates).
Before we begin, remember that statistics is complicated
and it's always good to ask for help from a statistician!
## Defining Analytical Questions
Before we begin our analysis, we should have already prespecified what
questions we want to answer. Designing questions to investigate
beforehand helps us plan the statistical analysis and mitigates bias.
If we're conducting an exploratory analysis or data mining investigation,
we should remember this as we weigh any significant findings.
In this tutorial we'll focus on the following questions:
1. What effect does treatment have on survival?
2. How do we estimate treatment effects adjusting for age?
3. Does bone metastases effect treatment efficacy?
## Data Preparation
First, load the necessary packages.
```{r, warning=FALSE, message=FALSE}
library(rms)
library(rpart)
library(dplyr)
library(ggplot2)
library(mice)
library(stringr)
library(tidyr)
```
`rms` provides tools for regression modeling. `dplyr` manipulate dataframes.
`ggplot2` makes data visualization easy. `rpart`, `mice`, and `tidyr` help handle
missing data. `stringr` makes it easier to work
with strings.
Next, we load the `prostate` dataset, included in the `rms` package.
```{r}
getHdata(prostate)
head(prostate)
```
Before modeling it's important
to conduct Exploratory Data Analysis (EDA) to become familiar
with the data. This often includes looking
at the distribution of each variable, understanding patterns
between variables, and checking for outliers. For simplicity
of discussion, we do not perform EDA in this tutorial.
Instead, see Grolemund and Wickham's
[R For Data Science](https://r4ds.had.co.nz/exploratory-data-analysis.html)
for details on conducting EDA in `R`.
### Data Cleaning
During EDA, we would have noticed some problems with the dataset
that we'll need to fix.
`status` is encoded with the cause of death. Left unchanged,
this will break our models because they are built to only
handle either a dead or censored status.
Let's treat all causes of death as equal and encode
status so 1 means death and 0 means censored.
Note: Be careful
that you're aware how death is encoded! If death
was encoded as 0 and censored as 1, this would
flip the interpretation of the hazard ratios.
```{r}
prostate <- prostate %>%
mutate(status = case_when(
str_detect(status, "dead") ~ 1,
str_detect(status, "alive") ~ 0
))
head(prostate)
```
In addition, categorical variables need to be converted to factors
```{r}
prostate <- prostate %>%
mutate(stage = factor(stage),
hx = factor(stage),
bm = factor(bm))
```
Fixing `status` and encoding categorical variables as factors
are the only fixes we need in this example. In the real
world, data is almost never this clean. Be careful to check that everything
is encoded properly before starting analysis. This includes making
sure categorical variables are properly encoded as factors.
Check the reference level for each factor.
The reference level is the "category" that is used to compare all other categories
against. For example, `rx` has 4 categories:
```{r}
levels(prostate$rx)
```
"placebo" is the first level, meaning it's the reference level. This makes sense,
as we'll want to compare estrogen therapy with the placebo. Use the `relevel()` command
if you need to reorder your factor. Failure to make sure
all appropriate variables are properly encoded can cause errors in
the analysis.
### Missing Data
EDA often finds missing data.
Understanding and addressing missing data is crucial for
modeling success. Excluding patients due to missing data can
decrease sample size, reducing power. It also risks biasing
the analysis if missing values are missing for a systematic reason.
For example, say blood pressure is only measured after surgery.
If some patients die during surgery, they'll be missing
post-operation blood pressure measurements. Excluding these patients from
analysis will cause selection bias as we are only analyzing patients
that managed to survive surgery.
There are three types of missing data:
* Missing at Completely Random (MCAR) - Data is missing due
to errors like a dropped test tube in the hospital's lab.
Because these data are missing due to random chance, it's
impossible to predict their missingness.
* Missing at Random (MAR) - Data is not missing completely at
random, but the probability of data being missing depends on the
other measured variables. We can try to guess these missing values.
* Informative Missing (MI) - Missing data is dependent
on some piece of information we haven't measured. This is the
most difficult type of missing data to account for, and special
modeling techniques are often needed. If your data is MI, go
find a statistician.
It's important to characterize what data is missing
and why. It's best to consult with a domain expert
to understand why data is missing. Make sure to describe missing
data patterns when reporting results so readers can understand
limitations of your analysis.
To help uncover these patterns, `rms` and `mice` provide functions
to identify missing data patterns. First let's see which variables
are missing values in our dataset.
```{r}
result <- md.pattern(prostate, rotate.names = TRUE)
```
This tells us that 475 patients have complete data (blue indicates no missing values).
11 patients are missing `sg` measurements (red indicates missingness) and so on.
No patients are missing multiple measurements.
`naclus` and `naplot` also summarize info on data missingness.
See the `naplot()` documentation for more info.
```{r}
na.patterns <- naclus(prostate)
naplot(na.patterns)
```
From these plots, we can see `age`, `wt`, `sz`, `ekg`, and `sg` all have missing values.
We can also use decision trees to identify which types of patients were likely
to have `sg` missing. This builds a model to predict which patients will be missing
`sg` based on other patient characteristics. Note the formula syntax in the `rpart`
call.
```{r}
who.na <- rpart(is.na(sg) ~ stage + rx + pf + hx + sbp + dbp +
hg + ap + bm, data = prostate, minbucket=15)
plot(who.na, margin = .1)
text(who.na)
```
It looks like `dbp` and `hg` are predictors of `sg` missingness.
Now would be a good time to consult the clinician to ask why `dbp`
and `hg` are predictive of missing `sg`.
It's also good to know which variables are missing simultaneously from
patients. The code below demonstrates this function
```{r}
plot(na.patterns)
```
In this case, no patients are missing multiple values so the plot looks empty.
If more data was missing, the plot would show a clustering map that
groups variables missing together most often. This plot can help us identify
which variables may be dependent on each other (they may be measured at the same time).
The empty plot indicates there isn't a systematic issue with missing data. To see
more examples with `plot(na.patterns)`, refer to Frank Harrell's
[Regression Modeling Strategies (RMS)](http://hbiostat.org/doc/rms.pdf)
Chapter 12.4 for a case study on missing data. You'll see a proper
clustering example there.
After we've characterized missing data patterns, it's time to decide
how to tackle our missing data.
Harrell proposes a general rule of thumb when deciding how to approach missing data:
| Amount Missing | What to do |
|----------------------------------------|--------------------------------------------|
| Less than 3% | Median imputation or case-wise deletion |
| More than 3% | MICE with max(5, 100x) imputations |
| Multiple predictors frequently missing | Sensitivity analysis with more imputations |
Median imputation is where we compute the median value for a variable, and replace
all missing values with this median. Case-wise deletion means exlcuding patients
with any missing values from analysis. Multiple Imputation By Chained
Equation (MICE) is an algorithm that tries to predict
missing data values. If a sensitivity analysis is needed,
consult a statistician.
MICE has several assumptions that must be met for the algorithm to work
properly. Using MICE requires two key decisions:
1. Is it plausible to assume our missing data is MAR? If we can't assume MAR,
time to find a statistician.
2. Assuming MAR, how will we model our missing data? What variables
should we use to guess missing values?
Decision 1) ultimately comes down to whether we think
there are unmeasured variables that **significantly** correlate with
missingness. Unmeasured variables that are only slightly correlated with
missingness should not prevent us from assuming MAR. We can also gather
additional information outside of our survival analysis variables
to build the MICE model. For example, we may not be interested in analyzing
the effect of hospital location on survival, but which hospital the patient
is from may determine if they have missing data. We can include this variable
in our imputation model if it will help predict missingness even though
we don't plan to analyze the effect of hospital on survival. See section
6.2 in Stef van Burren's
[Flexible Imputation of Missing Data (FIMD)](https://stefvanbuuren.name/fimd/sec-whenignorable.html)
for more details on deciding about MAR assumptions.
For decision 2), we can model missing data using MICE. We tell MICE which variables
affect missingness, and then MICE uses these variables to
generate a dataset with predicted values for missing data. Each predicted
value has a random error component. To properly use imputation, we generate
several datasets with MICE, analyze each dataset separately, and then
average our results across datasets. Using multiple datasets
retains the uncertainty from data being missing.
To use MICE, we need to specify our imputation model: which variables to include,
what imputation method to use,
and what order we will impute variables. MICE chooses
robust defaults for all of these decisions. Predictive
Mean Matching `pmm` is the default imputation method.
When we pass `mice` a dataframe it will use all variables
in the dataframe for the imputation model. If there are more than
20-30 variables in the imputation model, MICE will slow down and
may not run. In this case, we can tell MICE to only use some
variables. Make sure to include all variables
that you'll be analyzing (anything that is going in the Cox regression)
and any variables that you suspect play a significant role in missingness.
See FIMD
[section 6.3](https://stefvanbuuren.name/fimd/sec-modelform.html)
for how to set which variables to include and more info
on choosing variables.
Now let's actually impute our missing data with MICE!
```{r echo=T, eval=FALSE}
impute_transform <- prostate %>%
select(-patno, sdate) %>%
mice(m = 5, method = 'pmm')
impute_fit <- fit.mult.impute(Surv(dtime, status) ~ rx + stage + rcs(age, 3),
cph, impute_transform, data = prostate)
```
This code does two things. First, we use `mice()` to impute our missing data.
The first argument `prostate` should be the dataframe
with missing values and any extra variables to predict missing values.
Notice we removed `patno` and `sdate` before using `mice()` because they
don't provide any information about missingness. Including them would only have
slowed down or worsened the imputation.
`m` is the number of datasets to generate. `method = 'pmm'` tells MICE
which imputation algorithm to use.
After mice has run, `fit.mult.impute()` will build a model for each
imputed dataset and average the models for us. The first argument is
the regression formula you desire - don't worry about the formula above,
this will be explained in more detail later. `cph` is the type of regression
model you want to use. `impute_transform` is the imputation object `mice`
created for us.
Missing data can be a complex and important topic when modeling
survival data. For more info, check out RMS Chapter 3 or van Burren's
FIMD (both referenced above). Because so few values are missing, we'll opt
for casewise completion for the rest of this tutorial.
```{r}
prostate <- prostate %>%
drop_na()
```
## Model Fitting
With our data clean, we can move on to modeling.
Cox Proportional Hazards (PH) regression is one of the
most popular forms of survival analysis because
we can analyze many variables at once with assumptions
that are often reasonable for clinical practice.
If you're not familiar with Generalized Linear Models
(GLMs), it's helpful to think of Cox regression as
a special form of linear regression. Unlike
linear regression, the response variable $Y$
in Cox regression is censored survival outcome and
the explanatory variable $\beta$'s are log hazard ratios.
The R packages `survival` and `rms` provide Cox
regression functionality. In `survival`, users
call `coxph()` while `cph()` is used in `rms`.
We'll focus on `cph()` because both work similarly
and `cph` adds some extra functionality.
Before we can start modeling, we need
to tell `rms` how the data is formatted.
```{r}
ddist <- datadist(prostate)
options(datadist="ddist")
```
We're trying to evaluate effect of estrogen therapy `rx`
on survival. Because patients were randomized to treatment
or no treatment, we just need to include treatment in our Cox
model to estimate the effect of estrogen therapy at different doses.
```{r}
rx.fit <- cph(Surv(dtime, status) ~ rx, data = prostate, x=T, y=T)
rx.fit
```
`cph` uses the formula notation to specify regression models. In
formula notation, the response variable (usually denoted `Y`) is placed
on the left hand side of the `~` and explanatory variables go
on the right hand side (usually denoted `x1+x2+...`).
Our response variable `Y` is censored survival time outcomes,
represented by `Surv(dtime, status)`. `Surv` uses time to event (`dtime`)
and the outcome (`status`) to convert this info into a format
readable by `R`. On the right hand side we've included `rx`
as our explanatory variable.
Printing `fit` shows the log(Hazard Ratio) under `Coef` and p-value for
each explanatory variable. `rx` shows up multiple times because
it is a factor with multiple levels. Each coefficent for `rx` is
for that dose compared against control (`placebo`).
The arguments `x=T, y=T` tells `R` to store the dataset
for plotting explanatory variable relationships.
We can plot the results from our model to visualize the hazard ratios
```{r}
ggplot(Predict(rx.fit, rx = c("0.2 mg estrogen", "1.0 mg estrogen", "5.0 mg estrogen"))) +
geom_hline(aes(yintercept = 0), color = "red")
```
Remember that log(HR) < 0 indicates improved survival and log(HR) > 0
means worse survival.
It's important to consider
the uncertainty surrounding the effects estimates when deciding
if a treatment works. A small p-value and decently large treatment
effect suggests that 1.0 mg estrogen improves
treatment survival. Large p-values and smaller effects estimates
mean we don't have enough evidence to claim 0.2 mg or 5.0 mg
estrogen actually cause harm. This lack of evidence shouldn't rule out
the idea that 0.2 mg or 5.0 mg of estrogen can cause harm though - absence
of evidence isn't evidence of absence. It just means we can't be certain
either way and more data is needed. Although we are not doctors, some hypothesize
there is an estrogen dosage "sweet spot" that causes 1.0 mg estrogen to improve
survival. The idea is that too little estrogen (0.2 mg) isn't strong enough
to treat the cancer, while too much estrogen (5.0 mg) increases the risk of
cardiovascular disease. Although this may be true, our results above indicate
we'd need more evidence to make these conclusions.
Finally, results are typically presented as HR in papers, instead of
log(HR). So let's reformat the model coefficients to evaluate the effect of
the treatments on change in hazard.
```{r}
exp(rx.fit$coefficients)
```
Remember that HR < 1 indicates improved survival and HR > 1
means worse survival.
Given the treatment effect and significance above for 1.0 mg of estrogen,
the results indicate 1.0 mg of estrogen cause a `r 1 - exp(rx.fit$coefficients[2])`
reduction in hazard.Other doses of estrogen actually had small increases in
effect on risk of death in the study population, but those effects were
not conclusively proven to be different than treatment with placebo.
### Covariate Adjustment
Because RCTs randomize patients to treatment or an alternative
(placebo or other intervention), simply estimating the treatment's hazard
ratio is enough to determine treatment effect because the randomization
deals with confounding. This is called an unadjusted analysis because we
didn't adjust for any factors.
Although unadjusted analyses are acceptable for RCTs, there are benefits
to adjusted analyses. Adjusting for strong prognostic factors in RCTs can
increase power and improve effects estimates
([1](http://www.sciencedirect.com/science/article/pii/S0895435616001190)),
([2](https://lesslikely.com/statistics/equal-covariates/)),
and ([3](https://discourse.datamethods.org/t/reference-collection-to-push-back-against-common-statistical-myths/1787)).
To adjust for other variables, we add them as explanatory variables to our
Cox model.
In observational studies, it's critical to adjust for confounding variables
because there is no randomization to control confounding and covariate imbalance.
Observational studies usually adjust for more variables than RCTs because
without randomization there are infinitely many variables that can confound
a study. Researchers worry about lurking variables which are variables
that confound the results but don't get adjusted for because we don't know
they exist. Uncertainty about lurking variables is why we should interpret
observational studies with considerable doubt compared to RCTs.
`age` is often a strong prognostic factor, especially in oncology; the older
we get, the harder it is to tolerate and recover from harsh treatments like
chemotherapy. We can adjust for age by adding `age` to our regression
```{r}
fit <- cph(Surv(dtime, status) ~ rx + age, data = prostate, x=T, y=T)
fit
```
Age is treated as a continuous variable, with the assumption
that age is linearly related to survival. With our data,
this means that for every additional year patients hazard will increase
by $1 - e^{0.0322} =$ `r 1 - exp(0.0322)`.
To visualize age's relation to survival, we plot
age vs hazard
```{r}
ggplot(Predict(fit, age)) +
labs(x = "Age in Years") +
ggtitle("Relationship between age and hazard")
```
Remember that log(HR) > 0 means an increased risk of death.
To plot hazard ratio instead, set `fun = exp` inside `Predict()`. The plot
shows that patients over the age of 70 have an increased risk of death from
prostate cancer.
A naive reader might see this hazard ratio and the p-value for
age (p=.003) and conclude age causes an increased risk of death. Careful! Although age
may be causally linked to death, this RCT wasn't designed to test age's causal
effect on survival, only its association. Hazard ratios and p-values alone
can never determine causality for a variable - causality is determined
by the experimental design. Avoid making causal statements just because
of small p-values!
### Relaxing the Linearity Assumption
Some continuous variables may not be linearly related to survival.
When we have sufficient sample size, it's optimal to treat
these variables as nonlinear. This will improve power
to detect a relationship and can improve effect estimation.
`rms` uses `rcs()` to model nonlinear variable with
restricted cubic splines. We can use splines to treat
age as if it's nonlinearly related to survival.
```{r}
nonlinear.fit <- cph(Surv(dtime, status) ~ rx + rcs(age,5), data = prostate, x=T, y=T)
nonlinear.fit
```
We must choose how many knots to use when modeling nonlinear
variables. The number of knots determines how closely the
regression fits the data. It's recommended to use 3 to 5 knots
for most problems: 3 for small datasets (~30 samples),
4 for medium datasets (~60 samples), and 5 for large datasets
(~100 samples). These sample sizes are for uncensored data,
so in most real world scenarios more samples will be needed
when dealing with censored data. If these rules
seem arbitrary to the reader and you'd like to choose the number
of knots based on the data, Harrell recommends choosing the number
of knots that maximizes model likelihood $\chi^2 - 2k$ where $k$
is the number of model parameters and $\chi^2$ is the $\chi^2$
statistic calculated from the model's maximum likelihood estimate.
This is `LR chi2` displayed in the `cph()` output under `Model Tests`.
See RMS 2.4.6 for more information
on restricted cubic splines and knot selection.
Multiple parameters are needed to model a nonlinear variable
(`age` and `age'`, `age''`, and `age'''`). This makes interpretting
results more complicated.
Instead of just looking at the coefficient for `age`, we need to plot
age vs hazard.
```{r}
ggplot(Predict(nonlinear.fit, age)) +
labs(x = "Age in Years") +
ggtitle("Relationship between age and hazard")
```
A one year increase in age no longer has a simple effect on survival.
It seems patients younger than 70 are all at a relatively equal
lower probability of death from prostate cancer. After about 70 years of age, patients
begin to experience increasing risk of death. P-values are also not as straightforward to
interpret. Because age can have a non-monotonic relationship with
survival, we need a p-value to test if the nonlinear age function composed of several
variables is associated with survival. Do
not try to interpret the individual p-values for the age parameters
(`age` and `age'` etc).
Using a chunk ANOVA test, we can compute p-values
testing whether each predictor is associated with survival.
```{r}
anova(nonlinear.fit)
```
Both `rx` and `age` have strong evidence
suggesting they're associated with survival.
The nonlinear p-value subsection under `age` tests
whether there is evidence of a nonlinear relationship
between age and survival. If the p-value is large,
this may mean age is linearly related to survival.
Even if adjustment variables appear "nonsignificant" (p > .05),
don't remove them from the model! Adjusting for variables
with p > .05 still improves our estimates. Removing "insignificant"
variables would also violate p-value inference rules.
It's best to prespecify the adjustment variables included in the model before
analyzing data and then reporting the full model with all variables.
See Tomas' Cross Validated
[post](https://stats.stackexchange.com/questions/405832/interpretting-cox-regression-anova)
for more details on interpretting the ANOVA.
### Subgroup Analysis Using Interactions
Precision medicine emphasizes treatment effect
heterogeneity: due to differing patient characteristics,
a drug that may work for one person won't necessarily
work for another. These concerns often drive researchers
to investigate subgroup differences. Subgroups can
be derived from any stratifying feature; male/female or mutation/no
mutation are common stratifying factors.
To investigate subgroup differences, it's common for researchers
to divide patients into separate cohorts based on the stratifying
factor, perform a separate analysis for each cohort, and then
compare findings between cohorts. For a survival analysis, this
could mean dividing patients into male and female cohorts,
fitting Cox regressions for each cohort, and then comparing
the hazard ratios and p-values between males and females. If any of
the hazard ratios or p-values differ, investigators will often claim
effect heterogeneity between sexes.
Unfortunately there are problems with this analysis. First,
splitting patients into separate cohorts reduces sample size,
making it more difficult to detect effects. This is especially
problematic when the subgroups are imbalanced. If there were 4x
more males than females, it will be easier to find a "significant"
hazard ratio for the males than the females because there are more
samples. Finding a significant hazard ratio in one cohort but not
the other doesn't necessarily mean the effect sizes are different;
a smaller sample size in the female cohort may be making the p-value
larger even though the hazard ratios are the same.
Second, subgroup analysis doesn't provide any means to test or
estimate the effect difference between cohorts. Although males may have
a different hazard ratio than females, how can we be sure this difference
isn't due to sampling error? There also isn't an easy
way to build a confidence interval for the hazard ratio difference
between sexes. Without a p-value or confidence interval, it's hard
to be confident that the male/female difference we're observing
is truly real.
To remedy these problems, we should use interaction terms
in our regression formulas. An interaction term models how the hazard
ratio of one variable changes for different values of another variable.
An interaction term between treatment and male/female could
be used to assess whether female patients experience a
better response than male patients. We can also compute p-values
and confidence intervals
for interaction terms, helping us understand the uncertainty around
the treatment effect heterogeneity.
Interactions are easy to include in `R` formulas
```{r}
interact.fit <- cph(Surv(dtime, status) ~ rx*bm, data = prostate, x=T, y=T)
interact.fit
```
Above we've included an interaction term between treatment `rx` and
bone metastasis `bm`,
useful if we suspect treatment efficacy differs between patients
with and without bone metastases. In `R`, `x1*x2` specifies an interaction
term between `x1` and `x2` and will also include `x1` and `x2` as separate
terms. It's equivalent to writing `x1 + x2 + x1*x2`. We can look at the hazard ratios
and p-values
to weigh the evidence for treatment differences among bone metastases groups.
0.2 mg and 1.0 mg of estrogen have large p-values, suggesting there may be no
difference in treatment effect between patients with/without bone metastasis.
Although this seems logical from the p-values, be cautious interpretting
interactions; most interactions are underpowered because sample
sizes are often halved (or worse) due to the nature of comparing two
separate, smaller subgroups. Interactions often require
[16x the number of patients](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3193873/)
to achieve the same power as normal variables, so unless
the analysis was powered for an interaction, interpret
these results carefully.
Meanwhile 5.0 mg estrogen's small p-value suggests evidence for
treatment effect heterogeneity. `bm` is an indicator variable
with a status of 0 (no metastasis) or 1 (bone metastasis).
We interpret the interaction as follows: if `bm = 1`, the coefficient
would be -.8701, a rather large hazard reduction. If `bm = 0`,
the coefficient is 0, because $0 * -.8701 = 0$. It seems
patients with bone metastases show a much better response
to 5.0 mg of estrogen than patients without bone metastases.
`anova()` will also tell us the overall effect of the
interaction
```{r}
anova(interact.fit)
```
Make sure to check the overall interaction's p-value with
anova before looking at individual level p-values. If the overall
interaction isn't meaningfully small, the individual terms should be considered
weak evidence. If at least one of your interaction terms includes a nonlinear
continuous variable, make sure to use ANOVA to compute p-values - don't
look at the individual p-values!
A final note on interpretting interaction terms. If there is enough
evidence to believe the interaction term is non-null, to calculate
the final hazard ratio for a variable, we must add all log hazard ratios
that relate to that variable.
Seeing the Cox regression formula helps us understand what coefficients
we need to add. For our model with `rx`, `bm` and their interaction,
the regression looks like
$$
\lambda(t) = \lambda_0(t)\text{exp}({\beta_1x_1 + \beta_2x_2+\beta_3x_3 + \beta_4x_4 + \beta_5x_1x_4 + \beta_6x_2x_4 + \beta_7x_3x_4})
$$
where $x_1$ indicates 0.2 mg estrogen, $x_2$ indicates 1.0 mg estrogen, $x_3$ indicates
5.0 mg estrogen, and $x_4$ indicates bone metastasis. $\beta_i$ is the log hazard ratio
for the $i$th variable. $\beta_5$, $\beta_6$, and $\beta_7$
represent the interaction term coefficients. To compute the full hazard
for bone metastasis ($x_4 = 1$) patients treated with 5.0 mg
estrogen ($x_3 = 1$ and $x_1, x_2 = 0$) we'd compute
$$
\text{exp}({\beta_3 + \beta_4 + \beta_7}) = \text{HR}\\
\text{exp}({0.1112 + 1.0804 -0.8701}) = 1.379195
$$
This is confirmed when we plot the hazard for bone metastasis for
patients treated with 5.0 mg estrogen.
```{r}
ggplot(Predict(interact.fit, rx=c("5.0 mg estrogen"), bm, fun = exp)) +
labs(y = "Relative Hazard Ratio") +
ggtitle("Relationship between bone metastasis and survival")
```
The Analysis Factor provides more information on interpretting
interaction terms
[here](https://www.theanalysisfactor.com/interpreting-interactions-in-regression/).
### Checking Model Assumptions
Once we've fitted our model, we need to check our
assumptions before accepting our results. The important
assumption in Cox regression is the Proportional
Hazards (PH) assumption. This means that the hazard
ratio for any variables in the model stays constant over time.
We can test these assumptions with `cox.zph()`
```{r}
cox.zph(fit)
```
This code tests the proportional hazards
assumption for individual variables and the overall model.
Above we tested our first model that included
treatment and linear age. There's evidence that the
hazard ratio for age changes over time, and this
is problematic for our model (GLOBAL p=.02159).
A small p-value suggests the PH assumption is broken.
When continuous variables break the PH assumption,
modeling them as nonlinear variables can correct
the PH problems.
```{r}
cox.zph(nonlinear.fit)
```
The GLOBAL p-value now suggests the overall PH assumption
is satisfied.
When categorical variables break the PH assumption,
stratification can fix these issues. Stratification
will usually eliminate PH problems at the loss of
parameter estimation for that variable; you'll no
longer have a hazard ratio for the stratification
variable. Pretend `bm` was having PH problems, to
stratify by `bm` we'd code
```{r}
cph(Surv(dtime, status) ~ rx + strat(bm), data = prostate, x=T, y=T)
```
Stratification can also be used to adjust for variables whose hazard
ratio isn't of interest but we'd still like to adjust for differences
caused by this variable. Multicenter clinical trials will often
stratify by hospital in their analysis as investigators want
to adjust for care differences attributed to each hospital but
usually aren't interested in the actual hazard ratio for each hosptial.
If you're still having trouble with PH assumptions
after trying nonlinear modeling and stratification,
Cox may not be appropriate. Accelerated
time failure (AFT) models may be more appropriate for your data.
See RMS Chapter 18 for information on AFT models.
## Conclusion
During this tutorial we've covered data preparation,
missing data, and survival analysis with Cox regression.
Our explanations for each topic are just the tip of the iceberg.
Missing data is a complex topic with considerable research to
support imputation methods. See [Flexible Imputation
For Missing Data](https://stefvanbuuren.name/fimd/) by Stef
van Burren for an indepth look at imputation.
[Sterne et al](https://www.bmj.com/content/338/bmj.b2393) also have
a good summary of imputation for the non-statistician.
Although we focused on effect estimation with the Cox model,
regression modeling is also used for prediction. Frank Harrell's
[Regression Modeling Strategies](http://hbiostat.org/doc/rms.pdf)
covers regression and its applications in much more depth.
Finally, Sashegyi and Ferry have a good explanation of hazard
ratios and how to communicate their meaning with clinicians
[here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5388384/).