forked from epiforecasts/covid
-
Notifications
You must be signed in to change notification settings - Fork 0
/
methods.Rmd
68 lines (41 loc) · 8.65 KB
/
methods.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
title: "Methods"
description: |
Data sources, accounting for report delays and statistical methods.
bibliography:: library.bib
output:
distill::distill_article:
self_contained: true
toc: true
toc_depth: 2
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## Data
### National data
We used daily counts of confirmed cases reported by the European Centre for Disease Control for all analyses conducted at the national level[@ecdc_data; @NCoVUtils]. To estimate the delay from case onset date until confirmation date we used all cases from a publicly available linelist for which onset and confirmation dates are available[@kraemer2020epidemiological; @NCoVUtils]. Countries are only included in the reported estimates if at least 100 cases have been reported in a single day.
### Subnational data sources
For sub-national analyses, the source of the data is reported on each page, the data are fetched from government departments or from individuals who maintain a data source if no official data are available. Regions within countries are only reported if at least 40 cases have been reported in a single day.
## Adjusting for reporting delays
To estimate the reporting delay, we fit exponential and gamma distributions to the number of days between symptom onset and case confirmation, accounting for left and right censoring occurring in the data as each date is rounded to the nearest day. We fit each model in the statistical modelling program stan[@rstan] and compared to goodness-of-fit of each distribution to the data by comparing the approximate leave-one-out cross-validation information criterion (LOOIC)[@Vehtari2016].
The distribution that gives the lowest LOOIC was selected as the most appropriate and 1000 samples of the fitted distribution parameters were then drawn (rate for the exponential distribution, and shape and scale for the gamma distribution). For a given country, we used sample $i$ from the posterior distribution of delay distribution parameters, $\Theta_i$, to draw a sample of delays, $d_i$, to transform each observed confirmation date, $c_i$, into a sample onset date, $o_i$, as follows:
$o_i = c_i - d_i$,
where $d_i \sim exp(\Theta_i)$ or $gamma(\Theta_i)$
This resulted in 1000 date of onset samples for each confirmed case.
## Adjusting for right-censoring of confirmation dates
When moving from confirmation dates to onset dates it is important to consider that the total number of confirmed cases lags behind the number of cases that have onset, since there is a delay between onset occurring and the case being counted upon confirmation. To account for this right censoring, we used binomial upscaling to increase the estimated numbers of case onsets close to the present. After transforming the observed confirmation dates to onset dates and tallying case onset numbers by day, we then drew a sample of the number of case onsets that occurred but have not yet been confirmed.
If $t$ is the present date, then the number of onsets on day $t-k$, $o_{t-k}$ is used to draw the number of unconfirmed onsets on day $t-k$, $o^{*}_{t-k}$ from a negative binomial distribution as follows:
$o^{*}_{t-k} \sim negbin(size = o_{t-k} + 1, prob = f(t-k, \Theta_i))$
where $f(t-k, \Theta_i))$ is the cumulative distribution function for the given delay distribution, it gives the proportion of onset cases from $t-k$ days ago that are expected to have been confirmed over the k days from that time until the present. The final numbers of case onsets that were used to estimate the time varying reproduction numbers for day $t$ are given by $o_t + o^{*}_{t}$. As our approach could not fully reconstruct unreported cases without bias we truncated our results and did not use estimates from the last 4 days. Finally, the date of infection for each case was estimated to be 5 days prior to the date of symptom onset [@10.7326/M20-0504].
## Estimating the time-varying reproduction number
We used the *EpiEstim* R package[@EpiEstim; @R] to fit a model that estimated the time-varying reproduction number from the daily number of case onsets and a specified serial interval distribution [@cori2013; @wallinga2004]. We used a gamma prior for the reproduction number with mean 2.6 and standard deviation 2. This is based on an estimate of the R0 from the initial stages of the outbreak in Wuhan[@Imai:webreport3] but also has long tails to allow for differences in the reproduction number between countries. Where data was available, we also used *EpiEstim* to adjust for imported cases[@THOMPSON2019100356].
We incorporated uncertainty in the serial interval distribution by providing *EpiEstim* with 1000 samples from the serial interval mean of 4.7 days (95% CrI: 3.7, 6.0) and the standard deviation of 2.9 days (95% CrI: 1.9, 4.9) from previous analysis[@Nishiura2020.02.03.20019497]. We evaluated window lengths from 1 to 7 days, running *EpiEstim* separately for each window choice. We therefore fitted the *EpiEstim* model 1000 times overall for each window, each time using a different combination of sampled onset dates and serial interval distribution. The optimal window was selected by first estimating the one day ahead number of cases implied by each time-varying reproduction number estimate[@NOUVELLET201829] and then scoring this nowcast against the observed number of cases using the RPS score[@gneiting_strictly_2007; @jordan_evaluating_2019]. For each sample the window with the lowest median RPS score was selected.
The estimates of the time-varying reproduction number at each time point were combined over all 1000 samples, using the optimal window for each sample, to give a credible interval that incorporates uncertainty from the delay from case onset to confirmation and the length of the serial interval.
## Estimated change in daily cases
We defined the estimated change in daily cases to correspond to the proportion of reproduction number estimates for the current day that are below 1 (the value at which an outbreak is in decline). It was assumed that if less than 5% of samples were subcritical then an increase in cases was definite, if less than 20% of samples were subcritical then an increase in cases was likely, if more than 80% of samples were subcritical then a decrease in cases was likely and if more than 95% of samples were subcritical then a decrease in cases was definite. For countries/regions with between 20% and 80% of samples being subcritical we could not make a statement about the likely change in cases (defined as unsure).
We estimated the rate of spread ($r$) using linear regression with time as the only exposure and the logarithm of cases as the outcome for the overall course of the outbreak [@Park2019]. The adjusted $R^2$ value of the regression fit was then used to assess the goodness-of-fit. In order to account for potential changes in the rate of spread over the course of the outbreak we used a 7-day sliding window to produce time-varying estimates of the rate of spread and the corresponding adjusted $R^2$. The doubling time was then estimated by calculating $\text{ln}(2) \frac{1}{r}$ for each estimate of the rate of spread.
## The effect of changes in testing procedure
The results presented here are sensitive to changes in COVID-19 testing practices and the level of effort put into detecting COVID-19 cases. If a country expands their testing capacity and begins to report a higher proportion of cases then the model will fit a higher reproduction number value to account for this becasue it only understands new cases in terms of the infectiousness of previously reported cases and not as a result of better testing. On the other hand, if a country reduces their testing effort (for example, hitting their test capacity or running out of tests) then the model will estimate a drop in the reproduction number that may not be a true reduction. What is important for these results to be unbiased by testing is consistency in the level of testing effort rather than the actual level of testing effort. This means that whilst a change in testing effort will initially introduce bias, this will be reduced over time as long as the testing effort remains consistent from this point on.
## Code and data availability
We report the median and 90% highest density intervals for all measures. The analysis was conducted independently for all regions and is updated regularly as new data becomes available. Confidence in our estimates is shown using the proportion of data that were derived using binomial upscaling, confidence decreases as estimates get closer to now since there is more uncertainty in the upscaled number of cases. Code and results from this analysis can be found [here](https://epiforecasts.io/EpiNow) and [here](https://github.com/epiforecasts/covid).