From 8f623f5a1dc3a895c98bf563dbc5d9b674aa2f12 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Tue, 28 Jan 2025 12:06:37 +0000 Subject: [PATCH 01/24] add epinowcast package data and dependency to the getting started vignette --- DESCRIPTION | 3 ++ vignettes/baselinenowcast.Rmd | 74 ++++++++++++++++++++++++++++++++++- 2 files changed, 75 insertions(+), 2 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index 0f305d7..6852f92 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -32,6 +32,7 @@ Depends: R (>= 4.0.0) Suggests: dplyr, + epinowcast, bookdown, ggplot2, spelling, @@ -44,3 +45,5 @@ Encoding: UTF-8 Roxygen: list(markdown = TRUE) RoxygenNote: 7.3.2 VignetteBuilder: knitr +Remotes: + epinowcast=epinowcast/epinowcast@v0.3.0 diff --git a/vignettes/baselinenowcast.Rmd b/vignettes/baselinenowcast.Rmd index 8bb2a5f..7b7564b 100644 --- a/vignettes/baselinenowcast.Rmd +++ b/vignettes/baselinenowcast.Rmd @@ -20,12 +20,82 @@ vignette: > # Introduction +Incomplete reporting of epidemiological data at recent times can result in case count data that is right-truncated. +Right-truncated case counts can be misleading to interpret at face-value, as they will typically show a decline in the number of reported observations in the most recent time points. +These are the time points where the highest proportion of the data has yet to be observed in the dataset. +The imputation of what will eventually be observed up until the current time is referred to as a nowcast. +A number of methods have been developed to nowcast epidemiological case count data. + +The purpose of `baselinenowcast` is to provide a nowcast using the most recent observations and a relatively straightforward multiplicative method for adjusting incomplete reports. +The goal is to use this method as a baseline for comparing to potentially more complex model-based approaches which may have additional features that are potentially specific to the dataset or problem context. + +In the below section, we will describe an example of a nowcasting problem, and demonstrate how to use `baselinenowcast` to estimate a delay distribution from the data and apply that estimate to generate a probabilistic nowcast. + +# Packages + +As well as the `baselinenowcast` package this vignette also uses `epinowcast`, `ggplot2`, and `dplyr`. +The installation of `epinowcast` is not required for using the package, however, its pre and post-processing functions provide a lot of the data wrangling needed to set up the nowcasting problem. +We note that no components of the vignette require installing `CmdStan`, which is a downstream dependency of `epinowcast`. +We will just be using the `R` components of `epinowcast`, which can be installed using the example lines of code below. ```{r setup, message = FALSE} +# Installing epinowcast +# install.packages( #nolint +# "epinowcast", repos = "https://epinowcast.r-universe.dev" #nolint +# ) #nolint # Load packages -library(baselinenowcast) # nolint -library(ggplot2) # nolint +library(baselinenowcast) +library(epinowcast) +library(ggplot2) +library(dplyr) # Set seed for reproducibility set.seed(123) ``` + +# Data + +Nowcasting of right-truncated case counts involves the estimation of reporting delays for recently reported data. +For this, we need case counts both by when they were diagnosed (e.g. when someone tests positive; often called "reference date") and by when they were reported (i.e. when administratively recorded via public health surveillance; often called "report date"). The difference between the reference date and the report date is the reporting delay. +For this quick start, we use daily level data from the [Robert Koch Institute via the Germany Nowcasting hub](https://github.com/KITmetricslab/hospitalization-nowcast-hub/wiki/Truth-data#role-an-definition-of-the-seven-day-hospitalization-incidence). +These data represent hospitalisation counts by date of positive test and date of test report in Germany up to October 1, 2021. + +# Filtering + +We will filter the data to just look at the national-level data, for all age groups. +We will pretend that we are making a nowcast as of July 1, 2021, therefore we will exclude all reference dates and report dates after that date. +`germany_covid19_hosp` is provided as package data from `epinowcast +```{r load-data} +observed_long <- epinowcast::germany_covid19_hosp |> + dplyr::filter(location == "DE", age_group == "00+") +``` + +Let's start by plotting all of the values reported on the day of admissions, +and then compare that to the total admissions observed on each reference date. + +```{r make-some-exploratory-plots} +ggplot() + + geom_line( + # Plot data only for 0 day delay + data = observed_long |> + filter(reference_date == report_date), + aes(x = reference_date, y = confirm), color = "gray" + ) + + theme_bw() + + xlab("Reference date") + + ylab("Confirmed admissions on date of admissions") + + ggtitle("Number of cases confirmed on the reference date") +ggplot() + + geom_line( + # Plot the data summed across reporting days + data = observed_long |> + group_by(reference_date) |> + summarise(total_confirmed = sum(confirm)), + aes(x = reference_date, y = total_confirmed), color = "darkred" + ) + + theme_bw() + + xlab("Reference date") + + ylab("Total confirmed admissions") + + ggtitle("Total cases by reference date") +``` +` From ae1d0921ff689488f5da99643666a5bf9b6ab3e4 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Tue, 28 Jan 2025 15:41:31 +0000 Subject: [PATCH 02/24] add a very rough first pass function to estimate the delay, with no tests or checks --- NAMESPACE | 2 + R/estimate_delay.R | 65 ++++++++++++++++++++++++ man/estimate_delay.Rd | 56 +++++++++++++++++++++ vignettes/baselinenowcast.Rmd | 95 ++++++++++++++++++++++++++++------- 4 files changed, 201 insertions(+), 17 deletions(-) create mode 100644 R/estimate_delay.R create mode 100644 man/estimate_delay.Rd diff --git a/NAMESPACE b/NAMESPACE index e651b94..8c1d0fe 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -1 +1,3 @@ # Generated by roxygen2: do not edit by hand + +export(estimate_delay) diff --git a/R/estimate_delay.R b/R/estimate_delay.R new file mode 100644 index 0000000..6708592 --- /dev/null +++ b/R/estimate_delay.R @@ -0,0 +1,65 @@ +#' Estimate a delay distribution from a reporting triangle +#' @description Provides an estimate of the reporting delay as a function +#' of the delay, based on the reporting triangle and the specified maximum +#' delay and number of reference date observations to be used in the estimation. +#' This point estimate of the delay is computed empirically, using an +#' iterative algorithm starting from the most recent observations. It was +#' modified from the code originally developed by the Karlsruhe Institute +#' of Technology RESPINOW German Hospitalization Nowcasting Hub, +#' Modified from: https://github.com/KITmetricslab/RESPINOW-Hub/blob/7cce3ae2728116e8c8cc0e4ab29074462c24650e/code/baseline/functions.R#L55 #nolint +#' @param triangle dataframe of the reporting triangle, with one column for the +#' reference date and the remaining columns specifying the delay (units not +#' specified). Values indicate the number of new observations assigned to +#' the reference date at the specified delay. +#' @param max_delay integer indicating the maximum delay to estimate, in units +#' of the delay. The default is to use the whole reporting triangle, +#' `ncol(triangle) -1`. +#' @param n_history integer indicating the number of reference dates to be +#' used in the estimate of the reporting delay, always starting from the most +#' recent reporting delay. The default is to use the whole reporting triangle, +#' so `nrow(triangle)-1` +#' @returns delay_df dataframe of length `max_delay` with columns `delay` +#' and `pmf`, indicating the point estimate of the empirical probability +#' mass on each delay +#' @export +#' @examples +#' library(epinowcast) +#' nat_germany_hosp <- +#' germany_covid19_hosp[location == "DE"][age_group == "00+"] +#' nat_germany_hosp <- enw_filter_report_dates( +#' nat_germany_hosp, +#' latest_date = "2021-10-01" +#' ) +#' pobs <- enw_preprocess_data(nat_germany_hosp, max_delay = 21) +#' triangle_raw<- pobs$reporting_triangle[[1]] +#' delay_df <- estimate_delay(triangle_raw[, 2:], +#' max_delay = 20, +#' n_history = 30) +estimate_delay <- function(triangle, + max_delay = ncol(triangle) - 1, + n_history = nrow(triangle) - 1) { + # Filter the triangle down to nrow = n_history + 1, ncol = max_delay + rt <- triangle[, 1:(max_delay + 2)] |> + dplyr::filter(reference_date >= max(reference_date) - n_history) |> + dplyr::select(-reference_date) + + n_delays <- ncol(rt) + n_dates <- nrow(rt) + rt <- as.data.table(rt) + factor <- matrix(nrow = max_delay - 1, ncol = 1) + expectation <- as.matrix(rt) # Make a matrix so we can add a decimal value + for (co in 2:(n_delays)){ + block_top_left <- rt[1:(n_dates - co + 1), 1:(co - 1), drop = FALSE] + block_top <- rt[1:(n_dates - co + 1), ..co, drop = FALSE] + factor[co - 1] <- sum(block_top) / max(sum(block_top_left), 1) + block_bottom_left <- expectation[(n_dates - co + 2):n_dates, 1:(co - 1), + drop = FALSE] + expectation[(n_dates - co + 2):n_dates, co] <- factor[co - 1] * rowSums( + block_bottom_left) + } + # Not sure if this is right, as it might be overly weighting the + # estimate data (but I do think that is the point...) + delay_df <- data.frame(delay = 0:max_delay, + pmf = colSums(expectation) / sum(expectation)) + return(delay_df) +} diff --git a/man/estimate_delay.Rd b/man/estimate_delay.Rd new file mode 100644 index 0000000..05a6090 --- /dev/null +++ b/man/estimate_delay.Rd @@ -0,0 +1,56 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/estimate_delay.R +\name{estimate_delay} +\alias{estimate_delay} +\title{Estimate a delay distribution from a reporting triangle} +\usage{ +estimate_delay( + triangle, + max_delay = ncol(triangle) - 1, + n_history = nrow(triangle) - 1 +) +} +\arguments{ +\item{triangle}{dataframe of the reporting triangle, with one column for the +reference date and the remaining columns specifying the delay (units not +specified). Values indicate the number of new observations assigned to +the reference date at the specified delay.} + +\item{max_delay}{integer indicating the maximum delay to estimate, in units +of the delay. The default is to use the whole reporting triangle, +\code{ncol(triangle) -1}.} + +\item{n_history}{integer indicating the number of reference dates to be +used in the estimate of the reporting delay, always starting from the most +recent reporting delay. The default is to use the whole reporting triangle, +so \code{nrow(triangle)-1}} +} +\value{ +delay_df dataframe of length \code{max_delay} with columns \code{delay} +and \code{pmf}, indicating the point estimate of the empirical probability +mass on each delay +} +\description{ +Provides an estimate of the reporting delay as a function +of the delay, based on the reporting triangle and the specified maximum +delay and number of reference date observations to be used in the estimation. +This point estimate of the delay is computed empirically, using an +iterative algorithm starting from the most recent observations. It was +modified from the code originally developed by the Karlsruhe Institute +of Technology RESPINOW German Hospitalization Nowcasting Hub, +Modified from: https://github.com/KITmetricslab/RESPINOW-Hub/blob/7cce3ae2728116e8c8cc0e4ab29074462c24650e/code/baseline/functions.R#L55 #nolint +} +\examples{ +library(epinowcast) +nat_germany_hosp <- + germany_covid19_hosp[location == "DE"][age_group == "00+"] +nat_germany_hosp <- enw_filter_report_dates( + nat_germany_hosp, + latest_date = "2021-10-01" +) +pobs <- enw_preprocess_data(nat_germany_hosp, max_delay = 21) +triangle_raw<- pobs$reporting_triangle[[1]] +delay_df <- estimate_delay(triangle_raw[, 2:], + max_delay = 20, + n_history = 30) +} diff --git a/vignettes/baselinenowcast.Rmd b/vignettes/baselinenowcast.Rmd index 7b7564b..6ab5c2b 100644 --- a/vignettes/baselinenowcast.Rmd +++ b/vignettes/baselinenowcast.Rmd @@ -64,27 +64,25 @@ These data represent hospitalisation counts by date of positive test and date of We will filter the data to just look at the national-level data, for all age groups. We will pretend that we are making a nowcast as of July 1, 2021, therefore we will exclude all reference dates and report dates after that date. -`germany_covid19_hosp` is provided as package data from `epinowcast +`germany_covid19_hosp` is provided as package data from `epinowcast` ```{r load-data} observed_long <- epinowcast::germany_covid19_hosp |> + enw_filter_report_dates(latest_date = "2021-07-01") |> dplyr::filter(location == "DE", age_group == "00+") ``` Let's start by plotting all of the values reported on the day of admissions, -and then compare that to the total admissions observed on each reference date. +and then compare that to what we will eventually observe as of the final date +in the complete dataset. -```{r make-some-exploratory-plots} -ggplot() + - geom_line( - # Plot data only for 0 day delay - data = observed_long |> - filter(reference_date == report_date), - aes(x = reference_date, y = confirm), color = "gray" - ) + - theme_bw() + - xlab("Reference date") + - ylab("Confirmed admissions on date of admissions") + - ggtitle("Number of cases confirmed on the reference date") +# Plotting + +The red line shows the cumulative number of confirmed admissions on each report date, across all delays +It demonstrates the characteristic behavior of right-truncation. +This is because we have not yet observed the data that will become available for the longer delays at recent time points. + +Our task will be to estimate what the "final" cumulative number of cases will at each reference date, observed as of the "fully observed" data on October 2021. +```{r plot-the-data-by-reference-date} ggplot() + geom_line( # Plot the data summed across reporting days @@ -93,9 +91,72 @@ ggplot() + summarise(total_confirmed = sum(confirm)), aes(x = reference_date, y = total_confirmed), color = "darkred" ) + + geom_line( + data = epinowcast::germany_covid19_hosp |> + dplyr::filter(location == "DE", age_group == "00+") |> + group_by(reference_date) |> + summarise(total_confirmed = sum(confirm)) |> + dplyr::filter(reference_date <= "2021-07-01"), + aes(x = reference_date, y = total_confirmed), color = "black" + )+ theme_bw() + xlab("Reference date") + - ylab("Total confirmed admissions") + - ggtitle("Total cases by reference date") + ylab("Confirmed admissions") + + scale_y_continuous(trans = "log10") + + ggtitle("Comparing all admissions by reference date and those with no delay") +``` +Here the black line represents the quantity we will evaluate our nowcast against, +and the red line represents the data we have available to us up until July 1st, 2021. + +# Pre-processing + +In order to compute a nowcast for this data, we will need to start by creating what we call a reporting triangle. +The reporting triangle will be used to estimate the delay distribution, or the proportion of all counts reported on a particular delay. +This is a wide formatted dataframe with one column for the reference date and another column for each of the delays, from 0 to the maximum delay. +The entries represent the number of new cases assigned to that reference date with a particular delay. +Since this data is both reported and referenced daily, we will use the time scale of days to create the reporting triangle, but the delay and the reference date can have any temporal granularity. + +In this example, we will both fit our delay distribution and apply it to generate a nowcast using the same data, the national level data from Germany for all age groups. +However, these components can be separated, so for example, we could use the national level data for all age groups to estimate a delay distribution, and then we could apply that to the data from just a single age group. +This type of "borrowing" from another training dataset can be really useful when you have low counts or relatively sparse data, which is likely to be the case for smaller populations. +```{r user-specificatons} +nowcast_date <- "2021-07-01" +# Specify the maximum delay, which will determine the length of your delay distribution. +# Empirical data outside this delay window will not be used for training, so we'd +# want to do some EDA of the data before deciding on this. +max_delay <- 40 +# Specify the number of reference dates to use to estimate the delay distribution. +# Note this assumes you want the most recent observations (though we can consider changing this) +n_history <- 60 +``` + +Next we will use the `epinowcast` function, `enw_preprocess_data()`, to +generate a reporting triangle. +```{r use-epinowcast-preprocessing} +# Get the reporting triangle, adding an additional day because epinowcast assumes +# uses this to get the number of columns of the triangle, and we want the max_delay + 1 +# since 0 is a valid delay. +pobs <- enw_preprocess_data( + # Noting that this is the only way epinowcast preprocessing would work -- + # return to this later. IDate was throwing errors if we used the dplyr processed + # observed long above. + epinowcast::germany_covid19_hosp[location == "DE"][age_group == "00+"] |> + enw_filter_report_dates(latest_date = nowcast_date) |> + enw_filter_reference_dates(include_days = n_obs_training), + max_delay = max_delay + 1) + +triangle_full <- pobs$reporting_triangle[[1]] +head(triangle_full) +``` + +Now that we have the reporting triangle, we are now ready to pass it in to the `baselinenowcast` package to estimate the delay distribution. +We will tell the function the maximum delay and the number of observations we want to use, though the default will be to use the whole reporting triangle. +If the reporting triangle is too small for the user-specified delays and number of training observations, the function will throw an error. +We only want to pass in the reporting triangle (for a single group!) to this function. +If reference date are repeted because the reporting triangle contains multiple strata, the function will throw an error. +```{r estimate-delay} +triangle <- triangle_full[, -1] +delay_df <- estimate_delay(triangle = triangle, + max_delay = max_delay, + n_history = n_history) ``` -` From a1d1c9bc7b7ac4525c70889e5de3a6eaf867401c Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Tue, 28 Jan 2025 16:34:08 +0000 Subject: [PATCH 03/24] remove dplyr dependence internally --- R/estimate_delay.R | 28 +++++++++++++++------------- vignettes/baselinenowcast.Rmd | 17 +++++++++++++++++ 2 files changed, 32 insertions(+), 13 deletions(-) diff --git a/R/estimate_delay.R b/R/estimate_delay.R index 6708592..d5ff2f1 100644 --- a/R/estimate_delay.R +++ b/R/estimate_delay.R @@ -31,35 +31,37 @@ #' latest_date = "2021-10-01" #' ) #' pobs <- enw_preprocess_data(nat_germany_hosp, max_delay = 21) -#' triangle_raw<- pobs$reporting_triangle[[1]] -#' delay_df <- estimate_delay(triangle_raw[, 2:], -#' max_delay = 20, -#' n_history = 30) +#' triangle_raw <- pobs$reporting_triangle[[1]] +#' delay_df <- estimate_delay(triangle_raw[, -1], +#' max_delay = 20, +#' n_history = 30 +#' ) estimate_delay <- function(triangle, max_delay = ncol(triangle) - 1, n_history = nrow(triangle) - 1) { # Filter the triangle down to nrow = n_history + 1, ncol = max_delay - rt <- triangle[, 1:(max_delay + 2)] |> - dplyr::filter(reference_date >= max(reference_date) - n_history) |> - dplyr::select(-reference_date) + rt <- triangle[, 1:(max_delay + 2)][reference_date >= max(reference_date) - n_history][, -1] # nolint n_delays <- ncol(rt) n_dates <- nrow(rt) - rt <- as.data.table(rt) factor <- matrix(nrow = max_delay - 1, ncol = 1) expectation <- as.matrix(rt) # Make a matrix so we can add a decimal value - for (co in 2:(n_delays)){ + for (co in 2:(n_delays)) { block_top_left <- rt[1:(n_dates - co + 1), 1:(co - 1), drop = FALSE] block_top <- rt[1:(n_dates - co + 1), ..co, drop = FALSE] factor[co - 1] <- sum(block_top) / max(sum(block_top_left), 1) block_bottom_left <- expectation[(n_dates - co + 2):n_dates, 1:(co - 1), - drop = FALSE] + drop = FALSE + ] expectation[(n_dates - co + 2):n_dates, co] <- factor[co - 1] * rowSums( - block_bottom_left) + block_bottom_left + ) } # Not sure if this is right, as it might be overly weighting the # estimate data (but I do think that is the point...) - delay_df <- data.frame(delay = 0:max_delay, - pmf = colSums(expectation) / sum(expectation)) + delay_df <- data.frame( + delay = 0:max_delay, + pmf = colSums(expectation) / sum(expectation) + ) return(delay_df) } diff --git a/vignettes/baselinenowcast.Rmd b/vignettes/baselinenowcast.Rmd index 6ab5c2b..d55ed39 100644 --- a/vignettes/baselinenowcast.Rmd +++ b/vignettes/baselinenowcast.Rmd @@ -149,6 +149,8 @@ triangle_full <- pobs$reporting_triangle[[1]] head(triangle_full) ``` +# Estimate delay + Now that we have the reporting triangle, we are now ready to pass it in to the `baselinenowcast` package to estimate the delay distribution. We will tell the function the maximum delay and the number of observations we want to use, though the default will be to use the whole reporting triangle. If the reporting triangle is too small for the user-specified delays and number of training observations, the function will throw an error. @@ -159,4 +161,19 @@ triangle <- triangle_full[, -1] delay_df <- estimate_delay(triangle = triangle, max_delay = max_delay, n_history = n_history) + + +``` + +# Apply delay to generate point nowcast + +The next step in our workflow is to take the estimated delay distribution and apply it to the partially observed reporting triangle, generating an estimate of the number of new cases confirmed at each reference date and delay. +This will generate a point estimate of what we can call the reporting square, which is the complete set of reference dates and delays. +In this case, we will be applying the delay to the same reporting triangle we used to generate the estimate, but this doesn't always have to be the case. +The reporting triangle we are applying it to must have the same `max_delay` as the delay estimate. + +```{r} +reporting_square <- apply_delay(triangle_to_nowcast = triangle, + delay_f ) + ``` From fcfa8aaaa8faa57f4bf220cd64bc8873ada29345 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Tue, 28 Jan 2025 16:35:24 +0000 Subject: [PATCH 04/24] update vignette --- vignettes/baselinenowcast.Rmd | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/vignettes/baselinenowcast.Rmd b/vignettes/baselinenowcast.Rmd index d55ed39..84753ec 100644 --- a/vignettes/baselinenowcast.Rmd +++ b/vignettes/baselinenowcast.Rmd @@ -98,7 +98,7 @@ ggplot() + summarise(total_confirmed = sum(confirm)) |> dplyr::filter(reference_date <= "2021-07-01"), aes(x = reference_date, y = total_confirmed), color = "black" - )+ + ) + theme_bw() + xlab("Reference date") + ylab("Confirmed admissions") + @@ -143,7 +143,8 @@ pobs <- enw_preprocess_data( epinowcast::germany_covid19_hosp[location == "DE"][age_group == "00+"] |> enw_filter_report_dates(latest_date = nowcast_date) |> enw_filter_reference_dates(include_days = n_obs_training), - max_delay = max_delay + 1) + max_delay = max_delay + 1 +) triangle_full <- pobs$reporting_triangle[[1]] head(triangle_full) @@ -158,11 +159,11 @@ We only want to pass in the reporting triangle (for a single group!) to this fun If reference date are repeted because the reporting triangle contains multiple strata, the function will throw an error. ```{r estimate-delay} triangle <- triangle_full[, -1] -delay_df <- estimate_delay(triangle = triangle, - max_delay = max_delay, - n_history = n_history) - - +delay_df <- estimate_delay( + triangle = triangle, + max_delay = max_delay, + n_history = n_history +) ``` # Apply delay to generate point nowcast @@ -173,7 +174,8 @@ In this case, we will be applying the delay to the same reporting triangle we us The reporting triangle we are applying it to must have the same `max_delay` as the delay estimate. ```{r} -reporting_square <- apply_delay(triangle_to_nowcast = triangle, - delay_f ) - +reporting_square <- apply_delay( + triangle_to_nowcast = triangle, + delay_f +) ``` From eb9e712ee0950255197d7f8bbc039a1a380f0019 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Wed, 29 Jan 2025 11:24:22 +0000 Subject: [PATCH 05/24] add the preprocess function from respinow to handle 0s --- DESCRIPTION | 2 + R/estimate_delay.R | 21 +++++++--- R/preprocess_reporting_triangle.R | 55 +++++++++++++++++++++++++++ man/estimate_delay.Rd | 9 +++-- man/preprocess_reporting_triangle.Rd | 33 ++++++++++++++++ vignettes/baselinenowcast.Rmd | 57 +++++++++++++++++++++------- 6 files changed, 154 insertions(+), 23 deletions(-) create mode 100644 R/preprocess_reporting_triangle.R create mode 100644 man/preprocess_reporting_triangle.Rd diff --git a/DESCRIPTION b/DESCRIPTION index 6852f92..951c328 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -30,6 +30,8 @@ URL: https://github.com/epinowcast/baselinenowcast/ BugReports: https://github.com/epinowcast/baselinenowcast/issues Depends: R (>= 4.0.0) +Imports: + data.table Suggests: dplyr, epinowcast, diff --git a/R/estimate_delay.R b/R/estimate_delay.R index d5ff2f1..f3560d1 100644 --- a/R/estimate_delay.R +++ b/R/estimate_delay.R @@ -1,5 +1,6 @@ #' Estimate a delay distribution from a reporting triangle -#' @description Provides an estimate of the reporting delay as a function +#' @description +#' Provides an estimate of the reporting delay as a function #' of the delay, based on the reporting triangle and the specified maximum #' delay and number of reference date observations to be used in the estimation. #' This point estimate of the delay is computed empirically, using an @@ -39,16 +40,24 @@ estimate_delay <- function(triangle, max_delay = ncol(triangle) - 1, n_history = nrow(triangle) - 1) { - # Filter the triangle down to nrow = n_history + 1, ncol = max_delay - rt <- triangle[, 1:(max_delay + 2)][reference_date >= max(reference_date) - n_history][, -1] # nolint - + # User might put in data.frame or a data.table, we will convert to a data.table #nolint + # internally to avoid using dplyr in package functions + triangle <- data.table::as.data.table(triangle) + # Filter the triangle down to nrow = n_history + 1, ncol = max_delay + trunc_triangle <- preprocess_reporting_triangle(triangle, max_delay)[reference_date >= max(reference_date) - n_history] # nolint + # Make the date the rowname, so the matrix is just the entries + integer_cols <- which(colnames(trunc_triangle) %in% (grep("^\\d+$", names(trunc_df), value = TRUE))) #nolint + # the `..` is because its a data.table, we probably don't want to expect this from users. #nolint + rt <- as.matrix(trunc_triangle[, ..integer_cols]) + dates <- as.character(trunc_triangle$reference_date) + rownames(matr) <- dates n_delays <- ncol(rt) n_dates <- nrow(rt) factor <- matrix(nrow = max_delay - 1, ncol = 1) - expectation <- as.matrix(rt) # Make a matrix so we can add a decimal value + expectation <- rt for (co in 2:(n_delays)) { block_top_left <- rt[1:(n_dates - co + 1), 1:(co - 1), drop = FALSE] - block_top <- rt[1:(n_dates - co + 1), ..co, drop = FALSE] + block_top <- rt[1:(n_dates - co + 1), co, drop = FALSE] factor[co - 1] <- sum(block_top) / max(sum(block_top_left), 1) block_bottom_left <- expectation[(n_dates - co + 2):n_dates, 1:(co - 1), drop = FALSE diff --git a/R/preprocess_reporting_triangle.R b/R/preprocess_reporting_triangle.R new file mode 100644 index 0000000..4f940cf --- /dev/null +++ b/R/preprocess_reporting_triangle.R @@ -0,0 +1,55 @@ +#' Pre-process the reporting triangle +#' @description +#' Takes in a reporting triangle with the reference date as the first column, +#' and all subsequent columns named by their delay (unitless, so for example a +#' delay of 3 days or weeks would have the column name `3`), and values +#' indicating the number of new confirmed cases assigned to that reference +#' date with that particular delay. It returns a dataframe in the same format +#' as the input triangle, but truncated to include only the maximum delay +#' number of columns and with negative values of reporting handled via passing +#' them to the subsequent days delay. +#' Modified from https://github.com/KITmetricslab/RESPINOW-Hub/blob/main/code/baseline/functions.R #nolint +#' @param triangle the reporting triangle as a data.frame, with rows as the reference +#' date and columns as the delay, in any units. Assumes that the delays will +#' be in the format `{delay}` with no suffix for the unit +#' @param max_delay the maximum delay, in the delay units, to filter the +#' reporting triangle +#' @return trunc_triangle a dataframe in the same format as `triangle`, but +#' truncated to include only the maximum delay number of delay columns +#' and with negative values of reporting handled via passing them to the +#' subsequent days delay +preprocess_reporting_triangle <- function(triangle, max_delay) { + # restrict to the first columns in the max + cols_subset <- which(colnames(triangle) == max_delay) + trunc_triangle <- triangle[, 1:cols_subset] + # columns containing integer names: + integer_cols <- which(colnames(trunc_triangle) %in% (grep("^\\d+$", names(trunc_df), value = TRUE))) # nolint + # Loop over each row + for (i in 1:nrow(trunc_triangle)) { + to_subtract <- 0 + row <- trunc_triangle[i, ] + # Loop over the columns starting from the last column back to max delay + # column, and if there is a negative value, we add this to the + # next day and set that one as 0. + for (j in rev(integer_cols)) { + value <- row[[j]] + if (!is.na(value)) { + # Either adds 0 or the previous days negative value + value <- value + to_subtract + if (value < 0) { + # Want to subtract from subsequent day + to_subtract <- value + trunc_triangle[i, j] <- 0 # Set the negative value in the RT to 0 + } else { + trunc_triangle[i, j] <- value + to_subtract <- 0 + } + } + } + } + # Convert 'value' columns to integer type + for (col in integer_cols) { + trunc_triangle[[col]] <- as.integer(trunc_triangle[[col]]) + } + return(trunc_triangle) +} diff --git a/man/estimate_delay.Rd b/man/estimate_delay.Rd index 05a6090..202b8ce 100644 --- a/man/estimate_delay.Rd +++ b/man/estimate_delay.Rd @@ -49,8 +49,9 @@ nat_germany_hosp <- enw_filter_report_dates( latest_date = "2021-10-01" ) pobs <- enw_preprocess_data(nat_germany_hosp, max_delay = 21) -triangle_raw<- pobs$reporting_triangle[[1]] -delay_df <- estimate_delay(triangle_raw[, 2:], - max_delay = 20, - n_history = 30) +triangle_raw <- pobs$reporting_triangle[[1]] +delay_df <- estimate_delay(triangle_raw[, -1], + max_delay = 20, + n_history = 30 +) } diff --git a/man/preprocess_reporting_triangle.Rd b/man/preprocess_reporting_triangle.Rd new file mode 100644 index 0000000..3752989 --- /dev/null +++ b/man/preprocess_reporting_triangle.Rd @@ -0,0 +1,33 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/preprocess_reporting_triangle.R +\name{preprocess_reporting_triangle} +\alias{preprocess_reporting_triangle} +\title{Pre-process the reporting triangle} +\usage{ +preprocess_reporting_triangle(triangle, max_delay) +} +\arguments{ +\item{triangle}{the reporting triangle as a data.frame, with rows as the reference +date and columns as the delay, in any units. Assumes that the delays will +be in the format \code{{delay}} with no suffix for the unit} + +\item{max_delay}{the maximum delay, in the delay units, to filter the +reporting triangle} +} +\value{ +trunc_triangle a dataframe in the same format as \code{triangle}, but +truncated to include only the maximum delay number of delay columns +and with negative values of reporting handled via passing them to the +subsequent days delay +} +\description{ +Takes in a reporting triangle with the reference date as the first column, +and all subsequent columns named by their delay (unitless, so for example a +delay of 3 days or weeks would have the column name \code{3}), and values +indicating the number of new confirmed cases assigned to that reference +date with that particular delay. It returns a dataframe in the same format +as the input triangle, but truncated to include only the maximum delay +number of columns and with negative values of reporting handled via passing +them to the subsequent days delay. +Modified from https://github.com/KITmetricslab/RESPINOW-Hub/blob/main/code/baseline/functions.R #nolint +} diff --git a/vignettes/baselinenowcast.Rmd b/vignettes/baselinenowcast.Rmd index 84753ec..4ca7689 100644 --- a/vignettes/baselinenowcast.Rmd +++ b/vignettes/baselinenowcast.Rmd @@ -111,14 +111,17 @@ and the red line represents the data we have available to us up until July 1st, # Pre-processing In order to compute a nowcast for this data, we will need to start by creating what we call a reporting triangle. -The reporting triangle will be used to estimate the delay distribution, or the proportion of all counts reported on a particular delay. This is a wide formatted dataframe with one column for the reference date and another column for each of the delays, from 0 to the maximum delay. The entries represent the number of new cases assigned to that reference date with a particular delay. +The reporting triangle will be used to estimate the delay distribution, or the proportion of the final number of cases reported on a particular delay. Since this data is both reported and referenced daily, we will use the time scale of days to create the reporting triangle, but the delay and the reference date can have any temporal granularity. -In this example, we will both fit our delay distribution and apply it to generate a nowcast using the same data, the national level data from Germany for all age groups. -However, these components can be separated, so for example, we could use the national level data for all age groups to estimate a delay distribution, and then we could apply that to the data from just a single age group. +In this example, we will both fit our delay distribution, and apply it to generate a nowcast using the same data, the national level data from Germany for all age groups. +However, these components can be separated, so for example, we could use the national level data for all age groups to estimate a delay distribution, and then we could apply that elsewhere, for example to the data stratified by age group and location. This type of "borrowing" from another training dataset can be really useful when you have low counts or relatively sparse data, which is likely to be the case for smaller populations. + +In the below sections, we will specify our nowcast date, the maximum delay, and the number of observations by reference date that we want to use to estimate the delay distribution. +We recommend choosing the maximum delay and number of historical observations based on an exploratory data analysis, as these specifications will change significantly depending on the dataset. ```{r user-specificatons} nowcast_date <- "2021-07-01" # Specify the maximum delay, which will determine the length of your delay distribution. @@ -130,19 +133,20 @@ max_delay <- 40 n_history <- 60 ``` -Next we will use the `epinowcast` function, `enw_preprocess_data()`, to -generate a reporting triangle. +Next we will use the `epinowcast` function, `enw_preprocess_data()` and the data in the form a of a long tidy dataframe indexed by reference date and report date and filtered to the strata we are interested in, to generate a reporting triangle. ```{r use-epinowcast-preprocessing} +# Noting that this is the only way epinowcast preprocessing would work -- +# return to this later. IDate was throwing errors if we used the dplyr processed +# observed long above. +observed_long <- epinowcast::germany_covid19_hosp[location == "DE"][age_group == "00+"] |> + enw_filter_report_dates(latest_date = nowcast_date) |> + enw_filter_reference_dates(include_days = n_history) +head(observed_long) # Get the reporting triangle, adding an additional day because epinowcast assumes # uses this to get the number of columns of the triangle, and we want the max_delay + 1 # since 0 is a valid delay. pobs <- enw_preprocess_data( - # Noting that this is the only way epinowcast preprocessing would work -- - # return to this later. IDate was throwing errors if we used the dplyr processed - # observed long above. - epinowcast::germany_covid19_hosp[location == "DE"][age_group == "00+"] |> - enw_filter_report_dates(latest_date = nowcast_date) |> - enw_filter_reference_dates(include_days = n_obs_training), + obs = observed_long, max_delay = max_delay + 1 ) @@ -156,14 +160,41 @@ Now that we have the reporting triangle, we are now ready to pass it in to the ` We will tell the function the maximum delay and the number of observations we want to use, though the default will be to use the whole reporting triangle. If the reporting triangle is too small for the user-specified delays and number of training observations, the function will throw an error. We only want to pass in the reporting triangle (for a single group!) to this function. -If reference date are repeted because the reporting triangle contains multiple strata, the function will throw an error. +If reference date are repeated because the reporting triangle contains multiple strata, the function will throw an error. + +The `estimate_delay()` function expects the following inputs: + - `triangle`: the reporting triangle for a single strata, with the left column indicating the `reference_date` and the subsequent columns indexed starting at 0 through the number of observed delays. There should be no repeated reference dates or additional columns of metadata. + - `max_delay`: an integer indicating the maximum delay to estimate. This must be less than or equal to the number of delays in `triangle`. + - `n_history`: an integer indicating the number of observations by reference date to use to fit the delay distribution. This must be less than the number of rows in `triangle`. +`enw_preoprocess()` returns a `triangle` with an additional column `.group` which will not be used in our estimate of the delay distribution, as `estimate_delay()` performs the estimate on a single strata only. +If your data contains multiple groups, we recommend either aggregating across groups (for a fully pooled estimate) or estimating the delay individually for the strata of your choice (no pooling). +`baselinenowcast` does not support a hierarchical, or partially pooled, estimate of the delay by design. ```{r estimate-delay} -triangle <- triangle_full[, -1] +triangle <- triangle_full |> + dplyr::select(-`.group`) |> # remove unecessary `.group` column + as.data.frame() # baseline nowcast expects dataframe as an input +head(triangle) + + delay_df <- estimate_delay( triangle = triangle, max_delay = max_delay, n_history = n_history ) + +ggplot(delay_df) + + geom_line(aes(x = delay, y = cumsum(pmf))) + + xlab("Delay") + + ylab("Cumulativee proportion reported") + + ggtitle("Empirical point estimate of cumulative proportion reported by delay") + + theme_bw() + +ggplot(delay_df) + + geom_line(aes(x = delay, y = pmf)) + + xlab("Delay") + + ylab("Proportion reported") + + ggtitle("Empirical point estimate of proportion reported by delay") + + theme_bw() ``` # Apply delay to generate point nowcast From e590458fbf845464c7c537821e186e6e812a1239 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Wed, 29 Jan 2025 11:25:14 +0000 Subject: [PATCH 06/24] enforce using data.table --- R/estimate_delay.R | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/R/estimate_delay.R b/R/estimate_delay.R index f3560d1..85288f9 100644 --- a/R/estimate_delay.R +++ b/R/estimate_delay.R @@ -43,10 +43,10 @@ estimate_delay <- function(triangle, # User might put in data.frame or a data.table, we will convert to a data.table #nolint # internally to avoid using dplyr in package functions triangle <- data.table::as.data.table(triangle) - # Filter the triangle down to nrow = n_history + 1, ncol = max_delay + # Filter the triangle down to nrow = n_history + 1, ncol = max_delay trunc_triangle <- preprocess_reporting_triangle(triangle, max_delay)[reference_date >= max(reference_date) - n_history] # nolint # Make the date the rowname, so the matrix is just the entries - integer_cols <- which(colnames(trunc_triangle) %in% (grep("^\\d+$", names(trunc_df), value = TRUE))) #nolint + integer_cols <- which(colnames(trunc_triangle) %in% (grep("^\\d+$", names(trunc_df), value = TRUE))) # nolint # the `..` is because its a data.table, we probably don't want to expect this from users. #nolint rt <- as.matrix(trunc_triangle[, ..integer_cols]) dates <- as.character(trunc_triangle$reference_date) From ae56d32eda363adb6dd034731c364562e3a50880 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Wed, 29 Jan 2025 11:57:08 +0000 Subject: [PATCH 07/24] fix linting and bug --- R/preprocess_reporting_triangle.R | 10 +- vignettes/baselinenowcast.Rmd | 202 ++++++++++++++++++++---------- 2 files changed, 142 insertions(+), 70 deletions(-) diff --git a/R/preprocess_reporting_triangle.R b/R/preprocess_reporting_triangle.R index 4f940cf..8d0cc83 100644 --- a/R/preprocess_reporting_triangle.R +++ b/R/preprocess_reporting_triangle.R @@ -9,9 +9,9 @@ #' number of columns and with negative values of reporting handled via passing #' them to the subsequent days delay. #' Modified from https://github.com/KITmetricslab/RESPINOW-Hub/blob/main/code/baseline/functions.R #nolint -#' @param triangle the reporting triangle as a data.frame, with rows as the reference -#' date and columns as the delay, in any units. Assumes that the delays will -#' be in the format `{delay}` with no suffix for the unit +#' @param triangle the reporting triangle as a data.frame, with rows as the +#' reference date and columns as the delay, in any units. Assumes that the +#' delays will be in the format `{delay}` with no suffix for the unit #' @param max_delay the maximum delay, in the delay units, to filter the #' reporting triangle #' @return trunc_triangle a dataframe in the same format as `triangle`, but @@ -23,9 +23,9 @@ preprocess_reporting_triangle <- function(triangle, max_delay) { cols_subset <- which(colnames(triangle) == max_delay) trunc_triangle <- triangle[, 1:cols_subset] # columns containing integer names: - integer_cols <- which(colnames(trunc_triangle) %in% (grep("^\\d+$", names(trunc_df), value = TRUE))) # nolint + integer_cols <- which(colnames(trunc_triangle) %in% (grep("^\\d+$", names(trunc_triangle), value = TRUE))) # nolint # Loop over each row - for (i in 1:nrow(trunc_triangle)) { + for (i in seq_len(nrow(trunc_triangle))) { to_subtract <- 0 row <- trunc_triangle[i, ] # Loop over the columns starting from the last column back to max delay diff --git a/vignettes/baselinenowcast.Rmd b/vignettes/baselinenowcast.Rmd index 4ca7689..a524c11 100644 --- a/vignettes/baselinenowcast.Rmd +++ b/vignettes/baselinenowcast.Rmd @@ -20,25 +20,42 @@ vignette: > # Introduction -Incomplete reporting of epidemiological data at recent times can result in case count data that is right-truncated. -Right-truncated case counts can be misleading to interpret at face-value, as they will typically show a decline in the number of reported observations in the most recent time points. -These are the time points where the highest proportion of the data has yet to be observed in the dataset. - -The imputation of what will eventually be observed up until the current time is referred to as a nowcast. - -A number of methods have been developed to nowcast epidemiological case count data. - -The purpose of `baselinenowcast` is to provide a nowcast using the most recent observations and a relatively straightforward multiplicative method for adjusting incomplete reports. -The goal is to use this method as a baseline for comparing to potentially more complex model-based approaches which may have additional features that are potentially specific to the dataset or problem context. - -In the below section, we will describe an example of a nowcasting problem, and demonstrate how to use `baselinenowcast` to estimate a delay distribution from the data and apply that estimate to generate a probabilistic nowcast. +Incomplete reporting of epidemiological data at recent times can result in case +count data that is right-truncated. +Right-truncated case counts can be misleading to interpret at face-value, as +they will typically show a decline in the number of reported observations in +the most recent time points. +These are the time points where the highest proportion of the data has yet to +be observed in the dataset. + +The imputation of what will eventually be observed up until the current time is +referred to as a nowcast. + +A number of methods have been developed to nowcast epidemiological case count +data. + +The purpose of `baselinenowcast` is to provide a nowcast using the most recent +observations and a relatively straightforward multiplicative method for +adjusting incomplete reports. +The goal is to use this method as a baseline for comparing to potentially more +complex model-based approaches which may have additional features that are +potentially specific to the dataset or problem context. + +In the below section, we will describe an example of a nowcasting problem, and +demonstrate how to use `baselinenowcast` to estimate a delay distribution from +the data and apply that estimate to generate a probabilistic nowcast. # Packages -As well as the `baselinenowcast` package this vignette also uses `epinowcast`, `ggplot2`, and `dplyr`. -The installation of `epinowcast` is not required for using the package, however, its pre and post-processing functions provide a lot of the data wrangling needed to set up the nowcasting problem. -We note that no components of the vignette require installing `CmdStan`, which is a downstream dependency of `epinowcast`. -We will just be using the `R` components of `epinowcast`, which can be installed using the example lines of code below. +As well as the `baselinenowcast` package this vignette also uses `epinowcast`, +`ggplot2`, and `dplyr`. +The installation of `epinowcast` is not required for using the package, +however, its pre and post-processing functions provide a lot of the data +wrangling needed to set up the nowcasting problem. +We note that no components of the vignette require installing `CmdStan`, +which is a downstream dependency of `epinowcast`. +We will just be using the `R` components of `epinowcast`, which can be +installed using the example lines of code below. ```{r setup, message = FALSE} # Installing epinowcast # install.packages( #nolint @@ -55,15 +72,24 @@ set.seed(123) # Data -Nowcasting of right-truncated case counts involves the estimation of reporting delays for recently reported data. -For this, we need case counts both by when they were diagnosed (e.g. when someone tests positive; often called "reference date") and by when they were reported (i.e. when administratively recorded via public health surveillance; often called "report date"). The difference between the reference date and the report date is the reporting delay. -For this quick start, we use daily level data from the [Robert Koch Institute via the Germany Nowcasting hub](https://github.com/KITmetricslab/hospitalization-nowcast-hub/wiki/Truth-data#role-an-definition-of-the-seven-day-hospitalization-incidence). -These data represent hospitalisation counts by date of positive test and date of test report in Germany up to October 1, 2021. +Nowcasting of right-truncated case counts involves the estimation of reporting +delays for recently reported data. +For this, we need case counts both by when they were diagnosed (e.g. when +someone tests positive; often called "reference date") and by when they were +reported (i.e. when administratively recorded via public health surveillance; +often called "report date"). The difference between the reference date and the +report date is the reporting delay. +For this quick start, we use daily level data from the +[Robert Koch Institute via the Germany Nowcasting hub](https://github.com/KITmetricslab/hospitalization-nowcast-hub/wiki/Truth-data#role-an-definition-of-the-seven-day-hospitalization-incidence). +These data represent hospitalisation counts by date of positive test and date +of test report in Germany up to October 1, 2021. # Filtering -We will filter the data to just look at the national-level data, for all age groups. -We will pretend that we are making a nowcast as of July 1, 2021, therefore we will exclude all reference dates and report dates after that date. +We will filter the data to just look at the national-level data, for all age +groups. +We will pretend that we are making a nowcast as of July 1, 2021, therefore we +will exclude all reference dates and report dates after that date. `germany_covid19_hosp` is provided as package data from `epinowcast` ```{r load-data} observed_long <- epinowcast::germany_covid19_hosp |> @@ -77,11 +103,15 @@ in the complete dataset. # Plotting -The red line shows the cumulative number of confirmed admissions on each report date, across all delays +The red line shows the cumulative number of confirmed admissions on each report +date, across all delays It demonstrates the characteristic behavior of right-truncation. -This is because we have not yet observed the data that will become available for the longer delays at recent time points. +This is because we have not yet observed the data that will become available +for the longer delays at recent time points. -Our task will be to estimate what the "final" cumulative number of cases will at each reference date, observed as of the "fully observed" data on October 2021. +Our task will be to estimate what the "final" cumulative number of cases will +at each reference date, observed as of the "fully observed" data on October +2021. ```{r plot-the-data-by-reference-date} ggplot() + geom_line( @@ -105,46 +135,67 @@ ggplot() + scale_y_continuous(trans = "log10") + ggtitle("Comparing all admissions by reference date and those with no delay") ``` -Here the black line represents the quantity we will evaluate our nowcast against, -and the red line represents the data we have available to us up until July 1st, 2021. +Here the black line represents the quantity we will evaluate our nowcast +against,and the red line represents the data we have available to us up until +July 1st, 2021. # Pre-processing -In order to compute a nowcast for this data, we will need to start by creating what we call a reporting triangle. -This is a wide formatted dataframe with one column for the reference date and another column for each of the delays, from 0 to the maximum delay. -The entries represent the number of new cases assigned to that reference date with a particular delay. -The reporting triangle will be used to estimate the delay distribution, or the proportion of the final number of cases reported on a particular delay. -Since this data is both reported and referenced daily, we will use the time scale of days to create the reporting triangle, but the delay and the reference date can have any temporal granularity. - -In this example, we will both fit our delay distribution, and apply it to generate a nowcast using the same data, the national level data from Germany for all age groups. -However, these components can be separated, so for example, we could use the national level data for all age groups to estimate a delay distribution, and then we could apply that elsewhere, for example to the data stratified by age group and location. -This type of "borrowing" from another training dataset can be really useful when you have low counts or relatively sparse data, which is likely to be the case for smaller populations. - -In the below sections, we will specify our nowcast date, the maximum delay, and the number of observations by reference date that we want to use to estimate the delay distribution. -We recommend choosing the maximum delay and number of historical observations based on an exploratory data analysis, as these specifications will change significantly depending on the dataset. +In order to compute a nowcast for this data, we will need to start by creating +what we call a reporting triangle. +This is a wide formatted dataframe with one column for the reference date +and another column for each of the delays, from 0 to the maximum delay. +The entries represent the number of new cases assigned to that reference +date with a particular delay. +The reporting triangle will be used to estimate the delay distribution, or the +proportion of the final number of cases reported on a particular delay. +Since this data is both reported and referenced daily, we will use the time +scale of days to create the reporting triangle, but the delay and the reference +date can have any temporal granularity. + +In this example, we will both fit our delay distribution, and apply it to +generate a nowcast using the same data, the national level data from Germany +for all age groups. +However, these components can be separated, so for example, we could use the +national level data for all age groups to estimate a delay distribution, and +then we could apply that elsewhere, for example to the data stratified by +age group and location. +This type of "borrowing" from another training dataset can be really useful +when you have low counts or relatively sparse data, which is likely to be the +case for smaller populations. + +In the below sections, we will specify our nowcast date, the maximum delay, +and the number of observations by reference date that we want to use to +estimate the delay distribution. +We recommend choosing the maximum delay and number of historical observations +based on an exploratory data analysis, as these specifications will change +significantly depending on the dataset. ```{r user-specificatons} nowcast_date <- "2021-07-01" -# Specify the maximum delay, which will determine the length of your delay distribution. -# Empirical data outside this delay window will not be used for training, so we'd -# want to do some EDA of the data before deciding on this. +# Specify the maximum delay, which will determine the length of your delay +# distribution. Empirical data outside this delay window will not be used for +# training. max_delay <- 40 -# Specify the number of reference dates to use to estimate the delay distribution. -# Note this assumes you want the most recent observations (though we can consider changing this) +# Specify the number of reference dates to use to estimate the delay +# distribution.Note this assumes you want the most recent observations +# (though we can consider changing this) n_history <- 60 ``` -Next we will use the `epinowcast` function, `enw_preprocess_data()` and the data in the form a of a long tidy dataframe indexed by reference date and report date and filtered to the strata we are interested in, to generate a reporting triangle. +Next we will use the `epinowcast` function, `enw_preprocess_data()` and the +data in the form a of a long tidy dataframe indexed by reference date and +report date and filtered to the strata we are interested in, to generate a +reporting triangle. ```{r use-epinowcast-preprocessing} # Noting that this is the only way epinowcast preprocessing would work -- # return to this later. IDate was throwing errors if we used the dplyr processed # observed long above. -observed_long <- epinowcast::germany_covid19_hosp[location == "DE"][age_group == "00+"] |> +observed_long <- epinowcast::germany_covid19_hosp[location == "DE"][age_group == "00+"] |> # nolint enw_filter_report_dates(latest_date = nowcast_date) |> enw_filter_reference_dates(include_days = n_history) head(observed_long) -# Get the reporting triangle, adding an additional day because epinowcast assumes -# uses this to get the number of columns of the triangle, and we want the max_delay + 1 -# since 0 is a valid delay. +# Get the reporting triangle, adding an additional day because epinowcast +# we want the max_delay + 1 entries since 0 is a valid delay. pobs <- enw_preprocess_data( obs = observed_long, max_delay = max_delay + 1 @@ -156,19 +207,35 @@ head(triangle_full) # Estimate delay -Now that we have the reporting triangle, we are now ready to pass it in to the `baselinenowcast` package to estimate the delay distribution. -We will tell the function the maximum delay and the number of observations we want to use, though the default will be to use the whole reporting triangle. -If the reporting triangle is too small for the user-specified delays and number of training observations, the function will throw an error. -We only want to pass in the reporting triangle (for a single group!) to this function. -If reference date are repeated because the reporting triangle contains multiple strata, the function will throw an error. +Now that we have the reporting triangle, we are now ready to pass it in to the +`baselinenowcast` package to estimate the delay distribution. +We will tell the function the maximum delay and the number of observations we +want to use, though the default will be to use the whole reporting triangle. +If the reporting triangle is too small for the user-specified delays and number +of training observations, the function will throw an error. +We only want to pass in the reporting triangle (for a single group!) to this +function. +If reference date are repeated because the reporting triangle contains multiple +strata, the function will throw an error. The `estimate_delay()` function expects the following inputs: - - `triangle`: the reporting triangle for a single strata, with the left column indicating the `reference_date` and the subsequent columns indexed starting at 0 through the number of observed delays. There should be no repeated reference dates or additional columns of metadata. - - `max_delay`: an integer indicating the maximum delay to estimate. This must be less than or equal to the number of delays in `triangle`. - - `n_history`: an integer indicating the number of observations by reference date to use to fit the delay distribution. This must be less than the number of rows in `triangle`. -`enw_preoprocess()` returns a `triangle` with an additional column `.group` which will not be used in our estimate of the delay distribution, as `estimate_delay()` performs the estimate on a single strata only. -If your data contains multiple groups, we recommend either aggregating across groups (for a fully pooled estimate) or estimating the delay individually for the strata of your choice (no pooling). -`baselinenowcast` does not support a hierarchical, or partially pooled, estimate of the delay by design. + - `triangle`: the reporting triangle for a single strata, with the left column + indicating the `reference_date` and the subsequent columns indexed starting + at 0 through the number of observed delays. There should be no repeated + reference dates or additional columns of metadata. + - `max_delay`: an integer indicating the maximum delay to estimate. This must + be less than or equal to the number of delays in `triangle`. + - `n_history`: an integer indicating the number of observations by reference + date to use to fit the delay distribution. This must be less than the number + of rows in `triangle`. +`enw_preoprocess()` returns a `triangle` with an additional column `.group` +which will not be used in our estimate of the delay distribution, as +`estimate_delay()` performs the estimate on a single strata only. +If your data contains multiple groups, we recommend either aggregating across +groups (for a fully pooled estimate) or estimating the delay individually for +the strata of your choice (no pooling). +`baselinenowcast` does not support a hierarchical, or partially pooled, +estimate of the delay by design. ```{r estimate-delay} triangle <- triangle_full |> dplyr::select(-`.group`) |> # remove unecessary `.group` column @@ -186,7 +253,7 @@ ggplot(delay_df) + geom_line(aes(x = delay, y = cumsum(pmf))) + xlab("Delay") + ylab("Cumulativee proportion reported") + - ggtitle("Empirical point estimate of cumulative proportion reported by delay") + + ggtitle("Empirical point estimate of cumulative proportion reported by delay") + # nolint theme_bw() ggplot(delay_df) + @@ -199,10 +266,15 @@ ggplot(delay_df) + # Apply delay to generate point nowcast -The next step in our workflow is to take the estimated delay distribution and apply it to the partially observed reporting triangle, generating an estimate of the number of new cases confirmed at each reference date and delay. -This will generate a point estimate of what we can call the reporting square, which is the complete set of reference dates and delays. -In this case, we will be applying the delay to the same reporting triangle we used to generate the estimate, but this doesn't always have to be the case. -The reporting triangle we are applying it to must have the same `max_delay` as the delay estimate. +The next step in our workflow is to take the estimated delay distribution and +apply it to the partially observed reporting triangle, generating an estimate +of the number of new cases confirmed at each reference date and delay. +This will generate a point estimate of what we can call the reporting square, +which is the complete set of reference dates and delays. +In this case, we will be applying the delay to the same reporting triangle we +used to generate the estimate, but this doesn't always have to be the case. +The reporting triangle we are applying it to must have the same `max_delay` +as the delay estimate. ```{r} reporting_square <- apply_delay( From 880b92652de2d2431aff2f6bdae1187620f380e8 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Wed, 29 Jan 2025 13:10:00 +0000 Subject: [PATCH 08/24] format internally as data.table --- R/estimate_delay.R | 4 ++-- R/preprocess_reporting_triangle.R | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/R/estimate_delay.R b/R/estimate_delay.R index 85288f9..00d691c 100644 --- a/R/estimate_delay.R +++ b/R/estimate_delay.R @@ -46,11 +46,11 @@ estimate_delay <- function(triangle, # Filter the triangle down to nrow = n_history + 1, ncol = max_delay trunc_triangle <- preprocess_reporting_triangle(triangle, max_delay)[reference_date >= max(reference_date) - n_history] # nolint # Make the date the rowname, so the matrix is just the entries - integer_cols <- which(colnames(trunc_triangle) %in% (grep("^\\d+$", names(trunc_df), value = TRUE))) # nolint + integer_cols <- which(colnames(trunc_triangle) %in% (grep("^\\d+$", names(trunc_triangle), value = TRUE))) # nolint # the `..` is because its a data.table, we probably don't want to expect this from users. #nolint rt <- as.matrix(trunc_triangle[, ..integer_cols]) dates <- as.character(trunc_triangle$reference_date) - rownames(matr) <- dates + rownames(rt) <- dates n_delays <- ncol(rt) n_dates <- nrow(rt) factor <- matrix(nrow = max_delay - 1, ncol = 1) diff --git a/R/preprocess_reporting_triangle.R b/R/preprocess_reporting_triangle.R index 8d0cc83..e14b62f 100644 --- a/R/preprocess_reporting_triangle.R +++ b/R/preprocess_reporting_triangle.R @@ -51,5 +51,5 @@ preprocess_reporting_triangle <- function(triangle, max_delay) { for (col in integer_cols) { trunc_triangle[[col]] <- as.integer(trunc_triangle[[col]]) } - return(trunc_triangle) + return(data.table::as.data.table(trunc_triangle)) } From 0699f42ee44a4e96a7f4e490440caef2d30e201c Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Wed, 29 Jan 2025 13:47:02 +0000 Subject: [PATCH 09/24] remove data.table bc isnt recognizing column name... --- R/estimate_delay.R | 6 ++++-- vignettes/baselinenowcast.Rmd | 8 ++++---- 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/R/estimate_delay.R b/R/estimate_delay.R index 00d691c..2a049f3 100644 --- a/R/estimate_delay.R +++ b/R/estimate_delay.R @@ -44,11 +44,13 @@ estimate_delay <- function(triangle, # internally to avoid using dplyr in package functions triangle <- data.table::as.data.table(triangle) # Filter the triangle down to nrow = n_history + 1, ncol = max_delay - trunc_triangle <- preprocess_reporting_triangle(triangle, max_delay)[reference_date >= max(reference_date) - n_history] # nolint + max_date <- max(triangle$reference_date, na.rm = TRUE) - n_history + trunc_triangle <- preprocess_reporting_triangle(triangle, max_delay) |> + dplyr::filter(reference_date >= max_date) # nolint # Make the date the rowname, so the matrix is just the entries integer_cols <- which(colnames(trunc_triangle) %in% (grep("^\\d+$", names(trunc_triangle), value = TRUE))) # nolint # the `..` is because its a data.table, we probably don't want to expect this from users. #nolint - rt <- as.matrix(trunc_triangle[, ..integer_cols]) + rt <- as.matrix(trunc_triangle[, integer_cols]) dates <- as.character(trunc_triangle$reference_date) rownames(rt) <- dates n_delays <- ncol(rt) diff --git a/vignettes/baselinenowcast.Rmd b/vignettes/baselinenowcast.Rmd index a524c11..76a5988 100644 --- a/vignettes/baselinenowcast.Rmd +++ b/vignettes/baselinenowcast.Rmd @@ -277,8 +277,8 @@ The reporting triangle we are applying it to must have the same `max_delay` as the delay estimate. ```{r} -reporting_square <- apply_delay( - triangle_to_nowcast = triangle, - delay_f -) +# reporting_square <- apply_delay( #nolint +# triangle_to_nowcast = triangle,#nolint +# delay_f#nolint +# )#nolint ``` From 81b10615d1c4545e4cb4241e3b5419ae7d7f1c19 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Wed, 29 Jan 2025 14:26:04 +0000 Subject: [PATCH 10/24] test adding tests --- NAMESPACE | 1 + R/estimate_delay.R | 9 ++-- man/estimate_delay.Rd | 2 +- man/preprocess_reporting_triangle.Rd | 6 +-- tests/testthat/test-estimate_delay.R | 61 ++++++++++++++++++++++++++++ 5 files changed, 70 insertions(+), 9 deletions(-) create mode 100644 tests/testthat/test-estimate_delay.R diff --git a/NAMESPACE b/NAMESPACE index 8c1d0fe..eaa2632 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -1,3 +1,4 @@ # Generated by roxygen2: do not edit by hand export(estimate_delay) +importFrom(dplyr,filter) diff --git a/R/estimate_delay.R b/R/estimate_delay.R index 2a049f3..a7b2a1e 100644 --- a/R/estimate_delay.R +++ b/R/estimate_delay.R @@ -23,6 +23,7 @@ #' and `pmf`, indicating the point estimate of the empirical probability #' mass on each delay #' @export +#' @importFrom dplyr filter #' @examples #' library(epinowcast) #' nat_germany_hosp <- @@ -38,15 +39,13 @@ #' n_history = 30 #' ) estimate_delay <- function(triangle, - max_delay = ncol(triangle) - 1, + max_delay = ncol(triangle) - 2, n_history = nrow(triangle) - 1) { - # User might put in data.frame or a data.table, we will convert to a data.table #nolint - # internally to avoid using dplyr in package functions - triangle <- data.table::as.data.table(triangle) # Filter the triangle down to nrow = n_history + 1, ncol = max_delay max_date <- max(triangle$reference_date, na.rm = TRUE) - n_history trunc_triangle <- preprocess_reporting_triangle(triangle, max_delay) |> - dplyr::filter(reference_date >= max_date) # nolint + dplyr::filter(reference_date >= max_date) |> + as.data.frame() # nolint # Make the date the rowname, so the matrix is just the entries integer_cols <- which(colnames(trunc_triangle) %in% (grep("^\\d+$", names(trunc_triangle), value = TRUE))) # nolint # the `..` is because its a data.table, we probably don't want to expect this from users. #nolint diff --git a/man/estimate_delay.Rd b/man/estimate_delay.Rd index 202b8ce..b6da473 100644 --- a/man/estimate_delay.Rd +++ b/man/estimate_delay.Rd @@ -6,7 +6,7 @@ \usage{ estimate_delay( triangle, - max_delay = ncol(triangle) - 1, + max_delay = ncol(triangle) - 2, n_history = nrow(triangle) - 1 ) } diff --git a/man/preprocess_reporting_triangle.Rd b/man/preprocess_reporting_triangle.Rd index 3752989..fc3009f 100644 --- a/man/preprocess_reporting_triangle.Rd +++ b/man/preprocess_reporting_triangle.Rd @@ -7,9 +7,9 @@ preprocess_reporting_triangle(triangle, max_delay) } \arguments{ -\item{triangle}{the reporting triangle as a data.frame, with rows as the reference -date and columns as the delay, in any units. Assumes that the delays will -be in the format \code{{delay}} with no suffix for the unit} +\item{triangle}{the reporting triangle as a data.frame, with rows as the +reference date and columns as the delay, in any units. Assumes that the +delays will be in the format \code{{delay}} with no suffix for the unit} \item{max_delay}{the maximum delay, in the delay units, to filter the reporting triangle} diff --git a/tests/testthat/test-estimate_delay.R b/tests/testthat/test-estimate_delay.R new file mode 100644 index 0000000..fd4a84e --- /dev/null +++ b/tests/testthat/test-estimate_delay.R @@ -0,0 +1,61 @@ +test_that("estimate_delay function works correctly", { + set.seed(123) + triangle <- data.table::data.table( + reference_date = as.Date("2023-01-01") + 0:29, + `0` = rpois(30, 100), + `1` = rpois(30, 50), + `2` = rpois(30, 25), + `3` = rpois(30, 10), + `4` = rpois(30, 5) + ) + + # Test 1: Basic functionality + result <- estimate_delay(triangle) + expect_is(result, "data.frame") + expect_equal(nrow(result), ncol(triangle) - 1) + expect_equal(colnames(result), c("delay", "pmf")) + expect_true(all(result$delay == 0:(ncol(triangle) - 2))) + expect_true(all(result$pmf >= 0 & result$pmf <= 1)) + expect_equal(sum(result$pmf), 1, tolerance = 1e-6) + + # Test 2: Custom max_delay + result_max_delay <- estimate_delay(triangle, max_delay = 3) + expect_equal(nrow(result_max_delay), 4) + + # Test 3: Custom n_history + result_n_history <- estimate_delay(triangle, n_history = 20) + expect_is(result_n_history, "data.frame") + + # Test 4: Input validation *These should be more useful error messages* + expect_error(estimate_delay(triangle, max_delay = 0)) + expect_error(estimate_delay(triangle, n_history = 0)) + expect_error(estimate_delay(triangle, max_delay = 10, n_history = 40)) + + # Test 5: Handling of missing values + triangle_with_na <- triangle + triangle_with_na[1, 2] <- NA + result_with_na <- estimate_delay(triangle_with_na) + expect_is(result_with_na, "data.frame") + + # Test 6: Consistency of results + result1 <- estimate_delay(triangle) + result2 <- estimate_delay(triangle) + expect_equal(result1, result2) + + # Test 7: Handling different input types + result_df <- estimate_delay(as.data.frame(triangle)) + result_dt <- estimate_delay(data.table::as.data.table(triangle)) + expect_equal(result_dt, result_df) + + # Test 8: Check if the function handles edge cases correctly + triangle_single_day <- triangle[1, ] + expect_warning(estimate_delay(triangle_single_day)) + + # Test 9: Check that function throws error if column names are incorrect + triangle_renamed <- as.data.frame(triangle) + triangle_renamed <- stats::setNames( + triangle_renamed, + c("date", paste0("delay_", 0:4)) + ) + expect_error(estimate_delay(triangle_renamed)) +}) From 6df8dea2e47897470fe7cf1e5148c4bf09962d88 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Wed, 29 Jan 2025 14:39:11 +0000 Subject: [PATCH 11/24] fix tests to expect errors for now --- R/estimate_delay.R | 2 +- tests/testthat/test-estimate_delay.R | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/R/estimate_delay.R b/R/estimate_delay.R index a7b2a1e..a1a2013 100644 --- a/R/estimate_delay.R +++ b/R/estimate_delay.R @@ -40,7 +40,7 @@ #' ) estimate_delay <- function(triangle, max_delay = ncol(triangle) - 2, - n_history = nrow(triangle) - 1) { + n_history = nrow(triangle)) { # Filter the triangle down to nrow = n_history + 1, ncol = max_delay max_date <- max(triangle$reference_date, na.rm = TRUE) - n_history trunc_triangle <- preprocess_reporting_triangle(triangle, max_delay) |> diff --git a/tests/testthat/test-estimate_delay.R b/tests/testthat/test-estimate_delay.R index fd4a84e..3e96a97 100644 --- a/tests/testthat/test-estimate_delay.R +++ b/tests/testthat/test-estimate_delay.R @@ -47,9 +47,9 @@ test_that("estimate_delay function works correctly", { result_dt <- estimate_delay(data.table::as.data.table(triangle)) expect_equal(result_dt, result_df) - # Test 8: Check if the function handles edge cases correctly + # Test 8: Check that the function errors if its not passed a triangle triangle_single_day <- triangle[1, ] - expect_warning(estimate_delay(triangle_single_day)) + expect_error(estimate_delay(triangle_single_day)) # Test 9: Check that function throws error if column names are incorrect triangle_renamed <- as.data.frame(triangle) From 87ce427a6ac7008009fec47c8c16b05d72fe451d Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Wed, 29 Jan 2025 15:12:09 +0000 Subject: [PATCH 12/24] add a check for the triangle containing the expected column name --- DESCRIPTION | 4 +++- R/estimate_delay.R | 3 +++ R/validate.R | 25 +++++++++++++++++++++++++ man/estimate_delay.Rd | 2 +- man/validate_triangle.Rd | 15 +++++++++++++++ tests/testthat/test-estimate_delay.R | 4 +++- 6 files changed, 50 insertions(+), 3 deletions(-) create mode 100644 R/validate.R create mode 100644 man/validate_triangle.Rd diff --git a/DESCRIPTION b/DESCRIPTION index 3120950..bc2f706 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -31,7 +31,9 @@ BugReports: https://github.com/epinowcast/baselinenowcast/issues Depends: R (>= 4.0.0) Imports: - data.table + data.table, + cli, + checkmate Suggests: dplyr, epinowcast, diff --git a/R/estimate_delay.R b/R/estimate_delay.R index a1a2013..73374a7 100644 --- a/R/estimate_delay.R +++ b/R/estimate_delay.R @@ -41,6 +41,9 @@ estimate_delay <- function(triangle, max_delay = ncol(triangle) - 2, n_history = nrow(triangle)) { + # Check that the input reporting triangle is formatted properly. + validate_triangle(triangle) + # Filter the triangle down to nrow = n_history + 1, ncol = max_delay max_date <- max(triangle$reference_date, na.rm = TRUE) - n_history trunc_triangle <- preprocess_reporting_triangle(triangle, max_delay) |> diff --git a/R/validate.R b/R/validate.R new file mode 100644 index 0000000..09e7c51 --- /dev/null +++ b/R/validate.R @@ -0,0 +1,25 @@ +#' Validate triangle +#' @description +#' Various checks to make sure that the reporting triangle passed in to +#' `estimate_delay()` is formatted properly. +#' @param triangle a dataframe formatted as a reporting triangle +#' +#' @returns +validate_triangle <- function(triangle) { + # check column names includes reference date + checkmate::assert_names(names(triangle), must.include = "reference_date") + # check that the columns are integer digits + checkmate::assert_integerish( + as.numeric(grep("^\\d+$", names(triangle), value = TRUE)) + ) + # Check that this returns integers not decimals + checkmate::assert_integerish( + as.numeric(grep("^\\d+$", names(triangle), value = TRUE)) + ) + # Check that grep returns something of length greater than 1 + if (length(as.numeric(grep("^\\d+$", names(triangle), value = TRUE))) < 1) { + cli::cli_abort( + message = "Attempt to convert column names to integers failed" + ) + } +} diff --git a/man/estimate_delay.Rd b/man/estimate_delay.Rd index b6da473..d5c0ac4 100644 --- a/man/estimate_delay.Rd +++ b/man/estimate_delay.Rd @@ -7,7 +7,7 @@ estimate_delay( triangle, max_delay = ncol(triangle) - 2, - n_history = nrow(triangle) - 1 + n_history = nrow(triangle) ) } \arguments{ diff --git a/man/validate_triangle.Rd b/man/validate_triangle.Rd new file mode 100644 index 0000000..9c45501 --- /dev/null +++ b/man/validate_triangle.Rd @@ -0,0 +1,15 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/validate.R +\name{validate_triangle} +\alias{validate_triangle} +\title{Validate triangle} +\usage{ +validate_triangle(triangle) +} +\arguments{ +\item{triangle}{a dataframe formatted as a reporting triangle} +} +\description{ +Various checks to make sure that the reporting triangle passed in to +\code{estimate_delay()} is formatted properly. +} diff --git a/tests/testthat/test-estimate_delay.R b/tests/testthat/test-estimate_delay.R index 3e96a97..c08ef34 100644 --- a/tests/testthat/test-estimate_delay.R +++ b/tests/testthat/test-estimate_delay.R @@ -57,5 +57,7 @@ test_that("estimate_delay function works correctly", { triangle_renamed, c("date", paste0("delay_", 0:4)) ) - expect_error(estimate_delay(triangle_renamed)) + expect_error(estimate_delay(triangle_renamed), + regexp = "Names must include the elements" + ) }) From bdfa23e401b2c35f0e06cf671abaabcc09883409 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Wed, 29 Jan 2025 15:26:17 +0000 Subject: [PATCH 13/24] chamge to expect identical --- tests/testthat/test-estimate_delay.R | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/tests/testthat/test-estimate_delay.R b/tests/testthat/test-estimate_delay.R index c08ef34..9b65e25 100644 --- a/tests/testthat/test-estimate_delay.R +++ b/tests/testthat/test-estimate_delay.R @@ -12,15 +12,15 @@ test_that("estimate_delay function works correctly", { # Test 1: Basic functionality result <- estimate_delay(triangle) expect_is(result, "data.frame") - expect_equal(nrow(result), ncol(triangle) - 1) - expect_equal(colnames(result), c("delay", "pmf")) + expect_identical(nrow(result), ncol(triangle) - 1) + expect_identical(colnames(result), c("delay", "pmf")) expect_true(all(result$delay == 0:(ncol(triangle) - 2))) expect_true(all(result$pmf >= 0 & result$pmf <= 1)) expect_equal(sum(result$pmf), 1, tolerance = 1e-6) # Test 2: Custom max_delay result_max_delay <- estimate_delay(triangle, max_delay = 3) - expect_equal(nrow(result_max_delay), 4) + expect_identical(nrow(result_max_delay), 4) # Test 3: Custom n_history result_n_history <- estimate_delay(triangle, n_history = 20) @@ -40,12 +40,12 @@ test_that("estimate_delay function works correctly", { # Test 6: Consistency of results result1 <- estimate_delay(triangle) result2 <- estimate_delay(triangle) - expect_equal(result1, result2) + expect_identical(result1, result2) # Test 7: Handling different input types result_df <- estimate_delay(as.data.frame(triangle)) result_dt <- estimate_delay(data.table::as.data.table(triangle)) - expect_equal(result_dt, result_df) + expect_identical(result_dt, result_df) # Test 8: Check that the function errors if its not passed a triangle triangle_single_day <- triangle[1, ] From 7cee97c53b19bba40b92ee43f889b8c05a61b813 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Wed, 29 Jan 2025 16:21:30 +0000 Subject: [PATCH 14/24] use as integer to get identical --- tests/testthat/test-estimate_delay.R | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tests/testthat/test-estimate_delay.R b/tests/testthat/test-estimate_delay.R index 9b65e25..a37fa20 100644 --- a/tests/testthat/test-estimate_delay.R +++ b/tests/testthat/test-estimate_delay.R @@ -12,7 +12,7 @@ test_that("estimate_delay function works correctly", { # Test 1: Basic functionality result <- estimate_delay(triangle) expect_is(result, "data.frame") - expect_identical(nrow(result), ncol(triangle) - 1) + expect_identical(as.integer(nrow(result)), as.integer(ncol(triangle) - 1)) expect_identical(colnames(result), c("delay", "pmf")) expect_true(all(result$delay == 0:(ncol(triangle) - 2))) expect_true(all(result$pmf >= 0 & result$pmf <= 1)) @@ -20,7 +20,7 @@ test_that("estimate_delay function works correctly", { # Test 2: Custom max_delay result_max_delay <- estimate_delay(triangle, max_delay = 3) - expect_identical(nrow(result_max_delay), 4) + expect_identical(as.integer(nrow(result_max_delay)), as.integer(4)) # Test 3: Custom n_history result_n_history <- estimate_delay(triangle, n_history = 20) From 05d1b43756c92a3fb98013f76029d09902d8f28b Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Wed, 29 Jan 2025 16:53:05 +0000 Subject: [PATCH 15/24] fix description, temporarily adds dplyr --- .github/workflows/R-CMD-check.yaml | 2 +- DESCRIPTION | 2 +- tests/testthat.R | 1 - tests/testthat/test-estimate_delay.R | 2 +- 4 files changed, 3 insertions(+), 4 deletions(-) diff --git a/.github/workflows/R-CMD-check.yaml b/.github/workflows/R-CMD-check.yaml index 991bda2..4564b37 100644 --- a/.github/workflows/R-CMD-check.yaml +++ b/.github/workflows/R-CMD-check.yaml @@ -49,7 +49,7 @@ jobs: - uses: r-lib/actions/setup-r-dependencies@v2 with: - extra-packages: any::rcmdcheck + extra-packages: any::rcmdcheck, any::testthat needs: check - uses: r-lib/actions/check-r-package@v2 diff --git a/DESCRIPTION b/DESCRIPTION index bc2f706..c32d8d6 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -34,8 +34,8 @@ Imports: data.table, cli, checkmate + dplyr Suggests: - dplyr, epinowcast, bookdown, ggplot2, diff --git a/tests/testthat.R b/tests/testthat.R index 484df2d..a8230d0 100644 --- a/tests/testthat.R +++ b/tests/testthat.R @@ -1,5 +1,4 @@ library(testthat) -library(withr) library(baselinenowcast) test_results <- test_check("baselinenowcast") diff --git a/tests/testthat/test-estimate_delay.R b/tests/testthat/test-estimate_delay.R index a37fa20..3e9baf5 100644 --- a/tests/testthat/test-estimate_delay.R +++ b/tests/testthat/test-estimate_delay.R @@ -20,7 +20,7 @@ test_that("estimate_delay function works correctly", { # Test 2: Custom max_delay result_max_delay <- estimate_delay(triangle, max_delay = 3) - expect_identical(as.integer(nrow(result_max_delay)), as.integer(4)) + expect_identical(as.integer(nrow(result_max_delay)), 4L) # Test 3: Custom n_history result_n_history <- estimate_delay(triangle, n_history = 20) From 27cf8acb82eb5729c2c9d94292603f58a78eef96 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Wed, 29 Jan 2025 16:56:11 +0000 Subject: [PATCH 16/24] fix bug --- DESCRIPTION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/DESCRIPTION b/DESCRIPTION index c32d8d6..f007f32 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -33,7 +33,7 @@ Depends: Imports: data.table, cli, - checkmate + checkmate, dplyr Suggests: epinowcast, From 505133fd973a7130269a1760cc5ac728443497e9 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Thu, 30 Jan 2025 12:23:25 +0000 Subject: [PATCH 17/24] both ways of computing the delay are equivalent --- R/estimate_delay.R | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/R/estimate_delay.R b/R/estimate_delay.R index 73374a7..ffdf295 100644 --- a/R/estimate_delay.R +++ b/R/estimate_delay.R @@ -70,11 +70,22 @@ estimate_delay <- function(triangle, block_bottom_left ) } - # Not sure if this is right, as it might be overly weighting the - # estimate data (but I do think that is the point...) + # I think this should just be estimated using the lower half of the reporting + # triangle to get the latest delay estimate? + pmf <- vector(length = n_delays) + pmf[1] <- expectation[n_dates, 1] / sum(expectation[n_dates, 1:n_delays]) + for (i in 2:n_delays) { + pmf[i] <- sum(expectation[(n_dates - i + 2):n_dates, i]) / sum(expectation[(n_dates - i + 2):n_dates, 1:n_delays]) + } + + # Or use the whole thing. This might actually be the same... + alternate_pmf <- colSums(expectation) / sum(expectation) + + delay_df <- data.frame( delay = 0:max_delay, - pmf = colSums(expectation) / sum(expectation) + pmf = pmf, + atlernate_pmf = alternate_pmf ) return(delay_df) } From 34dd383cd9fe53f419ef80e35ce0c39f7929e154 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Thu, 30 Jan 2025 14:46:06 +0000 Subject: [PATCH 18/24] change name of function --- NAMESPACE | 2 +- R/{estimate_delay.R => get_delay_estimate.R} | 8 +++--- R/validate.R | 2 +- ...stimate_delay.Rd => get_delay_estimate.Rd} | 10 +++---- man/validate_triangle.Rd | 3 ++ ...mate_delay.R => test-get_delay_estimate.R} | 28 +++++++++---------- vignettes/baselinenowcast.Rmd | 9 +++--- 7 files changed, 32 insertions(+), 30 deletions(-) rename R/{estimate_delay.R => get_delay_estimate.R} (94%) rename man/{estimate_delay.Rd => get_delay_estimate.Rd} (92%) rename tests/testthat/{test-estimate_delay.R => test-get_delay_estimate.R} (65%) diff --git a/NAMESPACE b/NAMESPACE index eaa2632..36debea 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -1,4 +1,4 @@ # Generated by roxygen2: do not edit by hand -export(estimate_delay) +export(get_delay_estimate) importFrom(dplyr,filter) diff --git a/R/estimate_delay.R b/R/get_delay_estimate.R similarity index 94% rename from R/estimate_delay.R rename to R/get_delay_estimate.R index ffdf295..17a9f53 100644 --- a/R/estimate_delay.R +++ b/R/get_delay_estimate.R @@ -34,13 +34,13 @@ #' ) #' pobs <- enw_preprocess_data(nat_germany_hosp, max_delay = 21) #' triangle_raw <- pobs$reporting_triangle[[1]] -#' delay_df <- estimate_delay(triangle_raw[, -1], +#' delay_df <- get_delay_estimate(triangle_raw[, -1], #' max_delay = 20, #' n_history = 30 #' ) -estimate_delay <- function(triangle, - max_delay = ncol(triangle) - 2, - n_history = nrow(triangle)) { +get_delay_estimate <- function(triangle, + max_delay = ncol(triangle) - 2, + n_history = nrow(triangle)) { # Check that the input reporting triangle is formatted properly. validate_triangle(triangle) diff --git a/R/validate.R b/R/validate.R index 09e7c51..0637476 100644 --- a/R/validate.R +++ b/R/validate.R @@ -4,7 +4,7 @@ #' `estimate_delay()` is formatted properly. #' @param triangle a dataframe formatted as a reporting triangle #' -#' @returns +#' @returns NULL, invisibly validate_triangle <- function(triangle) { # check column names includes reference date checkmate::assert_names(names(triangle), must.include = "reference_date") diff --git a/man/estimate_delay.Rd b/man/get_delay_estimate.Rd similarity index 92% rename from man/estimate_delay.Rd rename to man/get_delay_estimate.Rd index d5c0ac4..9957f74 100644 --- a/man/estimate_delay.Rd +++ b/man/get_delay_estimate.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/estimate_delay.R -\name{estimate_delay} -\alias{estimate_delay} +% Please edit documentation in R/get_delay_estimate.R +\name{get_delay_estimate} +\alias{get_delay_estimate} \title{Estimate a delay distribution from a reporting triangle} \usage{ -estimate_delay( +get_delay_estimate( triangle, max_delay = ncol(triangle) - 2, n_history = nrow(triangle) @@ -50,7 +50,7 @@ nat_germany_hosp <- enw_filter_report_dates( ) pobs <- enw_preprocess_data(nat_germany_hosp, max_delay = 21) triangle_raw <- pobs$reporting_triangle[[1]] -delay_df <- estimate_delay(triangle_raw[, -1], +delay_df <- get_delay_estimate(triangle_raw[, -1], max_delay = 20, n_history = 30 ) diff --git a/man/validate_triangle.Rd b/man/validate_triangle.Rd index 9c45501..d04fc75 100644 --- a/man/validate_triangle.Rd +++ b/man/validate_triangle.Rd @@ -9,6 +9,9 @@ validate_triangle(triangle) \arguments{ \item{triangle}{a dataframe formatted as a reporting triangle} } +\value{ +NULL, invisibly +} \description{ Various checks to make sure that the reporting triangle passed in to \code{estimate_delay()} is formatted properly. diff --git a/tests/testthat/test-estimate_delay.R b/tests/testthat/test-get_delay_estimate.R similarity index 65% rename from tests/testthat/test-estimate_delay.R rename to tests/testthat/test-get_delay_estimate.R index 3e9baf5..7aa9433 100644 --- a/tests/testthat/test-estimate_delay.R +++ b/tests/testthat/test-get_delay_estimate.R @@ -1,4 +1,4 @@ -test_that("estimate_delay function works correctly", { +test_that("get_delay_estimate function works correctly", { set.seed(123) triangle <- data.table::data.table( reference_date = as.Date("2023-01-01") + 0:29, @@ -10,7 +10,7 @@ test_that("estimate_delay function works correctly", { ) # Test 1: Basic functionality - result <- estimate_delay(triangle) + result <- get_delay_estimate(triangle) expect_is(result, "data.frame") expect_identical(as.integer(nrow(result)), as.integer(ncol(triangle) - 1)) expect_identical(colnames(result), c("delay", "pmf")) @@ -19,37 +19,37 @@ test_that("estimate_delay function works correctly", { expect_equal(sum(result$pmf), 1, tolerance = 1e-6) # Test 2: Custom max_delay - result_max_delay <- estimate_delay(triangle, max_delay = 3) + result_max_delay <- get_delay_estimate(triangle, max_delay = 3) expect_identical(as.integer(nrow(result_max_delay)), 4L) # Test 3: Custom n_history - result_n_history <- estimate_delay(triangle, n_history = 20) + result_n_history <- get_delay_estimate(triangle, n_history = 20) expect_is(result_n_history, "data.frame") # Test 4: Input validation *These should be more useful error messages* - expect_error(estimate_delay(triangle, max_delay = 0)) - expect_error(estimate_delay(triangle, n_history = 0)) - expect_error(estimate_delay(triangle, max_delay = 10, n_history = 40)) + expect_error(get_delay_estimate(triangle, max_delay = 0)) + expect_error(get_delay_estimate(triangle, n_history = 0)) + expect_error(get_delay_estimate(triangle, max_delay = 10, n_history = 40)) # Test 5: Handling of missing values triangle_with_na <- triangle triangle_with_na[1, 2] <- NA - result_with_na <- estimate_delay(triangle_with_na) + result_with_na <- get_delay_estimate(triangle_with_na) expect_is(result_with_na, "data.frame") # Test 6: Consistency of results - result1 <- estimate_delay(triangle) - result2 <- estimate_delay(triangle) + result1 <- get_delay_estimate(triangle) + result2 <- get_delay_estimate(triangle) expect_identical(result1, result2) # Test 7: Handling different input types - result_df <- estimate_delay(as.data.frame(triangle)) - result_dt <- estimate_delay(data.table::as.data.table(triangle)) + result_df <- get_delay_estimate(as.data.frame(triangle)) + result_dt <- get_delay_estimate(data.table::as.data.table(triangle)) expect_identical(result_dt, result_df) # Test 8: Check that the function errors if its not passed a triangle triangle_single_day <- triangle[1, ] - expect_error(estimate_delay(triangle_single_day)) + expect_error(get_delay_estimate(triangle_single_day)) # Test 9: Check that function throws error if column names are incorrect triangle_renamed <- as.data.frame(triangle) @@ -57,7 +57,7 @@ test_that("estimate_delay function works correctly", { triangle_renamed, c("date", paste0("delay_", 0:4)) ) - expect_error(estimate_delay(triangle_renamed), + expect_error(get_delay_estimate(triangle_renamed), regexp = "Names must include the elements" ) }) diff --git a/vignettes/baselinenowcast.Rmd b/vignettes/baselinenowcast.Rmd index 76a5988..228d1d0 100644 --- a/vignettes/baselinenowcast.Rmd +++ b/vignettes/baselinenowcast.Rmd @@ -218,11 +218,10 @@ function. If reference date are repeated because the reporting triangle contains multiple strata, the function will throw an error. -The `estimate_delay()` function expects the following inputs: - - `triangle`: the reporting triangle for a single strata, with the left column - indicating the `reference_date` and the subsequent columns indexed starting - at 0 through the number of observed delays. There should be no repeated - reference dates or additional columns of metadata. +The `get_delay_estimate()` function expects the following inputs: + - `triangle`: a matrix with the reporting triangle for a single strata. Here + the rows represent the time points and the columns represent the observed + delays, starting at 0. - `max_delay`: an integer indicating the maximum delay to estimate. This must be less than or equal to the number of delays in `triangle`. - `n_history`: an integer indicating the number of observations by reference From 1e64384ce7692e18914ca3c185f705befe4daf77 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Thu, 30 Jan 2025 15:56:56 +0000 Subject: [PATCH 19/24] change to using matrix as method --- DESCRIPTION | 6 +-- R/get_delay_estimate.R | 54 ++++++++++------------- R/handle_neg_vals.R | 41 ++++++++++++++++++ R/preprocess_reporting_triangle.R | 55 ------------------------ R/validate.R | 54 ++++++++++++++++------- man/get_delay_estimate.Rd | 15 ++++--- man/handle_neg_vals.Rd | 23 ++++++++++ man/validate_triangle.Rd | 14 +++++- tests/testthat/test-get_delay_estimate.R | 29 ++----------- vignettes/baselinenowcast.Rmd | 23 +++++----- 10 files changed, 161 insertions(+), 153 deletions(-) create mode 100644 R/handle_neg_vals.R delete mode 100644 R/preprocess_reporting_triangle.R create mode 100644 man/handle_neg_vals.Rd diff --git a/DESCRIPTION b/DESCRIPTION index f007f32..fc1cc94 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -31,13 +31,11 @@ BugReports: https://github.com/epinowcast/baselinenowcast/issues Depends: R (>= 4.0.0) Imports: - data.table, - cli, - checkmate, - dplyr + cli Suggests: epinowcast, bookdown, + dplyr, ggplot2, spelling, rmarkdown, diff --git a/R/get_delay_estimate.R b/R/get_delay_estimate.R index 17a9f53..f23820e 100644 --- a/R/get_delay_estimate.R +++ b/R/get_delay_estimate.R @@ -8,10 +8,9 @@ #' modified from the code originally developed by the Karlsruhe Institute #' of Technology RESPINOW German Hospitalization Nowcasting Hub, #' Modified from: https://github.com/KITmetricslab/RESPINOW-Hub/blob/7cce3ae2728116e8c8cc0e4ab29074462c24650e/code/baseline/functions.R#L55 #nolint -#' @param triangle dataframe of the reporting triangle, with one column for the -#' reference date and the remaining columns specifying the delay (units not -#' specified). Values indicate the number of new observations assigned to -#' the reference date at the specified delay. +#' @param triangle matrix of the reporting triangle, with rows representing +#' the time points of reference and columns representing the delays +#' indexed at 0 #' @param max_delay integer indicating the maximum delay to estimate, in units #' of the delay. The default is to use the whole reporting triangle, #' `ncol(triangle) -1`. @@ -33,31 +32,30 @@ #' latest_date = "2021-10-01" #' ) #' pobs <- enw_preprocess_data(nat_germany_hosp, max_delay = 21) -#' triangle_raw <- pobs$reporting_triangle[[1]] -#' delay_df <- get_delay_estimate(triangle_raw[, -1], +#' triangle_raw <- pobs$reporting_triangle |> +#' dplyr::select(-`.group`, -reference_date) |> +#' as.matrix() |> +#' unname() +#' delay_df <- get_delay_estimate(triangle_raw, #' max_delay = 20, #' n_history = 30 #' ) get_delay_estimate <- function(triangle, - max_delay = ncol(triangle) - 2, + max_delay = ncol(triangle) - 1, n_history = nrow(triangle)) { # Check that the input reporting triangle is formatted properly. - validate_triangle(triangle) + validate_triangle(triangle, + max_delay = max_delay, + n_history = n_history + ) # Filter the triangle down to nrow = n_history + 1, ncol = max_delay - max_date <- max(triangle$reference_date, na.rm = TRUE) - n_history - trunc_triangle <- preprocess_reporting_triangle(triangle, max_delay) |> - dplyr::filter(reference_date >= max_date) |> - as.data.frame() # nolint - # Make the date the rowname, so the matrix is just the entries - integer_cols <- which(colnames(trunc_triangle) %in% (grep("^\\d+$", names(trunc_triangle), value = TRUE))) # nolint - # the `..` is because its a data.table, we probably don't want to expect this from users. #nolint - rt <- as.matrix(trunc_triangle[, integer_cols]) - dates <- as.character(trunc_triangle$reference_date) - rownames(rt) <- dates + nr0 <- nrow(triangle) + trunc_triangle <- triangle[(nr0 - n_history + 1):nr0, 1:(max_delay + 1)] + rt <- handle_neg_vals(trunc_triangle) n_delays <- ncol(rt) n_dates <- nrow(rt) - factor <- matrix(nrow = max_delay - 1, ncol = 1) + factor <- vector(length = max_delay - 1) expectation <- rt for (co in 2:(n_delays)) { block_top_left <- rt[1:(n_dates - co + 1), 1:(co - 1), drop = FALSE] @@ -66,26 +64,18 @@ get_delay_estimate <- function(triangle, block_bottom_left <- expectation[(n_dates - co + 2):n_dates, 1:(co - 1), drop = FALSE ] + # We compute the expectation so that we can get the delay estimate expectation[(n_dates - co + 2):n_dates, co] <- factor[co - 1] * rowSums( block_bottom_left ) } - # I think this should just be estimated using the lower half of the reporting - # triangle to get the latest delay estimate? - pmf <- vector(length = n_delays) - pmf[1] <- expectation[n_dates, 1] / sum(expectation[n_dates, 1:n_delays]) - for (i in 2:n_delays) { - pmf[i] <- sum(expectation[(n_dates - i + 2):n_dates, i]) / sum(expectation[(n_dates - i + 2):n_dates, 1:n_delays]) - } - - # Or use the whole thing. This might actually be the same... - alternate_pmf <- colSums(expectation) / sum(expectation) - + # Use the completed reporting square to get the point estimate of the delay + # distribution + pmf <- colSums(expectation) / sum(expectation) delay_df <- data.frame( delay = 0:max_delay, - pmf = pmf, - atlernate_pmf = alternate_pmf + pmf = pmf ) return(delay_df) } diff --git a/R/handle_neg_vals.R b/R/handle_neg_vals.R new file mode 100644 index 0000000..b954a35 --- /dev/null +++ b/R/handle_neg_vals.R @@ -0,0 +1,41 @@ +#' Handle negative values in the reporting triangle +#' @description +#' Takes in a reporting triangle and returns a matrix in the same format +#' as the input triangle, but with negative values of reporting handled via +#' passing them to the subsequent days delay. +#' Modified from https://github.com/KITmetricslab/RESPINOW-Hub/blob/main/code/baseline/functions.R #nolint +#' @param triangle the reporting triangle as a matrix, where rows are the +#' time points and columns are the delays, already truncated to the maximum +#' delay and the number of historical observations +#' @return pos_triangle a positive integer matrix with negative values of +#' reporting handled via passing them to the subsequent days delay +handle_neg_vals <- function(triangle) { + integer_cols <- seq_len(ncol(triangle)) + pos_triangle <- triangle + for (i in seq_len(nrow(triangle))) { + to_subtract <- 0 + row <- pos_triangle[i, ] + # Loop over the columns starting from the last column back to max delay + # column, and if there is a negative value, we add this to the + # next day and set that one as 0. + for (j in rev(integer_cols)) { + value <- row[[j]] + if (!is.na(value)) { + # Either adds 0 or the previous days negative value + value <- value + to_subtract + if (value < 0) { + # Want to subtract from subsequent day + to_subtract <- value + pos_triangle[i, j] <- 0 # Set the negative value in the RT to 0 + } else { + pos_triangle[i, j] <- value + to_subtract <- 0 + } + } + } + } + for (col in integer_cols) { + pos_triangle[[col]] <- as.integer(pos_triangle[[col]]) + } + return(pos_triangle) +} diff --git a/R/preprocess_reporting_triangle.R b/R/preprocess_reporting_triangle.R deleted file mode 100644 index e14b62f..0000000 --- a/R/preprocess_reporting_triangle.R +++ /dev/null @@ -1,55 +0,0 @@ -#' Pre-process the reporting triangle -#' @description -#' Takes in a reporting triangle with the reference date as the first column, -#' and all subsequent columns named by their delay (unitless, so for example a -#' delay of 3 days or weeks would have the column name `3`), and values -#' indicating the number of new confirmed cases assigned to that reference -#' date with that particular delay. It returns a dataframe in the same format -#' as the input triangle, but truncated to include only the maximum delay -#' number of columns and with negative values of reporting handled via passing -#' them to the subsequent days delay. -#' Modified from https://github.com/KITmetricslab/RESPINOW-Hub/blob/main/code/baseline/functions.R #nolint -#' @param triangle the reporting triangle as a data.frame, with rows as the -#' reference date and columns as the delay, in any units. Assumes that the -#' delays will be in the format `{delay}` with no suffix for the unit -#' @param max_delay the maximum delay, in the delay units, to filter the -#' reporting triangle -#' @return trunc_triangle a dataframe in the same format as `triangle`, but -#' truncated to include only the maximum delay number of delay columns -#' and with negative values of reporting handled via passing them to the -#' subsequent days delay -preprocess_reporting_triangle <- function(triangle, max_delay) { - # restrict to the first columns in the max - cols_subset <- which(colnames(triangle) == max_delay) - trunc_triangle <- triangle[, 1:cols_subset] - # columns containing integer names: - integer_cols <- which(colnames(trunc_triangle) %in% (grep("^\\d+$", names(trunc_triangle), value = TRUE))) # nolint - # Loop over each row - for (i in seq_len(nrow(trunc_triangle))) { - to_subtract <- 0 - row <- trunc_triangle[i, ] - # Loop over the columns starting from the last column back to max delay - # column, and if there is a negative value, we add this to the - # next day and set that one as 0. - for (j in rev(integer_cols)) { - value <- row[[j]] - if (!is.na(value)) { - # Either adds 0 or the previous days negative value - value <- value + to_subtract - if (value < 0) { - # Want to subtract from subsequent day - to_subtract <- value - trunc_triangle[i, j] <- 0 # Set the negative value in the RT to 0 - } else { - trunc_triangle[i, j] <- value - to_subtract <- 0 - } - } - } - } - # Convert 'value' columns to integer type - for (col in integer_cols) { - trunc_triangle[[col]] <- as.integer(trunc_triangle[[col]]) - } - return(data.table::as.data.table(trunc_triangle)) -} diff --git a/R/validate.R b/R/validate.R index 0637476..ec6e04f 100644 --- a/R/validate.R +++ b/R/validate.R @@ -2,24 +2,48 @@ #' @description #' Various checks to make sure that the reporting triangle passed in to #' `estimate_delay()` is formatted properly. -#' @param triangle a dataframe formatted as a reporting triangle +#' @param triangle a matrix of values with rows indicating the time points and +#' columns indicating the delays +#' @inheritParams get_delay_estimate #' #' @returns NULL, invisibly -validate_triangle <- function(triangle) { - # check column names includes reference date - checkmate::assert_names(names(triangle), must.include = "reference_date") - # check that the columns are integer digits - checkmate::assert_integerish( - as.numeric(grep("^\\d+$", names(triangle), value = TRUE)) - ) - # Check that this returns integers not decimals - checkmate::assert_integerish( - as.numeric(grep("^\\d+$", names(triangle), value = TRUE)) - ) - # Check that grep returns something of length greater than 1 - if (length(as.numeric(grep("^\\d+$", names(triangle), value = TRUE))) < 1) { +validate_triangle <- function(triangle, + max_delay, + n_history) { + if (nrow(triangle) <= ncol(triangle)) { cli::cli_abort( - message = "Attempt to convert column names to integers failed" + message = c( + "Number of observations must be greater than the maximum", + "delay" + ) + ) + } + + if (nrow(triangle) < n_history) { + cli::cli_abort( + message = c( + "Number of observations in input data not sufficient for", + "user specified number of historical observations to use", + "for estimaton." + ) + ) + } + + if (ncol(triangle) < (max_delay + 1)) { + cli::cli_abort( + message = c( + "Number of delays in input data not sufficient for", + "user specified maximum delay" + ) + ) + } + + if ((max_delay < 1 || n_history < 1)) { + cli::cli_abort( + message = c( + "Insufficient `max_delay` or `n_history`, must be greater than or", + " equal to 1." + ) ) } } diff --git a/man/get_delay_estimate.Rd b/man/get_delay_estimate.Rd index 9957f74..036359c 100644 --- a/man/get_delay_estimate.Rd +++ b/man/get_delay_estimate.Rd @@ -6,15 +6,14 @@ \usage{ get_delay_estimate( triangle, - max_delay = ncol(triangle) - 2, + max_delay = ncol(triangle) - 1, n_history = nrow(triangle) ) } \arguments{ -\item{triangle}{dataframe of the reporting triangle, with one column for the -reference date and the remaining columns specifying the delay (units not -specified). Values indicate the number of new observations assigned to -the reference date at the specified delay.} +\item{triangle}{matrix of the reporting triangle, with rows representing +the time points of reference and columns representing the delays +indexed at 0} \item{max_delay}{integer indicating the maximum delay to estimate, in units of the delay. The default is to use the whole reporting triangle, @@ -49,8 +48,10 @@ nat_germany_hosp <- enw_filter_report_dates( latest_date = "2021-10-01" ) pobs <- enw_preprocess_data(nat_germany_hosp, max_delay = 21) -triangle_raw <- pobs$reporting_triangle[[1]] -delay_df <- get_delay_estimate(triangle_raw[, -1], +triangle_raw <- pobs$reporting_triangle |> + dplyr::select(- `.group`, -reference_date) |> + as.matrix() |> unname() +delay_df <- get_delay_estimate(triangle_raw, max_delay = 20, n_history = 30 ) diff --git a/man/handle_neg_vals.Rd b/man/handle_neg_vals.Rd new file mode 100644 index 0000000..2e027d3 --- /dev/null +++ b/man/handle_neg_vals.Rd @@ -0,0 +1,23 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/handle_neg_vals.R +\name{handle_neg_vals} +\alias{handle_neg_vals} +\title{Handle negative values in the reporting triangle} +\usage{ +handle_neg_vals(triangle) +} +\arguments{ +\item{triangle}{the reporting triangle as a matrix, where rows are the +time points and columns are the delays, already truncated to the maximum +delay and the number of historical observations} +} +\value{ +pos_triangle a positive integer matrix with negative values of +reporting handled via passing them to the subsequent days delay +} +\description{ +Takes in a reporting triangle and returns a matrix in the same format +as the input triangle, but with negative values of reporting handled via passing +them to the subsequent days delay. +Modified from https://github.com/KITmetricslab/RESPINOW-Hub/blob/main/code/baseline/functions.R #nolint +} diff --git a/man/validate_triangle.Rd b/man/validate_triangle.Rd index d04fc75..89055b7 100644 --- a/man/validate_triangle.Rd +++ b/man/validate_triangle.Rd @@ -4,10 +4,20 @@ \alias{validate_triangle} \title{Validate triangle} \usage{ -validate_triangle(triangle) +validate_triangle(triangle, max_delay, n_history) } \arguments{ -\item{triangle}{a dataframe formatted as a reporting triangle} +\item{triangle}{a matrix of values with rows indicating the time points and +columns indicating the delays} + +\item{max_delay}{integer indicating the maximum delay to estimate, in units +of the delay. The default is to use the whole reporting triangle, +\code{ncol(triangle) -1}.} + +\item{n_history}{integer indicating the number of reference dates to be +used in the estimate of the reporting delay, always starting from the most +recent reporting delay. The default is to use the whole reporting triangle, +so \code{nrow(triangle)-1}} } \value{ NULL, invisibly diff --git a/tests/testthat/test-get_delay_estimate.R b/tests/testthat/test-get_delay_estimate.R index 7aa9433..08ddb3c 100644 --- a/tests/testthat/test-get_delay_estimate.R +++ b/tests/testthat/test-get_delay_estimate.R @@ -1,20 +1,12 @@ test_that("get_delay_estimate function works correctly", { set.seed(123) - triangle <- data.table::data.table( - reference_date = as.Date("2023-01-01") + 0:29, - `0` = rpois(30, 100), - `1` = rpois(30, 50), - `2` = rpois(30, 25), - `3` = rpois(30, 10), - `4` = rpois(30, 5) - ) + triangle <- matrix(rpois(30 * 5, lambda = 25), nrow = 30, ncol = 5) # Test 1: Basic functionality result <- get_delay_estimate(triangle) expect_is(result, "data.frame") - expect_identical(as.integer(nrow(result)), as.integer(ncol(triangle) - 1)) + expect_identical(as.integer(nrow(result)), as.integer(ncol(triangle))) expect_identical(colnames(result), c("delay", "pmf")) - expect_true(all(result$delay == 0:(ncol(triangle) - 2))) expect_true(all(result$pmf >= 0 & result$pmf <= 1)) expect_equal(sum(result$pmf), 1, tolerance = 1e-6) @@ -42,22 +34,7 @@ test_that("get_delay_estimate function works correctly", { result2 <- get_delay_estimate(triangle) expect_identical(result1, result2) - # Test 7: Handling different input types - result_df <- get_delay_estimate(as.data.frame(triangle)) - result_dt <- get_delay_estimate(data.table::as.data.table(triangle)) - expect_identical(result_dt, result_df) - - # Test 8: Check that the function errors if its not passed a triangle + # Test 7: Check that the function errors if its not passed a triangle triangle_single_day <- triangle[1, ] expect_error(get_delay_estimate(triangle_single_day)) - - # Test 9: Check that function throws error if column names are incorrect - triangle_renamed <- as.data.frame(triangle) - triangle_renamed <- stats::setNames( - triangle_renamed, - c("date", paste0("delay_", 0:4)) - ) - expect_error(get_delay_estimate(triangle_renamed), - regexp = "Names must include the elements" - ) }) diff --git a/vignettes/baselinenowcast.Rmd b/vignettes/baselinenowcast.Rmd index 228d1d0..5aeffc9 100644 --- a/vignettes/baselinenowcast.Rmd +++ b/vignettes/baselinenowcast.Rmd @@ -192,7 +192,7 @@ reporting triangle. # observed long above. observed_long <- epinowcast::germany_covid19_hosp[location == "DE"][age_group == "00+"] |> # nolint enw_filter_report_dates(latest_date = nowcast_date) |> - enw_filter_reference_dates(include_days = n_history) + enw_filter_reference_dates(include_days = n_history - 1) head(observed_long) # Get the reporting triangle, adding an additional day because epinowcast # we want the max_delay + 1 entries since 0 is a valid delay. @@ -227,22 +227,21 @@ The `get_delay_estimate()` function expects the following inputs: - `n_history`: an integer indicating the number of observations by reference date to use to fit the delay distribution. This must be less than the number of rows in `triangle`. -`enw_preoprocess()` returns a `triangle` with an additional column `.group` -which will not be used in our estimate of the delay distribution, as -`estimate_delay()` performs the estimate on a single strata only. -If your data contains multiple groups, we recommend either aggregating across -groups (for a fully pooled estimate) or estimating the delay individually for -the strata of your choice (no pooling). -`baselinenowcast` does not support a hierarchical, or partially pooled, -estimate of the delay by design. +`enw_preoprocess()` returns a `triangle` with the columns `.group` and +`.reference_date` and delays indicated by the column names. While we will +eventually write methods that will map from other input formats such as this, +we will start by demonstrating the functionality on the raw matrix of the +reporting triangle. + ```{r estimate-delay} triangle <- triangle_full |> - dplyr::select(-`.group`) |> # remove unecessary `.group` column - as.data.frame() # baseline nowcast expects dataframe as an input + dplyr::select(-`.group`, -reference_date) |> # remove unnecessary columns + as.matrix() |> + unname() head(triangle) -delay_df <- estimate_delay( +delay_df <- get_delay_estimate( triangle = triangle, max_delay = max_delay, n_history = n_history From bb252ecf3154042088bb6cc3d99fd80e0068a300 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Thu, 30 Jan 2025 16:04:11 +0000 Subject: [PATCH 20/24] fix docs, remove dplyr --- NAMESPACE | 1 - R/get_delay_estimate.R | 1 - man/get_delay_estimate.Rd | 5 +++-- man/handle_neg_vals.Rd | 4 ++-- man/preprocess_reporting_triangle.Rd | 33 ---------------------------- 5 files changed, 5 insertions(+), 39 deletions(-) delete mode 100644 man/preprocess_reporting_triangle.Rd diff --git a/NAMESPACE b/NAMESPACE index 36debea..92938d6 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -1,4 +1,3 @@ # Generated by roxygen2: do not edit by hand export(get_delay_estimate) -importFrom(dplyr,filter) diff --git a/R/get_delay_estimate.R b/R/get_delay_estimate.R index f23820e..170eec7 100644 --- a/R/get_delay_estimate.R +++ b/R/get_delay_estimate.R @@ -22,7 +22,6 @@ #' and `pmf`, indicating the point estimate of the empirical probability #' mass on each delay #' @export -#' @importFrom dplyr filter #' @examples #' library(epinowcast) #' nat_germany_hosp <- diff --git a/man/get_delay_estimate.Rd b/man/get_delay_estimate.Rd index 036359c..c0ec941 100644 --- a/man/get_delay_estimate.Rd +++ b/man/get_delay_estimate.Rd @@ -49,8 +49,9 @@ nat_germany_hosp <- enw_filter_report_dates( ) pobs <- enw_preprocess_data(nat_germany_hosp, max_delay = 21) triangle_raw <- pobs$reporting_triangle |> - dplyr::select(- `.group`, -reference_date) |> - as.matrix() |> unname() + dplyr::select(-`.group`, -reference_date) |> + as.matrix() |> + unname() delay_df <- get_delay_estimate(triangle_raw, max_delay = 20, n_history = 30 diff --git a/man/handle_neg_vals.Rd b/man/handle_neg_vals.Rd index 2e027d3..e88940d 100644 --- a/man/handle_neg_vals.Rd +++ b/man/handle_neg_vals.Rd @@ -17,7 +17,7 @@ reporting handled via passing them to the subsequent days delay } \description{ Takes in a reporting triangle and returns a matrix in the same format -as the input triangle, but with negative values of reporting handled via passing -them to the subsequent days delay. +as the input triangle, but with negative values of reporting handled via +passing them to the subsequent days delay. Modified from https://github.com/KITmetricslab/RESPINOW-Hub/blob/main/code/baseline/functions.R #nolint } diff --git a/man/preprocess_reporting_triangle.Rd b/man/preprocess_reporting_triangle.Rd deleted file mode 100644 index fc3009f..0000000 --- a/man/preprocess_reporting_triangle.Rd +++ /dev/null @@ -1,33 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/preprocess_reporting_triangle.R -\name{preprocess_reporting_triangle} -\alias{preprocess_reporting_triangle} -\title{Pre-process the reporting triangle} -\usage{ -preprocess_reporting_triangle(triangle, max_delay) -} -\arguments{ -\item{triangle}{the reporting triangle as a data.frame, with rows as the -reference date and columns as the delay, in any units. Assumes that the -delays will be in the format \code{{delay}} with no suffix for the unit} - -\item{max_delay}{the maximum delay, in the delay units, to filter the -reporting triangle} -} -\value{ -trunc_triangle a dataframe in the same format as \code{triangle}, but -truncated to include only the maximum delay number of delay columns -and with negative values of reporting handled via passing them to the -subsequent days delay -} -\description{ -Takes in a reporting triangle with the reference date as the first column, -and all subsequent columns named by their delay (unitless, so for example a -delay of 3 days or weeks would have the column name \code{3}), and values -indicating the number of new confirmed cases assigned to that reference -date with that particular delay. It returns a dataframe in the same format -as the input triangle, but truncated to include only the maximum delay -number of columns and with negative values of reporting handled via passing -them to the subsequent days delay. -Modified from https://github.com/KITmetricslab/RESPINOW-Hub/blob/main/code/baseline/functions.R #nolint -} From 65e60bead974561b227ea5c3705e4a792d9b4ac5 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Thu, 30 Jan 2025 16:18:22 +0000 Subject: [PATCH 21/24] fix example --- R/get_delay_estimate.R | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/get_delay_estimate.R b/R/get_delay_estimate.R index 170eec7..eca844b 100644 --- a/R/get_delay_estimate.R +++ b/R/get_delay_estimate.R @@ -31,7 +31,7 @@ #' latest_date = "2021-10-01" #' ) #' pobs <- enw_preprocess_data(nat_germany_hosp, max_delay = 21) -#' triangle_raw <- pobs$reporting_triangle |> +#' triangle_raw <- pobs$reporting_triangle[[1]] |> #' dplyr::select(-`.group`, -reference_date) |> #' as.matrix() |> #' unname() From e62423a6691471b46adec0be40d133e359338950 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Thu, 30 Jan 2025 16:31:27 +0000 Subject: [PATCH 22/24] add docs --- man/get_delay_estimate.Rd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/get_delay_estimate.Rd b/man/get_delay_estimate.Rd index c0ec941..1b4c740 100644 --- a/man/get_delay_estimate.Rd +++ b/man/get_delay_estimate.Rd @@ -48,7 +48,7 @@ nat_germany_hosp <- enw_filter_report_dates( latest_date = "2021-10-01" ) pobs <- enw_preprocess_data(nat_germany_hosp, max_delay = 21) -triangle_raw <- pobs$reporting_triangle |> +triangle_raw <- pobs$reporting_triangle[[1]] |> dplyr::select(-`.group`, -reference_date) |> as.matrix() |> unname() From b8fac19acfc38c3e3616ca158aebced100a37cf4 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Thu, 30 Jan 2025 17:10:32 +0000 Subject: [PATCH 23/24] add a bit more text to outline --- vignettes/baselinenowcast.Rmd | 84 +++++++++++++++++++++++++---------- 1 file changed, 60 insertions(+), 24 deletions(-) diff --git a/vignettes/baselinenowcast.Rmd b/vignettes/baselinenowcast.Rmd index 5aeffc9..0b39662 100644 --- a/vignettes/baselinenowcast.Rmd +++ b/vignettes/baselinenowcast.Rmd @@ -28,18 +28,15 @@ the most recent time points. These are the time points where the highest proportion of the data has yet to be observed in the dataset. -The imputation of what will eventually be observed up until the current time is -referred to as a nowcast. +The imputation of the cases that will eventually be observed up until the +current time is referred to as a nowcast. A number of methods have been developed to nowcast epidemiological case count data. -The purpose of `baselinenowcast` is to provide a nowcast using the most recent -observations and a relatively straightforward multiplicative method for -adjusting incomplete reports. -The goal is to use this method as a baseline for comparing to potentially more -complex model-based approaches which may have additional features that are -potentially specific to the dataset or problem context. +The purpose of `baselinenowcast` is to provide a nowcast computed directly from +the most recent observations to estimate a delay distribution empirically, and +apply that to the partially observed data to generate a nowcast. In the below section, we will describe an example of a nowcasting problem, and demonstrate how to use `baselinenowcast` to estimate a delay distribution from @@ -55,7 +52,8 @@ wrangling needed to set up the nowcasting problem. We note that no components of the vignette require installing `CmdStan`, which is a downstream dependency of `epinowcast`. We will just be using the `R` components of `epinowcast`, which can be -installed using the example lines of code below. +installed using the example lines of code below, so there is no need to +additionally install `CmdStan`. ```{r setup, message = FALSE} # Installing epinowcast # install.packages( #nolint @@ -74,8 +72,8 @@ set.seed(123) Nowcasting of right-truncated case counts involves the estimation of reporting delays for recently reported data. -For this, we need case counts both by when they were diagnosed (e.g. when -someone tests positive; often called "reference date") and by when they were +For this, we need case counts both by when they were diagnosed (often +called "reference date") and by when they were reported (i.e. when administratively recorded via public health surveillance; often called "report date"). The difference between the reference date and the report date is the reporting delay. @@ -104,7 +102,7 @@ in the complete dataset. # Plotting The red line shows the cumulative number of confirmed admissions on each report -date, across all delays +date, across all delays, using the data available as of July 1, 2021. It demonstrates the characteristic behavior of right-truncation. This is because we have not yet observed the data that will become available for the longer delays at recent time points. @@ -136,17 +134,18 @@ ggplot() + ggtitle("Comparing all admissions by reference date and those with no delay") ``` Here the black line represents the quantity we will evaluate our nowcast -against,and the red line represents the data we have available to us up until +against, and the red line represents the data we have available to us up until July 1st, 2021. # Pre-processing In order to compute a nowcast for this data, we will need to start by creating what we call a reporting triangle. -This is a wide formatted dataframe with one column for the reference date -and another column for each of the delays, from 0 to the maximum delay. +This is a wide formatted dataframe where each row represents one of the time +points being reference and each column represents the delay, starting from 0 up +until the maximum delay. The entries represent the number of new cases assigned to that reference -date with a particular delay. +time point with a particular delay. The reporting triangle will be used to estimate the delay distribution, or the proportion of the final number of cases reported on a particular delay. Since this data is both reported and referenced daily, we will use the time @@ -223,14 +222,15 @@ The `get_delay_estimate()` function expects the following inputs: the rows represent the time points and the columns represent the observed delays, starting at 0. - `max_delay`: an integer indicating the maximum delay to estimate. This must - be less than or equal to the number of delays in `triangle`. + be less than or equal to the number of rows in `triangle` minus 1, since we + assume `triangle` is indexed at 0. - `n_history`: an integer indicating the number of observations by reference - date to use to fit the delay distribution. This must be less than the number - of rows in `triangle`. -`enw_preoprocess()` returns a `triangle` with the columns `.group` and + date to use to fit the delay distribution. This must be less than or equal to + the number of rows in `triangle`. +`enw_preprocess()` returns a `triangle` with the columns `.group` and `.reference_date` and delays indicated by the column names. While we will eventually write methods that will map from other input formats such as this, -we will start by demonstrating the functionality on the raw matrix of the +we will start by demonstrating the functionality on only the matrix of the reporting triangle. ```{r estimate-delay} @@ -240,7 +240,6 @@ triangle <- triangle_full |> unname() head(triangle) - delay_df <- get_delay_estimate( triangle = triangle, max_delay = max_delay, @@ -275,8 +274,45 @@ The reporting triangle we are applying it to must have the same `max_delay` as the delay estimate. ```{r} -# reporting_square <- apply_delay( #nolint +# point_reporting_square <- apply_delay( #nolint # triangle_to_nowcast = triangle,#nolint -# delay_f#nolint +# delay_df#nolint # )#nolint ``` + +# Estimate uncertainty + +So far, we've demonstrated how to generate a point estimate of a nowcast. +We would like to generate a probabilistic nowcasts, which will require +estimating the empirical error between the expected observations and the +actual observations when applying the delay distribution. +This assumes a negative binomial observation model. + +*Note, we could probably create objects of different classes for the outputs +of each of these functions, but I am not sure this is necessary* + +```{r} +# disp_params <- estimate_uncertainty( #nolint +# triangle_to_nowcast = triangle #nolint +# delay_pmf = delay_df$pmf #nolint +# ) #nolint +``` + +# Generate probabilistic nowcast + +Now that we have estimated the dispersion, we can use the delay distribution, +the reporting triangle that we want to nowcast, and the dispersion parameters to +generate a probabilistic nowcast. + +```{r} +# nowcast <- generate_nowcast( #nolint +# triangle_to_nowcast = triangle, #nolint +# delay_df = delay_df, #nolint +# disp_params = disp_params #nolint +# ) #nolint +``` + +In the above, we might want to make sure we include the necessary metadata if +the user passes it in, so we could write methods that dispatch on objects of +a particular type so that they spit back out things with dates if those were +inputted. From 315e0bd6d05bebbe231249c828e0951a45d1fe52 Mon Sep 17 00:00:00 2001 From: Kaitlyn Johnson Date: Thu, 30 Jan 2025 17:15:14 +0000 Subject: [PATCH 24/24] add to changelog" --- NEWS.md | 1 + 1 file changed, 1 insertion(+) diff --git a/NEWS.md b/NEWS.md index 7b180ec..3691513 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,4 @@ # baselinenowcast 0.0.0.1000 +- Introduced first function to get the delay estimate and an outline of the Getting Started vignette - Added package skeleton.