From 3f5e46386cb5a2ee3500d4d47858b27238088f05 Mon Sep 17 00:00:00 2001 From: Sebastian Funk Date: Wed, 26 Jul 2023 13:42:29 +0100 Subject: [PATCH 1/7] add vignette with theoretical background --- vignettes/theoretical_background.Rmd | 148 +++++++++++++++++++++++++++ 1 file changed, 148 insertions(+) create mode 100644 vignettes/theoretical_background.Rmd diff --git a/vignettes/theoretical_background.Rmd b/vignettes/theoretical_background.Rmd new file mode 100644 index 0000000..b2c0267 --- /dev/null +++ b/vignettes/theoretical_background.Rmd @@ -0,0 +1,148 @@ +--- +title: "bpmodels: Theoretical background" +author: "Sebastian Funk" +output: + bookdown::html_vignette2: + fig_caption: yes + code_folding: show +pkgdown: + as_is: true +bibliography: references.json +link-citations: true +vignette: > + %\VignetteEncoding{UTF-8} + %\VignetteIndexEntry{bpmodels: Theoretical background} + %\VignetteEngine{knitr::rmarkdown} +editor_options: + chunk_output_type: console +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set( + echo = TRUE, + message = FALSE, + warning = FALSE, + collapse = TRUE, + comment = "#>" +) +``` + +_bpmodels_ provides methods to analyse and simulate the size and length of branching processes with an arbitrary offspring distribution. +In this vignette we lay out the mathematical concepts behind the functionality available +in the package. + +# Branching processes + +[Branching processes](https://en.wikipedia.org/wiki/Branching_process) are a class of models that are used to model the growth of populations where the number of offspring $Z$ that each member of the population has is a random variable following a given probability distribution with probability mass function $p(Z = z | \theta)$, the _offspring distribution_. +Their use has a long history in epidemiology, where the population is interpreted as a pathogen, and the offspring as new hosts that it infects [@farrington2003]. +Below we will call these infected individuals _cases_ but the methods could be applied in other contexts where branching processes are to be used. + +# Simulation + +To simulate from a branching process, we start with a single case and proceed in discrete steps or generations, drawing from $p(Z=z | \theta)$ to generate new cases from each case. + +If we additionally define the distribution of times $T$ as a random variable with distribution $f(T = t; \theta)$ we can assign each case $j$ a time $t_{j}$ which, if case $j$ has been affected by case $i$ is given by $f(t_{j} - t_{i} | \theta)$. +If we identify the timing cases by the time of their symptom onset this is the [serial interval](https://en.wikipedia.org/wiki/Serial_interval), but depending on case definitions this could be another interval. + +## Summary statistics + +To summarise the simulation of a branching process after it has ended (either because it has gone extinct or because of some stopping criterion) we either study the _size_ or _length_ of the resulting _chain_ of cases. +The size $S$ of a chain is the number of cases that have occurred over the course of the simulation including the initial case so that $S \geq 1$. +The length $L$ of a chain is the number of generations that have been simulated including the initial case so that $L \geq 1$. + +# Inference + +By characterising a chains of cases by their size of length we can conduct inference to learn about the underlying parameters $\theta$ from observation of chain sizes or chain lengths [@blumberg2013]. +In general this is only possible for _subcritical_ branching processes, i.e. ones where the mean number of offspring is less than 1, as otherwise the branching process could grow forever. +However, we can expand the theory to _supercritical_ branching processes, i.e. ones where the mean number of offspring is greater than 1, by defining a cutoff of chain size or length beyond which we treat the chain as if it had infinite size or length, respectively. + +## Size and length distributions for some offspring distributions + +We show the equations for the size and length distributions for some offspring distributions where they can be derived analytically: +[Poisson](https://en.wikipedia.org/wiki/Poisson_distribution), +[negative binomial](https://en.wikipedia.org/wiki/Negative_binomial_distribution), +[geometric](https://en.wikipedia.org/wiki/Geometric_distribution), +and a [gamma](https://en.wikipedia.org/wiki/Gamma_distribution)-[Borel](https://en.wikipedia.org/wiki/Borel_distribution) mixture. + +### Negative binomial and special cases + +If the offspring distribution is a Poisson distribution, we can interpret its rate parameter $\lambda$ as the basic reproduction number $R_{0}$ of the pathogen. +In the more general case where the offspring definition can be _overdispersed_ leading to _superspreading_ we can use a negative binomial offspring distribution with mean $\mu$ and overdispersion $k$ [@lloyd-smith2005, @blumberg2013]. +In that case, the mean parameter $\mu$ is interpreted as the basic reproduction number $R_0$ of the pathogen. +The negative binomial distribution arises from a Poisson-gamma mixture and thus a branching process with negative binomial distributed offspring can be interpreted as one with Poisson distributed offspring where the basic reproduction number $R_0$ itself varies according to a gamma distribution. +The amount of variation in $R_0$ is then interpreted as individual-level variation in transmission representing overdispersion or superspreading, and the degree to which this happens is given by the overdispersion parameter $k$. + +#### Size distributions + +The probability $p$ of a chain of size $S$ given $R_0$ and $k$ in a branching process with negative binomial offspring distribution is given in Eq. 9 of @blumberg2013 + +$$ +p(S|R_0, k) = \frac{\Gamma(kS + S - 1)}{\Gamma(kS)\Gamma(S + 1)} \frac{\left(\frac{R_0}{k}\right)^{S - 1}}{\left( 1 + \frac{R_0}{k} \right)^{kS + S - 1}} +$$ + +where $\Gamma$ is the gamma function. +In order to estimate $S$ from a given $R_0$ and $k$ we can define a likelihood function $L(S) = p(S|R_0, k)$. +The corresponding log-likelihood is + +\begin{align} +\mathrm{LL}(S) = &\log\Gamma(kS + S - 1) - \left(\log\Gamma(kS) + \log\Gamma(S - 1) \right) \\ +& + (S-1) \log \frac{R_0}{k} - (SR_0 + (S - 1)) \log \left(1 + \frac{R_0}{k}\right) +\end{align} + +The log-likelihood for Poisson distributed offspring follows from this where $k$ tends to infinity (corresponding to Eq. 2.2 in @farrington2003) + +$$ +\mathrm{LL}(S) = (S - 1) \log R_0 - S R_0 + (S - 2) \log S - \log\Gamma(S) +$$ + +In all cases the point estimate for the basic reproduction number $\hat{R_0}$ is related to the mean chain size $\bar{S}$ by + +$$ +\hat{R_0} = 1 - \frac{1}{\bar{S}} +$$ + +#### Length distributions + +The cumulative mass function $F(L)$ of observing a chain of length $L$ when offspring is Poisson distributed is given by Eq. (2.5) in @farrington2003 (there called "outbreak duration"): + +$$ +F(L) = e^{-R_0} E_L \left( e^{R_0 e^{-R_0} } \right) +$$ + +where $E_L(x)$ is the iterated exponential function, $E_0(x) = 1$, $E_{L + 1}(x) = x^{E_L(x)}$. + +For geometric distributed offspring (corresponding to a negative Binomial with $k=1$) this function is given by + +$$ +F(L) = \frac{ 1- R_0^{L + 1} } {1 - R_0^{L - 2}} +$$ + +In both cases $f(L)$ denotes cumulative mass functions and therefore the probability of observing a chain of length $L$ is therefore $f(L) - f(L - 1)$. + +## Gamma-Borel mixture + +The probability distribution of outbreak sizes from a branching process with a Poisson offspring distribution (Eq. 2.2 in @farrington2003) is a special case of the [Borel-Tanner distribution](https://en.wikipedia.org/wiki/Borel_distribution#Borel%E2%80%93Tanner_distribution) starting with 1 individual. +An alternative to the negative binomial offspring distribution which represents a Poisson-gamma mixture is a Borel-gamma mixture. +This could represent situations where the variation is not at the _individual level_ but at the _chain level_, i.e. transmission chains is homogeneous but there is heterogeneity between chains. +In that case, it can be shown that the resulting log-likelihood of chain sizes is + +\begin{align} +\mathrm{LL}(S) = &\log\Gamma(k + S - 1) - \left(\log\Gamma(k) + \log\Gamma(S + 1) \right) \\ +& + (S-1) \log S - k \log \left(S + \frac{R_0}{k}\right) +\end{align} + +## Size and length distributions for arbitrary offspring distributions + +The analytic likelihoods stated above are included in the package and used when the corresponding offspring distributions are given. +In order to do this, the simulation functionality is be used to generate $n$ simulated chains and the value of the cumulative mass function $P(S|\theta)$ at the observed $S$ approximated by the empirical cumulative distribution function: +$$ +P(S|\theta) \approx \sum_i \mathbf{1}(x_i <= S) +$$ +where $\mathbf{1}$ is the indicator function and $x_i$ the i-th observed chain size (or length, if the interest is in $L$). +In order to improve this approximation a linear approximation is applied to the values of the empirical distribution function (at the expense of normalisation to 1). + +The (unnormalised) probability of observing $S$ is then given by +$$ +p(S|\theta) = P(S|\theta) - P(S - 1|\theta) +$$ +and a an equivalent relationship is used for $L$. From 77630baf34aae2b3bb6241b6ab3cfde4fe1a3a3c Mon Sep 17 00:00:00 2001 From: Sebastian Funk Date: Wed, 26 Jul 2023 13:49:11 +0100 Subject: [PATCH 2/7] add pkgdown link --- _pkgdown.yml | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/_pkgdown.yml b/_pkgdown.yml index 4c40335..d248f6a 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -36,6 +36,10 @@ reference: - covid19_sa articles: +- title: Background + navbar: Modelling + contents: + - theoretical_background - title: Resources navbar: Literature contents: From e329daee912e9ea05533e3731866cb8f51dbb310 Mon Sep 17 00:00:00 2001 From: Tim Taylor Date: Fri, 28 Jul 2023 11:25:03 +0100 Subject: [PATCH 3/7] Title consistency across vignettes --- vignettes/theoretical_background.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/vignettes/theoretical_background.Rmd b/vignettes/theoretical_background.Rmd index b2c0267..c697989 100644 --- a/vignettes/theoretical_background.Rmd +++ b/vignettes/theoretical_background.Rmd @@ -1,5 +1,5 @@ --- -title: "bpmodels: Theoretical background" +title: "Theoretical background" author: "Sebastian Funk" output: bookdown::html_vignette2: @@ -11,7 +11,7 @@ bibliography: references.json link-citations: true vignette: > %\VignetteEncoding{UTF-8} - %\VignetteIndexEntry{bpmodels: Theoretical background} + %\VignetteIndexEntry{Theoretical background} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console From bd0d3e6146ec488a912c77e2c7f5c1f09ae7e816 Mon Sep 17 00:00:00 2001 From: Sebastian Funk Date: Fri, 1 Sep 2023 10:20:37 +0100 Subject: [PATCH 4/7] improve wording Co-authored-by: James Azam --- vignettes/theoretical_background.Rmd | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/vignettes/theoretical_background.Rmd b/vignettes/theoretical_background.Rmd index c697989..dca5c56 100644 --- a/vignettes/theoretical_background.Rmd +++ b/vignettes/theoretical_background.Rmd @@ -1,5 +1,5 @@ --- -title: "Theoretical background" +title: "An overview of the theoretical background for bpmodels" author: "Sebastian Funk" output: bookdown::html_vignette2: @@ -33,20 +33,20 @@ in the package. # Branching processes -[Branching processes](https://en.wikipedia.org/wiki/Branching_process) are a class of models that are used to model the growth of populations where the number of offspring $Z$ that each member of the population has is a random variable following a given probability distribution with probability mass function $p(Z = z | \theta)$, the _offspring distribution_. +[Branching processes](https://en.wikipedia.org/wiki/Branching_process) are a class of models that are used to model the growth of populations. They assume that each member of the population produces a number of offspring, $Z$, that is a random variable with probability mass function $p(Z = z | \theta)$, called the _offspring distribution_. Their use has a long history in epidemiology, where the population is interpreted as a pathogen, and the offspring as new hosts that it infects [@farrington2003]. Below we will call these infected individuals _cases_ but the methods could be applied in other contexts where branching processes are to be used. # Simulation -To simulate from a branching process, we start with a single case and proceed in discrete steps or generations, drawing from $p(Z=z | \theta)$ to generate new cases from each case. +To simulate from a branching process, we start with a single case and proceed in discrete steps or generations, drawing from the offspring distribution $p(Z=z | \theta)$ to generate new cases from each case. -If we additionally define the distribution of times $T$ as a random variable with distribution $f(T = t; \theta)$ we can assign each case $j$ a time $t_{j}$ which, if case $j$ has been affected by case $i$ is given by $f(t_{j} - t_{i} | \theta)$. -If we identify the timing cases by the time of their symptom onset this is the [serial interval](https://en.wikipedia.org/wiki/Serial_interval), but depending on case definitions this could be another interval. +Given an infector $i$ and infectee $j$, we can additionally assign them a distribution of times $T$ that approximates when the infection event occurred. If we define $T$ as a random variable with distribution $f(T = t; \theta)$ we can assign each case $j$ a time $t_{j}$ which, if case $j$ has been affected by case $i$ is given by $f(t_{j} - t_{i} | \theta)$. +If we identify the timing of cases by the time of their symptom onset this is the [serial interval](https://en.wikipedia.org/wiki/Serial_interval), but depending on case definitions this could be another interval. ## Summary statistics -To summarise the simulation of a branching process after it has ended (either because it has gone extinct or because of some stopping criterion) we either study the _size_ or _length_ of the resulting _chain_ of cases. +Branching process simulations end when they have gone extinct, that is, no more offspring are being produced, or because of some stopping criterion. To summarise the simulations, we either study the _size_ or _length_ of the resulting _chain_ of cases. The size $S$ of a chain is the number of cases that have occurred over the course of the simulation including the initial case so that $S \geq 1$. The length $L$ of a chain is the number of generations that have been simulated including the initial case so that $L \geq 1$. @@ -119,7 +119,7 @@ $$ In both cases $f(L)$ denotes cumulative mass functions and therefore the probability of observing a chain of length $L$ is therefore $f(L) - f(L - 1)$. -## Gamma-Borel mixture +### Gamma-Borel mixture The probability distribution of outbreak sizes from a branching process with a Poisson offspring distribution (Eq. 2.2 in @farrington2003) is a special case of the [Borel-Tanner distribution](https://en.wikipedia.org/wiki/Borel_distribution#Borel%E2%80%93Tanner_distribution) starting with 1 individual. An alternative to the negative binomial offspring distribution which represents a Poisson-gamma mixture is a Borel-gamma mixture. @@ -131,7 +131,7 @@ In that case, it can be shown that the resulting log-likelihood of chain sizes i & + (S-1) \log S - k \log \left(S + \frac{R_0}{k}\right) \end{align} -## Size and length distributions for arbitrary offspring distributions +## Numerical approximations of chain size and length distributions The analytic likelihoods stated above are included in the package and used when the corresponding offspring distributions are given. In order to do this, the simulation functionality is be used to generate $n$ simulated chains and the value of the cumulative mass function $P(S|\theta)$ at the observed $S$ approximated by the empirical cumulative distribution function: From 1b586b076553d50cee03b517aa9edae42da3fdc8 Mon Sep 17 00:00:00 2001 From: Sebastian Funk Date: Fri, 1 Sep 2023 10:28:29 +0100 Subject: [PATCH 5/7] reword --- vignettes/theoretical_background.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/theoretical_background.Rmd b/vignettes/theoretical_background.Rmd index dca5c56..d50cc99 100644 --- a/vignettes/theoretical_background.Rmd +++ b/vignettes/theoretical_background.Rmd @@ -133,7 +133,7 @@ In that case, it can be shown that the resulting log-likelihood of chain sizes i ## Numerical approximations of chain size and length distributions -The analytic likelihoods stated above are included in the package and used when the corresponding offspring distributions are given. +When analytic likelihoods are not available a numerical approximation is used to derive the distributions. In order to do this, the simulation functionality is be used to generate $n$ simulated chains and the value of the cumulative mass function $P(S|\theta)$ at the observed $S$ approximated by the empirical cumulative distribution function: $$ P(S|\theta) \approx \sum_i \mathbf{1}(x_i <= S) From f3e746b3b56117955269658128f86a0093c0b578 Mon Sep 17 00:00:00 2001 From: Sebastian Funk Date: Fri, 1 Sep 2023 10:28:50 +0100 Subject: [PATCH 6/7] update title --- vignettes/theoretical_background.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/theoretical_background.Rmd b/vignettes/theoretical_background.Rmd index d50cc99..266eb85 100644 --- a/vignettes/theoretical_background.Rmd +++ b/vignettes/theoretical_background.Rmd @@ -11,7 +11,7 @@ bibliography: references.json link-citations: true vignette: > %\VignetteEncoding{UTF-8} - %\VignetteIndexEntry{Theoretical background} + %\VignetteIndexEntry{An overview of the theoretical background for bpmodels} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console From 5cd7a7d27ba9233d001cfdaabc9e43f63010cac6 Mon Sep 17 00:00:00 2001 From: Sebastian Funk Date: Fri, 1 Sep 2023 10:29:08 +0100 Subject: [PATCH 7/7] add author --- vignettes/theoretical_background.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/theoretical_background.Rmd b/vignettes/theoretical_background.Rmd index 266eb85..02a3e61 100644 --- a/vignettes/theoretical_background.Rmd +++ b/vignettes/theoretical_background.Rmd @@ -1,6 +1,6 @@ --- title: "An overview of the theoretical background for bpmodels" -author: "Sebastian Funk" +author: "Sebastian Funk and James Azam" output: bookdown::html_vignette2: fig_caption: yes