count-regression.qmd

---
subtitle: "Poisson regression and variations"
---

# Models for Count Outcomes

---

## Acknowledgements {.unnumbered}

This content is adapted from:

-   @dobson4e, Chapter 9
-   @vittinghoff2e, Chapter 8

---

{{< include shared-config.qmd >}}

## Introduction

::: notes
This chapter presents models for 
[count data](probability.qmd#sec-count-vars) outcomes.
With covariates,
the event rate $\lambda$ 
becomes a function of the covariates
$\vX = (X_1, \dots,X_n)$.
Typically, count data models use
a $\log{}$ link function,
and thus an $\exp{}$ inverse-link function. 
That is:
:::

$$
\begin{aligned}
\Expp[Y | \vX = \vx, T = t] &= \mu(\vx,t)
\\ \mu(\vx,t) &= \lambda(\vx)\cdot t
\\ \lambda(\vx) &= \exp{\eta(\vx)}
\\ \eta(\vx) &= \vx'\tilde \beta = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p
\end{aligned}
$$

Therefore,
$$
\begin{aligned}
\logf{\Expp[Y | \vX = \vx,T=t]}
&= \logf{\mu(\vx)}\\
&=\logf{\lambda(\vx) \cdot t}\\
&=\logf{\lambda(\vx)} + \log{t}\\
&=\logf{\exp{\eta(\vx)}} + \log{t}\\
&=\eta(\vx) + \log{t}\\
&=\vx'\tilde\beta + \log{t}\\
&=(\beta_0 +\beta_1 x_1+\dots + \beta_p x_p) + \log{t}\\
\end{aligned}
$$

In contrast with the other covariates (represented by $\vX$), $T$ enters this expression with a $\log{}$ transformation and without a corresponding $\beta$ coefficient. 

:::{.callout-note}
Terms that enter the linear component of a model without a coefficient, such as $\log{t}$ here, are called **offsets**.
:::

### Rate ratios {.smaller}

::: notes
Differences on the log-rate scale become ratios on the rate scale, 
because
:::

$$\exp{a-b} = \frac{\exp{a}}{\exp{b}}$$

(recall from [Algebra 2](math-prereqs.qmd#cor-exp-sum))

Therefore, according to this model, 
**differences of $\delta$ in covariate $x_j$ correspond to rate ratios of $\exp{\beta_j \cdot \delta}$**.

That is, letting $\vX_{-j}$ denote vector $\vX$ with element $j$ removed: 

$$
\begin{aligned}
&{
\left\{
    \log{\Expp[Y |{\color{red}{X_j = a}}, \vX_{-j}=\vx_{-j},T=t]}
    \atop 
    {-\log{\Expp[Y |{\color{red}{X_j = b}}, \vX_{-j}=\vx_{-j},T=t]}}
    \right\}
}\\
&= 
{\left\{
\log{t} + \beta_0 + \beta_1 x_1 + ... + {\color{red}{\beta_j (a)}} + ...+\beta_p x_p 
\atop 
{-\log{t} + \beta_0 + \beta_1 x_1 + ... + {\color{red}{\beta_j (b)}} + ...+\beta_p x_p}
\right\}}\\
&= \color{red}{\beta_j(a-b)}
\end{aligned}
$$

And accordingly,

$$
\begin{aligned}
\frac
{\mathbb{E}[Y |{\color{red}{X_j = a}}, \vX_{-j} = \vx_{-j}, T = t]
}
{
\Expp[Y |{\color{red}{X_j = b}}, \vX_{-j}=\vx_{-j},T=t]
}
= 
\exp{{\color{red}{\beta_j(a-b)}}}
\end{aligned}
$$

## Inference for count regression models

### Confidence intervals for regression coefficients and rate ratios

As usual:

$$
\beta \in \left[\ci \right]
$$

Rate ratios: exponentiate CI endpoints

$$
\exp{\beta} \in \left[\exp{\ci} \right]
$$

### Hypothesis tests for regression coefficients

$$
t = \frac{\hat \beta - \beta_0}{\hse{\hb}}
$$

Compare $t$ or $|t|$ to the tails of the standard Gaussian distribution, according to the null hypothesis.

### Comparing nested models

log(likelihood ratio) tests, as usual.

## Prediction

$$
\begin{aligned}
\hat y 
&\eqdef \hat{\Expp}[Y|\vX= \vx,T=t]\\
&=\hat\mu(\vx, t)\\
&=\hat\lambda(\vx) \cdot t\\
&=\exp{\hat\eta(\vx)} \cdot t\\
&=\exp{\vx'\hat{\boldsymbol{\beta}}} \cdot t
\end{aligned}
$$

## Diagnostics

### Residuals

#### Observation residuals

$$e \eqdef y - \hat y$$

#### Pearson residuals

$$r = \frac{e}{\hse{e}} \approx \frac{e}{\sqrt{\hat y}}$$

#### Standardized Pearson residuals

$$r_p = \frac{r}{\sqrt{1-h}}$$
where $h$ is the "leverage" (which we will continue to leave undefined).

#### Deviance residuals

$$
d_k = \text{sign}(y - \hat y)\left\{\sqrt{2[\ell_{\text{full}}(y) - \ell(\hat\beta; y)]}\right\}
$$

:::{.callout-note}

$$\text{sign}(x) \eqdef \frac{x}{|x|}$$
In other words:

* $\text{sign}(x) = -1$ if $x < 0$
* $\text{sign}(x) = 0$ if $x = 0$
* $\text{sign}(x) = 1$ if $x > 0$

::::{.content-hidden}

```{r}
plot(sign,xlim = c(-1,1), xlab = "x", ylab = "sign(x)")
```

::::

:::


## Zero-inflation

### Models for zero-inflated counts

We assume a latent (unobserved) binary variable, $Z$, which we model using logistic regression:

$$P(Z=1|X=x) = \pi(x) = \expit(\gamma_0 + \gamma_1 x_1 +...)$$

According to this model, if $Z=1$, then $Y$ will always be zero, regardless of $X$ and $T$:

$$P(Y=0|Z=1,X=x,T=t) = 1$$

Otherwise (if $Z=0$), $Y$ will have a Poisson distribution, conditional on $X$ and $T$, as above.

Even though we never observe $Z$, we can estimate the parameters $\gamma_0$-$\gamma_p$, via maximum likelihood:

$$
\begin{aligned}
P(Y=y|X=x,T=t) &= P(Y=y,Z=1|...) + P(Y=y,Z=0|...)
\end{aligned}
$$
(by the Law of Total Probability)

where 
$$
\begin{aligned}
P(Y=y,Z=z|...) 
&= P(Y=y|Z=z,...)P(Z=z|...)
\end{aligned}
$$

---

::: {#exr-zinf-pmf}
Expand $P(Y=0|X=x,T=t)$, $P(Y=1|X=x,T=t)$ and $P(Y=y|X=x,T=t)$ into expressions involving $P(Z=1|X=x,T=t)$ and $P(Y=y|Z=0,X=x,T=t)$.
:::

---

::: {#exr-zinf-moments}

Derive the expected value and variance of $Y$, conditional on $X$ and $T$, as functions of $P(Z=1|X=x,T=t)$ and $\Expp[Y|Z=0,X=x,T=t]$.
:::

## Over-dispersion

---

::: notes

The Poisson distribution model **forces** the variance to equal the mean. 
In practice, many count distributions will have a variance substantially larger than the mean (or occasionally, smaller).

:::

:::: {#def-overdispersion}
#### Overdispersion

A random variable $X$ is **overdispersed** 
relative to a model $\p(X=x)$ if
if its empirical variance in a dataset is larger than 
the value is predicted by the fitted model $\hat{\p}(X=x)$.

::::

::: notes

c.f. 
@dobson4e §3.2.1, 7.7, 9.8; 
@vittinghoff2e §8.1.5; 
and <https://en.wikipedia.org/wiki/Overdispersion>.

When we encounter overdispersion, 
we can try to reduce the residual variance 
by adding more covariates. 

:::

### Negative binomial models

There are alternatives to the Poisson model.
Most notably, 
the [negative binomial model](probability.qmd#sec-nb-dist).

We can still model $\mu$ as a function of $X$ and $T$ as before, 
and we can combine this model with zero-inflation 
(as the conditional distribution for the non-zero component).

### Quasipoisson

An alternative to Negative binomial is the "quasipoisson" distribution. I've never used it, but it seems to be a method-of-moments type approach rather than maximum likelihood. It models the variance as $\Var{Y} = \mu\theta$, and estimates $\theta$ accordingly.

See `?quasipoisson` in R for more.

## Example: needle-sharing

(adapted from @vittinghoff2e, §8)

{{< include exr-needle-sharing.qmd >}}

## More on count regression

- <https://bookdown.org/roback/bookdown-BeyondMLR/ch-poissonreg.html>