05-mcmc.Rmd

# Markov Chain Monte Carlo Methods

Previously, we keep trying to compute

$$E h(X)$$

by generating random numbers. It is based on the law of large numbers that says

$$E h(X) \approx \sum_{i = 1}^N \frac{h(X_i)}{N}$$

The question is, when this convergence happens. Some random numbers might require expremely large $N$, while others needs affordable size. It is known that if this $\{ X_1, \ldots, X_N \}$ is *generated from Markov chain, the series converges quite fast*.

## Limiting Distribution of Markov Chain

Definition \@ref(def:dmc) presents the definition of markov chain and *markov property*.

$$P(X_{n + 1} = j \mid X_n = i, X_{n - 1} = i_{n - 1}, \ldots, X_0 = i_0) = P(X_{n + 1} = j \mid X_n = i) = P_{ij}$$

Consider discrete state space $S$.

```{definition, mctran, name = "Transition kernel"}
One-step transition matrix for discrete time markov chain on $S$ is

$$P = \begin{bmatrix} P_{ij} \end{bmatrix}$$

$n$-stem transition matrix is written as

$$P^{(n)} = \begin{bmatrix} P_{ij}^{(n)} = P(X_{n + k} = j \mid X_k = j) \end{bmatrix}$$
```

```{theorem, cke, name = "Chapmen-Kolmogorov Equation"}
For every $n, m \in \mathbb{Z}$,

$$P^{(n + m)} = P^{(n)} P^{(m)}$$
```

```{corollary, cke2}
By the Chapmen-Kolmogorov equation,

$$\forall n \in \{ 0, 1, 2, \ldots \} : \: P^{(n)} = P^n$$
```

Does Markov chain converge to same state after time has passed much enough?

$$P(\text{starts at}\: i \: \text{and ends at}\: j \: \text{state}) = \lim_{n \rightarrow \infty} P(X_n = j \mid X_0 = i) = \lim_{n \rightarrow \infty} P_{ij}^n$$

This holds when the process satisfies some conditions.

### Ergodic theorem

Let $S$ be the state of MC.

```{definition, mcperiod, name = "Aperiodicity"}
Let $i \in S$ be a state.

\begin{itemize}
  \item Period $d(i) := \gcd \{ n : P_{ii}^(n) > 0, n \in \mathbb{N} \}$
  \item A state $i$ is said to be \textbf{\textit{periodic}} $$: \Leftrightarrow d(i) > 1$$
  \item A state $i$ is said to be \textbf{\textit{aperiodic}} $$: \Leftrightarrow d(i) = 1$$
\end{itemize}
```

It is obvious that if a chain has a period, it won't be convergent.

```{definition, mcreduc, name = "Irreducibility"}
Markov chain is \textbf{\textit{irreducible}} iff it is possible to go from any state to any other state. Otherwise, it is called \textbf{\textit{reducible}}.
```

Intuitively, the states must be a *single closed communicating* class for convergence.

```{definition, mcrecc, name = "Positive recurrence"}
Markov chain is \textbf{\textit{recurrent}} iff

$$\forall i \in S : \: \text{chain starts at}\: i \: \text{and it will eventually return to}\: i \: \text{with probabbility}\: 1$$
```

When these properties - aperiodicity, irreducibility, and positive recurrent - MC can be guaranteed to be convergent provided finite moment.

```{theorem, ergodic, name = "Ergodic theorem"}
Suppose that $\{ X_i \} \sim MC$ is aperiodic, irreducible and positive recurrent with $E \lvert h(X_j) \rvert < \infty$. Then

$$\frac{1}{N} \sum_{i = 1}^N h(X_i) \stackrel{a.s}{\rightarrow} \int_{\Omega} h(X_i) \pi(X_i) dP$$

as $N \rightarrow \infty$
```

This ergodic theorem \@ref(thm:ergodic) is an MC analog to the strong law of large numbers.

### Stationary limiting distribution

Using transition kernel, we might get the limiting distribution. For example,

\begin{equation}
  \begin{split}
    \boldsymbol\pi^{(1)} & = \boldsymbol\pi^{(0)} P \\
    & = \begin{bmatrix}
      \pi_1 & \pi_2 & \pi_3
    \end{bmatrix} \begin{bmatrix}
      \pi_{11} & \pi_{12} & \pi_{13} \\
      \pi_{21} & \pi_{22} & \pi_{23} \\
      \pi_{31} & \pi_{32} & \pi_{33}
    \end{bmatrix}
  \end{split}
  (\#eq:transiter)
\end{equation}

Recursively,

\begin{equation}
  \begin{split}
    \boldsymbol\pi^{(t)} & = \boldsymbol\pi^{(t - 1)} P \\
    & = \boldsymbol\pi^{(0)} P^t
  \end{split}
  (\#eq:transiter2)
\end{equation}

```{theorem, station, name = "Stationary probabilities"}
Suppose that $\{ X_i \} \sim MC$ is aperiodic, irreducible and positive recurrent with $E \lvert h(X_j) \rvert < \infty$. Then there exists an invariant distribution $\boldsymbol\pi$ uniquely s.t.

$$
\begin{cases}
  \boldsymbol\pi = \boldsymbol\pi P \\
  \boldsymbol\pi \mathbf{1}^T = 1
\end{cases}
$$

Denote that every vector is a row vector here.
```

### Burn-in period

This kind of convergence is usually gauranted for any starting distribution, but the time varies according to its starting point. Thus, we *throw out a certain number of the first draws* so that stationarity less dependends on the starting point. It is called burn-in period.

### Thinning

Denote that MC has a dependency structure. So we jump the chain, i.e. break the dependence. However, this process is unnecessary with Ergodic theorem and increases the variance of MC estiamtes.


## Gibbs Sampler

*Markov Chain Monte Carlo (MCMC) Methods* includes in gibbs sampler and metropolis-hastings algorithm. In fact, gibbs sampler is a special form of the latter. Here we follow the notation of @Chib:1995de.

### Concept of gibbs sampler

We are given the joint density. For this joint density, the following theorem can be proven.

```{theorem, ham, name = "Hammersley-Clifford Theorem"}
Suppose that $(X, Y)^T \sim f(x, y)$. Then

$$f(x, y) = \frac{f(y \mid x)}{\int_{\R} \frac{f(y \mid x)}{f(x \mid y)} dy}$$
```

By definition, $f(x, y) \propto f(y \mid x)$. However, the above theorem gives that this joint density is proportional to both conditional densities, i.e. also to $f(x \mid y)$.

```{corollary, hamcor}
Theorem \@ref(thm:ham) implies the second 

\begin{itemize}
  \item $f(x, y) \propto f(y \mid x)$
  \item $f(x, y) \propto f(x \mid y)$
\end{itemize}
```

This can be extended to cases more than two blocks.

```{definition, fullcond, name = "Full conditional distribution"}
Let $\mathbf{X} = (X_1, \ldots, X_p)^T \in \R^p$ be a $p$-dimensional random vector. Then the \textbf{\textit{full conditional distribution}} of $X_j$ is

$$f(X_j \mid \mathbf{X}_{(-j)})$$

where $\mathbf{X}_{(-j)} = (X_1, \ldots, X_{j - 1}, X_{j + 1}, \ldots, X_p)^T$.
```

Gibbs sampler iterate to generate a number from each full conditional distribution so that we finally get the joint density, i.e.

$$X_j \sim f(X_j \mid \mathbf{X}_{(-j)})$$

For instance, for $p = 3$,

$$
\begin{cases}
  X^{(1)} \sim f(x \mid y^{(0)}, z^{(0)}) \\
  Y^{(1)} \sim f(y \mid \color{blue}{x^{(1)}}, z^{(0)}) \\
  Z^{(1)} \sim f(z \mid \color{blue}{x^{(1)}}, \color{blue}{y^{(1)}})
\end{cases}
$$

and so $(X^{(1)}, Y^{(1)}, Z^{(1)})^T \sim f(x, y, z)$. Next,

$$
\begin{cases}
  X^{(2)} \sim f(x \mid \color{blue}{y^{(1)}}, \color{blue}{z^{(1)}}) \\
  Y^{(2)} \sim f(y \mid \color{red}{x^{(2)}}, \color{blue}{z^{(1)}}) \\
  Z^{(2)} \sim f(z \mid \color{red}{x^{(2)}}, \color{red}{y^{(2)}})
\end{cases}
$$

so that $(X^{(2)}, Y^{(2)}, Z^{(2)})^T \sim f(x, y, z)$, and so on.

### Full conditional distributions

Suppose that we only have 

Here, of course, we should know $f(\cdot \mid \cdot)$. In some cases, the closed form can be given. Otherwise, there are some calculation methods.

1. normalized posterior
2. drop the irrelevant terms
3. closed form
4. Repeat 2 and 3 for all parameter blocks

```{example, bivgibbs, name = "Bivariate normal distribution"}
Generate

$$
(X_1, X_2) \mid \mu_1, \mu_2, \sigma_1^2, \sigma_2^2, \rho \sim N_2 \bigg( (\mu_1, \mu_2)^T, \begin{bmatrix}
  \sigma_1^2 & \rho \\
  \rho & \sigma_2^2
\end{bmatrix}
\bigg)
$$
```

In this problem, its closed can easily calculated.

$$
\begin{cases}
  X_1 \mid X_2, \mu_1, \mu_2, \sigma_1^2, \sigma_2^2, \rho \sim N \Big( \mu_1 + \rho \frac{\sigma_1}{\sigma_2} (X_2 - \mu_2), (1 - \rho^2) \sigma_1^2 \Big) \\
  X_2 \mid X_1, \mu_1, \mu_2, \sigma_1^2, \sigma_2^2, \rho \sim N \Big( \mu_2 + \rho \frac{\sigma_2}{\sigma_1} (X_1 - \mu_1), (1 - \rho^2) \sigma_2^2 \Big)
\end{cases}
$$

Hence, we just iterate the above set of process until gaining $N$ draws.

### Gibbs sampler step

\begin{algorithm}[H] \label{alg:gibbalg}
  \SetAlgoLined
  \SetKwInOut{Input}{input}
  \SetKwInOut{Output}{output}
  \KwData{Full conditional distribution $f$}
  \Input{Starting values $\mathbf{x}^{(0)} = (x_1^{(0)}, \ldots, x_p^{(0)})^T$, burn-in period $b$}
  \For{$i \leftarrow 1$ \KwTo $N$}{
    $\mathbf{x}^{\ast} = \mathbf{x}^{(i - 1)}$\;
    \For{$j \leftarrow 1$ \KwTo $p$}{
      Generate $x_j^{(i)} \sim f(x_j \mid \mathbf{x}_{(-j)} = \mathbf{x}_{(-j)}^{\ast})$\;
      Set or update $x_j^{\ast} = x_j^{(i)}$\;
    }
  }
  Draw out the first $b \; \mathbf{x}^{(j)}$ (Burn-in)\;
  \Output{$\mathbf{x}^{(b + 1)}, \ldots, \mathbf{x}^{(N)}$}
  \caption{Gibbs-sampler steps}
\end{algorithm}

Denote that Gibbs sampler accepts every candidate, while metropolis-hastings in the next section chooses one.

Sometimes Gibbs sampler algorithm $\ref{alg:gibbalg}$ requires nested loop, whose efficiency becomes quite awful. In `R`, `C++` implementation can be a solution [@Wickham:2019aa]. The following code is executed in `Rcpp` environment in `rmd` document. In practice, this should be placed in `cpp` file. Or `cppFunction()` can also be used.

```{Rcpp}
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix gibbs_bvn(int N, double x, double y, int burn, 
                        double mu1, double mu2, double sig1, double sig2, double rho) {
  NumericMatrix mat(N - burn, 2);
  
  for(int i = 0; i < burn; i++) {
    x = rnorm(1, mu1 + rho * sig1 / sig2 * (y - mu2), (1 - pow(rho, 2)) * pow(sig1, 2))[0];
    y = rnorm(1, mu2 + rho * sig2 / sig1 * (x - mu1), (1 - pow(rho, 2)) * pow(sig2, 2))[0];
  }
  
  for(int i = burn; i < N; i++) {
    x = rnorm(1, mu1 + rho * sig1 / sig2 * (y - mu2), (1 - pow(rho, 2)) * pow(sig1, 2))[0];
    y = rnorm(1, mu2 + rho * sig2 / sig1 * (x - mu1), (1 - pow(rho, 2)) * pow(sig2, 2))[0];
    mat(i - burn, 0) = x;
    mat(i - burn, 1) = y;
  }
  
  return(mat);
}
```

By executing above code, `gibbs_bvn(N, x, y, burn, mu1, mu2, sig1, sig2, rho)` function is created.

```{r}
bvn <- 
  gibbs_bvn(5000, 0, 0, 1000, 0, 2, 1, .5, -.75) %>% 
  data.table()
setnames(bvn, c("x", "y"))
```

We have generated bivariate normal random numbers. See Figure \@ref(fig:bvnscatter). Compare with our $\boldsymbol\mu$ and $\Sigma$.

```{r bvnscatter, fig.cap="Bivariate normal chain by the gibbs sampler"}
gg_scatter(bvn, aes(x, y))
```


## Metropolis-Hastings Algorithm

We try to generate a random numer from $f(x) \propto \pi(x)$, i.e.

$$f(x) = k \pi(x)$$

but we do not know the normalized constant

$$k = \frac{1}{\int \pi(x) dx}$$

```{definition, mcnote, name = "Density"}
In Metropolis-hastings (M-H) algorithm, we take care about the following two densities. Denote that terms and process are similar to A-R process.

\begin{enumerate}
  \item (normalized) Target density $\pi(\cdot)$ density that we try to generate sample from
  \item Candidate-generating density $q(\cdot \mid \cdot)$ density that we will actually generate random sample from
\end{enumerate}
```

Using these two target and candidate densities, we apply **A-R method** $\ref{alg:algar}$. Since this only uses $\frac{\pi(x^{\ast})}{\pi(x^{(j - 1)})}$, we do not need to know $k$.

### Metropolis-hastings sampler

One proceeds M-H algorithm as follows.

\begin{algorithm}[H] \label{alg:mhalg}
  \SetAlgoLined
  \SetKwInOut{Input}{input}
  \SetKwInOut{Output}{output}
  \Input{Starting point $x_0$, burn-in period $b$}
  \For{$i \leftarrow 1$ \KwTo $N$}{
    Draw a candidate distribution $Y \sim q(\cdot \mid x^{(i)})$\;
    $U \sim unif(0, 1) \ind Y$\;
    Acceptance rate $$\alpha(x^{(i)}, y) := \min \bigg(\frac{\pi(y) q(x^{(i)} \mid y)}{\pi(x^{(i)}) q(y \mid x^{(i)})}, 1 \bigg)$$\;
    \eIf{$U \le \alpha(x^{(i)}, y)$}{
      Accept so that $x^{(i + 1)} = y$\;
    }{
      Reject so that $x^{(i + 1)} = x^{(i)}$\;
    }
  }
  Draw out the first $b \; x^{(i)}$ (Burn-in)\;
  \Output{$x^{(b + 1)}, \ldots, x^{(N)}$}
  \caption{Metropolis-Hastings algorithm with burn-in period}
\end{algorithm}

Why $\alpha(x^{(j)}, y) := \min \bigg(\frac{\pi(y) q(x^{(j)} \mid y)}{\pi(x^{(j)}) q(y \mid x^{(j)})}, 1 \bigg)$ for A-R method? It produces *time reversible MC*, whose reversed process is also a MC.

\begin{equation}
  \pi_i P_{ij} = \pi_j P_{ji} \quad \text{if DTMC}
  (\#eq:revdtmc)
\end{equation}

We have seen DTMC, but for more generality consider continuous state space. Let $P(\mathbf{x}, A)$ be the transition kernel for $\mathbf{x} \in \R^d$ and $A \in \mathcal{B}$ borel set. There exists an invariant distribution

$$\pi^{\ast}(d\mathbf{y}) = \int_{\R^d} P(\mathbf{x}, d\mathbf{y}) \pi(\mathbf{x}) d\mathbf{x}$$

This $P(\mathbf{x}, A)$ can be re-expressed by some other function $p(\mathbf{x}, \mathbf{y})$

\begin{equation}
  P(\mathbf{x}, d\mathbf{y}) = p(\mathbf{x}, \mathbf{y}) d \mathbf{y} + r(\mathbf{x}) \delta_{\mathbf{x}}(d \mathbf{y})
  (\#eq:mhpxy)
\end{equation}

Then Equation \@ref(eq:revdtmc) becomes

\begin{equation}
  \pi(\mathbf{x}) p(\mathbf{x}, \mathbf{y}) = \pi(\mathbf{y}) p(\mathbf{y}, \mathbf{x})
  (\#eq:revmc)
\end{equation}

Given $\alpha(\mathbf{x}, \mathbf{y})$, now claim that $\pi(\mathbf{x}) p(\mathbf{x}, \mathbf{y}) = \pi(\mathbf{y}) p(\mathbf{y}, \mathbf{x})$. By construction,

\begin{equation}
  p(\mathbf{x}, \mathbf{y}) = \begin{cases}
    q(\mathbf{y} \mid \mathbf{x}) \alpha(\mathbf{x}, \mathbf{y}) & \mathbf{x} \neq \mathbf{y} \\
    1 - \int_{\R^d} q(\mathbf{y} \mid \mathbf{x}) \alpha(\mathbf{x}, \mathbf{y}) d \mathbf{y} & \mathbf{x} = \mathbf{y}
  \end{cases}
  (\#eq:mhtransprob)
\end{equation}

**Case 1**: $\alpha = 1$

From Equation \@ref(eq:mhpxy),

$$
\begin{cases}
  p(\mathbf{x}, \mathbf{y}) = q(\mathbf{y} \mid \mathbf{x}) \\
  p(\mathbf{y}, \mathbf{x}) = q(\mathbf{x} \mid \mathbf{y}) \frac{\pi(\mathbf{x}) q(\mathbf{y} \mid \mathbf{x})}{\pi(\mathbf{y}) q(\mathbf{x} \mid \mathbf{y})} = \frac{\pi(\mathbf{x}) q(\mathbf{y} \mid \mathbf{x})}{\pi(\mathbf{y})}
\end{cases}
$$

Therefore,

$$\pi(\mathbf{x}) p(\mathbf{x}, \mathbf{y}) = \pi(\mathbf{y}) p(\mathbf{y}, \mathbf{x})$$

i.e. time reversible.

**Case 2**: $\alpha < 1$

By symmetry,

$$
\begin{cases}
  p(\mathbf{x}, \mathbf{y}) = q(\mathbf{y} \mid \mathbf{x}) \frac{\pi(\mathbf{y}) q(\mathbf{x} \mid \mathbf{y})}{\pi(\mathbf{x}) q(\mathbf{y} \mid \mathbf{x})} = \frac{\pi(\mathbf{y}) q(\mathbf{x} \mid \mathbf{y})}{\pi(\mathbf{x})} \\
  p(\mathbf{y}, \mathbf{x}) = q(\mathbf{x} \mid \mathbf{y})
\end{cases}
$$

and hence,

$$\pi(\mathbf{x}) p(\mathbf{x}, \mathbf{y}) = \pi(\mathbf{y}) p(\mathbf{y}, \mathbf{x})$$

Finally, $\alpha(\mathbf{x}, \mathbf{y})$ of Algorithm $\ref{alg:mhalg}$ results in time reversible MC.


```{example, mhray, name = "Rayleigh density"}
Generate a sample from a Rayleigh density

$$f(x) = \frac{x}{\sigma^2} e^{- \frac{x^2}{2 \sigma^2}}$$
```

```{r}
dray <- function(x, sd) {
  if (sd <= 0 ) stop(gettextf("%s should be positive", expression(sd)))
  ifelse(
    x >= 0,
    x / sd^2 * exp(- x^2 / (2 * sd^2)),
    0
  )
}
```

Consider $\chi^2(x^{(j)})$ as candidate. The following function calcuates acceptance rate.

```{r}
acc_mc <- function(x, y, sd = 4) {
  ( (dray(y, sd) * dchisq(x, df = y)) / (dray(x, sd) * dchisq(y, df = x)) ) %>% 
    min(1)
}
```

<!-- To enhance the speed, we register parallel backends. -->

<!-- ```{r, message=FALSE} -->
<!-- MC_CORES <- future::availableCores() - 1 -->
<!-- cl <- parallel::makeCluster(MC_CORES) -->
<!-- doParallel::registerDoParallel(cl, cores = MC_CORES) -->
<!-- parallel::clusterExport(cl, c("acc_mc", "dray")) -->
<!-- parallel::clusterEvalQ(cl, c(library(dplyr), library(data.table))) -->
<!-- ``` -->

```{r}
mc_ray <- function(N = 10000, x0, sd = 4, burn = 1000) {
  x <- numeric(N)
  x[1] <- x0
  y <- numeric(1L)
  acc <- logical(N)
  acc[1] <- TRUE
  for (i in seq_len(N)[-1]) {
    y[1] <- rchisq(1, df = x[i - 1])
    acc[i] <- ( runif(1) <= acc_mc(x[i - 1], y, sd) )
    x[i] <- ifelse(acc[i], y, x[i - 1])
  }
  data.table(
    draw = seq_len(N),
    acc = acc,
    x = x
  )[(burn + 1):(.N)]
}
```

<!-- ```{r} -->
<!-- foreach(i = seq_len(N), .combine = rbind, .inorder = TRUE) %dopar% { -->
<!--     y[1] <- rchisq(1, df = x) -->
<!--     acc[1] <- ( runif(1) <= acc_mc(x, y, sd) ) -->
<!--     x[1] <- ifelse(acc, y, x) -->
<!--     data.table( -->
<!--       draw = i, -->
<!--       acc = acc, -->
<!--       x = x -->
<!--     ) -->
<!--   } %>% -->
<!--     .[(burn + 1):(.N)] -->
<!-- ``` -->

For a better result, try *burn-in period* 2000.

```{r}
ray <- mc_ray(N = 10000, x0 = 1, sd = 4, burn = 2000)
```

<!-- ```{r} -->
<!-- ray <- mc_ray(N = 10000, x0 = 1, sd = 4, burn = 2000) -->
<!-- #--------------------------------------------------- -->
<!-- parallel::stopCluster(cl) -->
<!-- ``` -->

Among 8000 chain, `r ray[, .N, by = acc][2, N]` candidiate points are rejected.

```{r}
ray[,
    .N,
    by = acc]
```

Recall that A-R method have tried to elevate the acceptance rate for efficiency.

```{r raymhpath, fig.cap="M-H sampling from Chisq to target Rayleigh"}
ray %>% 
  ggplot(aes(x = draw, y = x)) +
  geom_path(aes(colour = acc, group = 1)) +
  labs(
    x = "Draw",
    colour = "Acceptance"
  ) +
  theme(legend.position = "bottom")
```

In Figure \@ref(fig:raymhpath), the short horizontal paths might be represented as rejection points.

```{r raypathpart, fig.cap="Part of a chain from M-H sampling"}
ray[3000:3500] %>% 
  ggplot(aes(x = draw, y = x)) +
  geom_path(aes(colour = acc, group = 1)) +
  labs(
    x = "Draw",
    colour = "Acceptance"
  ) +
  theme(legend.position = "bottom")
```

Now we can see how the chain is mixed.

```{r raymix, fig.cap="Metropolis-Hastings sampling mixing"}
ray %>% 
  ggplot(aes(x = draw, y = x)) +
  geom_jitter(aes(colour = x, alpha = abs(x)), show.legend = FALSE) +
  scale_colour_gradient(low = "#0091ff", high = "#f0650e") +
  xlab("Draw")
```

See Figure \@ref(fig:raymix). We can see that the random numbers are mixed well.

### Jumping distribution

Candidiate distribution is also called in that it decides where will be the chain move in the next iteration. As in A-R, we should choose this candidate $q$ such that

$$spt \pi \subseteq spt q$$

```{r jumpdist, echo=FALSE, fig.cap="Choice of candidate distribution - Rayleigh"}
tibble(x = seq(0, 20, by = .01)) %>% 
  mutate_all(
    .funs = list(
      ~dray(., sd = 4),
      ~dchisq(., df = 6)
    )
  ) %>% 
  gather(-x, key = "jumping", value = "density") %>% 
  ggplot(aes(x = x, y = density, colour = jumping)) +
  geom_path()
```

According to this jumping distribution, M-H sampler becomes *random walk M-H and independent M-H*. Theses are the famous examples among M-H samplers.

### Random walk M-H

Let the candidate distribution be a *symmetric random walk*. Then this is called random walk M-H.

\begin{equation}
  q(y \mid x) = q_1(\lvert y - x \rvert)
  (\#eq:symjump)
\end{equation}

Then

$$q(y \mid x) = q(x \mid y)$$

and hence the acceptance ratio becomes

\begin{equation}
  \alpha(x^{(j)}, y) := \min \bigg(\frac{\pi(y)}{\pi(x^{(j)})}, 1 \bigg)
  (\#eq:symacc)
\end{equation}

Here, candidate number $y$ is generated in the form of

\begin{equation}
  y = x + z
  (\#eq:symcandpt)
\end{equation}

with increment $z \sim q(\lvert z \rvert)$.

```{example, mct, name = "Random walk metropolis"}
Generate $t(\nu = 4)$ using the random walk M-H.
```

Use the proposal distribution $N(X^{(j)}, \sigma^2)$. Denote that normal distribution is symmetric. Then

$$q(x \mid y) = q(y \mid x)$$

Thus,

$$\alpha(x^{(j)}, y) = \min \bigg(\frac{\pi(y)}{\pi(x^{(j)})}, 1 \bigg) = \min \bigg( \frac{t(y)}{t(x^{(j)})}, 1 \bigg)$$

When building the ratio, we do not need to multiply candidate in each numerator and denominator.

```{r}
acc_walk <- function(x, y, nu = 4) {
  ( dt(y, 4) / dt(x, 4) ) %>% 
    min(1)
}
```

Just change a few part of `mc_ray()` though it is annoying.

```{r}
mc_dt <- function(N = 10000, x0, nu = 4, sd, burn = 1000) {
  x <- numeric(N)
  x[1] <- x0
  y <- numeric(1L)
  acc <- logical(N)
  acc[1] <- TRUE
  for (i in seq_len(N)[-1]) {
    y[1] <- rnorm(1, x[i - 1], sd = sd) # changed here
    acc[i] <- ( runif(1) <= acc_walk(x[i - 1], y, nu = nu) ) # changed here
    x[i] <- ifelse(acc[i], y, x[i - 1])
  }
  data.table(
    draw = seq_len(N),
    acc = acc,
    x = x
  )[(burn + 1):(.N)]
}
```

Try various $\sigma^2$.

```{r}
tchain <- 
  lapply(
    c(.05, .5, 2, 16), # sd
    function(i) {
      mc_dt(N = 2000, x0 = 25, sd = i, burn = 100)[, chain := i]
    }
  ) %>% 
    rbindlist()
```

```{r}
tchain[,
       .N,
       by = .(acc, chain)]
```

As $\sigma^2$ increases, acceptance rate decreases.

```{r rwmhpath, fig.cap="Random walk M-H with different variances"}
tchain %>% 
  ggplot(aes(x = draw, y = x)) +
  geom_path(aes(colour = acc, group = 1)) +
  facet_grid(
    chain ~., scales = "free_y"
  ) +
  labs(
    x = "Draw",
    colour = "Acceptance"
  ) +
  theme(legend.position = "bottom")
```

See Figure \@ref(fig:rwmhpath). Large $\sigma$ shows frequent horizontal section.

### Independence M-H

When the candidate distribution does not depend on the previous value of the chain, it is called the *independence sampler*.

\begin{equation}
  q(y \mid x) = q(y)
  (\#eq:indepjump)
\end{equation}

Thus, the acceptance ratio is

\begin{equation}
  \alpha(x^{(j)}, y) := \min \bigg(\frac{\pi(y) q(x^{(j)})}{\pi(x^{(j)}) q(y)}, 1 \bigg)
  (\#eq:indepacc)
\end{equation}

The independence sampler is simple. Also, it gives a nice result provided that the jumping distribution is closed to the target. Otherwise, it does not perform well, which is the most case.

```{example, indepmhmix, name = "Independence sampler"}
Consider a mixture

$$f(x) = p N(0, 1) + (1 - p) N(5, 1)$$

Generate a chain following the posterior distribution of $p$ (which is the target). Use an independence sampler.
```

Note that $p \in (0, 1)$. It might be reasonable to use

$$Beta(1, 1) \stackrel{d}{=} unif(0, 1)$$

Now consider jumping distribution as

$$p \sim q(\cdot) \equiv Beta(a, b)$$

Then the acceptance rate is

$$\alpha(x^{(j)}, y) := \min \bigg(\frac{f(y) q(x^{(j)})}{f(x^{(j)}) q(y)}, 1 \bigg)$$

Given observed sample $x_i \iid f$,

```{r, ref.label="mixturecode"}
```

```{r}
x <- 
  mix_norm( # see chapter 1
    n = 30,
    p1 = .2,
    mean1 = 0,
    sd1 = 1,
    mean2 = 5,
    sd2 = 1
  )
```

$$\frac{f(y) q(x^{(j)})}{f(x^{(j)}) q(y)} = \frac{(x^{(j)})^{a - 1} (1 - x^{(j)})^{b - 1} \prod\limits_i \Big(y N(x_i \mid \mu_1, \sigma_1^2) + (1 - y) N(x_i \mid \mu_2, \sigma_2^2) \Big)}{y^{a - 1} (1 - y)^{b - 1} \prod\limits_i \Big(x^{(j)} N(x_i \mid \mu_1, \sigma_1^2) + (1 - x^{(j)}) N(x_i \mid \mu_2, \sigma_2^2) \Big)}$$

```{r}
acc_indep <- function(x, xj, y, a = 1, b = 1, mu, sig) {
  # x - observed sample
  fyi <- y * dnorm(x, mu[1], sig[1]) + (1 - y) * dnorm(x, mu[2], sig[2])
  fxi <- xj * dnorm(x, mu[1], sig[1]) + (1 - xj) * dnorm(x, mu[2], sig[2])
  fyx <- prod(fyi / fxi)
  ( fyx * ( xj^(a - 1) * (1 - xj)^(b - 1) / (y^(a -1) * (1 - y)^(b - 1)) ) ) %>% 
    min(1)
}
#----------------------------------------------------------
mc_mixture <- function(N = 10000, x, x0, a = 1, b = 1, mu = c(0, 5), sig = c(1, 1), burn = 1000) {
  xj <- numeric(N)
  xj[1] <- x0
  y <- numeric(1L)
  acc <- logical(N)
  acc[1] <- TRUE
  for (i in seq_len(N)[-1]) {
    y[1] <- rbeta(1, a, b) # changed here
    acc[i] <- ( runif(1) <= acc_indep(x, xj[i - 1], y, a, b, mu, sig) ) # changed here
    xj[i] <- ifelse(acc[i], y, xj[i - 1])
  }
  data.table(
    draw = seq_len(N),
    acc = acc,
    x = xj
  )[(burn + 1):(.N)]
}
#--------------------------------------------------------
mixchain <- 
  apply(
    matrix(
      c(1, 1, 5, 2),
      nrow = 2
    ),
    2,
    function(ab) {
      mc_mixture(x = x, x0 = .5, a = ab[1], b = ab[2])[, chain := paste(ab[1], ab[2], sep = ",")]
    }
  ) %>% 
  rbindlist()
```

Like previous section, try several hyperparameter $(a, b)$.

```{r mixpathmh, fig.cap="Mixture chain from M-H sampling for different $(a, b)$"}
mixchain %>% 
  ggplot(aes(x = draw, y = x)) +
  geom_path(aes(colour = acc, group = 1)) +
  facet_grid(
    chain ~., scales = "free_y"
  ) +
  labs(
    x = "Draw",
    colour = "Acceptance"
  ) +
  theme(legend.position = "bottom")
```


## Monitoring Convergence

See Figure \@ref(fig:raymhpath), a chain generated by M-H sampling. Is this convergent?

### Gelman-Rubin method

Geolman-Rubin method monitors convergence of a M-H chain. It requires *multiple chains and compare the behavior of them with respect to the variance of one or more scalar summary statistics*. The estimates of variance are similar to between- and within- mean squared erros in **one-way ANOVA**.

Consider $k$ chains of length $n$, saying $\{ X_{ij} : 1 \le i \le n, 1 \le j \le k \}$. Let $\psi$ be a scalar summary statistic that estimates some parameter of the target distribution. Compute scalar summary statistic for each chain $\{ \psi_{nj} = \psi(X_{1j}, \ldots, X_{nj}) \}$. If the chains are converging to the target distribution, then the sampling distribution of $\{ \psi_{nj} \}$ should be *converging to a common distribution*.

```{definition, seqvar, name="Sequence mean and variance"}
Given $\{ X_{ij} : 1 \le i \le n, 1 \le j \le k \}$ and corresponding $\{ \psi_{nj} \}$,

\begin{itemize}
  \item \textit{Overall mean} $$\overline{\psi}_{..} := \frac{1}{nk} \sum_{i,j} \psi_{ij}$$
  \item \textit{Within-sequence mean} $$\overline{\psi}_{.j} := \frac{1}{n} \sum_i \psi_{ij}$$
  \item \textit{Within-sequence variance} $$s_j^2 := \frac{1}{n - 1} \sum_{i = 1}^n (\psi_{ij} - \overline{\psi}_{.j})^2$$
\end{itemize}
```

Using Notations in Definition \@ref(def:seqvar), we can get the **Gelman-rubin statistic** step-by-step as follows [@gimdalw:2013aa].

\begin{algorithm}[H] \label{alg:gelmanalg}
  \SetAlgoLined
  \SetKwInOut{Input}{input}
  \SetKwInOut{Output}{output}
  \KwData{$k$ chains $\{ X_{ij} : 1 \le i \le n, 1 \le j \le k \}$ and corresponding statistic $\{ \psi_{nj} \}$}
  Between-sequence variance $$B := \frac{n}{k -1}\sum_{j = 1}^k (\overline{\psi}_{.j} - \overline{\psi}_{..})^2$$\;
  Within-sequence variance $$W := \frac{1}{k} \sum_{j = 1}^k s_j^2$$\;
  Compute $$V = \frac{n - 1}{n} W + \frac{1}{n} B$$\; \label{alg:gelpostalg}
  Potential scale reduction factor $$\hat{R} = \sqrt{\frac{V}{W}}$$\; \label{alg:gelpotalg}
  \Output{$\hat{R}$}
  \caption{Gelman-Rubin method}
\end{algorithm}

What does this mean? Each $B$ and $W$ computes between- and within- sequence variance as in one-way ANOVA. See Line $\ref{alg:gelpostalg}$. This estimates the marginal posterior variance of $\psi$ by a weighted average of $W$ and $B$.

\begin{equation}
  \widehat{Var}(\psi \mid x) = \frac{n - 1}{n} W + \frac{1}{n} B
  (\#eq:gelmarg)
\end{equation}

Assuming that the starting distribution $q(\cdot \mid x^{(0)})$ is overdispersed, i.e. variance is larger than expectation, $V$ overestimates $Var(\psi \mid x)$. Under stationariy, however, starting distribution is indentical to the target, so it is unbiased. It is same for large $n$.

For any $n \in \R$, $W$ underestimates $Var(\psi \mid x)$. This is because a single chain has less variability. As $n \rightarrow \infty$, its expectation finally goes to $Var(\psi \mid x)$ [@Gelman:2013aa].

```{definition, gelstat, name="Gelman-Rubin statistic"}
Gelman-Rubin statistic is the \textit{estimated potential scale reduction} by

$$\hat{R} = \sqrt{\frac{V}{W}}$$

which declines to $1$ as $n \rightarrow \infty$.
```

If this $\hat{R}$ is large, we might say that further simulations will improve the inference about target. On the other hand, we get $\hat{R}$ less than $1.1$ or $1.2$, the chain can be said to be converged to the target.

```{example, bvnconverge, name = "Gelman-rubin monitoring for normal jumping kernel"}
Generate standard normal random numbers using Normal candidate distribution.

$$
\begin{cases}
  \pi(x) = \phi(x \mid 0, 1) \\
  q(y \mid x) = \phi(y \mid \mu = x, \sigma^2)
\end{cases}
$$
```

Just change `acc_mc()` and a few line of `mc_ray()` above.

```{r}
acc_norm <- function(x, y, sd) {
  ( (dnorm(y) * dnorm(x, y, sd)) / (dnorm(x) * dnorm(y, x, sd)) ) %>% 
    min(1)
}
#---------------------------------------------------------
mc_norm <- function(N = 15000, x0, sd = .2, burn = 1000) {
  x <- numeric(N)
  x[1] <- x0
  y <- numeric(1L)
  acc <- logical(N)
  acc[1] <- TRUE
  for (i in seq_len(N)[-1]) {
    y[1] <- rnorm(1, x[i - 1], sd) # change this
    acc[i] <- ( runif(1) <= acc_norm(x[i - 1], y, sd) ) # change this
    x[i] <- ifelse(acc[i], y, x[i - 1])
  }
  data.table(
    draw = seq_len(N),
    acc = acc,
    x = x
  )[(burn + 1):(.N)]
}
nchain <- mc_norm(x0 = 2)
```

```{r npathmh, fig.cap="Standard normal by M-H"}
nchain %>% 
  ggplot(aes(x = draw, y = x)) +
  geom_path(aes(colour = acc, group = 1)) +
  labs(
    x = "Draw",
    colour = "Acceptance"
  ) +
  theme(legend.position = "bottom")
```

As we can see in Figure \@ref(fig:npathmh), many are accepted.

```{r nmixmh, fig.cap="Standard normal Mixing"}
nchain %>% 
  ggplot(aes(x = draw, y = x)) +
  geom_jitter(aes(colour = x, alpha = abs(x)), show.legend = FALSE) +
  scale_colour_gradient(low = "#0091ff", high = "#f0650e") +
  xlab("Draw")
```

Is Figure \@ref(fig:nmixmh) mixed well? Not as previous example, but we cannot say it is awkward. Now we check gelman-rubin statistic. First make a custom function to compute this.

1. `gen_chain`: generate multiple chains denoted column `chain`
2. `compute_gelman`: input $\{ \psi_{nj} \}$, `data.table` with group `j`

```{r}
gen_chain <- function(N = 15000, x0 = c(-10, -5, 5, 10), sd = .2, burn = 1000, k = 4) {
  lapply(
    x0, 
    function(i) mc_norm(N = N, x0 = i, sd = sd, burn = burn)[, chain := i]
  ) %>% 
    rbindlist() %>% 
    .[,
      .(draw, acc, x, psi = cumsum(x) / seq_along(x)), # scalar summary statistic
      by = chain]
}
```

Next, `compute_gelman()` function is applied to

$$\psi(X_{1j}, \ldots, X_{nj}) = \frac{1}{n} \sum_{i = 1}^n X_{ij}$$

```{r psimh, fig.cap="$\\psi$ generated by M-H - different initial values"}
zmc <- gen_chain()
#-----------------
zmc %>% 
  ggplot(aes(x = draw, y = psi)) +
  geom_path(aes(colour = acc, group = 1)) +
  facet_grid(
    chain ~.
  ) +
  labs(
    x = "Draw",
    y = expression(psi),
    colour = "Acceptance"
  ) +
  theme(legend.position = "bottom")
```


```{r}
compute_gelman <- function(mc) {
  nk <- 
    mc[,
       .N,
       by = chain]
  n <- nrow(nk)
  k <- unique(nk[,N])
  pbar <- mean(mc[,psi])
  w <- 
    mc[,
       .(ave = mean(psi), s = var(psi)),
       by = chain]
  B <- n / (k - 1) * sum((w[,ave] - pbar)^2)
  W <- sum(w[,s]^2) / k
  V <- W * (n - 1) / n + B / n
  sqrt(V / W)
}
```

Put `zmc` into this function.

```{r}
(rhat <- compute_gelman(zmc))
```

We get $\hat{R} = `r rhat`$. It is not that far from $1$. Actually, we can get each $\hat{R}_i$ for each $i$ until $n$ and observe its convergence.

```{r}
update_gelman <- function(mc) {
  k <- 
    mc[,
       .N,
       by = chain][,N] %>% 
    unique()
  parallel::mclapply(
    seq_len(k)[-1],
    function(i) {
      data.table(
        draw = i,
        r = compute_gelman(mc[, .SD[seq_len(i)], by = chain])
      )
    },
    mc.cores = 2
  ) %>% 
    rbindlist() %>% 
    .[order(draw)]
}
#--------------------------
rhatk <- update_gelman(zmc)
```

Figure \@ref(fig:gelmanseq) presents that the value becomes almost 1 quite fast.

```{r gelmanseq, fig.cap="Sequence of the gelman-rubin $\\hat{R}$"}
rhatk %>% 
  ggplot(aes(x = draw, y = r)) +
  modelr::geom_ref_line(h = 1) + # close to 1
  geom_path() +
  labs(
    x = "Draw",
    y = expression(hat(R))
  )
```

<!-- Actually, there is a package called `coda` which includes in a function caculating potential scale reduction factor. -->

<!-- ```{r} -->
<!-- library(coda) -->
<!-- ``` -->

<!-- In this library, we will use -->

<!-- - `coda::as.mcmc()` -->
<!-- - `coda::gelman.diag()` -->
<!-- - `coda::gelman.plot()` -->

<!-- `gelman.diag()` computes $\hat{R}$. This function requires input `mcmc` object, so we need `as.mcmc()`. -->

<!-- ```{r} -->
<!-- gelman_mc <-  -->
<!--   zmc[, -->
<!--       .(psi_mcmc = as.mcmc(psi)), -->
<!--       by = chain] %>%  -->
<!--   split(by = "chain") %>%  -->
<!--   lapply(function(x) x[,psi_mcmc]) -->
<!-- ``` -->

<!-- ```{r} -->
<!-- gelman.diag(gelman_mc) -->
<!-- ``` -->

<!-- ```{r gelpt, fig.cap="$\\hat{R}$ using coda package"} -->
<!-- gelman.plot(gelman_mc) -->
<!-- ``` -->