first-paper.tex

% Created 2020-07-16 Thu 16:28
% Intended LaTeX compiler: pdflatex
\documentclass[draft,usenatbib]{mnras}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{hyperref}
\usepackage{ulem}
\usepackage{grffile}
\usepackage{longtable}
\usepackage{capt-of}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{pgf}
\usepackage{pgfplots}
\usepackage{pdflscape}
\usepackage{longtable}
\usepackage{varioref}
\usepackage{bm}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{natbib}
\usepackage[nameinlink,capitalize,noabbrev]{cleveref}
\DeclareMathOperator{\TopHat}{TH}
\DeclareMathOperator{\CDF}{CDF}
\date{Accepted XXX. Received YYY; in original form ZZZ}
\pubyear{2020}
\date{\today}
\title{}
\hypersetup{
 pdfauthor={},
 pdftitle={},
 pdfkeywords={},
 pdfsubject={},
 pdfcreator={Emacs 28.0.50 (Org mode 9.3)}, 
 pdflang={English}}
\begin{document}

\tableofcontents

\begin{abstract}
    We develop method for improving nested sampling performance, accuracy and precision, called consistent posterior repartitioning. If a user has any knowledge of the covariance structure and location of the posterior peak(s), then this may be used to dramatically improve the performance of nested sampling without biasing evidences or posterior sampling.
    This knowledge is analogous to a pre-computed proposal matrix and Metropolis-Hastings start point, and importantly, if an incorrect repartitioning is mistakenly applied, has no impact on the performance or accuracy relative to the unaccelerated run.
    We demonstrate this on toy and cosmological examples, and show a real-world speed up for a concordance cosmological run by a factor of \(XX\). When repartitioned, nested sampling spends less time in the tails of the distribution and consequently accumulates less error in evidence and posterior estimation. When precision normalised, the performance uplift is by three orders of magnitude.
    This work may be viewed as the natural continuation of previous research by \citet{chen-ferroz-hobson}, may be applied with minimal modification to any existing scripts, and opens up a new evolutionary class of nested sampling algorithms.
    Our code is released as an open-source and extendable Python package \texttt{SuperNest}\footnote{GitHub/GitLab repository here}
\end{abstract}

\title[\texttt{SuperNest}]{\texttt{SuperNest}: accelerated nested sampling with applications to astrophysics and cosmology}
\hypersetup{
 pdftitle={SuperNest: accelerated nested sampling with applications to astrophysics and cosmology},
 pdflang={English}}

\author[Petrosyan \& Handley]{
    Aleksandr Petrosyan,$^{1,2,3}$\thanks{E-mail: ap886@cam.ac.uk}
    Will Handley$^{1,2,4}$\thanks{E-mail: wh260@cam.ac.uk}
\\
$^{1}$Astrophysics Group, Cavendish Laboratory, J. J. Thomson Avenue, Cambridge, CB3 0HE, UK\\
$^{2}$Kavli Institute for Cosmology, Madingley Road, Cambridge CB3 0HA, UK \\
$^{3}$Queens' College, Silver Street, Cambridge, CB3 9ET, UK \\
$^{4}$Gonville \& Caius College, Trinity Street, Cambridge, CB2 1TA, UK
}

\section{Introduction}
\label{sec:org0ed8441}

The standard model of the universe and its evolution in modern
cosmology is the \(\Lambda\)CDM model \citep{Condon2018}, so named
after the main components of the universe: the cosmological constant
\(\Lambda\) and cold dark matter. It has six major \citep{Condon2018}
independent \footnote{there can be other equivalent parameter
sextuplets.} parameters: the physical baryon density
\(\Omega_\mathrm{b}h^{2}\); the physical (cold) dark matter density
\(\Omega_\mathrm{c}h^{2}\); the angular parameter
\(100\theta_\mathrm{s}\); re-ionisation optical depth
\(\tau_\text{reio}\); power spectrum slope \(n_\mathrm{s}\) and
amplitude \(\ln (10^{10}A_\mathrm{s})\) \cite{Cosmology}

The task of the present study is to develop better tools for
evaluating the agreement of our observations from the Planck mission
with \(\Lambda\)CDM, estimating the parameters in the process. In
the language of Bayesian statistics \footnote{See \cite{xkcd} for
comparison to frequentist statistics.}, our goal is efficient
Bayesian inference.

While said inference can be executed analytically in principle, it
is often intractable even when performed numerically. For context,
the standard \(\Lambda\)CDM inference run requires an HPC cluster
with at least 128 nodes, each with at least 6GB of memory and an
equivalent of three full days of operation. To add insult to injury,
the error margins on the parameters and the evidence, if computed at
all, are staggering.  Even then such a result requires judicious
tuning and careful consideration of the model, which at present
cannot be automated. Equivalent inference for any model other than
\(\Lambda\)CDM is thus out of reach of most cosmologists. Presently,
trying to resolve the Hubble Tension, \cite{tension} seems
impossible for most professional cosmologists. The aim of the
present study is to correct these circumstances.

Multiple numerical algorithms exist to perform Bayesian inference:
Metropolis-Hastings \citep{Metropolis} in conjunction with the Gibbs
sampler \citep{Metropolis-Hastings-Gibbs}; Hybrid (Hamiltonian)
Monte Carlo \citep{1701.02434,Duane_1987}, and nested sampling
\citep{Skilling2006}. Each of these algorithms has different
advantages: Metropolis Hastings is one of the fastest at estimating
the model parameters, at the cost of not evaluating the evidence,
which is a universal metric of model fitness.

It is also well-known that most Bayesian inference methods, like
Metropolis-Hastings, can benefit from extra information provided at
inference time. This we call proposals as they usually contain
information about what we expect the posterior distribution to be
for each parameter. It has become standard practice to provide the
proposals along with cosmological inference packages, such as
\texttt{CosmoMC}. However, to date there has been no such mechanism for
nested sampling.

We shall explore a method of injecting said information into the
inference process that is based on an idea which was first pointed
out by \cite{chen-ferroz-hobson}. Whilst \emph{automatic posterior
repartitioning}, was originally used to resolve the issue of
unrepresentative priors, the idea can be exploited to create a much
more potent methodology for conducting nested sampling.


In the present paper we shall outline the method of \emph{stochastic
superpositional mixing} of \emph{consistent partitions}, demonstrate its
efficacy at utilising proposal information. We shall do so by
providing a brief overview of Bayesian inference, highlighting the
peculiarities of said inference method vital for our method to work.
Afterwartds we shall discuss the important evolution of our method,
and the infrastructure one would need in order to support the the
widespread use of proposal distributions. Finally, an evolution of
the nested sampling algorithm and the approaches one would take to
Bayesian inference shall be explored.

 \begin{table}
  \centering
  \caption{A non-exhaustive list of major implementations of nested sampling.}
  \begin{tabular}{lr}
	  \textbf{Name} & \textbf{Publication}\\
  \hline
  \texttt{MultiNest} & \cite{Feroz2009MultiNestAE} \\
  \texttt{PolyChord} & \cite{polychord} \\
  \texttt{nestle} & \cite{nestle} \\
  \texttt{dyNesty} & \cite{Speagle_2020}
  \end{tabular}
\end{table}

\section{Theoretical background}
\label{sec:orgb009956}

In this section we shall primarily focus on previous work in the
area. We must outline the key elements of Bayesian inference via
nested sampling in order to understand why using more informative
priors directly is not a good solution, and why one needs to perform
consistent re-partitioning.


\subsection{Bayesian inference.}
\label{sec:org16a4244}
Bayesian inference, as the name suggests, is based around a
remarkable result in the theory of probability, called Bayes'
theorem. It is mainly concerned with the relationship of
conditional probabilities, which neatly map onto the concepts of
known facts, and assumptions, but also retaining the ability to
reason about objective truths.

We must first set up the terminology used. A model \({\cal M}\) of
a physical process, is parameterised by \(\bm{\theta} =
   (\theta_{1}, \theta_{2}, \ldots , \theta_{n})\).  New empirical
observations of said process are encapsulated in the
\textbf{\emph{dataset}} \(\mathfrak{D}\).  The
\textbf{\emph{likelihood}} \({\cal L}\) of the parameters
\(\bm{\theta}\) is the probability of observing \(\mathfrak{D}\),
conditional on the configuration \(\bm{\theta}\) and the model \({\cal
   M}\).  The prior \(\pi(\bm{\theta})\) is the probability of
\(\bm{\theta}\) assuming \({\cal M}\). It can be obtained from both
previous datasets as well as constraints inherent to the model. The
posterior is a probability of \(\bm{\theta}\) that is conditional on
\({\cal M}\) and the dataset \({\mathfrak D}\). The locus of all
\(\bm{\theta}\) for which the prior is both defined and non-zero
defines the \textbf\{\emph{prior space}\} \(\Psi\). Finally, the
\textbf\{\emph{evidence}\} is the probability of the data
\({\mathfrak D}\) assuming the model.

The interactions of probabilities of \cref{tab:defs} is governed by
\citeauthor{1763} 's theorem:
\begin{equation}\label{eq:bayes}
{\cal L}(\bm{\theta}) \times \pi (\bm{\theta}) = {\cal Z} \times {\cal P} (\bm{\theta}).
\end{equation}

Bayesian inference is the process of reconciling the model \({\cal M}\)
represented in \({\cal L}\) and \(\pi\), with observations
\(\mathfrak{D}\) represented in \({\cal L}\). A numerical algorithm that
obtains \({\cal Z}\) and \({\cal P}\) from \(\pi\) and \({\cal L}\), is called
a \emph{\textbf{sampling algorithm}} or \textbf{\emph{sampler}}.

The convenient representation of \(\pi\) and \({\cal L}\) depends on the
particulars of the sampler. For \textbf\{\emph{nested sampling}\}
(e.g. \texttt{PolyChord}, \texttt{MultiNest}) we delineate them
indirectly: with the logarithm of the likelihood probability-density
function \(\ln \cal L (\bm{\theta})\), and \textbf\{\emph\{prior
quantile\}\} \(C\{\pi\}(\bm{\theta})\). The latter is a coordinate
transformation \(C: \bm{u} \mapsto \bm{\theta}\) that maps a uniform
distribution of \(\bm{u}\) in a unit hypercube to \(\pi(\bm{\theta})\) in
\(\Psi\). It is often obtained by inverting the cumulative distribution
function of the prior.
\begin{table}
\caption{Definitions of main quantities in Bayesian inference.   \label{tab:defs}}
\centering
\begin{tabular}{llr}
\textbf{\textbf{Term}} & \textbf{\textbf{Symbol}} & \textbf{\textbf{Definition}}\\
\hline
Prior  & \(\pi(\bm{\theta})\) & \(P ( \bm{\theta}  \vert {\cal M})\) \\
Likelihood  & \({\cal L}(\bm{\theta})\) & \(P ( \bm{\mathfrak{D}} \vert \bm{\theta} \cap {\cal M})\) \\
Posterior  & \({\cal P}(\bm{\theta})\) & \(P ( \bm{\theta} \vert \bm{\mathfrak{D}} \cap {\cal M})\) \\
Evidence & \({\cal Z}\) & \(P ( \bm{\mathfrak{D}} \vert {\cal M})\) \\
\end{tabular}
\end{table}


For \citeauthor{1763}' theorem holds, on the set intersection of the
domains of all probability density functions. Let \(D(f)\) denote the
domain of the probability density function \(f\), i.e.\textasciitilde{}where \(f\) is
both defined and \textbf{non-zero}. Hence
\begin{equation}
D \{ \pi \} \cap D \{ {\cal L} \} = D \{ {\cal P} \} \subset \Psi,
\end{equation}
meaning the inference is possible only on the intersection of the
domains of prior and likelihood.\label{domain-discussion}

For each choice of \({\cal L}\) and \(\pi\), there is a unique choice
of \({\cal Z}\) and \({\cal P}\); equivalently they represent the same
unique model \({\cal M}\), or partition it consistently. That
correspondence is \emph{surjective}, but not \emph{injective}: many
choices of \({\cal L}(\bm{\theta})\) and \(\pi (\bm{\theta})\) may
correspond to the same \({\cal P} (\bm{\theta})\) and \({\cal Z}\)
\citep{chen-ferroz-hobson}. This remark is the cornerstone of the
automatic posterior repartitioning, which we shall exploit to
obtain a speedup.

\subsection{Nested Sampling}
\label{sec:org8d5afbf}

There are very few methods of performing full Bayesian
inference. The chief reason is that one rarely needs more than a
very crude approximation of what one can achieve.

More often than not, one reduces the evidence to either 1 (model
fits the data), or 0, hence removing the need for a complex routine
to evaluate \({\cal Z}\). Similarly, model parameters' posterior
distribution is often reduced to an un-correlated symmetric Normal
distribution. Such crude approximations allow one to perform
inference exceptionally quickly. One needs to be in a situation
where resolving differences in evidence of up to a few percent may
be a deciding factor to invest into evaluating the evidence to that
precision. There are very few reasons to concern oneself with
evaluating \({\cal Z}\) at all, hence very few invest the resources
into doing it.

Nested sampling is a full infence algorithm, in that it evaluates
both the evidence and the posterior. It is rarely used outside of
circumstances where model comparison is a necessity, because of the
additional overhead associated to evaluating the evidence. While
this isn't the only advantage that nested sampling has over other
MCMC methods, but it is the chief qualitative difference that can
justify the time investment. To understand why evaluating the
evidence is so valuable, consider how one might do it.

\({\cal P}\) is a probability, therefore normalised, which combined
with \cref{eq:bayes} yields

\begin{equation}
 \label{eq:def-z}
 {\cal Z} = \int_{\Psi} {\cal L}(\bm{\theta}) \pi(\bm{\theta}) d\bm{\theta}.
\end{equation}
Thus, \citeauthor{1763} 's theorem reduces parameter estimation ---
obtaining \({\cal P}\) from \(\pi\) and \({\cal L}\), to
integration\textasciitilde{}\citep{bayes-integration}. The naïve approach to obtain
\({\cal Z}\) of uniformly rasterising \(\Psi\) is intractable for
hypotheses with \(O(30)\) parameters \citep{Caflisch_1998}, which
is a problem that nested sampling resolves.

Having motivated the utility of nested sampling, we should provide
an outline of its execution. The following is a short description
of nested sampling \citep{Skilling2006}. We begin, by picking
\(n_\text{live}\) \textbf\{\emph{live points}\} at random in
\(\Psi\). During each subsequent iteration, the point with the lowest
likelihood is declared \emph\{\textbf{dead}\}, and another live point
\(\bm{\theta}\in\Psi\) is taken with a higher likelihood, based on
the prior \(\pi\) and an implementation-dependent principle. Live
points are thus gradually moved into regions of high likelihood. By
tracking their locations and likelihoods, from a statistical
argument we can approximate \({\cal Z}\) and its error for each
iteration, and by \cref{eq:bayes}, \({\cal P}(\bm{\theta})\). We
continue until a pre-determined fraction of the evidence associated
to \(\Psi\) remains unaccounted for.

Recall, that Not all parameter inference methods require obtaining
\({\cal Z}\). Some methods, such as Hamiltonian Monte-Carlo
\citep{1701.02434}, allow obtaining a normalised \({\cal P}\)
directly. For such approaches, any consistent specification of
\(\pi\) and \({\cal L}\) will lead to identically the same posterior,
barring numerical errors. This is also true of methods that
evaluate \({\cal Z}\) exactly. However, nested sampling allows
uncertainty in \({\cal Z}\), which is controlled by \(\pi\) and \({\cal
   L}\). Thus, nested sampling, unlike, e.g.\textasciitilde{}Metropolis-Hastings
\citep{Metropolis-Hastings-Gibbs} is sensitive to the concrete
definitions of prior and likelihood. While many choices of \(\pi\)
and \({\cal L}\) correspond to the same \({\cal P}\) and \({\cal Z}\),
the errors and nested sampling's time complexity are dependent on
the specification of \(\pi\) \citep{Skilling2006}. Specifically, more
\emph{informative} priors are preferable.

In the following section we shall discuss how information content
is being measured.

\subsection{Metrics and informativity}
\label{sec:org0abc6cc}

An important quantity for measuring the correctness of the obtained
posterior is the \textbf\{\emph{Kullback-Leibler divergence}\} \({\cal
   D}\) \citep{Kullback_1951}. For probability distributions
\(f(\bm{\theta})\) and \(g(\bm{\theta})\), it is defined as:
\begin{equation}
\label{eq:kl-def}
 {\cal D}\{f, g \} = \int_{\Psi}f(\bm{\theta}) \ln \frac{f(\bm{\theta})}{g(\bm{\theta})} d \bm{\theta}.
\end{equation}
It is a pre-metric on the space of probability distributions: it is
nil if and only if \(f(\bm{\theta}) = g(\bm{\theta})\), (albeit not
symmetric) which is convenient for defining a representation
hierarchy. The statement: \(f\) represents \(g\) better than \(h\) is
equivalent to
\begin{equation}
\label{eq:hierarchy}
{\cal D}\{f, g\} < {\cal D}\{h, g\}.
\end{equation}
Specifically, distribution \(h\) is said to be unrepresentative of \(g\)
if a uniform distribution \(f\) represents \(g\) better than \(h\).

A probability density function \(f(\bm{\theta})\) is said to be
more \emph\{\textbf{informative}\} than \(g(\bm{\theta})\) if:
\begin{equation}
\label{eq:informative}
{\cal D}\{ f, g \} > {\cal D}\{ g, f\}.
\end{equation}
This also highlights, that Kullback-Leibler divergence is not a
metric on the space of distributions. However, being asymmetric
lends itself well to considerations where such an asymmetry is
natural: e.g.\textasciitilde{}priors are not equivalent to posteriors, one comes
after the other, and so \({\cal D}\) can be used to quantify the
``surprise'' information obtained during inference.

The time complexity \(T\) of nested sampling is
\begin{equation}\label{eq:complexity}
T \propto  n_\text{live}\  \langle {\cal T}\{{\cal L}(\bm{\theta})\} \rangle \ \langle  {\cal N}\{{\cal L}(\bm{\theta}) \},
\end{equation}
where \({\cal T}\{f(\bm{\theta})\}\) is the time complexity of
evaluating \(f(\bm{\theta})\) and \({\cal N}\{f(\bm{\theta})\}\) ---
the quantity of such evaluations. Reducing \(n_\text{live}\) reduces
the resolution of nested sampling, while \({\cal T}\{{\cal
   L}(\bm{\theta})\}\) is model-dependent. We can, however, reduce the
number of likelihood evaluations, by providing a more informative
prior. However, there is an associated risk, which we shall address
later.

Choosing the correct representations of \({\cal P}\) and \(\pi\) is
crucial for nested sampling's correctness and performance. For
example, assuming the same likelihood, if \(\pi_{0}\) and \(\pi_{1}\)
are equally informative, but \(\pi_{0}\) is more representative of
\({\cal P}\), then the inference with \(\pi_{0}\) will terminate more
quickly than with \(\pi_{1}\), (and would be more accurate, also).

Similarly, if \(\pi_{1}\) is more informative than \(\pi_{2}\), but
equally as representative, nested sampling will terminate with
\(\pi_{1}\) faster than with \(\pi_{2}\), and the result will be more
precise. In detail, if \(\pi_{1} (\bm{\theta})\) is more similar to
\({\cal P} (\bm{\theta})\), then points drawn with PDF \(\pi_{1}
   (\bm{\theta})\) are more likely to lie in \(\bm{\theta}\) regions of
high \({\cal P} (\bm{\theta})\), leading to fewer iterations.

Posteriors \({\cal P}_{1}\) and \({\cal P}_{2}\) obtained with the
priors \(\pi_{1}\) and \(\pi_{2}\) are different, because of
\cref{eq:bayes}. In fact, the posterior \({\cal P}_{1}\) will be more
informative than \({\cal P}_{2}\), and more similar to
\(\pi_{1}\). This effect we call \textbf\{\emph{prior imprinting}\}.

Imprinting is desirable if the informative prior \(\pi_{1}\) is the
result of inference over another dataset. Nonetheless, imprinting
limits the information obtainable from \(\mathfrak{D}\). There is a
considerable risk of getting no usable data from the inference,
which makes one prefer uniform priors even when more information is
available.

The problem is exasperated in case of proposals. The issue is that
the algorithm has no room to consult the proposal distributions
outside of the prior. Using a prior taken out of ``thin air'', with
nested sampling is a recipe for disaster. However, in the next
section we shall discuss how one can mitigate these issues, and use
a proposal as an aspect of a prior.


\subsection{Power posterior repartitioning and unrepresentative priors}
\label{sec:orgd933e18}
\textbf{NB:} From this section onward we shall adopt the following
notation. \(\pi\) and \({\cal L}\) with similar decorations (index,
diacritics), belong to the same specification of the model. Models
using the uniform prior are special, in that they obtain the most
accurate posterior and evidence, thus they are represented with an
over-bar (the plot of a uniform prior in 1D is a horizontal
line). Hats delineate the consistent partitions, that incorporate
the proposal (the hat represents the peak(s) often present in
informative proposals).

We are working under the assumption that \(\pi(\bm{\theta})\) is an
informative, unrepresentative prior. We want to obtain correct
posterior \(\bar{\cal P}\) but without using a uniform, universally
representative reference prior \(\bar{\pi}\), because it is often the
least informative. To avoid loss of precision and mitigate prior
imprinting, \cite{chen-ferroz-hobson} have proposed introducing the
parameter \(\beta\) to control the breadth of the informative prior:
\begin{equation}
\label{eq:autopr-prior}
\hat{\pi}(\bm{\bm{\theta}};\beta) = \cfrac{\pi(\bm{\theta})^{\beta}}{Z(\beta)\{\pi\}},
\end{equation}
(see \cref{fig:ppr}) where \(Z(\beta)\{\pi\}\) --- a functional of
\(\pi (\bm{\theta})\) is a normalisation factor for \({\cal P}
   (\bm{\theta})\), i.e.
\begin{equation}
Z(\beta)\{\pi\} = \int_{\Psi} \pi(\bm{\bm{\theta}})^{\beta}d\bm{\bm{\theta}}.
\end{equation}
In their prescription, the likelihood changes to
\begin{equation}
\hat{\cal L}(\bm{\theta}; \beta) = {\cal L}(\bm{\theta}) Z(\beta)\{\pi\} \cdot \pi^{1-\beta}(\bm{\theta}).
\end{equation}
The new parameter \(\beta\) is treated as any other non-derived
parameter of the original theory.
\begin{figure}
\input{./illustrations/ppr.tex}
\caption{\label{fig:ppr} Demonstration of
 \(\hat{\pi}(\theta; \beta)\) for different values of \(\beta\) in
one dimension. We've assumed that the original
\( \pi (\bm{\theta})\) distribution is a truncated Gaussian,
i.e.~zero outside the region \((-1, 1)\). Numerical instability,
which manifests as changes in curvature at the boundaries
exaggerated. The area under curves for different $\beta$ is
normalised to unity as in \cref{eq:autopr-prior}}.  
\end{figure}


Note, that
\({\cal L}(\bm{\theta})\pi (\bm{\theta}) = \hat{\cal L}(\bm{\theta})
   \hat{\pi} (\bm{\theta})\) by construction. Thus, from \cref{eq:bayes}
the posterior and evidence corresponding to
\(\hat{\cal L}(\bm{\theta};\beta)\) and
\(\hat{\pi} (\bm{\theta};\beta)\) will be the same as
\({\cal P} (\bm{\theta})\) and \({\cal Z}\), which correspond to the
original \(\pi(\bm{\theta})\) and \({\cal L}(\bm{\theta})\).

If the informative prior \(\pi (\bm{\theta})\) is less
representative of the posterior \(\bar{\cal P} (\bm{\theta})\),
error in \(\hat{\cal Z}\) is larger. Hence, while we don't violate
crefeq:bayes directly, \(\bar{\cal Z}\) can be more different from
\({\cal Z}\) while remaining within margin of error, and similarly
\({\cal P}(\bm{\theta}) \ne \bar{\cal P}(\bm{\theta})\). This is
where the new parameter comes into play. \(\hat{\pi}\) may become
representative for some value of \(\beta = \beta_{R}\). Values
\(\beta\) close to \(\beta_{R}\) correlate with higher likelihoods,
thus the sampler prefers them. Hence, the system will converge to a
state where \({\cal P} (\bm{\theta})\) is represented in
\(\hat{\pi} (\bm{\theta};\beta)\)\footnote{Technically we obtain \(\hat{\cal P} (\bm{\theta};\beta)\) which, when marginalised over
\(\beta\), yields \({\cal P} (\bm{\theta}) = \int \hat{\cal P}
   (\bm{\theta};\beta) d \beta\) --- the correct posterior.}.  As a
consequence, we reduced the errors and obtained the same result as
we would have with a less informative but more representative
prior.

\cite{chen-ferroz-hobson} dubbed this scheme
\textbf\{\emph{automatic power posterior repartitioning}\} (PPR)
because the choice of \(\beta\rightarrow\beta_{R}\) is automatic. It
mitigates the loss precision and thus accuracy for unrepresentative
informative priors \(\pi\), by sacrificing performance.

\section{Discoveries}
\label{sec:org85f6b5d}

\subsection{The trouble with proposals}
\label{sec:org8bc2bd5}

Nested sampling is different from Metropolis-Hastings-Gibbs and
many other Markov-Chain Monte Carlo methods. Often, such algorithms
are designed with a separate input that is the proposal: an initial
guess that guides the algorithm towards the right answer.  For
nested sampling no such provisions are in place. The only input
where such information can be used is the prior.  Thus, to
understand why one can't use proposals directly, we must first
address why informative priors are avoided.

From \cref{eq:bayes}, we can see that changing only the prior \(\pi\)
necessarily leads to changes in both \({\cal P}\) and \({\cal Z}\). For
example if \(\pi\) is a Gaussian centered at
\(\bm{\theta}=\bm{\mu}_{\pi}\) and \({\cal L}\) is a Dirac
\(\delta\)-function peaked at \(\bm{\theta}=\bm{\mu}_{{\cal L}}\), with
\(\bm{\mu}_{\pi}\) sufficiently far from \(\bm{\mu}_{{\cal L}}\) then
the posterior will necessarily have peaks at both \(\bm{\mu}_{\pi}\)
and \(\bm{\mu}_{{\cal L}}\). This is an example of prior imprinting
and is a necessary part of a Bayesian view of statistics. For a
Bayesian, the prior information is no less valuable than the
information inferred from the dataset \(\mathfrak{D}\), and the
posterior represents \emph{all} of our best knowledge.

The problem however, is the \emph{prejudiced sampler}. Because
nested sampling chooses live points with probability proportional
to the prior, the probability of a point being drawn from the
likelihood peak can be made arbitrarily small. In fact, if
\(\bm{\mu}_{{\cal L}}\) and \(\bm{\mu}_{\pi}\) are separated by more
than five standard deviations of the prior Gaussian, thirty million
samples will be drawn from \(\bm{\mu}_{\pi}\) before a single point
is drawn on the circle containing \(\bm{\theta} = \bm{\mu}_{{\cal
   L}}\).

An apt analogy can be drawn with the Venera-14 mission
\citep{siddiqi2018beyond}. Upon landing, due to a number of
unfortunate coincidences, the lander took its one and only
measurement of Venusian soil from one of its own lens caps. As a
result, we have obtained objectively correct information from
Venus: a sample of an object on its surface. However, the
efficiency of said measurement of the compressibility of Earth
rubber leaves much to be desired.

Before \cite{chen-ferroz-hobson} the best solution was to use a
uniform prior that included both \(\bm{\mu}_{\pi}\) and
\(\bm{\mu}_{{\cal L}}\). The computational cost of inference is so
high that the risk of gaining nothing from a dataset is
untenable. Thus discarding all prior information in hopes of
inferring some from the dataset is preferable to using the
information in \(\pi\).

Thus, proposals are not even considered for use with nested
sampling.  Since proposals may be crude approximations, we may
obtain far worse than no new information.  Any potential benefit in
performance or precision is far outweighed by the unreliable
posterior.  We do, however, have one method of mitigating these
problems --- automatic posterior repartitioning
\citep{chen-ferroz-hobson}. Though the connection may seem unclear
at this stage, schematically, Automatic posterior repartitioning
allows one to represent infinitely many pairs of \(\pi\) and
\({\cal L}\), which all produce the same result: evidence and
posterior. If one can encode the proposal as a prior that obtains
the same evidence and posterior as the prior one has started with,
one could, in theory, obtain all of the benefits of having a more
informative prior, with also obtaining information that pertains to
the model in question rather than repeating the information
provided as a guess.

\subsection{How intuitive proposals accelerate convergence}
\label{sec:orgcc977c0}

Consider the following premise: we're given a model \({\cal M}\),
for which our prior \(\pi\) is not the uniform
\(\bar{\pi}(\bm{\theta})\). So, usually from other \% sources,
e.g.\textasciitilde{}other inferences, physical reasoning, etc, we know that
\begin{equation}
 \pi (\bm{\theta}) = f(\bm{\theta}; \bm{\mu}, \bm{\Sigma}),
 \label{eq:bias}
 \end{equation}
which is representative of the posterior \(\bar{\cal
   P}(\bm{\theta})\). Here, the probability density function \(f\) is
parameterised by \(\bm{\mu}\) in its location and \(\bm{\Sigma}\)
its breadth. In order to obtain the same result as one would have
with the less informative uniform prior \(\bar{\pi}(\bm{\theta})\),
one needs to correct the likelihood \({\cal L}\). Recall, that the
reason why PPR obtains the same posterior \(\bar{\cal
   P}(\bm{\theta})= \hat{\cal P}(\bm{\theta})\) as one would have
using 

\(\bar{\pi} (\bm{\theta}) = Const.\) 

is because \(\hat{\cal
   L} (\bm{\theta};\beta)\) and \(\hat{\pi} (\bm{\theta};\beta)\) are
a \textbf\{\emph{consistent (re)partitioning}\} of \(\bar{\cal Z}\)
and \({\cal P}(\bm{\theta})\). That is:
\begin{equation}
\label{eq:partitioning}
\int_{\Psi} \hat{\cal L} (\hat{\bm{\theta}}) \hat{\pi} (\bm{\hat{\theta}}) d\hat{\bm{\theta}}  = \int_{\Psi}\bar{\pi} (\bm{\theta}) \bar{\cal L} (\bm{\theta}) d\bm{\theta} = \bar{\cal Z},
\end{equation}
where in the case of PPR
\(\hat{\bm{\theta}} = (\theta_{1}, \theta_{2}, \ldots, \theta_{n},
   \beta)\). \Cref{eq:partitioning} holds if
\begin{equation}
\label{eq:partitioning-p}
\hat{\cal L}(\bm{\theta};\beta) \hat{\pi}(\bm{\theta};\beta)  = \bar{\cal L}(\bm{\theta})\bar{\pi}(\bm{\theta})
\end{equation}

for all \(\beta\), by \cref{eq:bayes}. Note that
\cite{chen-ferroz-hobson} have used \cref{eq:partitioning-p} as the
primary expression. Following their convention, we shall sometimes
refer to consistent partitions as posterior repartitioning, rather
than evidence repartitioning.

By using a more informative prior in thusly, we accelerates
convergence, because each iteration obtains a larger evidence
estimate, so fewer are needed to reach the termination point
(See\textasciitilde{}\cref{fig:benchmark}). There is a competing mechanism: the
evidence estimates accumulate fewer errors, so inference proceeds
longer before the precision loss triggers termination
(\cref{fig:higson}). Thus repartitioning reaches a more precise
result quicker. Better still, the obtained precision can be
sacrificed to further accelerate inference.

\subsubsection{Example: Intuitive proposal posterior repartitioning}
\label{sec:org17469c3}
Suppose that one has obtained the posterior \({\cal
    P}(\bm{\theta})\) from a different inference, which could be
nested sampling with a uniform prior, or Hamiltonian Monte Carlo
(or a theoretical approximation). Thus,
\begin{subequations}
\begin{equation}
\label{eq:iPPR}
\hat{\pi}(\bm{\theta}) = f(\bm{\theta}, \bm{\mu}, \bm{\Sigma}) = {\cal P}(\bm{\theta}),
\end{equation}
is an informative prior that represents our knowledge, but might not
represent the posterior. We call it an \textbf{\emph{(intuitive)
proposal}}. However, we wish to avoid prejudicing the sampler and
use the (uniform) reference prior $\bar{\pi}(\bm{\theta})$, with
reference likelihood $\bar{\cal L}(\bm{\theta})$.

To obtain with $\hat{\pi}(\bm{\theta})$ the same posterior and
evidence as one would have with $\bar{\pi}(\bm{\theta})$ and
$\bar{\cal L}(\bm{\theta})$, the partitioning of the (evidence) needs
to be \textbf{\emph{consistent}} with the reference. Specifically:
\begin{equation}
  \label{eq:ippr-l}
\hat{\cal L}(\bm{\theta}) = \frac{\bar{\pi}(\bm{\theta}) \bar{\cal L}(\bm{\theta})}{ f(\bm{\theta}; \bm{\mu}, \bm{\Sigma})}.
\end{equation}
\end{subequations}
We call this scheme \textbf\{\emph\{intuitive proposal posterior
\footnote{More accurately evidence repartitioning, which is equivalent
in simple cases.} repartitioning\}\} (iPPR). It is the fastest
possible and the least robust consistent partitioning
scheme. While we have technically addressed the change in \({\cal
    P}\) due to a different prior, we have not addressed the problem of
\(\hat{\pi}\) being (potentially) unrepresentative of \(\bar{\cal
    P}\). In the example already considered in \cref{sec:prejudice}, we
will have reduced prior imprinting, but not all addressed the
prejudice. The probability of sampling from the true likelihood
peak is still minuscule.  By contrast, we have seen that automatic
power posterior repartitioning can mitigate both issues. What iPPR
lacks, is a mechanism for extending its representation. Rather
than attempt a modification akin to power partitioning, in
\cref{sec:isomixtures} we shall provide this mechanism as
completely external to iPPR and unleash its potential.

\subsubsection{General automatic posterior repartitioning}
\label{sec:orgd059e47}

In this section, we look at the family of prescriptions similar to PPR
and iPPR called consistent partitioning. We note which schemes are
more useful for the task of accelerating nested sampling without
biasing the posterior. We begin by noting, that \Cref{eq:partitioning}
alone does not guarantee the correct posterior and evidence.

We shall consider a general consistent partitioning
\(\hat{\pi}, \hat{\cal L}\) with re-parametrisation
\(\hat{\bm{\theta}}\). Because \(\bm{\theta} \ne \hat{\bm{\theta}}\),
generally, the posterior \({\cal P}(\bm\hat{\theta})\) would not have
the same functional form as \(\bar{\cal
    P}(\bm{\theta})\). Nonetheless, if inverting the parametrisation
from \(\bm{\hat{\theta}}\) to \(\bm{\theta}\) is possible, and under that
procedure \(\hat{\cal P}\) maps to \({\cal P}\), we shall say that
\(\hat{\cal P}\) is marginalised to \({\cal P}\). Thus, the correct
posterior is one that marginalises to \(\bar{\cal P}\). We shall often
use \(\hat{\cal P}(\bm\hat{\theta})\) interchangeably with
\({\cal P}(\bm{\theta})\) that it marginalises to.

We can rigorously prove\footnote{in a later publication}, that the
following conditions are necessary for a consistent partitioning
to yield the correct posterior and evidence through Bayesian
inference.

\begin{enumerate}
\item \textbf{Consistency}. The partitioning is consistent
i.e.~satisfies cref:eq:partitioning. \label[Property]{norm-prop}

\item \textbf{Representation}. In prior hyperspace
$\hat{\Psi} \supset \Psi$ there exists a subspace
$\Psi_{R} \subset \hat{\Psi}$, such that for all
$\hat{\bm{\theta}}\in \Psi_{R}$, \( {\cal P}(\bm{\theta})\) is
represented in \( \hat{\pi} (\bm{\hat{\theta}})\). In other words,
the re-parameterised prior includes a representative
configuration. \label[Property]{spec-prop}
\item \textbf{Convergence}. The sampling favours representative
configurations $\bm\hat{\theta} \in
\Psi_{R}$. \label[Property]{vconv-prop}
\item \textbf{Objectivity}. The prior bias (towards
\(\hat{\pi}(\bm{\hat{\theta}})\)) is weaker than the posterior bias
(towards \(\hat{\cal P}(\bm{\hat{\theta}})\)). \label[Property]{obj-prop}
\end{enumerate}

Note that these properties are sensitive to the sampling algorithm. For
example, for inference by uniform-rasterised integration of
\({\cal Z}\), all properties follow from \cref{eq:partitioning-p}. Not so
for a class of algorithms that estimate \({\cal Z}\) by controlled error
propagation and approximation, e.g.\textasciitilde{}nested sampling. Thus,
understanding the circumstances wherein these conditions are violated,
may clarify the conditions for which both PPR and iPPR fail to produce
the expected result.

Firstly, they satisfy \cref{norm-prop} by construction. iPPR satisfies
\cref{spec-prop} if and only if \(\hat{\pi} (\bm{\theta})\)
represented the correct posterior to begin with, in which case
\(\Psi_{R} = \Psi\). \Cref{vconv-prop} follows from the correctness
proof of nested sampling \citep{Skilling2006}, and
\cref{spec-prop}. In \cref{sec:autopr} we have shown that PPR
satisfies \cref{spec-prop}, where
\(\Psi_{R} = \{ \beta = \beta_{R} = \text{Const.}\}\), if \(\beta_{R}\)
exists. There's always at least one:
\(\Psi_{R} = \text{Locus} \{ \beta_{R}=0 \} \cap \Psi\), but we are
interested in values of \(\beta_{R} > 0\), as such priors are more
informative. In that section we have provided an intuitive explanation
for why PPR has \cref{vconv-prop}.

However, consistency alone does not guarantee the correct posterior, indeed in
\cref{fig:convergence}, we see that both \(\theta_{0}\) and \(theta_{2}\)
marginalised posteriors are offset from the correct result obtained
using \(\bar{\pi}(\bm{\theta})=\text{Const.}\). This is an illustration of the
importance of \cref{obj-prop}, as the test case \cref{fig:convergence}
was constructed to violate it specifically.


\subsection{Isometric mixtures of repartitioning schemes}
\label{sec:org7b95d8a}
In this section we consider two methods of combining several
proposals (consistent partitions) into one (consistent
partition). Identifying the posterior to which points in \(\Psi\)
correspond to by \cref{eq:bayes}, as a metric, we name these
\emph\{\textbf{isometric}\} mixtures.


\subsubsection{Additive isometric mixtures}
\label{sec:org3881130}
Consider \(m\) consistent repartitioning schemes of the same
posterior \(\bar{\cal P}(\bm{\theta})\):
\begin{equation}
  \label{eq:collection-of-models}
\bar{\cal L}(\bm{\theta}) \bar{\pi}(\bm{\theta})= \hat{\cal L}_{1}(\bm{\theta}) \hat{\pi}_{1}(\bm{\theta}) =  \ldots =\hat{\cal L}_{m}(\bm{\theta}) \hat{\pi}_{m}(\bm{\theta}).
\end{equation}
Their \textbf\{\textbf\{\emph{isometric mixture}\}\}, is a consistent
partitioning that involves information from each constituent prior,
but preserves the posterior and evidence of its component partitions.

For example: an \textbf\{\emph{additive mixture}\} \cref{fig:additive},
defined as
\begin{subequations}
\begin{alignat}{2}
\hat{\pi}(\bm{\theta}; \bm{\beta}) = &\sum_{i} \beta_{i} \hat{\pi}_{i}(\bm{\theta}),\label{eq:additive-mix}\\
\hat{{\cal L}}(\bm{\theta}; \bm{\beta}) = &\frac{\sum_{i}   \beta_{i} \hat{\pi}_{i}(\bm{\theta}) \hat{\cal L}_{i}(\bm{\theta})}{\sum_{i} \beta_{i} \hat{\pi}_{i}(\bm{\theta})},
\end{alignat}
\end{subequations}

parameterised by
\(\bm{\beta} = (\beta_{1}, \beta_{2}, \ldots, \beta_{m})\) where each
\(\beta_{i} \in [0,1]\). It is itself a consistent partitioning,
i.e.\textasciitilde{}\emph\{\textbf{isometric}\}, if and only if
\(\sum_{i} \beta_{i} = 1\).
\begin{figure}
\input{illustrations/additive_mixtures.tex}
\caption{\label{fig:additive} An additive isometric mixture of a
Gaussian proposal and a uniform reference. Power-Gaussian added
for comparison.}
\end{figure}

Isometric mixtures are an attempt to relax some of the limitations
imposed by power posterior repartitioning. Firstly, all proposals in
PPR have to be linked by a power relation.  This class always includes
a uniform prior, but not, for example, a ``wedding cake'' prior
(stepped uniform prior). Additive mixtures permit such
proposals. Moreover, in additive isometric mixtures, any consistent
partitions are compatible provided the set union of their domains
matches \(\Psi\).

However, additive mixtures have limited utility: they are slow,
difficult to implement and susceptible to numerical instability
more than any other consistent partitioning\footnote{These claims shall
be substantiated in a more detailed publication.}.  We can,
however do much better.

\subsubsection{Stochastic superpositional isometric mixtures}
\label{sec:orgae3ae62}
One major problem with additive mixtures lies in the definition of
\(\hat{\cal L}\). Instead of having to evaluate only one of the
constituent likelihoods, we are forced to evaluate all of them. Hence,
a lower bound on time complexity:
\begin{equation}
  {\cal T}\{\hat{\cal L}\} = o \left(   \max_{i} {\cal T}\{ {\cal L}_{i}\} \right), \label{eq:hard-cap}
  \end{equation}
which is the average case when the likelihoods \({\cal L}_{i}\) are all
related to the same reference (e.g.\textasciitilde{}\(\bar{\cal L}\)) with only minor
corrections computed asynchronously to account for different
proposals. If \({\cal L}_{i}\) and \({\cal L}_{j}\) have no common
computations to re-use, the average case time complexity is
\(o\left[{\cal T}({\cal L}_{i}) + {\cal T}({\cal L}_{j})\right]\).


Another issue is that the overall likelihood depends on the prior PDFs
of the constituents. This is problematic since nested sampling
requires specification of the prior via its quantile
\citep{Skilling2006,polychord,multinest}. Function inversion is not
linear with respect to addition, so the quantile of the weighted sum
needs to be evaluated for each type of mixture individually. For a
linear combination of uniform priors, evaluating the quantile can be
performed analytically, but not in case of two Gaussians or a Gaussian
mixed with a uniform. By contrast, the quantile of PPR with an
uncorrelated\footnote{not so for a correlated Gaussian. Nonetheless,
every correlated covariance matrix can be diagonalised, and included
in the re-parametrisation.} Gaussian proposal is found in closed
form.

We thus try to avoid mathematical operations that require evaluation
of all of the constituents' priors/likelihoods. An example of such an
operation is deterministic prior branching.  This scheme has the
benefit of trivially determining the quantile of the mixture from the
component quantiles. The probability of branch choice can be tuned
using a parameter, which can be made part \(\hat{\bm{\theta}}\)
similarly to \(\beta\) in PPR. This parametrisation provides the
mechanism needed for \cref{vconv-prop}.

Hence, we purport that a \textbf\{\emph{superpositional mixture}\}, defined via
the following parametrisation:
\begin{subequations}
\begin{equation}
\hat{\pi}(\bm{\theta}; \bm{\beta})  =
\begin{cases}
\hat{\pi}_{1}(\bm{\theta}) & \text{with probability } \beta_{1},\\
& \vdots\\
\hat{\pi}_{n}(\bm{\theta}) & \text{with probability } (1- \sum_{i}^{m}\beta_{i}),
\end{cases}
\end{equation}
\begin{equation}
\hat{\cal L}(\bm{\theta}; \bm{\beta})  =
\begin{cases}
\hat{\cal L}_{1}(\bm{\theta}) &  \text{with probability } \beta_{1},\\
&\vdots\\
\hat{\cal L}_{m}(\bm{\theta}) & \text{with probability} (1- \sum_{i}^{m}\beta_{i}).
\end{cases}
\end{equation}
is isometric, if and only if
\begin{equation}
\label{eq:sspr}
\hat{\pi}(\bm{\theta}; \bm{\beta}) = \hat{\pi}_{i}(\bm{\theta}) \Leftrightarrow \hat{\cal L}(\bm{\theta}; \bm{\beta}) = \hat{\cal L}_{i}(\bm{\theta}; \bm{\beta}),
\end{equation}
\end{subequations}

that is, the branches are chosen consistently.

The\textasciitilde{}\cref{spec-prop} is satisfied, if any of the priors \(\hat{\pi}\)
represented the posterior. The\textasciitilde{}\cref{vconv-prop} is satisfied
similarly to PPR: the likelihood is determined by
\(\bm\hat{\theta} \supset \bm{\beta}\), so \(\bm{\beta}\)s that lead to
higher likelihoods are favoured, ergo configurations representing
\({\cal P}\) are preferred.

Superpositional mixtures have multiple advantages when compared with
additive mixtures. Crucially, only one of \({\cal L}_{i}\) is evaluated
each time \(\hat{\cal L}\) is evaluated. As a result, ignoring the
overhead of branch choice, the worst-case time complexity is the same
if not better than the best case for additive mixtures, which has vast
implications discussed in \cref{sec:applications}.

The superpositional mixture's branch choice must be external to and
independent from the likelihoods and priors. For example, the prior
quantile of the mixture must branch into either of the component prior
quantiles. As a result, the end user doesn't need to perform any
calculations beyond the proposal quantiles themselves.

There can be many implementations of a superpositional mixture. A
natural first choice would be a quantum computer, where the
\(\hat{\pi}\) and \(\hat{\cal L}\) are represented by \(m\) level systems
entangled with each other (consistent branching) and a classical
computer (to evaluate \({\cal L}\) and \(\pi\)). However, we can also
attain an implementation using only computational methods via
stochastic deterministic choice based on \(\bm{\theta}\).

The \textbf\{\emph{stochastic superpositional (isometric) mixture}\} of
consistent partitioning (SSIM) ensures branch consistency by requiring
\begin{equation}
\hat{\pi}(\bm{\theta}; \bm{\beta}) = \hat{\pi}_{F(\bm{\theta};
  \bm{\beta})}(\bm{\theta};\bm{\beta}),
  \end{equation}
where
\(F: (\bm{\theta}, \bm{\beta}) \mapsto i \in \{1, 2, \ldots, m-1\}\). In
our implementation it is a niche-apportionment random number generator
(sometimes called the broken stick model), seeded with the numerical
\texttt{hash} of the vector \(\bm{\theta}\), illustrated in
\cref{fig:mixture}.

Superpositional mixtures are superior in robustness and ease of
implementation. They do, nevertheless, come with one drawback. As a
result of branching, the likelihood \(\hat{\cal L}\) visible to the
sampler, is no longer continuous (\cref{fig:mixture-3d}). Thus a
nested sampling implementation that relied on said continuity will
have undefined behaviour. \texttt{PolyChord}'s slice sampling seems
not affected by the discontinuity, but there may be other samplers
that are.

\begin{figure}
\input{./illustrations/mixture_2.tex}

\input{./illustrations/mixture_3.tex}

\input{./illustrations/mixture_4.tex}
\caption{An example of mixture repartitioning. The mixture is not
normalised to emphasise the coincidence of values with both the
uniform distribution and a Gaussian. $\beta$ controls the
probability of belonging to the Gaussian in the stochastic
mixture.  Additionally, the resolution is deliberately reduced, to
contrast behaviour of all three at the truncation
boundary. \label{fig:mixture}}
\end{figure}

\begin{figure}
\centering
\includegraphics[width=0.99\columnwidth]{./illustrations/SSIM_3d.pdf}
\caption{An illustration of SSIM in two dimensions. Colour represents the value of $\pi(\bm{\theta})$. As a result of nested sampling, nucleation of the representative phase is dynamically favoured.}
\label{fig:mixture-3d}
\end{figure}

\subsection{On notation and mental models}
\label{sec:org9e7115b}

It is opportune time to discuss a subtlety that we have previously
neglected. \cite{chen-ferroz-hobson} originally named the technique
automatic posterior repartitioning, which evokes a clear mental
model. Assuming that the original definitions of \(\pi\) and
\(\mathcal{L}\) were a partitioning of only the posterior, a new value
of \(\beta\) produces a new partitioning, thus it re-partitions the
posterior.  The extra parameter is a time-like object, with a clear
direction of evolution, in that any change to its value causes a
re-partitioning of the model.

While this mental model had served well for the purposes of solving
the unrepresentative prior problem, it is severely limiting to the
effect of introducing proposals.

The first ineptitude of the mental model is that the expression
``re-partitioning'' implies the mutability of the posterior. It is not
mutable. In fact, the posterior that we obtained via re-partitioning
has a strict functional dependence on the parameter, which is strictly
a different function. Meaningful information is lost when we project
the repartitioned result to the original prior space, albeit only a
Bayesian would regard it as such.

A second deeper problem is that the notation inherently puts impetus
on the posterior. In reality automatic posterior repartitioning is a
necessary, but insufficient condition for consistent partitioning. As
long as no coordinate transformation is performed, the difference is
negligible. However, for more complicated cases, e.g.\textasciitilde{}re-sizeable
prior space schemes, the posterior repartitioning is
under-determined. A naive extension doesn't and indeed can't produce
the expected result, if one considers an extension similar to
\begin{equation}
  \label{eq:naive-extension}
  \pi(\theta) \mathcal{L}(\theta) = \hat{\pi}(\theta) \mathcal{\hat{L}}(\theta)
  \end{equation}
one shall obtain nonsense. One can prove (by considering a reference
prior space from which all prior spaces of the same dimensionality
derive via coordinate transformation), that the correct expression is
actually one that preserves the evidence differential element.

What we propose is a much more general world-view and a more accurate
and expressive model. A consistent partitioning involves specifying a
hyperspace that includes the original prior space. The partitioning
into \(\pi\) and \(\mathcal{L}\) is done once only, when the Bayesian
inference problem is set up. The original posterior is a function in
the original prior (sub)space. The posterior we obtain as a result, is
the original in some projections, the evidence to which it corresponds
is also the same as the original.

One might object that this is not a good model for the superpositional
mixture, as the dynamical analogy would be much more appropriate, as
the parameters really only control the partitioning. This point is
partially valid. I would advocate seeing superpositions as an
extension into a hilbert space of vectors that are themselves
spaces. Not easy to imagine, but to someone fluent in Quantum theory,
not a challenge. A better analogy would be to imagine the spaces for
each individual prior side by side, and have a few parameters that
control the relative ``heights'' of these spaces, or activation energy
for diffusion. This is a middle-ground that retains the generality of
treating the entire problem in a hyperspace, but also has a dynamical
analogy.

Arguments can be made either way, but an important consideration is to
have a model that gives accurate predictions first, and is easy to
imagine second.

\section{Methodology of Measurements}
\label{sec:org46183d2}
Our measurements have to ascertain three key points. First we must
prove that the consistent partitions obtain sensible estimates of
\({\cal P}\) and \({\cal Z}\) and document the circumstances when they
don't. We shall then need to measure the performance uplift that can
be attained while preserving the accuracy and precision of the
sampling. Lastly, we shall test our machinery when applied to a
real-world example: Cosmological parameter estimation.

For performance, we shall adopt the weighted accounting approach
\citep{Cormen} for measuring time complexity in units of
\({\cal N}\{{\cal L}\}\), and reducing all quantities to their
long-run averages. Consequently, all of the partitions' overheads
associated with internal implementation details are ignored. This is
to ensure fairness in comparing power repartitioning to a stochastic
mixture\footnote{SSIM has far less overhead}.

We shall use Kullback-Leibler divergence in two contexts. First,
\({\cal D}\{\pi, {\cal P}\}\) --- a measure of information obtained from
the dataset ignoring the prior, is used to gauge performance (as seen
in \Cref{fig:kl-scaling}).

We also need a method of comparing posteriors to determine their
accuracy. The Second divergence
\({\cal D}\{ {\cal P}, \bar{\cal P} \}\), quantifies the correctness of
the obtained posterior, where \(\bar{\cal P}\) is the posterior obtained
using a \(\bar{\pi}(\bm{\theta}) = \text{Const}\). In conjunction with
\({\cal Z}\), these form our correctness criteria.

From \cref{eq:bayes}, errors in \({\cal P}\) are necessarily linked to
errors in estimating \({\cal Z}\), and is the pivotal reason why nested
sampling is sensitive to partitioning in the first instance. Moreover,
the character of error in \({\cal Z}\) indicates the type of error in
\({\cal P}\). A greater-than-expected evidence \({\cal Z}\) indicates
inconsistent partitioning, where the likelihood was not re-scaled to
accommodate a more informative prior. A less-than-expected \({\cal Z}\)
is a sign that the regions of high \({\cal L}\) were not probed
sufficiently, often accompanied by prior imprinting (PPR in
\cref{fig:convergence}).

\begin{table}
\centering

\caption{Typical values of posterior-to-reference-posterior
Kullback-Leibler divergence ${\cal D}\{{\cal P}, \bar{\cal P}\}$
for the runs shown in cref:fig:hist. The inconsistent
re-sizeable uniform had not been given an improper normalisation
of $\hat{\cal L} = {\cal L}$. It is of type \textbf{\emph{Re-sizeable
uniform}}.}
\begin{tabular}{lrr}
\textbf{Scheme} & ${\cal D}\{ {\cal P}, \bar{\cal P}\}$ & ${\cal Z}$\\
\hline
Uniform & 0.018 & \(-62.70 \pm 0.30\)\\
Analytical & 0.000 & \(-62.72 \pm 0.00\) \\
$R$ & 0.724 & \(-54.8 \pm 0.90\)\\
$PPR$ & 0.011 & \(-62.73 \pm 0.01\)\\
$SSIM(U, G)$ & 0.007 & \(-62.72 \pm 0.01\)\\
$SSIM(U, G, R)$ & 0.696 & \(-57.70 \pm 0.30\)\\
\end{tabular}
\label{tab:hist}
\end{table}

\begin{figure}
\input{./illustrations/scaling-kld.tex}
\caption{Scaling of number of likelihood calls with Kullback-Leibler
divergence \({\cal D}\{ \pi, {\cal P}\}\) With co-linear offsets
varying from $10\bm{\mu}$ to $300\bm{\mu}$. The best fit line is
\(\left[(1.5 \pm 0.2) {\cal D} + (1.7 \pm 0.1)\right]\cdot 10^3 \)
with determination coefficient \(R^{2} = 0.85\) which indicates
that \({\cal D}\) is a reliable performance indicator for
\texttt{PolyChord}.\label{fig:kl-scaling}}
\end{figure}

\begin{figure}
\input{./illustrations/histograms.tex}
\caption{An illustration of the histograms for the last 1000 evidence
estimates of different types of consistent partitioning. SSIM is a
stochastic superposition of Gaussian iPPR (\(G\)), uniform
(\(U\)). \label{fig:hist}}
\end{figure}

\begin{figure}
\includegraphics[width=0.5\textwidth]{./illustrations/triangle-fit.pdf}
\caption{An example of a posterior obtained with PPR, based on
Planck parameter covariance matrix, compared with the Planck
posterior chains. The differences in the distributions indicate
variance across different inference runs.
${\cal D}\{ {\cal P}, \bar{\cal P}\} \approx 0.01$. The deviation
is due to a different (smaller) number of live points used, and
the difference between the correct likelihood and its
approximation using a Gaussian. \label{fig:overlay-posteriors}}
\end{figure}

When constructing the test cases, we use no more than three
dimensional models with Gaussian likelihoods, as they are
sufficiently general to share similarities with cosmological
inference, while also being practical to investigate under small
perturbations.  For this purpose, we use a uniform baseline prior,
and a Gaussian likelihood:

\begin{equation}
\ln {\cal L}(\bm{\theta}) = \ln {\cal L}^\text{max}- \dfrac{1}{2}{(\bm{\theta} - \bm{\mu})}^{T}\Sigma^{-1}(\bm{\theta}-\bm{\mu}),
\end{equation}
where the covariance matrix \(\bm{\Sigma}\), specifies the extent of
the peak, and the vector \(\bm{\mu}\) --- the location.  \({\cal
  L}^\text{max}\) is the normalisation factor, which we keep implicit,
for convenience.

\(\bm{\Sigma}\) is assumed diagonal, without loss of generality. While
\(\bm{\Sigma}\) can be singular, which usually means a redundancy in the
parametrisation, which can be fixed (by turning the strongly
correlated parameters derived). Otherwise it is positive
semi-definite, and symmetric, meaning that the it can be diagonalised
via change into its eigen-basis. Counter-intuitively, this basis change must
not be made part of the quantile. It is applied before computations
involving correlated Gaussians, and reversed afterwards. This is a
consequence of the extra Jacobian brought on by the difference between
\cref{eq:partitioning} and \cref{eq:partitioning-p}. Essentially by
applying the transformation globally the unit hypercube becomes a
parallelopiped, which is the result of neglecting the Jacobian
associated to the linear transformation.

To simulate imperfections we consider translational offsets between
the proposal prior and the model likelihood.  The main trial posterior
is thus
\begin{equation}
\bar{{\cal P}}(\bm{\theta}) = G(\bm{\theta}; \bm{\mu} =
(1,2,3),\bm{\Sigma} = \mathrm{1}_{3}),
\end{equation}

truncated to a cube of side length\footnote{The value \(1.2\) was
chosen because it is the shortest non-machine representable floating
point number, whose inverse is also not machine representable. This
causes numerical instability in the uniform prior probability
density function and quantile (at the boundaries). The value was
chosen for tests of boundary effects, which had to be removed from
the project, because of volume constraints.}
\(a = 1.2 \cdot 10^{9}\). The corresponding evidence (\cref{eq:def-z})
is \(\ln \mathcal{Z}\approx-62.7\). The quantile of this Gaussian
distribution is the one that enters iPPR and PPR's priors as well as
the reference likelihood. All other test cases are derived from the
same Gaussian either via re-scaling, deformation (off-diagonal
covariance and anisotropic scaling), or translation.

The choice of the prior scale: \(a = O(10^{9})\), is to ensure that
the series are not affected by run-to-run variance, even with a
reduced number of live points. This has the added benefit of
simulating an unbounded uniform prior numerically, as it is near the
numerical limits. Also, any error in re-scaling the likelihood
(e.g.\textasciitilde{}\cref{fig:hist}) leading to an inconsistent partition would not
be obvious or as clean with a smaller prior boundary. Lastly, this
choice allowed us to test the hypothesis that both stochastic mixtures
and power posterior repartitioning can effectively remove the burn-in
stage altogether. Last but not least, with such preconditions,
stochastic mixtures are put at the greatest disadvantage. In the
average case, approximately half the original live points are drawn
from the proposal distribution and half from the uniform. The
probability of finding the offset posterior peak is thus minuscule for
large offsets. By contrast, In the average case the original live
points with a Gaussian power posterior are drawn from a twice broad
Gaussian.

\section{Results and Discussion[sec:results]}
\label{sec:org9f21dda}
The first test was to ensure that the repartitioning was implemented
correctly. For this goal, we produced coinciding Gaussian likelihoods
and prior components. The results of the test are shown in
\cref{tab:hist} and \cref{fig:hist}.


The second class of tests involved deforming the prior Gaussians.
Both SSIM (iPPR and uniform) and PPR were resilient with respect to
re-scaling and anisotropic deformation of the likelihood, obtaining
\({\cal D}\{ {\cal P}, \bar{\cal P}\} \leq 0.03\). iPPR coped with
situations where \({\cal P}\) was narrower than \(\pi\), while failing in
the opposite case: \({\cal D}\{ {\cal P}, \bar{\cal P}\} \geq 5.5\),
when \({\cal D}\{ \pi, {\cal P} \} = 5.5\) and
\(\Sigma = 0.3 \times \mathrm{1}_{3}\).

The final test was with regards to translational offsets. The results
are shown in \cref{fig:kl-d,fig:convergence,fig:drift}. In
\cref{fig:kl-d}, we see that the amount of information extracted from
PPR increases with increased offset. However, it does so sub-linearly,
which combined with \cref{fig:convergence}, renders suspect the
validity of the posteriors obtained using PPR and SSIM. However,
\cref{fig:drift} shows that only PPR is adversely affected.

The posterior to posterior Kullback-Leibler divergence remained stable
and less than \(0.3\) for the stochastic mixture and the
reference. Power repartitioning fluctuated considerably, ensuring that
no suitable plot could be produced. This suggests instability with respect to
perturbations, and unpredictability of the accuracy of the
posterior. However, none of the values reached the prior to posterior
divergence, suggesting that at no offset was the posterior entirely
obtained from the prior. As a result, power repartitioning may still
be useful for unrepresentative informative priors, that are not
proposals, as \cite{chen-ferroz-hobson} have shown.

A special case is that shown in \cref{fig:convergence}, in a reduced
size bounding box \(a=2\times 10^{3}\). The main notable feature is
the inaccuracy of the posterior obtained by PPR. If the offset is
small --- \(O(2\sigma)\), the posterior is shifted. With a larger
offset, e.g. \(O(4\sigma)\), two peaks can be resolved.  Both errors
are caused by incorrect evidence (see \cref{fig:drift}) PPR:
\(\ln {\cal Z}\approx -25.4 \pm 2\), vs uniform reference
\(\ln {\cal Z} = -22.7 \pm 0.4\) and SSIM,
\(\ln {\cal Z} = -22.5 \pm 0.3\). There are two key observations to be
made: the evidence is still within reasonable variance from the
reference, and its estimated error is large. As a result, while we
haven't obtained the right information, we know that something went
wrong.

This result is not at variance with \cite{chen-ferroz-hobson} 's
observations, as they do not have a comparable test case. All of their
numerical test cases were restricted to no more than two physical
parameters, while we extended it to three. The example given required
considerable fine-tuning to be reproducible\footnote\{Too much free
time in quarantine. \}, as larger or smaller offsets often lead to
correct convergence some of the time. Another hint at why power
repartitioning may have been affected more than a stochastic mixture
can be gleaned from \cref{fig:hist}. By noticing that the correct
evidence is still within one standard deviation of the estimate
obtained using power repartitioning we can suggest, that the result is
less precise. So the unusual shape of the marginalised posterior, is
the result of loss of precision. The inaccurate posterior is within
margin of error of the analytical result,

\begin{figure}
\input{illustrations/benchmark.tex}
\caption{number of ${\cal L}$ evaluations as a function of the
number of live points. \(U\) is the reference uniform, and \(G\)
is the pure Gaussian proposal.
\(\max {\cal D}\{{\cal P}, \bar{\cal P}\} < 1.5\), meaning all
participating consistent partitions obtained the correct
posterior. The number of evaluations scales as
$k\cdot n_\text{live}^{1.1 \pm 0.2}$, where $k$ reduces for faster
repartitioning schemes. \label{fig:benchmark}}
\end{figure}

It is worthwhile to consider the impact of such a scenario occurring
during practical use of Bayesian inference. If either of the posterior
looks as PPR's marginalised posteriors in \cref{fig:convergence}, the
researcher performing the inference has the following options:
\begin{enumerate}
\item accept the posterior as is~\label[Option]{opt:accept}
\item accept the posterior, but as a less credible result\label[Option]{opt:accept-with-err}
\item reject the PPR result entirely, and perform a run with only a
uniform prior\label[Option]{opt:uniform}
\item readjust the PPR mean and variance using the posterior, and
re-run~\label[Option]{opt:shift}
\item combine PPR with SSIM in mixture with a uniform prior
\end{enumerate}
\Cref{opt:accept-with-err} is a last resort. \Cref{opt:accept} is
adequate for low accuracy applications, provided errors are properly
estimated using e.g.\textasciitilde{}\texttt{nestcheck} \citep{higson2018nestcheck}.
From \cref{fig:benchmark}, we see that the performance uplift allows
for \cref{opt:shift} to be more efficient than\textasciitilde{}\ref{opt:uniform},
albeit marginally so.

This is where our technique is most useful: one obtains, as we've
shown in\textasciitilde{}\cref{fig:convergence}, a more accurate
\({\cal P}(\bm{\theta})\), by using PPR from within SSIM. Hence, a
repartitioning scheme that is on average slower than PPR (by
approximately \(18\%\) extra \({\cal L}\) evaluations) within margin
of run-to-run variance of PPR (approximately
\(20\%\))\footnote\{Comparison with \cref{fig:benchmark} may be
misleading, as the error margins there correspond to exact
coincidence, while the case in question involves an offset of
\(6\mu\). \}, which is an order of magnitude less
than\textasciitilde{}\vref{opt:uniform,opt:shift} would afford. That said, using the
proposal directly is faster still \cref{fig:benchmark}.

\begin{figure}
\includegraphics[width=0.99\columnwidth]{./illustrations/convergence.pdf}
\caption{An illustration of offsets affecting ${\cal P}$ under
various repartitioning schemes. Dotted series represent the prior
imprint. The reference uniform and the stochastic mixture agree
with the analytical posterior: Gaussian peak at
$\bm{\theta} = (4, 6, 8)$. \label{fig:convergence}}
\end{figure}

\begin{figure} \centering
\begin{subfigure}{0.99\columnwidth}
\centering

\input{./illustrations/kullback-leibler.tex}
\caption{Kullback-Leibler divergence \({\cal D}\) for different
offsets: Gaussian peaks displaced from \(\bm{\mu}\) by
\(\text{Offset}\times \bm{\mu}\). Notice that the faster
repartitioning methods produce a lower value of \({\cal
D}\). The divergence \({\cal D}\) scales sub-linearly with the
offset.\label{fig:kl-d}}
\end{subfigure}

\begin{subfigure}{0.99\columnwidth}
\centering

\input{./illustrations/evidence-drift.tex}

\caption{An illustration of offsets affecting ${\cal Z}$. The true
value is constant, mirrored by the mixture: SSIM of PPR and
reference uniform. PPR alone produces incorrect evidence,
consistent with cref:fig:convergence. Tighter error-bars on SSIM
are consistent with our observations from
cref:fig:hist.\label{fig:drift}}
\end{subfigure}
\caption{Illustrations of effects of offsets on the correctness
\ref{fig:drift} and performance \ref{fig:kl-d} of nested sampling
under consistent posterior repartitioning.}
\end{figure}


Lastly, \textbf\{\emph{posterior mass}\} --- a measure of convergence
speed \citep{higson2018nestcheck}, is often used in diagnosing nested
sampling. Typical examples of posterior mass for a run with
\(\pi=\text{Const.}\) and runs accelerated by posterior repartitioning
are given in \cref{fig:higson}. Notice that the re-partitioned series
has a longer extinction phase, as a result of introducing extra
nuisance parameters. Also, the confidence intervals on each parameter
between the uniform and the re-partitioned run are identical,
signifying that we have not lost precision.

\begin{figure*}
\includegraphics[width=.99\textwidth]{./illustrations/higson.png}
\caption{plot of the evolution of nested sampling. The \color{red} red
\color{black} series corresponds to SSIM of iPPR, while the
\color{blue} blue \color{black} series --- to a reference
uniform. The horizontal axis of plots in the second column is
\(\ln X\), where \(X(\mathcal{L}) \in [0,1]\) is the fraction of the
prior with likelihood greater than \(\mathcal{L}\). The top plot is
the relative posterior mass. In row $i$ the ${\cal P}(\theta_{i})$
is plotted. Confidence intervals represented with color
intensity. The reference values for the model parameters are
\(\theta = (0, 4, 8)\) \label{fig:higson}}
\end{figure*}

\subsection{Cosmological Simulations}
\label{sec:orgb6806f9}

After an initial run of \texttt{Cobaya} \citep{cobaya}, we have obtained the marginalised
posteriors of all the key parameters of the \(\Lambda\)CDM model,
as well as the nuisance parameters.

First, we have performed an inference using the Planck \citep{Planck} dataset,
with the \(\Lambda\)CDM model. The results of our initial run are
presented in \cref{fig:cosmology}. From these data, under the
assumption that the parameters' posteriors are a correlated Gaussian
distribution, we extract the means \(\bm{\mu}\) and the covariance
matrix \(\bm{\Sigma}\).

We use a stochastic mixture of a uniform prior and a single Gaussian
obtained from the posterior samples of a run with a uniform prior,
which we patch into \texttt{Cobaya}'s interface to \texttt{PolyChord}
\citep{code}. The posteriors of two runs with identical settings (save
live point number) are given in \cref{fig:cosmology}.

Firstly, notice that the posteriors have a significant overlap. Each
plot on the diagonal of \cref{fig:cosmology} is a Gaussian, agreeing
with the results of the reference run to within less than 1/10\(^{th}\) of a
standard deviation. However SSIM predicts a deformed (non-ellipsoidal)
covariance of the \(\Lambda\)CDM parameters. 

The deformations are present in all posteriors that used a Gaussian
proposal, which indicates that the deformations are systematic. The
deformities are not caused by finite-grain size in the stochastic
mixture, as the Gaussian proposal has them, and to a greater
extent. The mixing portion parameter \(\beta\), has converged to a mean
of \(\langle \beta \rangle = 0.82\), which indicates that the Gaussian
proposal was not fully the most representative, but also that the
later stages of sampling were dominated by the Gaussian
proposal. Despite the appearance, however, \cref{tab:cosmo-accuracy}
shows that the posteriors between SSIM and non-SSIM runs are not
significantly different (\({\cal D}< 0.3\)). Moreover the evidence is
within one standard deviation and more precise with SSIM by a factor
of \(8\).

While this might indicate a higher accuracy than obtainable with a
pure uniform prior, one must exercise caution. While we can eliminate
some potential systematic errors, a more conclusive analysis is
needed.

With accuracy out of the way, \cref{tab:cosmo-performance}, highlights
a significant improvement in performance. Using SSIM offers a
reduction of run-time by a factor of \(19\). By exploiting increased
precision one can reduce the number of live points, and gain a further
reduction of run-time by a factor of \(37\). Further improvements are
attainable by reducing the precision criterion and terminating
early. Conversely, to obtain similar precision to SSIM, assuming
sub-linear scaling with \(n_\text{live}\), one would need to extend
the duration of the inference to 912 hours \(\approx\) 40
days. Assuming that errors in evidence scale as
\(n_\text{live}^{-1/2}\) the time would be then of the order of a
year.

\begin{table}
\centering
\caption{Accuracy metrics for Cosmology runs using Cobaya.}
\begin{tabular}{llrrr}
\textbf{Prior} & \textbf{Device} & ${\cal D}\{ {\cal P}, \bar{\cal P}\}$ & $\ln {\cal Z}$ & $n_\text{live}$\\
\hline
Uniform & CSD3 &\( 0.000\) & \(-1432.8 \pm 0.8\) & 108\\
SSIM\((U, G)\) & CSD3 & \(0.2\) & \(-1433.6 \pm 0.1\) & 100\\
iPPR(\(G\)) & CSD3 & \(0.4\) & \(-1433.8 \pm 0.05\) & 100\\
SSIM\((U, G)\) & PC & \(0.25\) & \(-1433.5 \pm 0.2\) & 50
\end{tabular}
\label{tab:cosmo-accuracy}
\end{table}


\begin{table}
\centering
\caption{Performance metrics for Cosmology runs using Cobaya. $t$ is
the time from beginning of sampling, to output. Starred series
were extrapolated linearly. Precision normalisation assumes errors in
${\cal Z}$ scale as $n_\text{live}^{-1}$. }
\begin{tabular}{llrrr}
\textbf{Prior} & \textbf{Device used} & \textbf{$t$/(hrs)} & \({\cal N}\{ {\cal L}\}\) & $n_\text{live}$\\
\hline
Uniform & CSD3 &\( 32.2 \) & \(480 000\) & 108\\
SSIM\((U, G)\) & CSD3 & \(1.7\) & \(90 000\) & 100\\
SSIM\((U, G)\) & PC & \(50\) & \(49 000\) & 50\\
\hline
Uniform & PC$^{*}$ & \(912\) & \(240 000\) & 50\\
Uniform & CSD3$^{*}$ &  \(224\) & 3 360 000  & 700
\end{tabular}
\label{tab:cosmo-performance}
\end{table}

\section{Conclusions}
\label{sec:org29806b0}

\subsection{Results}
\label{sec:org68bd148}

The project's purpose has been to investigate the performance increase
attainable by algorithmic optimisations of the inputs to nested
sampling.We have identified a class of methods based on work by
\cite{chen-ferroz-hobson}, called consistent partitions, fit for this
purpose. We have shown that each consistent partition can accelerate
nested sampling when given an informative proposal.  We have developed
stochastic superpositional isometric mixing (SSIM), to combine several
proposals, into one. When used with nested sampling, SSIM produces
more precise and accurate posteriors, faster than any individual
consistent partition.

We have established the following advantages in using SSIM over PPR:
SSIM admits multiple types of proposal priors, while PPR admits only
one; it permits a broader class of proposals, for example: with
differing domains, while PPR --- only if the domains of the proposals
coincide.  SSIM is abstract: the prior quantile is a superposition of
the constituent priors' quantiles. By contrast, PPR prior quantile
needs to be calculated by the end user for each type of proposal. The
calculation is non-trivial for non-Gaussian proposals. SSIM supports
an unbiased reference (uniform) prior exactly. PPR tends to an
unbiased reference as \(\beta\rightarrow 0\), but is only truly
unbiased if \(\beta=0\), with negligible probability. SSIM, like PPR,
prefers the prior that leads to a higher likelihood, but unlike PPR,
this does not lead to the total exclusion of less-representative
priors.


As a result, faster, but more fragile consistent partitions
(e.g.\textasciitilde{}iPPR), in conjunction with a standard uniform prior can exceed
more robust but slower PPR in precision accuracy and speed.  When
applied to real-world cosmological parameter estimation, our strategy
of using SSIM of Uniform and iPPR resulted in a significant
performance increase, reducing the run-time requirements of
\texttt{Cobaya} by a factor of 30.

\subsection{Further Refinemenets}
\label{sec:org36c706d}
As of now the best way to use SSIM is to use the python package
\texttt{supernest}.
\begin{verbatim}
pip install supernest
\end{verbatim}

This package includes all the code used to produce all of the plots
except \ref{fig:cosmology-pc}, which is upstream in Cobaya. For SSIM
to work, the proposal has to be provided in the form of a
consistent partition. While for an arbitrary combination of
proposal prior quantile and likelihood this task may seem daunting,
one often deals with a complex likelihood function and a uniform
prior. In this case, it is rather simple to create a proposal in
the form of a Gaussian, and produce a correction to the likelihood.

It is important to remember that since the proposals are put into a
superposition with a reference, that the proposals can be tighter
than the expected posterior distribution, or indeed contain very
little if any covariance information. As the safety-net of SSIM
ensures that at least some of the points are being sampled from the
original distribution and PolyChord is capable of exploring
multi-modality in the posteriors, there are very few reasons not to
do so.


\subsection{Applications}
\label{sec:org488def7}

The obtained results are general. They can be applied in any area of
any science that relies on Bayesian inference using nested sampling,
e.g.\textasciitilde{}particle physics \citep{multinest}, astronomy
\citep{Casado_2016}, medicine, Psychology, et cetera. SSIM should be
considered for high-performance compute applications in COVID-19
research (e.g.\textasciitilde{}\cite{Covid1,Covid2}), as inference in this field is
both time and resource-intensive, while also time-critical. It may
prove useful for agent-based simulations, with complex Likelihood
functions \citep{Covid2}, similar to Cosmology. Identifying causal
links between policies and incidence of Covid 19 cases, for example is
described by 49 parameters.

Note that the asymptotic worst-case time complexity of superpositional
mixtures liberates one to use as many complex models as one likes. For
example: consider two libraries providing a likelihood for
\(\Lambda\)CDM, one which makes multiple approximations (fast), and
one which performs the full calculation (slow). By using the two in a
superpositional mixture, one shall obtain a speedup compared to the
slow run of nested sampling. This is due to the slow likelihood being
evaluated only some of the time. It will only be comparable to the
pure slow run if the fast prior were utterly unrepresentative of the
results, which itself is a valuable insight. Our findings may be of
particular interest for further refining \texttt{CLASS} and
\texttt{Cobaya}, as the time complexity of computing the likelihood is
the bottleneck of modern cosmological code.

Nested sampling can also be applied to inference-related problems,
such as reinforcement learning \citep{javid2020}. The process of
training a neural network involves estimating connection strengths
between nodes of said network. Normally, this end is achieved via
negative feedback: connections correlated with the desired behaviour
are reinforced, and vice versa \citep{Kaelbling_1996}. Machine
learning maps neatly onto Bayesian inference when identifying
connections strengths as parameters of a model, and likelihood ---
correlation with desired behaviour. Most neural networks are trained
with uniform priors.


We may also extend Bayesian analysis to \textbf\{\emph\{consistent
Bayesian meta-analysis\}\}. Consider data obtained from multiple
physical processes that are described in one theory with an
overlapping set of parameters \(\theta\). As of now, we only perform
separate analyses of each experiment. However, SSIM allows us to
combine these models, and naturally represents consistency in the
posteriors of the shared parameters. As an example, all of the
estimates of the age of the universe may be obtained in one fell swoop
from all the available models and data. This scheme will have the
bonus of highlighting datasets that are incompatible with the overall
conclusion, allowing us to re-evaluate the experimental data as
needed\footnote\{Additional, more detailed explanations shall be
published in a paper submitted to the \textbf\{\emph\{Monthly Notices
of the Royal Astronomical Society\}\}.\}.

Bayesian meta-inference is related to the issue of discordant datasets
\citep{tension}, and Bayes factor as a method of combining
datasets. The idea is not new: usage of evidence as the sole judge of
consistency between a model and a dataset had been discussed as long
as the subject of Bayesian inference exists. Multiple metrics had been
proposed e.g.\textasciitilde{}\cite{Marshall_2006}.

However, we propose a different delineation of datasets. Instead of
considering the results of some early experiments as parts of the
prior, and considering their agreement with newer observations only,
we propose clearing the prior of anything but the theoretical
constraints violation of which would lead to the theory being
disproved. For example, if our theory predicts no negative-mass dark
matter, our prior is uniform in the positive
\(\Omega_{\mathrm{c}}\). The data that used to be part of the prior
inextricably, are now considered proposals. In Bayesian meta-analysis,
our prior is a stochastic mixture of all previous observations of dark
matter and the aforementioned constrained uniform prior. To clarify,
our scheme does not imply a mixture of just two priors. If the
existence of dark matter can be (and was) inferred from \(n\)
datasets, then our mixture is of as many as \(n+1\) priors, and would
consist of the posteriors of the analysis of the experiments used as
proposals. The joint likelihood is suitably programmed. Due to the
consistent branching, there is no ``cross-talk'' between
likelihoods. However, the marginalised posteriors would indicate the
best fit parameter distributions and take consistency and precision of
different observations into account. Effectively, this method synthesises
data into a coherent model, without artificially splitting the model
into different experimental datasets, and requiring manual
reconciliation.

The posteriors for the branch probabilities would be a measure of the
consistency of specific experiments. If nested sampling chose to
ignore e.g.\textasciitilde{}the Type IA supernova datasets, it may suggest that such
experiments are systematically inconsistent with other
observations. It is much better than attempting to reconcile the
discrepant datasets manually, as people are prone to
fallacies. Moreover, for experiments for which data is still
preserved, can be continuously integrated into a joint
posterior. Meta-inference may reveal cases where data was doctored to
fit a particular conclusion. In such cases, the marginalised
posteriors will show unusual covariances, and be outliers in the
analysis.


In conclusion, the new methodology of combining information from many
priors shows great promise in the field of Bayesian inference. It has
demonstrably reduced the run-time of some of the most complex
problems: that of Cosmological Parameter Estimation. A rich field of
research awaits those courageous-enough to follow. It is ours but to point
the way.

\section{Code}
\label{sec:org3775397}
All code used to generate the plots, the framework for systematising
consistent partitions as well as the configurations of \texttt{Cobaya}
for cosmological simulations can be found on GitHub \citep{sspr}. In a
separate repository\textasciitilde{}\citep{code} is the version of Cobaya with our
modifications, which was used to produce the figures overleaf.


\begin{landscape}
\begin{figure}
\centering % Center table
\includegraphics[height=0.95\textheight]{./illustrations/cosmology.pdf}
\caption{The marginalised posteriors for \texttt{Cobaya} +
\texttt{Class} on CSD3 with \(n_\text{live}=100\). The Reference
uniform is \color{red} red \color{black}, while SSIM is
\color{blue} blue \color{black}. With the exception of
\(n_\mathrm{s}\) and \(\Omega_\mathrm{c}\), all parameters are
more tightly constrained. iPPR added to rule out finite-grain-size
effects for partially representative
priors. } \label{fig:cosmology}
\end{figure}
\end{landscape}
\begin{landscape}
\begin{figure}
\centering % Center table
\includegraphics[height=0.95\textheight]{./illustrations/cosmo-pc.pdf}
\caption{The marginalised posteriors for \texttt{Cobaya} +
\texttt{Class} on CSD3 with \(n_\text{live}=100\) vs PC
\(n_\text{live}=50\). } \label{fig:cosmology-pc}
\end{figure}
\end{landscape}

\bibliography{bibliography}
\bibliographystyle{mnras}
\end{document}