Skip to content
This repository has been archived by the owner on Sep 13, 2022. It is now read-only.

Latest commit

 

History

History
1661 lines (1472 loc) · 82.7 KB

first-paper.org

File metadata and controls

1661 lines (1472 loc) · 82.7 KB

Introduction

The standard model of the universe and its evolution in modern cosmology is the \(Λ\)CDM model \citep{Condon2018}, so named after the main components of the universe: the cosmological constant \(Λ\) and cold dark matter. It has six major citep:Condon2018 independent [fn::there can be other equivalent parameter sextuplets. ] parameters: the physical baryon density \(Ω_\mathrm{b}h2\); the physical (cold) dark matter density \(Ω_\mathrm{c}h2\); the angular parameter \(100θ_\mathrm{s}\); re-ionisation optical depth \(τ_\text{reio}\); power spectrum slope \(n_\mathrm{s}\) and amplitude \(ln (1010A_\mathrm{s})\) \cite{Cosmology}

The task of the present study is to develop better tools for evaluating the agreement of our observations from the Planck mission with \(Λ\)CDM, estimating the parameters in the process. In the language of Bayesian statistics [fn::See \cite{xkcd} for comparison to frequentist statistics.], our goal is efficient Bayesian inference.

While said inference can be executed analytically in principle, it is often intractable even when performed numerically. For context, the standard \(Λ\)CDM inference run requires an HPC cluster with at least 128 nodes, each with at least 6GB of memory and an equivalent of three full days of operation. To add insult to injury, the error margins on the parameters and the evidence, if computed at all, are staggering. Even then such a result requires judicious tuning and careful consideration of the model, which at present cannot be automated. Equivalent inference for any model other than \(Λ\)CDM is thus out of reach of most cosmologists. Presently, trying to resolve the Hubble Tension, \cite{tension} seems impossible for most professional cosmologists. The aim of the present study is to correct these circumstances.

Multiple numerical algorithms exist to perform Bayesian inference: Metropolis-Hastings \citep{Metropolis} in conjunction with the Gibbs sampler \citep{Metropolis-Hastings-Gibbs}; Hybrid (Hamiltonian) Monte Carlo \citep{1701.02434,Duane_1987}, and nested sampling \citep{Skilling2006}. Each of these algorithms has different advantages: Metropolis Hastings is one of the fastest at estimating the model parameters, at the cost of not evaluating the evidence, which is a universal metric of model fitness.

It is also well-known that most Bayesian inference methods, like Metropolis-Hastings, can benefit from extra information provided at inference time. This we call proposals as they usually contain information about what we expect the posterior distribution to be for each parameter. It has become standard practice to provide the proposals along with cosmological inference packages, such as CosmoMC. However, to date there has been no such mechanism for nested sampling.

We shall explore a method of injecting said information into the inference process that is based on an idea which was first pointed out by \cite{chen-ferroz-hobson}. Whilst automatic posterior repartitioning, was originally used to resolve the issue of unrepresentative priors, the idea can be exploited to create a much more potent methodology for conducting nested sampling.

In the present paper we shall outline the method of stochastic superpositional mixing of consistent partitions, demonstrate its efficacy at utilising proposal information. We shall do so by providing a brief overview of Bayesian inference, highlighting the peculiarities of said inference method vital for our method to work. Afterwartds we shall discuss the important evolution of our method, and the infrastructure one would need in order to support the the widespread use of proposal distributions. Finally, an evolution of the nested sampling algorithm and the approaches one would take to Bayesian inference shall be explored.

Theoretical background

In this section we shall primarily focus on previous work in the area. We must outline the key elements of Bayesian inference via nested sampling in order to understand why using more informative priors directly is not a good solution, and why one needs to perform consistent re-partitioning.

Bayesian inference.

Bayesian inference, as the name suggests, is based around a remarkable result in the theory of probability, called Bayes’ theorem. It is mainly concerned with the relationship of conditional probabilities, which neatly map onto the concepts of known facts, and assumptions, but also retaining the ability to reason about objective truths.

We must first set up the terminology used. A model \({\cal M}\) of a physical process, is parameterised by \(\bm{θ} = (θ1, θ2, \ldots , θn)\). New empirical observations of said process are encapsulated in the /dataset/ \(\mathfrak{D}\). The /likelihood/ ${\cal L}$ of the parameters $\bm{θ}$ is the probability of observing \(\mathfrak{D}\), conditional on the configuration $\bm{θ}$ and the model ${\cal M}$. The prior \(π(\bm{θ})\) is the probability of $\bm{θ}$ assuming ${\cal M}$. It can be obtained from both previous datasets as well as constraints inherent to the model. The posterior is a probability of $\bm{θ}$ that is conditional on ${\cal M}$ and the dataset ${\mathfrak D}$. The locus of all $\bm{θ}$ for which the prior is both defined and non-zero defines the \textbf{\emph{prior space}} \(Ψ\). Finally, the \textbf{\emph{evidence}} is the probability of the data \({\mathfrak D}\) assuming the model.

The interactions of probabilities of cref:tab:defs is governed by citeauthor:1763 ‘s theorem: \begin{equation}\label{eq:bayes} {\cal L}(\bm{θ}) × π (\bm{θ}) = {\cal Z} × {\cal P} (\bm{θ}). \end{equation}

Bayesian inference is the process of reconciling the model ${\cal M}$ represented in ${\cal L}$ and $π$, with observations \(\mathfrak{D}\) represented in ${\cal L}$. A numerical algorithm that obtains ${\cal Z}$ and ${\cal P}$ from $π$ and ${\cal L}$, is called a *sampling algorithm* or /sampler/.

The convenient representation of $π$ and ${\cal L}$ depends on the particulars of the sampler. For \textbf{\emph{nested sampling}} (e.g. \texttt{PolyChord}, \texttt{MultiNest}) we delineate them indirectly: with the logarithm of the likelihood probability-density function \(ln \cal L (\bm{θ})\), and \textbf{\emph{prior quantile}} \(C\{π\}(\bm{θ})\). The latter is a coordinate transformation $C: \bm{u} \mapsto \bm{θ}$ that maps a uniform distribution of $\bm{u}$ in a unit hypercube to $π(\bm{θ})$ in $Ψ$. It is often obtained by inverting the cumulative distribution function of the prior.

For citeauthor:1763’ theorem holds, on the set intersection of the domains of all probability density functions. Let \(D(f)\) denote the domain of the probability density function \(f\), i.e.~where \(f\) is both defined and \textbf{non-zero}. Hence \begin{equation} D \{ π \} ∩ D \{ {\cal L} \} = D \{ {\cal P} \} ⊂ Ψ, \end{equation} meaning the inference is possible only on the intersection of the domains of prior and likelihood.\label{domain-discussion}

For each choice of ${\cal L}$ and $π$, there is a unique choice of ${\cal Z}$ and ${\cal P}$; equivalently they represent the same unique model ${\cal M}$, or partition it consistently. That correspondence is \emph{surjective}, but not \emph{injective}: many choices of \({\cal L}(\bm{θ})\) and \(π (\bm{θ})\) may correspond to the same \( {\cal P} (\bm{θ})\) and \({\cal Z}\) citep:chen-ferroz-hobson. This remark is the cornerstone of the automatic posterior repartitioning, which we shall exploit to obtain a speedup.

Nested Sampling

There are very few methods of performing full Bayesian inference. The chief reason is that one rarely needs more than a very crude approximation of what one can achieve.

More often than not, one reduces the evidence to either 1 (model fits the data), or 0, hence removing the need for a complex routine to evaluate \({\cal Z}\). Similarly, model parameters’ posterior distribution is often reduced to an un-correlated symmetric Normal distribution. Such crude approximations allow one to perform inference exceptionally quickly. One needs to be in a situation where resolving differences in evidence of up to a few percent may be a deciding factor to invest into evaluating the evidence to that precision. There are very few reasons to concern oneself with evaluating \({\cal Z}\) at all, hence very few invest the resources into doing it.

Nested sampling is a full infence algorithm, in that it evaluates both the evidence and the posterior. It is rarely used outside of circumstances where model comparison is a necessity, because of the additional overhead associated to evaluating the evidence. While this isn’t the only advantage that nested sampling has over other MCMC methods, but it is the chief qualitative difference that can justify the time investment. To understand why evaluating the evidence is so valuable, consider how one might do it.

${\cal P}$ is a probability, therefore normalised, which combined with cref:eq:bayes yields

\begin{equation} \label{eq:def-z} {\cal Z} = ∫Ψ {\cal L}(\bm{θ}) π(\bm{θ}) d\bm{θ}. \end{equation} Thus, citeauthor:1763 ‘s theorem reduces parameter estimation — obtaining ${\cal P}$ from $π$ and ${\cal L}$, to integration~citep:bayes-integration. The naïve approach to obtain \({\cal Z}\) of uniformly rasterising \(Ψ\) is intractable for hypotheses with \(O(30)\) parameters citep:Caflisch_1998, which is a problem that nested sampling resolves.

Having motivated the utility of nested sampling, we should provide an outline of its execution. The following is a short description of nested sampling citep:Skilling2006. We begin, by picking \(n_\text{live}\) \textbf{\emph{live points}} at random in $Ψ$. During each subsequent iteration, the point with the lowest likelihood is declared \emph{\textbf{dead}}, and another live point $\bm{θ}∈\Psi$ is taken with a higher likelihood, based on the prior $π$ and an implementation-dependent principle. Live points are thus gradually moved into regions of high likelihood. By tracking their locations and likelihoods, from a statistical argument we can approximate ${\cal Z}$ and its error for each iteration, and by cref:eq:bayes, ${\cal P}(\bm{θ})$. We continue until a pre-determined fraction of the evidence associated to $Ψ$ remains unaccounted for.

Recall, that Not all parameter inference methods require obtaining ${\cal Z}$. Some methods, such as Hamiltonian Monte-Carlo citep:1701.02434, allow obtaining a normalised ${\cal P}$ directly. For such approaches, any consistent specification of $π$ and ${\cal L}$ will lead to identically the same posterior, barring numerical errors. This is also true of methods that evaluate ${\cal Z}$ exactly. However, nested sampling allows uncertainty in ${\cal Z}$, which is controlled by $π$ and ${\cal L}$. Thus, nested sampling, unlike, e.g.~Metropolis-Hastings citep:Metropolis-Hastings-Gibbs is sensitive to the concrete definitions of prior and likelihood. While many choices of $π$ and ${\cal L}$ correspond to the same ${\cal P}$ and ${\cal Z}$, the errors and nested sampling’s time complexity are dependent on the specification of $π$ citep:Skilling2006. Specifically, more informative priors are preferable.

In the following section we shall discuss how information content is being measured.

Metrics and informativity

An important quantity for measuring the correctness of the obtained posterior is the \textbf{\emph{Kullback-Leibler divergence}} ${\cal D}$ citep:Kullback_1951. For probability distributions \(f(\bm{θ})\) and \(g(\bm{θ})\), it is defined as: \begin{equation} \label{eq:kl-def} {\cal D}\{f, g \} = ∫Ψf(\bm{θ}) ln \frac{f(\bm{θ})}{g(\bm{θ})} d \bm{θ}. \end{equation} It is a pre-metric on the space of probability distributions: it is nil if and only if $f(\bm{θ}) = g(\bm{θ})$, (albeit not symmetric) which is convenient for defining a representation hierarchy. The statement: $f$ represents $g$ better than $h$ is equivalent to \begin{equation} \label{eq:hierarchy} {\cal D}\{f, g\} < {\cal D}\{h, g\}. \end{equation} Specifically, distribution $h$ is said to be unrepresentative of $g$ if a uniform distribution $f$ represents $g$ better than $h$.

A probability density function \(f(\bm{θ})\) is said to be more \emph{\textbf{informative}} than \(g(\bm{θ})\) if: \begin{equation} \label{eq:informative} {\cal D}\{ f, g \} > {\cal D}\{ g, f\}. \end{equation} This also highlights, that Kullback-Leibler divergence is not a metric on the space of distributions. However, being asymmetric lends itself well to considerations where such an asymmetry is natural: e.g.~priors are not equivalent to posteriors, one comes after the other, and so ${\cal D}$ can be used to quantify the “surprise” information obtained during inference.

The time complexity $T$ of nested sampling is \begin{equation}\label{eq:complexity} T \propto n_\text{live}\ \langle {\cal T}\{{\cal L}(\bm{θ})\} \rangle \ \langle {\cal N}\{{\cal L}(\bm{θ}) \}, \end{equation} where ${\cal T}\{f(\bm{θ})\}$ is the time complexity of evaluating $f(\bm{θ})$ and ${\cal N}\{f(\bm{θ})\}$ — the quantity of such evaluations. Reducing $n_\text{live}$ reduces the resolution of nested sampling, while \({\cal T}\{{\cal L}(\bm{θ})\}\) is model-dependent. We can, however, reduce the number of likelihood evaluations, by providing a more informative prior. However, there is an associated risk, which we shall address later.

Choosing the correct representations of ${\cal P}$ and $π$ is crucial for nested sampling’s correctness and performance. For example, assuming the same likelihood, if $π0$ and $π1$ are equally informative, but \(π0\) is more representative of ${\cal P}$, then the inference with \(π0\) will terminate more quickly than with $π1$, (and would be more accurate, also).

Similarly, if $π1$ is more informative than $π2$, but equally as representative, nested sampling will terminate with $π1$ faster than with $π2$, and the result will be more precise. In detail, if \(π1 (\bm{θ})\) is more similar to \( {\cal P} (\bm{θ})\), then points drawn with PDF \(π1 (\bm{θ})\) are more likely to lie in $\bm{θ}$ regions of high \( {\cal P} (\bm{θ})\), leading to fewer iterations.

Posteriors ${\cal P}1$ and ${\cal P}2$ obtained with the priors $π1$ and $π2$ are different, because of cref:eq:bayes. In fact, the posterior ${\cal P}1$ will be more informative than ${\cal P}2$, and more similar to $π1$. This effect we call \textbf{\emph{prior imprinting}}.

Imprinting is desirable if the informative prior $π1$ is the result of inference over another dataset. Nonetheless, imprinting limits the information obtainable from \(\mathfrak{D}\). There is a considerable risk of getting no usable data from the inference, which makes one prefer uniform priors even when more information is available.

The problem is exasperated in case of proposals. The issue is that the algorithm has no room to consult the proposal distributions outside of the prior. Using a prior taken out of “thin air”, with nested sampling is a recipe for disaster. However, in the next section we shall discuss how one can mitigate these issues, and use a proposal as an aspect of a prior.

Power posterior repartitioning and unrepresentative priors

NB: From this section onward we shall adopt the following notation. $π$ and \({\cal L}\) with similar decorations (index, diacritics), belong to the same specification of the model. Models using the uniform prior are special, in that they obtain the most accurate posterior and evidence, thus they are represented with an over-bar (the plot of a uniform prior in 1D is a horizontal line). Hats delineate the consistent partitions, that incorporate the proposal (the hat represents the peak(s) often present in informative proposals).

We are working under the assumption that \(π(\bm{θ})\) is an informative, unrepresentative prior. We want to obtain correct posterior \(\bar{\cal P}\) but without using a uniform, universally representative reference prior \(\bar{π}\), because it is often the least informative. To avoid loss of precision and mitigate prior imprinting, cite:chen-ferroz-hobson have proposed introducing the parameter \(β\) to control the breadth of the informative prior: \begin{equation} \label{eq:autopr-prior} \hat{π}(\bm{\bm{θ}};β) = \cfrac{π(\bm{θ})β}{Z(β)\{π\}}, \end{equation} (see cref:fig:ppr) where \(Z(β)\{π\}\) — a functional of \(π (\bm{θ})\) is a normalisation factor for \( {\cal P} (\bm{θ})\), i.e. \begin{equation} Z(β)\{π\} = ∫Ψ π(\bm{\bm{θ}})βd\bm{\bm{θ}}. \end{equation} In their prescription, the likelihood changes to \begin{equation} \hat{\cal L}(\bm{θ}; β) = {\cal L}(\bm{θ}) Z(β)\{π\} ⋅ π1-β(\bm{θ}). \end{equation} The new parameter $β$ is treated as any other non-derived parameter of the original theory.

Note, that \({\cal L}(\bm{θ})π (\bm{θ}) = \hat{\cal L}(\bm{θ}) \hat{π} (\bm{θ})\) by construction. Thus, from cref:eq:bayes the posterior and evidence corresponding to \(\hat{\cal L}(\bm{θ};β)\) and \(\hat{π} (\bm{θ};β)\) will be the same as \( {\cal P} (\bm{θ})\) and \({\cal Z}\), which correspond to the original $π(\bm{θ})$ and ${\cal L}(\bm{θ})$.

If the informative prior \(π (\bm{θ})\) is less representative of the posterior \( \bar{\cal P} (\bm{θ})\), error in \(\hat{\cal Z}\) is larger. Hence, while we don’t violate crefeq:bayes directly, $\bar{\cal Z}$ can be more different from \({\cal Z}\) while remaining within margin of error, and similarly ${\cal P}(\bm{θ}) ≠ \bar{\cal P}(\bm{θ})$. This is where the new parameter comes into play. $\hat{π}$ may become representative for some value of $β = βR$. Values $β$ close to $βR$ correlate with higher likelihoods, thus the sampler prefers them. Hence, the system will converge to a state where \( {\cal P} (\bm{θ})\) is represented in \(\hat{π} (\bm{θ};β)\)[fn::Technically we obtain \( \hat{\cal P} (\bm{θ};β)\) which, when marginalised over $β$, yields \( {\cal P} (\bm{θ}) = ∫ \hat{\cal P} (\bm{θ};β) d β\) — the correct posterior.]. As a consequence, we reduced the errors and obtained the same result as we would have with a less informative but more representative prior.

cite:chen-ferroz-hobson dubbed this scheme \textbf{\emph{automatic power posterior repartitioning}} (PPR) because the choice of $β\rightarrowβR$ is automatic. It mitigates the loss precision and thus accuracy for unrepresentative informative priors $π$, by sacrificing performance.

Discoveries

The trouble with proposals

Nested sampling is different from Metropolis-Hastings-Gibbs and many other Markov-Chain Monte Carlo methods. Often, such algorithms are designed with a separate input that is the proposal: an initial guess that guides the algorithm towards the right answer. For nested sampling no such provisions are in place. The only input where such information can be used is the prior. Thus, to understand why one can’t use proposals directly, we must first address why informative priors are avoided.

From cref:eq:bayes, we can see that changing only the prior $π$ necessarily leads to changes in both ${\cal P}$ and ${\cal Z}$. For example if $π$ is a Gaussian centered at \(\bm{θ}=\bm{μ}π\) and \({\cal L}\) is a Dirac \(δ\)-function peaked at $\bm{θ}=\bm{μ}{\cal L}$, with $\bm{μ}π$ sufficiently far from $\bm{μ}{\cal L}$ then the posterior will necessarily have peaks at both $\bm{μ}π$ and $\bm{μ}{\cal L}$. This is an example of prior imprinting and is a necessary part of a Bayesian view of statistics. For a Bayesian, the prior information is no less valuable than the information inferred from the dataset \(\mathfrak{D}\), and the posterior represents \emph{all} of our best knowledge.

The problem however, is the \emph{prejudiced sampler}. Because nested sampling chooses live points with probability proportional to the prior, the probability of a point being drawn from the likelihood peak can be made arbitrarily small. In fact, if $\bm{μ}{\cal L}$ and $\bm{μ}π$ are separated by more than five standard deviations of the prior Gaussian, thirty million samples will be drawn from $\bm{μ}π$ before a single point is drawn on the circle containing $\bm{θ} = \bm{μ}_{{\cal L}}$.

An apt analogy can be drawn with the Venera-14 mission citep:siddiqi2018beyond. Upon landing, due to a number of unfortunate coincidences, the lander took its one and only measurement of Venusian soil from one of its own lens caps. As a result, we have obtained objectively correct information from Venus: a sample of an object on its surface. However, the efficiency of said measurement of the compressibility of Earth rubber leaves much to be desired.

Before cite:chen-ferroz-hobson the best solution was to use a uniform prior that included both $\bm{μ}π$ and $\bm{μ}{\cal L}$. The computational cost of inference is so high that the risk of gaining nothing from a dataset is untenable. Thus discarding all prior information in hopes of inferring some from the dataset is preferable to using the information in $π$.

Thus, proposals are not even considered for use with nested sampling. Since proposals may be crude approximations, we may obtain far worse than no new information. Any potential benefit in performance or precision is far outweighed by the unreliable posterior. We do, however, have one method of mitigating these problems — automatic posterior repartitioning citep:chen-ferroz-hobson. Though the connection may seem unclear at this stage, schematically, Automatic posterior repartitioning allows one to represent infinitely many pairs of \(π\) and \({\cal L}\), which all produce the same result: evidence and posterior. If one can encode the proposal as a prior that obtains the same evidence and posterior as the prior one has started with, one could, in theory, obtain all of the benefits of having a more informative prior, with also obtaining information that pertains to the model in question rather than repeating the information provided as a guess.

How intuitive proposals accelerate convergence

Consider the following premise: we’re given a model \({\cal M}\), for which our prior $π$ is not the uniform \(\bar{π}(\bm{θ})\). So, usually from other % sources, e.g.~other inferences, physical reasoning, etc, we know that \begin{equation} π (\bm{θ}) = f(\bm{θ}; \bm{μ}, \bm{Σ}), \label{eq:bias} \end{equation} which is representative of the posterior \(\bar{\cal P}(\bm{θ})\). Here, the probability density function $f$ is parameterised by \(\bm{μ}\) in its location and \(\bm{Σ}\) its breadth. In order to obtain the same result as one would have with the less informative uniform prior \(\bar{π}(\bm{θ})\), one needs to correct the likelihood ${\cal L}$. Recall, that the reason why PPR obtains the same posterior \(\bar{\cal P}(\bm{θ})= \hat{\cal P}(\bm{θ})\) as one would have using

\(\bar{π} (\bm{θ}) = Const. \)

is because \(\hat{\cal L} (\bm{θ};β)\) and \( \hat{π} (\bm{θ};β)\) are a \textbf{\emph{consistent (re)partitioning}} of \( \bar{\cal Z}\) and \({\cal P}(\bm{θ})\). That is: \begin{equation} \label{eq:partitioning} ∫Ψ \hat{\cal L} (\hat{\bm{θ}}) \hat{π} (\bm{\hat{θ}}) d\hat{\bm{θ}} = ∫Ψ\bar{π} (\bm{θ}) \bar{\cal L} (\bm{θ}) d\bm{θ} = \bar{\cal Z}, \end{equation} where in the case of PPR $\hat{\bm{θ}} = (θ1, θ2, \ldots, θn, β)$. Cref:eq:partitioning holds if \begin{equation} \label{eq:partitioning-p} \hat{\cal L}(\bm{θ};β) \hat{π}(\bm{θ};β) = \bar{\cal L}(\bm{θ})\bar{π}(\bm{θ}) \end{equation}

for all $β$, by cref:eq:bayes. Note that cite:chen-ferroz-hobson have used cref:eq:partitioning-p as the primary expression. Following their convention, we shall sometimes refer to consistent partitions as posterior repartitioning, rather than evidence repartitioning.

By using a more informative prior in thusly, we accelerates convergence, because each iteration obtains a larger evidence estimate, so fewer are needed to reach the termination point (See~cref:fig:benchmark). There is a competing mechanism: the evidence estimates accumulate fewer errors, so inference proceeds longer before the precision loss triggers termination (cref:fig:higson). Thus repartitioning reaches a more precise result quicker. Better still, the obtained precision can be sacrificed to further accelerate inference.

Example: Intuitive proposal posterior repartitioning

Suppose that one has obtained the posterior \({\cal P}(\bm{θ})\) from a different inference, which could be nested sampling with a uniform prior, or Hamiltonian Monte Carlo (or a theoretical approximation). Thus, \begin{subequations} \begin{equation} \label{eq:iPPR} \hat{π}(\bm{θ}) = f(\bm{θ}, \bm{μ}, \bm{Σ}) = {\cal P}(\bm{θ}), \end{equation} is an informative prior that represents our knowledge, but might not represent the posterior. We call it an \textbf{\emph{(intuitive) proposal}}. However, we wish to avoid prejudicing the sampler and use the (uniform) reference prior $\bar{π}(\bm{θ})$, with reference likelihood $\bar{\cal L}(\bm{θ})$.

To obtain with $\hat{π}(\bm{θ})$ the same posterior and evidence as one would have with $\bar{π}(\bm{θ})$ and $\bar{\cal L}(\bm{θ})$, the partitioning of the (evidence) needs to be \textbf{\emph{consistent}} with the reference. Specifically: \begin{equation} \label{eq:ippr-l} \hat{\cal L}(\bm{θ}) = \frac{\bar{π}(\bm{θ}) \bar{\cal L}(\bm{θ})}{ f(\bm{θ}; \bm{μ}, \bm{Σ})}. \end{equation} \end{subequations} We call this scheme \textbf{\emph{intuitive proposal posterior [fn::More accurately evidence repartitioning, which is equivalent in simple cases.] repartitioning}} (iPPR). It is the fastest possible and the least robust consistent partitioning scheme. While we have technically addressed the change in ${\cal P}$ due to a different prior, we have not addressed the problem of $\hat{π}$ being (potentially) unrepresentative of $\bar{\cal P}$. In the example already considered in cref:sec:prejudice, we will have reduced prior imprinting, but not all addressed the prejudice. The probability of sampling from the true likelihood peak is still minuscule. By contrast, we have seen that automatic power posterior repartitioning can mitigate both issues. What iPPR lacks, is a mechanism for extending its representation. Rather than attempt a modification akin to power partitioning, in cref:sec:isomixtures we shall provide this mechanism as completely external to iPPR and unleash its potential.

General automatic posterior repartitioning

In this section, we look at the family of prescriptions similar to PPR and iPPR called consistent partitioning. We note which schemes are more useful for the task of accelerating nested sampling without biasing the posterior. We begin by noting, that Cref:eq:partitioning alone does not guarantee the correct posterior and evidence.

We shall consider a general consistent partitioning \(\hat{π}, \hat{\cal L}\) with re-parametrisation \(\hat{\bm{θ}}\). Because $\bm{θ} ≠ \hat{\bm{θ}}$, generally, the posterior \({\cal P}(\bm\hat{θ})\) would not have the same functional form as \(\bar{\cal P}(\bm{θ})\). Nonetheless, if inverting the parametrisation from $\bm{\hat{θ}}$ to $\bm{θ}$ is possible, and under that procedure $\hat{\cal P}$ maps to ${\cal P}$, we shall say that $\hat{\cal P}$ is marginalised to ${\cal P}$. Thus, the correct posterior is one that marginalises to $\bar{\cal P}$. We shall often use $\hat{\cal P}(\bm\hat{θ})$ interchangeably with ${\cal P}(\bm{θ})$ that it marginalises to.

We can rigorously prove[fn::in a later publication], that the following conditions are necessary for a consistent partitioning to yield the correct posterior and evidence through Bayesian inference.

Note that these properties are sensitive to the sampling algorithm. For example, for inference by uniform-rasterised integration of ${\cal Z}$, all properties follow from cref:eq:partitioning-p. Not so for a class of algorithms that estimate ${\cal Z}$ by controlled error propagation and approximation, e.g.~nested sampling. Thus, understanding the circumstances wherein these conditions are violated, may clarify the conditions for which both PPR and iPPR fail to produce the expected result.

Firstly, they satisfy cref:norm-prop by construction. iPPR satisfies cref:spec-prop if and only if \( \hat{π} (\bm{θ})\) represented the correct posterior to begin with, in which case $ΨR = Ψ$. Cref:vconv-prop follows from the correctness proof of nested sampling citep:Skilling2006, and cref:spec-prop. In cref:sec:autopr we have shown that PPR satisfies cref:spec-prop, where $ΨR = \{ β = βR = \text{Const.}\}$, if $βR$ exists. There’s always at least one: \(ΨR = \text{Locus} \{ βR=0 \} ∩ Ψ\), but we are interested in values of \(βR > 0 \), as such priors are more informative. In that section we have provided an intuitive explanation for why PPR has cref:vconv-prop.

However, consistency alone does not guarantee the correct posterior, indeed in cref:fig:convergence, we see that both $θ0$ and $theta2$ marginalised posteriors are offset from the correct result obtained using \( \bar{π}(\bm{θ})=\text{Const.}\). This is an illustration of the importance of cref:obj-prop, as the test case cref:fig:convergence was constructed to violate it specifically.

Isometric mixtures of repartitioning schemes

In this section we consider two methods of combining several proposals (consistent partitions) into one (consistent partition). Identifying the posterior to which points in $Ψ$ correspond to by cref:eq:bayes, as a metric, we name these \emph{\textbf{isometric}} mixtures.

Additive isometric mixtures

Consider \(m\) consistent repartitioning schemes of the same posterior \(\bar{\cal P}(\bm{θ})\): \begin{equation} \label{eq:collection-of-models} \bar{\cal L}(\bm{θ}) \bar{π}(\bm{θ})= \hat{\cal L}1(\bm{θ}) \hat{π}1(\bm{θ}) = \ldots =\hat{\cal L}m(\bm{θ}) \hat{π}m(\bm{θ}). \end{equation} Their \textbf{\textbf{\emph{isometric mixture}}}, is a consistent partitioning that involves information from each constituent prior, but preserves the posterior and evidence of its component partitions.

For example: an \textbf{\emph{additive mixture}} cref:fig:additive, defined as

parameterised by $\bm{β} = (β1, β2, \ldots, βm)$ where each $βi ∈ [0,1]$. It is itself a consistent partitioning, i.e.~\emph{\textbf{isometric}}, if and only if $∑i βi = 1$.

Isometric mixtures are an attempt to relax some of the limitations imposed by power posterior repartitioning. Firstly, all proposals in PPR have to be linked by a power relation. This class always includes a uniform prior, but not, for example, a “wedding cake” prior (stepped uniform prior). Additive mixtures permit such proposals. Moreover, in additive isometric mixtures, any consistent partitions are compatible provided the set union of their domains matches $Ψ$.

However, additive mixtures have limited utility: they are slow, difficult to implement and susceptible to numerical instability more than any other consistent partitioning[fn::These claims shall be substantiated in a more detailed publication.]. We can, however do much better.

Stochastic superpositional isometric mixtures

One major problem with additive mixtures lies in the definition of $\hat{\cal L}$. Instead of having to evaluate only one of the constituent likelihoods, we are forced to evaluate all of them. Hence, a lower bound on time complexity: \begin{equation} {\cal T}\{\hat{\cal L}\} = o \left( maxi {\cal T}\{ {\cal L}i\} \right), \label{eq:hard-cap} \end{equation} which is the average case when the likelihoods ${\cal L}i$ are all related to the same reference (e.g.~$\bar{\cal L}$) with only minor corrections computed asynchronously to account for different proposals. If ${\cal L}i$ and ${\cal L}j$ have no common computations to re-use, the average case time complexity is \(o\left[{\cal T}({\cal L}i) + {\cal T}({\cal L}j)\right]\).

Another issue is that the overall likelihood depends on the prior PDFs of the constituents. This is problematic since nested sampling requires specification of the prior via its quantile citep:Skilling2006,polychord,multinest. Function inversion is not linear with respect to addition, so the quantile of the weighted sum needs to be evaluated for each type of mixture individually. For a linear combination of uniform priors, evaluating the quantile can be performed analytically, but not in case of two Gaussians or a Gaussian mixed with a uniform. By contrast, the quantile of PPR with an uncorrelated[fn::not so for a correlated Gaussian. Nonetheless, every correlated covariance matrix can be diagonalised, and included in the re-parametrisation. ] Gaussian proposal is found in closed form.

We thus try to avoid mathematical operations that require evaluation of all of the constituents’ priors/likelihoods. An example of such an operation is deterministic prior branching. This scheme has the benefit of trivially determining the quantile of the mixture from the component quantiles. The probability of branch choice can be tuned using a parameter, which can be made part $\hat{\bm{θ}}$ similarly to $β$ in PPR. This parametrisation provides the mechanism needed for cref:vconv-prop.

Hence, we purport that a \textbf{\emph{superpositional mixture}}, defined via the following parametrisation:

that is, the branches are chosen consistently.

The~cref:spec-prop is satisfied, if any of the priors $\hat{π}$ represented the posterior. The~cref:vconv-prop is satisfied similarly to PPR: the likelihood is determined by \(\bm\hat{θ} ⊃ \bm{β}\), so \(\bm{β}\)s that lead to higher likelihoods are favoured, ergo configurations representing ${\cal P}$ are preferred.

Superpositional mixtures have multiple advantages when compared with additive mixtures. Crucially, only one of \({\cal L}i\) is evaluated each time $\hat{\cal L}$ is evaluated. As a result, ignoring the overhead of branch choice, the worst-case time complexity is the same if not better than the best case for additive mixtures, which has vast implications discussed in cref:sec:applications.

The superpositional mixture’s branch choice must be external to and independent from the likelihoods and priors. For example, the prior quantile of the mixture must branch into either of the component prior quantiles. As a result, the end user doesn’t need to perform any calculations beyond the proposal quantiles themselves.

There can be many implementations of a superpositional mixture. A natural first choice would be a quantum computer, where the $\hat{π}$ and $\hat{\cal L}$ are represented by \(m\) level systems entangled with each other (consistent branching) and a classical computer (to evaluate ${\cal L}$ and $π$). However, we can also attain an implementation using only computational methods via stochastic deterministic choice based on $\bm{θ}$.

The \textbf{\emph{stochastic superpositional (isometric) mixture}} of consistent partitioning (SSIM) ensures branch consistency by requiring \begin{equation} \hat{π}(\bm{θ}; \bm{β}) = \hat{π}F(\bm{θ; \bm{β})}(\bm{θ};\bm{β}), \end{equation} where $F: (\bm{θ}, \bm{β}) \mapsto i ∈ \{1, 2, \ldots, m-1\}$. In our implementation it is a niche-apportionment random number generator (sometimes called the broken stick model), seeded with the numerical \texttt{hash} of the vector $\bm{θ}$, illustrated in cref:fig:mixture.

Superpositional mixtures are superior in robustness and ease of implementation. They do, nevertheless, come with one drawback. As a result of branching, the likelihood $\hat{\cal L}$ visible to the sampler, is no longer continuous (cref:fig:mixture-3d). Thus a nested sampling implementation that relied on said continuity will have undefined behaviour. \texttt{PolyChord}’s slice sampling seems not affected by the discontinuity, but there may be other samplers that are.

On notation and mental models

It is opportune time to discuss a subtlety that we have previously neglected. cite:chen-ferroz-hobson originally named the technique automatic posterior repartitioning, which evokes a clear mental model. Assuming that the original definitions of \( π \) and \(\mathcal{L}\) were a partitioning of only the posterior, a new value of \(β\) produces a new partitioning, thus it re-partitions the posterior. The extra parameter is a time-like object, with a clear direction of evolution, in that any change to its value causes a re-partitioning of the model.

While this mental model had served well for the purposes of solving the unrepresentative prior problem, it is severely limiting to the effect of introducing proposals.

The first ineptitude of the mental model is that the expression “re-partitioning” implies the mutability of the posterior. It is not mutable. In fact, the posterior that we obtained via re-partitioning has a strict functional dependence on the parameter, which is strictly a different function. Meaningful information is lost when we project the repartitioned result to the original prior space, albeit only a Bayesian would regard it as such.

A second deeper problem is that the notation inherently puts impetus on the posterior. In reality automatic posterior repartitioning is a necessary, but insufficient condition for consistent partitioning. As long as no coordinate transformation is performed, the difference is negligible. However, for more complicated cases, e.g.~re-sizeable prior space schemes, the posterior repartitioning is under-determined. A naive extension doesn’t and indeed can’t produce the expected result, if one considers an extension similar to \begin{equation} \label{eq:naive-extension} π(θ) \mathcal{L}(θ) = \hat{π}(θ) \mathcal{\hat{L}}(θ) \end{equation} one shall obtain nonsense. One can prove (by considering a reference prior space from which all prior spaces of the same dimensionality derive via coordinate transformation), that the correct expression is actually one that preserves the evidence differential element.

What we propose is a much more general world-view and a more accurate and expressive model. A consistent partitioning involves specifying a hyperspace that includes the original prior space. The partitioning into $π$ and $\mathcal{L}$ is done once only, when the Bayesian inference problem is set up. The original posterior is a function in the original prior (sub)space. The posterior we obtain as a result, is the original in some projections, the evidence to which it corresponds is also the same as the original.

One might object that this is not a good model for the superpositional mixture, as the dynamical analogy would be much more appropriate, as the parameters really only control the partitioning. This point is partially valid. I would advocate seeing superpositions as an extension into a hilbert space of vectors that are themselves spaces. Not easy to imagine, but to someone fluent in Quantum theory, not a challenge. A better analogy would be to imagine the spaces for each individual prior side by side, and have a few parameters that control the relative “heights” of these spaces, or activation energy for diffusion. This is a middle-ground that retains the generality of treating the entire problem in a hyperspace, but also has a dynamical analogy.

Arguments can be made either way, but an important consideration is to have a model that gives accurate predictions first, and is easy to imagine second.

Methodology of Measurements

Our measurements have to ascertain three key points. First we must prove that the consistent partitions obtain sensible estimates of ${\cal P}$ and ${\cal Z}$ and document the circumstances when they don’t. We shall then need to measure the performance uplift that can be attained while preserving the accuracy and precision of the sampling. Lastly, we shall test our machinery when applied to a real-world example: Cosmological parameter estimation.

For performance, we shall adopt the weighted accounting approach citep:Cormen for measuring time complexity in units of \({\cal N}\{{\cal L}\}\), and reducing all quantities to their long-run averages. Consequently, all of the partitions’ overheads associated with internal implementation details are ignored. This is to ensure fairness in comparing power repartitioning to a stochastic mixture[fn::SSIM has far less overhead].

We shall use Kullback-Leibler divergence in two contexts. First, ${\cal D}\{π, {\cal P}\}$ — a measure of information obtained from the dataset ignoring the prior, is used to gauge performance (as seen in Cref:fig:kl-scaling).

We also need a method of comparing posteriors to determine their accuracy. The Second divergence ${\cal D}\{ {\cal P}, \bar{\cal P} \}$, quantifies the correctness of the obtained posterior, where $\bar{\cal P}$ is the posterior obtained using a $\bar{π}(\bm{θ}) = \text{Const}$. In conjunction with ${\cal Z}$, these form our correctness criteria.

From cref:eq:bayes, errors in ${\cal P}$ are necessarily linked to errors in estimating ${\cal Z}$, and is the pivotal reason why nested sampling is sensitive to partitioning in the first instance. Moreover, the character of error in ${\cal Z}$ indicates the type of error in ${\cal P}$. A greater-than-expected evidence ${\cal Z}$ indicates inconsistent partitioning, where the likelihood was not re-scaled to accommodate a more informative prior. A less-than-expected ${\cal Z}$ is a sign that the regions of high ${\cal L}$ were not probed sufficiently, often accompanied by prior imprinting (PPR in cref:fig:convergence).

When constructing the test cases, we use no more than three dimensional models with Gaussian likelihoods, as they are sufficiently general to share similarities with cosmological inference, while also being practical to investigate under small perturbations. For this purpose, we use a uniform baseline prior, and a Gaussian likelihood:

where the covariance matrix \(\bm{Σ}\), specifies the extent of the peak, and the vector \(\bm{μ}\) — the location. \({\cal L}^\text{max}\) is the normalisation factor, which we keep implicit, for convenience.

$\bm{Σ}$ is assumed diagonal, without loss of generality. While $\bm{Σ}$ can be singular, which usually means a redundancy in the parametrisation, which can be fixed (by turning the strongly correlated parameters derived). Otherwise it is positive semi-definite, and symmetric, meaning that the it can be diagonalised via change into its eigen-basis. Counter-intuitively, this basis change must not be made part of the quantile. It is applied before computations involving correlated Gaussians, and reversed afterwards. This is a consequence of the extra Jacobian brought on by the difference between cref:eq:partitioning and cref:eq:partitioning-p. Essentially by applying the transformation globally the unit hypercube becomes a parallelopiped, which is the result of neglecting the Jacobian associated to the linear transformation.

To simulate imperfections we consider translational offsets between the proposal prior and the model likelihood. The main trial posterior is thus

truncated to a cube of side length[fn::The value \(1.2\) was chosen because it is the shortest non-machine representable floating point number, whose inverse is also not machine representable. This causes numerical instability in the uniform prior probability density function and quantile (at the boundaries). The value was chosen for tests of boundary effects, which had to be removed from the project, because of volume constraints. ] \(a = 1.2 ⋅ 109\). The corresponding evidence (cref:eq:def-z) is \(ln \mathcal{Z}≈-62.7\). The quantile of this Gaussian distribution is the one that enters iPPR and PPR’s priors as well as the reference likelihood. All other test cases are derived from the same Gaussian either via re-scaling, deformation (off-diagonal covariance and anisotropic scaling), or translation.

The choice of the prior scale: \(a = O(109)\), is to ensure that the series are not affected by run-to-run variance, even with a reduced number of live points. This has the added benefit of simulating an unbounded uniform prior numerically, as it is near the numerical limits. Also, any error in re-scaling the likelihood (e.g.~cref:fig:hist) leading to an inconsistent partition would not be obvious or as clean with a smaller prior boundary. Lastly, this choice allowed us to test the hypothesis that both stochastic mixtures and power posterior repartitioning can effectively remove the burn-in stage altogether. Last but not least, with such preconditions, stochastic mixtures are put at the greatest disadvantage. In the average case, approximately half the original live points are drawn from the proposal distribution and half from the uniform. The probability of finding the offset posterior peak is thus minuscule for large offsets. By contrast, In the average case the original live points with a Gaussian power posterior are drawn from a twice broad Gaussian.

Results and Discussion[sec:results]

The first test was to ensure that the repartitioning was implemented correctly. For this goal, we produced coinciding Gaussian likelihoods and prior components. The results of the test are shown in cref:tab:hist and cref:fig:hist.

The second class of tests involved deforming the prior Gaussians. Both SSIM (iPPR and uniform) and PPR were resilient with respect to re-scaling and anisotropic deformation of the likelihood, obtaining \({\cal D}\{ {\cal P}, \bar{\cal P}\} \leq 0.03\). iPPR coped with situations where \({\cal P}\) was narrower than $π$, while failing in the opposite case: \({\cal D}\{ {\cal P}, \bar{\cal P}\} \geq 5.5\), when \({\cal D}\{ π, {\cal P} \} = 5.5\) and \(Σ = 0.3 × \mathrm{1}3\).

The final test was with regards to translational offsets. The results are shown in cref:fig:kl-d,fig:convergence,fig:drift. In cref:fig:kl-d, we see that the amount of information extracted from PPR increases with increased offset. However, it does so sub-linearly, which combined with cref:fig:convergence, renders suspect the validity of the posteriors obtained using PPR and SSIM. However, cref:fig:drift shows that only PPR is adversely affected.

The posterior to posterior Kullback-Leibler divergence remained stable and less than \(0.3\) for the stochastic mixture and the reference. Power repartitioning fluctuated considerably, ensuring that no suitable plot could be produced. This suggests instability with respect to perturbations, and unpredictability of the accuracy of the posterior. However, none of the values reached the prior to posterior divergence, suggesting that at no offset was the posterior entirely obtained from the prior. As a result, power repartitioning may still be useful for unrepresentative informative priors, that are not proposals, as cite:chen-ferroz-hobson have shown.

A special case is that shown in cref:fig:convergence, in a reduced size bounding box \(a=2× 103\). The main notable feature is the inaccuracy of the posterior obtained by PPR. If the offset is small — \(O(2σ)\), the posterior is shifted. With a larger offset, e.g. \(O(4σ)\), two peaks can be resolved. Both errors are caused by incorrect evidence (see cref:fig:drift) PPR: \(ln {\cal Z}≈ -25.4 ± 2\), vs uniform reference \(ln {\cal Z} = -22.7 ± 0.4\) and SSIM, \(ln {\cal Z} = -22.5 ± 0.3\). There are two key observations to be made: the evidence is still within reasonable variance from the reference, and its estimated error is large. As a result, while we haven’t obtained the right information, we know that something went wrong.

This result is not at variance with cite:chen-ferroz-hobson ‘s observations, as they do not have a comparable test case. All of their numerical test cases were restricted to no more than two physical parameters, while we extended it to three. The example given required considerable fine-tuning to be reproducible\footnote{Too much free time in quarantine. }, as larger or smaller offsets often lead to correct convergence some of the time. Another hint at why power repartitioning may have been affected more than a stochastic mixture can be gleaned from cref:fig:hist. By noticing that the correct evidence is still within one standard deviation of the estimate obtained using power repartitioning we can suggest, that the result is less precise. So the unusual shape of the marginalised posterior, is the result of loss of precision. The inaccurate posterior is within margin of error of the analytical result,

It is worthwhile to consider the impact of such a scenario occurring during practical use of Bayesian inference. If either of the posterior looks as PPR’s marginalised posteriors in cref:fig:convergence, the researcher performing the inference has the following options:

Cref:opt:accept-with-err is a last resort. Cref:opt:accept is adequate for low accuracy applications, provided errors are properly estimated using e.g.~\texttt{nestcheck} citep:higson2018nestcheck. From cref:fig:benchmark, we see that the performance uplift allows for cref:opt:shift to be more efficient than~\ref{opt:uniform}, albeit marginally so.

This is where our technique is most useful: one obtains, as we’ve shown in~cref:fig:convergence, a more accurate \({\cal P}(\bm{θ})\), by using PPR from within SSIM. Hence, a repartitioning scheme that is on average slower than PPR (by approximately \(18\%\) extra \({\cal L}\) evaluations) within margin of run-to-run variance of PPR (approximately \(20\%\))\footnote{Comparison with cref:fig:benchmark may be misleading, as the error margins there correspond to exact coincidence, while the case in question involves an offset of $6μ$. }, which is an order of magnitude less than~\vref{opt:uniform,opt:shift} would afford. That said, using the proposal directly is faster still cref:fig:benchmark.

Lastly, \textbf{\emph{posterior mass}} — a measure of convergence speed citep:higson2018nestcheck, is often used in diagnosing nested sampling. Typical examples of posterior mass for a run with $π=\text{Const.}$ and runs accelerated by posterior repartitioning are given in cref:fig:higson. Notice that the re-partitioned series has a longer extinction phase, as a result of introducing extra nuisance parameters. Also, the confidence intervals on each parameter between the uniform and the re-partitioned run are identical, signifying that we have not lost precision.

Cosmological Simulations

After an initial run of \texttt{Cobaya} citep:cobaya, we have obtained the marginalised posteriors of all the key parameters of the \(Λ\)CDM model, as well as the nuisance parameters.

First, we have performed an inference using the Planck citep:Planck dataset, with the \(Λ\)CDM model. The results of our initial run are presented in cref:fig:cosmology. From these data, under the assumption that the parameters’ posteriors are a correlated Gaussian distribution, we extract the means $\bm{μ}$ and the covariance matrix \(\bm{Σ}\).

We use a stochastic mixture of a uniform prior and a single Gaussian obtained from the posterior samples of a run with a uniform prior, which we patch into \texttt{Cobaya}’s interface to \texttt{PolyChord} citep:code. The posteriors of two runs with identical settings (save live point number) are given in cref:fig:cosmology.

Firstly, notice that the posteriors have a significant overlap. Each plot on the diagonal of cref:fig:cosmology is a Gaussian, agreeing with the results of the reference run to within less than 1/10$th$ of a standard deviation. However SSIM predicts a deformed (non-ellipsoidal) covariance of the \(Λ\)CDM parameters.

The deformations are present in all posteriors that used a Gaussian proposal, which indicates that the deformations are systematic. The deformities are not caused by finite-grain size in the stochastic mixture, as the Gaussian proposal has them, and to a greater extent. The mixing portion parameter $β$, has converged to a mean of $\langle β \rangle = 0.82$, which indicates that the Gaussian proposal was not fully the most representative, but also that the later stages of sampling were dominated by the Gaussian proposal. Despite the appearance, however, cref:tab:cosmo-accuracy shows that the posteriors between SSIM and non-SSIM runs are not significantly different (${\cal D}&lt; 0.3$). Moreover the evidence is within one standard deviation and more precise with SSIM by a factor of \(8\).

While this might indicate a higher accuracy than obtainable with a pure uniform prior, one must exercise caution. While we can eliminate some potential systematic errors, a more conclusive analysis is needed.

With accuracy out of the way, cref:tab:cosmo-performance, highlights a significant improvement in performance. Using SSIM offers a reduction of run-time by a factor of \(19\). By exploiting increased precision one can reduce the number of live points, and gain a further reduction of run-time by a factor of \(37\). Further improvements are attainable by reducing the precision criterion and terminating early. Conversely, to obtain similar precision to SSIM, assuming sub-linear scaling with \(n_\text{live}\), one would need to extend the duration of the inference to 912 hours \(≈\) 40 days. Assuming that errors in evidence scale as \(n_\text{live}-1/2\) the time would be then of the order of a year.

Conclusions

Results

The project’s purpose has been to investigate the performance increase attainable by algorithmic optimisations of the inputs to nested sampling.We have identified a class of methods based on work by cite:chen-ferroz-hobson, called consistent partitions, fit for this purpose. We have shown that each consistent partition can accelerate nested sampling when given an informative proposal. We have developed stochastic superpositional isometric mixing (SSIM), to combine several proposals, into one. When used with nested sampling, SSIM produces more precise and accurate posteriors, faster than any individual consistent partition.

We have established the following advantages in using SSIM over PPR: SSIM admits multiple types of proposal priors, while PPR admits only one; it permits a broader class of proposals, for example: with differing domains, while PPR — only if the domains of the proposals coincide. SSIM is abstract: the prior quantile is a superposition of the constituent priors’ quantiles. By contrast, PPR prior quantile needs to be calculated by the end user for each type of proposal. The calculation is non-trivial for non-Gaussian proposals. SSIM supports an unbiased reference (uniform) prior exactly. PPR tends to an unbiased reference as \(β\rightarrow 0 \), but is only truly unbiased if $β=0$, with negligible probability. SSIM, like PPR, prefers the prior that leads to a higher likelihood, but unlike PPR, this does not lead to the total exclusion of less-representative priors.

As a result, faster, but more fragile consistent partitions (e.g.~iPPR), in conjunction with a standard uniform prior can exceed more robust but slower PPR in precision accuracy and speed. When applied to real-world cosmological parameter estimation, our strategy of using SSIM of Uniform and iPPR resulted in a significant performance increase, reducing the run-time requirements of \texttt{Cobaya} by a factor of 30.

Further Refinemenets

As of now the best way to use SSIM is to use the python package supernest.

pip install supernest

This package includes all the code used to produce all of the plots except ref:fig:cosmology-pc, which is upstream in Cobaya. For SSIM to work, the proposal has to be provided in the form of a consistent partition. While for an arbitrary combination of proposal prior quantile and likelihood this task may seem daunting, one often deals with a complex likelihood function and a uniform prior. In this case, it is rather simple to create a proposal in the form of a Gaussian, and produce a correction to the likelihood.

It is important to remember that since the proposals are put into a superposition with a reference, that the proposals can be tighter than the expected posterior distribution, or indeed contain very little if any covariance information. As the safety-net of SSIM ensures that at least some of the points are being sampled from the original distribution and PolyChord is capable of exploring multi-modality in the posteriors, there are very few reasons not to do so.

Applications

The obtained results are general. They can be applied in any area of any science that relies on Bayesian inference using nested sampling, e.g.~particle physics citep:multinest, astronomy citep:Casado_2016, medicine, Psychology, et cetera. SSIM should be considered for high-performance compute applications in COVID-19 research (e.g.~cite:Covid1,Covid2), as inference in this field is both time and resource-intensive, while also time-critical. It may prove useful for agent-based simulations, with complex Likelihood functions citep:Covid2, similar to Cosmology. Identifying causal links between policies and incidence of Covid 19 cases, for example is described by 49 parameters.

Note that the asymptotic worst-case time complexity of superpositional mixtures liberates one to use as many complex models as one likes. For example: consider two libraries providing a likelihood for \(Λ\)CDM, one which makes multiple approximations (fast), and one which performs the full calculation (slow). By using the two in a superpositional mixture, one shall obtain a speedup compared to the slow run of nested sampling. This is due to the slow likelihood being evaluated only some of the time. It will only be comparable to the pure slow run if the fast prior were utterly unrepresentative of the results, which itself is a valuable insight. Our findings may be of particular interest for further refining \texttt{CLASS} and \texttt{Cobaya}, as the time complexity of computing the likelihood is the bottleneck of modern cosmological code.

Nested sampling can also be applied to inference-related problems, such as reinforcement learning citep:javid2020. The process of training a neural network involves estimating connection strengths between nodes of said network. Normally, this end is achieved via negative feedback: connections correlated with the desired behaviour are reinforced, and vice versa citep:Kaelbling_1996. Machine learning maps neatly onto Bayesian inference when identifying connections strengths as parameters of a model, and likelihood — correlation with desired behaviour. Most neural networks are trained with uniform priors.

We may also extend Bayesian analysis to \textbf{\emph{consistent Bayesian meta-analysis}}. Consider data obtained from multiple physical processes that are described in one theory with an overlapping set of parameters $θ$. As of now, we only perform separate analyses of each experiment. However, SSIM allows us to combine these models, and naturally represents consistency in the posteriors of the shared parameters. As an example, all of the estimates of the age of the universe may be obtained in one fell swoop from all the available models and data. This scheme will have the bonus of highlighting datasets that are incompatible with the overall conclusion, allowing us to re-evaluate the experimental data as needed\footnote{Additional, more detailed explanations shall be published in a paper submitted to the \textbf{\emph{Monthly Notices of the Royal Astronomical Society}}.}.

Bayesian meta-inference is related to the issue of discordant datasets citep:tension, and Bayes factor as a method of combining datasets. The idea is not new: usage of evidence as the sole judge of consistency between a model and a dataset had been discussed as long as the subject of Bayesian inference exists. Multiple metrics had been proposed e.g.~cite:Marshall_2006.

However, we propose a different delineation of datasets. Instead of considering the results of some early experiments as parts of the prior, and considering their agreement with newer observations only, we propose clearing the prior of anything but the theoretical constraints violation of which would lead to the theory being disproved. For example, if our theory predicts no negative-mass dark matter, our prior is uniform in the positive \(Ω\mathrm{c}\). The data that used to be part of the prior inextricably, are now considered proposals. In Bayesian meta-analysis, our prior is a stochastic mixture of all previous observations of dark matter and the aforementioned constrained uniform prior. To clarify, our scheme does not imply a mixture of just two priors. If the existence of dark matter can be (and was) inferred from \(n\) datasets, then our mixture is of as many as \(n+1\) priors, and would consist of the posteriors of the analysis of the experiments used as proposals. The joint likelihood is suitably programmed. Due to the consistent branching, there is no “cross-talk” between likelihoods. However, the marginalised posteriors would indicate the best fit parameter distributions and take consistency and precision of different observations into account. Effectively, this method synthesises data into a coherent model, without artificially splitting the model into different experimental datasets, and requiring manual reconciliation.

The posteriors for the branch probabilities would be a measure of the consistency of specific experiments. If nested sampling chose to ignore e.g.~the Type IA supernova datasets, it may suggest that such experiments are systematically inconsistent with other observations. It is much better than attempting to reconcile the discrepant datasets manually, as people are prone to fallacies. Moreover, for experiments for which data is still preserved, can be continuously integrated into a joint posterior. Meta-inference may reveal cases where data was doctored to fit a particular conclusion. In such cases, the marginalised posteriors will show unusual covariances, and be outliers in the analysis.

In conclusion, the new methodology of combining information from many priors shows great promise in the field of Bayesian inference. It has demonstrably reduced the run-time of some of the most complex problems: that of Cosmological Parameter Estimation. A rich field of research awaits those courageous-enough to follow. It is ours but to point the way.

Code

All code used to generate the plots, the framework for systematising consistent partitions as well as the configurations of \texttt{Cobaya} for cosmological simulations can be found on GitHub citep:sspr. In a separate repository~citep:code is the version of Cobaya with our modifications, which was used to produce the figures overleaf.

Footnotes