This repository has been archived by the owner on Sep 13, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
first-paper.tex
1641 lines (1458 loc) · 79.7 KB
/
first-paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Created 2020-07-16 Thu 16:28
% Intended LaTeX compiler: pdflatex
\documentclass[draft,usenatbib]{mnras}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{hyperref}
\usepackage{ulem}
\usepackage{grffile}
\usepackage{longtable}
\usepackage{capt-of}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{pgf}
\usepackage{pgfplots}
\usepackage{pdflscape}
\usepackage{longtable}
\usepackage{varioref}
\usepackage{bm}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{natbib}
\usepackage[nameinlink,capitalize,noabbrev]{cleveref}
\DeclareMathOperator{\TopHat}{TH}
\DeclareMathOperator{\CDF}{CDF}
\date{Accepted XXX. Received YYY; in original form ZZZ}
\pubyear{2020}
\date{\today}
\title{}
\hypersetup{
pdfauthor={},
pdftitle={},
pdfkeywords={},
pdfsubject={},
pdfcreator={Emacs 28.0.50 (Org mode 9.3)},
pdflang={English}}
\begin{document}
\tableofcontents
\begin{abstract}
We develop method for improving nested sampling performance, accuracy and precision, called consistent posterior repartitioning. If a user has any knowledge of the covariance structure and location of the posterior peak(s), then this may be used to dramatically improve the performance of nested sampling without biasing evidences or posterior sampling.
This knowledge is analogous to a pre-computed proposal matrix and Metropolis-Hastings start point, and importantly, if an incorrect repartitioning is mistakenly applied, has no impact on the performance or accuracy relative to the unaccelerated run.
We demonstrate this on toy and cosmological examples, and show a real-world speed up for a concordance cosmological run by a factor of \(XX\). When repartitioned, nested sampling spends less time in the tails of the distribution and consequently accumulates less error in evidence and posterior estimation. When precision normalised, the performance uplift is by three orders of magnitude.
This work may be viewed as the natural continuation of previous research by \citet{chen-ferroz-hobson}, may be applied with minimal modification to any existing scripts, and opens up a new evolutionary class of nested sampling algorithms.
Our code is released as an open-source and extendable Python package \texttt{SuperNest}\footnote{GitHub/GitLab repository here}
\end{abstract}
\title[\texttt{SuperNest}]{\texttt{SuperNest}: accelerated nested sampling with applications to astrophysics and cosmology}
\hypersetup{
pdftitle={SuperNest: accelerated nested sampling with applications to astrophysics and cosmology},
pdflang={English}}
\author[Petrosyan \& Handley]{
Aleksandr Petrosyan,$^{1,2,3}$\thanks{E-mail: ap886@cam.ac.uk}
Will Handley$^{1,2,4}$\thanks{E-mail: wh260@cam.ac.uk}
\\
$^{1}$Astrophysics Group, Cavendish Laboratory, J. J. Thomson Avenue, Cambridge, CB3 0HE, UK\\
$^{2}$Kavli Institute for Cosmology, Madingley Road, Cambridge CB3 0HA, UK \\
$^{3}$Queens' College, Silver Street, Cambridge, CB3 9ET, UK \\
$^{4}$Gonville \& Caius College, Trinity Street, Cambridge, CB2 1TA, UK
}
\section{Introduction}
\label{sec:org0ed8441}
The standard model of the universe and its evolution in modern
cosmology is the \(\Lambda\)CDM model \citep{Condon2018}, so named
after the main components of the universe: the cosmological constant
\(\Lambda\) and cold dark matter. It has six major \citep{Condon2018}
independent \footnote{there can be other equivalent parameter
sextuplets.} parameters: the physical baryon density
\(\Omega_\mathrm{b}h^{2}\); the physical (cold) dark matter density
\(\Omega_\mathrm{c}h^{2}\); the angular parameter
\(100\theta_\mathrm{s}\); re-ionisation optical depth
\(\tau_\text{reio}\); power spectrum slope \(n_\mathrm{s}\) and
amplitude \(\ln (10^{10}A_\mathrm{s})\) \cite{Cosmology}
The task of the present study is to develop better tools for
evaluating the agreement of our observations from the Planck mission
with \(\Lambda\)CDM, estimating the parameters in the process. In
the language of Bayesian statistics \footnote{See \cite{xkcd} for
comparison to frequentist statistics.}, our goal is efficient
Bayesian inference.
While said inference can be executed analytically in principle, it
is often intractable even when performed numerically. For context,
the standard \(\Lambda\)CDM inference run requires an HPC cluster
with at least 128 nodes, each with at least 6GB of memory and an
equivalent of three full days of operation. To add insult to injury,
the error margins on the parameters and the evidence, if computed at
all, are staggering. Even then such a result requires judicious
tuning and careful consideration of the model, which at present
cannot be automated. Equivalent inference for any model other than
\(\Lambda\)CDM is thus out of reach of most cosmologists. Presently,
trying to resolve the Hubble Tension, \cite{tension} seems
impossible for most professional cosmologists. The aim of the
present study is to correct these circumstances.
Multiple numerical algorithms exist to perform Bayesian inference:
Metropolis-Hastings \citep{Metropolis} in conjunction with the Gibbs
sampler \citep{Metropolis-Hastings-Gibbs}; Hybrid (Hamiltonian)
Monte Carlo \citep{1701.02434,Duane_1987}, and nested sampling
\citep{Skilling2006}. Each of these algorithms has different
advantages: Metropolis Hastings is one of the fastest at estimating
the model parameters, at the cost of not evaluating the evidence,
which is a universal metric of model fitness.
It is also well-known that most Bayesian inference methods, like
Metropolis-Hastings, can benefit from extra information provided at
inference time. This we call proposals as they usually contain
information about what we expect the posterior distribution to be
for each parameter. It has become standard practice to provide the
proposals along with cosmological inference packages, such as
\texttt{CosmoMC}. However, to date there has been no such mechanism for
nested sampling.
We shall explore a method of injecting said information into the
inference process that is based on an idea which was first pointed
out by \cite{chen-ferroz-hobson}. Whilst \emph{automatic posterior
repartitioning}, was originally used to resolve the issue of
unrepresentative priors, the idea can be exploited to create a much
more potent methodology for conducting nested sampling.
In the present paper we shall outline the method of \emph{stochastic
superpositional mixing} of \emph{consistent partitions}, demonstrate its
efficacy at utilising proposal information. We shall do so by
providing a brief overview of Bayesian inference, highlighting the
peculiarities of said inference method vital for our method to work.
Afterwartds we shall discuss the important evolution of our method,
and the infrastructure one would need in order to support the the
widespread use of proposal distributions. Finally, an evolution of
the nested sampling algorithm and the approaches one would take to
Bayesian inference shall be explored.
\begin{table}
\centering
\caption{A non-exhaustive list of major implementations of nested sampling.}
\begin{tabular}{lr}
\textbf{Name} & \textbf{Publication}\\
\hline
\texttt{MultiNest} & \cite{Feroz2009MultiNestAE} \\
\texttt{PolyChord} & \cite{polychord} \\
\texttt{nestle} & \cite{nestle} \\
\texttt{dyNesty} & \cite{Speagle_2020}
\end{tabular}
\end{table}
\section{Theoretical background}
\label{sec:orgb009956}
In this section we shall primarily focus on previous work in the
area. We must outline the key elements of Bayesian inference via
nested sampling in order to understand why using more informative
priors directly is not a good solution, and why one needs to perform
consistent re-partitioning.
\subsection{Bayesian inference.}
\label{sec:org16a4244}
Bayesian inference, as the name suggests, is based around a
remarkable result in the theory of probability, called Bayes'
theorem. It is mainly concerned with the relationship of
conditional probabilities, which neatly map onto the concepts of
known facts, and assumptions, but also retaining the ability to
reason about objective truths.
We must first set up the terminology used. A model \({\cal M}\) of
a physical process, is parameterised by \(\bm{\theta} =
(\theta_{1}, \theta_{2}, \ldots , \theta_{n})\). New empirical
observations of said process are encapsulated in the
\textbf{\emph{dataset}} \(\mathfrak{D}\). The
\textbf{\emph{likelihood}} \({\cal L}\) of the parameters
\(\bm{\theta}\) is the probability of observing \(\mathfrak{D}\),
conditional on the configuration \(\bm{\theta}\) and the model \({\cal
M}\). The prior \(\pi(\bm{\theta})\) is the probability of
\(\bm{\theta}\) assuming \({\cal M}\). It can be obtained from both
previous datasets as well as constraints inherent to the model. The
posterior is a probability of \(\bm{\theta}\) that is conditional on
\({\cal M}\) and the dataset \({\mathfrak D}\). The locus of all
\(\bm{\theta}\) for which the prior is both defined and non-zero
defines the \textbf\{\emph{prior space}\} \(\Psi\). Finally, the
\textbf\{\emph{evidence}\} is the probability of the data
\({\mathfrak D}\) assuming the model.
The interactions of probabilities of \cref{tab:defs} is governed by
\citeauthor{1763} 's theorem:
\begin{equation}\label{eq:bayes}
{\cal L}(\bm{\theta}) \times \pi (\bm{\theta}) = {\cal Z} \times {\cal P} (\bm{\theta}).
\end{equation}
Bayesian inference is the process of reconciling the model \({\cal M}\)
represented in \({\cal L}\) and \(\pi\), with observations
\(\mathfrak{D}\) represented in \({\cal L}\). A numerical algorithm that
obtains \({\cal Z}\) and \({\cal P}\) from \(\pi\) and \({\cal L}\), is called
a \emph{\textbf{sampling algorithm}} or \textbf{\emph{sampler}}.
The convenient representation of \(\pi\) and \({\cal L}\) depends on the
particulars of the sampler. For \textbf\{\emph{nested sampling}\}
(e.g. \texttt{PolyChord}, \texttt{MultiNest}) we delineate them
indirectly: with the logarithm of the likelihood probability-density
function \(\ln \cal L (\bm{\theta})\), and \textbf\{\emph\{prior
quantile\}\} \(C\{\pi\}(\bm{\theta})\). The latter is a coordinate
transformation \(C: \bm{u} \mapsto \bm{\theta}\) that maps a uniform
distribution of \(\bm{u}\) in a unit hypercube to \(\pi(\bm{\theta})\) in
\(\Psi\). It is often obtained by inverting the cumulative distribution
function of the prior.
\begin{table}
\caption{Definitions of main quantities in Bayesian inference. \label{tab:defs}}
\centering
\begin{tabular}{llr}
\textbf{\textbf{Term}} & \textbf{\textbf{Symbol}} & \textbf{\textbf{Definition}}\\
\hline
Prior & \(\pi(\bm{\theta})\) & \(P ( \bm{\theta} \vert {\cal M})\) \\
Likelihood & \({\cal L}(\bm{\theta})\) & \(P ( \bm{\mathfrak{D}} \vert \bm{\theta} \cap {\cal M})\) \\
Posterior & \({\cal P}(\bm{\theta})\) & \(P ( \bm{\theta} \vert \bm{\mathfrak{D}} \cap {\cal M})\) \\
Evidence & \({\cal Z}\) & \(P ( \bm{\mathfrak{D}} \vert {\cal M})\) \\
\end{tabular}
\end{table}
For \citeauthor{1763}' theorem holds, on the set intersection of the
domains of all probability density functions. Let \(D(f)\) denote the
domain of the probability density function \(f\), i.e.\textasciitilde{}where \(f\) is
both defined and \textbf{non-zero}. Hence
\begin{equation}
D \{ \pi \} \cap D \{ {\cal L} \} = D \{ {\cal P} \} \subset \Psi,
\end{equation}
meaning the inference is possible only on the intersection of the
domains of prior and likelihood.\label{domain-discussion}
For each choice of \({\cal L}\) and \(\pi\), there is a unique choice
of \({\cal Z}\) and \({\cal P}\); equivalently they represent the same
unique model \({\cal M}\), or partition it consistently. That
correspondence is \emph{surjective}, but not \emph{injective}: many
choices of \({\cal L}(\bm{\theta})\) and \(\pi (\bm{\theta})\) may
correspond to the same \({\cal P} (\bm{\theta})\) and \({\cal Z}\)
\citep{chen-ferroz-hobson}. This remark is the cornerstone of the
automatic posterior repartitioning, which we shall exploit to
obtain a speedup.
\subsection{Nested Sampling}
\label{sec:org8d5afbf}
There are very few methods of performing full Bayesian
inference. The chief reason is that one rarely needs more than a
very crude approximation of what one can achieve.
More often than not, one reduces the evidence to either 1 (model
fits the data), or 0, hence removing the need for a complex routine
to evaluate \({\cal Z}\). Similarly, model parameters' posterior
distribution is often reduced to an un-correlated symmetric Normal
distribution. Such crude approximations allow one to perform
inference exceptionally quickly. One needs to be in a situation
where resolving differences in evidence of up to a few percent may
be a deciding factor to invest into evaluating the evidence to that
precision. There are very few reasons to concern oneself with
evaluating \({\cal Z}\) at all, hence very few invest the resources
into doing it.
Nested sampling is a full infence algorithm, in that it evaluates
both the evidence and the posterior. It is rarely used outside of
circumstances where model comparison is a necessity, because of the
additional overhead associated to evaluating the evidence. While
this isn't the only advantage that nested sampling has over other
MCMC methods, but it is the chief qualitative difference that can
justify the time investment. To understand why evaluating the
evidence is so valuable, consider how one might do it.
\({\cal P}\) is a probability, therefore normalised, which combined
with \cref{eq:bayes} yields
\begin{equation}
\label{eq:def-z}
{\cal Z} = \int_{\Psi} {\cal L}(\bm{\theta}) \pi(\bm{\theta}) d\bm{\theta}.
\end{equation}
Thus, \citeauthor{1763} 's theorem reduces parameter estimation ---
obtaining \({\cal P}\) from \(\pi\) and \({\cal L}\), to
integration\textasciitilde{}\citep{bayes-integration}. The naïve approach to obtain
\({\cal Z}\) of uniformly rasterising \(\Psi\) is intractable for
hypotheses with \(O(30)\) parameters \citep{Caflisch_1998}, which
is a problem that nested sampling resolves.
Having motivated the utility of nested sampling, we should provide
an outline of its execution. The following is a short description
of nested sampling \citep{Skilling2006}. We begin, by picking
\(n_\text{live}\) \textbf\{\emph{live points}\} at random in
\(\Psi\). During each subsequent iteration, the point with the lowest
likelihood is declared \emph\{\textbf{dead}\}, and another live point
\(\bm{\theta}\in\Psi\) is taken with a higher likelihood, based on
the prior \(\pi\) and an implementation-dependent principle. Live
points are thus gradually moved into regions of high likelihood. By
tracking their locations and likelihoods, from a statistical
argument we can approximate \({\cal Z}\) and its error for each
iteration, and by \cref{eq:bayes}, \({\cal P}(\bm{\theta})\). We
continue until a pre-determined fraction of the evidence associated
to \(\Psi\) remains unaccounted for.
Recall, that Not all parameter inference methods require obtaining
\({\cal Z}\). Some methods, such as Hamiltonian Monte-Carlo
\citep{1701.02434}, allow obtaining a normalised \({\cal P}\)
directly. For such approaches, any consistent specification of
\(\pi\) and \({\cal L}\) will lead to identically the same posterior,
barring numerical errors. This is also true of methods that
evaluate \({\cal Z}\) exactly. However, nested sampling allows
uncertainty in \({\cal Z}\), which is controlled by \(\pi\) and \({\cal
L}\). Thus, nested sampling, unlike, e.g.\textasciitilde{}Metropolis-Hastings
\citep{Metropolis-Hastings-Gibbs} is sensitive to the concrete
definitions of prior and likelihood. While many choices of \(\pi\)
and \({\cal L}\) correspond to the same \({\cal P}\) and \({\cal Z}\),
the errors and nested sampling's time complexity are dependent on
the specification of \(\pi\) \citep{Skilling2006}. Specifically, more
\emph{informative} priors are preferable.
In the following section we shall discuss how information content
is being measured.
\subsection{Metrics and informativity}
\label{sec:org0abc6cc}
An important quantity for measuring the correctness of the obtained
posterior is the \textbf\{\emph{Kullback-Leibler divergence}\} \({\cal
D}\) \citep{Kullback_1951}. For probability distributions
\(f(\bm{\theta})\) and \(g(\bm{\theta})\), it is defined as:
\begin{equation}
\label{eq:kl-def}
{\cal D}\{f, g \} = \int_{\Psi}f(\bm{\theta}) \ln \frac{f(\bm{\theta})}{g(\bm{\theta})} d \bm{\theta}.
\end{equation}
It is a pre-metric on the space of probability distributions: it is
nil if and only if \(f(\bm{\theta}) = g(\bm{\theta})\), (albeit not
symmetric) which is convenient for defining a representation
hierarchy. The statement: \(f\) represents \(g\) better than \(h\) is
equivalent to
\begin{equation}
\label{eq:hierarchy}
{\cal D}\{f, g\} < {\cal D}\{h, g\}.
\end{equation}
Specifically, distribution \(h\) is said to be unrepresentative of \(g\)
if a uniform distribution \(f\) represents \(g\) better than \(h\).
A probability density function \(f(\bm{\theta})\) is said to be
more \emph\{\textbf{informative}\} than \(g(\bm{\theta})\) if:
\begin{equation}
\label{eq:informative}
{\cal D}\{ f, g \} > {\cal D}\{ g, f\}.
\end{equation}
This also highlights, that Kullback-Leibler divergence is not a
metric on the space of distributions. However, being asymmetric
lends itself well to considerations where such an asymmetry is
natural: e.g.\textasciitilde{}priors are not equivalent to posteriors, one comes
after the other, and so \({\cal D}\) can be used to quantify the
``surprise'' information obtained during inference.
The time complexity \(T\) of nested sampling is
\begin{equation}\label{eq:complexity}
T \propto n_\text{live}\ \langle {\cal T}\{{\cal L}(\bm{\theta})\} \rangle \ \langle {\cal N}\{{\cal L}(\bm{\theta}) \},
\end{equation}
where \({\cal T}\{f(\bm{\theta})\}\) is the time complexity of
evaluating \(f(\bm{\theta})\) and \({\cal N}\{f(\bm{\theta})\}\) ---
the quantity of such evaluations. Reducing \(n_\text{live}\) reduces
the resolution of nested sampling, while \({\cal T}\{{\cal
L}(\bm{\theta})\}\) is model-dependent. We can, however, reduce the
number of likelihood evaluations, by providing a more informative
prior. However, there is an associated risk, which we shall address
later.
Choosing the correct representations of \({\cal P}\) and \(\pi\) is
crucial for nested sampling's correctness and performance. For
example, assuming the same likelihood, if \(\pi_{0}\) and \(\pi_{1}\)
are equally informative, but \(\pi_{0}\) is more representative of
\({\cal P}\), then the inference with \(\pi_{0}\) will terminate more
quickly than with \(\pi_{1}\), (and would be more accurate, also).
Similarly, if \(\pi_{1}\) is more informative than \(\pi_{2}\), but
equally as representative, nested sampling will terminate with
\(\pi_{1}\) faster than with \(\pi_{2}\), and the result will be more
precise. In detail, if \(\pi_{1} (\bm{\theta})\) is more similar to
\({\cal P} (\bm{\theta})\), then points drawn with PDF \(\pi_{1}
(\bm{\theta})\) are more likely to lie in \(\bm{\theta}\) regions of
high \({\cal P} (\bm{\theta})\), leading to fewer iterations.
Posteriors \({\cal P}_{1}\) and \({\cal P}_{2}\) obtained with the
priors \(\pi_{1}\) and \(\pi_{2}\) are different, because of
\cref{eq:bayes}. In fact, the posterior \({\cal P}_{1}\) will be more
informative than \({\cal P}_{2}\), and more similar to
\(\pi_{1}\). This effect we call \textbf\{\emph{prior imprinting}\}.
Imprinting is desirable if the informative prior \(\pi_{1}\) is the
result of inference over another dataset. Nonetheless, imprinting
limits the information obtainable from \(\mathfrak{D}\). There is a
considerable risk of getting no usable data from the inference,
which makes one prefer uniform priors even when more information is
available.
The problem is exasperated in case of proposals. The issue is that
the algorithm has no room to consult the proposal distributions
outside of the prior. Using a prior taken out of ``thin air'', with
nested sampling is a recipe for disaster. However, in the next
section we shall discuss how one can mitigate these issues, and use
a proposal as an aspect of a prior.
\subsection{Power posterior repartitioning and unrepresentative priors}
\label{sec:orgd933e18}
\textbf{NB:} From this section onward we shall adopt the following
notation. \(\pi\) and \({\cal L}\) with similar decorations (index,
diacritics), belong to the same specification of the model. Models
using the uniform prior are special, in that they obtain the most
accurate posterior and evidence, thus they are represented with an
over-bar (the plot of a uniform prior in 1D is a horizontal
line). Hats delineate the consistent partitions, that incorporate
the proposal (the hat represents the peak(s) often present in
informative proposals).
We are working under the assumption that \(\pi(\bm{\theta})\) is an
informative, unrepresentative prior. We want to obtain correct
posterior \(\bar{\cal P}\) but without using a uniform, universally
representative reference prior \(\bar{\pi}\), because it is often the
least informative. To avoid loss of precision and mitigate prior
imprinting, \cite{chen-ferroz-hobson} have proposed introducing the
parameter \(\beta\) to control the breadth of the informative prior:
\begin{equation}
\label{eq:autopr-prior}
\hat{\pi}(\bm{\bm{\theta}};\beta) = \cfrac{\pi(\bm{\theta})^{\beta}}{Z(\beta)\{\pi\}},
\end{equation}
(see \cref{fig:ppr}) where \(Z(\beta)\{\pi\}\) --- a functional of
\(\pi (\bm{\theta})\) is a normalisation factor for \({\cal P}
(\bm{\theta})\), i.e.
\begin{equation}
Z(\beta)\{\pi\} = \int_{\Psi} \pi(\bm{\bm{\theta}})^{\beta}d\bm{\bm{\theta}}.
\end{equation}
In their prescription, the likelihood changes to
\begin{equation}
\hat{\cal L}(\bm{\theta}; \beta) = {\cal L}(\bm{\theta}) Z(\beta)\{\pi\} \cdot \pi^{1-\beta}(\bm{\theta}).
\end{equation}
The new parameter \(\beta\) is treated as any other non-derived
parameter of the original theory.
\begin{figure}
\input{./illustrations/ppr.tex}
\caption{\label{fig:ppr} Demonstration of
\(\hat{\pi}(\theta; \beta)\) for different values of \(\beta\) in
one dimension. We've assumed that the original
\( \pi (\bm{\theta})\) distribution is a truncated Gaussian,
i.e.~zero outside the region \((-1, 1)\). Numerical instability,
which manifests as changes in curvature at the boundaries
exaggerated. The area under curves for different $\beta$ is
normalised to unity as in \cref{eq:autopr-prior}}.
\end{figure}
Note, that
\({\cal L}(\bm{\theta})\pi (\bm{\theta}) = \hat{\cal L}(\bm{\theta})
\hat{\pi} (\bm{\theta})\) by construction. Thus, from \cref{eq:bayes}
the posterior and evidence corresponding to
\(\hat{\cal L}(\bm{\theta};\beta)\) and
\(\hat{\pi} (\bm{\theta};\beta)\) will be the same as
\({\cal P} (\bm{\theta})\) and \({\cal Z}\), which correspond to the
original \(\pi(\bm{\theta})\) and \({\cal L}(\bm{\theta})\).
If the informative prior \(\pi (\bm{\theta})\) is less
representative of the posterior \(\bar{\cal P} (\bm{\theta})\),
error in \(\hat{\cal Z}\) is larger. Hence, while we don't violate
crefeq:bayes directly, \(\bar{\cal Z}\) can be more different from
\({\cal Z}\) while remaining within margin of error, and similarly
\({\cal P}(\bm{\theta}) \ne \bar{\cal P}(\bm{\theta})\). This is
where the new parameter comes into play. \(\hat{\pi}\) may become
representative for some value of \(\beta = \beta_{R}\). Values
\(\beta\) close to \(\beta_{R}\) correlate with higher likelihoods,
thus the sampler prefers them. Hence, the system will converge to a
state where \({\cal P} (\bm{\theta})\) is represented in
\(\hat{\pi} (\bm{\theta};\beta)\)\footnote{Technically we obtain \(\hat{\cal P} (\bm{\theta};\beta)\) which, when marginalised over
\(\beta\), yields \({\cal P} (\bm{\theta}) = \int \hat{\cal P}
(\bm{\theta};\beta) d \beta\) --- the correct posterior.}. As a
consequence, we reduced the errors and obtained the same result as
we would have with a less informative but more representative
prior.
\cite{chen-ferroz-hobson} dubbed this scheme
\textbf\{\emph{automatic power posterior repartitioning}\} (PPR)
because the choice of \(\beta\rightarrow\beta_{R}\) is automatic. It
mitigates the loss precision and thus accuracy for unrepresentative
informative priors \(\pi\), by sacrificing performance.
\section{Discoveries}
\label{sec:org85f6b5d}
\subsection{The trouble with proposals}
\label{sec:org8bc2bd5}
Nested sampling is different from Metropolis-Hastings-Gibbs and
many other Markov-Chain Monte Carlo methods. Often, such algorithms
are designed with a separate input that is the proposal: an initial
guess that guides the algorithm towards the right answer. For
nested sampling no such provisions are in place. The only input
where such information can be used is the prior. Thus, to
understand why one can't use proposals directly, we must first
address why informative priors are avoided.
From \cref{eq:bayes}, we can see that changing only the prior \(\pi\)
necessarily leads to changes in both \({\cal P}\) and \({\cal Z}\). For
example if \(\pi\) is a Gaussian centered at
\(\bm{\theta}=\bm{\mu}_{\pi}\) and \({\cal L}\) is a Dirac
\(\delta\)-function peaked at \(\bm{\theta}=\bm{\mu}_{{\cal L}}\), with
\(\bm{\mu}_{\pi}\) sufficiently far from \(\bm{\mu}_{{\cal L}}\) then
the posterior will necessarily have peaks at both \(\bm{\mu}_{\pi}\)
and \(\bm{\mu}_{{\cal L}}\). This is an example of prior imprinting
and is a necessary part of a Bayesian view of statistics. For a
Bayesian, the prior information is no less valuable than the
information inferred from the dataset \(\mathfrak{D}\), and the
posterior represents \emph{all} of our best knowledge.
The problem however, is the \emph{prejudiced sampler}. Because
nested sampling chooses live points with probability proportional
to the prior, the probability of a point being drawn from the
likelihood peak can be made arbitrarily small. In fact, if
\(\bm{\mu}_{{\cal L}}\) and \(\bm{\mu}_{\pi}\) are separated by more
than five standard deviations of the prior Gaussian, thirty million
samples will be drawn from \(\bm{\mu}_{\pi}\) before a single point
is drawn on the circle containing \(\bm{\theta} = \bm{\mu}_{{\cal
L}}\).
An apt analogy can be drawn with the Venera-14 mission
\citep{siddiqi2018beyond}. Upon landing, due to a number of
unfortunate coincidences, the lander took its one and only
measurement of Venusian soil from one of its own lens caps. As a
result, we have obtained objectively correct information from
Venus: a sample of an object on its surface. However, the
efficiency of said measurement of the compressibility of Earth
rubber leaves much to be desired.
Before \cite{chen-ferroz-hobson} the best solution was to use a
uniform prior that included both \(\bm{\mu}_{\pi}\) and
\(\bm{\mu}_{{\cal L}}\). The computational cost of inference is so
high that the risk of gaining nothing from a dataset is
untenable. Thus discarding all prior information in hopes of
inferring some from the dataset is preferable to using the
information in \(\pi\).
Thus, proposals are not even considered for use with nested
sampling. Since proposals may be crude approximations, we may
obtain far worse than no new information. Any potential benefit in
performance or precision is far outweighed by the unreliable
posterior. We do, however, have one method of mitigating these
problems --- automatic posterior repartitioning
\citep{chen-ferroz-hobson}. Though the connection may seem unclear
at this stage, schematically, Automatic posterior repartitioning
allows one to represent infinitely many pairs of \(\pi\) and
\({\cal L}\), which all produce the same result: evidence and
posterior. If one can encode the proposal as a prior that obtains
the same evidence and posterior as the prior one has started with,
one could, in theory, obtain all of the benefits of having a more
informative prior, with also obtaining information that pertains to
the model in question rather than repeating the information
provided as a guess.
\subsection{How intuitive proposals accelerate convergence}
\label{sec:orgcc977c0}
Consider the following premise: we're given a model \({\cal M}\),
for which our prior \(\pi\) is not the uniform
\(\bar{\pi}(\bm{\theta})\). So, usually from other \% sources,
e.g.\textasciitilde{}other inferences, physical reasoning, etc, we know that
\begin{equation}
\pi (\bm{\theta}) = f(\bm{\theta}; \bm{\mu}, \bm{\Sigma}),
\label{eq:bias}
\end{equation}
which is representative of the posterior \(\bar{\cal
P}(\bm{\theta})\). Here, the probability density function \(f\) is
parameterised by \(\bm{\mu}\) in its location and \(\bm{\Sigma}\)
its breadth. In order to obtain the same result as one would have
with the less informative uniform prior \(\bar{\pi}(\bm{\theta})\),
one needs to correct the likelihood \({\cal L}\). Recall, that the
reason why PPR obtains the same posterior \(\bar{\cal
P}(\bm{\theta})= \hat{\cal P}(\bm{\theta})\) as one would have
using
\(\bar{\pi} (\bm{\theta}) = Const.\)
is because \(\hat{\cal
L} (\bm{\theta};\beta)\) and \(\hat{\pi} (\bm{\theta};\beta)\) are
a \textbf\{\emph{consistent (re)partitioning}\} of \(\bar{\cal Z}\)
and \({\cal P}(\bm{\theta})\). That is:
\begin{equation}
\label{eq:partitioning}
\int_{\Psi} \hat{\cal L} (\hat{\bm{\theta}}) \hat{\pi} (\bm{\hat{\theta}}) d\hat{\bm{\theta}} = \int_{\Psi}\bar{\pi} (\bm{\theta}) \bar{\cal L} (\bm{\theta}) d\bm{\theta} = \bar{\cal Z},
\end{equation}
where in the case of PPR
\(\hat{\bm{\theta}} = (\theta_{1}, \theta_{2}, \ldots, \theta_{n},
\beta)\). \Cref{eq:partitioning} holds if
\begin{equation}
\label{eq:partitioning-p}
\hat{\cal L}(\bm{\theta};\beta) \hat{\pi}(\bm{\theta};\beta) = \bar{\cal L}(\bm{\theta})\bar{\pi}(\bm{\theta})
\end{equation}
for all \(\beta\), by \cref{eq:bayes}. Note that
\cite{chen-ferroz-hobson} have used \cref{eq:partitioning-p} as the
primary expression. Following their convention, we shall sometimes
refer to consistent partitions as posterior repartitioning, rather
than evidence repartitioning.
By using a more informative prior in thusly, we accelerates
convergence, because each iteration obtains a larger evidence
estimate, so fewer are needed to reach the termination point
(See\textasciitilde{}\cref{fig:benchmark}). There is a competing mechanism: the
evidence estimates accumulate fewer errors, so inference proceeds
longer before the precision loss triggers termination
(\cref{fig:higson}). Thus repartitioning reaches a more precise
result quicker. Better still, the obtained precision can be
sacrificed to further accelerate inference.
\subsubsection{Example: Intuitive proposal posterior repartitioning}
\label{sec:org17469c3}
Suppose that one has obtained the posterior \({\cal
P}(\bm{\theta})\) from a different inference, which could be
nested sampling with a uniform prior, or Hamiltonian Monte Carlo
(or a theoretical approximation). Thus,
\begin{subequations}
\begin{equation}
\label{eq:iPPR}
\hat{\pi}(\bm{\theta}) = f(\bm{\theta}, \bm{\mu}, \bm{\Sigma}) = {\cal P}(\bm{\theta}),
\end{equation}
is an informative prior that represents our knowledge, but might not
represent the posterior. We call it an \textbf{\emph{(intuitive)
proposal}}. However, we wish to avoid prejudicing the sampler and
use the (uniform) reference prior $\bar{\pi}(\bm{\theta})$, with
reference likelihood $\bar{\cal L}(\bm{\theta})$.
To obtain with $\hat{\pi}(\bm{\theta})$ the same posterior and
evidence as one would have with $\bar{\pi}(\bm{\theta})$ and
$\bar{\cal L}(\bm{\theta})$, the partitioning of the (evidence) needs
to be \textbf{\emph{consistent}} with the reference. Specifically:
\begin{equation}
\label{eq:ippr-l}
\hat{\cal L}(\bm{\theta}) = \frac{\bar{\pi}(\bm{\theta}) \bar{\cal L}(\bm{\theta})}{ f(\bm{\theta}; \bm{\mu}, \bm{\Sigma})}.
\end{equation}
\end{subequations}
We call this scheme \textbf\{\emph\{intuitive proposal posterior
\footnote{More accurately evidence repartitioning, which is equivalent
in simple cases.} repartitioning\}\} (iPPR). It is the fastest
possible and the least robust consistent partitioning
scheme. While we have technically addressed the change in \({\cal
P}\) due to a different prior, we have not addressed the problem of
\(\hat{\pi}\) being (potentially) unrepresentative of \(\bar{\cal
P}\). In the example already considered in \cref{sec:prejudice}, we
will have reduced prior imprinting, but not all addressed the
prejudice. The probability of sampling from the true likelihood
peak is still minuscule. By contrast, we have seen that automatic
power posterior repartitioning can mitigate both issues. What iPPR
lacks, is a mechanism for extending its representation. Rather
than attempt a modification akin to power partitioning, in
\cref{sec:isomixtures} we shall provide this mechanism as
completely external to iPPR and unleash its potential.
\subsubsection{General automatic posterior repartitioning}
\label{sec:orgd059e47}
In this section, we look at the family of prescriptions similar to PPR
and iPPR called consistent partitioning. We note which schemes are
more useful for the task of accelerating nested sampling without
biasing the posterior. We begin by noting, that \Cref{eq:partitioning}
alone does not guarantee the correct posterior and evidence.
We shall consider a general consistent partitioning
\(\hat{\pi}, \hat{\cal L}\) with re-parametrisation
\(\hat{\bm{\theta}}\). Because \(\bm{\theta} \ne \hat{\bm{\theta}}\),
generally, the posterior \({\cal P}(\bm\hat{\theta})\) would not have
the same functional form as \(\bar{\cal
P}(\bm{\theta})\). Nonetheless, if inverting the parametrisation
from \(\bm{\hat{\theta}}\) to \(\bm{\theta}\) is possible, and under that
procedure \(\hat{\cal P}\) maps to \({\cal P}\), we shall say that
\(\hat{\cal P}\) is marginalised to \({\cal P}\). Thus, the correct
posterior is one that marginalises to \(\bar{\cal P}\). We shall often
use \(\hat{\cal P}(\bm\hat{\theta})\) interchangeably with
\({\cal P}(\bm{\theta})\) that it marginalises to.
We can rigorously prove\footnote{in a later publication}, that the
following conditions are necessary for a consistent partitioning
to yield the correct posterior and evidence through Bayesian
inference.
\begin{enumerate}
\item \textbf{Consistency}. The partitioning is consistent
i.e.~satisfies cref:eq:partitioning. \label[Property]{norm-prop}
\item \textbf{Representation}. In prior hyperspace
$\hat{\Psi} \supset \Psi$ there exists a subspace
$\Psi_{R} \subset \hat{\Psi}$, such that for all
$\hat{\bm{\theta}}\in \Psi_{R}$, \( {\cal P}(\bm{\theta})\) is
represented in \( \hat{\pi} (\bm{\hat{\theta}})\). In other words,
the re-parameterised prior includes a representative
configuration. \label[Property]{spec-prop}
\item \textbf{Convergence}. The sampling favours representative
configurations $\bm\hat{\theta} \in
\Psi_{R}$. \label[Property]{vconv-prop}
\item \textbf{Objectivity}. The prior bias (towards
\(\hat{\pi}(\bm{\hat{\theta}})\)) is weaker than the posterior bias
(towards \(\hat{\cal P}(\bm{\hat{\theta}})\)). \label[Property]{obj-prop}
\end{enumerate}
Note that these properties are sensitive to the sampling algorithm. For
example, for inference by uniform-rasterised integration of
\({\cal Z}\), all properties follow from \cref{eq:partitioning-p}. Not so
for a class of algorithms that estimate \({\cal Z}\) by controlled error
propagation and approximation, e.g.\textasciitilde{}nested sampling. Thus,
understanding the circumstances wherein these conditions are violated,
may clarify the conditions for which both PPR and iPPR fail to produce
the expected result.
Firstly, they satisfy \cref{norm-prop} by construction. iPPR satisfies
\cref{spec-prop} if and only if \(\hat{\pi} (\bm{\theta})\)
represented the correct posterior to begin with, in which case
\(\Psi_{R} = \Psi\). \Cref{vconv-prop} follows from the correctness
proof of nested sampling \citep{Skilling2006}, and
\cref{spec-prop}. In \cref{sec:autopr} we have shown that PPR
satisfies \cref{spec-prop}, where
\(\Psi_{R} = \{ \beta = \beta_{R} = \text{Const.}\}\), if \(\beta_{R}\)
exists. There's always at least one:
\(\Psi_{R} = \text{Locus} \{ \beta_{R}=0 \} \cap \Psi\), but we are
interested in values of \(\beta_{R} > 0\), as such priors are more
informative. In that section we have provided an intuitive explanation
for why PPR has \cref{vconv-prop}.
However, consistency alone does not guarantee the correct posterior, indeed in
\cref{fig:convergence}, we see that both \(\theta_{0}\) and \(theta_{2}\)
marginalised posteriors are offset from the correct result obtained
using \(\bar{\pi}(\bm{\theta})=\text{Const.}\). This is an illustration of the
importance of \cref{obj-prop}, as the test case \cref{fig:convergence}
was constructed to violate it specifically.
\subsection{Isometric mixtures of repartitioning schemes}
\label{sec:org7b95d8a}
In this section we consider two methods of combining several
proposals (consistent partitions) into one (consistent
partition). Identifying the posterior to which points in \(\Psi\)
correspond to by \cref{eq:bayes}, as a metric, we name these
\emph\{\textbf{isometric}\} mixtures.
\subsubsection{Additive isometric mixtures}
\label{sec:org3881130}
Consider \(m\) consistent repartitioning schemes of the same
posterior \(\bar{\cal P}(\bm{\theta})\):
\begin{equation}
\label{eq:collection-of-models}
\bar{\cal L}(\bm{\theta}) \bar{\pi}(\bm{\theta})= \hat{\cal L}_{1}(\bm{\theta}) \hat{\pi}_{1}(\bm{\theta}) = \ldots =\hat{\cal L}_{m}(\bm{\theta}) \hat{\pi}_{m}(\bm{\theta}).
\end{equation}
Their \textbf\{\textbf\{\emph{isometric mixture}\}\}, is a consistent
partitioning that involves information from each constituent prior,
but preserves the posterior and evidence of its component partitions.
For example: an \textbf\{\emph{additive mixture}\} \cref{fig:additive},
defined as
\begin{subequations}
\begin{alignat}{2}
\hat{\pi}(\bm{\theta}; \bm{\beta}) = &\sum_{i} \beta_{i} \hat{\pi}_{i}(\bm{\theta}),\label{eq:additive-mix}\\
\hat{{\cal L}}(\bm{\theta}; \bm{\beta}) = &\frac{\sum_{i} \beta_{i} \hat{\pi}_{i}(\bm{\theta}) \hat{\cal L}_{i}(\bm{\theta})}{\sum_{i} \beta_{i} \hat{\pi}_{i}(\bm{\theta})},
\end{alignat}
\end{subequations}
parameterised by
\(\bm{\beta} = (\beta_{1}, \beta_{2}, \ldots, \beta_{m})\) where each
\(\beta_{i} \in [0,1]\). It is itself a consistent partitioning,
i.e.\textasciitilde{}\emph\{\textbf{isometric}\}, if and only if
\(\sum_{i} \beta_{i} = 1\).
\begin{figure}
\input{illustrations/additive_mixtures.tex}
\caption{\label{fig:additive} An additive isometric mixture of a
Gaussian proposal and a uniform reference. Power-Gaussian added
for comparison.}
\end{figure}
Isometric mixtures are an attempt to relax some of the limitations
imposed by power posterior repartitioning. Firstly, all proposals in
PPR have to be linked by a power relation. This class always includes
a uniform prior, but not, for example, a ``wedding cake'' prior
(stepped uniform prior). Additive mixtures permit such
proposals. Moreover, in additive isometric mixtures, any consistent
partitions are compatible provided the set union of their domains
matches \(\Psi\).
However, additive mixtures have limited utility: they are slow,
difficult to implement and susceptible to numerical instability
more than any other consistent partitioning\footnote{These claims shall
be substantiated in a more detailed publication.}. We can,
however do much better.
\subsubsection{Stochastic superpositional isometric mixtures}
\label{sec:orgae3ae62}
One major problem with additive mixtures lies in the definition of
\(\hat{\cal L}\). Instead of having to evaluate only one of the
constituent likelihoods, we are forced to evaluate all of them. Hence,
a lower bound on time complexity:
\begin{equation}
{\cal T}\{\hat{\cal L}\} = o \left( \max_{i} {\cal T}\{ {\cal L}_{i}\} \right), \label{eq:hard-cap}
\end{equation}
which is the average case when the likelihoods \({\cal L}_{i}\) are all
related to the same reference (e.g.\textasciitilde{}\(\bar{\cal L}\)) with only minor
corrections computed asynchronously to account for different
proposals. If \({\cal L}_{i}\) and \({\cal L}_{j}\) have no common
computations to re-use, the average case time complexity is
\(o\left[{\cal T}({\cal L}_{i}) + {\cal T}({\cal L}_{j})\right]\).
Another issue is that the overall likelihood depends on the prior PDFs
of the constituents. This is problematic since nested sampling
requires specification of the prior via its quantile
\citep{Skilling2006,polychord,multinest}. Function inversion is not
linear with respect to addition, so the quantile of the weighted sum
needs to be evaluated for each type of mixture individually. For a
linear combination of uniform priors, evaluating the quantile can be
performed analytically, but not in case of two Gaussians or a Gaussian
mixed with a uniform. By contrast, the quantile of PPR with an
uncorrelated\footnote{not so for a correlated Gaussian. Nonetheless,
every correlated covariance matrix can be diagonalised, and included
in the re-parametrisation.} Gaussian proposal is found in closed
form.
We thus try to avoid mathematical operations that require evaluation
of all of the constituents' priors/likelihoods. An example of such an
operation is deterministic prior branching. This scheme has the
benefit of trivially determining the quantile of the mixture from the
component quantiles. The probability of branch choice can be tuned
using a parameter, which can be made part \(\hat{\bm{\theta}}\)
similarly to \(\beta\) in PPR. This parametrisation provides the
mechanism needed for \cref{vconv-prop}.
Hence, we purport that a \textbf\{\emph{superpositional mixture}\}, defined via
the following parametrisation:
\begin{subequations}
\begin{equation}
\hat{\pi}(\bm{\theta}; \bm{\beta}) =
\begin{cases}
\hat{\pi}_{1}(\bm{\theta}) & \text{with probability } \beta_{1},\\
& \vdots\\
\hat{\pi}_{n}(\bm{\theta}) & \text{with probability } (1- \sum_{i}^{m}\beta_{i}),
\end{cases}
\end{equation}
\begin{equation}
\hat{\cal L}(\bm{\theta}; \bm{\beta}) =
\begin{cases}
\hat{\cal L}_{1}(\bm{\theta}) & \text{with probability } \beta_{1},\\
&\vdots\\
\hat{\cal L}_{m}(\bm{\theta}) & \text{with probability} (1- \sum_{i}^{m}\beta_{i}).
\end{cases}
\end{equation}
is isometric, if and only if
\begin{equation}
\label{eq:sspr}
\hat{\pi}(\bm{\theta}; \bm{\beta}) = \hat{\pi}_{i}(\bm{\theta}) \Leftrightarrow \hat{\cal L}(\bm{\theta}; \bm{\beta}) = \hat{\cal L}_{i}(\bm{\theta}; \bm{\beta}),
\end{equation}
\end{subequations}
that is, the branches are chosen consistently.
The\textasciitilde{}\cref{spec-prop} is satisfied, if any of the priors \(\hat{\pi}\)
represented the posterior. The\textasciitilde{}\cref{vconv-prop} is satisfied
similarly to PPR: the likelihood is determined by
\(\bm\hat{\theta} \supset \bm{\beta}\), so \(\bm{\beta}\)s that lead to
higher likelihoods are favoured, ergo configurations representing
\({\cal P}\) are preferred.
Superpositional mixtures have multiple advantages when compared with
additive mixtures. Crucially, only one of \({\cal L}_{i}\) is evaluated
each time \(\hat{\cal L}\) is evaluated. As a result, ignoring the
overhead of branch choice, the worst-case time complexity is the same
if not better than the best case for additive mixtures, which has vast
implications discussed in \cref{sec:applications}.
The superpositional mixture's branch choice must be external to and
independent from the likelihoods and priors. For example, the prior
quantile of the mixture must branch into either of the component prior
quantiles. As a result, the end user doesn't need to perform any
calculations beyond the proposal quantiles themselves.
There can be many implementations of a superpositional mixture. A
natural first choice would be a quantum computer, where the
\(\hat{\pi}\) and \(\hat{\cal L}\) are represented by \(m\) level systems
entangled with each other (consistent branching) and a classical
computer (to evaluate \({\cal L}\) and \(\pi\)). However, we can also
attain an implementation using only computational methods via
stochastic deterministic choice based on \(\bm{\theta}\).
The \textbf\{\emph{stochastic superpositional (isometric) mixture}\} of
consistent partitioning (SSIM) ensures branch consistency by requiring
\begin{equation}
\hat{\pi}(\bm{\theta}; \bm{\beta}) = \hat{\pi}_{F(\bm{\theta};
\bm{\beta})}(\bm{\theta};\bm{\beta}),
\end{equation}
where
\(F: (\bm{\theta}, \bm{\beta}) \mapsto i \in \{1, 2, \ldots, m-1\}\). In
our implementation it is a niche-apportionment random number generator
(sometimes called the broken stick model), seeded with the numerical
\texttt{hash} of the vector \(\bm{\theta}\), illustrated in
\cref{fig:mixture}.
Superpositional mixtures are superior in robustness and ease of
implementation. They do, nevertheless, come with one drawback. As a
result of branching, the likelihood \(\hat{\cal L}\) visible to the
sampler, is no longer continuous (\cref{fig:mixture-3d}). Thus a
nested sampling implementation that relied on said continuity will
have undefined behaviour. \texttt{PolyChord}'s slice sampling seems
not affected by the discontinuity, but there may be other samplers
that are.
\begin{figure}
\input{./illustrations/mixture_2.tex}
\input{./illustrations/mixture_3.tex}
\input{./illustrations/mixture_4.tex}
\caption{An example of mixture repartitioning. The mixture is not
normalised to emphasise the coincidence of values with both the
uniform distribution and a Gaussian. $\beta$ controls the
probability of belonging to the Gaussian in the stochastic
mixture. Additionally, the resolution is deliberately reduced, to
contrast behaviour of all three at the truncation
boundary. \label{fig:mixture}}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=0.99\columnwidth]{./illustrations/SSIM_3d.pdf}
\caption{An illustration of SSIM in two dimensions. Colour represents the value of $\pi(\bm{\theta})$. As a result of nested sampling, nucleation of the representative phase is dynamically favoured.}
\label{fig:mixture-3d}
\end{figure}
\subsection{On notation and mental models}
\label{sec:org9e7115b}
It is opportune time to discuss a subtlety that we have previously
neglected. \cite{chen-ferroz-hobson} originally named the technique
automatic posterior repartitioning, which evokes a clear mental
model. Assuming that the original definitions of \(\pi\) and
\(\mathcal{L}\) were a partitioning of only the posterior, a new value
of \(\beta\) produces a new partitioning, thus it re-partitions the
posterior. The extra parameter is a time-like object, with a clear
direction of evolution, in that any change to its value causes a
re-partitioning of the model.
While this mental model had served well for the purposes of solving
the unrepresentative prior problem, it is severely limiting to the
effect of introducing proposals.
The first ineptitude of the mental model is that the expression
``re-partitioning'' implies the mutability of the posterior. It is not
mutable. In fact, the posterior that we obtained via re-partitioning
has a strict functional dependence on the parameter, which is strictly
a different function. Meaningful information is lost when we project
the repartitioned result to the original prior space, albeit only a
Bayesian would regard it as such.
A second deeper problem is that the notation inherently puts impetus
on the posterior. In reality automatic posterior repartitioning is a
necessary, but insufficient condition for consistent partitioning. As
long as no coordinate transformation is performed, the difference is
negligible. However, for more complicated cases, e.g.\textasciitilde{}re-sizeable
prior space schemes, the posterior repartitioning is
under-determined. A naive extension doesn't and indeed can't produce
the expected result, if one considers an extension similar to
\begin{equation}
\label{eq:naive-extension}
\pi(\theta) \mathcal{L}(\theta) = \hat{\pi}(\theta) \mathcal{\hat{L}}(\theta)
\end{equation}
one shall obtain nonsense. One can prove (by considering a reference
prior space from which all prior spaces of the same dimensionality
derive via coordinate transformation), that the correct expression is
actually one that preserves the evidence differential element.
What we propose is a much more general world-view and a more accurate
and expressive model. A consistent partitioning involves specifying a
hyperspace that includes the original prior space. The partitioning
into \(\pi\) and \(\mathcal{L}\) is done once only, when the Bayesian
inference problem is set up. The original posterior is a function in
the original prior (sub)space. The posterior we obtain as a result, is
the original in some projections, the evidence to which it corresponds
is also the same as the original.
One might object that this is not a good model for the superpositional
mixture, as the dynamical analogy would be much more appropriate, as
the parameters really only control the partitioning. This point is
partially valid. I would advocate seeing superpositions as an
extension into a hilbert space of vectors that are themselves
spaces. Not easy to imagine, but to someone fluent in Quantum theory,
not a challenge. A better analogy would be to imagine the spaces for
each individual prior side by side, and have a few parameters that
control the relative ``heights'' of these spaces, or activation energy
for diffusion. This is a middle-ground that retains the generality of
treating the entire problem in a hyperspace, but also has a dynamical
analogy.
Arguments can be made either way, but an important consideration is to
have a model that gives accurate predictions first, and is easy to
imagine second.
\section{Methodology of Measurements}
\label{sec:org46183d2}