forked from avehtari/BDA_course_Aalto
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathBDA_notes_ch10.tex
260 lines (214 loc) · 11.1 KB
/
BDA_notes_ch10.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
\documentclass[a4paper,11pt,english]{article}
\usepackage{babel}
\usepackage[latin1]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{newtxtext} % times
\usepackage{amsmath}
\usepackage[varqu,varl]{inconsolata} % typewriter
\usepackage{microtype}
\usepackage{url}
\urlstyle{same}
\usepackage{color}
\usepackage[bookmarks=false]{hyperref}
\hypersetup{%
bookmarksopen=true,
bookmarksnumbered=true,
pdftitle={Bayesian data analysis},
pdfsubject={Comments},
pdfauthor={Aki Vehtari},
pdfkeywords={Bayesian probability theory, Bayesian inference, Bayesian data analysis},
pdfstartview={FitH -32768},
colorlinks=true,
linkcolor=blue,
citecolor=blue,
filecolor=blue,
urlcolor=blue
}
% if not draft, smaller printable area makes the paper more readable
\topmargin -4mm
\oddsidemargin 0mm
\textheight 225mm
\textwidth 160mm
%\parskip=\baselineskip
\DeclareMathOperator{\E}{E}
\DeclareMathOperator{\Var}{Var}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\Sd}{Sd}
\DeclareMathOperator{\sd}{sd}
\DeclareMathOperator{\Bin}{Bin}
\DeclareMathOperator{\Beta}{Beta}
\DeclareMathOperator{\Invchi2}{Inv-\chi^2}
\DeclareMathOperator{\NInvchi2}{N-Inv-\chi^2}
\DeclareMathOperator{\logit}{logit}
\DeclareMathOperator{\N}{N}
\DeclareMathOperator{\U}{U}
\DeclareMathOperator{\tr}{tr}
%\DeclareMathOperator{\Pr}{Pr}
\DeclareMathOperator{\trace}{trace}
\definecolor{navyblue}{rgb}{0,0,0.5}
\renewcommand{\emph}[1]{\textcolor{navyblue}{#1}}
\definecolor{darkgreen}{rgb}{0,0.3922,0}
\pagestyle{empty}
\begin{document}
\thispagestyle{empty}
\section*{Bayesian data analysis -- reading instructions ch 10}
\smallskip
{\bf Aki Vehtari}
\smallskip
\subsection*{Chapter 10}
Outline of the chapter 10
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item 10.1 Numerical integration (overview)
\item 10.2 Distributional approximations (overview, more in Chapter 4 and 13)
\item 10.3 Direct simulation and rejection sampling (overview)
\item 10.4 Importance sampling (used in PSIS-LOO discussed later)
\item 10.5 How many simulation draws are needed? (Important! Ex 10.1 and 10.2)
\item 10.6 Software (can be skipped)
\item 10.7 Debugging (can be skipped)
\end{list}
Sections 10.1-10.4 give overview of different computational
methods. Some of then have been already used in the book.
Section 10.5 is very important and related to the exercises.\\
R and Python demos at \url{https://avehtari.github.io/BDA_course_Aalto/demos.html}
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item demo10\_1: Rejection sampling
\item demo10\_2: Importance sampling
\item demo10\_3: Sampling-importance resampling
\end{list}
Find all the terms and symbols listed below. When reading the chapter,
write down questions related to things unclear for you or things you
think might be unclear for others.
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item unnormalized density
\item target distribution
\item log density
\item overflow and underflow
\item numerical integration
\item quadrature
\item simulation methods
\item Monte Carlo
\item stochastic methods
\item deterministic methods
\item distributional approximations
\item crude estimation
\item direct simulation
\item grid sampling
\item rejection sampling
\item importance sampling
\item importance ratios/weights
\end{list}
\subsection*{Numerical accuracy of computer arithmetic}
Many models use continuous real valued parameters. Computers have finite memory and thus the continuous values are also presented with finite number of bits and thus with finite accuracy. Most commonly used presentations are floating-point presentations that try to have balanced accuracy over the range of values where it mostly matters. As the the presentation has finite accuracy there are limitations, for example, with IEC 60559 floating-point (double precision) arithmetic used in current R
\begin{itemize}
\item the smallest positive floating-point number $x$ such that $1 + x \neq 1$ is $2.220446\cdot 10^{-16}$
\item the smallest non-zero normalized floating-point number is $2.225074\cdot 10^{-308}$
\item the largest normalized floating-point number $1.797693\cdot 10^{308}$
\item the largest integer which can be represented is $2^{31} - 1 = 2147483647$
\item see more at \url{https://stat.ethz.ch/R-manual/R-devel/library/base/html/zMachine.html}
\end{itemize}
Article by Goldberg (1991) "What Every Computer Scientist Should Know About Floating-Point Arithmetic" \url{https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html} provides nice overvew of floating-point arithmetic and how the computations should be arranged for improved accuracy.
Lecture notes by Geyer (2020) "Stat 3701 Lecture Notes: Computer Arithmetic" \url{https://stat.umn.edu/geyer/3701/notes/arithmetic.html} provide code examples in R illustrating the most common issues in floating-point arithmetic including examples similar shown in the BDA course lecture.
Stan User Guide Chapter 15 \url{https://mc-stan.org/docs/2_26/stan-users-guide/floating-point-arithmetic.html} discusses floating point arithmetic in context of Stan.
\subsection*{Draws and sample}
A group of draws is a sample. A sample can consist of one draw, and
thus some people use the word sample for both single item and for the
group. For clarity, we prefer separate words for a single item (draw)
and for the group (sample).
\subsection*{How many digits should be displayed}
\begin{itemize}
\item Too many digits make reading of the results slower and give
false impression of the accuracy.
\item Don't show digits which are just random noise. You can use
Monte Carlo standard error estimates to check how many digits are
likely to stay the same if the sampling would be continued.
\item Show meaningful digits given the posterior uncertainty. You can compare posterior standard error or posterior intervals to the mean value. Posterior interval length can be used to determine also how many digits to show for the interval endpoints.
\item Example: The mean and 90\% central posterior interval for
temperature increase C$^\circ$/century (see the slides for the
example) based on posterior draws:
\begin{itemize}
\item {\color{red} 2.050774 and $[$0.7472868 3.3017524$]$} (too many digits)
\item {\color{darkgreen} 2.1 and $[$0.7 3.3$]$} (good
compared to the interval length)
\item {\color{navyblue} 2 and $[$1 3$]$} (depends on the context)
\end{itemize}
\item Example: The probability that temp increase is positive
\begin{itemize}
\item {\color{red} 0.9960000} (too many digits)
\item {\color{navyblue} 1.00} (depends on the context. 1.00 hints
it's not exactly 1, but larger than 0.99)
\item With 4000 draws MCSE $\approx$ 0.002. We could report that
probability is {\color{darkgreen} very likely larger than
0.99}, or sample more to justify reporting three digits
\item For probabilities close to 0 or 1, consider also when the
model assumption justify certain accuracy
\end{itemize}
\item When reporting many numbers in table, for aesthetics reasons,
it may be sometimes better for some numbers to show one extra or
one too few digits compared to the ideal.
\item Often it's better to plot the whole posterior density in
addition of any summaries, as summaries always loose some
information content.
\item For your reports: Don't be lazy and settle for the default
number of digits in R or Python. Think for each reported value
how many digits is sensible.
\end{itemize}
\subsection*{Quadrature}
Sometimes `quadrature' is used to refer generically to any numerical
integration method (including Monte Carlo), sometimes it is used to
refer just to deterministic numerical integration methods.
\subsection*{Rejection sampling}
Rejection sampling is mostly used as a part of fast methods for
univariate sampling. For example, sampling from the normal
distribution is often made using Ziggurat method, which uses a
proposal distribution resembling stairs.
Rejection sampling is also commonly used for truncated distributions,
in which case all draws from the truncated part are rejected.
\subsection*{Importance sampling}
Popularity of importance sampling is increasing. It is used, for
example, as part of other methods as particle filters and pseudo
marginal likelihood approaches, and to improve distributional
approximations (including variational inference in machine learning).
Importance sampling is useful in importance sampling leave-one-out
cross-validation. Cross-validation is discussed in Chapter 7 and
importance sampling leave-one-out cross-validation is discussed in
the article
\begin{itemize}
\item Aki Vehtari, Andrew Gelman and Jonah Gabry (2016). Practical
Bayesian model evaluation using leave-one-out cross-validation and
WAIC. In Statistics and Computing, 27(5):1413--1432. arXiv preprint
arXiv:1507.04544 \url{<http://arxiv.org/abs/1507.04544>}
\end{itemize}
After the book was published, we have developed Pareto smoothed
importance sampling which is more stable than plain importance
sampling and has very useful Pareto-$k$ diagnostic to check the
reliability
\begin{itemize}
\item Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and
Jonah Gabry (2019). Pareto smoothed importance sampling. arXiv
preprint arXiv:1507.02646. \url{<http://arxiv.org/abs/1507.02646>}
\end{itemize}
\subsection*{Importance resampling with or without replacement}
BDA3 p. 266 recommends importance resampling without replacement. At
the time of writing that in 2013, we had less experience with
importance sampling and there were some reasonable papers showing
reduced variance doing resampling without replacement. We don't
recommend this anymore as Pareto smoothed importance sampling works
better and is also applicable when the resample sample size is equal
to the original sample size.
\subsection*{Importance sampling effective sample size}
{\color{red} BDA3 1st (2013) and 2nd (2014) printing have an error
for $\tilde{w}(\theta^s)$ used in the effective sample size
equation 10.4. The normalized weights equation should not have the
multiplier S (the normalized weights should sum to one). Errata for
the book
\url{http://www.stat.columbia.edu/~gelman/book/errata_bda3.txt}.}\\
The effective sample size estimate mentioned in the book is generic approximation, and more accurate effective sample size estimate would take into account also the functional. For example, importance sampling effective sample size can be different when estimating $\E[\theta]$ or $\E[\theta]^2$. If you are interested see more details, for example, in our Pareto importance sampling paper \url{https://arxiv.org/abs/1507.02646}.
The derivation for the effective sample size and Monte Carlo standard error (MCSE) for importance sampling can be found, for example, in Chapter 9 of \textit{Monte Carlo theory, methods and examples} by Art B. Owen \url{https://statweb.stanford.edu/~owen/mc/}.
\subsection*{Buffon's needles}
Computer simulation of Buffon's needle dropping method for estimating
the value of $\pi$ \url{https://mste.illinois.edu/activity/buffon/}.
\end{document}
%%% Local Variables:
%%% TeX-PDF-mode: t
%%% TeX-master: t
%%% End: