-
Notifications
You must be signed in to change notification settings - Fork 0
/
prob.Rmd
423 lines (252 loc) · 18.4 KB
/
prob.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
---
title: "Probability"
description: |
Notes on Probability.
author:
- name: Alex Stephenson
output:
distill::distill_article:
toc: true
toc_depth: 3
hightlight: haddock
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
The notes and notation come predominantly from Aronow and Miller (2019), Pittman (1993), and Casella and Berger (2002).
## Fundamentals
### Sample Spaces, Event Spaces, Probability Axioms
A set *S* of subsets of a sample space $\Omega$ is an event space if it satisfies:
- Non-empty: $S \neq \emptyset$
- Closed under complements: $A \in S \implies A^C \in S$
- Closed under countable unions: $A_1,A_2,\cdot\cdot\cdot \in S \implies A_1 \cup A_2 \cup \cdot \cdot \cdot \in S$
A set *S* of subsets of some other set $\Omega$ that satisfies these properties is known as a $\sigma$-algebra on $\Omega$.
A **probability measure** is a function $P: S \rightarrow \mathbb{R}$ that assigns a probability to every event in the event space.
### Kolmogorov's Axioms
Let $\Omega, S, P$ be defined as above. Then $(\Omega, S, P)$ is a *probability space* if it satisfies:
- Non-negativity: $\forall A \in S, P(A) \geq 0$ and $P(A)$ is finite and real.
- Unitarity: $P(\Omega) = 1$
- Countable additivity: If $A_1, A_2, ... \in S$ are pairwise disjoint $\forall i \neq j A_i \cap A_j = \emptyset$ implies $P(A_1 \cup A_2 \cup ...) = P(A_1) + P(A_2) + \cdot \cdot \cdot = \sum_iP(A_i)$
### Basic Properties of Probability
Let $(\Omega, S, P)$ be a probability space. The following properties hold:
- Monotonicity: $\forall A,B \in S$ if $A \subseteq B \implies P(A) \leq P(B)$
- Subtraction Rule: $\forall A,B \in S$ if $A \subseteq B \implies P(B\A) = P(B) - P(A)$
- Zero Probability of the empty set: $P(\emptyset) = 0$
- Probability Bounds: $\forall A \in S, 0 \leq P(A) \leq 1$
- Complement Rule: $\forall A \in S, P(A^C) = 1 - P(A)$
### Joint and Continuous Probability
The **joint probability** for $A,B \in S$ of A and B is $P(A \cap B)$
#### The Addition Rule
For $A,B \in S, P(A \cup B) = P(A) + P(B) - P(A \cap B)$. In words the probability of at least one of two events occurring is equal to the sum of the probabilities of each occurring minus the probability of both occurring.
#### Conditional Probability
For $A,B \in S$ the **conditional probability** of A given B is $P(A|B) = \frac{P(A \cap B)}{P(B)}$ where $P(B) > 0$.
The Multiplicative Law of Probability is a rearrangement $P(A|B)P(B) = P(A \cap B)$
#### Bayes Rule
For $A,B \in S$ with $P(A) > 0$ and $P(B) > 0$
$$\begin{aligned}
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
\end{aligned}$$
### The Law of Total Probability
To understand the definition (and the utility) of the Law of Total Probability, first define a **partition** as the case if $A_1, A_2,... \in S$ are nonempty and pairwise disjoint and $\Omega = A_1 \cup A_2 \cup ...$ then $\{A_1, A_2, ...\}$ is a partition of $\Omega$.
Partitions divide the sample space into mutually exclusive and exhaustive categories. Every outcome in the sample space is contained in exactly one event, so exactly one event in the partition occurs for any draw from the probability space.
The **Law of Total Probability** states:
If $\{A_1, A_2,... \}$ is a partition of $\Omega$ and $B \in S$ then $P(B) = \sum_i P(B \cap A_i)$. This can be equivalently stated as $P(B) = \sum_i P(B|A_i)P(A_i)$ in the event that $P(A_i) > 0, i = 1,2,3,...$.
The probability of an event B is thus a weighted average of the conditional probabilities of that event. The weights are the probabilities of the events that are being conditioned on.
### Independence
Presuming we have a random generative process, then events $A,B \in S$ are **independent** if $P(A \cap B) = P(A)P(B)$. This also means that for $A,B \in S$ with $P(B) > 0$ A and B are independent if and only if $P(A|B) = P(A)$.
## Random Variables
A **random variable** is a variable that takes on a real value that is determined by a random generative process. Formally, it is a function $X: \Omega \rightarrow \mathbb{R}$ such that $\forall r \in \mathbb{R}, \{\omega \in \Omega : X(\omega) \leq r \} \in S$.
We can think of random variables as mapping states of the world to a real number. There are two ways to apply a function to random variables. The first is to use the value of the random variable as an input to another function $g$. Formally, let $g: U \rightarrow mathbb{R}$ be a function where $X(\Omega) \subseteq U \subseteq \mathbb{R}$. If $g \circ X: \Omega \rightarrow \mathbb{R}$ is a random variable then we say that $g$ is a function of $X$. The second is to operate on the function itself.^[All random variables are implicitly defined within some probability space.]
### Discrete Random Variables
A random variable X is **discrete** if its range $X(\Omega)$ is a countable set.
Most variables tend to be discrete, though it can be mathematically simpler to represent some measurements that are discrete as continuous because we can use results from calculus.
#### Probability Mass Function (pmf)
Given some discrete random variable X, we summarize the probability of each outcome x occurring with the **probability mass function**. The probability mass function of X is $f(x) = Pr[X=x], \forall x \in \mathbb{R}$
As an example of how a pmf works, suppose that we have a fair coin and we flip it repeatedly. Out of many coin flips, we expect heads and tails to come up half of the time. The pmf of the coin flip random variable is then:
$$\begin{aligned}
f(x) =\begin{cases}
\frac{1}{2} & x \in \{H, T\} \\
0 & otherwise
\end{cases}
\end{aligned}$$
A pmf tells us everything about a random variable's distribution.
#### Discrete PMF properties
For a discrete random variable X with pmf $f$
- $\forall X \in \mathbb{R}, f(x) \geq 0$
- $\sum_{x \in X(\Omega)} f(x) = 1$
More generally, for any set of events $D \subseteq \mathbb{R}$ and $A = \{X \in D\}$, $P(A) = Pr[X \in D] = \sum_{x \in X(A)}f(x)$.
#### Cumulative Distribution Function (cdf)
For a random variable X, the **cumulative distribution function** of X is $F(x) = Pr[X \leq x], \forall x \in \mathbb{R}$
The cdf returns the probability that an outcome for a random variable will be less than or equal to some given value. The cdf of X tells us everything that there is to know about X.
#### Properties of cdf
For a random variable X with cdf $F$:
- $F$ is nondecreasing: $\forall x_1, x_2 \in \mathbb{R} x_1 < x_2 \implies F(x_1) \leq F(X_2)$
- $\lim_{x \rightarrow -\infty} F(x) = 0$
- $\lim_{x \rightarrow \infty} F(x) = 1$
- $\forall x \in \mathbb{R}, 1- F(x) = Pr[X > x]$
### Continuous Random Variables
If we are interested in some measurement that can take on an arbitrary degree of precision (e.g. time), we often represent this variable for mathematical convenience as a **continuous** random variable. Formally. a random variable X is continuous if there exists a non-negative function $f: \mathbb{R} \rightarrow \mathbb{R}$ such that the cdf of X is $F(x) = Pr[X \leq x] = \int_{-\infty}^x f(u)du, \forall x \in \mathbb{R}$
#### Probability Density Function (pdf)
The function $f$ is the **probability density function**. Due to the Fundamental Theorem of Calculus, the pdf is unique for any continuous random variable.
For a continuous random variable X with cdf $F$, the **probability density function** of X is $f(x) = \frac{dF(u)}{du}\big{|}_{u=x}, \forall x \in \mathbb{R}$.
The pdf is the continuous analog to the pmf and consequently the properties of the pdf are analogous to the pmf.
- $\forall x \in \mathbb{R}, f(x) \geq 0$
- $\int_{-\infty}^\infty f(x)dx = 1$
Because a continuous random variable is evaluated with an integral, $\forall x \in \mathbb{R}, Pr[X = x] = 0$. Furthermore:
- $\forall x \in \mathbb{R}, Pr[X < x] = Pr[X \leq x] = F(x) = \int_{-\infty}^x f(u)du$
- $\forall x \in \mathbb{R}, Pr[X > x] = Pr[X \geq x] = 1 - F(x) = \int_{x}^\infty f(u)du$
- $\forall a,b \in \mathbb{R}, a \leq b, Pr[a < X < b] = Pr[a \leq X < b] = Pr[a < X \leq B] = Pr[a \leq X \leq B] = F(b) - F(a) = \int_{a}^bf(x)dx$
### Support
Generally, we are interested in some subset of outcomes that could actually be observed for a given random variable. For example, no person has a negative height. The set of values at which the pmf/pdf of a random variable is positive is called its **support**. Formally the $Supp[X] = \{x \in \mathbb{R}: f(x) > 0 \}$
For discrete random variables, the support is the set of values that the random variables takes with nonzero probability. For continuous random variables, the support is the set of values over which the random variable has nonzero probability density.
## Bivariate Relationships
In the same way that univariate random variables can be completely described by their pmf/pdf, bivariate distributions are completely described by their joint pmf/pdf.
### Joint PMF
For discrete random variables X and Y, the **joint pmf** of X and Y is $f(x,y) = Pr[X=x, Y=y], \forall x,y \in \mathbb{R}$
### Joint CDF
For random variables X and Y, the **joint cdf** of X and Y is $F(x,y) = Pr[X \leq x, Y \leq y], \forall x,y \in \mathbb{R}$
### Marginal PMF
For discrete random variables X and Y with joint pmf $f$, the **marginal pmf** of Y is $f_y(y) = Pr[Y=y] = \sum_{x \in Supp[X]}f(x,y), \forall y \in \mathbb{R}$
The marginal pmf tells us the information about that variable ignoring its relationship with other variables.
### Conditional PMF
For discrete random variables X and Y with joint pmf $f$, the conditional pmf of Y given X = x is: $f_{Y|X}(y|x) = Pr[Y=y|X=x] = \frac{Pr[X=x, Y=y]}{Pr[X=x]} = \frac{f(x,y)}{f_X(x)}, \forall y \in \mathbb{R}$ and $\forall x \in Supp[X]$
The conditional pmf tells us the probability that a given value of Y occurs given that a certain value of X occurs.
### Multiplicative Law for PMFs
Let X and Y be two discrete random variables with joint pmf $f$. Then, $\forall x \in \mathbb{R}$ and $\forall y \in Supp[Y]$, $f_{X|Y}(x|y)f_Y(y) = f(x,y)$
### Jointly Continous Random Variables
Two random variables X and Y are jointly continuous if there exists a non-negative function$f: \mathbb{R}^2 \rightarrow \mathbb{R}$ such that the joint cdf of X and Y is
$$\begin{aligned}
F(x,y) = Pr[X \leq x, Y \leq y] = \int_{-\infty}^x \int_{-\infty}^y f(u,v) dvdu, \forall x,y \in \mathbb{R}
\end{aligned}$$
We refer to the function $f$ as the joint probability density function. Exactly like in the univariate case, we can derive the joint pdf by taking the mixed second order partial derivative of the joint cdf
### Joint PDF
For jointly continous random variables X and Y with joint cdf F, the **joint pdf** of X and Y is $\f(x,y) = \frac{\partial^2F(u,v)}{\partial u \partial v}\big|_{u=x, v=y}, \forall x,y \in \mathbb{R}$
For any jointly continous variables X and Y with a joint pdf if $D \subseteq \mathbb{R}^2$ then $Pr[(X,Y) \in D] = \int\int f(x,y)dydx$
For example the probability that (X,Y) falls within $\{0 \leq X \leq 5, Y \leq X\}$ is $\int_{0}^5\int_{0}^x f(x,y)dydx$
### Marginal Joint PDF
For jointly continuous random variables X and Y with joint pdf $f$ the **marginal pdf** of Y is $f_Y(y) = \int_{-\infty}^\infty f(x,y)dx, \forall y \in \mathbb{R}$
### Conditional Joint PDF
For jointly continuous random variables X and Y with point pdf $f$ the **conditional pdf** of Y given X=x is $f_{Y|X}(y|x) = \frac{f(x,y)}{f_X(x)}, \forall y \in \mathbb{R}, \forall x \in Supp[X]$
## Summarizing Distributions
### Expected Values
For a discrete random variable *X* with pmf *f* if $\sum_x |x|f(x) < \infty$, then the expected value of X is:
$$E[X] = \sum_x xf(x)$$
For a continuous case with a pdf *f* with $\int_{-\infty}^\infty |x|f(x)dx < \infty$ the expected value of X is:
$$E[X] = \int_{-\infty}^\infty xf(x)dx$$
A common distribution both for explication and for binary data is the *Bernoulli Distribution*. Let $X \sim Bernoulli(p)$. The expected value of this distribution is:
$$E[X] = \sum_{x=0}^1xf(x) = 0(1-p) + 1(p) = p$$
Another example is the standard normal distribution.
$$\begin{aligned}E[X] &= \int_{-\infty}^\infty x \frac{e^{\frac{-x^2}{2}}}{\sqrt{2\pi}}dx \\
E[X] &= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^\infty xe^{\frac{-x^2}{2}}dx \\
E[X] &= \frac{1}{\sqrt{2\pi}}(-e^{\frac{-x^2}{2}})|_{-\infty}^\infty \\
E[X] &= 0
\end{aligned}$$
See the [Calculus](optim.html) notes for more details on the integration process.
### Expectation of a Function of a Random Variable
Functions of random variables are themselves random variables. If X is a discrete random variable with pmf *f* and *g* is a function of X, then:
$$E[g(X)] = \sum_x g(x)f(x)$$
Similarly the continuous case is:
$$E[g(X)] = \int_{-\infty}^infty g(x)f(x)dx$$
An immediate implication of this theorem is that whenever you see a regression of any kind, it's a type of average.
### Properties of Expected values
For a random variable X
- $\forall c \in \mathbb{R}, E[c] = c$
- $\forall a \in \mathbb{R}, E[aX] = aE[X]$
To prove the first, note that any constant can be considered a degenerate distribution with a pmf equaling one if x is the constant and 0 otherwise. Then:
$$\begin{aligned}
E[c] &= \sum_x xf(x) \\
E[c] &= cf(c) \\
E[c] &= c(1) \\
E[c] &= c
\end{aligned}$$
For the second, in the discrete case:
$$\begin{aligned}
E[aX] &= E[g(X)] \\
E[aX] &= \sum_x g(x)f(x) \\
E[aX] &= \sum_x axf(x) \\
E[aX] &= a \sum_x xf(x) \\
E[aX] &= aE[X]
\end{aligned}$$
The continuous proof is analogous just with integrals instead of sum operators.
### Linearity of Expectation
Let X and Y be random variables. $\forall a,b,c \in \mathbb{R}$
$$E[aX + bY + c] = aE[X] + bE[Y] + c$$
The proof for linearity takes advantage of the fact that the expectation of a function of two random variables is $\sum_x \sum_y h(x,y)f(x,y)$
$$\begin{aligned}
E[aX + bY + c] &= \sum_x \sum_y h(x,y)f(x,y) \\
E[aX + bY + c] &= (ax + by + c)f(x,y) \\
E[aX + bY + c] &= a \sum_x \sum_y xf(x,y) + b \sum_x \sum_y yf(x,y) + c \sum_x \sum_y f(x,y) \\
E[aX + bY + c] &= a \sum_x x \sum_y f(x,y) + b \sum_y y \sum_x f(x,y) + c \sum_x \sum_y f(x,y) \\
E[aX + bY + c] &= a \sum_x x f_X(x) + b \sum_y yf_Y(y) + c \sum_x f_X(x) \\
E[aX + bY + c] &= aE[X] + bE[Y] + c(1) \\
E[aX + bY + c] &= aE[X] + bE[Y] + c
\end{aligned}$$
The continuous proof is again analogous with appropriate replacement of integrals for summations.
### Raw and Central Moments
For a random variable X and $j \in \mathbb{N}$ the $j^{th}$ raw moment of X is $\mu'_j = E[X^J]$
The expected value is the first raw moments. Raw moments provide summary information about a distribution's shape and location. We often would like summary measures that reflect the shape and spread of a distribution and do not depend on the expected value of the distribution. The $j^{th}$ central moment is
$$\mu_j = E[(X - E[X])^j]$$
### Variance
The variance is the second central moment and characterizes the variability or spread of a distribution. Higher variance implies greater unpredictability. The variance of a random variable X is defined as:
$$V[X] = E[(X - E[X])^2]$$
We will often see an alternative formula $V[X] = E[X^2] - E[X]^$ and it is common for this to be proved. To do so:
$$\begin{aligned}
V[X] &= E[(X - E[X])^2] \\
V[X] &= E[X^2 - 2XE[X] + E[X^2]] \\
V[X] &= E[X^2] - 2E[XE[X]] + E[E[X]^2] \\
V[X] &= E[X^2] - 2E[X]E[X] + E[X]^2 \\
V[X] &= E[X^2] - 2E[X]^2 + E[X]^2 \\
V[X] &= E[X^2] - E[X]^2
\end{aligned}$$
We are able to do this because E[X] is itself a constant and therefore we can treat it as such whenever we apply the linearity of expectations.
### Properties of variance
For a random variable X
- $\forall c \in \mathbb{R}, V[X + c] = V[X]$
- $\forall a \in \mathbb{R}, V[aX] = a^2V[X]$
Proof for the first one:
$$\begin{aligned}
V[X + c] &= E[(X + c - E[x + c])^2] \\
V[X + c] &= E[(X + c - E[X] + c)^2] \\
V[X + c] &= E[(x - E[X])^2] \\
V[X + c] &= V[X]
\end{aligned}$$
To prove the second:
$$\begin{aligned}
V[aX] &= E[(aX - E[aX])^2] \\
V[aX] &= E[(aX - aE[X])^2] \\
V[aX] &= E[a^2(X - E[X])^2] \\
V[aX] &= a^2E[(X - E[X])^2] \\
V[aX] &= a^2V[X]
\end{aligned}$$
### The standard deviation
The standard deviation is just the square root of the variance $\sigma[X] = \sqrt{V[X]}$
### Properties of the standard deviation
For a random variable X:
- $\forall c \in \mathbb{R}, \sigma[X + c] = \sigma[X]$
- $\forall a \in \mathbb{R}, \sigma[aX] = |a|\sigma[X]$
A reason to prefer the standard deviation is that the variance is in units squared. The standard deviation is on the same scale as the random variable of interest.
### Chebyshev's Inequality
Knowing the mean and the standard deviation allows us to infer other features of a distribution. Chebyshev's inequality puts an upper bound on the probability that a draw from a distribution will be more than a given number of standard deviations from the mean.
Let X be a random variable with a finite variance. Then $\forall \epsilon > 0 Pr[|X - E[X]| \geq \epsilon \sigma [X]] \leq \frac{1}{\epsilon^2}$
### The Normal Distribution
A continuous random variable X follows a normal if it has a pdf $f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^\frac{-(x-\mu)^2}{2\sigma^2}, \forall x \in \mathbb{R}$
for some constants $\mu, \sigma \in \mathbb{R}$. If $X \sim N(\mu, \sigma^2)$ then:
- $E[X] = \mu$
- $\sigma[X] = \sigma$
### Properties of the Normal Distribution
Suppose $X \sim N(\mu_X, \sigma^2_X), Y \sim N(\mu_Y, \sigma^2_Y)$. Then:
- $\forall a,b \in \mathbb{R} a \neq 0, W = aX + b \implies W \sim N(a\mu_X + b, a^2\sigma^2_X)$
- $X \perp \!\!\! \perp Y and Z = X + Y \implies Z \sim N(\mu_X + \mu_Y, \sigma^2_X + \sigma^2_Y)$
An implication of this proof is that any linear combination of any number of mutually independent normal random variables is itself a normally distributed random variable.
### Mean Squared Error
For a random variable X and $c \in \mathbb{R}$ the MSE of X about c is $E[(X-c)^2]$. The RMSE (root mean squared error) is just the square root of this quantity.
There is an alternative formula for the MSE which is useful for other proofs.
$$\begin{aligned}
E[(X - c)^2] &= E[X^2 - 2cX + c^2] \\
E[(X - c)^2] &= E[X^2] - 2cE[X] + c^2 \\
E[(X - c)^2] &= E[X^2] - E[X]^2 + E[X]^2 - 2cE[X] + c^2 \\
E[(X - c)^2] &= (E[X^2] - E[X]^2) + [E[X]^2 - 2cE[X] + c^2] \\
E[(X - c)^2] &= V[X] + (E[X] - c)^2
\end{aligned}$$
Using this formula we can see that the expected value of X minimizes the MSE. To prove this note that the first term in the decomposition does not depend on our choice of $c$. The second term is minimized when $(E[X] - c)^2$ is equal to 0 which occurs when $c = E[X]$.
### Covariance and Correlation