-
Notifications
You must be signed in to change notification settings - Fork 1
/
DA1_Chap3.tex
executable file
·1365 lines (1264 loc) · 71.6 KB
/
DA1_Chap3.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%
\chapter{BASIC STATISTICAL CONCEPTS}
\label{ch:basics}
\epigraph{``The most important questions of life are, for the most part, really only problems of probability.''}{\textit{Pierre Simon de Laplace, Mathematician}}
Probability is mostly organized common sense. However, being able to be specific about what probability is
enables us to more accurately calculate probabilities and to employ theoretical statistical distributions to
address confidence limits on data-derived quantities.
\section{Probability Basics}
\index{Probability!basics|(}
In data analysis and hypothesis testing we are concerned with separating the probable from
the possible. First, let us have a look at possibilities. In many situations we can either list all the
possibilities or say how many such outcomes there are. In evaluating possibilities, we are often
concerned with finding all the possible choices that are offered. Studying these choices leads us
to the ``multiplication of choices'' rule:
\index{Multiplication of choices}
\begin{quote}
If a choice consists of $k$ steps, of which the first can be made in $n_1$ ways and
the $k^{th}$ in $n_k$ ways, the total number of choices is $\Pi n_{i}, i = 1,k$.
\end{quote}
This can often be seen most clearly with a tree diagram (Figure~\ref{fig:Fig1_choices}).
The number of choices here are $3 \times 4 = 12$.
\PSfig[h]{Fig1_choices}{Tree diagram for illustrating all possible choices.}
\subsection{Permutations}
\index{Permutations}
How many ways can we arrange $r$ objects selected from a set of $n$ distinct objects?
This question applies to numerous statistical and probabilistic situations. We will first
consider a simple example.
\begin{example}
We have a tray with 20 water samples. How many ways can you select three samples
from the 20? The first sample can be any of 20, the second will be any of the remaining 19, while the third is one of
the remaining 18. The total ways must therefore be $20 \times 19 \times 18 = 6840$.
\end{example}
We can write the number of choices as $20 \times (20-1) \times (20-2)$, and by induction we find
\begin{equation}
\mbox{ways} = n(n-1)(n-2)\ldots(n - r + 1 ) = {}_n P_r.
\label{eq:choices}
\end{equation}
It is convenient to introduce the factorial $n!$, defined as
\begin{equation}
n! = \prod^n_{i=1} i.
\end{equation}
For convenience, we also define $0!$ to equal 1. We can then rewrite (\ref{eq:choices}) as
\begin{equation}
_{n}P_{r} = \frac{n (n-1) (n-2) \ldots (n-r + 1) (n-r) (n-r -1) \dots 1}{(n-r) (n-r-1) \ldots 1} = \frac{n!}{(n-r)!}.
\end{equation}
This quantity is called the number of \emph{permutations} of $r$ objects selected from a set of $n$ distinct
objects.
\begin{example}
We wish to determine how many different hands one can be dealt in a game of poker.
With $n = 52$ (total number of cards in the deck) and $r = 5$ (number of cards in a hand), we find
\begin{equation}
_{52} P_5 = \frac{52!}{(52-5)!} = \frac{52!}{47!} = 48 \cdot 49 \cdot 50 \cdot 51 \cdot 52 =
3 \cdot 10^8.
\end{equation}
However, this calculation assumes that the \emph{order} in which you receive the cards is important.
\end{example}
\subsection{Combinations}
\index{Combinations}
In many situations we do not care about the exact ordering of the $r$ objects, i.e., $abc$ is the
same choice as $acb$ for our purpose. In general, $r$ objects can be arranged in $r!$ different ways
($_{r}P_r = r!$). Since we are only concerned about \emph{which} $r$ objects have been selected and not their
order, we can use ${}_nP_r$ but must now normalize the result by $r!$, i.e.,
\begin{equation}
_{n} C_{r} = \frac{_{n} P_{r}}{r!} = \frac{n!}{r!(n-r)!} = \binom{n}{r}.
\end{equation}
The quantity $_{n} C_{r}$ is called the number of \emph{combinations}, and
the factors $\binom{n}{r}$ are called the
\emph{binomial coefficients}.
\index{Binomial coefficients}
After picking the $r$ objects, $n - r$ objects are left, so consequently there are as many ways of
selecting $n - r$ objects from $n$ as there are of selecting $r$ objects, i.e.,
\begin{equation}
\binom{n}{r} = \binom{n}{n-r}.
\label{eq:binom_inverse}
\end{equation}
\begin{example}
How many ways can you select three tide gauge records from 10 available stations?
This is a question of combinations:
\begin{equation}
{}_{10} C_3 = \binom{10}{3} =
\frac{10!}{3!7!} = \frac{8\cdot 9 \cdot 10}{1\cdot 2 \cdot 3} = 8 \cdot 3 \cdot 5 = 120.
\end{equation}
Likewise, per (\ref{eq:binom_inverse}), there are also 120 ways to select 7 tide gauge records from the same 10 stations.
\end{example}
\subsection{Probability}
\index{Probability}
So far we have studied only what is \emph{possible} in a given situation. We have listed all
possibilities or determined how many possibilities there are. However, to be of use to us we
need to be able to judge which of the possibilities are \emph{probable} and which are \emph{improbable}.
The basic concept of probability can be stated thus: If there are $n$ possible outcomes or
results, and $s$ of those are regarded as favorable (or as ``successes''), then the probability of
success is given by
\begin{equation}
P = s/n.
\end{equation}
This classical definition applies only when all possible outcomes are \emph{equally likely}.
\begin{example}
What is the probability of drawing an ace from a deck of cards?
\emph{Answer}: $P = 4/52 = 1/13 = 7.7\%$.
How about getting a 3 \emph{or} a 4 with a balanced die?
\emph{Answer}: $s = 2$ and $n = 6$, so $P = 2/6 = 33\%$
\end{example}
While equally likely possibilities are found mostly in games of chance, the classical probability
concept also applies to random selections, such as making selections to reduce a large set of data
down to a manageable quantity without introducing sampling bias.
\begin{example}
If three of 20 water samples have been
contaminated and you select four random samples, what is the probability of picking one of the
bad samples?
\emph{Answer}: We have
$\binom{20}{4} = 3 \cdot 5 \cdot 17 \cdot 19 = 4845$
ways of making the selection of our four samples. The number of
``favorable'' outcomes is $\binom{17}{3}$ [we pick three good samples of the 17 good ones] times $\binom{3}{1}$
[we pick one of the three bad samples] = 2040. It then follows that the probability is
$P = s/n = 2040/4845 = 42\%$.
Here we used the rule of multiplicative choices.
\end{example}
Obviously, the classical probability concept will not be useful when some outcomes are more
likely than others. A better definition would then be
\begin{quote}
\emph{The probability of an event is the proportion of the time that events of the same
kind will occur in the long run.}
\end{quote}
So, when the National Weather Service says that the chance of rain on any day in June is 0.2, it is based
on past experiences that on average June had 6 days of rain. Another important probability
theorem is the \emph{law of large numbers}, which states
\index{Law of large numbers}
\begin{quote}
\emph{If a situation, trial, or experiment is repeated again and again, the proportion of
successes will tend to approach the probability that any one outcome will be a
success.}
\end{quote}
which is basically our probability concept in reverse.
Coin tosses illustrate the law of large numbers nicely. We toss the coin and keep track of how many
times we get ``heads'' versus the total number of tosses. For a nice symmetric coin we expect the
proportion of heads to total tosses to approach 0.5 over the long haul, but initially we are not surprised
that there can be large departures from this expectation. Figure~\ref{fig:Fig1_coin} shows how the
proportion may oscillate for a small number of tosses but eventually it will approach the expected value.
\PSfig[h]{Fig1_coin}{Proportion of heads in a series of coin tosses. The more tosses we complete,
the closer the ratio of heads to total tosses will approach 0.5. Shown are five separate sequences.
They differ considerably for small numbers but all converge on the expected proportion.}
\subsection{Some rules of probability}
\index{Probability!rules}
\index{Event}
In statistics, the set of all possible outcomes of an experiment is called the \emph{sample space},
usually denoted by the letter $S$. Any subset of $S$ is called an \emph{event}. An event may contain more
than one item. Sample spaces may be finite or infinite. Two events that have no elements in
common are said to be \emph{mutually exclusive}, meaning they cannot both occur at the same time.
There are only positive (or zero) probabilities, symbolically written
\begin{equation}
P(A) \geq 0
\end{equation}
for any event $A$.
Every sample space has probability 1, so that
\begin{equation}
P(S) = 1,
\end{equation}
where $P = 1$ means absolute certainty.
If two events are mutually exclusive, the probability that one \emph{or} the other will occur equals the
sum of their probabilities
\begin{equation}
P(A \cup B) = P (A) + P(B).
\label{eq:add_probe}
\end{equation}
Regarding the notation, $\cup$ means \emph{union} (which we read as ``OR''), $\cap$ means \emph{intersection} (``AND''), and $'$
(the prime symbol) means \emph{complement} (``NOT''). We can furthermore state that
\begin{equation}
P(A)\leq 1,
\end{equation}
since absolute certainty is the most we can ask for. Also,
\begin{equation}
P(A) + P(A') = 1,
\end{equation}
since it is certain that an event either will or will not occur.
\subsection{Probabilities and odds}
\index{Odds}
\index{Probability!odds}
Bookmakers in London use a slightly different system of reporting probabilities.
If the probability of an event is $p$, then the \emph{odds} for its occurrence are
\begin{equation}
a : b = \frac{p}{1-p}.
\end{equation}
The inverse relation gives
\begin{equation}
p = \frac{a}{a+b}.
\end{equation}
If you are still reading this book then odds are you will pass this course!
\subsection{Addition rules}
\index{Probability!addition}
\index{Probability!Venn}
\index{Venn diagram}
\index{Plot!Venn}
\PSfig[h]{Fig1_Venn}{A Venn diagram illustrating the probabilities of finding hydrocarbons. The overlapping magenta
wedge graphically represents the probabilities of finding \emph{both} oil and gas.}
The addition rules demonstrated above only holds for \emph{mutually exclusive events}. Let us now
consider a more general case.
The sketch in Figure~\ref{fig:Fig1_Venn} is a \emph{Venn diagram}, a handy graphical way of illustrating the various
combinations of possibilities and probabilities. The diagram illustrates the probabilities
associated with finding hydrocarbons during a hypothetical exploration campaign. We see from
the diagram that
\begin{equation}
\begin{array}{rcl}
P(\mbox{oil}) & = &0.18 + 0.12 = 0.3,\\
P(\mbox{gas}) & = & 0.24 + 0.12 = 0.36, \\
P(\mbox{oil} \cup \mbox{gas} ) & = & 0.18 + 0.12 + 0.24 = 0.54.
\end{array}
\end{equation}
Now, if we used the simple addition rule (\ref{eq:add_probe}), we would find
\begin{equation}
P(\mbox{oil} \cup \mbox{gas} ) = P (\mbox{oil}) + P \mbox{(gas)} = 0.3 + 0.36 = 0.66.
\end{equation}
This value overestimates the probability, because finding oil and finding gas are \emph{not} mutually
exclusive since we might find both. We can correct the equation by writing
\begin{equation}
P (\mbox{oil} \cup \mbox{gas}) = P\mbox{(oil)} + P \mbox{(gas)} - P(\mbox{oil} \cap \mbox{gas}) = 0.3 + 0.36 - 0.12 = 0.54.
\end{equation}
The general addition rule for probabilities thus becomes
\begin{equation}
P(A\cup B) = P(A) + P(B) - P(A \cap B).
\end{equation}
Note that if the events \emph{are} mutually exclusive then
$P(A \cap B) = 0$ and we recover the original rule.
\subsection{Conditional probability and Bayes basic theorem}
\index{Probability!conditional}
\index{Conditional probability}
We must sometimes evaluate the probability of an event \emph{given that another event already has occurred}.
We write the probability that $A$ will occur given that $B$ already has occurred as
\begin{equation}
P(A | B) = \frac{P(A \cap B)}{P(B)}.
\label{eq:cond_prob}
\end{equation}
In our exploration example, we can find the probability of finding oil given that gas already has
been found as
\begin{equation}
P(\mbox{oil}|\mbox{gas}) = \frac{P(\mbox{oil} \cap \mbox{gas})}{P(\mbox{gas})} = \frac{0.12}{0.36} = \frac{1}{3}.
\end{equation}
We can now derive a general multiplication rule from (\ref{eq:cond_prob}) by multiplying it by $P(B)$ and
exchange \emph{A} and \emph{B}, which gives
\begin{equation}
\begin{array}{rcl}
P(A \cap B) & = & P(B) P (A | B)\\
P(A \cap B) & = & P(A) P (B | A)
\end{array}
\label{eq:Bayes_basic}
\end{equation}
and implies that the probability of both events $A$ and $B$ occurring is given by the probability of
one event occurring multiplied by the probability that the other event will occur given that the first one
already has occurred (occurs, or will occur). This rule is called the \emph{joint probability} or \emph{Bayes
basic theorem}.
\index{Probability!joint}
\index{Joint probability}
\index{Bayes basic theorem}
\index{Probability!Bayes basic theorem}
Now, if the events $A$ and $B$ are independent events, then the probability that $A$ will take place
is not influenced by whether $B$ has taken place or not, i.e.
\begin{equation}
P(A|B) = P(A).
\end{equation}
Substituting this expression into (\ref{eq:Bayes_basic}) we obtain
\begin{equation}
P(A\cap B) = P(A) \cdot P(B).
\label{eq:jointindependent}
\end{equation}
That is, the probability that two independent events $A$ and $B$ both will occur equals the product of their probabilities. In
general, for $n$ independent events with individual probability $p_i$, the probability that all $n$ events
occur is
\begin{equation}
P = \prod ^n_{i=1} p_i.
\end{equation}
\begin{example}
What is the probability of rolling three ones in a row with a balanced die?
\emph{Answer}: With $n = 3$ and $p =1/6$,
\begin{equation}
P = \frac{1}{6} \cdot \frac{1}{6} \cdot \frac{1}{6} \approx 0.005.
\end{equation}
\end{example}
While $P(A | B)$ and $P(B|A)$ may look similar, they can be vastly different. For example, let $A$ be the event
of a death on the Bay Bridge connecting San Francisco and Oakland, and $B$ the event of a magnitude 8 earthquake in the area.
Then, $P(A|B)$ is the probability of a fatality on the Bay Bridge \emph{given} that a large earthquake has
taken place nearby, while $P(B|A)$ is the probability that we will have a magnitude 8 quake \emph{given} that a
death has been reported on the bridge. Clearly $P(A|B)$ seems more likely than $P(B|A)$ since we know the former to
have happened in the past. On the other hand, we can list many causes of fatalities on the freeway other than
earthquakes (e.g., traffic accidents, heart attacks, old age, road rage, talk radio rants, and so on).
We can arrive at a relation between $P(B|A)$ and $P(A|B)$ by equating the two expressions for $P(A\cap B)$ in
(\ref{eq:Bayes_basic}). We obtain $P(A) \cdot P (B|A) = P (B) \cdot P (A|B)$, or
\begin{equation}
P(B | A) = \frac{P(B) \cdot P (A | B)}{P(A)}.
\label{eq:relate_cond_prob}
\end{equation}
This is a useful relation since we may sometimes know one conditional probability but are
interested in the inverse relationship. For example, we may know that salt domes
(known as potential traps for hydrocarbons) often are associated with
large curvatures in the gravity field. However, we may be more interested in the converse:
Given that large curvatures in the gravity field exist, what is the probability that salt domes are
the cause of such anomalies?
\subsection{Bayes general theorem}
\index{Bayes general theorem}
\index{Probability!Bayes general theorem}
If there are more than one event $B_i$ (all mutually exclusive) that are conditionally related to an
event $A$, then $P(A)$ is simply the sum of the conditional probabilities of the events $B_i$ times their individual probabilities, i.e.
\begin{equation}
P(A) = \sum^n_{i=1} P (A|B_i) \cdot P (B_i).
\label{eq:cond_prob_sum}
\end{equation}
Substituting (\ref{eq:cond_prob_sum}) into (\ref{eq:relate_cond_prob}) gives, for any of the $n$ events $B_i$,
\begin{equation}
P(B_i |A) = \frac{P (B_i) \cdot P (A|B_i)}{\displaystyle \sum ^n _{j=1} P (A|B_j)\cdot P(B_j)}.
\label{eq:Bayes_theorem}
\end{equation}
\PSfig[h]{Fig1_fossil_site}{Location of a fossil discovery with respect to the two drainage basins from which it
must have originated. Bayes theorem provides a formal way to assign likelihood to the possible origins.}
\noindent
This is the general \emph{Bayes theorem}.
\begin{example}
Let us assume that an unknown marine fossil
fragment was found in a dry stream bed in northern Sahara. Excited, a paleontologist would like to send out an
expendable graduate student field party to search for a more complete specimen of the unknown species.
Unfortunately, the source of the
fragment cannot be identified uniquely since it was found several kilometers below the junction of two dry stream
tributaries (Figure~\ref{fig:Fig1_fossil_site}). The drainage basin $B_1$ of the larger stream covers
407.5 km$^2$, while the other basin ($B_2$) covers only 207.5
km$^2$. Based on this difference in basin size alone we might expect the probabilities that the fragment came from one of
the basins are
\begin{equation}
\begin{array}{c}
P(B_1) = \frac{407.5} {615} = 0.66,\\*[1ex]
P(B_2) = \frac{207.5} {615} = 0.34,
\end{array}
\end{equation}
based solely on the proportion of each basin's area to the combined area. However, inspecting an ancient British-produced geological map
reveals that only 31\% of the outcropping rocks in the larger basin $B_1$ are marine, whereas almost 85\% of
the outcrops in basin $B_2$ are marine. We can now state two conditional probabilities:\\
$P(A|B_1) = 0.31$ (Probability of a marine fossil, given it was derived from basin $B_1$.)
$P(A|B_2) = 0.85$ (Probability of a marine fossil, given it was derived from basin $B_2$.)\\
\noindent
With these probabilities and Bayes general theorem (\ref{eq:Bayes_theorem}) we can find the conditional probability that
the fossil came from basin $B_1$ given that the fossil is marine:
\begin{equation}
P(B_1|A) =
\frac{P(A|B_1) \cdot P (B_1)} {P(A|B_1) \cdot P (B_1) + P (A|B_2) \cdot P(B_2)}
=
\frac{0.31 \cdot 0.66}{0.31 \cdot 0.66 + 0.85 \cdot 0.34} = 0.41.
\end{equation}
Consequently, the probability of the fossil coming from the smaller basin $B_2$ is the complimentary probability
\begin{equation}
P(B_2|A) = 0.59.
\end{equation}
It therefore seems somewhat more likely that the smaller basin was the source of the fossil and that this area should be
the initial target for the student-led expedition.
However, $P(B_1|A)$ and $P(B_2|A)$ are not dramatically different and depends to some extent on the assumptions used to
select $P(B_i)$ and $P(A|B_i)$ in the first place.
Bayes general theorem is extensively used in such search and find scenarios and the probabilities that go into
the procedure are constantly being revised as more is learned during the search.
\end{example}
\index{Probability!basics|)}
\section{The M\&M's of Statistics}
When discussing exploratory data analysis we mentioned that it is useful to be able to present
large data sets using just a few parameters. We saw the box-and-whisker diagram graphically
summarized a data distribution. However, it is often desirable to represent a data set by
a \emph{single} number which, in its way, is descriptive of the entire data set. We will see there are
several ways to select this ``representative'' value. We will mostly be concerned with measures
that somehow describe the center or middle of the data set. These are called estimates of
\emph{central location}\index{Central location}.
\subsection{Population and samples}
\index{Data!population}
\index{Population}
\index{Data!sample}
\index{Sample}
If a data set consists of all conceivably possible (or hypothetically possible) observations of a
certain phenomenon then we call it a \emph{population}. A population can be finite or infinite. Any subset
of the population is called a \emph{sample}. Thus, a series of 12 coin-tosses is a sample of the potentially
unlimited number of tosses in the population. We will most often find that we are analyzing
samples taken from a much larger population, and our aim will be to learn something about the
population by studying the smaller sample set (Figure~\ref{fig:Fig1_outcrop}).
\PSfig[h]{Fig1_outcrop}{We must always try to select an unbiased sample from the population. In this example we are sampling
the weathered outcrop of a sedimentary layer, which most likely is not representative of the entire formation.}
\subsection{Measures of central location (mean, median, mode)}
The best known estimate of central location is called the \emph{arithmetic mean}, defined as
\begin{equation}
\index{Sample!mean}
\index{Mean}
\index{Arithmetic mean}
\bar{x} = \frac{1}{n} \sum^n_{i=1} x_i.
\label{eq:arith_mean}
\end{equation}
The mean is also loosely called the ``average.'' Resist being that sloppy! When reporting the mean value, always say ``mean'' and
not ``average'' so that the reader knows exactly what you have done. We call $\bar{x}$ the \emph{sample mean} to
distinguish it from the true mean of the population, denoted
\begin{equation}
\mu = \frac{1}{N} \sum^N_{i=1} x_i,
\end{equation}
which likely will remain unknown to us. The mean has many useful properties, which explains its common use:
\begin{itemize}
\item It can always be calculated for any numerical data, i.e., it always exists.
\item It is unique and straightforward to calculate.
\item It is relatively stable and does not fluctuate much from sample to sample taken from the same
population.
\item It lends itself to further statistical treatment: several $\bar{x}$ estimates from subgroups can later be combined into an overall grand
mean.
\item It takes into account every data value.
\end{itemize}
However, the last property can sometimes be a liability. Should a few points deviate excessively
from the bulk of the data then it does not make sense to include them in the sample. A better
estimate for the central location may then be the \emph{sample median}:
\begin{equation}
\index{Sample!median}
\index{Median}
\mbox{median } x_i = \tilde{x} = \left \{ \begin{array}{cl}
x_{ n/2 + 1}, & n \mbox{ is odd}\\*[1ex]
\displaystyle \frac{1}{2} (x_{n/2 + 1} + x_{n/2} ), & n \mbox{ is even}
\end{array} \right.
\end{equation}
Here, the data first must be sorted into ascending (or descending) order. We then choose the middle
value (or mean of the two middle values for even $n$) as our median estimate.
\index{Robust estimation}
Consider this sample of sandstone densities: \{2.30, 2.20, 2.35, 2.25, 2.30, 23.0, 2.25\}, $n = 7$.
The median density can be found to be $\tilde{x} = 2.30$, a reasonable value, while the mean density $\bar{x} = 5.24$,
which is a rather useless estimate since it is clearly far outside the bulk of the data \emph{and} outside
the range of known sandstone densities anywhere. For this reason we
say that the median is a \emph{robust} estimate of central location. Here it is rather obvious that the value
23.0, which probably is a clerical error, threw off the mean and we could correct for that by excluding
it from the calculation and find $\bar{x} = 2.28$
instead. However, in many cases our data set will be very large and we must anticipate that some
values may be erroneous.
The disadvantage of the median is the need to sort the data, which can be slow. (Do you think this is
a valid reason not to use it?). However,
like the mean, the median always exists and is unique.
\index{Sample!mode}
\index{Mode}
Our final traditional estimate for central location is the \emph{mode}. The mode is defined as
the observation that occurs the most frequently. For defining the central location the mode is at a
disadvantage since it may not exist (perhaps no two values are the same) or it may not be unique (our
densities actually have two modes). Of course, if our data set is expected to have more than one ``peak,''
modal estimates are important, and we will return to that later. The mode will be denoted as $\hat{x}$.
The mean, median and mode of a distribution typically are related as indicated in
Figure~\ref{fig:Fig1_mmm}.
\PSfig[h]{Fig1_mmm}{The relationship between the mean, median, and mode estimates of central location for
a skewed data distribution. These
estimates will all coincide for a perfectly symmetric and unimodal distribution.}
Returning to the mean, it is occasionally the case that some measurements are considered
more important than others. It could be that some observations were made with a more precise
instrument, or simply that some values are not as well documented as others. These are
examples of situations where we should use a \emph{weighted mean}
\index{Mean!weighted}
\index{Weighted mean}
\begin{equation}
\bar{x} = \sum^n_{i=1} w_i x_i \left / \sum^n_{i=1} w_i \right.,
\label{eq:weighted_mean}
\end{equation}
where $w_i$ is the weight of the $i$'th data value. If all $w_i = 1$ then we recover the original definition for the
mean (\ref{eq:arith_mean}). This general equation is also convenient when we need to compute the overall, or
\emph{grand mean} based on the individual means from several data sets. The grand mean based on $m$ data sets may be
written as
\index{Grand mean}
\index{Mean!grand}
\begin{equation}
\bar{\bar{x}} = \frac{\sum^m _{i=1} n_i \bar{x}_i}{\sum^m_{i=1} n_i},
\end{equation}
where the sample sizes $n_i$ take the place of the weights in (\ref{eq:weighted_mean}).
\subsection{Measures of variation}
While a measure of central location is an important attribute of our data, it says little
about how the data are distributed. We need some way of representing the \emph{variation} of our
observations about the central location. In the EDA section, we used the \emph{range} and \emph{hinges} to
indicate data variability. Another way to define the variability would be to compute the
deviations from the mean,
\begin{equation}
\Delta x_i = x_i - \bar{x},
\end{equation}
and take the average of the sum of deviations, $\frac{1}{n}\displaystyle \sum^n_{i=1} \Delta x_i$.
Sadly, it turns out that this sum is
always zero, which makes it rather useless for our purposes. A more useful quantity might be the
mean of the absolute value ($AD$) of the deviations:
\begin{equation}
\index{AD (Absolute value of deviation)}
\index{Deviation!absolute value}
\index{Absolute value of deviation (AD)}
AD = \frac{1}{n} \sum^n _{i=1} | \Delta x_i |.
\label{eq:AD}
\end{equation}
Because of the absolute value sign this function is nonanalytic and often completely ignored by
statisticians. You will find very superficial treatment of medians and absolute deviations in most
elementary statistics books. However, when dealing with real data that include occasional bad
values, the $AD$ is useful, just as the median can be more useful than the mean. However, the most
common way to describe variation of a population is to define it as the average \emph{squared}
deviation. Hence, the population \emph{variance} is
\begin{equation}
\index{Variance}
\index{Data!variance}
\index{Population!variance}
\sigma^2 = \frac{1}{N} \sum^N _{i=1} (x_i - \mu)^2,
\end{equation}
and the population \emph{standard deviation} is therefore
\index{Standard deviation}
\begin{equation}
\sigma = \sqrt{ \frac{1}{N} \sum^N_{i=1} (x_i - \mu) ^2}.
\end{equation}
Most often we will be working with samples rather than entire populations, and we hope (and will later test)
that the sample is representative of the population.
The sample standard deviation $s$ is given by
\begin{equation}
\index{Sample!variance}
s = \sqrt{ \frac{1}{n-1} \displaystyle \sum^n_{i=1} (x_i - \bar{x}) ^2}.
\label{eq:stdev}
\end{equation}
Note that we are dividing by $n - 1$ rather than by $n$. This is done because $\bar{x}$ must first be \emph{estimated} from the
sample rather than being a \emph{given} parameter of the population, such as $\mu$ and $N$. This reduces the degrees
of freedom by one; hence we divide by $n - 1$ (we will have more to say about degrees of freedom in Section~\ref{sec:freedom}).
We can now show one property of the mean: It is clear that $s^2$ depends on the choice for $\bar{x}$.
Let us find the value for $\bar{x}$ in (\ref{eq:stdev}) that gives the smallest value for $s^2$. Consider
\begin{equation}
f(\bar{x}) = s^2 = \frac{1}{n-1} \sum^n_{i=1} (x_i - \bar{x})^2.
\end{equation}
The function $f$ has a minimum where $df/d\bar{x}= 0$ and $d^2f/d\bar{x}^2 > 0$, so we find
\begin{equation}
\frac{df}{d\bar{x}} = \displaystyle \frac{\displaystyle \sum ^n_{i=1} - 2 (x_i - \bar{x})} {n-1} =
\frac{-2}{n-1} \sum ^n _{i=1} (x_i - \bar{x}) = 0,
\end{equation}
which gives
\begin{equation}
\sum^n_{i=1} (x_i - \bar{x}) = 0.
\end{equation}
We can solve this equation and find
\begin{equation}
\bar{x} = \frac{1}{n} \sum^n _{i=1} x_i.
\end{equation}
Since
\begin{equation}
\frac{d^2f}{dx^2} = \frac{2n}{n-1} > 0,
\end{equation}
we know that $f$ has a minimum for this value of $\bar{x}$. Thus, we have shown that the value $\bar{x}$
that minimizes the standard deviation equals
the mean we defined earlier in (\ref{eq:arith_mean}). This is a very useful and important property of the mean.
Because $\bar{x}$
minimizes the squared ``misfit'', it is also called the \emph{least-squares estimate} of central location
(or L$_2$ estimate for short). When computing the mean and standard deviation on a computer we
do not normally use (\ref{eq:stdev}) since it requires two passes through the data: One to compute the $\bar{x}$ and
another to solve (\ref{eq:stdev}). Rather, we rearrange (\ref{eq:stdev}) to give
\begin{equation}
\begin{array}{ll}
s & = \displaystyle \sqrt{ \sum^n _{i=1} \frac{(x_i - \bar{x})^2} {n-1} } =
\sqrt{ \sum^n_{i=1} \frac{x^2_i - 2x_i \bar{x} + \bar{x}^2} {n-1} }\\*[3ex] \\
\ & = \displaystyle \sqrt{\frac{n \displaystyle \sum x^2_i - 2 n \bar{x} \displaystyle \sum x_i + n \sum \bar{x}^2}
{n(n-1)} } = \sqrt{\frac{n \displaystyle \sum x^2_i - (\sum x_i)^2}{n(n-1)} }.
\end{array}
\end{equation}
\subsection{Robust estimation}
\label{sec:zscore}
\index{Robust!estimation|(}
We found that the arithmetic mean is the value that minimizes the sum of the squared deviations from the
central value. Can we apply the same argument to the mean absolute deviation and find what the
best value for $\tilde{x}$ may be? In other words, let
\index{Median}
\begin{equation}
\frac{d}{d\tilde{x}} \left( \frac{1}{n}\displaystyle \sum^n_{i=1} \left |x_i - \tilde{x}\right | \right) = -\frac{1}{n} \sum^n_{i=1} \frac{x_i - \tilde{x}}{|x_i - \tilde{x}|} = 0.
\label{eq:dAD-d*}
\end{equation}
The term inside the summation can only take on the values $-1$, $0$, or $+1$. Thus, the only $\tilde{x}$ that can
satisfy (\ref{eq:dAD-d*}) is a value chosen such that half the $x_i$ are smaller (giving $-1$) and half the $x_i$ are
larger (giving $+1$), and for odd sample sizes we also get one or more exact zeros. Thus, we
have proven that the median is the location estimate that minimizes
the mean absolute deviation. The
median is also called the L$_1$-estimate of central location.
The discussion of mean and median brings up the general issue of \emph{robust estimation}: How to
calculate a stable and reasonable estimate of central location in the presence of contaminated
data? As an indicator of how robust a method is, we will introduce the concept of ``breakdown
point.'' It is the \emph{smallest fraction} of the observations that must be replaced by outliers in order to throw
the estimator outside reasonable bounds.
We have already seen that even a single bad value is enough to throw the mean way off. For our
densities of sandstone, we had $\rho = \{2.2, 2.25, 2.25, 2.3, 2.3, 2.35, 23.0\}$, with $n = 7$. If we
realized that 23.0 should be 2.3, we find $\bar{\rho} = 2.28 \pm 0.05$, while if we included $\rho_7 = 23.0$ we
would find $\bar{\rho} = 5.24 \pm 7.8$. The second estimate is obviously far outside the 2.20--2.35 range we first
determined. We can therefore say that the least squares estimate (i.e., the mean) has a breakdown value of
$1/n$; it only takes one outlier to ruin our day. On the other hand, note that the median is $\sim 2.3$ in
both cases, well inside the acceptable interval. It is found that the breakdown point of the
median is
50\%: We would have to replace half the data with bad outliers to move the estimate of
the median outside the range of the original (good) data values.
Apart from the central location estimator, we also want a robust estimate of the spread of the
data. Clearly, the classical standard deviation is problematic since only one bad value will make it
biased due to the $x^2$ effect. From the success of taking the median of a set of numbers rather
than summing them up, could we do something similar with the deviations? Consider what
value of $\tilde{x}$ would minimize the median of $\{|x_i - \tilde{x}|\}$.
You can probably see for yourselves that the $\tilde{x}$ must equal
our old friend the median. Because of the robustness of the median operator, we will often use
the quantity called the \emph{median absolute deviation} ($MAD$) as our robust estimate of ``spread'' or variation.
Note: Many textbooks and software packages (such as MATLAB) use $MAD$ to indicate \emph{mean absolute deviation}
instead, as defined in (\ref{eq:AD}) and called $AD$ in these notes. Thus, we define
\index{Mean absolute deviation}
\begin{equation}
\index{MAD}
\index{Median absolute deviation}
MAD = 1.4826 \mbox{ median } |x_i - \tilde{x} |,
\end{equation}
where the factor 1.4826 is a correction term that makes the $MAD$ equal to the standard deviation
of normally distributed data\footnote{This factor equals $1/P^{-1}_c(0.75)$, where $z = P^{-1}_c(p)$ is
the inverse cumulative normal distribution; see Table~\ref{tbl:Critical_z}.}. Like the median, the $MAD$ has a breakdown point of 50\%. The $MAD$
for our example was 0.07 and it remained unchanged by using the contaminated value.
Having robust estimates of central location and scale, we can attempt to identify \emph{outliers}. We may
compute the robust \emph{standard units}
\begin{equation}
\index{Normal scores!robust}
\index{Standard scores!robust}
z_i = \frac{x_i - \tilde{x}} {MAD}
\end{equation}
and compare them to a cutoff value: If $|z_i| > z_{cut}$ we say we have detected an outlier. The choice
for $z_{cut}$ is to a certain extent arbitrary. It is, however, quite standard to choose $z_{cut} = 2.5$. Chances
that any $z_i$ will exceed $z_{cut}$ is very small if the $z_i$'s came from a normal distribution. Our
normalized densities (including the contaminated value) using $\bar{x}$ and $s$ to compute $z_i$ gives
\begin{equation}
z_{\scriptscriptstyle L_{\scriptscriptstyle 2}} = \left \{ -0.39, -0.38, -0.38, -0.377, -0.377, -0.37, 2.28\right \},
\end{equation}
where none of the values qualify as an outlier. Using the median and $MAD$ instead, we find
\begin{equation}
z_{\scriptscriptstyle L_{\scriptscriptstyle 1}}
= \left \{ -1.35, -0.68, -0.68, 0.0, 0.0, 0.68, 280.0 \right \},
\end{equation}
and we see that the bad observation gives a huge $z$-value two orders of magnitude larger than any other.
Clearly, the least-squares technique alone is not trustworthy when it comes to detecting bad
points. The outlier-detecting scheme presents us with an elegant two-step technique: First find and remove
the outliers from the data, then use classical \emph{least-squares} techniques on the remaining data
points. The resulting statistics are called the \emph{least trimmed squares} estimates (LTS). We
will return to the concept of robustness when discussing regression in Chapter~\ref{ch:regression}.
\index{Least trimmed squares (LTS)}
\index{Robust!estimation|)}
\subsection{Central limit theorem}
How well does our sample mean, $\bar{x}$, compare to the true population mean, $\mu$? An important
theorem, called the \emph{central limit theorem}, states
\begin{quote}
\index{Central limit theorem}
\emph{If $n$ (the sample size) is large, the theoretical sampling distribution of the mean
can be approximated closely with a normal distribution.}
\end{quote}
This is rather important since it justifies the use of the normal distribution in a wide range of
situations. It simply states that the sample mean $\bar{x}$ is an \emph{unbiased estimate} of the population
mean and that the scatter about $\mu$ is \emph{normally distributed}. It can be shown that the standard
deviation of the sampling mean, $s_{\bar{x}}$, is related to the population deviation, $\sigma$, by
\begin{equation}
s_{\bar{x}} = \frac{\sigma} {\sqrt{n}}
\label{eq:samp_dev_int}
\end{equation}
or
\begin{equation}
\index{Sample!mean}
s_{\bar{x}} = \frac{\sigma} {\sqrt{n}} \sqrt{\frac{N-n}{N-1}}
\label{eq:samp_dev_int2}
\end{equation}
depending on whether the population is infinite (\ref{eq:samp_dev_int}) or finite of size $N$ (\ref{eq:samp_dev_int2}). Thus, as $n$
grows large, $s_{\bar{x}} \rightarrow 0$. Furthermore, the sample variance $s^2$ has the mean value $\sigma^2$ with
standard deviation
\begin{equation}
\index{Sample!variance}
\sigma^2_s = \frac{2\sigma^4 }{n-1},
\end{equation}
which also $\rightarrow 0$ for large $n$. For our analysis we will substitute the sample standard deviation
$s$ \emph{in lieu} of the unknown population standard deviation $\sigma$, since $s$ is an \emph{unbiased estimator} of $\sigma$.
\subsection{Covariance and correlation}
\label{sc:cc}
We found earlier that the sample variance was defined as
\begin{equation}
s^2_x = \frac{\displaystyle \sum^n_{i=1} (x_i - \bar{x})^2}{n-1} =
\frac{\displaystyle \sum^n_{i=1} (x_i - \bar{x})(x_i - \bar{x}) }{n-1}.
\end{equation}
It is often the case that our data set consists of pairs of properties, such as sets of (depth, pressure),
(time, temperature), concentrations of two elements, and more. Denoting the paired properties by $x$ and $y$,
we can compute the variance of each quantity separately. For instance, for $y$ we find
\begin{equation}
s^2_y = \frac{\displaystyle \sum ^n_{i=1} (y_i - \bar{y})^2} {n-1}=
\frac{\displaystyle \sum^n_{i=1} (y_i - \bar{y}) (y_i - \bar{y})} {n-1}.
\end{equation}
We can now define the \emph{covariance} between $x$ and $y$ in a similar way as
\begin{equation}
\index{Sample!covariance}
\index{Covariance}
s_{xy} = \frac{\displaystyle \sum^n_{i=1} (x_i - \bar{x})(y_i - \bar{y})} {n-1}.
\end{equation}
While $s_x$ and $s_y$ tell us how the $x$ and $y$ values are distributed \emph{individually}, $s_{xy}$ tells us how
the $x$ and $y$ values vary \emph{together}.
Because the value of the covariance clearly depends on the units of $x$ and $y$, it is difficult to
state what covariance values are meaningful. This difficulty is overcome by defining the Pearson
\emph{correlation coefficient} $r$, which normalizes the covariance to yield correlations in the $\pm 1$ range, i.e.,
\begin{equation}
\index{Sample!correlation}
\index{Correlation}
r = \frac{s_{xy}} {s_x s_y}.
\end{equation}
If $|r|$ is close to 1, then the variables are strongly correlated or anti-correlated. Values of $r$ close to 0 mean that there
is little significant correlation between the data pairs. Figure~\ref{fig:Fig1_correlations} shows some examples of
data pairs and their correlations.
We see that in general, $r$ will tell us how well the data are ``clustered'' in some direction. Note in
particular example (f), which presents data that are clearly correlated (i.e., all pairs lie on a circle), yet $r
= 0$. This occurs because $r$ is a measure of a \emph{linear} relationship between values; a nonlinear
relationship may not register a significant correlation. Thus, we must be careful with how we use $r$ to draw conclusions
about the interdependency of paired values. For example, if our ($x,y$) data are governed by a $y = \sqrt{x}$ law then we
may find a fairly good correlation between $x$ and $y$, but we would be wrong to conclude that $x$ and $y$
have a \emph{linear} relationship (plotting $y$ versus $\sqrt{x}$ \emph{would give} a linear relationship and a much higher
value of $r$). We will return to correlation under the rubrics of curve fitting and multiple regression in
Chapter~\ref{ch:regression}.
\PSfig[h]{Fig1_correlations}{Some examples of data sets and their correlation coefficients. Note that the perfect
circular correlation in (f) gives a zero linear correlation coefficient. While clearly $x$ and $y$ are correlated,
their relationship is not \emph{linear}.}
\subsection{Moments}
Returning to the L$_2$ estimates, we will briefly introduce the concept of \emph{moments}. In general,
the $r$'th moment is defined as
\begin{equation}
\index{Moments}
m_r = \frac{1}{n} \sum^n_{i = 1} (x_i - \mu)^r,
\end{equation}
except for $r = 1$ where it is customary to use the ``raw moment'' about zero instead. From this definition it can be seen
that the mean and variance are the first (raw) and second (central) moments,
respectively. We will look at two higher order (central) moments that one may encounter in the literature.
The first is called the \emph{skewness} ($SK$) and it is the third central moment, given by
\begin{equation}
\index{Skewness}
\index{Data!skewness}
SK = \frac{1}{n} \sum ^n_{i=1} \left ( \frac{x_i - \bar{x}} {s} \right) ^3 = \frac{1}{n} \sum ^n_{i=1} z_i^3,
\end{equation}
where we normalize by $s$ to get dimensionless values for $SK$. The skewness is used to investigate
our data sets' \emph{degree of symmetry} about the mean. A positive $SK$ means we have a longer tail
to the right of the mean than to the left, and vice versa for a negative $SK$ (Figure ~\ref{fig:Fig1_skewness}).
\PSfig[h]{Fig1_skewness}{Examples of data distributions with positive and negative skewness. The
sign of the skewness indicates which side of the distribution is long-tailed.}
\noindent
Unfortunately, if the data contain outliers then the $SK$ will be very sensitive to these values and
consequently be of little use to us. A more robust estimate of skewness is the \emph{Pearson
coefficient of skewness},
\begin{equation}
\index{Pearson skewness}
\index{Skewness!Pearson}
SK_p = \frac{3(\bar{x} - \tilde{x})} {s},
\end{equation}
where we basically compare the mean and the median. An even higher-order central moment is the
\emph{kurtosis},
\begin{equation}
\index{Kurtosis}
\index{Data!kurtosis}
K = \left \{ \frac{1}{n} \sum^n_{i=1} \left ( \frac{x_i - \bar{x}}{s} \right) ^4 \right \} -3 = \left \{ \frac{1}{n} \sum ^n_{i=1} z_i^4 \right \} - 3.
\end{equation}
The correction term $-3$ makes $K = 0$ for a normal distribution, which we will discuss shortly. The kurtosis $K$ attempts to
quantify a data distribution's ``sharpness'' ($K > 0$) or ``flatness'' ($K < 0$; Figure ~\ref{fig:Fig1_kurtosis}).
However, for most real data $K$ can be almost infinite and should be used only with
``well-behaved'' data.
\PSfig[h]{Fig1_kurtosis}{Examples of distributions with different kurtosis. Distributions with negative $K$ are
called \emph{platykurtic}\index{Platykurtic}, while a positive $K$ is called \emph{leptokurtic}\index{Leptokurtic}. You will of course be immensely pleased to learn
that an intermediate case is called \emph{mesokurtic}\index{Mesokurtic}.}
\section{Discrete Probability Distributions}
\index{Probability distributions|(}
\index{Probability distribution!discrete}
An important concept in statistics and probability is the notion of a \emph{probability distribution}. It is a
function $P(x)$, which indicates the probability that the event $x$ will take place. $P(x)$ can be a
discrete or continuous function. As an example of a discrete function, consider the function $P(x), x=1, 2,..,6$, that gives
the probability of throwing an $x$ with a balanced die:
\begin{equation}
P(x) = 1/6,\quad x = 1,2, \ldots, 6,
\end{equation}
or for flipping a coin:
\begin{equation}
P(x) = 1/2,\quad x = \left \{ H, T \right \}.
\end{equation}
Staying with the throws of the die, we can relate $P(x)$ to the area under the curve in Figure ~\ref{fig:Fig1_die_probability}.
\PSfig[h]{Fig1_die_probability}{Probability of throwing any number on a die is a constant $1/6$, unless
the die is ``loaded''.}
Two important properties shared by all discrete probability distributions are
\begin{equation}
0 \leq P (x_i) \leq 1, \mbox{ for all } x_i,
\end{equation}
\begin{equation}
\sum^n_{i=1} P(x_i) = 1.
\label{eq:Pdiscretesum}
\end{equation}
\subsection{Binomial probability distribution}
\label{sec:binom}
\index{Probability distribution!binomial}
\index{Binomial probability distribution}
Often we are more interested in knowing the probability of a certain outcome after $n$ repeated
tries, such as ``what is the probability of receiving junk mail three days in one week?'' To derive such a
function, we will assume that each event is independent and has the same probability, $p$. Then, the
probability that an event \emph{does not} occur is the complement, $q = 1 - p$. Consequently, the probability of
getting $x$ successes in $n$ tries (and thus $n - x$ failures) is
\begin{equation}
P_1(x) = p^x q^{n-x}.
\end{equation}
However, this probability applies to a \emph{specific order} of all possible outcomes. Since we may not care about
the order in which the successful $x$ events occurred, we must scale $P_1(x)$ by the number of
possible combinations of $x$ successes in $n$ tries. We already know this amount to be given by $\binom{n}{x}$,
so our discrete probability function becomes
\begin{equation}
P_{n,p}(x) = \binom{n}{x} p^x q^{n-x} = \binom{n}{x} p^x (1 - p)^{n-x}, \quad x =0, 1, \ldots, n.
\label{eq:binomial_dist}
\end{equation}
This expression is known as the binomial probability distribution or simply the \emph{binomial distribution}
(Figure~\ref{fig:Fig1_binom_dist}) and it is used to predict the probability that $x$ events out of $n$
tries will be successful, given that each independent $x$ has the probability $p$ of success.
\PSfig[h]{Fig1_binom_dist}{Binomial probability distribution $P_{n,p}(x)$, which shows the probability of having $x$ successful
outcomes out of a total of $n$ tries, when each try has the probability $p$ of success (and $q = 1 - p$ of failure).
Here, $p = 0.25$ and $n = 8$.}
\begin{example}
What are the chances of drawing three red cards in six tries from a deck (assuming we place the card back
into the deck after each try)? Here $p = 1/2$, so
\begin{equation}
P_{6,0.5}(3) = \frac {6!}{3!3!} \left ( \frac{1}{2} \right ) ^3 \left ( \frac{1}{2} \right )^{6-3} = 0.31.
\end{equation}
One might have thought that getting half red and half black cards would have a higher probability, but
remember that we require \emph{exactly} 3 reds. If we compute the probability of getting 1, 2, or 3 reds
separately and used the summation rule to compute the probability that we would draw 1, 2, or 3
red cards then $P$ would be much higher.
\end{example}
The binomial probability distribution can also be used to assess the likelihood of more serious scenarios, such as the
next example presents.
\begin{example}
A silver-tonged con artist approaches you on a street in New York City with a simple proposition: He has
10 beads --- 9 black and one white. You get to pick one bead from his bag. You are
given six opportunities to draw a bead (the bead is returned to the bag after each try), and
if anytime during the six tries you pick the white bead then you have won and he will give you \$20.
However, if you
have not picked the white bead after six tries then you owe him \$20 instead. Is this a good deal?
Answer: Clearly, the probability of picking the white bead is fixed at $p = 0.1$. To lose
the bet you will have to come up empty-handed six times in a row. For $n = 6$ and $r = 0$ the
chances of that is simply
\begin{equation}
P_{6,0.1}(0) = \binom{6}{0} 0.1^0(1-0.1)^6 = 0.53.
\end{equation}
So while it is close to 50--50 the con-artist will most likely win, at least in the long run.
You probably should also be concerned that there might be something else going on as well, such as sleight-of-hand
removal of the white bead before each try...
\end{example}
\subsection{The Poisson distribution}
\index{Poisson distribution}
\index{Probability distribution!Poisson}
\index{Rare events}
\index{Binomial probability distribution!approximation}
In some situations, the binomial distribution can be approximated by simpler expressions.
One such case arises when the probability $p$ for one event is
very small and $n$ is large. Such events are called \emph{rare}, and the discrete distribution may then be approximated by
\index{Rate of occurrence}
\begin{equation}
P(x) = \frac{\lambda^x e^{-\lambda}}{x!},\quad x = 0, 1, 2, \ldots, n
\end{equation}
where $\lambda = np$ is the \emph{rate of occurrence}. The Poisson distribution can be used to evaluate the
probabilities for the occurrence of rare events such as large earthquakes, volcanic eruptions, and reversals of the
geomagnetic field. For instance, the number of floods occurring in a 50-year period has been shown to
follow a Poisson distribution with $\lambda = 2.2$. What is the probability that we will have at least
one flood in the next 50 year period? Here, $P = 1 - P_0$, the probability of having no flood.
Plugging in for $x = 0$ and $\lambda = 2.2$ we find $P_0 = 0.1108$, so $P = 0.8892$.
\begin{example}
A student is monitoring the radioactive decay of a certain sample that is expected to
undergo three decays per minute. The student observes the number of decays over 100
individual one-minute periods and constructs the summary shown in Table~\ref{tbl:decay1}.
\begin{table}[h]
\centering
\begin{tabular}{|l||c|c|c|c|c|c|c|c|c|c|} \hline
\bf{Decays} & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\ \hline
\bf{Observed} & 5 & 19 & 23 & 21 & 14 & 12 & 3 & 2 & 1 & 0 \\ \hline
\end{tabular}
\caption{Number of decays observed in one-minute interval.}
\label{tbl:decay1}
\end{table}
Does the data support the expected decay rate? We make a histogram of the data
by normalizing the observed frequencies by the total count
and superimposing the Poisson distribution for the expected rate. The result (Figure~\ref{fig:Fig1_poisson})
shows a very good fit.
\PSfig[h]{Fig1_poisson}{Histogram of observed decay rate frequencies (bars) and
the theoretical Poisson distribution (circles) for the expected rate $\lambda = 3$.}
\end{example}
\section{Continuous Probability Distributions}
While many populations are of a discrete nature (e.g., outcomes of coin tosses, numbers of
microfossils in a core, etc.), we are very often dealing with observations of a phenomenon that
can take on any of a continuous spectrum of values. We may sample the phenomenon at certain
points in space-time and thus have discrete observations. Nevertheless, the underlying probability
distribution is continuous (e.g., Figure~\ref{fig:Fig1_cont_pdf}).
\PSfig[h]{Fig1_cont_pdf}{Example of a continuous probability density function (pdf). The area under any pdf
must equal 1. The finite probability identified in (\ref{eq:probfinite}) is indicated in dark gray.}
Continuous distributions can be thought of as the limit for discrete distributions when the
``spacing'' between events shrinks to zero. Hence, we must replace the summation in (\ref{eq:Pdiscretesum}) with the integral
\index{Probability distribution!continuous}
\index{Continuous probability distribution}
\index{pdf (probability density function)}
\index{Probability density function (pdf)}
\begin{equation}
\int^\infty _{-\infty} p (x) d x = 1.
\label{eq:pdf}
\end{equation}
Because of their continuous nature, functions such as $p(x)$ in (\ref{eq:pdf}) are called \emph{probability}
\emph{density functions} (pdf). The probability of an event is still defined by the area under the curve, but
now we must integrate to find the area and hence the probability.
E.g., the probability that a random variable will take on a value between $a - \Delta$ and $a +\Delta$ is
\begin{equation}
P(a\pm \Delta) = \int ^{a+\Delta} _{a - \Delta} p(x) dx.
\label{eq:probfinite}
\end{equation}
As $\Delta \rightarrow 0$ we find that the probability goes to zero. Thus, the probability of getting exactly $x = a $
is nil.
The \emph{cumulative distribution function} (cdf) gives the probability that an observation less than or
equal to $a$ will occur. We obtain the integral expression for this distribution by replacing the
lower limit by $-\infty$ and the upper limit by $a$, finding
\begin{equation}
\index{Probability distribution!cumulative}
\index{Cumulative probability distribution}
P_c(a) = \int^a _{-\infty} p (x) dx.
\end{equation}
Obviously, as $a \rightarrow \infty, P_c(a)\rightarrow 1$. Given the cumulative distribution function we can
revisit (\ref{eq:probfinite}) and instead state
\begin{equation}
P(a\pm \Delta) = P_c(a+\Delta) - P_c(a - \Delta).
\label{eq:probfinite2}
\end{equation}
\subsection{The normal distribution}
\index{Normal distribution|(}
\index{Gaussian distribution|(}
So far the function $p(x)$ has been arbitrary. Any continuous function with unit area under
the curve (i.e., \ref{eq:pdf}) would qualify. We will now turn our attention to the best known and most frequently
used pdf: the \emph{normal distribution}. Its study dates back to 18th
century investigations into the nature of experimental error. It was found that repeat
measurements of the same quantity displayed a surprising degree of regularity. In particular, the German scientist K.
F. Gauss played a major role in developing the theoretical foundations for the normal distribution,
hence its other name: the \emph{Gaussian} distribution. It is given by
\begin{equation}
p(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{- \frac{1}{2} \left( \frac{x-\mu}{\sigma} \right) ^2 },
\label{eq:gnorm}
\end{equation}
where $\mu$ and $\sigma$ have been defined previously. The constant term before the exponential normalizes the
area under the curve to unity (Figure~\ref{fig:Fig1_normal_pdf}). As discussed in Section~\ref{sec:zscore},
it is often convenient to transform your data into so-called
\emph{standard scores}:
\begin{equation}
\index{Standard scores}
\index{Normal scores}
z_i = \frac{x_i - \mu} {\sigma},
\end{equation}
in which case (\ref{eq:gnorm}) reduces to