-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path6_statistical_inference.R
1351 lines (1006 loc) · 54.7 KB
/
6_statistical_inference.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
[Statistical Inference]
Definition: "Generating conclusions about a population from a noisy sample."
-- Week 1: Why Statistical Inference?
[Probability]
- Given a random experiment (say rolling a die) a probability measure is a population quantity that summarizes the randomness.
- Specifically, probability takes a possible outcome from the experiment and:
+ Assigns it a number between 0 and 1
+ So that the probability that something occurs is 1 (the die must be rolled) and
+ So that the probability of the union of any two sets of outcomes that have nothing in common (mutually exclusive) is the sum of their respective probabilities
Rules probability must follow
- The probability that nothing occurs is 0
- The probability that something occurs is 1
- The probability of something is 1 minus the probability that the opposite occurs
- The probability of at least one of two (or more) things that can not simultaneously occur (mutually exclusive) is the sum of their respective probabilities
- If an event A implies the occurrence of event B, then the probability of A occurring is less than the probability that B occurs
- For any two events the probability that at least one occurs is the sum of their probabilities minus their intersection (else you sum the intersection twice)
Discrete values without upper bound (e.g. web traffic hits) are usually investigated with the Poisson distribution (canonical PMF to model counts).
[PMF] #assigns probabilities to specific values for discrete random variables
- A probability mass function evaluted at a value corresponds to the probability that a discrete random variable takes that value. To be a valid pmf function, p must satisfy:
1 - It must always be larger than or equal to 0.
2 - The sum of the possible values that the random variable can take has to add up to one.
- Bernoulli distribution
X = 0 represents tails
X = 1 represents heads
p(x) = (1/2)^x * (1/2)^1-x # p(0) = 0.5 & p(1) = 0.5
p(x) = θ^x * (1-θ)^1-x # p(0) = θ - 1 & p(1) = θ
[PDF]
- A probability density function (pdf), is a function associated with a continuous random variable. It must satisfy:
1 - It must be larger than or equal to zero everywhere.
2 - The total area under it must be one.
x <- c(-0.5, 0, 1, 1, 1.5)
y <- c(0, 0, 2, 0, 0)
plot(x, y, lwd = 3, frame = FALSE, type = 'l')
What is the probability that 75% or fewer of calls get addressed? # We are looking at a rectangular triangle, so we can calculate the surface underneat by taking 0.5 * Base * Height
1.5 * 0.75 / 2 #=> 0.5625
pbeta(0.75, 2, 1) #=> 0.5625 // beta distribution (for triangles), returns the less or equal than x probability
CDF and survival function
- The cumulative distribution function (CDF) of a random variable, X, returns the probability that the random variable is less than or equal to the value x.
F(x) = P(X <= x)
- The survival function of a random variable X is defined as the probability that the random variable is greater than the value of x
S(x) = P(X > x)
Notice that S(x) = 1 - F(x)
pbeta(c(0.4, 0.5, 0.6), 2, 1) # probabilities, height, width
#=> 0.16 0.25 0.36
Quantiles
- The α th quantile of a distribution with distribution function F is the point xα so that F(xα) = a
+ A percentile is simply a quantile with α expressed as a percent
+ The median is the 50th percentile
We want to solve 0.5 = F(x) = x^2, which equals sqrt(0.5) # 0.7071 // F(x) = 1/2 * x * 2x = 1/2 * 2x^2 = 2x^2 / 2 = x^2
qbeta(0.5, 2, 1) #=> 0.7071
Summary
- You might be wondering at this point "I've heard of a median before, it didn't require integration. Where's the data?"
- What we are referring to are population quantities. Therefore, the median being discussed is the population median.
- A probability model connects the data to the population using assumptions.
- Therefore the median we are discussing is the estimand, the sample median will be the estimator.
Conditional Probability
- Let B be an event so that P(B) > 0
P(A | B) == probability of A given that event B has occurred
P(A & B) == intersection # & is the cap symbol
P(A | B) = P(A & B) / P(B)
A = {1}
B = {1, 3, 5}
P(one given that roll is odd) = P(A|B) = 1/6 / 3/6 = 2/6 = 1/3
P(A U B) = P(A) + P(B) - P(A & B) # P(A U B) == union probability (A or B; at least one)
QUIZ
P(F U M) == 0.17
P(F) == 0.12
P(F & M) == 0.06%
0.17 = 0.12 + P(M) - 0.06
Bayes Rule
P(B | A) = [P(A | B) * P(B)] / [P(A | B) * P(B) + P(A | B^c) * P(B^c)]
Diagnostic tests
- Let + and - be the events that the result of a diagnostic test is positive or negative respectively
- Let D and D^c (D-complement) be the event that the subject of the test has or does not have the disease respectively
Sensitivity = P(+|D)
Specificity = P(-|D^c)
Positive predictive value = P(D|+)
Negative predictive value = P(D^c|-)
Prevalence of disease = P(D)
Example:
- Sensitivity of 99.7% # P(+|D)
- Specificity of 98.5% # P(-|D^c)
- Population with a .1% prevalence of HIV # P(D)
What is the associated positive predictive value?
P(D|+) = [P(+|D) * P(D)] / [P(+|D) * P(D) + P(+|D^c) * P(D^c)]
P(D|+) = [P(+|D) * P(D)] / [P(+|D) * P(D) + {1 - P(-|D^c)} * {1 - P(D)}]
P(+ | D^c) # Chances that the test is positive when subject does not have disease
==
1 - P(- | D^c) # 1 - the chance test is negative when subject does not have disease
Similar vein: P(D^c) == 1 - P(D)
QUIZ
Sensitivity = 0.75 # P(+|D)
Specificity = 0.52 # P(-|D^c)
Prevalence = 0.30 # P(D)
Want to know: P(D|+)
P(D|+) = [0.75 * 0.3] / [0.75 * 0.3 + (1 - 0.52) * (1 - 0.3)] = 0.225 / [0.225 + 0.336] = 0.4011
[Independence]
Event A is independent of event B if
P(A | B) = P(A) where P(B) > 0
P(A & B) = P(A) * P(B)
IID random variables
iid = independent and identically distributed
Random variables are said to be iid if they are independent and identically distributed
- Independent: statistically unrelated from one another
- Identically distributed: all having been drawn from the same population distribution
- iid random variables are the default model for random samples
[Expected Values]
- The process of making conclusions about populations from noisy data that was drawn from it.
library(manipulate)
myHist <- function(mu){
g <- ggplot(galton, aes(x = child))
g <- g + geom_histogram(fill = "salmon", binwidth = 1, aes(y = ..density..), colour = "black"
g <- g + geom_density(size = 2)
g <- g + geom_vline(xintercept = mu, size = 2)
mse <- round(mean(galton$child - mu)^2), 3)
g <- g + labs(title = paste('mu = ', mu, ' MSE = ', mse))
g
}
manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))
Example of a population mean
- Suppose a coin is flip X is declared 0 or 1 corresponding to a head or a tail, respectively.
- What is the expected value of X?
E[X] = 0.5 * 0 + 0.5 * 1 = 0.5
What about a biased coin?
Suppose that a random variable, X, is so that P(X = 1) = p and P(X = 0) = (1 - p)
E[X] = 0 * (1 - p) + 1 * p = p
[Continuous Random Variables]
- For a continuous variable, X, with density, f, the expected value is again exactly the center of mass of the density
Facts about expected values
- Recall that expected values are properties of distributions (center of mass of distribution)
- Note that the average of random variables is itself a random variable and its associated distribution has an expected value
- The center of this distribution is the same as that of the original distribution
-- Week 2
[Introduction to Variability]
- The variance of a random variable is a measure of spread
Var(X) = E[(X-μ)^2] = E[X^2] - E[X]^2
- The square root of the variance is called the standard deviation
Example:
- E[X] = 3.5 (when X is the toss of a die)
- E[X^2] = 1^2 * 1/6 + 2^2 * 1/6 + ... + 6^2 * 1/6 = 15.17
- Var(X) = E[X^2] - E[X]^2 = 2.92
What is the variance from the result of the toss of a coin with probability of heads (1) of p?
E[X] = 0 * (1 - p) + 1 * p = p
E[X^2] = E[X] = p # since 0^2 = 0 & 1^2 = 1, they are the same in this case!
Var(X) = E[X^2] - E[X]^2 = p - p^2 = p(1 - p)
Recall the mean
- Recall that the average of a random sample from a population is itself a random variable
+ It has its own population mean (which is the same as the original/actual population mean) and population variance
E[X] = μ
To relate the random sample variance back to the original/actual variance: Var(X) = σ^2/n # hence the variance of the sample mean goes towards zero as n increases, since the distribution centers more and more around the actual population mean it is trying to estimate
+ This is useful because in real life you only get one sample mean, not many as we did here with these many experiments
X = sample mean
μ = population mean
σ = population stdev
S = sample stdev
σ^2 = population variance
S^2 = sample variance
standard error or the mean = σ / n^0.5 # == (s^2/n)^0.5
Var(X) = σ^2/n # variance of the original population = sample variance / n?
- We call the standard deviation of a statistic (e.g. mean/stdef) a standard error # error refers to it's (potential) difference to what the true statistc is
To Summarize
- The sample variance, S^2, estimates the population variance, σ^2
- The distribution of S^2 is centered around σ^2
+ And it gets more centered around σ^2 with more observations
- The distribution of X is centered around μ
+ And it gets more centered around μ with more observations
- The variance of the sample mean, S^2, is σ^2/n
+ Its logical estimate is S^2 / n
+ Hence the logical estimate of the standard error = S/(n^0.5)
* S, the sample standard deviation, talks about how variable the population/sample is
* S/(n^0.5), the standard error, talks about how variable averages of random samples of size n from the population are
Simulation Example
- Normal distribution
# Standard normals have variance 1 (and stdev 1); means of a random sample of n standard normals have stdev 1/(n^0.5) # standard error formula
nosim <- 1000
n <- 10
sd(apply(matrix(rnorm(nosim * n), nosim), 1, mean))
#=> 0.3156
1 / sqrt(n)
#=> 0.3162
- Uniform Distribution
# Standard uniforms have variance 1/12, means of a random sample of n uniforms have stdev 1/((12*n)^0.5)
nosim <- 1000
n <- 10
sd(apply(matrix(runif(nosim * n), nosim), 1, mean))
#=> 0.09017
1 / sqrt(12*n)
#=> 0.09129
- Poisson Distribution
# Poisson(4) have variance 4; means of a random sample of n Poisson have stdev 2/(n^0.5)
nosim <- 1000
n <- 10
sd(apply(matrix(rpois(nosim * n), nosim), 1, mean))
#=> 0.6219
2 / sqrt(n)
#=> 0.6325
- Fair coin flips
# Fair coin flips have variance 0.25, means of a random sample of n coin flips have stdev 1/(2*n^0.5)
nosim <- 1000
n <- 10
sd(apply(matrix(sample(0 : 1, nosim * n, replace = TRUE), nosim), 1, mean))
#=> 0.1587
1 / (2 * sqrt(n))
#=> 0.1581
The Sample Variance
Population mean --> center of mass of the population
Sample mean --> center of mass of the observed data (sample)
Population variance --> Var(X) = E[(X-μ)^2] = E[X^2] - E[X]^2 # the expected squared distance of a random variable from the poplutation around the population mean
Sample variance --> S^2 = [Σ(Xi - X)^2] / [n - 1] # the sample variance is the average squared distance of the observed observations minus the sample mean
Variance of the Sample Variance
- The sample variance is a function of data, it is also a random variable, so:
+ It has an associate population distribution
+ Its expected value is the population variance
+ As you collect more data, the distribution of the sample variance is going to get more concentrated around the population variance
Data Example
library(UsingR); data(father.son);
x <- father.son$sheight # son's height
n <- length(x) # number of observations
round(c(var(x), var(x) / n, sd(x), sd(x) / sqrt(n)), 2)
#=> [1] 7.29 0.01 2.81 0.09
- 7.29 = variance (in inches^2) # variability in childrens heights
- 0.01 = the variability in the average of childrens heights (in inches^2) # no longer about variability in children's heights
- 2.81 = standard deviation (in inches) # variability in childrens heights
- 0.09 = the variability in the average of childrens heights (in inches), standard error == the standard deviation in the distribution of averages of n childrens heights (an estimate)
Summarizing what we know about variances
- The sample variance is an estimate of the population variance
- The distribution of the sample variance is centered at what it is estimating
- It gets more concentrated around the population variance with larger sample sizes
- The variance of the sample mean is the population variance divided by n Var(X) = stdev^2 / n == stdev / n^0.5
[Some Common Distributions]
-- The Bernoulli distribution
- The Bernoulli distribution arises as the result of a binary outcome (say a coinflip)
+ Bernoulli random variables take (only) the values 1 and 0 with probabilities of (say) p and 1 - p
+ The mean of a Bernoulli random variable is p, the variance is p(1 - p)
* 1 = success
* 0 = failure
P(X = x) = (p^x) * (1-p)^(1-x)
Binomial Distribution
- A binomial random variable is obtained as the sum of a bunch of iid Bernoulli variables
+ In specific, let X1, ..., Xn be iid Bernoulli(p);
+ then X = ΣXi is a binomial random variable
- The total number of heads on the flips of a potentially biased coin
P(X = x) = ( n ) * p^x * (1-p)^(n-x)
( x )
( n ) = n! / (x!(n-x)!)
( x )
Example
Suppose a friend has 8 children (oh my!), 7 of which are girls.
- If each gender has an independent 50% probability for each birth, what
is the probability of gettting 7 or more girls out of 8 births?
( 8 ) * 0.5^7 * (1 - 0.5)^1 + ( 8 ) * 0.5^8 * (1 - 0.5)^0 = 0.04
( 7 ) ( 8 )
# ( 8 ) = 8! / (7!(8-7)!)
# ( 7 )
Same as:
choose(8, 7) * 0.5^8 + choose(8, 8) * 0.5^8
#=> 0.03516
pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE) # greater than 6 == 7 & 8
#=> 0.03516
-- Normal Distribution
- Normal/Gaussian distribution with mean (E[X]) μ and variance (Var(X)) σ^2 # X~N(μ,σ^2)
+ Bell curve, formula: (2π^2)^(1/2) * e^(-(x-μ)^2/(2/2σ^2)
- When μ = 0 and σ = 1, we have the standard normal distribution # standard random normals are often labeled Z
If X~N(μ,σ^2) # X is from a normal distribution with mean μ and variance σ^2
Z = (X - μ) / σ ~ N(0,1)
+ We convert the units of X to standard deviations from the mean, the resulting random variable Z is a standard normal
+ Conversely, if we have a standard normal (Z), we can convert it back to a non-standard normal:
X = μ + σZ ~ N(μ,σ^2)
More facts about the normal density
1 - Approximately 68%, 95% and 99% of the normal density lies within 1, 2 and 3 standard deviations from the mean, respectively.
2 - -1.28, -1.645, -1.96 and -2.33 are the 10th, 5th, 2.5th and 1st percentiles of the standard normal distribution respectively.
3 - 1.28, 1.645, 1.96 and 2.33 are the 90th, 95th, 97.5th and 99th percentiles of the standard normal distribution respectively.
+ In units of sigma.
+ For a not standard normal, the points are re-calculated as μ - 1.28*σ & μ + 1.28*σ (everything between these two points = 80%, between the 10th and 90th percentile), etc.
Question
- What is the 95th percentile of a N(μ,σ^2) distribution (X.95)?
+ Quick answer in R: qnorm(.95, mean = mu, sd = sd)
+ Since we know 1.645 if the 95th percentile, answer = μ + 1.645*σ
- What is the probability that a N(μ,σ^2) random variable (RV) is larger than x?
+ Quick answer in R: pnorm(x, mean = μ, sd = sigma, lower.tail = FALSE) or 1 - pnorm(x, mean = μ, sd = sigma)
+ A conceptual way is to calculate how many standard deviations away x is from the mean --> (x - μ) / σ # x espressed as how many standard deviations away from the mean
+ Example:
* Assume that the number of daily ad clicks for a company is (approximately) normally distributed with a mean of 1020 and a standard deviation of 50.
What is the probability of getting more than 1160 clicks in a day?
- It is not very likely, since 1160 is 2.8 standard deviations away from the mean.
+ pnorm(1160, mean = 1020, sd = 50, lower.tail = FALSE)
#=> 0.002555
+ pnorm(2.8, lower.tail = FALSE) # takes the standard normal distribution when no μ/σ given by default
#=> 0.002555
* Assume that the number of daily ad clicks for a company is (approximately) normally distributed with a mean of 1020 and a standard deviation of 50.
What number of daily ad clicks would represent the one where 75% of days have fewer clicks (assuming the days are independent and identically distriubted)?
- qnorm(0.75, mean = 1020, sd = 50)
#=> 1054
-- Poisson
P(X = x; ) = (λ^x(e^-λ)) / x! # mass function
+ The mean of this distributino is λ
+ The variance of the distribution is also λ
+ x = strictly non-negative integer # something cannot have occured less than 0 times
Some uses for the Poisson distribution
- Modeling count data
- Modeling event-time or survival data
- Modeling contingency tables (again counts, multi-dimensional, grouped by more than 2 categories)
- Approximating binomials when n is large and p is small # for instance number of people (n), and how many people contract a certain disease (p)
Rates and Poisson random variables
- Poisson random variables are used to model rates
- X~Poisson(λt) where
+ λ = E[X/t] is the expected count per unit of time # the average/expected number of occurrences for a given unit of time t
- t is the total monitoring time
Example
- The number of people that show up at a bus stop is Poisson with a mean of 2.5 per hour.
If watching the bus stop for 4 hours, what is the probability that 3 or fewer people show up for the whole time?
+ ppois(3, lambda = 2.5 * 4)
#=> 0.01034
Poisson approximation to the binomial
- When n is large and p is small, the Poisson distribution is an accurate approximation to the binomial distribution
- Notation
+ X~Binomial(n,p)
+ λ = n*p
- n gets large
- p gets small
Example
We flip a coin with success probability 0.01 five hundred times.
What is the probability of 2 or fewer successes?
pbinom(2, size = 500, prob = 0.01)
#=> 0.1234
ppois(2, lambda = 500 * 0.01)
#=> 0.1247
[A Trip to Asymptopia]
Asymptotics
- Asymptotics is the term for the behavior of statistics as the sample size (or some other relevant quantity) limits
to infinity (or some other relevant number)
- Asymptotics are incredibly useful for simple statistical inference and approximations
- Asymptotics form the basis for frequency interpretation of probabilities (the long run of proportion of times an event occurs)
Limits of random variables
- These results allow us to talk about the large sample distribution of sample means of a collection of iid observations
- The first of these results, the Law of Large Numbers, we intuitively know
+ It says that the average limits to what its estimating, the population mean
* Example Xn could be the average of the result of n coin flips (i.e. the sample proportion of heads)
* As we flip a fair coin over and over, it eventually converges to the true probability of a head
Law of Large numbers (LLN) in action
n <- 1000
means <- cumsum(rnorm(n))/(1:n)
# as the number of simulations increases, we get closer and closer to the actual mean
means <- cumsum(sample(0:1, n, replace = TRUE)) / (1:n)
Discussion
- An estimator is consistent if it converges to what you want to estimate
- The LLN says that the sample mean of iid samples is consistent for the population mean
- Typically, good estimators are consistent; it is not too much to ask that if we go through
the trouble of collecting an infinite amount of data that we get the right answer
- The sample variance and sample standard deviation of iid random variables are consistent as well
The Central Limit Theorem (CLT)
- For our purposes, the CLT states that the distribution of averages of iid variables (properly normalized) becomes
that of a standard normal as the sample size increases.
(Xn - μ) / σ/(n^0.5) = (Estimate - Mean of estimate) / Std. error of estimate
Xn = sample average
+ Has a distribution like that of a standard normal for large n
+ The useful way to think about the CLT is that Xn is approximately N(μ,σ^2/n) # mean is population mean, variance is std error of the mean
Example
- Simulate a standard normal random variable by rolling n (six sided)
+ Let Xi be the outcome for die i
+ Then note that μ = E[Xi] = 3.5
+ Var(Xi) = 2.92
+ SE (2.92/n)^0.5 = 1.71 / n^0.5
+ Lets roll n dice, take their mean, subtract off 3.5, and divide by 1.71/n^0.5 # scaling by the standard error
Coin CLT
- Let Xi be the 0 or 1 result of the i'th' flip of a possibly unfair coin
+ The sample proportion, say ^p, is the average of the coin flips
+ E[Xi] = p and Var(Xi) = p(1 - p)
+ SE of the mean is (p(1-p)/n)^0.5
+ (^p - p) / (p(1-p)/n)^0.5 # should be normally distributed
-- Asymptotics and Confidence Intervals
- X (sample mean) is approximately normally distributed with mean μ and sd σ/n^0.5 # SE
- Probability that X is bigger than μ + 2σ or smaller than μ - 2σ is 5% (technically 1.96σ)
- X +- 2σ / n^0.5 is called a 95% interval for μ
Example: Give a confidence interval for the average height of simulations
library(UsingR)
data(father.son)
x <- father.son$sheight
(mean(x) + c(-1, 1) * qnorm(0.975) * sd(x)/sqrt(lenght(x)))/12 # dividing by 12 so that the confidence interval will be in feet rather than inches
#=> 5.710 5.738 --> 95% chance the average height lies between these two numbers
-- Sample proportions
- In the event that each Xi is 0 or 1 with common success probability p then σ^2 = p(1-p)
- The interval takes the form:
^p +- z[1-a/2] * ((p(1-p)/n))^0.5
- Replacing p by ^p in the standard error results in what is called a Wald confidence interval for p
- For 95% intervals:
^p +- 1/n^0.5
is a quick estimate for p
Example
- Your campaign advisor told you that in a random sample of 100 likely voters, 56 intent to vote for you.
+ Can you relax? Do you have this race in the bag?
+ Without access to a computer or calculator, how precise is this estimate?
- 1/sqrt(100) = 0.1, so a back of the envelope calculation gives an approximate 95% interval of (0.46, 0.66)
+ Not enough for you to relax, better go do more campaigning!
- Rough guidelines, 100 for 1 decimal place, 10000 for 2, 1000000 for 3
round(1/sqrt(10^(1:6)), 3)
#=> 0.316 0.100 0.032 0.010 0.003 0.001
Binomial Interval
0.56 + c(-1, 1) * qnorm(0.975) * sqrt(0.56 * 0.44 / 100)
#=> 0.4627 0.6573
binom.test(56, 100)$conf.int
#=> 0.4572 0.6592
#=> attr(,"conf.level")
#=> 0.95
Simulation
n <- 20 # change to 100 for better results
pvals <- seq(0.1, 0.9, by = 0.05)
nosim <- 1000
coverage <- sapply(pvals, function(p)) {
phats <- rbinom(nosim, prob = p, size = n)/n
ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
ul <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
mean(ll < p & ul > p)
}
What is happening?
- n is not large enough (20) for the CLT to be applicable for many of the values of p
- Quick fix, form the interval with (X + 2)/(n + 4)
+ Add two successes and failures, Agresti/Coull interval
n <- 20 # change to 100 for better results
pvals <- seq(0.1, 0.9, by = 0.05)
nosim <- 1000
coverage <- sapply(pvals, function(p)) {
phats <- (rbinom(nosim, prob = p, size = n) + 2) / (n + 4)
ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n) # phat = ^p
ul <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
mean(ll < p & ul > p)
}
Poisson Interval
- A nuclear pump failed 5 times out of 94.32 days. Give a 95% confidence interval for the failure rate per day?
+ X ~ Poisson(λt)
+ Estimate ^λ = X/t # λ = failure rate, t = number of days
+ Var(^λ) = λ/t
+ ^λ/t is our variance estimate
x <- 5
t <- 94.32
lambda <- x/t
round(labmda + c(-1, 1) * qnorm(0.975) * sqrt(lambda/t), 3)
#=> 0.007 0.099
poisson.test(x, T = 94.32)$conf
#=> 0.01721 0.12371
#=> attr(,"conf.level")
#=> 0.95
Summary
- The LLN states that averages of iid samples converge to the population means that they are estimating
- The CLT states that averages are approximately normal, with distributions:
+ centered at the population mean
+ with standard deviation equal to the standard error of the mean
+ CLT gives no guarantee that n is large enough
- Taking the mean and adding and subtracting the relevant normal quantile times the SE yields a confidence interval for the mean
+ Adding and subtracting 2 SEs works for 95% CIs
- CIs get wider as the coverage increases (why?)
+ The more sure you want to be the CI contains the parameter, the wider the procedure makes the interval
- The Poisson and binomial case have exact intervals that do not require the CLT
+ A quick fix for small sample size binomial calculations is to add 2 successes and failures
-- Week 3: Confidence Intervals and Testing
[T Confidence Intervals]
- In the previous lectures, we discussed creating a confidence interval using the CLT:
+ They took the form Estimate +- ZQ * SE
* Z quantile of the normal distribution
* SE of the estimate
- We will now also look at the T distribution:
+ Est +- TQ * SE
* We change the Z-Quantile to a T-Quantile, using the T-distibrution
- This distribution has heavier tails than the normal distribution (so these intervals will be wider)
- Whenever in doubt between Z/T, use T:
+ As you collect more data, T will become like Z anyway
Gossets t distribution
- Invented by William Gosset (under pseudonym 'Student') in 1908
- Has thicker tails than the normal distribution
- Is indexed by a degrees of freedom; gets more like a standard normal as df gets larger
- It assumes that the underlying data are iid Gaussian with the result that (X - μ) / (S / n^0.5)
+ Follows Gossets t distribution with n - 1 degrees of freedom
+ Interval is X +- t(n-1) * S/n^0.5 where t(n-1) is the relevant quantile
+ degrees of freedom = n-1
Notes about the t interval
- The t interval technically assumes that the data are iid normal, though it is robust to this assumption
- It works well whenever the distribution of the data is roughly symmetric and mound shaped
- Paired observations (Xt vs Xt+1) are often analyzed using the t interval by taking differences
- For large degrees of freedom, t quantiles become the same as standard normal quantiles; therefore this
interval converges to the same interval as the CLT yielded
- For skewed distributions, the spirit of the t interval assumptions are violated
+ Also, for skewed distributions, it does not make a lot of sense to center the interval at the mean
+ In this case, consider taking logs or using a different summary like the median
- For highly discrete data, like binary, other intervals are available
Example: Dependent Groups / Paired
Sleep Data
- In R typing data(sleep) brings up the sleep data originally analyzed in Gossets Biometrika paper,
which shows the increase in hours for 10 patients for 10 patients on two soporific drugs.
R treats the data as two groups rather than paired.
Results:
g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20]
difference <- g2 - g1
m <- mean(difference); s <- sd(difference); n <- 10
m + c(1, -1) * qt(.975, n-1) * s / sqrt(n)
t.test(difference)
t.test(g2, g1, paired = TRUE)
t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)
Example: Independent Group t confidence intervals
- Suppose that we want to compare the mean blood pressure between two groups in a
randomized trial; those who received the treatment to those who received a placebo.
+ We cannot use the paired t test because the groups are independent and may have different sample sizes
+ We now present methods for comparing independent groups
Confidence Interval
- Therefore a (1 - α) * 100% confidence interval μy - μx is:
Y - X +- tnx+ny-2,1-α/2 * Sp * (1/nx + 1/ny)^0.5
+ Sample mean Y - sample mean X
+ t-quantile:
* DF: nx + ny - 2
* 1-α/2
* Sp^2 = {(nx - 1)*Sx^2 + (ny - 1)*Sy^2} / (nx + ny - 2) # pooled variance, its srt is the pooled sdev
# You simply weigh them according to sample sizes, if sizes are the same, it is equal to the average
- Remember that this interval is assuming a constant variance across the two groups
- If there is some doubt, assume a different variance per group, which we will cover later on
+ If we collect a lot of data, nx & ny become small
Example: Comparing SBP for 8 oral contraceptive users versus 21 controls
- Xoc = 132.86 mmHg with soc = 15.34 mmHg # n = 8
- Xc = 127.44 mmHg with sc = 18.23 mmHg # n = 21
- Pooled variance estimate
sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2) / (8 + 21 - 2))
132.86 - 127.44 + c(1, -1) * qt(0.975, 27) * sp * (1/8 + 1/21)^0.5
# OC - C --> not the other way around, since you think OC will be better than C
Another Example: Mistakenly treating the Sleep Data as Grouped
n1 <- length(g1); n2 <- length(g2)
sp <- sqrt(((n1-1) * sd(x1)^2 + (n2-1) * sd(x2)^2) / (n1 + n2 - 2))
md <- mean(g2) - mean(g1)
semd <- sp * sqrt(1 / n1 + 1 / n2) # this is the standard error (of the mean difference)!
rbind(
md + c(-1, 1) * qt(0.975, n1 + n2 - 2) * semd
t.test(g2, g1, paired = FALSE, var.equal = TRUE)$conf
t.test(g2, g1, paired = TRUE)$conf
)
#=> -0.2039 3.364
#=> -0.2039 3.364
#=> 0.7001 2.460
ChickWeight data in R
library(datasets); data(ChickWeight); library(reshape2)
## define weight gain or loss
wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight")
names(wideCW)[-(1: 2)] <- paste("time", names(wideCW)[-(1 : 2)], sep = "")
library(dplyr)
wideCW <- mutate(wideCW, # this creates a new column, gain, in df wideCW
gain = time21 - time0
)
# Comparing diet 1 to diet 4
wideCW14 <- subset(wideCW, Diet %in% c(1,4))
rbind(
t.test(gain ~ Diet, paired = FALSE, var.equal = TRUE, data = wideCW14)$conf
t.test(gain ~ Diet, paired = FALSE, var.equal = FALSE, data = wideCW14)$conf
)
Unequal Variances
- When the variance between the groups is unequal, it does not follow a t-distribution
+ It can be approximated by a t-distribution, by changing the formula to calculate the degrees of freedom (no longer nx + ny - 2):
df = (Sx^2/nx + Sy^2/ny)^2 / ((Sx^2/nx)^2 / (nx - 1) + (Sy^2/ny)^2 / (ny - 1))
--> When in doubt whether or not the variances of the groups are equal, just use var.equal = FALSE.
Example
Comparing SBP for 8 oral contraceptive users versus 21 controls
- Xoc = 132.86 mmHg with soc = 15.34 mmHg
- Xc = 127.44 mmHg with sc = 18.23 mmHg
- df = 15.04, t15.04,975 = 2.13 # df calculated with the new df formula
- Interval:
132.86 - 127.44 +- 2.13 * (15.34^2/8 + 18.23^2/21)^0.5
- In R, t.test(..., var.equal = FALSE)
Comparing other kinds of data
For binomial data, there are lots of ways to compare two groups
- Relative risk, risk difference, odds ratio.
- Chi-squared tests (expected vs observed), normal approximations, exact tests.
For count data, there is also Chi-squared tests and exact tests.
[Hypothesis Testing]
- Hypothesis testing is concerned with making decisions using data
- A null hypothesis is specified that represents the status quo, usally labeled H0
+ The null hypothesis is assumed true and statistical evidence is required to reject it in favor of a research or alternative hypothesis
Example - Respiratory Disturbance Index
- A respiratory disturbance index of more than 30 events / hour, say, is considered evidence of severe sleep disordered breathing (SBD).
- Suppose that in a sample of 100 overweight subjects with other risk factors for sleep disordered breathing at a sleep clinic, the mean RDI
was 32 events / hour with a standard deviation of 10 events / hour.
- We might want to test the hypothesis that:
+ H0: μ = 30
+ Ha: μ > 30
- where μ is the population mean RDI
- The alternative hypotheses are typicall of the form <, > or !=
- Note that there are four possible outcomes of our statistical decision process:
1 - Truth: H0 | Decide: H0 | Result: Correctly accept null
2 - Truth: H0 | Decide: Ha | Result: Type I error (FP)
3 - Truth: Ha | Decide: Ha | Result: Correctly reject null
4 - Truth: Ha | Decide: H0 | Result: Type II error (FN)
Type I & Type II rates are related: If Type I rate increases, Type II rate decreases, and vice-versa.
- Consider a court of law; the null hypothesis is that the defendant is innocent
- We require a standard on the available evidence to reject to null hypothesis (convict)
- If we set a low standard, then we would increase the percentage of innocent people convicted (type I errors - FPs);
however we would also increase the percentage of guilty people convicted (correctly, rejecting the null)
- If we set a high standard, then we increase the percentage of innocent people let free (correctly accepting the null)
while we would also increase the percentage of guilty people let free (type II - FNs)
Example - Respiratory Disturbance Index (cntd..)
- A reasonable strategy would reject the null hypothesis if X was larger than some constant C
- Typically, C is chosen so that the probability of a Type I error (FP), α, is 0.05 (or some other relevant constant)
- α = Type I error rate = Probability of rejecting the null hypothesis when, in fact, the null hypothesis is correct # you do not want this, but you also don't want to set α too low, hence you'd never reject the null hypothesis
- Standard error of the mean = 10 / 100^0.5 = 1
- Under H0 X ~ N(30, 1)
- We want to choose C so that the P(X > C; H0) is 5%
- The 95th percentile of a normal distribution is 1.645 standard deviations from the mean
- IF C = 30 + 1 * 1.645 = 31.645
+ Then the probability that a N(30,1) is larger than it is 5%
+ So the rule "Reject H0 when X >= 31.645" has the property that the probability of rejection is 5% when H0 is true (for the μ, σ and n given)
- In general we do not convert C back to the original scale
- We would just reject because the Z-score; which is how many standard errors the sample mean is above the hypothesized mean; is greather than 1.645.
(32-30) / (10/10^0.5) = 2 --> > 1.645 # divide the difference between the sample and hypothesized mean, divide by the SE
- Or whenever n^0.5(X-μ)/s > Z1-α
Example reconsidered (T-tests)
# One-sided test
- Consider our example again. Suppose that n = 16 (rather than 100)
- The statistic (X - 30) / (s/16^0.5) follows a T distribution with 15 df under H0.
+ Under H0 the probability that it is larger than the 95th percentile of the T distribution is 5%
+ The 95th percentile of the T distribution with 15 df is 1.7531 (obtained via qt(.95, 15))
+ Our test statistic is now: 16^0.5(32 - 30)/10 = 0.8
* We fail to reject H0 since 0.8 < 1.7531
# Two sided test
- Suppose that we would reject the null hypothesis if in fact the mean was too large or too small
- That is, we want to test the alternative Ha: μ != 30
- We will reject H0 is the test statistic, 0.8, is either too large or too small
+ Since the statistic is positive, we only need to consider the large side (positive side)
- Then we want the probability of rejecting under the null to be 5%, split equally as 2.5% in the upper tail and 2.5% in the lower tail
- Thus we reject if our test statistic is larger than qt(.975, 15) or smaller than qt(0.025, 15)
+ This is the same as saying: reject if the absolute value of our statistic is larger than qt(0.975, 15)
+ So we fail to reject the two sided test as well
+ (If you fail to reject the one-sided test, you know that you will fail to reject the two sided test)
T-test in R
One-Group Testing
library(UsingR)
data(father.son)
t.test(father.son$sheight - father.son$fheight) # paired same as t.test(father.son$sheight, father.son$fheight, paired = TRUE)
#=> t = 11.79 # if true mean = 0, our sample would fall at this test statistic, very unlikely, hence H0 rejected
#=> df = 1077 # 1178 - 1 observations
#=> p-value < 2.2e-16
#=> conf interval: 0.831 1.163 --> this a 95% conf interval where the actual mean lies
#=> mean of x: 0.997
Connections with Confidence Intervals
- Consider testing H0: μ = μ0 versus Ha: μ != μ0 # μ0 mu not
- Take the set of all possible values for which you fail to reject H0, this set is a (1-α)100% confidence interval for μ
- The same works in reverse; if a (1-α)100% interval contains μ0, then we fail to reject H0
Two-Group Testing
- First, now you know how to do two group T tests since we already covered independent group T intervals
- Rejection rules are the same
- Test H0 : μ1 = μ2
Example
library(datasets); data(ChickWeight); library(reshape2)
## Define weight gain or loss
wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight")
names(wideCW)[-(1:2)] <- paste("time", names(wideCW)[-(1:2)], sep = "")
library(dplyr)
wideCW <- mutate(wideCW,
gain = time21 - time0
)
# Equal/Unequal Variance T test comparing diets 1 and 4
wideCW14 <- subset(wideCW, Diet %in% c(1,4))
t.test(gain ~ Diet, paried = False, var.equal = TRUE, data = wideCW14)
#=> t = -2.725
#=> df = 23
#=> p-value = 0.01207
#=> alternative hypothesis: true difference in means is not equal to 0
#=> 95% conf interval: -108.15 -14.81
#=> sample estimates: mean group 1: 136.2, mean group 4: 197.7
Exact Binomial Test
- Recall this problem, "Suppose a friend has 8 children, 7 of which are girls and none are twins"
- Perform the relevant hypothesis test: H0: p = 0.5, Ha: p > 0.5
+ What is the relevant rejection region, so that the probability of rejecting (H0 incorrectly; FP) is less than 5%?
Rejection Region Type I Error Rate
[0 : 8] - 1
[1 : 8] - 0.9961
[2 : 8] - 0.9648
[3 : 8] - 0.8555
[4 : 8] - 0.6367
[5 : 8] - 0.3633
[6 : 8] - 0.1455
[7 : 8] - 0.0352 # so at 7:8, < 0.05
[8 : 8] - 0.0039
- It is impossible to get an exact 5% level test for this case due to the discreteness of the binomial.
+ The closest rejection region is [7 : 8]
+ Any alpha level lower than 0.0039 is not attainable
- For larger sample sizes, we could do a normal approximation
- Two sided test is not obvious
+ Given a way to do two sided tests, we could take the set of values of p0 for which we fail to reject
to get an exact binomial confidence interval (called the Clopper/Pearson interval)
* So at 5%, this would be [0-1: 8] and [8 : 8] # the confidence interval (when fail to reject) is the inverse of this
- For these problems, people always create a P-Value rather than computing the rejection region
[P-Values]
- Most common measure of statistical significance
- Their ubiquity, along with concern over interpretation and use makes them controversial among statisticians
What is a P-value?
- Idea: Suppose nothing is going on - how unusual is it to see the estimate we got? # How unusal is the result we got if H0 is true?
- Approach:
1 - Define the hypothetical distribution of a data summary (statistic) when "nothing is going on" (null hypothesis)
2 - Calculate the summary/statistic with the data we have (test statistic)
3 - Compare what we calculated (2) to our hypothetical distribution (1) and see if the value is "extreme" (p-value)
- The P-value is the probability under the null hypothesis of obtaining evidence (test-statistic) as extreme or more extreme than that obtained
+ If the P-value is small, then either H0 is true and we have observed a rare event, or H0 is false
- Suppose that you get a T statistic of 2.5 for 15 df testing H0: μ = μ0 versus Ha: μ > μ0
+ What is the probability of getting a T statistic as large as 2.5
pt(2.5, 15, lower.tail = FALSE)
#=> 0.01225
The attained significance level
- Our test statistic was 2 for H0 : μ0 = 30 versus Ha: μ > 30
+ Notice that we rejected the one sided thest when α = 0.05 (since that equals 1.645 [assuming normal distribution, not t-distribution]),
would we reject if α = 0.01, how about α = 0.001?
+ The P-value therefore is the smallest value of α for which you would still reject H0
Notes
- By reporting a P-value the reader can perform the hypothesis test at whatever α level he or she chooses
- If the P-value is less than α you can reject the null hypothesis at that signficance level
- For two sided hypothesis tests, double the smaller of the two one-sided hypothesis test P-values
Revisiting an earlier example
- Suppose a friend has children, 7 of which are girls and none are twins
- If each gender has an independent 50% probability for each birth, what is the probability of getting 7 or more girls out of 8 births?
H0: p = 0.5 versus Ha: p > 0.5
choose(8,7) * 0.5^8 + choose(8,8) * 0.5^8
#=> 0.03516
pbinom(6, size = 8, prob = 0.5, lower.tail = FALSE)
Poisson Example
- Suppose that a hospital has an infection rate of 10 infections per 100 person/days at risk (rate of 0.1) during the last monitoring period
- Assume that an infection rate of 0.05 is an important benchmark
- Given the model, could the observed rate being larger than 0.05 be attriubted to chance?
+ Under H0: λ = 0.05 so that λ0100 = 5
+ Consider Ha: λ > 0.05
ppois(9, 5, lower.tail = FALSE) # 10 or more given rate of 5
#=> 0.03183 # unlikely to see this result
ppois(9, 5, upper.tail = FALSE) # 10 or less
Appending a vector in a loop:
for (i in 1 : 4000) {
smns <- c(smns, mean(sa))
}
https://en.wikipedia.org/wiki/Q–Q_plot --> to check whether distributions of differing populations are the same
-- Week 4: Power, multiple comparisons, resampling
[Power]