-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path12-Testing.Rmd
1174 lines (974 loc) · 57 KB
/
12-Testing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Testing Hypothesis {#ChapTesting}
## Student Learning Objectives
Hypothesis testing emerges as a crucial component in decision making
where one of two competing options needs to be selected. Statistical
hypothesis testing provides formal guidelines for making such a
selection. This chapter deals with the formulation of statistical
hypothesis testing and describes the associated decision rules.
Specifically, we consider hypothesis testing in the context of the
expectation of a measurement and in the context of the probability of an
event. In subsequent chapters we deal with hypothesis testing in the
context of other parameters as well. By the end of this chapter, the
student should be able to:
- Formulate statistical hypothesis for testing.
- Test, based on a sample, hypotheses regarding the expectation of the
measurement and the probability of an event.
- Identify the limitations of statistical hypothesis testing and the
danger of misinterpretation of the test’s conclusions.
## The Theory of Hypothesis Testing
Statistical inference is used in order to detect and characterize
meaningful phenomena that may be hidden in an environment contaminated
by random noise. Hypothesis testing is an important step, typically the
first, in the process of making inferences. In this step one tries to
answer the question: “Is there a phenomena at all?". The basic approach
is to determine whether the observed data can or cannot be reasonably
explained by a model of randomness that does not involve the phenomena.
In this section we introduce the structure and characteristics of
statistical hypothesis testing. We start with an informal application of
a statistical test and proceed with formal definitions. In the next
section we discuss in more detail the testing of hypotheses on the
expectation of a measurement and the testing of hypotheses on the
probability of an event. More examples are considered in subsequent
chapters.
### An Example of Hypothesis Testing
The variable “`price`" in the file “`cars.csv`" contains data on the
prices of different types of cars that were sold in the United States
during 1985. The average price of a car back then — the average of the
variable “`price`" — was \$13,207. One may be interested in the
question: Do Americans pay today for cars a different price than what
they used to pay in the 80’s? Has the price of cars changed
significantly since 1985?
The average price of a car in the United States in 2009 was
\$27,958[^12_1]. Clearly, this figure is higher than \$13,207. However, in
order to produce a fair answer to the question we have to take into
account that, due to inflation, the prices of all products went up
during these years. A more meaningful comparison will involve the
current prices of cars in terms of 1985 Dollars. Indeed, if we take into
account inflation then we get that, on the average, the cost of today’s
cars corresponds to an average price of \$13,662 in 1985 values[^12_2].
This price is still higher than the prices in the 1985 but not as much.
The question we are asking is: “Is the difference between \$13,207 and
\$13,662 significant or is it not so?".
In order to give a statistical answer to this question we carry out a
statistical test. The specific test is conducted with the aid of the
function “`t.test`". Later we will discuss in more details some of the
arguments that may be used in this function. Currently, we simply apply
it to the data stored in the variable “`price`" to test that the
expected price is different than the \$13,662, the average price of a
car in 2009, adjusted for inflation:
```{r}
cars <- read.csv("_data/cars.csv")
t.test(cars$price,mu=13662)
```
The data in the file “`cars.csv`" is read into a data frame that is
given the name “`cars`". Afterwards, the data on prices of car types in
1985 is entered as the first argument to the function “`t.test`". The
other argument is the expected value that we want to test, the current
average price of cars, given in terms of 1985 Dollar value. The output
of the function is reported under the title: “`One Sample t-test`".
Let us read the report from the bottom up. The bottom part of the report
describes the confidence interval and the point estimate of the expected
price of a car in 1985, based on the given data. Indeed, the last line
reports the sample average of the price, which is equal to 13,207.13.
This number, the average of the 201 non-missing values of the variable
“`price`", serves as the estimate of the expected price of a car in
1985. The 95% confidence interval of the expectation, the interval
\[12101.80, 14312.46\], is presented on the 4th line from the bottom.
This is the confidence interval for the expectation that was computed in
Subsection \@ref(subsec:CImean)[^12_3].
The information relevant to conducting the statistical test itself is
given in the upper part of the report. Specifically, it is reported that
the data in “`cars$price`" is used in order to carry out the test. Based
on this data a test statistic is computed and obtains the value of
“`t = -0.8115`". This statistic is associated with the $t$-distribution
with “`df = 200`" degrees of freedom. The last quantity that is being
reported is denoted the *$p$-value* and it obtains the value
“`p-value = 0.4181`". The test may be carried out with the aid of the
value of the $t$ statistic or, more directly, using the $p$-value.
Currently we will use the $p$-value.
The test itself examines the hypothesis that the expected price of a car
in 1985 was equal to \$13,662, the average price of a car in 2009, given
in 1985 values. This hypothesis is called the null hypothesis. The
alternative hypothesis is that the expected price of a car in 1985 was
not equal to that figure. The specification of the alternative
hypothesis is reported on the third line of the output of the function
“`t.test`".
One may decide between the two hypothesis on the basis of the size of
the $p$-value. The rule of thumb is to reject the null hypothesis, and
thus accept the alternative hypothesis, if the $p$-value is less than
0.05. In the current example the $p$-value is equal 0.4181 and is larger
than 0.05. Consequently, we may conclude that the expected price of a
car in 1985 was not significantly different than the current price of a
car.
In the rest of this section we give a more rigorous explanation of the
theory and practice of statistical hypothesis testing.
### The Structure of a Statistical Test of Hypotheses
The initial step in statistical inference in general, and in statistical
hypothesis testing in particular, is the formulation of the statistical
model and the identification of the parameter/s that should be
investigated. In the current situation the statistical model may
correspond to the assumption that the data in the variable “`price`" are
an instance of a *random* sample (of size $n=201$). The parameter that
we want to investigate is the expectation of the measurement that
produced the sample. The variance of the measurement is also relevant
for the investigation.
After the statistical model has been set, one may split the process of
testing a statistical hypothesis into three steps: (i) formulation of
the hypotheses, (ii) specification of the test, and (iii) reaching the
final conclusion. The first two steps are carried out on the basis of
the probabilistic characteristics of the statistical model and in the
context of the sampling distribution. In principal, the first two steps
may be conducted in the planning stage prior to the collection of the
observations. Only the third step involves the actual data. In the
example that was considered in the previous subsection the third step
was applied to the data in the variable “`price`" using the function
“`t.test`".
[**(i) Formulating the hypotheses**]{}: A statistical model involves a
parameter that is the target of the investigation. In principle, this
parameter may obtain any value within a range of possible values. The
formulation of the hypothesis corresponds to splitting the range of
values into two sub-collections: a sub-collection that emerges in
response to the presence of the phenomena and a sub-collection that
emerges in response to the situation when the phenomena is absent. The
sub-collection of parameter values where the phenomena is absent is
called the *null hypothesis* and is marked as “$H_0$". The other
sub-collection, the one reflecting the presence of the phenomena, is
denoted the *alternative hypothesis* and is marked “$H_1$".
For example, consider the price of cars. Assume that the phenomena one
wishes to investigate is the change in the relative price of a car in
the 80’s as compared to prices today. The parameter of interest is the
expected price of cars back then, which we denote by $\Expec(X)$. The
formulation of the statement that the expected price of cars has changed
is “$\Expec(X) \not = 13,662$". This statement corresponds to the
*presence* of a phenomena, to a change, and is customarily defined as
the alternative hypothesis. On the other hand, the situation
“$\Expec(X) = 13,662$" corresponds to not having any change in the price
of cars. Hence, this situation corresponds to the *absence* of the
phenomena and is denoted the null hypothesis. In summary, in order to
investigate the change in the relative price of cars we my consider the
null hypothesis “$H_0:\Expec(X) = 13,662$" and test it against the
alternative hypothesis “$H_1: \Expec(X)\not = 13,662$".
A variation in the formulation of the phenomena can change the
definition of the null and alternative hypotheses. For example, if the
intention is to investigate the *rise* in the price of cars then the
phenomena will correspond to the expected price in 1985 being less than
\$13,662. Accordingly, the alternative hypothesis should be defined as
$H_1: \Expec(X) < 13,662$, with the null hypothesis defined as
$H_0: \Expec(X) \geq 13,662$. Observe that in this case an expected
price larger than \$13,662 relates to the phenomena of rising (relative)
prices *not* taking place.
On the other hand, if one would wants to investigate a *decrease* in the
price then one should define the alternative hypothesis to be
$H_1: \Expec(X) > 13,662$, with the null hypothesis being
$H_0: \Expec(X) \leq 13,662$.
The type of alternative that was considered in the example,
$H_1: \Expec(X) \not = 13,622$ is called a *two-sided* alternative. The
other two types of alternative hypotheses that were considered
thereafter, $H_1: \Expec(X) < 13,662$ and $H_1: \Expec(X) > 13,662$, are
both called *one-sided* alternatives.
In summary, the formulation of the hypothesis is a reflection of the
phenomena one wishes to examine. The setting associated with the
presence of the phenomena is denoted the alternative hypothesis and the
complimentary setting, the setting where the phenomena is absent, is
denoted the null hypothesis.
[**(ii) Specifying the test:**]{} The second step in hypothesis testing
involves the selection of the decision rule, i.e. the statistical test,
to be used in order to decide between the two hypotheses. The decision
rule is composed of a statistic and a subset of values of the statistic
that correspond to the rejection of the null hypothesis. The statistic
is called the *test statistic* and the subset of values is called the
*rejection region*. The decision is to reject the null hypothesis (and
consequently choose the alternative hypothesis) if the test statistic
falls in the rejection region. Otherwise, if the test statistic does not
fall in the rejection region then the null hypothesis is selected.
Return to the example in which we test between $H_0:\Expec(X) = 13,662$
and $H_1:\Expec(X) \not= 13,662$. One may compute the statistic:
$$T = \frac{\bar X - 13,662}{S/\sqrt{n}}\;,$$ where $\bar X$ is the
sample average (of the variable “`price`"), $S$ is the sample standard
deviation, and $n$ is the sample size ($n = 201$ in the current
example).
The sample average $\bar X$ is an estimator of a expected price of the
car. In principle, the statistic $T$ measures the discrepancy between
the estimated value of the expectation ($\bar X$) and the expected value
under the null hypothesis ($\Expec(X) = 13,662$). This discrepancy is
measured in units of the (estimated) standard deviation of the sample
average[^12_4].
If the null hypothesis $H_0:\Expec(X) = 13,662$ is true then the
sampling distribution of the sample average $\bar X$ should be
concentrated about the value 13,662. Values of the sample average much
larger or much smaller than this value may serve as evidence against the
null hypothesis.
In reflection, if the null hypothesis holds true then the values of the
sampling distribution of the statistic $T$ should tend to be in the
vicinity of 0. Values with a relative small absolute value are
consistent with the null hypothesis. On the other hand, extremely
positive or extremely negative values of the statistic indicate that the
null hypothesis is probably false.
It is natural to set a value $c$ and to reject the null hypothesis
whenever the absolute value of the statistic $T$ is larger than $c$. The
resulting rejection region is of the form $\{|T| > c\}$. The rule of
thumb, again, is to take threshold $c$ to be equal the 0.975-percentile
of the $t$-distribution on $n-1$ degrees of freedom, where $n$ is the
sample size. In the current example, the sample size is $n=201$ and the
percentile of the $t$-distribution is `qt(0.975,200)` = 1.971896.
Consequently, the subset $\{|T| > 1.971896\}$ is the rejection region of
the test.
A change in the hypotheses that are being tested may lead to a change in
the test statistic and/or the rejection region. For example, for testing
$H_0: \Expec(X) \geq 13,662$ versus $H_1: \Expec(X) < 13,662$ one may
still use the same test statistic $T$ as before. However, only very
negative values of the statistic are inconsistent with the null
hypothesis. It turns out that the rejection region in this case is of
the form $\{T < -1.652508\}$, where `qt(0.05,200)` = -1.652508 is the
0.05-percentile of the $t$-distribution on 200 degrees of freedom. On
the other hand, the rejection region for testing between
$H_0: \Expec(X) \leq 13,662$ and $H_1: \Expec(X) > 13,662$ is
$\{T > 1.652508\}$. In this case, `qt(0.95,200)` = 1.652508 is the
0.95-percentile of the $t$-distribution on 200 degrees of freedom.
Selecting the test statistic and deciding what rejection region to use
specifies the statistical test and completes the second step.
[**(iii) Reaching a conclusion:**]{} After the stage is set, all that is
left is to apply the test to the observed data. This is done by
computing the observed value of the test statistic and checking whether
or not the observed value belongs to the rejection region. If it does
belong to the rejection region then the decision is to reject the null
hypothesis. Otherwise, if the statistic does not belong to the rejection
region, then the decision is to accept the null hypothesis.
Return to the example of testing the price of car types. The observed
value of the $T$ statistic is part of the output of the application of
the function “`t.test`" to the data. The value is “`t = -0.8115`". As an
exercise, let us recompute directly from the data the value of the $T$
statistic:
```{r}
x.bar <- mean(cars$price,na.rm=TRUE)
x.bar
s <- sd(cars$price,na.rm=TRUE)
```
The observed value of the sample average is $\bar x = 13207.13$ and the
observed value of the sample standard deviation is $s = 7947.066$. The
sample size (due to having 4 missing values) is $n=201$. The formula for
the computation of the test statistic in this example is
$t = [\bar x - 13,662]/[s/\sqrt{n}]$. Plugging in this formula the
sample size and the computed values of the sample average and standard
deviation produces:
```{r}
(x.bar - 13662)/(s/sqrt(201))
```
This value, after rounding up, is equal to the value “`t = -0.8115`"
that is reported in the output of the function “`t.test`".
The critical threshold for the absolute value of the $T$ statistic on
$201-1 = 200$ degrees of freedom is `qt(0.975,200)` = 1.971896. Since
the absolute observed value ($|t| = 0.8114824$) is less then the
threshold we get that the value of the statistic does not belong to the
rejection region (which is composed of absolute values larger than the
threshold). Consequently, we accept the null hypothesis. This null
hypothesis declares that the expected price of a car was equal to the
current expected price (after adjusting for the change in Consumer Price
Index)[^12_5].
### Error Types and Error Probabilities
The $T$ statistic was proposed for testing a change in the price of a
car. This statistic measures the discrepancy between the sample average
price of a car and the expected value of the sample average, where the
expectation is computed under the null hypothesis. The structure of the
rejection region of the test is $\{|T| > c\}$, where $c$ is an
appropriate threshold. In the current example the value of the threshold
$c$ was set to be equal to `qt(0.975,200)` = 1.971896. In general, the
specification of the threshold $c$ depends on the error probabilities
that are associated with the test. In this section we describe these
error probabilities.
The process of making decisions may involve errors. In the case of
hypothesis testing one may specify two types of error. On the one hand,
the case may be that the null hypothesis is correct (in the example,
$\Expec(X) = 13,662$). However, the data is such that the null
hypothesis is rejected (here, $|T| > 1.971896$). This error is called a
*Type I* error.
A different type of error occurs when the alternative hypothesis holds
($\Expec(X) \not= 13,662$) but the null hypothesis is not rejected
($|T| \leq 1.971896$). This other type of error is called *Type II*
error. A summary of the types of errors can be found in
Table \@ref(tab:Testing1):
$H_0: \Expec(X) = 13,662$ $H_1:\Expec(X) \not= 13,662$
----------------------------------- --------------------------- ------------------------------
Accept $H_0$: $|T| \leq 1.971896$ Type II Error
Reject $H_0$: $|T| > 1.971896$ Type I Error
: (\#tab:Testing1)Error Types[]{data-label="tab:Testing1"}
In statistical testing of hypothesis the two types of error are not
treated symmetrically. Rather, making a Type I error is considered more
severe than making a Type II error. Consequently, the test’s decision
rule is designed so as to assure an acceptable probability of making a
Type I error. Reducing the probability of a Type II error is desirable,
but is of secondary importance.
Indeed, in the example that deals with the price of car types the
threshold was set as high as `qt(0.975,200)` = 1.971896 in order to
reject the null hypothesis. It is not sufficient that the sample average
is not equal to 13,662 (corresponding to a threshold of 0), but it has
to be significantly different from the expectation under the null
hypothesis, the distance between the sample average and the null
expectation should be relatively large, in order to exclude $H_0$ as an
option.
The significance level of the evidence for rejecting the null hypothesis
is based on the probability of the Type I error. The probabilities
associated with the different types of error are presented in
Table \@ref(tab:Testing2):
$H_0: \Expec(X) = 13,662$ $H_1:\Expec(X) \not= 13,662$
--------------------- --------------------------- ------------------------------
$\Prob(|T| \leq c)$ Prob. of Type II Error
$\Prob(|T| > c)$ Significance Level Statistical Power
: (\#tab:Testing2)Error Probabilities[]{data-label="tab:Testing2"}
Observe that the probability of a Type I error is called the
significance level. The significance level is set at some pre-specified
level such as 5% or 1%, with 5% being the most widely used level. In
particular, setting the threshold in the example to be equal to
`qt(0.975,200)` = 1.971896 produces a test with a 5% significance level.
This lack of symmetry between the two hypothesis proposes another
interpretation of the difference between the hypothesis. According to
this interpretation the null hypothesis is the one in which the cost of
making an error is greater. Thus, when one separates the collection of
parameter values into two subsets then the subset that is associated
with a more severe error is designated as the null hypothesis and the
other subset becomes the alternative.
For example, a new drug must pass a sequence of clinical trials before
it is approved for distribution. In these trials one may want to test
whether the new drug produces beneficial effect in comparison to the
current treatment. Naturally, the null hypothesis in this case would be
that the new drug is no better than the current treatment and the
alternative hypothesis would be that it is better. Only if the clinical
trials demonstrates a significant beneficiary effect of the new drug
would it be released for marketing.
In scientific research, in general, the currently accepted theory, the
conservative explanation, is designated as the null hypothesis. A claim
for novelty in the form of an alternative explanation requires strong
evidence in order for it to be accepted and be favored over the
traditional explanation. Hence, the novel explanation is designated as
the alternative hypothesis. It replaces the current theory only if the
empirical data clearly supports its. The test statistic is a summary of
the empirical data. The rejection region corresponds to values that are
unlikely to be observed according to the current theory. Obtaining a
value in the rejection region is an indication that the current theory
is probably not adequate and should be replaced by an explanation that
is more consistent with the empirical evidence.
The second type of error probability in Table \[tab:Testing\_2\] is the
probability of a Type II error. Instead of dealing directly with this
probability the tradition is to consider the complementary probability
that corresponds to the probability of *not* making a Type II error.
This complementary probability is called the *statistical power*:
$$\mbox{Statistical Power} = 1 - \mbox{Probability of Type II Error}$$
The statistical power is the probability of rejecting the null
hypothesis when the state of nature is the alternative hypothesis. (In
comparison, the significance level is the probability of rejecting the
null hypothesis when the state of nature is the null hypothesis.) When
comparing two decision rules for testing hypothesis, both having the
same significance level, the one that possesses a higher statistical
power should be favored.
### $p$-Values
The $p$-value is another test statistic. It is associated with a
specific test statistic and a structure of the rejection region. The
$p$-value is equal to the significance level of the test in which the
observed value of the statistic serves as the threshold. In the current
example, where the $T$ is the underlying test statistic and the
structure of the rejection region is of the form $\{|T| > c\}$ then the
$p$-value is equal to the probability of rejecting the null hypothesis
in the case where the threshold $c$ is equal to the observed absolute
value of the $T$ statistic. In other words:
$$\mbox{$p$-value} = \Prob(|T| > |t|) = \Prob(|T| > |-0.8114824|)= \Prob(|T| > 0.8114824)\;,$$
where $t=-0.8114824$ is the observed value of the $T$ statistic and the
computation of the probability is conducted under the null hypothesis.
Specifically, under the null hypothesis $H_0: \Expec(X) = 13,662$ we get
that the distribution of the statistic
$T = [\bar X - 13,662]/[S/\sqrt{n}]$ is the $t$-distribution on
$n-1 = 200$ degrees of freedom. The probability of the event
$\{|T| > 0.8114824\}$ corresponds to the sum of the probabilities of
both tails of the distribution. By the symmetry of the $t$-distribution
this equals twice the probability of the upper tail:
$$\Prob(|T| > 0.8114824) = 2\cdot \Prob(T> 0.8114824) = 2\cdot [1-\Prob(|T| \leq 0.8114824)]\;.$$
When we compute this probability in `R` we get:
```{r}
2*(1-pt(0.8114824,200))
```
This probability is equal, after rounding up, to the probability
“`p-value = 0.4181`" that is reported in the output of the function
“`t.test`".
The $p$-value is a function of the data. In the particular data set the
computed value of the $T$ statistic was -0.8114824. For a different data
set the evaluation of the statistic would have produced a different
value. As a result, the threshold that would have been used in the
computation would have been different, thereby changing the numerical
value of the $p$-value. Being a function of the data, we conclude that
the $p$-value is a statistic.
The $p$-value is used as a test statistic by comparing its value to the
pre-defined significance level. If the significance level is 1% then the
null hypothesis is rejected for $p$-values less that 0.01. Likewise, if
the significance level is set at the 5% level then the null hypothesis
is rejected for $p$-values less than 0.05.
The statistical test that is based directly on the $T$ statistic and the
statistical test that is based on the $p$-value are equivalent to each
other. The one rejects the null hypothesis if, and only if, the other
does so. The advantage of using the $p$-value as the test statistic is
that no further probabilistic computations are required. The $p$-value
is compared directly to the significance level we seek. For the test
that examines the $T$ statistic we still need to identify the threshold
associated with the given significance level.
In the next 2 sections we extend the discussion of the $t$-test and give
further examples to the use of the function “`t.test`". We also deal
with tests on probabilities of events and introduce the function
“`prop.test`" for conducting such tests.
## Testing Hypothesis on Expectation
Let us consider the variable “`dif.mpg`" that contains the difference in
fuel consumption between highway and city conditions. This variable was
considered in Chapter \@ref(ChapConfidence). Examine the distribution of
this variable:
```{r, out.width = '60%', fig.align = "center"}
dif.mpg <- cars$highway.mpg - cars$city.mpg
summary(dif.mpg)
plot(table(dif.mpg))
```
In the first expression we created the variable “`dif.mpg`" that
contains the difference in miles-per-gallon. The difference is computed
for each car type between highway driving conditions and urban driving
condition. The summary of this variable is produced in the second
expression. Observe that the values of the variable range between 0 and
11, with 50% of the distribution concentrated between 5 and 7. The
median is 6 and the mean is 5.532. The last expression produces the bar
plot of the distribution. It turns out that the variable “`dif.mpg`"
obtains integer values.
In this section we test hypotheses regarding the expected difference in
fuel consumption between highway and city conditions.
Energy is required in order to move cars. For heavier cars more energy
is required. Consequently, one may conjecture that milage per gallon for
heavier cars is less than the milage per gallon for lighter cars.
The relation between the weight of the car and the difference between
the milage-per-gallon in highway and city driving conditions is less
clear. On the one hand, urban traffic involves frequent changes in speed
in comparison to highway conditions. One may presume that this change in
speed is a cause for reduced efficiency in fuel consumption. If this is
the case then one may predict that heavier cars, which require more
energy for acceleration, will be associated with a bigger difference
between highway and city driving conditions in comparison to lighter
cars.
One the other hand, heavier cars do less miles per gallon overall. The
difference between two smaller numbers (the milage per gallon in highway
and in city conditions for heavier cars) may tend to be smaller than the
difference between two larger numbers (the milage per gallon in highway
and in city conditions for lighter cars). If this is the case then one
may predict that heavier cars will be associated with a smaller
difference between highway and city driving conditions in comparison to
lighter cars.
The average difference between highway and city conditions is
approximately 5.53 for all cars. Divide the cars into to two groups of
equal size: One group is composed of the heavier cars and the other
group is composed of the lighter cars. We will examine the relation
between the weight of the car and difference in miles per gallon between
the two driving conditions by testing hypotheses separately for each
weight group[^12_6]. For each such group we start by testing the two-sided
hypothesis $H_1:\Expec(X) \not = 5.53$, where $X$ is the difference
between highway and city miles-per-gallon in cars that belong to the
given weight group. After carrying the test for the two-sided
alternative we will discuss results of the application of tests for
one-sided alternatives.
We start by the definition of the weight groups. The variable
“`curb.weight`" measures the weight of the cars in the data frame
“`cars`". Let us examine the summary of the content of this variable:
```{r}
summary(cars$curb.weight)
```
Half of the cars in the data frame weigh less than 2,414 lb and half the
cars weigh more. The average weight of a car is 2,556 lb. Let us take
2,414 as a threshold and denote cars below this weight as “light" and
cars above this threshold as “heavy":
```{r}
heavy <- cars$curb.weight > 2414
table(heavy)
```
The variable “`heavy`" indicates for each car type whether its weight is
above or below the threshold weight of 2,414 lb. The variable is
composed of a sequence with as many components as the number of
observations in the data frame “`cars`" ($n = 205$). Each component is a
logical value: “`TRUE`" if the car is heavier than the threshold and
“`FALSE`" if it is not. When we apply the function “`table`" to this
sequence we get that 102 of the cars are heavier than the threshold and
103 are not so.
We would like to apply the $t$-test first to the subset of all cars with
weight above 2,414 lb (cars that are associated with the value “`TRUE`"
in the variable “`heavy`"), and then to all cars with weights not
exceeding the threshold (cars associated with value “`FALSE`"). In the
past we showed that one may address components of a sequence using its
position in the sequence[^12_7]. Here we demonstrate an alternative
approach for addressing specific locations by using a sequence with
logical components.
In order to illustrate this second approach consider the two sequences:
```{r}
w <- c(5,3,4,6,2,9)
d <- c(13,22,0,12,6,20)
```
Say we want to select the components of the sequence “`d`" in all the
locations where the components of the sequence “`w`" obtain values
larger than 5. Consider the code:
```{r}
w > 5
d[w > 5]
```
The expression “`w > 5`" is a sequence of logical components, with the
value “`TRUE`" at the positions where “`w`" is above the threshold and
the value “`FALSE`" at the positions where “`w`" is below the threshold.
We may use the sequence with logical components as an index to the
sequence of the same length “`d`". The relevant expression is
“`d[w > 5]`". The output of this expression is the sub-sequence of
elements from “`d`" that are associated with the “`TRUE`" values of the
logical sequence. Indeed, “`TRUE`" values are present at the 4th and the
6th positions of the logical sequence. Consequently, the output of the
expression “`d[w > 5]`" contains the 4th and the 6th components of the
sequence “`d`".
The operator “`!`", when applied to a logical value, reverses the value.
A “`TRUE`" becomes “`FALSE`" and a “`FALSE`" becomes “`TRUE`". Consider
the code:
```{r}
!(w > 5)
d[!(w > 5)]
```
Observe that the sequence “`!(w > 5)`" obtains a value of “`TRUE`" at
the positions where “`w`" is less or equal to 5. Consequently, the
output of the expression “`d[!(w > 5)]`" are all the values of “`d`"
that are associated with components of “`w`" that are less or equal to
5.
The variable “`dif.mpg`" contains data on the difference in
miles-per-gallon between highway and city driving conditions for all the
car types. The sequence “`heavy`" identifies the car types with curb
weight above the threshold of 2,414 lb. The components of this sequence
are logical with the value “`TRUE`" at positions associated with the
heavier car types and the “`FALSE`" at positions associated with the
lighter car types. Observe that the output of the expression
“`dif.mpg[heavy]`" is the subsequence of differences in miles-per-gallon
for the cars with curb weight above the given threshold. We apply the
function “`t.test`" to this expression in order to conduct the $t$-test
on the expectation of the variable “`dif.mpg`" for the heavier cars:
```{r}
t.test(dif.mpg[heavy],mu=5.53)
```
The target population are the heavier car types. Notice that we test the
null hypothesis that expected difference among he heavier cars is equal
to 5.53 against the alternative hypothesis that the expected difference
among heavier cars is not equal to 5.53. The null hypothesis is not
rejected at the 5% significance level since the $p$-value, which is
equal to 0.1735, is larger than 0.05. Consequently, based on the data at
hand, we cannot conclude that the expected difference in
miles-per-gallon for heavier cars is significantly different than the
average difference for all cars.
Observe also that the estimate of the expectation, the sample mean, is
equal to 5.254902, with a confidence interval of the form
$[4.900198, 5.609606]$.
Next, let us apply the same test to the lighter cars. The expression
“`dif.pmg[!heavy]`" produces the subsequence of differences in
miles-per-gallon for the cars with curb weight below the given
threshold. The application of the function “`t.test`" to this
subsequence gives:
```{r}
t.test(dif.mpg[!heavy],mu=5.53)
```
Again, the null hypothesis is not rejected at the 5% significance level
since a $p$-value of 0.05164 is still larger than 0.05. However, unlike
the case for heavier cars where the $p$-value was undeniably larger than
the threshold. In this example it is much closer to the threshold of
0.05. Consequently, we may almost conclude that the expected difference
in miles-per-gallon for lighter cars is significantly different than the
average difference for all car.
Why did we not reject the null hypothesis for the heavier cars but
almost did so for the lighter cars? Both tests are based on the $T$
statistic, which measures the ratio between the deviation of the sample
average from its expectation under the null, divided by the estimate of
the standard deviation of the sample average. The value of this
statistic is “`t = -1.5385`" for heavier cars and it is “`t = 1.9692`"
for lighter cars, an absolute value of about 25% higher.
The deviation of the sample average for the heavier cars and the
expectation under the null is $5.254902 - 5.53 = -0.275098$. On the
other hand, the deviation of the sample average for the lighter cars and
the expectation under the null is $5.805825 - 5.53 = 0.275825$. The two
deviations are practically equal to each other in the absolute value.
The estimator of the standard deviation of the sample average is
$S/\sqrt{n}$, where $S$ is the sample standard deviation and $n$ is the
sample size. The sample sizes, 103 for lighter cars and 102 for heavier
cars, are almost equal. Therefore, the reason for the difference in the
values of the $T$ statistics for both weight groups must be differences
in the sample standard deviations. Indeed, when we compute the sample
standard deviation for lighter and heavier cars[^12_8] we get that the
standard deviation for lighter cars (1.421531) is much smaller than the
standard deviation for heavier cars (1.805856):
```{r}
tapply(dif.mpg,heavy,sd)
```
The important lesson to learn from this exercise is that simple minded
notion of significance and statistical significance are not the same. A
simple minded assessment of the discrepancy from the null hypothesis
will put the evidence from the data on lighter cars and the evidence
from the data on heavier cars on the same level. In both cases the
estimated value of the expectation is the same distance away from the
null value.
However, statistical assessment conducts the analysis in the context of
the sampling distribution. The deviation of the sample average from the
expectation is compared to the standard deviation of the sample average.
Consequently, in statistical testing of hypothesis a smaller deviation
of the sample average from the expectation under the null may be more
significant than a larger one if the sampling variability of the former
is much smaller than the sampling variability of the later.
Let us proceed with the demonstration of the application of the $t$-test
by the testing of one-sided alternatives in the context of the lighter
cars. One may test the one-sided alternative $H_1:\Expec(X) > 5.53$ that
the expected value of the difference in miles-per-gallon among cars with
curb weight no more than 2,414 lb is *greater* than 5.53 by the
application of the function “`t.test`" to the data on lighter cars. This
data is the output of the expression “`dif.mpg[!heavy]`". As before, we
specify the null value of the expectation by the introduction of the
expression “`mu=5.53`". The fact that we are interested in the testing
of the specific alternative is specified by the introduction of a new
argument of the form: “`alternative=greater`". The default value of the
argument “`alternative`" is “`two.sided`", which produces a test of a
two-sided alternative. By changing the value of the argument to
“`greater`" we produce a test for the appropriate one-sided alternative:
```{r}
t.test(dif.mpg[!heavy],mu=5.53,alternative="greater")
```
The value of the test statistic (`t = 1.9692`) is the same as for the
test of the two-sided alternative and so is the number of degrees of
freedom associated with the statistic (`df = 102`). However, the
$p$-value is smaller (`p-value = 0.02582`), compared to the $p$-value in
the test for the two-sided alternative (`p-value = 0.05164`). The
$p$-value for the one-sided test is the probability under the sampling
distribution that the test statistic obtains vales larger than the
observed value of 1.9692. The $p$-value for the two-sided test is twice
that figure since it involves also the probability of being less than
the negative of the observes value.
The estimated value of the expectation, the sample average, is
unchanged. However, instead of producing a confidence interval for the
expectation the report produces a *one-sided* confidence interval of the
form $[5.573323, \infty)$. Such an interval corresponds to the
*smallest* value that the expectation may reasonably obtain on the basis
of the observed data.
Finally, consider the test of the other one-sided alternative
$H_1:\Expec(X) < 5.53$:
```{r}
t.test(dif.mpg[!heavy],mu=5.53,alternative="less")
```
The alternative here is determined by the expression
“`alternative=less`". The $p$-value is equal to 0.9742, which is the
probability that the test statistic obtains values less than the
observed value of 1.9692. Clearly, the null hypothesis is not rejected
in this test.
## Testing Hypothesis on Proportion {#TestFrac}
Consider the problem of testing hypothesis on the probability of an
event. Recall that a probability $p$ of some event can be estimated by
the observed relative frequency of the event in the sample, denoted
$\hat P$. The estimation is associated with the Bernoulli random
variable $X$, that obtains the value 1 when the event occurs and the
value 0 when it does not. The statistical model states that $p$ is the
expectation of $X$. The estimator $\hat P$ is the sample average of this
measurement.
With this formulation we may relate the problem of testing hypotheses
formulated in terms of $p$ to the problem of tests associated to the
expectation of a measurement. For the latter problem we applied the
$t$-test. A similar, though not identical, test is used for the problem
of testing hypothesis on proportions.
Assume that one in interested in testing the null hypothesis that the
probability of the event is equal to some specific value, say one half,
versus the alternative hypothesis that the probability is not equal to
this value. These hypotheses are formulated as $H_0:p = 0.5$ and
$H_1:p\not = 0.5$.
The sample proportion of the event $\hat P$ is the basis for the
construction of the test statistic. Recall that the variance of the
estimator $\hat P$ is given by $\Var(\hat P) = p(1-p)/n$. Under the null
hypothesis we get that the variance is equal to
$\Var(\hat P) = 0.5(1-0.5)/n$. A natural test statistic is the
standardized sample proportion:
$$Z = \frac{\hat P - 0.5}{\sqrt{0.5 (1-0.5)/n}}\;,$$ that measures the
ratio between the deviation of the estimator from its null expected
value and the standard deviation of the estimator. The standard
deviation of the sample proportion is used in the ratio.
If the null hypothesis that $p=0.5$ holds true then one gets that the
value 0 is the center of the sampling distribution of the test statistic
$Z$. Values of the statistic that are much larger or much smaller than 0
indicate that the null hypothesis is unlikely. Consequently, one may
consider a rejection region of the form $\{|Z| > c\}$, for some
threshold value $c$. The threshold $c$ is set at a high enough level to
assure the required significance level, namely the probability under the
null hypothesis of obtaining a value in the rejection region.
Equivalently, the rejection region can be written in the form
$\{Z^2 > c^2\}$.
As a result of the Central Limit Theorem one may conclude that the
distribution of the test statistic is approximately Normal. Hence,
Normal computations may be used in order to produce an approximate
threshold or in order to compute an approximation for the $p$-value.
Specifically, if $Z$ has the standard Normal distribution then $Z^2$ has
a chi-square distribution on one degree of freedom.
In order to illustrate the application of hypothesis testing for
proportion consider the following problem: In the previous section we
obtained the curb weight of 2,414 lb as the sample median. The weights
of half the cars in the sample were above that level and the weights of
half the cars were below this level. If this level was actually the
population median then the probability that the weight of a random car
is not exceeding this level would be equal to 0.5.
Let us test the hypothesis that the median weight of cars that run on
diesel is also 2,414 lb. Recall that 20 out of the 205 car types in the
sample have diesel engines. Let us use the weights of these cars in
order to test the hypothesis.
The variable “`fuel.type`" is a factor with two levels “`diesel`" and
“`gas`" that identify the fuel type of each car. The variable “`heavy`"
identifies for each car whether its weight is above the level of 2414 or
not. Let us produce a $2 \times 2$ table that summarizes the frequency
of each combination of weight group and the fuel type:
```{r}
fuel <- cars$fuel.type
table(fuel,heavy)
```
Originally the function “`table`" was applied to a single factor and
produced a sequence with the frequencies of each level of the factor. In
the current application the input to the function are two factors[^12_9].
The output is a table of frequencies. Each entry to the table
corresponds to the frequency of a combination of levels, one from the
first input factor and the other from the second input factor. In this
example we obtain that 6 cars use diesel and their curb weight was below
the threshold. There are 14 cars that use diesel and their curb weight
is above the threshold. Likewise, there are 97 light cars that use gas
and 88 heavy cars with gas engines.
The function “`prop.test`" produces statistical tests for proportions.
The relevant information for the current application of the function is
the fact that frequency of light diesel cars is 6 among a total number
of 20 diesel cars. The first entry to the function is the frequency of
the occurrence of the event, 6 in this case, and the second entry is the
relevant sample size, the total number of diesel cars which is 20 in the
current example:
```{r}
prop.test(6,20)
```
The function produces a report that is printed on the screen. The title
identifies the test as a one-sample test of proportions. In later
chapters we will apply the same function to more complex data structures
and the title will change accordingly. The title also identifies the
fact that a continuity correction is used in the computation of the test
statistic.
The line under the title indicates the frequency of the event in the
sample and the sample size. (In the current example, 6 diesel cars with
weights below the threshold among a total of 20 diesel cars.) The
probability of the event, under the null hypothesis, is described. The
default value of this probability is “`p = 0.5`", which is the proper
value in the current example. This default value can be modified by
replacing the value 0.5 by the appropriate probability.
The next line presents the information relevant for the test itself. The
test statistic, which is essentially the square of the $Z$ statistic
described above[^12_10], obtains the value 2.45. The sampling distribution
of this statistic under the null hypothesis is, approximately, the
chi-square distribution on 1 degree of freedom. The $p$-value, which is
the probability that chi-square distribution on 1 degree of freedom
obtains a value above 2.45, is equal to 0.1175. Consequently, the null
hypothesis is not rejected at the 5% significance level.
The bottom part of the report provides the confidence interval and the
point estimate for the probability of the event. The confidence interval
for the given data is $[0.1283909, 0.5433071]$ and the point estimate is
$\hat p = 6/20 = 0.3$.
It is interesting to note that although the deviation between the
estimated proportion $\hat p = 0.3$ and the null value of the
probability $p = 0.5$ is relatively large still the null hypothesis was
not rejected. The reason for that is the smallness of the sample,
$n = 20$, that was used in order to test the hypothesis. Indeed, as an
exercise let us examine the application of the same test to a setting
where $n = 200$ and the number of occurrences of the event is 60:
```{r}
prop.test(60,200)
```
The estimated value of the probability is the same as before since
$\hat p = 60/200 = 0.3$. However, the $p$-value is
$2.322 \times 10^{-8}$, which is way below the significance threshold of
0.05. In this scenario the null hypothesis is rejected with flying
colors.
This last example is yet another demonstration of the basic
characteristic of statistical hypothesis testing. The consideration is
based not on the discrepancy of the estimator of the parameter from the
value of the parameter under the null. Instead, it is based on the
*relative* discrepancy in comparison to the sampling variability of the
estimator. When the sample size is larger the variability is smaller.
Hence, the chances of rejecting the null hypothesis for the same
discrepancy increases.
## Exercises
```{exercise}
Consider a medical condition that does not have a
standard treatment. The recommended design of a clinical trial for a new
treatment to such condition involves using a placebo treatment as a
control. A placebo treatment is a treatment that externally looks
identical to the actual treatment but, in reality, it does not have the
active ingredients. The reason for using placebo for control is the
“placebo effect". Patients tent to react to the fact that they are being
treated regardless of the actual beneficial effect of the treatment.
As an example, consider the trial for testing magnets as a treatment for
pain that was described in Question \[ex:Inference.1\]. The patients
that where randomly assigned to the control (the last 21 observations in
the file “`magnets.csv`") were treated with devises that looked like
magnets but actually were not. The goal in this exercise is to test for
the presence of a placebo effect in the case study “Magnets and Pain
Relief" of Question \[ex:Inference.1\] using the data in the file
“`magnets.csv`".
1. Let $X$ be the measurement of change, the difference between the
score of pain before the treatment and the score after the
treatment, for patients that were treated with the inactive placebo.
Express, in terms of the expected value of $X$, the null hypothesis
and the alternative hypothesis for a statistical test to determine
the presence of a placebo effect. The null hypothesis should reflect
the situation that the placebo effect is absent.
2. Identify the observations that can be used in order to test the
hypotheses.
3. Carry out the test and report your conclusion. (Use a significance
level of 5%.)
```
```{exercise}
It is assumed, when constructing the $t$-test, that the
measurements are Normally distributed. In this exercise we examine the
robustness of the test to divergence from the assumption. You are
required to compute the significance level of a two-sided $t$-test of
$H_0:\Expec(X)=4$ versus $H_1: \Expec(X) \not = 4$. Assume there are
$n=20$ observations and use a $t$-test with a nominal 5% significance
level.
1. Consider the case where $X \sim \mathrm{Exponential}(1/4)$.
2. Consider the case where $X \sim \mathrm{Uniform}(0,8)$.
```
```{exercise}
Assume that you are interested in testing
$H_0:\Expec(X) = 20$ versus $H_1:\Expec(X)\not = 20$ with a significance
level of 5% using the $t$-test. Let the sample average, of a sample of
size $n=55$, be equal to $\bar x = 22.7$ and the sample standard
deviation be equal to $s = 5.4$.
1. Do you reject the null hypothesis?
2. Use the same information. Only now you are interested in a
significance level of 1%. Do you reject the null hypothesis?
3. Use the information the presentation of the exercise. But now you
are interested in testing $H_0:\Expec(X) = 24$ versus
$H_1:\Expec(X)\not = 24$ (with a significance level of 5%). Do you
reject the null hypothesis?
```