-
Notifications
You must be signed in to change notification settings - Fork 13
/
mean_test_example.Rmd
371 lines (268 loc) · 10.3 KB
/
mean_test_example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
---
jupyter:
jupytext:
notebook_metadata_filter: all,-language_info
split_at_heading: true
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.13.7
kernelspec:
display_name: Python 3
language: python
name: python3
---
# A worked example of the general linear model
See: [introduction to the general linear model](https://matthew-brett.github.io/teaching/glm_intro.html).
Here we go through the matrix formulation of the general linear model with the
simplest possible example – a t-test for the difference of the sample mean
from zero.
Let us say we have the hypothesis that men are more psychopathic than women.
We find 10 male-female couples, and we give them each a psychopathy
questionnaire. We score the questionnaires, and then, for each couple, we
subtract the woman’s score from the man’s score, to get a difference score.
We have 10 difference scores:
```{python}
#: Our standard imports
import numpy as np
import numpy.linalg as npl
# Only show 4 decimals when printing arrays
np.set_printoptions(precision=4)
# Plotting
import matplotlib.pyplot as plt
```
```{python}
differences = np.array([ 1.5993, -0.13 , 2.3806, 1.3761, -1.3595,
1.0286, 0.8466, 1.6669, -0.8241, 0.4469])
n = len(differences)
print('Mean', np.sum(differences) / n)
print('STD', np.std(differences))
```
Our hypothesis is that the females in the couple have a lower psychopathy
score. We therefore predict that, in the whole population of male-female
couples, the difference measure will, on average, be positive (men have higher
scores than women).
We teat this hypothesis by testing whether the sample mean is far enough from
zero to make it unlikely that the population mean is actually zero.
To rush ahead, the t-value is always a parameter of interest, in our case, the mean difference, divided by an estimate of the expected variation of that parameter. Specifically, in our case, the value we are expecting for the t-statistic is:
```{python}
expected_t = np.mean(differences) / (np.std(differences, ddof=1) / np.sqrt(n))
expected_t
```
Here we confirm with the Scipy `ttest_1samp` function:
```{python}
import scipy.stats as sps
sps.ttest_1samp(differences, popmean=0)
```
## The standard error of the mean
As you have seen above, the t-statistic requires that we divide the mean value by expected standard deviation of the parameter. The expected standard deviation of the mean is the [standard error of
the mean](https://en.wikipedia.org/wiki/Standard_error#Standard_error_of_the_mean_2).
The [standard error](https://en.wikipedia.org/wiki/Standard_error) of a statistic (such as a mean) is the standard
deviation of the sampling distribution of the statistic. The sampling
distribution is the distribution of the statistic that we expect when taking a
very large number of samples of the statistic. For example, the sampling
distribution of the mean, is the expected distribution of the sample means
generated by taking a very large number of samples.
### What is the standard error of the mean?
To be more concrete, let's generate a *population* with a mean of 0 and a standard deviation of 5.
```{python}
# The default random number generator in Numpy.
rng = np.random.default_rng()
population = rng.normal(0, 5, size=10000000)
plt.hist(population, bins=100);
```
As usual, define the *mean* of a vector of values $\vec{x}$ as:
$$
\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i
$$
```{python}
# The population mean is near-enough that we'll call it 0
np.sum(population) / len(population)
```
```{python}
# The population standard deviation is near-enough that we'll call it 5
np.std(population)
```
```{python}
pop_mean = 0
pop_std = 5
```
Now imagine we took a sample size `n` from this population:
```{python}
n
```
```{python}
sample = rng.choice(population, size=n, replace=False)
sample
```
The sample has a *sample mean*:
```{python}
sample_mean = np.mean(sample) # or np.sum(sample) / n
sample_mean
```
If I draw a new sample, it will have a different meean. The *mean* of the sample is itself, somewhat random:
```{python}
sample = rng.choice(population, size=n, replace=False)
np.mean(sample)
```
Let us do that many times, to get an idea of the *distribution* of the means from samples of size `n`. This is called the *sampling distribution* of the mean:
```{python}
n_samples = 100000
sample_means = np.zeros(n_samples)
for i in range(n_samples):
sample = rng.choice(population, size=n, replace=False)
sample_means[i] = np.mean(sample)
sample_means[:10]
```
We now have the *sampling distribution* of the mean:
```{python}
plt.hist(sample_means, bins=50);
```
The *standard error* of the mean is nothing but the estimate, using some assumptions of the standard deviation of this distribution. For example, for our particular `n_samples` samples, we got a sampling distribution standard deviation of:
```{python}
# Standard deviation of sampling distribution of mean.
np.std(sample_means)
```
The *standard error of the mean* (SEM) is an estimate of this value using some assumptions, but without having to draw multiple samples from the population. It is given by the population standard deviation divided by the square root of `n`:
```{python}
pop_sem = pop_std / np.sqrt(n)
pop_sem
```
Notice that the SEM is very close to our estimate by drawing lots of samples. Our estimate will, of course, be a little random, because it depends on the exact random samples we drew from the population.
But wait!, I hear you cry, we usually don't know the standard deviation of the population. Often, even usually, all we have to go on is the sample. We need an *estimate* of the standard deviation, to give an *estimate* for the standard error of the mean.
One estimate could be the standard deviation of the sample we have, as in:
```{python}
np.std(sample)
```
As you may remember, the Numpy calculation for the standard deviation is:
$$
v = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 \\
\sigma = \sqrt{v}
$$
In code, this is:
```{python}
# Numpy calculation
variance = np.sum((sample - np.mean(sample)) ** 2) / n
sigma = np.sqrt(variance)
sigma
```
Unfortunately, this is not the best number to use, because it turns out this calculation of the standard deviation tends to be *lower* than the population standard deviation, for reasons related to the fact that we used the *sample mean* to calculate the standard deviation. In technical terms, this calculation is *biased* to be too low, on average, compared to the population standard deviation.
We have to do a small correction for this effect, by dividing the variance by $n - 1$ rather than $n$. This version of the calculation produces an *unbiased* estimate of population standard deviation:
So, the *unbiased estimate of the standard deviation* of $\vec{x}$ as:
$$
v_u = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \\
s_u = \sqrt{v_u}
$$
In code:
```{python}
v_u = np.sum((sample - np.mean(sample)) ** 2) / (n - 1)
s_u = np.sqrt(v_u)
s_u
```
Numpy does this calculation if we pass `ddof=1` to the `std` function. `ddof` stands for Delta Degrees of Freedom, and refers to the number of parameters we had to estimate to calculate the standard deviation. In our current case, this is 1, for the mean.
```{python}
np.std(sample, ddof=1)
```
The *unbiased* Standard Error of the Mean (SEM) from the *sample* uses the *unbiased* standard deviation from the sample, and is:
$$
SE_{\bar{x}} = \frac{s_y}{\sqrt{n}}
$$
where $n$ is the number of samples, in our case:
```{python}
n
```
Therefore:
```{python}
SEM_u = s_u / np.sqrt(n)
SEM_u
```
Notice that this isn't far from the SEM calculated from the population standard deviation, or from the standard deviation of the sampling distribution that we calculated by taking many samples, but this estimate of the SEM will vary from sample to sample, as the mean and unbiased standard deviation will vary.
## t statistic, SEM
A t statistic is given by a value divided by the *standard error* of that
value. The t statistic for $\bar{x}$ is:
$$
t = \frac{\bar{x}}{SE_{\bar{x}}}
$$
In our case:
```{python}
x = differences
n = len(x)
x_bar = np.mean(x)
x_bar
```
```{python}
unbiased_var_x = 1. / (n - 1) * np.sum((x - x_bar) ** 2)
s_x = np.sqrt(unbiased_var_x)
s_x
```
```{python}
# Unbiased SEM uses the unbiased standard deviation
SEM = s_x / np.sqrt(n)
SEM
```
```{python}
t = x_bar / SEM
t
```
# Testing using the general linear model
It is overkill for us to use the general linear model for this, but it does
show how the machinery works in the simplest case.
$\newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}}$
$\newcommand{\yvec}{\vec{y}} \newcommand{\xvec}{\vec{x}} \newcommand{\evec}{\vec{\varepsilon}}$
The matrix expression of the general linear model is:
$$
\yvec = \Xmat \bvec + \evec
$$
$\newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}}$
Define our design matrix $\Xmat$ to have a single column of ones:
```{python}
X = np.ones((n, 1))
X
```
$\newcommand{\bhat}{\hat{\bvec}} \newcommand{\yhat}{\hat{\yvec}}$
$\bhat$ is The least squares estimate for $\bvec$, and is given by:
$$
\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec
$$
Because $\Xmat$ is just a column of ones, $\Xmat^T \yvec = \sum_i{y_i}$.
$\Xmat^T \Xmat = n$, so $(\Xmat^T \Xmat)^{-1} = \frac{1}{n}$.
Thus:
$$
\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec \\
= \frac{1}{n} \sum_i{y_i} \\
= \bar{y}
$$
```{python}
pX = npl.pinv(X)
pX
```
```{python}
B = pX @ differences
assert np.isclose(B, np.mean(differences))
```
The student’s t statistic from the general linear model is:
$$
t = \frac{c^T \hat\beta}{\sqrt{\hat{\sigma}^2 c^T (\Xmat^T \Xmat)^+ c}}
$$
where $\hat{\sigma}^2$ is our estimate of variance in the residuals, $c$ is a
contrast vector to select some combination of the design columns, and
$(\Xmat^T \Xmat)^+$ is the *pseudoinverse* of $\Xmat^T \Xmat$.
In our case we have only one design column, so $c = [1]$ and we can omit it.
$\hat{\sigma}^2 = s_x^2$ for $s_x$ defined above. $\Xmat^T \Xmat$ is
invertible, and we know the inverse already: $\frac{1}{n}$.
```{python}
iXtX = npl.inv(X.T @ X)
assert np.isclose(iXtX, 1 / n)
```
Therefore:
$$
t = \frac{\bar{y}}{s_x \sqrt{\frac{1}{n}}} \\
= \bar{y} \Big/ \frac{s_x}{\sqrt{n}} \\
= \frac{\bar{y}}{SE_{\bar{y}}}
$$
```{python}
s_x = np.sqrt(np.sum((differences - np.mean(differences)) ** 2) / (n - 1))
t = B / (s_x * np.sqrt(iXtX))
t
```