forked from nipraxis/textbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
mean_test_example.Rmd
264 lines (199 loc) · 7.5 KB
/
mean_test_example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
---
jupyter:
jupytext:
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.11.5
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---
# A worked example of the general linear model
See: [introduction to the general linear model](https://matthew-brett.github.io/teaching/glm_intro.html).
Here we go through the matrix formulation of the general linear model with the
simplest possible example – a t-test for the difference of the sample mean
from zero.
Let us say we have the hypothesis that men are more psychopathic than women.
We find 10 male-female couples, and we give them each a psychopathy
questionnaire. We score the questionnaires, and then, for each couple, we
subtract the woman’s score from the man’s score, to get a difference score.
We have 10 difference scores:
```{python}
#: Our standard imports
import numpy as np
import numpy.linalg as npl
# Only show 4 decimals when printing arrays
np.set_printoptions(precision=4)
```
```{python}
differences = np.array([ 1.5993, -0.13 , 2.3806, 1.3761, -1.3595,
1.0286, 0.8466, 1.6669, -0.8241, 0.4469])
```
Our hypothesis is that the females in the couple have a lower psychopathy
score. We therefore predict that, in the whole population of male-female
couples, the difference measure will, on average, be positive (men have higher
scores than women).
We teat this hypothesis by testing whether the sample mean is far enough from
zero to make it unlikely that the population mean is actually zero.
# The standard error of the mean
One way to test this, is to compare the mean value, to the [standard error of
the mean](https://en.wikipedia.org/wiki/Standard_error#Standard_error_of_the_mean_2).
The [standard error](https://en.wikipedia.org/wiki/Standard_error) of a statistic (such as a mean) is the standard
deviation of the sampling distribution of the statistic. The sampling
distribution is the distribution of the statistic that we expect when taking a
very large number of samples of the statistic. For example, the sampling
distribution of the mean, is the expected distribution of the sample means
generated by taking a very large number of samples.
As usual, define the *mean* of a vector of values $\vec{x}$ as:
$$
\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i
$$
Define the *unbiased estimate of the standard deviation* of $\vec{x}$ as:
$$
s_x = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}
$$
Notice the $\frac{1}{n-1}$. We have previously used $n$ rather than $n-1$ as
the divisor for the *sample standard deviation*. Using $n$, our *sample
standard deviation* is a biased estimate of the population standard deviation.
Using $n-1$ as the divisor gives an unbiased estimate for population standard
deviation.
The standard error of the mean is:
$$
SE_{\bar{x}} = \frac{s_x}{\sqrt{n}}
$$
where $n$ is the number of samples, 10 in our case.
A t statistic is given by a value divided by the *standard error* of that
value. The t statistic for $\bar{x}$ is:
$$
t = \frac{\bar{x}}{SE_{\bar{x}}}
$$
In our case:
```{python}
x = differences
n = len(x)
x_bar = np.mean(x)
x_bar
```
```{python}
unviased_var_x = 1. / (n - 1) * np.sum((x - x_bar) ** 2)
s_x = np.sqrt(unviased_var_x)
s_x
```
```{python}
SEM = s_x / np.sqrt(n)
SEM
```
```{python}
t = x_bar / SEM
t
```
# The standard error as an estimator
Imagine that we repeat the following procedure many times, collecting each
value into a vector $\vec{m}$:
* take $n$ samples from some distribution;
* take the mean of these $n$ samples.
The standard error of the mean is an estimate of the standard deviation of the
values in $\vec{m}$.
We can simulate this process when we know the exact distribution of the values
in the population. For example, let us imagine that we know that the
population distribution is a normal distribution with population mean 10 ($\mu
= 10$) and population standard deviation 1.5 ($\sigma = 1.5$). We can use
numpy.random to draw samples from this distribution:
```{python}
# 10 samples from normal distribution with mean 5 and sd 1.5
rng = np.random.default_rng()
sample = rng.normal(5, 1.5, size=10)
sample
```
We know the exact mean ($\mu$) and standard deviation ($\sigma$). If we draw
a near-infinite number of samples like this, then the standard deviation of
the mean of these samples will be $\frac{\sigma}{\sqrt{n}}$:
```{python}
1.5 / np.sqrt(n)
```
We can compare this number to an estimate of the same thing from a very large
number of means from samples size 10:
```{python}
n_samples = 100000
pop_mean = 5
pop_sd = 1.5
means_of_samples = np.zeros(n_samples)
for i in range(n_samples):
sample = np.random.normal(pop_mean, pop_sd, size=n)
means_of_samples[i] = sample.mean()
sq_deviations = (means_of_samples - pop_mean) ** 2
# With lots of samples, this value is close to the exact number
means_std_dev = np.sqrt(1. / n_samples * np.sum(sq_deviations))
means_std_dev
```
Normally we do not know the exact standard deviation of the population. The
standard error of the mean is for that situation. First we use the sample
that we have to get an *unbiased estimate* of the population standard
deviation:
```{python}
n = 10
y = sample
y_bar = y.mean()
unbiased_var_estimate = 1. / (n - 1) * np.sum((y - y_bar) ** 2)
unbiased_sd_est = np.sqrt(unbiased_var_estimate)
```
Of course, this estimate will not be exactly the same as the population
standard deviation ($\sigma = 1.5$):
```{python}
unbiased_sd_est
```
We use the unbiased standard deviation estimate to give an unbiased estimate
for the standard error of the mean. This estimate will be close to the
standard deviation of means from many samples of size $n$:
```{python}
# Standard deviation of means from population
print(1.5 / np.sqrt(n))
# Our estimate for the standard error of mean
print(unbiased_sd_est / np.sqrt(n))
```
# Testing using the general linear model
It is overkill for us to use the general linear model for this, but it does
show how the machinery works in the simplest case.
$\newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}}$
$\newcommand{\yvec}{\vec{y}} \newcommand{\xvec}{\vec{x}} \newcommand{\evec}{\vec{\varepsilon}}$
The matrix expression of the general linear model is:
$$
\yvec = \Xmat \bvec + \evec
$$
$\newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}}$
Define our design matrix $\Xmat$ to have a single column of ones:
```{python}
X = np.ones((n, 1))
X
```
$\newcommand{\bhat}{\hat{\bvec}} \newcommand{\yhat}{\hat{\yvec}}$
$\bhat$ is The least squares estimate for $\bvec$, and is given by:
$$
\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec
$$
Because $\Xmat$ is just a column of ones, $\Xmat^T \yvec = \sum_i{y_i}$.
$\Xmat^T \Xmat = n$, so $(\Xmat^T \Xmat)^{-1} = \frac{1}{n}$.
Thus:
$$
\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec \\
= \frac{1}{n} \sum_i{y_i} \\
= \bar{y}
$$
The student’s t statistic from the general linear model is:
$$
t = \frac{c^T \hat\beta}{\sqrt{\hat{\sigma}^2 c^T (\Xmat^T \Xmat)^+ c}}
$$
where $\hat{\sigma}^2$ is our estimate of variance in the residuals, $c$ is a
contrast vector to select some combination of the design columns, and
$(\Xmat^T \Xmat)^+$ is the *pseudoinverse* of $\Xmat^T \Xmat$.
In our case we have only one design column, so $c = [1]$ and we can omit it.
$\hat{\sigma}^2 = s_x^2$ for $s_x$ defined above. $\Xmat^T \Xmat$ is
invertible, and we know the inverse already: $\frac{1}{n}$. Therefore:
$$
t = \frac{\bar{y}}{s_x \sqrt{\frac{1}{n}}} \\
= \bar{y} \Big/ \frac{s_x}{\sqrt{n}} \\
= \frac{\bar{y}}{SE_{\bar{y}}}
$$