mean_test_example.Rmd

---
jupyter:
  jupytext:
    notebook_metadata_filter: all,-language_info
    split_at_heading: true
    text_representation:
      extension: .Rmd
      format_name: rmarkdown
      format_version: '1.2'
      jupytext_version: 1.13.7
  kernelspec:
    display_name: Python 3
    language: python
    name: python3
---

# A worked example of the general linear model

See: [introduction to the general linear model](https://matthew-brett.github.io/teaching/glm_intro.html).

Here we go through the matrix formulation of the general linear model with the
simplest possible example – a t-test for the difference of the sample mean
from zero.

Let us say we have the hypothesis that men are more psychopathic than women.

We find 10 male-female couples, and we give them each a psychopathy
questionnaire.  We score the questionnaires, and then, for each couple, we
subtract the woman’s score from the man’s score, to get a difference score.

We have 10 difference scores:

```{python}
#: Our standard imports
import numpy as np
import numpy.linalg as npl
# Only show 4 decimals when printing arrays
np.set_printoptions(precision=4)
# Plotting
import matplotlib.pyplot as plt
```

```{python}
differences = np.array([ 1.5993, -0.13  ,  2.3806,  1.3761, -1.3595,
                         1.0286,  0.8466,  1.6669, -0.8241,  0.4469])
n = len(differences)
print('Mean', np.sum(differences) / n)
print('STD', np.std(differences))
```

Our hypothesis is that the females in the couple have a lower psychopathy
score. We therefore predict that, in the whole population of male-female
couples, the difference measure will, on average, be positive (men have higher
scores than women).

We teat this hypothesis by testing whether the sample mean is far enough from
zero to make it unlikely that the population mean is actually zero.


To rush ahead, the t-value is always a parameter of interest, in our case, the mean difference, divided by an estimate of the expected variation of that parameter.   Specifically, in our case, the value we are expecting for the t-statistic is:


```{python}
expected_t = np.mean(differences) / (np.std(differences, ddof=1) / np.sqrt(n))
expected_t
```

Here we confirm with the Scipy `ttest_1samp` function:

```{python}
import scipy.stats as sps
sps.ttest_1samp(differences, popmean=0)
```

## The standard error of the mean

As you have seen above, the t-statistic requires that we divide the mean value by expected standard deviation of the parameter.  The expected standard deviation of the mean is the [standard error of
the mean](https://en.wikipedia.org/wiki/Standard_error#Standard_error_of_the_mean_2).

The [standard error](https://en.wikipedia.org/wiki/Standard_error) of a statistic (such as a mean) is the standard
deviation of the sampling distribution of the statistic.  The sampling
distribution is the distribution of the statistic that we expect when taking a
very large number of samples of the statistic.  For example, the sampling
distribution of the mean, is the expected distribution of the sample means
generated by taking a very large number of samples.

### What is the standard error of the mean?

To be more concrete, let's generate a *population* with a mean of 0 and a standard deviation of 5.

```{python}
# The default random number generator in Numpy.
rng = np.random.default_rng()

population = rng.normal(0, 5, size=10000000)

plt.hist(population, bins=100);
```
As usual, define the *mean* of a vector of values $\vec{x}$ as:

$$
\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i
$$

```{python}
# The population mean is near-enough that we'll call it 0
np.sum(population) / len(population)
```

```{python}
# The population standard deviation is near-enough that we'll call it 5
np.std(population)
```

```{python}
pop_mean = 0
pop_std = 5
```

Now imagine we took a sample size `n` from this population:

```{python}
n
```

```{python}
sample = rng.choice(population, size=n, replace=False)
sample
```

The sample has a *sample mean*:

```{python}
sample_mean = np.mean(sample)   # or np.sum(sample) / n
sample_mean
```

If I draw a new sample, it will have a different meean.  The *mean* of the sample is itself, somewhat random:

```{python}
sample = rng.choice(population, size=n, replace=False)
np.mean(sample)
```

Let us do that many times, to get an idea of the *distribution* of the means from samples of size `n`.  This is called the *sampling distribution* of the mean:

```{python}
n_samples = 100000
sample_means = np.zeros(n_samples)
for i in range(n_samples):
    sample = rng.choice(population, size=n, replace=False)
    sample_means[i] = np.mean(sample)
sample_means[:10]
```

We now have the *sampling distribution* of the mean:

```{python}
plt.hist(sample_means, bins=50);
```

The *standard error* of the mean is nothing but the estimate, using some assumptions of the standard deviation of this distribution.  For example, for our particular `n_samples` samples, we got a sampling distribution standard deviation of:

```{python}
# Standard deviation of sampling distribution of mean.
np.std(sample_means)
```

The *standard error of the mean* (SEM) is an estimate of this value using some assumptions, but without having to draw multiple samples from the population.  It is given by the population standard deviation divided by the square root of `n`:

```{python}
pop_sem = pop_std / np.sqrt(n)
pop_sem
```

Notice that the SEM is very close to our estimate by drawing lots of samples.  Our estimate will, of course, be a little random, because it depends on the exact random samples we drew from the population.


But wait!, I hear you cry, we usually don't know the standard deviation of the population.  Often, even usually, all we have to go on is the sample.  We need an *estimate* of the standard deviation, to give an *estimate* for the standard error of the mean.


One estimate could be the standard deviation of the sample we have, as in:

```{python}
np.std(sample)
```

As you may remember, the Numpy calculation for the standard deviation is:

$$
v = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 \\
\sigma = \sqrt{v}
$$

In code, this is:

```{python}
# Numpy calculation
variance = np.sum((sample - np.mean(sample)) ** 2) / n
sigma = np.sqrt(variance)
sigma
```

Unfortunately, this is not the best number to use, because it turns out this calculation of the standard deviation tends to be *lower* than the population standard deviation, for reasons related to the fact that we used the *sample mean* to calculate the standard deviation. In technical terms, this calculation is *biased* to be too low, on average, compared to the population standard deviation.

We have to do a small correction for this effect, by dividing the variance by $n - 1$ rather than $n$.  This version of the calculation produces an *unbiased* estimate of population standard deviation:


So, the *unbiased estimate of the standard deviation* of $\vec{x}$ as:

$$
v_u = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \\
s_u = \sqrt{v_u}
$$

In code:

```{python}
v_u = np.sum((sample - np.mean(sample)) ** 2) / (n - 1)
s_u = np.sqrt(v_u)
s_u
```

Numpy does this calculation if we pass `ddof=1` to the `std` function.  `ddof` stands for Delta Degrees of Freedom, and refers to the number of parameters we had to estimate to calculate the standard deviation.  In our current case, this is 1, for the mean.

```{python}
np.std(sample, ddof=1)
```

The *unbiased* Standard Error of the Mean (SEM) from the *sample* uses the *unbiased* standard deviation from the sample, and is:

$$
SE_{\bar{x}} = \frac{s_y}{\sqrt{n}}
$$

where $n$ is the number of samples, in our case:

```{python}
n
```

Therefore:

```{python}
SEM_u = s_u / np.sqrt(n)
SEM_u
```

Notice that this isn't far from the SEM calculated from the population standard deviation, or from the standard deviation of the sampling distribution that we calculated by taking many samples, but this estimate of the SEM will vary from sample to sample, as the mean and unbiased standard deviation will vary.

## t statistic, SEM


A t statistic is given by a value divided by the *standard error* of that
value.  The t statistic for $\bar{x}$ is:

$$
t = \frac{\bar{x}}{SE_{\bar{x}}}
$$

In our case:

```{python}
x = differences
n = len(x)
x_bar = np.mean(x)
x_bar
```

```{python}
unbiased_var_x = 1. / (n - 1) * np.sum((x - x_bar) ** 2)
s_x = np.sqrt(unbiased_var_x)
s_x
```

```{python}
# Unbiased SEM uses the unbiased standard deviation
SEM = s_x / np.sqrt(n)
SEM
```

```{python}
t = x_bar / SEM
t
```

# Testing using the general linear model

It is overkill for us to use the general linear model for this, but it does
show how the machinery works in the simplest case.

$\newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}}$
$\newcommand{\yvec}{\vec{y}} \newcommand{\xvec}{\vec{x}} \newcommand{\evec}{\vec{\varepsilon}}$

The matrix expression of the general linear model is:

$$
\yvec = \Xmat \bvec + \evec
$$

$\newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}}$

Define our design matrix $\Xmat$ to have a single column of ones:

```{python}
X = np.ones((n, 1))
X
```

$\newcommand{\bhat}{\hat{\bvec}} \newcommand{\yhat}{\hat{\yvec}}$

$\bhat$ is The least squares estimate for $\bvec$, and is given by:

$$
\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec
$$

Because $\Xmat$ is just a column of ones, $\Xmat^T \yvec = \sum_i{y_i}$.

$\Xmat^T \Xmat = n$, so $(\Xmat^T \Xmat)^{-1} = \frac{1}{n}$.

Thus:

$$
\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec \\
= \frac{1}{n} \sum_i{y_i} \\
= \bar{y}
$$

```{python}
pX = npl.pinv(X)
pX
```

```{python}
B = pX @ differences
assert np.isclose(B, np.mean(differences))
```

The student’s t statistic from the general linear model is:

$$
t = \frac{c^T \hat\beta}{\sqrt{\hat{\sigma}^2 c^T (\Xmat^T \Xmat)^+ c}}
$$

where $\hat{\sigma}^2$ is our estimate of variance in the residuals, $c$ is a
contrast vector to select some combination of the design columns, and
$(\Xmat^T \Xmat)^+$ is the *pseudoinverse* of $\Xmat^T \Xmat$.

In our case we have only one design column, so $c = [1]$ and we can omit it.
$\hat{\sigma}^2 = s_x^2$ for $s_x$ defined above.  $\Xmat^T \Xmat$ is
invertible, and we know the inverse already: $\frac{1}{n}$.

```{python}
iXtX = npl.inv(X.T @ X)
assert np.isclose(iXtX, 1 / n)
```

Therefore:

$$
t = \frac{\bar{y}}{s_x \sqrt{\frac{1}{n}}} \\
= \bar{y} \Big/ \frac{s_x}{\sqrt{n}} \\
= \frac{\bar{y}}{SE_{\bar{y}}}
$$

```{python}
s_x = np.sqrt(np.sum((differences - np.mean(differences)) ** 2) / (n - 1))
t = B / (s_x * np.sqrt(iXtX))
t
```