mean_test_example.Rmd

---
jupyter:
  jupytext:
    text_representation:
      extension: .Rmd
      format_name: rmarkdown
      format_version: '1.2'
      jupytext_version: 1.11.5
  kernelspec:
    display_name: Python 3 (ipykernel)
    language: python
    name: python3
---

# A worked example of the general linear model

See: [introduction to the general linear model](https://matthew-brett.github.io/teaching/glm_intro.html).

Here we go through the matrix formulation of the general linear model with the
simplest possible example – a t-test for the difference of the sample mean
from zero.

Let us say we have the hypothesis that men are more psychopathic than women.

We find 10 male-female couples, and we give them each a psychopathy
questionnaire.  We score the questionnaires, and then, for each couple, we
subtract the woman’s score from the man’s score, to get a difference score.

We have 10 difference scores:

```{python}
#: Our standard imports
import numpy as np
import numpy.linalg as npl
# Only show 4 decimals when printing arrays
np.set_printoptions(precision=4)
```

```{python}
differences = np.array([ 1.5993, -0.13  ,  2.3806,  1.3761, -1.3595,
                         1.0286,  0.8466,  1.6669, -0.8241,  0.4469])
```

Our hypothesis is that the females in the couple have a lower psychopathy
score. We therefore predict that, in the whole population of male-female
couples, the difference measure will, on average, be positive (men have higher
scores than women).

We teat this hypothesis by testing whether the sample mean is far enough from
zero to make it unlikely that the population mean is actually zero.

# The standard error of the mean

One way to test this, is to compare the mean value, to the [standard error of
the mean](https://en.wikipedia.org/wiki/Standard_error#Standard_error_of_the_mean_2).

The [standard error](https://en.wikipedia.org/wiki/Standard_error) of a statistic (such as a mean) is the standard
deviation of the sampling distribution of the statistic.  The sampling
distribution is the distribution of the statistic that we expect when taking a
very large number of samples of the statistic.  For example, the sampling
distribution of the mean, is the expected distribution of the sample means
generated by taking a very large number of samples.

As usual, define the *mean* of a vector of values $\vec{x}$ as:

$$
\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i
$$

Define the *unbiased estimate of the standard deviation* of $\vec{x}$ as:

$$
s_x = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}
$$

Notice the $\frac{1}{n-1}$.   We have previously used $n$ rather than $n-1$ as
the divisor for the *sample standard deviation*.  Using $n$, our *sample
standard deviation* is a biased estimate of the population standard deviation.
Using $n-1$ as the divisor gives an unbiased estimate for population standard
deviation.

The standard error of the mean is:

$$
SE_{\bar{x}} = \frac{s_x}{\sqrt{n}}
$$

where $n$ is the number of samples, 10 in our case.

A t statistic is given by a value divided by the *standard error* of that
value.  The t statistic for $\bar{x}$ is:

$$
t = \frac{\bar{x}}{SE_{\bar{x}}}
$$

In our case:

```{python}
x = differences
n = len(x)
x_bar = np.mean(x)
x_bar
```

```{python}
unviased_var_x = 1. / (n - 1) * np.sum((x - x_bar) ** 2)
s_x = np.sqrt(unviased_var_x)
s_x
```

```{python}
SEM = s_x / np.sqrt(n)
SEM
```

```{python}
t = x_bar / SEM
t
```

# The standard error as an estimator

Imagine that we repeat the following procedure many times, collecting each
value into a vector $\vec{m}$:

* take $n$ samples from some distribution;

* take the mean of these $n$ samples.

The standard error of the mean is an estimate of the standard deviation of the
values in $\vec{m}$.

We can simulate this process when we know the exact distribution of the values
in the population.  For example, let us imagine that we know that the
population distribution is a normal distribution with population mean 10 ($\mu
= 10$) and population standard deviation 1.5 ($\sigma = 1.5$).  We can use
numpy.random to draw samples from this distribution:

```{python}
# 10 samples from normal distribution with mean 5 and sd 1.5
rng = np.random.default_rng()
sample = rng.normal(5, 1.5, size=10)
sample
```

We know the exact mean ($\mu$) and standard deviation ($\sigma$).  If we draw
a near-infinite number of samples like this, then the standard deviation of
the mean of these samples will be $\frac{\sigma}{\sqrt{n}}$:

```{python}
1.5 / np.sqrt(n)
```

We can compare this number to an estimate of the same thing from a very large
number of means from samples size 10:

```{python}
n_samples = 100000
pop_mean = 5
pop_sd = 1.5
means_of_samples = np.zeros(n_samples)
for i in range(n_samples):
    sample = np.random.normal(pop_mean, pop_sd, size=n)
    means_of_samples[i] = sample.mean()
sq_deviations = (means_of_samples - pop_mean) ** 2
# With lots of samples, this value is close to the exact number
means_std_dev = np.sqrt(1. / n_samples * np.sum(sq_deviations))
means_std_dev
```

Normally we do not know the exact standard deviation of the population.  The
standard error of the mean is for that situation.  First we use the sample
that we have to get an *unbiased estimate* of the population standard
deviation:

```{python}
n = 10
y = sample
y_bar = y.mean()
unbiased_var_estimate = 1. / (n - 1) * np.sum((y - y_bar) ** 2)
unbiased_sd_est = np.sqrt(unbiased_var_estimate)
```

Of course, this estimate will not be exactly the same as the population
standard deviation ($\sigma = 1.5$):

```{python}
unbiased_sd_est
```

We use the unbiased standard deviation estimate to give an unbiased estimate
for the standard error of the mean. This estimate will be close to the
standard deviation of means from many samples of size $n$:

```{python}
# Standard deviation of means from population
print(1.5 / np.sqrt(n))
# Our estimate for the standard error of mean
print(unbiased_sd_est / np.sqrt(n))
```

# Testing using the general linear model

It is overkill for us to use the general linear model for this, but it does
show how the machinery works in the simplest case.

$\newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}}$
$\newcommand{\yvec}{\vec{y}} \newcommand{\xvec}{\vec{x}} \newcommand{\evec}{\vec{\varepsilon}}$

The matrix expression of the general linear model is:

$$
\yvec = \Xmat \bvec + \evec
$$

$\newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}}$

Define our design matrix $\Xmat$ to have a single column of ones:

```{python}
X = np.ones((n, 1))
X
```

$\newcommand{\bhat}{\hat{\bvec}} \newcommand{\yhat}{\hat{\yvec}}$

$\bhat$ is The least squares estimate for $\bvec$, and is given by:

$$
\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec
$$

Because $\Xmat$ is just a column of ones, $\Xmat^T \yvec = \sum_i{y_i}$.

$\Xmat^T \Xmat = n$, so $(\Xmat^T \Xmat)^{-1} = \frac{1}{n}$.

Thus:

$$
\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec \\
= \frac{1}{n} \sum_i{y_i} \\
= \bar{y}
$$

The student’s t statistic from the general linear model is:

$$
t = \frac{c^T \hat\beta}{\sqrt{\hat{\sigma}^2 c^T (\Xmat^T \Xmat)^+ c}}
$$

where $\hat{\sigma}^2$ is our estimate of variance in the residuals, $c$ is a
contrast vector to select some combination of the design columns, and
$(\Xmat^T \Xmat)^+$ is the *pseudoinverse* of $\Xmat^T \Xmat$.

In our case we have only one design column, so $c = [1]$ and we can omit it.
$\hat{\sigma}^2 = s_x^2$ for $s_x$ defined above.  $\Xmat^T \Xmat$ is
invertible, and we know the inverse already: $\frac{1}{n}$.  Therefore:

$$
t = \frac{\bar{y}}{s_x \sqrt{\frac{1}{n}}} \\
= \bar{y} \Big/ \frac{s_x}{\sqrt{n}} \\
= \frac{\bar{y}}{SE_{\bar{y}}}
$$