HERS-example.qmd

## Example: hormone therapy study

::: notes

Now, we're going to analyze some real-world data using a Gaussian model, and then we're going to do a simulation to examine the properties of maximum likelihood estimation for that Gaussian model.

Here we look at the "heart and estrogen/progestin study" (HERS), a clinical 
trial of hormone therapy for prevention of recurrent heart attacks and 
death among 2,763 post-menopausal women with existing coronary heart disease 
(CHD) (Hulley et al. 1998).

We are going to model the distribution of fasting glucose among nondiabetics who don't exercise.

:::

```{r}
#| eval: false
# load the data directly from a UCSF website
hers = haven::read_dta(
  paste0( # I'm breaking up the url into two chunks for readability
    "https://regression.ucsf.edu/sites/g/files",
    "/tkssra6706/f/wysiwyg/home/data/hersdata.dta"
  )
)

```

```{r}
#| include: false
library(haven)
hers = read_stata("Data/hersdata.dta")
```

```{r}
#| tbl-cap: "HERS dataset"
#| label: tbl-HERS
hers |> head()
```

---

```{r}

n.obs = 100 # we're going to take a small subset of the data to look at; 
# if we took the whole data set, the likelihood function would be hard to 
# graph nicely

library(dplyr)
data1 = 
  hers |> 
  filter(
    diabetes == 0,
    exercise == 0) |> 
  head(n.obs)

glucose_data = 
  data1 |> 
  pull(glucose)

library(ggplot2)
library(ggeasy)
plot1 = 
  data1 |> 
  ggplot(aes(x = glucose)) +
  geom_histogram(aes(x = glucose, after_stat(density))) +
  theme_classic() +
  easy_labs()

print(plot1)

```

Looks somewhat plausibly Gaussian. Good enough for this example!

### Find the MLEs

```{r}

mu_hat = mean(glucose_data)
sigma_sq_hat = mean((glucose_data - mean(glucose_data))^2)

```

Our MLEs are:

* $\hat\mu = `r mu_hat`$

* $\hat\sigma^2 = `r sigma_sq_hat`$

Here's the estimated distribution, superimposed on our histogram:

```{r}

plot1 +
  geom_function(
    fun = function(x) dnorm(x, mean = mu_hat, sd = sqrt(sigma_sq_hat)),
    col = "red"
  )


```

Looks like a somewhat decent fit? We could probably do better, but that's for another time.

### Construct the likelihood and log-likelihood functions

::: notes
it's often computationally more effective to construct the log-likelihood first and then exponentiate it to get the likelihood
:::

```{r}

loglik = function(
    mu, # I'm assigning default values, which the function will use 
    # unless we tell it otherwise
    sigma = sd(x), # note that you can define some defaults based on other arguments
    x = glucose_data, 
    n = length(x)
)
{
  
  normalizing_constants = -n/2 * log((sigma^2) * 2 * pi) 
  
  likelihood_kernel = - 1/(2 * sigma^2) * 
    {
      # I have to do this part in a somewhat complicated way
      # so that we can pass in vectors of possible values of mu
      # and get the likelihood for each value;
      # for the binomial case it's easier
      sum(x^2) - 2 * sum(x) * mu + n * mu^2
    }
  
  answer = normalizing_constants + likelihood_kernel
  
  return(answer)
  
}

# `...` means pass any inputs to lik() along to loglik()
lik = function(...) exp(loglik(...))


```

### Graph the Likelihood as a function of $\mu$

(fixing $\sigma^2$ at $\hat\sigma^2 = `r sigma_sq_hat`$)

```{r}

ggplot() + 
  geom_function(fun = function(x) lik(mu = x, sigma = sigma_sq_hat)) + 
  xlim(mean(glucose_data) + c(-1,1) * sd(glucose_data)) +
  xlab("possible values of mu") +
  ylab("likelihood") + 
  geom_vline(xintercept = mean(glucose_data), col = "red")


```

### Graph the Log-likelihood as a function of $\mu$

(fixing $\sigma^2$ at $\hat\sigma^2 = `r sigma_sq_hat`$)

```{r}

ggplot() + 
  geom_function(fun = function(x) loglik(mu = x, sigma = sigma_sq_hat)) + 
  xlim(mean(glucose_data) + c(-1,1) * sd(glucose_data)) +
  xlab("possible values of mu") +
  ylab('log(likelihood)') + 
  geom_vline(xintercept = mean(glucose_data), col = "red")


```

### Likelihood and log-likelihood for $\sigma$, conditional on $\mu = \hat\mu$:


```{r}


ggplot() + 
  geom_function(fun = function(x) lik(sigma = x, mu = mean(glucose_data))) + 
  xlim(sd(glucose_data) * c(.9,1.1)) + 
  geom_vline(
    xintercept = sd(glucose_data) * sqrt(n.obs - 1)/sqrt(n.obs), 
    col = "red") +
  xlab("possible values for sigma") +
  ylab('Likelihood')

ggplot() + 
  geom_function(
    fun = function(x) loglik(sigma = x, mu = mean(glucose_data))
  ) + 
  xlim(sd(glucose_data) * c(0.9, 1.1)) +
  geom_vline(
    xintercept = 
      sd(glucose_data) * sqrt(n.obs - 1) / sqrt(n.obs), 
    col = "red") +
  xlab("possible values for sigma") +
  ylab("log(likelihood)")


```

### Standard errors by sample size:

```{r}

se.mu.hat = function(n, sigma = sd(glucose_data)) sigma/sqrt(n)
ggplot() + 
  geom_function(fun = se.mu.hat) + 
  scale_x_continuous(trans = "log10", limits = c(10, 10^5), name = "Sample size") +
  ylab("Standard error of mu (mg/dl)") +
  theme_classic()
```

### Simulations

#### Create simulation framework

Here's a function that performs a single simulation of a Gaussian modeling analysis:

```{r}

do_one_sim = function(
    n = 100,
    mu = mean(glucose_data),
    mu0 = mean(glucose_data) * 0.9,
    sigma2 = var(glucose_data), 
    return_data = FALSE # if this is set to true, we will create a list() containing both 
    # the analytic results and the vector of simulated data
)
{
  
  # generate data
  x = rnorm(n = 100, mean = mu, sd = sqrt(sigma2))
  
  # analyze data
  mu_hat = mean(x)
  sigmahat = sd(x)
  se_hat = sigmahat/sqrt(n)
  confint = mu_hat + c(-1, 1) * se_hat * qt(.975, df = n - 1)
  tstat = abs(mu_hat - mu0) / se_hat
  pval = pt(df = n - 1, q = tstat, lower = FALSE) * 2
  confint_covers = between(mu, confint[1], confint[2])
  test_rejects = pval < 0.05
  
  # if you want spaces, hyphens, or characters in your column names, use "", '', or ``:
  to_return = tibble(
    "mu-hat" = mu_hat,
    "sigma-hat" = sigmahat,
    "se_hat" = se_hat,
    "confint_left" = confint[1],
    "confint_right" = confint[2],
    "tstat" = tstat,
    "pval" = pval, 
    "confint covers true mu" = confint_covers,
    "test rejects null hypothesis" = test_rejects
  )
  
  if(return_data)
  {
    return(
      list(
        data = x, 
        results = to_return))
  } else
  {
    return(to_return)
  }
  
}

```

Let's see what this function outputs for us:

```{r}

do_one_sim()

```

Looks good!

Now let's check it against the `t.test()` function from the `stats` package:

```{r}

set.seed(1)
mu = mean(glucose_data)
mu0 = 80
sim.output = do_one_sim(mu0 = mu0, return_data = TRUE)
our_results = 
  sim.output$results |> 
  mutate(source = "`do_one_sim()`")

results_t.test = t.test(sim.output$data, mu = mu0)

results2 = 
  tibble(
    source = "`stats::t.test()`",
    "mu-hat" = results_t.test$estimate,
    "sigma-hat" = results_t.test$stderr*sqrt(length(sim.output$data)),
    "se_hat" = results_t.test$stderr,
    confint_left = results_t.test$conf.int[1],
    confint_right = results_t.test$conf.int[2],
    tstat = results_t.test$statistic,
    pval = results_t.test$p.value,
    "confint covers true mu" = between(mu, confint_left, confint_right),
     `test rejects null hypothesis` = pval < 0.05
  )

comparison = 
  bind_rows(
    our_results,
    results2
  ) |> 
  relocate(
    "source",
    .before = everything()
  )

comparison

```

Looks like we got it right!

#### Run 1000 simulations

Here's a function that calls the previous function `n_sims` times and summarizes the results:

```{r}

do_n_sims = function(
    n_sims = 1000,
    ... # this symbol means "allow additional arguments to be passed on to the `do_sim_once` function
)
{
  
  sim_results = NULL # we're going to create a "tibble" of results, 
  # row by row (slightly different from the hint on the homework)
  
  for (i in 1:n_sims)
  {
    
    set.seed(i)
    
    current_results = 
      do_one_sim(...) |> # here's where the simulation actually gets run
      mutate(
        sim_number = i
      ) |> 
      relocate(sim_number, .before = everything())
    
    sim_results = 
      sim_results |> 
      bind_rows(current_results)
    
  }
  
  return(sim_results)
}

```

```{r}

sim_results = do_n_sims(
  n_sims = 100,
  mu = mean(glucose_data),
  sigma2 = var(glucose_data),  
  n = 100 # this is the number of samples per simulated data set
)

sim_results |> head(10)
```

The simulation results are in! Now we have to analyze them. 

#### Analyze simulation results

To do that, we write another function:

```{r}

summarize_sim = function(
    sim_results, 
    mu = mean(glucose_data),
    sigma2 = var(glucose_data), 
    n = 100)
{
  
  
  # calculate the true standard error based on the data-generating parameters:
  `se(mu-hat)` = sqrt(sigma2/n)
  
  sim_results |> 
    summarize(
      `bias[mu-hat]` = mean(`mu-hat`) - mu,
      `SE(mu-hat)` = sd(`mu-hat`),
      `bias[SE-hat]` = mean(se_hat) - `se(mu-hat)`,
      `SE(SE-hat)` = sd(se_hat),
      coverage = mean(`confint covers true mu`),
      power = mean(`test rejects null hypothesis`)
    )
  
}

```

Let's try it out:

```{r}

sim_summary = summarize_sim(
  sim_results, 
  mu = mean(glucose_data), 
  # this function needs to know the true parameter values in order to assess bias
  sigma2 = var(glucose_data), 
  n = 100)

sim_summary
```

From this simulation, we observe that our estimate of $\mu$, $\hat\mu$, has minimal bias,
and so does our estimate of $SE(\hat\mu)$, $\hat{SE}(\hat\mu)$.

The confidence intervals captured the true value even more often than they were supposed to, and the hypothesis test always rejected the null hypothesis.

I wonder what would happen with a different sample size, a different true $\mu$ value, or a different $\sigma^2$ value...