ancova.qmd

# Linear regression models

```{r}
#| label: setup
#| file: "_common.R"
#| include: true
#| message: false
```

## Linear models for factorial designs

If factors measure a continuous quantity, we may alternatively consider a statistical model that describes a curve or a straight line, rather than only determining the mean at the levels measured in an experiment. These types of experiments are rather common in engineering.


:::{#exm-sokolova}

## Is additional paper wrapping viewed as more eco-friendly?

@Sokolova.Krishna.Doring:2023 consider consumer bias when assessing how eco-friendly packages are. Items such as cereal are packaged in plastic bags, which themselves are covered in a box. They conjecture (and find) that, paradoxically, consumers tend to view the packaging as being more eco-friendly when the amount of cardboard
or paper surrounding the box is larger, relative to the sole plastic package. We consider in the sequel the data from Study 2A, which measures the perceived environmental friendliness (PEF, variable `pef`) as a function of the `proportion` of paper wrapping (either none, half of the area of the plastic, equal or twice).

The linear model we can envision measures the effect of `pef` as a linear function of `proportion`, with
\begin{align*}
\mathsf{E}(\texttt{pef} \mid \texttt{proportion}) = \beta_0 + \beta_1 \texttt{proportion}
\end{align*}
and with homoscedastic observations. More general models would include polynomials (up to degree $K-1$ for a factor with $K$ levels).

If we fit against the simple linear regression model, we can extract the estimated coefficients and the $p$-values for the $t$-test for $\beta_0$ and $\beta_1$. The test for the intercept is of no interest since data are measured on a scale from 1 to 7, so the mean response when `proportion=0` cannot be zero. The coefficient for `proportion` suggests a trend of 0.5 point per unit ratio, and this is significantly different from zero, indicating that the `pef` score changes with the paper to plastic ratio.

```{r}
#| eval: true
#| echo: true
data(SKD23_S2A, package = "hecedsm") # load data
linmod <- lm(pef ~ proportion, data = SKD23_S2A)
# fit simple linear regression
coef(linmod) # extract intercept and slope
```

Let $\mu_{0}, \mu_{0.5}, \mu_{1}, \mu_2$ denote the true mean of the PEF score as a function of the proportion of paper. There are several tests that could be of interest here, but we focus on contrasts performed by authors and an hypothesis test of linearity as a function of the proportion of plastic. For the latter, we could compare the linear regression model (in which the PEF score increases linearly with the proportion of paper to plastic) against the ANOVA which allows each of the four groups to have different means.


If we use $\boldsymbol{\alpha} \in \mathbb{R}^4$ to denote the parameter vector of the analysis of variance model using the treatment parametrization and $\boldsymbol{\beta} \in \mathbb{R}^2$ for the simple linear regression model, then we have
\begin{align*}
\mu_0 &= \beta_0=\alpha_0 \\
\mu_{0.5} &= \beta_0 + 0.5 \beta_1 = \alpha_0 + \alpha_1\\
\mu_1 &= \beta_0 + \beta_1 = \alpha_0 + \alpha_2 \\
\mu_2 &= \beta_0 + 2 \beta_1= \alpha_0 + \alpha_3.
\end{align*}
The test comparing the simple linear regression with the analysis of variance imposes two simultaneous restrictions, with $\mathscr{H}_0: \alpha_3 = 2\alpha_2= 4\alpha_1$, so the null distribution is $\mathsf{Fisher}(2, 798)$ or roughly $\chi^2_2$.


```{r}
#| eval: true
#| echo: true
anovamod <- lm(pef ~ factor(proportion), # one-way ANOVA
               data = SKD23_S2A) 
# Compare simple linear regression with ANOVA
anova(linmod, anovamod) # is the change in PEF linear?
# Specifying the weights 
# these contrasts encode the mean (so don't sum to zero)
car::linearHypothesis(model = anovamod, 
   hypothesis = rbind(c(0, -2, 1, 0), 
                      c(0, 0, -2, 1)))
```
We see from the output that the $F$ tests and the $p$-values are identical, whether we impose the constraints manually or simply feed the two nested models to the `anova` method.

The authors were interested in comparing none with other choices: we are interested in pairwise differences, but only relative to the reference $\mu_{0}$:
\begin{align*}
\mu_0 = \mu_{0.5}  & \iff 1\mu_0 - 1\mu_{0.5} + 0\mu_{1} + 0 \mu_{2} = 0\\
\mu_0 = \mu_{1} & \iff 1\mu_0 + 0\mu_{0.5} -1\mu_{1} + 0 \mu_{2} = 0\\
\mu_0 = \mu_{2} & \iff 1\mu_0 + 0\mu_{0.5} + 0\mu_{1} -1 \mu_{2} = 0
\end{align*}
so contrast vectors $(1, -1, 0, 0)$, $(1, 0, -1, 0)$ and $(1, 0, 0, -1)$ for the marginal means would allow one to test the hypothesis.

```{r}
#| eval: true
#| echo: true
margmean <- anovamod |>  
  emmeans::emmeans(specs = "proportion") # group means
contrastlist <- list( # specify contrast vectors
   refvshalf = c(1, -1, 0, 0),
   refvsone =  c(1, 0, -1, 0),
   refvstwo =  c(1, 0, 0, -1))
# compute contrasts relative to reference
margmean |> emmeans::contrast(method = contrastlist)
```


```{r}
#| label: SKD23S2A-save
#| eval: true
#| echo: false
data(SKD23_S2A, package = "hecedsm") # load data
linmod <- lm(pef ~ proportion, data = SKD23_S2A) # fit simple linear regression
anovamod <- lm(pef ~ factor(proportion), data = SKD23_S2A) # one-way ANOVA
# anova(linmod, anovamod) # compare models (two restrictions)
# group means
margmean <- anovamod |>  emmeans::emmeans(specs = "proportion") 
contrastlist <- list( # specify contrast vectors
   refvshalf = c(1, -1, 0, 0),
   refvsone =  c(1, 0, -1, 0),
   refvstwo =  c(1, 0, 0, -1))
contrast <- margmean |> emmeans::contrast( 
 method = contrastlist)
```

The group averages are reported in @tbl-print-groupmeans-PEF, match those reported by the authors in the paper. They suggest an increased perceived environmental friendliness as the amount of paper used in the wrapping increases. We could fit a simple regression model to assess the average change, treating the proportion as a continuous explanatory variable. The estimated slope for the change in PEF score, which ranges from 1 to 7 in increments of 0.25, is `r round(coef(linmod)[2],2)` point per ratio of paper/plastic. There is however strong evidence, given the data, that the change isn't quite linear, as the fit of the linear regression model is significantly worse than the corresponding linear model.


```{r}
#| label: tbl-print-groupmeans-PEF
#| echo: false
#| eval: true
#| tbl-cap: "Estimated group averages of PEF per proportion with standard errors"
knitr::kable(margmean,
      digits = c(2,2,3,0,2,4),
      booktabs = TRUE,
      col.names = c("proportion",
                    "marg. mean",
                    "std. err.",
                    "dof",
                    "lower (CI)",
                    "upper (CI)")) |>
    kableExtra::kable_styling()
```

```{r}
#| label: tbl-print-contrast-PEF
#| echo: false
#| eval: true
#| tbl-cap: "Estimated contrasts for differences of PEF to no paper."
kable(contrast,
      booktabs = TRUE,
      digits = c(2,2,2,0,2,2),
      col.names = c("contrast",
                    "estimate",
                    "std. err.",
                    "dof",
                    "stat",
                    "p-value")) |>
  kableExtra::kable_styling()
```

All differences reported in @tbl-print-contrast-PEF are significant and positive, in line with the researcher's hypothesis.


:::


## Analysis of covariance

The previous chapter dealt with factorial experiments in which all experimental factors are of interest. It is possible to use measurements concomitant to the data collection (for example, value to a test before we complete the group assignment for the manipulation) to get a measure of the relative strength of students. The more correlated these measures are with the response, the more we can explain the data. We then proceed with the random assignment of our experimental units to different conditions. 


Including covariates should in principle increase power and our ability to detect real differences due to experimental manipulations, provided the variables used as control are correlated with the response. Generally, they are not needed for valid inference, which is guaranteed by randomization, and shouldn't be used to assign treatment. Such designs are meant to reduce the error.

We can include continuous covariates to the analysis of variance, whose slope governs the relationship with the response. The strict inclusion isn't necessary to draw valid causal conclusion, but adding the term helps again reduce the residual variability. Such a design was historically called **analysis of covariance**, although as analysis of variance models, they are nothing but linear regression models. The ANCOVA model assumes that slopes are parallel: if rather there is an interaction term, then the experimental manipulation induces changes that depend on the value of the continuous explanatory. This case is termed **moderation** in the management literature.

In an analysis of covariance, we include a linear component for a (continuous) covariate, with the purpose again to reduce residual error and increase power. A prime example is prior/post experiment measurements, whereby we monitor the change in outcome due to the manipulation. This post by Solomon Kurz [[link]](https://solomonkurz.netlify.app/blog/2023-04-12-boost-your-power-with-baseline-covariates/) nicely illustrates the added benefits of using covariates when there is strong correlation between your response and the latter

In such setting, it may seem logical to take the difference in post and prior score as response: this is showcased in @exm-vanStek and @Baumann:1992, an analysis of which is presented on [the course website](https://edsm.rbind.io/example/06-ancova/). 

When we add a covariate, we need the latter to have a strong linear correlation for the inclusion to make sense. We can assess graphically whether the relationship is linear, and whether the slopes for each experimental condition are the same.^[If not, this implies that the covariate interacts with the experimental condition.]

```{r}
#| label: fig-ancovadifftrend
#| echo: false
#| eval: true
#| cache: true
#| fig-cap: "Simulated data from two groups with an analysis of covariance model. "
set.seed(123)
ng <- 24
x1 <- rgamma(n = ng, shape = 10, scale = 5)
x2 <- rgamma(n = ng, shape = 7, scale = 3)
dat <- data.frame(
  covariate = c(x1, x2),
  response = c(2 + 0.2*x1 + rnorm(ng, sd = 4), 5 + 0.4*x2 + rnorm(ng, sd = 4)),
  group = factor(rep(c("treatment", "control"), each = ng)))
g1 <- ggplot(data = dat, aes(x = covariate, y = response, group = group, color = group)) + 
  geom_point() + 
  geom_rug(sides = "b") + 
  geom_abline(data = data.frame(group = factor(c("treatment",
                                                 "control")),
                                intercept = c(2,5),
                                slope = c(0.2,0.4)),
              mapping = aes(intercept = intercept, 
                            slope = slope, 
                            color = group)) +
  theme_classic() 
set.seed(123)
ng <- 24
x1 <- rgamma(n = ng, shape = 10, scale = 5)
x2 <- rgamma(n = ng, shape = 10, scale = 5)
dat <- data.frame(
  covariate = c(x1, x2),
  response = c(2 + 0.2*x1 + rnorm(ng, sd = 4), 5 + 0.2*x2 + rnorm(ng, sd = 4)),
  group = factor(rep(c("treatment", "control"), 
                     each = ng)))
g2 <- ggplot(data = dat, 
             aes(x = covariate, 
                 y = response, 
                 group = group, 
                 color = group)) + 
  geom_point() + 
  geom_rug(sides = "b") + 
  geom_abline(data = data.frame(group = factor(c("treatment",
                                                 "control")),
                                intercept = c(2,5),
                                slope = c(0.2,0.2)),
              mapping = aes(intercept = intercept, 
                            slope = slope, 
                            color = group)) +
  theme_classic()
 
g2 + g1 + 
  plot_layout(guides = 'collect') &
  theme(legend.position = "bottom")
```

The left panel of @fig-ancovadifftrend shows the ideal situation for an analysis of covariate: the relationship between response and covariate is linear with strong correlation, with the same slope and overlapping support. Since the slopes are the same, we can compare the difference  in average (the vertical difference between slopes at any level of the covariate) because the latter is constant, so this depiction is useful. By contrast, the right-hand panel of @fig-ancovadifftrend shows an interaction between the covariate and the experimental groups, different slopes: there, the effect of the experimental condition increases with the level of the covariate. One may also note that the lack of overlap in the support, the set of values taken by the covariate, for the two experimental conditions, makes comparison hazardous at best in the right-hand panel.

@fig-ancovaresid shows that, due to the strong correlation, the variability of the measurements is smaller on the right-hand panel (corresponding to the analysis of covariance model) than for the centred response on the left-hand panel; note that the $y$-axes have different scales.


```{r}
#| label: fig-ancovaresid
#| echo: false
#| eval: true
#| cache: true
#| fig-cap: "Response after subtracting mean (left) and after detrending (right)."
g3 <- ggplot(data = dat,
             aes(color = group,
                 x = group,
                 y = response - mean(response))) +
  geom_jitter() + 
  scale_y_continuous(limits = c(-10,10), breaks = seq(-10, 10, by = 5)) +
  labs(y = "centered response",
       subtitle = "ANOVA")+
  theme_classic()
dat$resid <- resid(lm(response ~ covariate,
         data = dat))

g4 <- ggplot(data = dat,
             aes(color = group,
                 x = group,
                 y = resid)) +
  geom_jitter() +
  scale_y_continuous(limits = c(-10,10), breaks = seq(-10, 10, by = 5)) +
  labs(y = "residuals (detrended observations)",
       subtitle = "ANCOVA") +
  theme_classic()
g3 + g4 + 
  plot_layout(guides = 'collect') &
  theme(legend.position = "bottom")
```


We present two examples of analysis of covariance, showing how the inclusion of covariates helps disentangle differences between experimental conditions.

::: {#exm-leechoi}

## Inconsistency of product description and image in online retailing

@Lee.Choi:2019 measured the impact of discrepancies between descriptions and visual depiction of items in online retail. They performed an experiment in which participants were presented with descriptions of a product (a set of six toothbrushes) that was either consistent or inconsistent with the description. The authors postulated that a discrepancy could lead to lower appreciation score, measured using three Likert scales. They also suspected that the familiarity with the product brand should impact ratings, and controlled for the latter using another question.


One way to account for familiarity when comparing the mean is to use a linear regression with familiarity as another explanatory variable. The expected value of the product evaluation is
$$
\mathsf{E}(\texttt{prodeval}) = \beta_0 + \beta_1 \texttt{familiarity} + \beta_2 \texttt{consistency}, 
$$ {#eq-vS}
where $\texttt{familiarity}$ is the score from 1 to 7 and $\texttt{consistency}$ is a binary indicator equal to one if the output is inconsistent and zero otherwise. 
The coefficient $\beta_2$ thus measures the difference between product evaluation rating for consistent vs inconsistent displays, for the same familiarity score.


```{r}
#| label: LC19-linreg
#| eval: true
#| echo: false
# Load data
data(LC19_S1, package = "hecedsm")
# Fit a linear regression
ancova <- lm(prodeval ~ familiarity + consistency,
             data = LC19_S1)
# Output coefficients and t-tests
#summary(ancova)
# Anova table
#car::Anova(ancova, type = 3)
```


We can look at coefficient (standard error) estimates $\widehat{\beta}_2 = `r round(coef(ancova)[3], 2)` (`r round( as.numeric(sqrt(diag(vcov(ancova)))[3]),3)`)$.
No difference between groups would mean $\beta_2=0$ and we can build a test statistic by looking at the standardized regression coefficient $t = \widehat{\beta}_2/\mathsf{se}(\widehat{\beta}_2)$. The result output is `r papaja::apa_print(ancova)$full_result$consistencyinconsistent`. We reject the null hypothesis of equal product evaluation for both display at level 5%: there is evidence that there is a small difference, with people giving on average a score that is 0.64 points smaller (on a scale of 1 to 9) when presented with conflicting descriptions and images.

We can compare the analysis of variance table obtained by fitting the model with and without $\texttt{familiarity}$. @tbl-anovatabLC19S1 shows that the effect of consistency is small and not significant and a two-sample _t_-test shows no evidence of difference between the average familiarity score in both experimental conditions ($p$-value of $`r papaja::apa_p(broom::tidy(anova(aov(familiarity ~ consistency, data = LC19_S1)))$p.value)[1]`$). However, we can explain roughly one fifth of the residual variability by the familiarity with the brand (see the sum of squares in @tbl-anovatabLC19S1): removing the latter leads to a higher signal-to-noise ratio  for the impact of consistency, at the expense of a loss of one degree of freedom. Thus, it appears that the manipulation was successful.


```{r}
#| label: tbl-anovatabLC19S1
#| echo: false
#| eval: true
#| tbl-cap: "Analysis of variance tables"
data(LC19_S1, package = "hecedsm")
anovaMod <- lm(prodeval ~ consistency,
             data = LC19_S1)
ancovaMod <- lm(prodeval ~ familiarity + consistency,
             data = LC19_S1)
options(knitr.kable.NA = '')
tab_anova <- broom::tidy(
    car::Anova(anovaMod, type = 3)[-1,])
tab_anova$p.value <- papaja::apa_p(tab_anova$p.value)
tab_ancova <-  broom::tidy(
  car::Anova(ancovaMod, type = 3)[-1,])
tab_ancova$p.value <- papaja::apa_p(tab_ancova$p.value)
# Output coefficients and t-tests
knitr::kable(tab_anova,
  col.names = c("term","sum. sq.", "df","stat","p-value"),
  digits = c(0,2,1,2,3,3), booktabs = TRUE,
  caption = "model without familiarity")
knitr::kable(tab_ancova,
  col.names = c("term","sum. sq.", "df","stat","p-value"),
  digits = c(0,2,1,2,3,3), booktabs = TRUE,
  caption = "model with familiarity")
```


```{r}
#| label: fig-ANCOVA-demo
#| fig-cap: "Scatterplot of product evaluation as a function of the familiarity score, split by experimental manipulation."
#| eval: true
#| echo: false
#| out-width: '70%'
#| fig-width: 8
#| fig-height: 6
#| message: false
#| warning: false
set.seed(1234)
data(LC19_S1, package = "hecedsm")
ggplot(data = LC19_S1, 
       mapping = aes(y = prodeval, 
                     x = familiarity, 
                     col = consistency)) + 
  geom_point(position = "jitter") + 
  geom_smooth(method = "lm", 
              formula = y ~ x,
              show.legend = FALSE,
              fullrange = TRUE,
              se = FALSE) +
  scale_y_continuous(limits = c(1,9),
                     breaks = c(3L, 6L, 9L)) +
  scale_x_continuous(limits = c(3,7),
                    breaks = 1:7) + 
  theme_classic() + 
  theme(legend.position = "bottom") +
  labs(y = "product evaluation",
       col = "display")
```

@fig-ANCOVA-demo shows that people more familiar with the product or brand tend to have a more positive product evaluation, as postulated by the authors. The graph also shows two straight lines corresponding to the fit of a linear model with different intercept and slope for each display group: there is little material difference, one needs to assess formally whether a single linear relationship as the one postulated in @eq-vS can adequately characterize the relation in both groups. 

To this effect, we fit a linear model with different slopes in each group, and compare the fit of the latter with the analysis of covariance model that includes a single slope for both groups: we can then test if the slopes are the same, or alternatively if the difference between the slopes is zero. The _t_-statistic indicates no difference in slope ($p$-value of $`r papaja::apa_p(broom::tidy(anova(lm(prodeval ~ familiarity + consistency, data = LC19_S1), lm(prodeval ~ familiarity*consistency, data = LC19_S1)))$p.value)[2]`$), thus the assumption is reasonable. Levene's test for homogeneity of variance indicates no discernible difference between groups. Thus, it appears there is a difference in perception of product quality due to the manipulation.


:::

::: {#exm-vanStek}

## Effect of scientific consensus on false beliefs

We consider Study 3 of @vanStekelenburg:2021, who studied changes in perception of people holding false beliefs or denying (to some extent) the scientific consensus by presenting them with news article showcasing information about various phenomena. The experimental manipulation consisted in presenting boosting, a form of training to help readers identify and establish whether scientifists were truly expert in the domain of interest, how strong was the consensus, etc.^[The article is interesting because lack of planning/changes led them to adapt the design from experiment 1 to 3 until they found something. Without preregistration, it is unlikely such findings would have been publishable.]

The third and final experiment of the paper focused on genetically modified organisms: it is a replication of Study 2, but with a control group (since there were no detectable difference between experimental conditions `Boost` and `BoostPlus`) and a larger sample size (because Study 2 was underpowered). 

The data include 854 observations with `prior`, the negative of the prior belief score of the participant, the `post` experiment score for the veracity of the claim. Both were measured using a visual scale ranging from -100  (I am 100% certain this is false) to 100  (I am 100% certain this is true), with 0 (I don’t know) in the middle. Only people with negative prior beliefs were recruited to the study. The three experimental conditions were `BoostPlus`, `consensus` and a `control` group. Note that the scores in the data have been negated, meaning that negative posterior scores indicate agreement with the consensus on GMO.

```{r}
#| label: SSVB21S3-fit
#| eval: true
#| echo: false
library(emmeans)
library(nlme)
data(SSVB21_S3, package = "hecedsm")

# ANOVA model
SSVBmod0 <- lm(post ~ condition, 
                 data = SSVB21_S3)
# ANCOVA model (with prior as covariate)
SSVBmod1 <- lm(post ~ prior + condition, 
                 data = SSVB21_S3)
# Check equality of slope (ok)
SSVBmod2 <- lm(post ~ prior*condition, 
                 data = SSVB21_S3)
#anova(SSVBmod1, SSVBmod2)

# Check randomization (ok)
SSVBmod3 <- lm(prior ~ condition, 
               data = SSVB21_S3)
# Check equal variance per group (nope)
levene <-  
  car::leveneTest(resid(SSVBmod1) ~ condition, 
                  data = SSVB21_S3)

m1 <- nlme::gls(
  post ~ prior + condition, 
  weights = varIdent(form = ~1 | condition),
  data = SSVB21_S3, 
  method = "ML")
m2 <- nlme::gls(
  post ~ prior + condition, 
  data = SSVB21_S3, 
  method = "ML")
# anova(m1, m2)
# Clear heteroscedasticity...

# Fit a model with different variances
library(nlme)
SSVBmod4 <- nlme::gls(
  post ~ prior + condition, 
  weights = varIdent(form = ~1 | condition),
  data = SSVB21_S3)

# Compare model with unequal variance with 
# model with equal variance (bis), 
# this time with likelihood ratio test

SSVBmod5 <- gls(
  post ~ condition, 
  weights = varIdent(form = ~ 1 |condition),
  data = SSVB21_S3)

# Contrasts
emm4 <- emmeans(SSVBmod4, specs = "condition")
emm5 <- emmeans(SSVBmod5, specs = "condition")
# Note order: Boost, consensus, control

# Not comparable: 
# emm4 uses detrended data,
# emm5 uses group averages
#
# Compute planned contrasts
contrast_list <- list(
  "consensus vs control" = c(0,  1, -1), 
  "consensus vs BoostPlus" = c(-1, 1,  0),
  "BoostPlus vs control " =  c(1, 0, -1)) 
c4 <- contrast(emm4,
         method = contrast_list,
         p.adjust = "holm")
# Without controlling for prior beliefs
c5 <- contrast(emm5,
         method = contrast_list,
         p.adjust = "holm")
```

Preliminary checks suggest that, although the slopes for prior beliefs could plausibly be the same in each group and the data are properly randomized, there is evidence of unequal variance for the changes in score. As such, we fit a model with mean
\begin{align*}
\mathsf{E}(\texttt{post}) &= \begin{cases}
\beta_0 + \beta_1 \texttt{prior} + \alpha_1 &  \texttt{condition} = \texttt{BoostPlus}\\
\beta_0 + \beta_1 \texttt{prior} + \alpha_2 &\texttt{condition} = \texttt{consensus}\\
\beta_0 + \beta_1 \texttt{prior} + \alpha_3 &\texttt{condition} = \texttt{control}
\end{cases}
\end{align*}
with $\alpha_1 + \alpha_2 + \alpha_3=0$, using the sum-to-zero parametrization, and with different variance for each experimental condition, 
\begin{align*}
\mathsf{Va}(\texttt{post}) = \begin{cases}
\sigma^2_1, &  \texttt{condition} = \texttt{BoostPlus},\\
\sigma^2_2, &  \texttt{condition} = \texttt{consensus},\\
\sigma^2_3, & \texttt{condition} = \texttt{control}.
\end{cases}
\end{align*}
Because of the unequal variances, we cannot use multiple testing procedures reserved for analysis of variance and resort instead to Holm--Bonferroni to control the familywise error rate. We here look only at pairwise differences between conditions.^[In Study 2, the interest was comparing manipulation vs control and the Boost vs BoostPlus conditions, two orthogonal contrasts.]
 
 
```{r}
#| label: tbl-anovatabSSVB
#| echo: false
#| eval: true
#| tbl-cap: "Analysis of variance tables"
options(knitr.kable.NA = '')
tab_anova <- broom::tidy(
    car::Anova(SSVBmod5, type = 3)[-1,])
tab_anova$p.value <- papaja::apa_p(tab_anova$p.value)
tab_ancova <-  broom::tidy(
  car::Anova(SSVBmod4, type = 3)[-1,])
tab_ancova$p.value <- papaja::apa_p(tab_ancova$p.value)
# Output coefficients and t-tests
knitr::kable(tab_anova,
             booktabs = TRUE,
             caption = "ANOVA model (without prior belief)",
             col.names = c("term","df","stat","p-value"),
             digits = c(0,2,1,2))
knitr::kable(tab_ancova,
             booktabs = TRUE,
             caption = "ANCOVA model (with prior belief)",
             col.names = c("term","df","stat","p-value"),
             digits = c(0,2,1,2))
```

Repeating the exercise of comparing the amount of evidence for comparison with and without inclusion of a covariate shows that the value of the test statistic is larger (@tbl-anovatabSSVB), indicative of stronger evidence with the analysis of covariance model: the conclusion would be unaffected with such large sample sizes. We of course care very little for the global $F$ test of equality of mean, as the previous study had shown large differences. What is more interesting here is quantifying the change between conditions.


```{r}
#| label: tbl-contraststabSSVB
#| echo: false
#| tbl-cap: "Pairwise contrasts with _p_-values adjusted using Holm--Bonferroni"
#| eval: true
options(knitr.kable.NA = '')
tab_anova <- broom::tidy(c5) |> dplyr::select(!c("term","null.value"))
tab_anova$p.value <- papaja::apa_p(tab_anova$p.value)
tab_ancova <-  broom::tidy(c4) |> dplyr::select(!c("term","null.value"))
tab_ancova$p.value <- papaja::apa_p(tab_ancova$p.value)
# Output coefficients and t-tests
knitr::kable(tab_anova,
             booktabs = TRUE,
  digits = c(0,2,1,2,3,3),
  
  caption = "ANOVA model (without prior belief score).")

knitr::kable(tab_ancova,
             booktabs = TRUE,
  digits = c(0,2,1,2,3,3),
  caption = "ANCOVA model (with prior belief score).")
```

@tbl-contraststabSSVB shows the pairwise contrasts, which measure two different things: the analysis of variance model compares the average in group, whereas the analysis of covariance (the linear model with `prior`) uses detrended values and focuses on the change in perception. Because the data are unbalanced and we estimate group mean and variance separately, the degrees of freedom change from one pairwise comparison to the next. Again, using the covariate `prior`, which is somewhat strongly correlated with `post` as seen in @fig-vanStekS3, helps decrease background noise.

```{r}
#| label: tbl-vanStekS3
#| eval: true
#| echo: false
#| message: false
#| warning: false
#| tbl-cap: "Summary statistics of belief as a function of time of measurement and experimental condition."
# Transform data so that all observations
# measurements are in a single column
# post/prior become labels of a categorical variable
SSVB21_S3_long <- SSVB21_S3 |>
  tidyr::pivot_longer(cols = c("prior","post"),
               names_to = "time",
               values_to = "belief") |>
  dplyr::mutate(time = relevel(factor(time),
                        ref = "prior"))
# Compute summary statistics by group
# average and its standard error = std.dev / sqrt(n)
SSVB21_S3_summary <- SSVB21_S3_long |>
  dplyr::group_by(time, condition) |>
  dplyr::summarize(mean = mean(belief),
            se = sd(belief)/sqrt(dplyr::n()))

knitr::kable(SSVB21_S3_summary, 
            digits = 2,
            booktabs = TRUE, 
            linesep = "")
```


```{r}
#| label: fig-vanStekS3
#| eval: true
#| echo: false
#| fig-cap: "Difference between prior and post experiment beliefs on genetically engineered food."
# Create a plot of data by timing
# jitter observations to avoid overlay
ggplot(data = SSVB21_S3_summary,
       # Aesthetics common to all
       aes(x = time,
           group = condition)) +
  geom_point(data = SSVB21_S3_long,
             mapping = aes(x = time,
                           y = belief,
                           col = condition,
                           group = condition),
             position = position_jitterdodge(jitter.width = 0.25)) +
  # Add plus or one standard error for the mean
  geom_linerange(data = SSVB21_S3_summary,
                 aes(ymin = mean - se,
                     ymax = mean + se),
                 position = position_dodge(width = 0.75)) +
  # Add a line for each group mean
  geom_errorbar(data = SSVB21_S3_summary,
                aes(x = time,
                    ymax = mean,
                    ymin = mean,
                    color = condition),
                position = position_dodge(width = 0.75)) +
  # Add lines for each group (prior to post)
  geom_line(data = SSVB21_S3_summary,
            mapping =
              aes(x = time,
                  group = condition,
                  color = condition,
                  y = mean),
    position = position_dodge(0.75)
  ) +
  theme_classic() +
  theme(legend.position = "bottom")

```


:::


:::{.callout-caution}
@vanStekelenburg:2021 split their data to do pairwise comparisons two at the time (thus taking roughly two-third of the data to perform a two sample _t_-test with each pair). Although it does not impact their conclusion, this approach is conceptually incorrect: if the variance was equal, we would want to use all observations to estimate it (so their approach would be suboptimal, since we would estimate the variance three times with smaller samples).

On the contrary, using a model that assumes equal variance when it is not the case leads to distortion: the variance we estimate will be some sort of average of the variability $\sigma_i$ and $\sigma_j$ in experimental condition $i$ and $j$, again potentially leading to distortions. With large samples, this may be unconsequential, but illustrates caveats of subsample analyses.


:::

:::{.callout-caution}

@fig-vanStekS3f1 shows the relationship between prior and posterior score. The data show clear difference between individuals: many start from completely disbelieving of genetically engineered food and change their mind (sometimes drastically), there are many people who do not change idea at all and have similar scores, and many who give a posterior score of zero. This heterogeneity in the data illustrates the danger of only looking at the summary statistics and comparing averages. It does not tell the whole picture! One could investigate whether the strength of religious or political beliefs, or how much participants trust scientists, could explain some of the observed differences.

:::

```{r}
#| label: fig-vanStekS3f1
#| echo: false
#| eval: true
#| out-width: '80%'
#| fig-cap: "Scatterplot of negated prior and posterior belief score."
# Scatterplot of data
ggplot(data = SSVB21_S3,
       aes(x = prior,
           y = post)) +
  geom_point(aes(col = condition)) +
  # geom_smooth(method = "lm",
  #             se = FALSE,
  #             col = "black") +
  theme_classic() +
  theme(legend.position = "bottom")
```


## Moderation and interactions

In a randomized experiment, we can check the average outcome of a manipulation by comparing groups: assuming random sampling, these conclusions can be broadly generalized to the population of interest from which the sample is drawn. However, it may be that the effect of the treatment depends on other variables: cultural differences, gender or education may change.

The causal effect on $Y$ of the experimental manipulation, say $X$ may be a misleading summary if another variable modifies the relation of $X \to Y$: for example, the perception of gender discrimination or racism may depend on the person background and experience and this may impact the effect of the manipulation. Such variables, say $W$, thus have an interactive effect with the experimental factor $X$, termed moderator in psychology. In the statistical model, inclusion of interaction terms between $X$ and $W$ (typically via product of the so-called moderator variable with the factor encoding the experimental sub-condition) will allow us to estimate those differences.

In an analysis of covariance, covariates are included that have an impact on the outcome to filter out variability, but with the assumption that they do not influence the effect of treatment (no interaction). With moderators, we rather include and tests for the interactions. If we have an experimental factor $X$ which is binary or categorical, the resulting model is a simple analysis of variance model (or analysis of covariance) and we can test the significance of the interaction term to assess the moderating effect of $W$.

If $W$ is a mean-centered continuous variable and $X$ a categorical variable with $k=1, \ldots, K$ levels using the sum-to-zero parametrization, the linear model with mean
$$
\underset{\text{average response at $W$ in group $k$}}{\mathsf{E}\{Y \mid \mathsf{do}(X) = k, W\}} = \underset{\text{intercept for group $k$}}{\mu + \alpha_k} + \underset{\text{slope for group $k$}}{(\beta + \gamma_k)}W
$$
includes different slopes for $W$ in each experimental group, as well as different intercepts for each group.


:::{#exm-moderation1}

## Gender discrimation

We consider data from a study on gender discrimination [@Garcia:2010]. Participants were put with a file where a women was turned down promotion in favour of male colleague despite her being clearly more experimented and qualified. The authors manipulated the decision of the participant to this decision, either choosing not to challenge the decision (no protest), a request to reconsider based on individual qualities of the applicants (individual) and a request to reconsider based on abilities of women (collective). All items were measured using scales constructed using items measured using Likert scales ranging from strongly disagree (1) to strongly agree (7).

The postulated mediator variable is `sexism`, the average of 6 Likert scales for the Modern Sexism Scale assessing pervasiveness of gender discrimination. We consider participants' evaluation of the appropriateness of the response of the fictional character.

We fit the linear model with the interaction and display the observed slopes

```{r}
#| eval: false
#| echo: true
data(GSBE10, package = "hecedsm")
lin_moder <- lm(respeval ~ protest*sexism,
               data = GSBE10)
summary(lin_moder) # coefficients
car::Anova(lin_moder, type = 3) # tests
```

```{r}
#| label: tbl-testsmoder
#| echo: false
#| eval: true
#| message: false
#| warning: false
#| tbl-cap: "Analysis of variance table for the linear moderation model of ."
data(GSBE10, package = "hecedsm")
linmoder <- lm(respeval ~ protest*sexism,
               data = GSBE10)
# summary(linmoder) # coefficients

options(knitr.kable.NA = '')
tab_anova <- broom::tidy(
    car::Anova(linmoder, type = 3)[-1,])
tab_anova$p.value <- papaja::apa_p(tab_anova$p.value)
knitr::kable(tab_anova,
  booktabs = TRUE,
  col.names = c("term", "sum of squares","df","stat","p-value"),
  digits = c(0,2,1,2,3,3))
```

```{r}
#| eval: true
#| echo: false
#| warning: false
#| message: false
#| out-width: '80%'
ggplot(data = GSBE10,
       aes(x = sexism,
           y = respeval,
           color = protest,
           group = protest)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm", formula = y ~ x) +
  labs(subtitle = "evaluation of response",
       y = "",
       color = "experimental condition") +
  theme_classic() +
  theme(legend.position = "bottom")
```

Because of the interaction, comparing the levels of the experimental factor only makes sense if we fix the value of sexism (since the slopes are not parallel) and won't necessarily be reliable outside of the range of observed values of sexism. We could look at quantiles and differences at the mean sexism,^[This is the default with `emmeans`] or one standard deviation away.

:::

We may be interested in the range of values of the predictor $W$ for which the difference between treatments is not statistically significant if we only have a binary treatment. The Johnson--Neyman method [@Johnson.Neyman:1936] considers this range, but this leads to multiple testing problems since we probe the model repeatedly. @Esarey.Sumner:2018 offer a method that provides control for the false discovery rate.

To illustrate the method, we dichotomize the manipulation pooling individual and collective protests, since these are the most similar.

```{r}
#| eval: true
#| echo: true
#| fig-cap: "Johnson--Neyman plot for difference between protest and no protest as a function of sexism."
#| label: fig-jn
library(interactions)
db <- GSBE10 |>
  dplyr::mutate(
    protest = as.integer(protest != "no protest"))
lin_moder2 <- lm(respeval ~ protest*sexism, data = db)
jn <- johnson_neyman(
  model = lin_moder2, # linear model
  pred = protest, # binary experimental factor
  modx = sexism, # moderator
  control.fdr = TRUE,
  mod.range = range(db$sexism))
jn$plot
```


The cutoff value is 4.20 with control for the false discovery rate and 4.15 without. The interval is not extended beyond the range of value for sexism, as these are not possible given the Likert scale (which starts at 1). In this example, the moderator is not experimentally manipulated, but it could be. More complicated mediation models could include interactions between treatment effects or moderators and covariates, with external variables, leading to moderated mediation. Interactions can be considered for pretty much any statistical model, but the usual assumptions need to hold for inference to be approximately valid.


:::{#exm-moderation2}


# Ghosting 

@Leckfor:2023 consider the impact of ghosting someone (i.e., not responding to communications and cutting bridges with a person) on personal satisfaction, uncertainty and the need for closure (explanations). In their Study 3, they postulated that the type of rejection someone experienced. The study comprises 545 participants who were asked to reflect about a situation in their own life, in scenarios where 

> Participants in the included (control) condition were asked to write about a time when someone “expressed that they wanted to continue and/or maintain a friendship, romantic relationship, or casual dating situation.” Participants in the ghosted condition wrote about a time when someone ended a relationship by “suddenly cutting off all communication without explanation.” Participants in the directly rejected condition wrote about a time when someone “directly communicate[d] that they wanted to end the relationship.” Nearly half (49%) of participants wrote about a friendship, followed by a romantic relationship (29%), and then a casual dating situation (22%).

The moderator variable is `cond`, one of control, being ghosted or rejected. This should moderate the effect of the need for closure on need for satisfaction. The authors postulated that individuals would have lower needs of satisfaction after being ghosted.

```{r}
#| label: moderation2 
#| eval: false
#| echo: true
# Moderation analysis
# This time, the factor of interest is continuous
# and the moderator is categorical (K=3 levels)
data(LWSH23_S3, package = "hecedsm")
mod <- lm(data = LWSH23_S3, needsatis ~ needclosure * cond)
anova(mod) # interaction is significant
# Compute estimated marginal means, but with global weights equal to relative weight of each variable
emmeans::emmeans(mod, specs = "needclosure", 
                 by = "cond", 
                 at = list(needclosure = 2:7), 
                 weights = "prop")
# All values are reported for the average of needclosure by default without the at
```
```{r}
#| label: tbl-lmmoder2
#| echo: false
#| eval: true
#| message: false
#| warning: false
#| tbl-cap: "Linear moderation model coefficients of Study 3 of @Leckfor:2023. Least square coefficients, standard errors, Wald tests and $p$-values for $\\beta=0$ and 95\\% confidence intervals."
data(LWSH23_S3, package = "hecedsm")
mod <- lm(data = LWSH23_S3, needsatis ~ needclosure * cond)
options(knitr.kable.NA = '')
tab_coefs <- broom::tidy(summary(mod))
tab_coefs$term = c("(Intercept)", "need for closure", "directly rejected [cond]", "ghosted [cond]",
 "need for closure * directly rejected [cond]", "need for closure * ghosted [cond]")
# tab_coefs$p.value <- papaja::apa_p(tab_coefs$p.value)
tab_coefs$p.value<- NULL
tab_coefs$statistic <- NULL
tab_coefs$coef <- paste0("$\\beta_",0:5,"$")
tab_coefs$lower <- confint(mod)[,1]
tab_coefs$upper <- confint(mod)[,2]
tab_coefs <- tab_coefs |> dplyr::select(term, coef, estimate, std.error, lower, upper)
knitr::kable(tab_coefs,
  booktabs = TRUE,
  col.names = c("term", "coef.","estimate",  "std. error", "lower CL", "upper CL"),
  digits = c(0,2,1,2,2,2), align = "lccccc")
```

```{r}
#| label: tbl-testsmoder2
#| echo: false
#| eval: true
#| message: false
#| warning: false
#| tbl-cap: "Analysis of variance table for the linear moderation model of Study 3 of @Leckfor:2023."
data(LWSH23_S3, package = "hecedsm")
mod <- lm(data = LWSH23_S3, needsatis ~ needclosure * cond)
options(knitr.kable.NA = '')
tab_anova <- broom::tidy(anova(mod)) |> dplyr::select(!meansq,)
tab_anova$p.value <- papaja::apa_p(tab_anova$p.value)
knitr::kable(tab_anova,
  booktabs = TRUE,
  col.names = c("term", "df","sum of squares", "stat","p-value"),
  digits = c(0,2,1,2,3,3))
```
Since the interaction is significant, we focus on the different slopes. We can of course simply report the equations of the linear regression for each group. Alternative, `emmeans` can also return the predictions for different values of `needclosure` per condition along with standard errors. The estimates will have lower values of `needclosure` when close to the subgroup averages.


The postulated theoretical mean equation is 
\begin{align*}
\mathsf{E}(\texttt{needsatisf}) &= \beta_0 + \beta_1\texttt{needclosure} + \beta_2 \mathsf{1}_{\texttt{cond=directly rejected}} +\beta_3 \mathsf{1}_{\texttt{cond=directly rejected}} \\&+ \beta_4 \texttt{needclosure} \times \mathsf{1}_{\texttt{cond=directly rejected}} +\beta_5 \texttt{needclosure} \times\mathsf{1}_{\texttt{cond=directly rejected}}.
\end{align*}
The estimated model mean $\widehat{\mathsf{E}}(\texttt{needsatisf} \mid \texttt{needclosure}=x)$ when `needclosure` is worth $x \in [1,7]$ for each `cond` is thus
\begin{align*}
\widehat{Y} & = \begin{cases}
\widehat{\beta}_0 + \widehat{\beta}_1x, & \texttt{control};\\
(\widehat{\beta}_0 + \widehat{\beta}_3) + (\widehat{\beta}_1 + \beta_4)x , & \texttt{directly rejected};\\
(\widehat{\beta}_0 + \widehat{\beta}_2) + (\widehat{\beta}_1 + \beta_5)x, & \texttt{ghosted}.
\end{cases} \\&= \begin{cases}
5.015 + 0.181 x, & \texttt{control};\\
 3.86 -0.222x, & \texttt{directly rejected};\\
3.52 -0.234x, & \texttt{ghosted}.
\end{cases}
\end{align*}

```{r}
#| eval: true
#| echo: false
#| message: false
#| warning: false
#| label: fig-moderation2-slopes
#| fig-cap: "Scatterplot and linear regression slopes for the need for satisfaction as a function of the moderating situation and the need for closure."
ggplot(data = LWSH23_S3, 
       mapping = aes(x = needclosure, y = needsatis, color = cond)) +
   geom_smooth(method = "lm", se = FALSE) +
   geom_point() + MetBrewer::scale_color_met_d("Hiroshige") + 
   labs(x = "need for closure",
        y = "need for satisfaction",
        color = NULL) +
   theme_classic() +
   theme(legend.position = "bottom")
```

We can see clearly in @fig-moderation2-slopes that the impact of the need for closure depends on the situation, with downward trends for the ghosted and rejected groups. 


We can fit this toy model also with the PROCESS macro, which only understand numeric values for factors... The output (omitted) includes model coefficients with confidence intervals, the $F$ test statistic for the ANOVA, and output for the average need for closure.

```{r}
#| echo: true
#| eval: false
process(data = LWSH23_S3 |>
          dplyr::mutate(condind = as.integer(cond)), # cast factor to numeric integer levels
        y = "needsatis",  # response variable
        x = "needclosure", # explanatory variable (not manipulated)
        w = "condind", # postulated moderator
        mcw = 1, # dummy coding for moderator w (so compare to base level, here 'included')
        model = 1) # number of model in Hayes (simple moderation)
```

:::


:::{.callout-important}

## **Summary**

* We can treat factor levels measuring quantities as continuous covariates, and reduce the number of parameters.
* Inclusion of continuous covariates may help filtering out unwanted variability.
* These are typically concomitant variables (measured alongside the response variable).
* This designs reduce the residual error, leading to an increase in power (more ability to detect differences in average between experimental conditions).
* We are only interested in differences due to experimental condition (marginal effects).
* In general, there should be no interaction between covariates/blocking factors and experimental conditions. 
* This hypothesis can be assessed by comparing the models with and without interaction, if there are enough units (e.g., equality of slope for ANCOVA).
* Moderators are variables that interact with the experimental factor. We assess their presence by testing for an interaction in a linear regression model.
* In the presence of a moderator, we need to consider slopes or groups separately for subgroups.
:::