SEM_fit_index.qmd

---
title: "Bonus: Fit indices in SEM"
format: html
author: Moritz Fischer
execute:
  cache: true
project-type: website
---

*Reading/working time: ~15 min.*

```{r r install packages, message=FALSE, warning=FALSE}

#install.packages(c("future.apply", "ggplot2", "lavaan", "MPsychoR"), dependencies = TRUE)

require(future.apply)
require(ggplot2)
require(lavaan)
require(MPsychoR)

```

In this chapter, we will turn to simulation-based power analysis for fit indices in the context of SEM. We will build on the model we have introduced in Chapter 5 ([Structural Equation Modelling (SEM)](SEM.qmd)), it is therefore recommendable to read this chapter first.  The following chunk specifies this model, in which (latent) generalized prejudice is predicted by (latent) openness to experience (as conceptualized in the big five personality traits). We fit this model to the data set by Bergh et al. (2016) in order to get a first impression of the model fit.

```{r specify model}

data("Bergh")

model_sem <- "generalized_prejudice =~ EP + DP + SP + HP
              openness =~ O1 + O2 + O3
              generalized_prejudice ~ openness"

#fit the SEM to the pilot data set
fit <- sem(model_sem, data = Bergh)

summary(fit, fit.measures = TRUE)

```

There are many different fit indices displayed in this output, for example the  Comparative Fit Index (CFI), the Root Mean Square Error of Approximation (RMSEA) and the Standardized Root Mean Square Residual (SRMR). We can not go into the details of the interpretations of the fit indices here, but it is important to know that many of these indices are not very sensitive to sample size. Therefore, running a power analysis for these fit indices is not really meaningful. But instead of analyzing how one of these indices varies as a function of sample size, we can optimize the precision of one of these indices. For example, the `lavaan` output from above displays the 90% confidence interval for the RMSEA index. We could, for example, plan to find a certain sample size that ensures that the confidence interval around the RMSEA estimate has a certain maximum size, that is, the RMSEA estimate is sufficiently precise. This what we will learn to do in this chapter. 

As a starting point, we again define the population parameters (i.e., the means, variances, and co-variances of all measured variables). We use the study by Bergh et al. (2016) to estimate these parameters (just as we did in Chapter 5).

```{r define population}

attach(Bergh)

#store means
means_vector <- c(mean(EP), mean(SP), mean(HP), mean(DP), mean(O1), mean(O2), mean(O3)) |> round(2)

#store covariances
cov_mat <- cov(cbind(EP, SP, HP, DP, O1, O2, O3)) |> round(2)

```

With these parameters, we can simulate data using the `rmvnorm` function from the `Rfast` package. The only difference to the simulation described in Chapter 5 is that here, we do not calculate and store the p-value of the regression coefficient, but rather, we compute the width of the RMSEA confidence interval and store it in a vector. We then count the number of simulations that yield a confidence interval with a maximum size of, say, .10. The next chunk shows how this is done. 

```{r power analysis, warning=FALSE}

set.seed(9875234)

#test ns between 50 and 200
ns <- seq(50, 200, by=10) 

#prepare empty vector to store results
result <- data.frame()

#set number of iterations
iterations <- 1000

#write function
sim_sem <- function(n, model, mu, sigma) {
  

  simulated_data <- Rfast::rmvnorm(n = n, mu = mu, sigma = sigma) |> as.data.frame()
  fit_sem_simulated <- sem(model_sem, data = simulated_data)
  rmsea_ci_width <- as.numeric(fitMeasures(fit_sem_simulated)["rmsea.ci.upper"] - fitMeasures(fit_sem_simulated)["rmsea.ci.lower"])
  return(rmsea_ci_width)
  
    }


#replicate function with varying ns
for (n in ns) {  
  
rmsea_ci_width <- future_replicate(iterations, sim_sem(n = n, model = model_sem, mu = means_vector, sigma = cov_mat), future.seed=TRUE)  
result <- rbind(result, data.frame(
    n = n,
    power = sum(rmsea_ci_width < .1)/iterations)
  )

}
```

Let's plot this. 

```{r plot power curve}

ggplot(result, aes(x=n, y=power)) + geom_point() + geom_line() + scale_x_continuous(n.breaks = 18, limits = c(30,200)) + scale_y_continuous(n.breaks = 10, limits = c(0,1)) + geom_hline(yintercept= 0.8, color = "red")

```

This analysis suggests that approx. 168 participants are needed to obtain a 90%-confidence interval around the RMSEA coefficient that is not larger than .10 in 80% of the cases.