Code in R and Stata to code, run, and analyse a simulation study is included in this repository. The specific simulation study is described down below.
This repository is organised as follows:
-
R code is included in the folder named
R
, with three scripts: one for coding the simulation study, one for analysing the results, and one for creating plots and tables with the results; -
Stata code is included in the folder named
Stata
, including three analogous scripts. -
Simulated data with the results obtained by the above-mentioned code in R and Stata are included in the folder named
data
. The R dataset has extension*.RDS
, while the Stata dataset has extension*.DTA
.
The aim of this simulation study is pedagogical.
Say we have a study which can be summarised by the following DAG:
library(ggdag)
library(ggplot2)
confounder_triangle() %>%
ggdag() +
theme_dag_blank(base_size = 12) +
coord_cartesian(xlim = c(-0.5, 2.5), ylim = c(-0.5, 1.5))
where X is the exposure, Y the outcome, and Z a confounder.
Note that the exposure and the outcome are independent, there is no edge between X and Y.
We know that there should be no observed association between X and Y once we adjust for Z in our analysis.
However:
-
Is it true that once we adjust for Z we observe no association between X and Y?
-
What happens if we fail to adjust for a confounder in our analysis?
This is what we are trying to learn from this simulation study.
X is a binary treatment variable, Z is a continuous confounder, and Y is a continuous outcome.
Z is simulated from a standard normal distribution, while X depends on Z according to a logistic model:
Y depends on Z according to a linear model:
with
We fix the following parameters:
-
$\gamma_0 = 1$ , -
$\gamma_1 = 3$ , -
$\alpha_0 = 10$ .
For
-
$\alpha_1 = 5$ : in this scenario, Z is actually a confounder; -
$\alpha_1 = 0$ : in the second scenario, we remove the edge between Z and Y. Here Z is not a counfounder anymore.
Finally, we generate N = 200 independent subjects per each simulated dataset.
We have a single estimand of interest here, the regression coefficient that quantifies the association between X and Y.
Remember that, according to our data-generating mechanisms, we expect no association between X and Y.
We estimate the regression coefficient of interest using linear regression. Specifically, we fit the following two models:
- A model that fails to adjust for the observed confounder Z (model 1):
- A model that properly adjusts for Z (model 2):
The coefficient of interest is
The key performance measure of interest is bias, defined as
as we know that — according to our DGMs — the true
We might also be interested in estimating the power of a significance
test for
This work is licensed under a Creative Commons Attribution 4.0 International License.