The Impact of MRI Image Quality on Statistical and Predictive Analysis on Voxel-Based Morphology

About

This library is currently being developed and maintained at the Applied Machine Learning group at Forschungszentrum Juelich, Germany.

Overview

This repository contains the scripts, data, and analysis code required to reproduce the results presented in the paper "The impact of MRI image quality on statistical and predictive analysis on voxel-based morphology." The study investigates how MRI image quality affects univariate statistical analyses and machine learning-based predictions in voxel-based morphology (VBM). By leveraging three large, publicly available datasets, the paper highlights the importance of image quality and sample size in neuroimaging research.

Paper Abstract:

MRI brain scans are affected by image artifacts caused by head motion, influencing derived measures such as brain volume and cortical thickness. This study examines the role of automated image quality assessment (IQA) in controlling for the effects of poor-quality images on statistical and predictive analyses. Key findings include:

Image quality significantly impacts the detection of sex/gender differences in univariate group comparisons, especially for smaller samples.
Increasing sample size and image quality improves statistical power in univariate analyses but has a marginal effect on classification accuracy for machine learning approaches.
For univariate methods, higher image quality is crucial, while machine learning benefits more from larger sample sizes.

Paper Link: https://arxiv.org/abs/2411.01268

Replicate obtained results.

Clone the repository:

git clone https://github.com/N-Nieto/QC.git
cd QC

Repository Structure

data/: Features are shared anonymized as X_(site).csv and Y_(site).csv.
code/statistics/: Python scripts for univariate statistical tests, machine learning experiments, and IQA evaluations.
code/sex_classification/: Python scripts for machine learning sex classification.
output/statistics: Directory to store analysis outputs. Contains the experiment results used in the plots.
output/ML: Directory to store analysis outputs. Contains the experiment results used in the plots.
plot/: Jupyter notebooks providing step-by-step workflows for reproducing key results and figures from the paper.

Create the enviroment to reproduce the code

Create conda environment

conda env create -f environment_QC.yml

Activate conda environment

conda activate QC_env

Data Exploration

To get familiar with the data, let's first plot the QC distribution of the data

Lib exploration

The sampling by QC

The core of this work is to sample the site data using the QC information. For generating low, high, or random QC, the function balance_data_age_gender_Qsampling from lib/data_processing.py. This function uses age bins to balance the data.

The logic of this function is as follows

For a given age bin, compute how many images for each gender exist.
Repeat this for all age bins and find the minimum value.
Now we know what is the number of images to sample for each gender in each age bin
For each age bin, sort the images according to QC (inverse, direct, or randomly)
Retain the minimum value

For example, let's assume we have 3 age bins. In the first bin, there are 20 images of males and 15 of females. In the second 40 and 10 for male and female and in the last 50 and 100. In this case, the minimum possible sample is 10, as is the minimum of females in bin 2.

Now, we will sample 10 males and females in the first, second, and last bin. For the first age bin, we will select 10 of the 20 males. How? accordingly with the QC of the images.

For the females, we will get 10 of the 15 possible. As you suspect, if we sample with low QC or high QC, there will be 5 images that there will be in both samples. This is why, the sampling participants can be shared for different QC sampling strategies.

As you already guessed, for the females in the second bin, does not matter how we sample, we will always get the same 10 images.

If you select a lower number of age bins, there will be more images selected, but the samples will be more overlapped. If you select a high number of age bins, fewer images will be selected but the cohort will be more different from each other.

For example, for eNKI

For 10 age bins

For 3 age bins

hint: you can generate these plots with

python code/plots/count_shared_participants.py

Statistical analysis

The statistical analysis is performed in the following fashion.

Load the data
Sample the data according to different QC
For each feature, separate the data for each sex/gender
Perform a t-test between the feature distribution
Save the p-value
If the p-value is small, the distributions of the features are distinct.

For high and low and high QC, you should run

python code/statistics/statistics_univariate.py

For random QC, the script is repeated N=20 times. You can set these values on the script and then run

python code/statistics/statistics_univariate_randomQ_repeated.py

Sex classification using Machine Learning

There are two main ways to run the classification

For a single site, the data is loaded, sampled, and split in train/test folds using a 5 Repetitions 5 Folds cross-validation

For high and low high QC, you should run

python code/sex_classicitation/sex_classification_single_site.py

For random QC, the script is repeated N=20 times. You can set these values on the script and then run

python code/statistics/sex_classification_single_site_randomQ_repeated.py

For pooled data, the data is loaded, sampled, and split in train/test folds using a 5 Repetitions 5 Folds cross-validation

For high and low high QC, you should run

python code/sex_classicitation/sex_classification_pooled_data.py

For random QC, the script is repeated N=20 times. You can set these values on the script and then run

python code/statistics/sex_classification_pooled_data_randomQ_repeated.py

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
code		code
data		data
lib		lib
output		output
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
enviroment_generation.py		enviroment_generation.py
environment_QC.yml		environment_QC.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Impact of MRI Image Quality on Statistical and Predictive Analysis on Voxel-Based Morphology

About

Overview

Replicate obtained results.

Repository Structure

Create the enviroment to reproduce the code

Data Exploration

To get familiar with the data, let's first plot the QC distribution of the data

Lib exploration

The sampling by QC

The logic of this function is as follows

If you select a lower number of age bins, there will be more images selected, but the samples will be more overlapped. If you select a high number of age bins, fewer images will be selected but the cohort will be more different from each other.

For example, for eNKI

For 10 age bins

For 3 age bins

hint: you can generate these plots with

Statistical analysis

The statistical analysis is performed in the following fashion.

Sex classification using Machine Learning

There are two main ways to run the classification

For a single site, the data is loaded, sampled, and split in train/test folds using a 5 Repetitions 5 Folds cross-validation

For pooled data, the data is loaded, sampled, and split in train/test folds using a 5 Repetitions 5 Folds cross-validation

About

Releases

Packages

Languages

N-Nieto/QC

Folders and files

Latest commit

History

Repository files navigation

The Impact of MRI Image Quality on Statistical and Predictive Analysis on Voxel-Based Morphology

About

Overview

Replicate obtained results.

Repository Structure

Create the enviroment to reproduce the code

Data Exploration

To get familiar with the data, let's first plot the QC distribution of the data

Lib exploration

The sampling by QC

The logic of this function is as follows

If you select a lower number of age bins, there will be more images selected, but the samples will be more overlapped. If you select a high number of age bins, fewer images will be selected but the cohort will be more different from each other.

For example, for eNKI

For 10 age bins

For 3 age bins

hint: you can generate these plots with

Statistical analysis

The statistical analysis is performed in the following fashion.

Sex classification using Machine Learning

There are two main ways to run the classification

For a single site, the data is loaded, sampled, and split in train/test folds using a 5 Repetitions 5 Folds cross-validation

For pooled data, the data is loaded, sampled, and split in train/test folds using a 5 Repetitions 5 Folds cross-validation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages