title author output
Project Description and Data Processing Workflow
Group 2 Data Science Spring School 2022 (Anders Isaksen, Hugo Fitipaldi, Sam Ghatan, Sedrah Butt)
This is the repository for Group Anders Group 2 of the Data Science Spring School & Challenge, notorious Kahoot quiz winners! img


This document is best viewed on GitHub Pages.

This project is fully open-source from raw data to output and presentation, so please join and help us improve the model. Sources are contained in the GitHub repository.

A questionable model unfit for deployment is deployed here.

This repository contains the R code used to process data, and also contains four datasets from PhysioNet with data on diabetes and neuropathy status. While ECG data is not contained in the repository, it is available for download on PhysioNet for all datasets, although a substantial proportion of participants are missing ECG or tabular data.

Throughout this document, R code to reproduce data processing and population flow is provided in folded chunks below the section of the text where these are mentioned.

The final processing of ECG data and neural network training is performed using Python in a Google Colab Notebook here and here. The filtered ECG data output from the readme.Rmd R script and used in the notebook can be found on Google Drive (or locally in /ecg_data/ after running the readme.Rmd script).

To generate/update the GitHub Pages landing page, run index.R after knitting readme.Rmd.

Aims and summary

  • This project aimed to train a neural network to be able to predict the risk of an individual having prevalent diabetic neuropathy, using nothing but a standard 12-lead 10 second ECG of two non-standard V1/V2 and V5/V6 leads.
  • By combining four publicly available PhysioNet datasets, data is available on a total of roughly 100 ECGs from 90 individuals with diabetes, who have provided data on neuropathy status. However, ECG data from two of these datasets (a third of the total ECG records) is recorded during vasoregulatory stress testing experiments, and were deemed inappropriate for our use. Thus, the final dataset consisted of 60 individuals with diabetes from two PhysioNet datasets, with 24 cases of prevalent diabetic neuropathy among these individuals.
  • Data augmentation was mainly done in the way of splitting the ECG signals into many images of 10-second snippets, in addition to item and batch transforming operations (random resizing and cropping, etc.). -We trained the same model on different training/validation splits to account for potential data leakage on the individual and experiment dataset level.
  • A ResNet18 computer vision model was transferred and trained on these data, but model performance was poor, and very prone to over-fitting, with a validation loss > 0.80 and increasing with every epoch of training beyond the first. Validation loss was highest in the model with training/validation split on experiment dataset level, indicating that data leakage only this level may be worth accounting for.
  • Further work: Particularly challenging was the high level of noise in the ECG data. Had more time been available, the next logical step to improve model performance would be to filter out noisy ECG snippets, or even better, train the model as a 3-label classifier (neuropathy, healthy, noise).

Poster presentation: img

Data sources


# Required packages:
## Warning: package 'fs' was built under R version 4.1.3
# Load tabular data:

# From "Cerebromicrovascular Disease in Elderly with Diabetes" ("GE-79"):
cded_data <-
  lapply(list.files(here("raw_csv_data", "GE-79"), full.names = T), fread, stringsAsFactors = F)

# From "Cerebral perfusion and cognitive decline in type 2 diabetes" ("GE-75"):
cpd_data <-
  lapply(list.files(here("raw_csv_data", "GE-75"), full.names = T), fread, stringsAsFactors = F)

# From "Cerebral Vasoregulation in Diabetes" ("GE-71"):
cvd_data <-
  lapply(list.files(here("raw_csv_data", "GE-71"), full.names = T), fread, stringsAsFactors = F)

# From: "Cerebral Vasoregulation in Elderly with Stroke" ("GE-72"):
cves_data <-
  fread(list.files(here("raw_csv_data", "GE-72"), full.names = T), stringsAsFactors = F)

Actual size of usable data

The above contents are what the documentation describes. That does not match the size of the tabular data actually in the datasets, and some individuals may be present in more than one dataset.

Unique subjects in each dataset and in a combined dataset:

  • CDED: 82
  • CPD: 88
  • CVES: 172
  • CVD: 86
  • Combined: 391
# Unique subjects in each dataset:

length(unique(cded_data[[3]]$`Subject ID`))

# CPD:
length(unique(cpd_data[[4]]$`Subject ID`))


# CVD:
length(unique(cvd_data[[4]]$`Subject ID`))

# Unique subjects in total:
    cded_data[[3]]$`Subject ID`,
    cpd_data[[4]]$`Subject ID`,
    cvd_data[[4]]$`Subject ID`

Unique subjects with ECG data available

All four datasets include data on whether ECG data is missing or not. In the CDED and CPD datasets, this is described with an explicit variable. In the CVES and CVD datasets, we're making a qualified guess based on whether that person completed the visit where ECGs were performed:

  • CDED: 47
  • CPD: 51
  • CVES: 91
  • CVD: 57
  • Combined: 220
# Unique subjects in each dataset with ECG data:

length(unique(cded_data[[3]][ECG == 1]$`Subject ID`))

# CPD:
length(unique(cpd_data[[4]][ECG == 1]$`Subject ID`))

length(unique(cves_data[completed_visit_status == "COMPLETED"]$subject_number))

# CVD:
length(unique(cvd_data[[4]][`Head Up Tilt D2` == 1]$`Subject ID`))

# Unique subjects in total:
    cded_data[[3]][ECG == 1]$`Subject ID`,
    cpd_data[[4]][ECG == 1]$`Subject ID`,
    cves_data[completed_visit_status == "COMPLETED"]$subject_number,
    cvd_data[[4]][`Head Up Tilt D2` == 1]$`Subject ID`

Unique subjects in each dataset with ECG data, who have diabetes

All four datasets provide data on diabetes status. Note that these individuals may provide more than one ECG, e.g. if ECGs are performed at baseline and at follow-up:

  • CDED: 22
  • CPD: 45
  • CVES: 2
  • CVD: 29
  • Combined: 90
# Unique subjects in each dataset with ECG data, who have diabetes:

length(unique(cded_data[[3]][ECG == 1 & toupper(`Subject ID`) %in% toupper(cded_data[[6]][`DM PATIENT MEDICAL HISTORY` == "YES"]$`patient ID`)]$`Subject ID`))

# CPD:
length(unique(cpd_data[[4]][ECG == 1 & Group == "DM"]$`Subject ID`))

length(unique(cves_data[completed_visit_status == "COMPLETED" & `DM PATIENT MEDICAL HISTORY` %in% c("yes", "YES")]$subject_number))

# CVD:
length(unique(cvd_data[[4]][`Head Up Tilt D2` == 1 & Group %in% c("DM", "DMOH")]$`Subject ID`))

# Unique subjects in total:
    cded_data[[3]][ECG == 1 &
                     toupper(`Subject ID`) %in% toupper(cded_data[[6]][`DM PATIENT MEDICAL HISTORY` == "YES"]$`patient ID`)]$`Subject ID`,
    cpd_data[[4]][ECG == 1 & Group == "DM"]$`Subject ID`,
    cves_data[completed_visit_status == "COMPLETED" &
                `DM PATIENT MEDICAL HISTORY` %in% c("yes", "YES")]$subject_number,
    cvd_data[[4]][`Head Up Tilt D2` == 1 &
                    Group %in% c("DM", "DMOH")]$`Subject ID`

Participant overlap between datasets

Overlap in participants (with ECG data and diabetes) between the datasets is limited to 8 participants in CDED, who are also present in CPD (7) and CVD (1).

### Overlap between cded and cpd/cves/cvd:
# This could have been done more elegant, but bear with me)

# cded vs cpd: 7: ("S0296" "S0301" "S0308" "S0314" "S0318" "S0372" "S0430"):
cded_data[[3]][ECG == 1 &
                 toupper(`Subject ID`) %in% toupper(cded_data[[6]][`DM PATIENT MEDICAL HISTORY` == "YES"]$`patient ID`) &
                 toupper(`Subject ID`) %in% toupper(cpd_data[[4]][ECG == 1 &
                                                                    Group == "DM"]$`Subject ID`)]

# cded vs. cves: 0:
nrow(cded_data[[3]][ECG == 1 &
                      toupper(`Subject ID`) %in% toupper(cded_data[[6]][`DM PATIENT MEDICAL HISTORY` == "YES"]$`patient ID`) &
                      toupper(`Subject ID`) %in% toupper(cves_data[completed_visit_status == "COMPLETED" &
                                                                     `DM PATIENT MEDICAL HISTORY` %in% c("yes", "YES")]$subject_number)])

# cded vs. cvd: 1 ("S0105"):
cded_data[[3]][ECG == 1 &
                 toupper(`Subject ID`) %in% toupper(cded_data[[6]][`DM PATIENT MEDICAL HISTORY` == "YES"]$`patient ID`) &
                 toupper(`Subject ID`) %in% toupper(cvd_data[[4]][`Head Up Tilt D2` == 1 &
                                                                    Group %in% c("DM", "DMOH")]$`Subject ID`)]

### No overlap between cpd and cves/cvd

# cpd vs. cves: 0:
nrow(cpd_data[[4]][ECG == 1 &
                     Group == "DM" &
                     toupper(`Subject ID`) %in% toupper(cves_data[completed_visit_status == "COMPLETED" &
                                                                    `DM PATIENT MEDICAL HISTORY` %in% c("yes", "YES")]$subject_number)])

# cpd vs. cvd: 0
nrow(cpd_data[[4]][ECG == 1 &
                     Group == "DM" &
                     toupper(`Subject ID`) %in% toupper(cvd_data[[4]][`Head Up Tilt D2` == 1 &
                                                                        Group %in% c("DM", "DMOH")]$`Subject ID`)])

### No overlap between cves and cvd: 0
nrow(cves_data[completed_visit_status == "COMPLETED" &
                 `DM PATIENT MEDICAL HISTORY` %in% c("yes", "YES") &
                 toupper(subject_number) %in% toupper(cvd_data[[4]][`Head Up Tilt D2` == 1 &
                                                                      Group %in% c("DM", "DMOH")]$`Subject ID`)])

Study dataset: Individuals in CDED and CPD with diabetes, and data on ECG and neuropathy

For the models, we combined CDED (data from baseline visit) and CPD datasets, excluding records in the CPD data from the 7 individuals already present in the CDED dataset and leaving a final study population of 60 individuals.

Prevalence of diabetic neuropathy

The protocol states that neuropathy in the CDED dataset was diagnosed at some point using the validated symptom scale neuropathy total symptom score-6, but the available variables do not correspond to this.

Both CDED and CPD contain questionnaire data on numbness and painful sensations of the feet. The CPD dataset also contains an item on autonomic neuropathy symptoms, although it is unclear what specific symptoms this item covers.

We defined diabetic neuropathy as a binary variable on the individual level as the presence of at least one of these symptoms. Individuals with missing data on all neuropathy items were excluded, while cases with missing data on only some items were interpreted as having no symptoms of these types.

Final dataset

Using the above method, 24 cases of neuropathy were identified among the 60 individuals in the study population, corresponding to a prevalence around 40% in both datasets (6 of 15 individuals from CDED, 19 of 45 from CPD).

## Define nephropathy in each dataset:

### CPD:
# Clean column names and subject ID's:
names(cpd_data[[2]]) <- to_snake_case(names(cpd_data[[2]]))
cpd_data[[2]]$patient_id <- toupper(cpd_data[[2]]$patient_id)

# Filtering to only id, diabetes status and the three neuropathy variables on numbness and pain :
cpd_data_vars <-
  cpd_data[[2]][, .(

# Recode string data to binary and NAs:
binary_converter_function <- function(x) {
  case_when(x == "N/A" ~ NA,
            x == "YES" | x == "yes" | x == "Yes" ~ TRUE,
            x == "NO" | x == "no" | x == "No" ~ FALSE)

# Recode string data to binary and NAs:
mod_cols = names(cpd_data_vars)[2:5]
cpd_data_vars[, (mod_cols) := lapply(.SD, binary_converter_function), .SDcols = mod_cols]

# Create the simpler neuropathy outcome variable:
# neuropathy is defined as the presence of either neuropathy, or numbness or pain in the feet:
cpd_data_vars[, neuropathy_outcome := apply(cpd_data_vars[, 3:5], 1, function(x)
  sum(x, na.rm = T)) >= 1]

# Set individuals with completely missing data to NA:
cpd_data_vars[, no_neuropathy_data := apply(cpd_data_vars[, 3:5], 1, function(x)
  sum( == 3]

cpd_data_vars[, neuropathy_outcome := fifelse(no_neuropathy_data == T, NA, neuropathy_outcome)]

# rename diabetes variable and dataset variable for convenience:
names(cpd_data_vars)[2] <- "diabetes"
cpd_data_vars[, dataset := "cpd"]

# Clean CPD dataset (individuals with diabetes, and ECG/neuropathy-data):
cpd_clean <-
  cpd_data_vars[diabetes == T &
                  ! &
                  patient_id %in% toupper(cpd_data[[4]][ECG == 1]$`Subject ID`), c(1, 6, 8)]

### CDED:
#### Make column names prettier for future use and clean case inconsistency in ID variable:
names(cded_data[[6]]) <- to_snake_case(names(cded_data[[6]]))
cded_data[[6]]$patient_id <-  toupper(cded_data[[6]]$patient_id)

# Filter to variables needed:
cded_survey <-
  cded_data[[6]][, .(

# Select columns to be modified
mod_cols = names(cded_survey)[3:5]
cded_survey[, (mod_cols) := lapply(.SD, binary_converter_function), .SDcols = mod_cols]

# Create a simple neuropathy variable:
# neuropathy is defined as the presence of either numbness or pain in the feet:
# The few cases of missing data in a symptom variable is treated as no symptom of this kind.
cded_survey[, neuropathy_outcome := apply(cded_survey[, 4:5], 1, function(x)
  sum(x, na.rm = T)) >= 1]

# Set individuals with completely missing data to NA:
cded_survey[, no_neuropathy_data := apply(cded_survey[, 4:5], 1, function(x)
  sum( == 2]

cded_survey[, neuropathy_outcome := fifelse(no_neuropathy_data == T, NA, neuropathy_outcome)]

# Rename diabetes variable for convenience:
names(cded_survey)[3] <- "diabetes"

# Add variable to keep track of which dataset overlapping individuals came from:
cded_survey[, dataset := "cded"]

# Clean CDED dataset (visit 2 data from individuals with diabetes and ECG/neuropathy-data, not in CPD):
cded_clean <-
  cded_survey[visit == 2 &
                diabetes == T &
                ! &
                patient_id %in% toupper(cded_data[[3]][ECG == 1]$`Subject ID`) &
                !patient_id %in% cpd_clean$patient_id, c(1, 6, 8)]

# Merge to one dataset and count neuropathy cases:
neuropathy_final <- rbind(cded_clean, cpd_clean)

nrow(neuropathy_final[dataset == "cded"])
nrow(neuropathy_final[dataset == "cded" & neuropathy_outcome == T])
nrow(neuropathy_final[dataset == "cpd"])
nrow(neuropathy_final[dataset == "cpd" & neuropathy_outcome == T])

Export tabular data and ECG data

Final cleaning of tabular data:

Append patient id variable to match ECG data file names: 'S' + ID + 'ECG'

The final dataset looks like this before exporting to a csv file:

# Append ID's:
study_dataset <-
  neuropathy_final[, .(
    patient_id = paste0(patient_id, "ECG"),
    dataset = factor(dataset),

# Export dataset
fwrite(study_dataset, file = here("output_data", "study_dataset.csv"))

# Summary and contents:
Filter, split and export ECG files:

To save space and computation time, we filter the ECGs to only the ones we need, and export them to different folders for labelling purposes. We'll also split the ECGs into training and validation parent folders, so ECGs from the same experiment or individual cannot be present in both training a validation datasets (we'll be splitting the ECGs into small snippets later, so each individual will contribute multiple ECGs). Otherwise we risk data leakage between the training and validation datasets, and the model might learn to identify individuals or experiment, rather than signals of neuropathy, which would erode model performance on external data. A somewhat famous example of this mistake being Andrew Ng's random split of 112,120 x-ray images from 30,805 individuals which was subsequently corrected.

Fortunately, the CPD and CDED datasets are similarly balanced in terms of neuropathy prevalence, and their relative sizes are suitable for use as a training/validation split (the CDED participants make up 25% of the study population). Due to the limited data available, we do not set aside a test dataset, but expect performance on external datasets to be relatively stable due to the different sources of training and validation set.

# Specify local source folder of CDED and CPD ECG data:
cded_ecg_folder <-

cpd_ecg_folder <- "C:/physionet/cpd/data/ecg"

# List ECG files of all patients in study population
cded_files <- list.files(cded_ecg_folder,
                         full.names = T)

cpd_files <- list.files(cpd_ecg_folder,
                        full.names = T)

# Filter files of each dataset to only subjects in study population and split into groups based on neuropathy status:

cded_healthy <-
  cded_files[str_sub(cded_files, -12, -5) %in% study_dataset[dataset == "cded" &
                                                               neuropathy_outcome == FALSE]$patient_id]

cded_neuropathy <-
  cded_files[str_sub(cded_files, -12, -5) %in% study_dataset[dataset == "cded" &
                                                               neuropathy_outcome == TRUE]$patient_id]

# CPD:
cpd_healthy <-
  cpd_files[str_sub(cpd_files, -12, -5) %in% study_dataset[dataset == "cpd" &
                                                             neuropathy_outcome == FALSE]$patient_id]

cpd_neuropathy <-
  cpd_files[str_sub(cpd_files, -12, -5) %in% study_dataset[dataset == "cpd" &
                                                             neuropathy_outcome == TRUE]$patient_id]

In this fashion, we end up with a training set containing 45 individuals, and a validation set containing 15 individuals.

Training set:

Since the proportion of neuropathy is balanced between the two datasets, we also train a model on a random 20/80 split for comparison (due to the balanced proportions, the risk of data leakage inflating performance when training the model across both datasets shouldn't be critical).

# Alternative random split:
# Sample 1 in 5 of all participants to validation dataset:
all_participant_files <- c(cded_healthy, cded_neuropathy, cpd_healthy, cpd_neuropathy)

all_healthy <- all_participant_files[str_sub(all_participant_files, -12, -5) %in% study_dataset[neuropathy_outcome == FALSE]$patient_id]

all_neuropathy <- all_participant_files[str_sub(all_participant_files, -12, -5) %in% study_dataset[neuropathy_outcome == TRUE]$patient_id]

# Set group 2 seed for reproducibility:


valid_healthy <- all_participant_files[str_sub(all_participant_files, -12, -5) %in% sample(study_dataset[neuropathy_outcome == FALSE]$patient_id, 0.20 * nrow(study_dataset[neuropathy_outcome == FALSE]))]

valid_neuropathy <- all_participant_files[str_sub(all_participant_files, -12, -5) %in% sample(study_dataset[neuropathy_outcome == TRUE]$patient_id, 0.20 * nrow(study_dataset[neuropathy_outcome == TRUE]))]

# And remove the validation individuals from the training set:
train_healthy <- all_healthy[!all_healthy %in% valid_healthy]

train_neuropathy <- all_neuropathy[!all_neuropathy %in% valid_neuropathy]

For labeling purposes, training set ECGs from individuals with neuropathy go to the /ecg_wfdb/train/neuropathy/ folder, and those without neuropathy go to the /ecg_wfdb/train/healthy/ folder. Conversely, validation set ECGs go to their respective /ecg_wfdb/valid/neuropathy/ and /ecg_wfdb/valid/healthy/ folders. Like so:

├── /train
│   ├── /healthy/
│   └── /neuropathy/
└── /valid
    ├── /healthy/
    └── /neuropathy/

Note that the ECG data files aren't tracked in Git, so you'll have to download the datasets from PhysioNet to reproduce this.

# Copy these files to either /healthy/ or /neuropathy/ folders based on neuropathy status:

# Create the folders
ecg_folders <- c("ecg_wfdb", "ecg_wfdb_randomsplit")
split_folders <- c(rep("train", 2), rep("valid", 2))
label_folders <- c(rep("healthy", 4), rep("neuropathy", 4))

path <- here(ecg_folders, split_folders, label_folders)


# Split by source dataset:

# Training set:
# Healthy:
file.copy(from = cpd_healthy, to = here("ecg_wfdb", "train", "healthy"))

# Neuropathy:
file.copy(from = cpd_neuropathy, to = here("ecg_wfdb", "train", "neuropathy"))

# Validation set:
# Healthy:
file.copy(from = cded_healthy, to = here("ecg_wfdb", "valid", "healthy"))

# Neuropathy:
file.copy(from = cded_neuropathy, to = here("ecg_wfdb", "valid", "neuropathy"))

# Random split:

# Training set:
# Healthy:
file.copy(from = train_healthy, to = here("ecg_wfdb_randomsplit", "train", "healthy"))

# Neuropathy:
file.copy(from = train_neuropathy, to = here("ecg_wfdb_randomsplit", "train", "neuropathy"))

# Validation set:
# Healthy:
file.copy(from = valid_healthy, to = here("ecg_wfdb_randomsplit", "valid", "healthy"))

# Neuropathy:
file.copy(from = valid_neuropathy, to = here("ecg_wfdb_randomsplit", "valid", "neuropathy"))

Off to Python and Google Colab!

Now, we have no further use of the tabular data file, since all the information needed to run the model is contained in the filename and path of the ECG data itself at this point (neuropathy label in the folder name, ECG ID in the file name).

The rest of the data processing is carried out in Python on Google Colab. Preprocessing here (GitHub copy here), transfer learning here (GitHub copy here), and involves reading the ECG data's waveform signals, extracting the two ECG leads and splitting them into hundreds of 10 second snippets saved as separate image files, which are then loaded into fastai DataLoader objects to train a ResNet model.

See you on the other side!