mmcdermott · mmcdermott · May 26, 2024 · May 24, 2024 · May 25, 2024 · May 25, 2024
diff --git a/MIMIC-IV_Example/README.md b/MIMIC-IV_Example/README.md
@@ -0,0 +1,154 @@
+# MIMIC-IV Example
+
+This is an example of how to extract a MEDS dataset from MIMIC-IV. All scripts in this README are assumed to
+be run **not** from this directory but from the root directory of this entire repository (e.g., one directory
+up from this one).
+
+**Status**: This is a work in progress. The code is not yet functional. Remaining work includes:
+
+- [x] Implementing the pre-MEDS processing step.
+  - [x] Implement the joining of discharge times.
+  - [x] Implement the conversion of the DOB to a more usable format.
+  - [x] Implement the joining of death times.
+- [ ] Testing the pre-MEDS processing step on live MIMIC-IV.
+  - [x] Test that it runs at all.
+  - [ ] Test that the output is as expected.
+- [ ] Check the installation instructions on a fresh client.
+- [x] Testing the `configs/event_configs.yaml` configuration on MIMIC-IV
+- [x] Testing the MEDS extraction ETL runs on MIMIC-IV (this should be expected to work, but needs
+  live testing).
+  - [x] Sub-sharding
+  - [x] Patient split gathering
+  - [x] Event extraction
+  - [x] Merging
+- [ ] Validating the output MEDS cohort
+  - [x] Basic validation (even though file sizes are weird, the number of rows looks consistent).
+  - [ ] Debug and remove rows with null codes! (there are a lot of them)
+  - [ ] Detailed validation
+
+## Step 0: Installation
+
+Download this repository and install the requirements:
+
+```bash
+git clone git@github.com:mmcdermott/MEDS_polars_functions.git
+cd MEDS_polars_functions
+git checkout MIMIC_IV
+conda create -n MEDS python=3.12
+conda activate MEDS
+pip install .[mimic]
+```
+
+## Step 1: Download MIMIC-IV
+
+Download the MIMIC-IV dataset from https://physionet.org/content/mimiciv/2.2/ following the instructions on
-Download the MIMIC-IV dataset from https://physionet.org/content/mimiciv/2.2/ following the instructions on
+Download the MIMIC-IV dataset from [PhysioNet MIMIC-IV](https://physionet.org/content/mimiciv/2.2/) following the instructions on
-Download the MIMIC-IV dataset from https://physionet.org/content/mimiciv/2.2/ following the instructions on
+Download the MIMIC-IV dataset from [PhysioNet MIMIC-IV](https://physionet.org/content/mimiciv/2.2/) following the instructions on
+that page. You will need the raw `.csv.gz` files for this example. We will use `$MIMICIV_RAW_DIR` to denote
+the root directory of where the resulting _core data files_ are stored -- e.g., there should be a `hosp` and
+`icu` subdirectory of `$MIMICIV_RAW_DIR`.
+
+## Step 2: Get the data ready for base MEDS extraction
+
+This is a step in a few parts:
+
+1. Join a few tables by `hadm_id` to get the right timestamps in the right rows for processing. In
+   particular, we need to join:
+   - TODO
+2. Convert the patient's static data to a more parseable form. This entails:
+   - Get the patient's DOB in a format that is usable for MEDS, rather than the integral `anchor_year` and
+     `anchor_offset` fields.
+   - Merge the patient's `dod` with the `deathtime` from the `admissions` table.
+
+After these steps, modified files or symlinks to the original files will be written in a new directory which
+will be used as the input to the actual MEDS extraction ETL. We'll use `$MIMICIV_PREMEDS_DIR` to denote this
+directory.
+
+To run this step, you can use the following script (assumed to be run **not** from this directory but from the
+root directory of this repository):
+
+```bash
+./MIMIC-IV_Example/pre_MEDS.py raw_cohort_dir=$MIMICIV_RAW_DIR output_dir=$MIMICIV_PREMEDS_DIR
+```
+
+In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total.
+
+## Step 3: Run the MEDS extraction ETL
+
+We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`.
-We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`.
+We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`. Note, this is a different directory than the pre-MEDS directory
-We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`.
+We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`. Note, this is a different directory than the pre-MEDS directory
+Note this is a different directory than the pre-MEDS directory (though, of course, they can both be
+subdirectories of the same root directory).
+
+This is a step in 4 parts:
+
+1. Sub-shard the raw files. Run this command as many times simultaneously as you would like to have workers
+   performing this sub-sharding step.
+
+```bash
+./scripts/extraction/shard_events.py \
+    raw_cohort_dir=$MIMICIV_PREMEDS_DIR \
+    MEDS_cohort_dir=$MIMICIV_MEDS_DIR \
+    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
+```
+
+In practice, on a machine with 150 GB of RAM and 10 cores, this step takes approximately 20 minutes in total.
+
+2. Extract and form the patient splits and sub-shards.
+
+```bash
+./scripts/extraction/split_and_shard_patients.py \
+    raw_cohort_dir=$MIMICIV_PREMEDS_DIR \
+    MEDS_cohort_dir=$MIMICIV_MEDS_DIR \
+    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
+```
+
+In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total.
+
+3. Extract patient sub-shards and convert to MEDS events.
+
+```bash
+./scripts/extraction/convert_to_sharded_events.py \
+    raw_cohort_dir=$MIMICIV_PREMEDS_DIR \
+    MEDS_cohort_dir=$MIMICIV_MEDS_DIR \
+    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
+```
+
+In practice, serially, this also takes around 20 minutes or more. However, it can be trivially parallelized to
+cut the time down by a factor of the number of workers processing the data by simply running the command
+multiple times (though this will, of course, consume more resources). If your filesystem is distributed, these
+commands can also be launched as separate slurm jobs, for example. For MIMIC-IV, this level of parallelization
+and performance is not necessary; however, for larger datasets, it can be.
+
+4. Merge the MEDS events into a single file per patient sub-shard.
+
+```bash
+./scripts/extraction/merge_to_MEDS_cohort.py \
+    raw_cohort_dir=$MIMICIV_PREMEDS_DIR \
+    MEDS_cohort_dir=$MIMICIV_MEDS_DIR \
+    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
+```
+
+## Limitations / TO-DOs:
+
+Currently, some tables are ignored, including:
+
+1. `hosp/emar_detail`
+2. `hosp/microbiologyevents`
+3. `hosp/services`
+4. `icu/datetimeevents`
+5. `icu/ingredientevents`
+
-
+        Several questions remain about how to appropriately handle timestamps of the data
-
+        Several questions remain about how to appropriately handle timestamps of the data
+Lots of questions remain about how to appropriately handle timestamps of the data -- e.g., things like HCPCS
+events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the
-events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the
+events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the timeline, which is otherwise stored at the
-events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the
+events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the timeline, which is otherwise stored at the
+timeline which is otherwise stored at the _datetime_ resolution?
+
+Other questions:
+
+1. How to handle merging the deathtimes between the hosp table and the patients table?
+2. How to handle the dob nonsense MIMIC has?
+
+## Future Work
+
+### Pre-MEDS Processing
+
+If you wanted, some other processing could also be done here, such as:
+
+1. Converting the patient's dynamically recorded race into a static, most commonly recorded race field.
diff --git a/MIMIC-IV_Example/configs/event_configs.yaml b/MIMIC-IV_Example/configs/event_configs.yaml
@@ -0,0 +1,215 @@
+patient_id_col: subject_id
+hosp/admissions:
+  ed_registration:
+    code: ED_REGISTRATION
+    timestamp: col(edregtime)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+  ed_out:
+    code: ED_OUT
+    timestamp: col(edouttime)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+  admission:
+    code:
+      - HOSPITAL_ADMISSION
+      - col(admission_type)
+      - col(admission_location)
+    timestamp: col(admittime)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+    insurance: insurance
+    language: language
+    marital_status: marital_status
+    race: race
+    hadm_id: hadm_id
+  discharge:
+    code:
+      - HOSPITAL_DISCHARGE
+      - col(discharge_location)
+    timestamp: col(dischtime)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+    hadm_id: hadm_id
+  # We omit the death event here as it is joined to the data in the patients table in the pre-MEDS step.
+  #death:
+  #  code: DEATH
+  #  timestamp: col(deathtime)
+  #  timestamp_format: "%Y-%m-%d %H:%M:%S"
+  #  death_location: death_location
+  #  death_type: death_type
+
+hosp/diagnoses_icd:
+  diagnosis:
+    code:
+      - DIAGNOSIS
+      - ICD
+      - col(icd_version)
+      - col(icd_code)
+    hadm_id: hadm_id
+    timestamp: col(hadm_discharge_time)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+
+hosp/drgcodes:
+  drg:
+    code:
+      - DRG
+      - col(drg_type)
+      - col(drg_code)
+      - col(description)
+    hadm_id: hadm_id
+    timestamp: col(hadm_discharge_time)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+    drg_severity: drg_severity
+    drg_mortality: drg_mortality
+
+hosp/emar:
+  medication:
+    code:
+      - MEDICATION
+      - col(medication)
+      - col(event_txt)
+    timestamp: col(charttime)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+    hadm_id: hadm_id
+    emar_id: emar_id
+    emar_seq: emar_seq
+
+hosp/hcpcsevents:
+  hcpcs:
+    code:
+      - HCPCS
+      - col(short_description)
+    hadm_id: hadm_id
+    timestamp: col(chartdate)
+    timestamp_format: "%Y-%m-%d"
+
+hosp/labevents:
+  lab:
+    code:
+      - LAB
+      - col(itemid)
+      - col(valueuom)
+    hadm_id: hadm_id
+    timestamp: col(charttime)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+    numerical_value: valuenum
+    text_value: value
+    priority: priority
+
+hosp/omr:
+  omr:
+    code: col(result_name)
+    text_value: col(result_value)
+    timestamp: col(chartdate)
+    timestamp_format: "%Y-%m-%d"
+
+hosp/patients:
+  gender:
+    code:
+      - GENDER
+      - col(gender)
+    timestamp: null
+  dob:
+    code: DOB
+    timestamp: col(year_of_birth)
+    timestamp_format: "%Y"
+  death:
+    code: DEATH
+    timestamp: col(dod)
+    timestamp_format:
+      - "%Y-%m-%d %H:%M:%S"
+      - "%Y-%m-%d"
+
+hosp/pharmacy:
+  medication_start:
+    code:
+      - MEDICATION
+      - START
+      - col(medication)
+    timestamp: col(starttime)
+    route: route
+    frequency: frequency
+    doses_per_24_hrs: doses_per_24_hrs
+    poe_id: poe_id
+    timestamp_format:
+      - "%Y-%m-%d %H:%M:%S"
+      - "%Y-%m-%d"
+  medication_stop:
+    code:
+      - MEDICATION
+      - STOP
+      - col(medication)
+    timestamp: col(stoptime)
+    poe_id: poe_id
+    timestamp_format:
+      - "%Y-%m-%d %H:%M:%S"
+      - "%Y-%m-%d"
+
+hosp/procedures_icd:
+  procedure:
+    code:
+      - PROCEDURE
+      - ICD
+      - col(icd_version)
+      - col(icd_code)
+    hadm_id: hadm_id
+    timestamp: col(chartdate)
+    timestamp_format: "%Y-%m-%d"
+
+hosp/transfers:
+  transfer:
+    code:
+      - TRANSFER_TO
+      - col(eventtype)
+      - col(careunit)
+    timestamp: col(intime)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+    hadm_id: hadm_id
+
+icu/icustays:
+  icu_admission:
+    code:
+      - ICU_ADMISSION
+      - col(first_careunit)
+    timestamp: col(intime)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+    hadm_id: hadm_id
+    icustay_id: stay_id
+  icu_discharge:
+    code:
+      - ICU_DISCHARGE
+      - col(last_careunit)
+    timestamp: col(outtime)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+    hadm_id: hadm_id
+    icustay_id: stay_id
+
+icu/chartevents:
+  event:
+    code:
+      - LAB
+      - col(itemid)
+      - col(valueuom)
+    timestamp: col(charttime)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+    numerical_value: valuenum
+    text_value: value
+    hadm_id: hadm_id
+    icustay_id: stay_id
+
+icu/procedureevents:
+  start:
+    code:
+      - PROCEDURE
+      - START
+      - col(itemid)
+    timestamp: col(starttime)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+    hadm_id: hadm_id
+    icustay_id: stay_id
+  end:
+    code:
+      - PROCEDURE
+      - END
+      - col(itemid)
+    timestamp: col(endtime)
+    timestamp_format: "%Y-%m-%d %H:%M:%S"
+    hadm_id: hadm_id
+    icustay_id: stay_id
diff --git a/MIMIC-IV_Example/configs/pre_MEDS.yaml b/MIMIC-IV_Example/configs/pre_MEDS.yaml
@@ -0,0 +1,11 @@
+raw_cohort_dir: ???
+output_dir: ???
+
+# Hydra
+hydra:
+  job:
+    name: pre_MEDS_${now:%Y-%m-%d_%H-%M-%S}
+  run:
+    dir: ${output_dir}/.logs/${hydra.job.name}
+  sweep:
+    dir: ${output_dir}/.logs/${hydra.job.name}