Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding details to run a MIMIC-IV example (and some other small fixes) #6

Merged
merged 17 commits into from
May 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions MIMIC-IV_Example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# MIMIC-IV Example

This is an example of how to extract a MEDS dataset from MIMIC-IV. All scripts in this README are assumed to
be run **not** from this directory but from the root directory of this entire repository (e.g., one directory
up from this one).

**Status**: This is a work in progress. The code is not yet functional. Remaining work includes:

- [x] Implementing the pre-MEDS processing step.
- [x] Implement the joining of discharge times.
- [x] Implement the conversion of the DOB to a more usable format.
- [x] Implement the joining of death times.
- [ ] Testing the pre-MEDS processing step on live MIMIC-IV.
- [x] Test that it runs at all.
- [ ] Test that the output is as expected.
- [ ] Check the installation instructions on a fresh client.
- [x] Testing the `configs/event_configs.yaml` configuration on MIMIC-IV
- [x] Testing the MEDS extraction ETL runs on MIMIC-IV (this should be expected to work, but needs
live testing).
- [x] Sub-sharding
- [x] Patient split gathering
- [x] Event extraction
- [x] Merging
- [ ] Validating the output MEDS cohort
- [x] Basic validation (even though file sizes are weird, the number of rows looks consistent).
- [ ] Debug and remove rows with null codes! (there are a lot of them)
- [ ] Detailed validation

## Step 0: Installation

Download this repository and install the requirements:

```bash
git clone git@github.com:mmcdermott/MEDS_polars_functions.git
cd MEDS_polars_functions
git checkout MIMIC_IV
conda create -n MEDS python=3.12
conda activate MEDS
pip install .[mimic]
```

## Step 1: Download MIMIC-IV

Download the MIMIC-IV dataset from https://physionet.org/content/mimiciv/2.2/ following the instructions on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a more descriptive hyperlink text instead of a bare URL.

- Download the MIMIC-IV dataset from https://physionet.org/content/mimiciv/2.2/ following the instructions on
+ Download the MIMIC-IV dataset from [PhysioNet MIMIC-IV](https://physionet.org/content/mimiciv/2.2/) following the instructions on

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
Download the MIMIC-IV dataset from https://physionet.org/content/mimiciv/2.2/ following the instructions on
Download the MIMIC-IV dataset from [PhysioNet MIMIC-IV](https://physionet.org/content/mimiciv/2.2/) following the instructions on

that page. You will need the raw `.csv.gz` files for this example. We will use `$MIMICIV_RAW_DIR` to denote
the root directory of where the resulting _core data files_ are stored -- e.g., there should be a `hosp` and
`icu` subdirectory of `$MIMICIV_RAW_DIR`.

## Step 2: Get the data ready for base MEDS extraction

This is a step in a few parts:

1. Join a few tables by `hadm_id` to get the right timestamps in the right rows for processing. In
particular, we need to join:
- TODO
2. Convert the patient's static data to a more parseable form. This entails:
- Get the patient's DOB in a format that is usable for MEDS, rather than the integral `anchor_year` and
`anchor_offset` fields.
- Merge the patient's `dod` with the `deathtime` from the `admissions` table.

After these steps, modified files or symlinks to the original files will be written in a new directory which
will be used as the input to the actual MEDS extraction ETL. We'll use `$MIMICIV_PREMEDS_DIR` to denote this
directory.

To run this step, you can use the following script (assumed to be run **not** from this directory but from the
root directory of this repository):

```bash
./MIMIC-IV_Example/pre_MEDS.py raw_cohort_dir=$MIMICIV_RAW_DIR output_dir=$MIMICIV_PREMEDS_DIR
```

In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total.

## Step 3: Run the MEDS extraction ETL

We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comma after "Note" for better readability.

- Note this is a different directory than the pre-MEDS directory
+ Note, this is a different directory than the pre-MEDS directory

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`.
We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`. Note, this is a different directory than the pre-MEDS directory

Note this is a different directory than the pre-MEDS directory (though, of course, they can both be
subdirectories of the same root directory).

This is a step in 4 parts:

1. Sub-shard the raw files. Run this command as many times simultaneously as you would like to have workers
performing this sub-sharding step.

```bash
./scripts/extraction/shard_events.py \
raw_cohort_dir=$MIMICIV_PREMEDS_DIR \
MEDS_cohort_dir=$MIMICIV_MEDS_DIR \
event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
```

In practice, on a machine with 150 GB of RAM and 10 cores, this step takes approximately 20 minutes in total.

2. Extract and form the patient splits and sub-shards.

```bash
./scripts/extraction/split_and_shard_patients.py \
raw_cohort_dir=$MIMICIV_PREMEDS_DIR \
MEDS_cohort_dir=$MIMICIV_MEDS_DIR \
event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
```

In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total.

3. Extract patient sub-shards and convert to MEDS events.

```bash
./scripts/extraction/convert_to_sharded_events.py \
raw_cohort_dir=$MIMICIV_PREMEDS_DIR \
MEDS_cohort_dir=$MIMICIV_MEDS_DIR \
event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
```

In practice, serially, this also takes around 20 minutes or more. However, it can be trivially parallelized to
cut the time down by a factor of the number of workers processing the data by simply running the command
multiple times (though this will, of course, consume more resources). If your filesystem is distributed, these
commands can also be launched as separate slurm jobs, for example. For MIMIC-IV, this level of parallelization
and performance is not necessary; however, for larger datasets, it can be.

4. Merge the MEDS events into a single file per patient sub-shard.

```bash
./scripts/extraction/merge_to_MEDS_cohort.py \
raw_cohort_dir=$MIMICIV_PREMEDS_DIR \
MEDS_cohort_dir=$MIMICIV_MEDS_DIR \
event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
```

## Limitations / TO-DOs:

Currently, some tables are ignored, including:

1. `hosp/emar_detail`
2. `hosp/microbiologyevents`
3. `hosp/services`
4. `icu/datetimeevents`
5. `icu/ingredientevents`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a more formal tone in documentation.

- Lots of questions remain about how to appropriately handle timestamps of the data
+ Several questions remain about how to appropriately handle timestamps of the data

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
Several questions remain about how to appropriately handle timestamps of the data

Lots of questions remain about how to appropriately handle timestamps of the data -- e.g., things like HCPCS
events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comma after "timeline" for better readability.

- How should those be slotted into the timeline which is otherwise stored at the _datetime_ resolution?
+ How should those be slotted into the timeline, which is otherwise stored at the _datetime_ resolution?

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the
events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the timeline, which is otherwise stored at the

timeline which is otherwise stored at the _datetime_ resolution?

Other questions:

1. How to handle merging the deathtimes between the hosp table and the patients table?
2. How to handle the dob nonsense MIMIC has?

## Future Work

### Pre-MEDS Processing

If you wanted, some other processing could also be done here, such as:

1. Converting the patient's dynamically recorded race into a static, most commonly recorded race field.
215 changes: 215 additions & 0 deletions MIMIC-IV_Example/configs/event_configs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
patient_id_col: subject_id
hosp/admissions:
ed_registration:
code: ED_REGISTRATION
timestamp: col(edregtime)
timestamp_format: "%Y-%m-%d %H:%M:%S"
ed_out:
code: ED_OUT
timestamp: col(edouttime)
timestamp_format: "%Y-%m-%d %H:%M:%S"
admission:
code:
- HOSPITAL_ADMISSION
- col(admission_type)
- col(admission_location)
timestamp: col(admittime)
timestamp_format: "%Y-%m-%d %H:%M:%S"
insurance: insurance
language: language
marital_status: marital_status
race: race
hadm_id: hadm_id
discharge:
code:
- HOSPITAL_DISCHARGE
- col(discharge_location)
timestamp: col(dischtime)
timestamp_format: "%Y-%m-%d %H:%M:%S"
hadm_id: hadm_id
# We omit the death event here as it is joined to the data in the patients table in the pre-MEDS step.
#death:
# code: DEATH
# timestamp: col(deathtime)
# timestamp_format: "%Y-%m-%d %H:%M:%S"
# death_location: death_location
# death_type: death_type

hosp/diagnoses_icd:
diagnosis:
code:
- DIAGNOSIS
- ICD
- col(icd_version)
- col(icd_code)
hadm_id: hadm_id
timestamp: col(hadm_discharge_time)
timestamp_format: "%Y-%m-%d %H:%M:%S"

hosp/drgcodes:
drg:
code:
- DRG
- col(drg_type)
- col(drg_code)
- col(description)
hadm_id: hadm_id
timestamp: col(hadm_discharge_time)
timestamp_format: "%Y-%m-%d %H:%M:%S"
drg_severity: drg_severity
drg_mortality: drg_mortality

hosp/emar:
medication:
code:
- MEDICATION
- col(medication)
- col(event_txt)
timestamp: col(charttime)
timestamp_format: "%Y-%m-%d %H:%M:%S"
hadm_id: hadm_id
emar_id: emar_id
emar_seq: emar_seq

hosp/hcpcsevents:
hcpcs:
code:
- HCPCS
- col(short_description)
hadm_id: hadm_id
timestamp: col(chartdate)
timestamp_format: "%Y-%m-%d"

hosp/labevents:
lab:
code:
- LAB
- col(itemid)
- col(valueuom)
hadm_id: hadm_id
timestamp: col(charttime)
timestamp_format: "%Y-%m-%d %H:%M:%S"
numerical_value: valuenum
text_value: value
priority: priority

hosp/omr:
omr:
code: col(result_name)
text_value: col(result_value)
timestamp: col(chartdate)
timestamp_format: "%Y-%m-%d"

hosp/patients:
gender:
code:
- GENDER
- col(gender)
timestamp: null
dob:
code: DOB
timestamp: col(year_of_birth)
timestamp_format: "%Y"
death:
code: DEATH
timestamp: col(dod)
timestamp_format:
- "%Y-%m-%d %H:%M:%S"
- "%Y-%m-%d"

hosp/pharmacy:
medication_start:
code:
- MEDICATION
- START
- col(medication)
timestamp: col(starttime)
route: route
frequency: frequency
doses_per_24_hrs: doses_per_24_hrs
poe_id: poe_id
timestamp_format:
- "%Y-%m-%d %H:%M:%S"
- "%Y-%m-%d"
medication_stop:
code:
- MEDICATION
- STOP
- col(medication)
timestamp: col(stoptime)
poe_id: poe_id
timestamp_format:
- "%Y-%m-%d %H:%M:%S"
- "%Y-%m-%d"

hosp/procedures_icd:
procedure:
code:
- PROCEDURE
- ICD
- col(icd_version)
- col(icd_code)
hadm_id: hadm_id
timestamp: col(chartdate)
timestamp_format: "%Y-%m-%d"

hosp/transfers:
transfer:
code:
- TRANSFER_TO
- col(eventtype)
- col(careunit)
timestamp: col(intime)
timestamp_format: "%Y-%m-%d %H:%M:%S"
hadm_id: hadm_id

icu/icustays:
icu_admission:
code:
- ICU_ADMISSION
- col(first_careunit)
timestamp: col(intime)
timestamp_format: "%Y-%m-%d %H:%M:%S"
hadm_id: hadm_id
icustay_id: stay_id
icu_discharge:
code:
- ICU_DISCHARGE
- col(last_careunit)
timestamp: col(outtime)
timestamp_format: "%Y-%m-%d %H:%M:%S"
hadm_id: hadm_id
icustay_id: stay_id

icu/chartevents:
event:
code:
- LAB
- col(itemid)
- col(valueuom)
timestamp: col(charttime)
timestamp_format: "%Y-%m-%d %H:%M:%S"
numerical_value: valuenum
text_value: value
hadm_id: hadm_id
icustay_id: stay_id

icu/procedureevents:
start:
code:
- PROCEDURE
- START
- col(itemid)
timestamp: col(starttime)
timestamp_format: "%Y-%m-%d %H:%M:%S"
hadm_id: hadm_id
icustay_id: stay_id
end:
code:
- PROCEDURE
- END
- col(itemid)
timestamp: col(endtime)
timestamp_format: "%Y-%m-%d %H:%M:%S"
hadm_id: hadm_id
icustay_id: stay_id
11 changes: 11 additions & 0 deletions MIMIC-IV_Example/configs/pre_MEDS.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
raw_cohort_dir: ???
output_dir: ???

# Hydra
hydra:
job:
name: pre_MEDS_${now:%Y-%m-%d_%H-%M-%S}
run:
dir: ${output_dir}/.logs/${hydra.job.name}
sweep:
dir: ${output_dir}/.logs/${hydra.job.name}
Loading
Loading