-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding details to run a MIMIC-IV example (and some other small fixes) #6
Changes from 15 commits
fd93950
1395f85
2de69ad
c52b937
551074c
2b15e06
19e39fe
9009f73
f65eb24
101dd33
a0c8203
c3eb004
6dde01b
de4b12c
04626d2
3a89351
7e72489
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,151 @@ | ||||||
# MIMIC-IV Example | ||||||
|
||||||
This is an example of how to extract a MEDS dataset from MIMIC-IV. All scripts in this README are assumed to | ||||||
be run **not** from this directory but from the root directory of this entire repository (e.g., one directory | ||||||
up from this one). | ||||||
|
||||||
**Status**: This is a work in progress. The code is not yet functional. Remaining work includes: | ||||||
|
||||||
- [x] Implementing the pre-MEDS processing step. | ||||||
- [x] Implement the joining of discharge times. | ||||||
- [x] Implement the conversion of the DOB to a more usable format. | ||||||
- [x] Implement the joining of death times. | ||||||
- [ ] Testing the pre-MEDS processing step on live MIMIC-IV. | ||||||
- [x] Test that it runs at all. | ||||||
- [ ] Test that the output is as expected. | ||||||
- [ ] Check the installation instructions on a fresh client. | ||||||
- [x] Testing the `configs/event_configs.yaml` configuration on MIMIC-IV | ||||||
- [ ] Testing the MEDS extraction ETL runs on MIMIC-IV (this should be expected to work, but needs | ||||||
live testing). | ||||||
- [x] Sub-sharding | ||||||
- [x] Patient split gathering | ||||||
- [x] Event extraction | ||||||
- [x] Merging | ||||||
- [ ] Validating the output MEDS cohort | ||||||
|
||||||
## Step 0: Installation | ||||||
|
||||||
Download this repository and install the requirements: | ||||||
|
||||||
```bash | ||||||
git clone git@github.com:mmcdermott/MEDS_polars_functions.git | ||||||
cd MEDS_polars_functions | ||||||
git checkout MIMIC_IV | ||||||
conda create -n MEDS python=3.12 | ||||||
conda activate MEDS | ||||||
pip install .[mimic] | ||||||
``` | ||||||
|
||||||
## Step 1: Download MIMIC-IV | ||||||
|
||||||
Download the MIMIC-IV dataset from https://physionet.org/content/mimiciv/2.2/ following the instructions on | ||||||
that page. You will need the raw `.csv.gz` files for this example. We will use `$MIMICIV_RAW_DIR` to denote | ||||||
the root directory of where the resulting _core data files_ are stored -- e.g., there should be a `hosp` and | ||||||
`icu` subdirectory of `$MIMICIV_RAW_DIR`. | ||||||
|
||||||
## Step 2: Get the data ready for base MEDS extraction | ||||||
|
||||||
This is a step in a few parts: | ||||||
|
||||||
1. Join a few tables by `hadm_id` to get the right timestamps in the right rows for processing. In | ||||||
particular, we need to join: | ||||||
- TODO | ||||||
2. Convert the patient's static data to a more parseable form. This entails: | ||||||
- Get the patient's DOB in a format that is usable for MEDS, rather than the integral `anchor_year` and | ||||||
`anchor_offset` fields. | ||||||
- Merge the patient's `dod` with the `deathtime` from the `admissions` table. | ||||||
|
||||||
After these steps, modified files or symlinks to the original files will be written in a new directory which | ||||||
will be used as the input to the actual MEDS extraction ETL. We'll use `$MIMICIV_PREMEDS_DIR` to denote this | ||||||
directory. | ||||||
|
||||||
To run this step, you can use the following script (assumed to be run **not** from this directory but from the | ||||||
root directory of this repository): | ||||||
|
||||||
```bash | ||||||
./MIMIC-IV_Example/pre_MEDS.py raw_cohort_dir=$MIMICIV_RAW_DIR output_dir=$MIMICIV_PREMEDS_DIR | ||||||
``` | ||||||
|
||||||
In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total. | ||||||
|
||||||
## Step 3: Run the MEDS extraction ETL | ||||||
|
||||||
We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider adding a comma after "Note" for better readability. - Note this is a different directory than the pre-MEDS directory
+ Note, this is a different directory than the pre-MEDS directory Committable suggestion
Suggested change
|
||||||
Note this is a different directory than the pre-MEDS directory (though, of course, they can both be | ||||||
subdirectories of the same root directory). | ||||||
|
||||||
This is a step in 4 parts: | ||||||
|
||||||
1. Sub-shard the raw files. Run this command as many times simultaneously as you would like to have workers | ||||||
performing this sub-sharding step. | ||||||
|
||||||
```bash | ||||||
./scripts/extraction/shard_events.py \ | ||||||
raw_cohort_dir=$MIMICIV_PREMEDS_DIR \ | ||||||
MEDS_cohort_dir=$MIMICIV_MEDS_DIR \ | ||||||
event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml | ||||||
``` | ||||||
|
||||||
In practice, on a machine with 150 GB of RAM and 10 cores, this step takes approximately 20 minutes in total. | ||||||
|
||||||
2. Extract and form the patient splits and sub-shards. | ||||||
|
||||||
```bash | ||||||
./scripts/extraction/split_and_shard_patients.py \ | ||||||
raw_cohort_dir=$MIMICIV_PREMEDS_DIR \ | ||||||
MEDS_cohort_dir=$MIMICIV_MEDS_DIR \ | ||||||
event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml | ||||||
``` | ||||||
|
||||||
In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total. | ||||||
|
||||||
3. Extract patient sub-shards and convert to MEDS events. | ||||||
|
||||||
```bash | ||||||
./scripts/extraction/convert_to_sharded_events.py \ | ||||||
raw_cohort_dir=$MIMICIV_PREMEDS_DIR \ | ||||||
MEDS_cohort_dir=$MIMICIV_MEDS_DIR \ | ||||||
event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml | ||||||
``` | ||||||
|
||||||
In practice, serially, this also takes around 20 minutes or more. However, it can be trivially parallelized to | ||||||
cut the time down by a factor of the number of workers processing the data by simply running the command | ||||||
multiple times (though this will, of course, consume more resources). If your filesystem is distributed, these | ||||||
commands can also be launched as separate slurm jobs, for example. For MIMIC-IV, this level of parallelization | ||||||
and performance is not necessary; however, for larger datasets, it can be. | ||||||
|
||||||
4. Merge the MEDS events into a single file per patient sub-shard. | ||||||
|
||||||
```bash | ||||||
./scripts/extraction/merge_to_MEDS_cohort.py \ | ||||||
raw_cohort_dir=$MIMICIV_PREMEDS_DIR \ | ||||||
MEDS_cohort_dir=$MIMICIV_MEDS_DIR \ | ||||||
event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml | ||||||
``` | ||||||
|
||||||
## Limitations / TO-DOs: | ||||||
|
||||||
Currently, some tables are ignored, including: | ||||||
|
||||||
1. `hosp/emar_detail` | ||||||
2. `hosp/microbiologyevents` | ||||||
3. `hosp/services` | ||||||
4. `icu/datetimeevents` | ||||||
5. `icu/ingredientevents` | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider using a more formal tone in documentation. - Lots of questions remain about how to appropriately handle timestamps of the data
+ Several questions remain about how to appropriately handle timestamps of the data Committable suggestion
Suggested change
|
||||||
Lots of questions remain about how to appropriately handle timestamps of the data -- e.g., things like HCPCS | ||||||
events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider adding a comma after "timeline" for better readability. - How should those be slotted into the timeline which is otherwise stored at the _datetime_ resolution?
+ How should those be slotted into the timeline, which is otherwise stored at the _datetime_ resolution? Committable suggestion
Suggested change
|
||||||
timeline which is otherwise stored at the _datetime_ resolution? | ||||||
|
||||||
Other questions: | ||||||
|
||||||
1. How to handle merging the deathtimes between the hosp table and the patients table? | ||||||
2. How to handle the dob nonsense MIMIC has? | ||||||
|
||||||
## Future Work | ||||||
|
||||||
### Pre-MEDS Processing | ||||||
|
||||||
If you wanted, some other processing could also be done here, such as: | ||||||
|
||||||
1. Converting the patient's dynamically recorded race into a static, most commonly recorded race field. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,215 @@ | ||
patient_id_col: subject_id | ||
hosp/admissions: | ||
ed_registration: | ||
code: ED_REGISTRATION | ||
timestamp: col(edregtime) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
ed_out: | ||
code: ED_OUT | ||
timestamp: col(edouttime) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
admission: | ||
code: | ||
- HOSPITAL_ADMISSION | ||
- col(admission_type) | ||
- col(admission_location) | ||
timestamp: col(admittime) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
insurance: insurance | ||
language: language | ||
marital_status: marital_status | ||
race: race | ||
hadm_id: hadm_id | ||
discharge: | ||
code: | ||
- HOSPITAL_DISCHARGE | ||
- col(discharge_location) | ||
timestamp: col(dischtime) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
hadm_id: hadm_id | ||
# We omit the death event here as it is joined to the data in the patients table in the pre-MEDS step. | ||
#death: | ||
# code: DEATH | ||
# timestamp: col(deathtime) | ||
# timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
# death_location: death_location | ||
# death_type: death_type | ||
|
||
hosp/diagnoses_icd: | ||
diagnosis: | ||
code: | ||
- DIAGNOSIS | ||
- ICD | ||
- col(icd_version) | ||
- col(icd_code) | ||
hadm_id: hadm_id | ||
timestamp: col(hadm_discharge_time) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
|
||
hosp/drgcodes: | ||
drg: | ||
code: | ||
- DRG | ||
- col(drg_type) | ||
- col(drg_code) | ||
- col(description) | ||
hadm_id: hadm_id | ||
timestamp: col(hadm_discharge_time) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
drg_severity: drg_severity | ||
drg_mortality: drg_mortality | ||
|
||
hosp/emar: | ||
medication: | ||
code: | ||
- MEDICATION | ||
- col(medication) | ||
- col(event_txt) | ||
timestamp: col(charttime) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
hadm_id: hadm_id | ||
emar_id: emar_id | ||
emar_seq: emar_seq | ||
|
||
hosp/hcpcsevents: | ||
hcpcs: | ||
code: | ||
- HCPCS | ||
- col(short_description) | ||
hadm_id: hadm_id | ||
timestamp: col(chartdate) | ||
timestamp_format: "%Y-%m-%d" | ||
|
||
hosp/labevents: | ||
lab: | ||
code: | ||
- LAB | ||
- col(itemid) | ||
- col(valueuom) | ||
hadm_id: hadm_id | ||
timestamp: col(charttime) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
numerical_value: valuenum | ||
text_value: value | ||
priority: priority | ||
|
||
hosp/omr: | ||
omr: | ||
code: col(result_name) | ||
text_value: col(result_value) | ||
timestamp: col(chartdate) | ||
timestamp_format: "%Y-%m-%d" | ||
|
||
hosp/patients: | ||
gender: | ||
code: | ||
- GENDER | ||
- col(gender) | ||
timestamp: null | ||
dob: | ||
code: DOB | ||
timestamp: col(year_of_birth) | ||
timestamp_format: "%Y" | ||
death: | ||
code: DEATH | ||
timestamp: col(dod) | ||
timestamp_format: | ||
- "%Y-%m-%d %H:%M:%S" | ||
- "%Y-%m-%d" | ||
|
||
hosp/pharmacy: | ||
medication_start: | ||
code: | ||
- MEDICATION | ||
- START | ||
- col(medication) | ||
timestamp: col(starttime) | ||
route: route | ||
frequency: frequency | ||
doses_per_24_hrs: doses_per_24_hrs | ||
poe_id: poe_id | ||
timestamp_format: | ||
- "%Y-%m-%d %H:%M:%S" | ||
- "%Y-%m-%d" | ||
medication_stop: | ||
code: | ||
- MEDICATION | ||
- STOP | ||
- col(medication) | ||
timestamp: col(stoptime) | ||
poe_id: poe_id | ||
timestamp_format: | ||
- "%Y-%m-%d %H:%M:%S" | ||
- "%Y-%m-%d" | ||
|
||
hosp/procedures_icd: | ||
procedure: | ||
code: | ||
- PROCEDURE | ||
- ICD | ||
- col(icd_version) | ||
- col(icd_code) | ||
hadm_id: hadm_id | ||
timestamp: col(chartdate) | ||
timestamp_format: "%Y-%m-%d" | ||
|
||
hosp/transfers: | ||
transfer: | ||
code: | ||
- TRANSFER_TO | ||
- col(eventtype) | ||
- col(careunit) | ||
timestamp: col(intime) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
hadm_id: hadm_id | ||
|
||
icu/icustays: | ||
icu_admission: | ||
code: | ||
- ICU_ADMISSION | ||
- col(first_careunit) | ||
timestamp: col(intime) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
hadm_id: hadm_id | ||
icustay_id: stay_id | ||
icu_discharge: | ||
code: | ||
- ICU_DISCHARGE | ||
- col(last_careunit) | ||
timestamp: col(outtime) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
hadm_id: hadm_id | ||
icustay_id: stay_id | ||
|
||
icu/chartevents: | ||
event: | ||
code: | ||
- LAB | ||
- col(itemid) | ||
- col(valueuom) | ||
timestamp: col(charttime) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
numerical_value: valuenum | ||
text_value: value | ||
hadm_id: hadm_id | ||
icustay_id: stay_id | ||
|
||
icu/procedureevents: | ||
start: | ||
code: | ||
- PROCEDURE | ||
- START | ||
- col(itemid) | ||
timestamp: col(starttime) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
hadm_id: hadm_id | ||
icustay_id: stay_id | ||
end: | ||
code: | ||
- PROCEDURE | ||
- END | ||
- col(itemid) | ||
timestamp: col(endtime) | ||
timestamp_format: "%Y-%m-%d %H:%M:%S" | ||
hadm_id: hadm_id | ||
icustay_id: stay_id |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
raw_cohort_dir: ??? | ||
output_dir: ??? | ||
|
||
# Hydra | ||
hydra: | ||
job: | ||
name: pre_MEDS_${now:%Y-%m-%d_%H-%M-%S} | ||
run: | ||
dir: ${output_dir}/.logs/${hydra.job.name} | ||
sweep: | ||
dir: ${output_dir}/.logs/${hydra.job.name} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a more descriptive hyperlink text instead of a bare URL.
Committable suggestion