Skip to content

Commit

Permalink
Merge branch 'dev' into 69_add_reducer_logic
Browse files Browse the repository at this point in the history
  • Loading branch information
mmcdermott committed Sep 1, 2024
2 parents 973a09d + 3519769 commit 1ebda61
Show file tree
Hide file tree
Showing 39 changed files with 2,233 additions and 1,383 deletions.
155 changes: 53 additions & 102 deletions MIMIC-IV_Example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,33 +6,34 @@ up from this one).

## Step 0: Installation

Download this repository and install the requirements:
If you want to install via pypi, (note that for now, you still need to copy some files locally even with a
pypi installation, which is covered below, so make sure you are in a suitable directory) use:

```bash
conda create -n MEDS python=3.12
conda activate MEDS
pip install "MEDS_transforms[local_parallelism]"
mkdir MIMIC-IV_Example
cd MIMIC-IV_Example
wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script.sh
wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script_slurm.sh
wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/pre_MEDS.py
chmod +x joint_script.sh
chmod +x joint_script_slurm.sh
chmod +x pre_MEDS.py
cd ..
pip install "MEDS_transforms[local_parallelism,slurm_parallelism]"
```

If you want to install locally, use:
If you want to profile the time and memory costs of your ETL, also install: `pip install hydra-profiler`.

## Step 0.5: Set-up
Set some environment variables and download the necessary files:
```bash
git clone git@github.com:mmcdermott/MEDS_transforms.git
cd MEDS_transforms
conda create -n MEDS python=3.12
conda activate MEDS
pip install .[local_parallelism]
export MIMICIV_RAW_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data
export MIMICIV_PRE_MEDS_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data
export MIMICIV_MEDS_COHORT_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data

export VERSION=0.0.6 # or whatever version you want
export URL="https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/$VERSION/MIMIC-IV_Example"

wget $URL/run.sh
wget $URL/pre_MEDS.py
wget $URL/local_parallelism_runner.yaml
wget $URL/slurm_runner.yaml
mkdir configs
cd configs
wget $URL/configs/extract_MIMIC.yaml
cd ..
chmod +x run.sh
chmod +x pre_MEDS.py
```

## Step 1: Download MIMIC-IV
Expand All @@ -46,101 +47,51 @@ the root directory of where the resulting _core data files_ are stored -- e.g.,

```bash
cd $MIMIC_RAW_DIR
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/d_labitems_to_loinc.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/inputevents_to_rxnorm.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/lab_itemid_to_loinc.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_main.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_value.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/numerics-summary.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/outputevents_to_loinc.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_datetimeevents.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_itemid.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/waveforms-summary.csv
export MIMIC_URL=https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map
wget $MIMIC_URL/d_labitems_to_loinc.csv
wget $MIMIC_URL/inputevents_to_rxnorm.csv
wget $MIMIC_URL/lab_itemid_to_loinc.csv
wget $MIMIC_URL/meas_chartevents_main.csv
wget $MIMIC_URL/meas_chartevents_value.csv
wget $MIMIC_URL/numerics-summary.csv
wget $MIMIC_URL/outputevents_to_loinc.csv
wget $MIMIC_URL/proc_datetimeevents.csv
wget $MIMIC_URL/proc_itemid.csv
wget $MIMIC_URL/waveforms-summary.csv
```

## Step 2: Run the basic MEDS ETL

This step contains several sub-steps; luckily, all these substeps can be run via a single script, with the
`joint_script.sh` script which uses the Hydra `joblib` launcher to run things with local parallelism (make
sure you enable this feature by including the `[local_parallelism]` option during installation) or via
`joint_script_slurm.sh` which uses the Hydra `submitit` launcher to run things through slurm (make sure you
enable this feature by including the `[slurm_parallelism]` option during installation). This script entails
several steps:

### Step 2.1: Get the data ready for base MEDS extraction

This is a step in a few parts:

1. Join a few tables by `hadm_id` to get the right times in the right rows for processing. In
particular, we need to join:
- the `hosp/diagnoses_icd` table with the `hosp/admissions` table to get the `dischtime` for each
`hadm_id`.
- the `hosp/drgcodes` table with the `hosp/admissions` table to get the `dischtime` for each `hadm_id`.
2. Convert the subject's static data to a more parseable form. This entails:
- Get the subject's DOB in a format that is usable for MEDS, rather than the integral `anchor_year` and
`anchor_offset` fields.
- Merge the subject's `dod` with the `deathtime` from the `admissions` table.

After these steps, modified files or symlinks to the original files will be written in a new directory which
will be used as the input to the actual MEDS extraction ETL. We'll use `$MIMICIV_PREMEDS_DIR` to denote this
directory.
## Step 2: Run the MEDS ETL

This step is run in the `joint_script.sh` script or the `joint_script_slurm.sh` script, but in either case the
base command that is run is as follows (assumed to be run **not** from this directory but from the
root directory of this repository):
To run the MEDS ETL, run the following command:

```bash
./MIMIC-IV_Example/pre_MEDS.py raw_cohort_dir=$MIMICIV_RAW_DIR output_dir=$MIMICIV_PREMEDS_DIR
./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true
```

In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total.
To not unzip the `.csv.gz` files, set `do_unzip=false` instead of `do_unzip=true`.

### Step 2.2: Run the MEDS extraction ETL
To use a specific stage runner file (e.g., to set different parallelism options), you can specify it as an
additional argument

We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`.
Note this is a different directory than the pre-MEDS directory (though, of course, they can both be
subdirectories of the same root directory).

This is a step in 4 parts:

1. Sub-shard the raw files. Run this command as many times simultaneously as you would like to have workers
performing this sub-sharding step. See below for how to automate this parallelism using hydra launchers.

This step uses the `./scripts/extraction/shard_events.py` script. See `joint_script*.sh` for the expected
format of the command.

2. Extract and form the subject splits and sub-shards. The `./scripts/extraction/split_and_shard_subjects.py`
script is used for this step. See `joint_script*.sh` for the expected format of the command.

3. Extract subject sub-shards and convert to MEDS events. The
`./scripts/extraction/convert_to_sharded_events.py` script is used for this step. See `joint_script*.sh` for
the expected format of the command.

4. Merge the MEDS events into a single file per subject sub-shard. The
`./scripts/extraction/merge_to_MEDS_cohort.py` script is used for this step. See `joint_script*.sh` for the
expected format of the command.

5. (Optional) Generate preliminary code statistics and merge to external metadata. This is not performed
currently in the `joint_script*.sh` scripts.

## Limitations / TO-DOs:

Currently, some tables are ignored, including:
```bash
export N_WORKERS=5
./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true \
stage_runner_fp=slurm_runner.yaml
```

1. `hosp/emar_detail`
2. `hosp/microbiologyevents`
3. `hosp/services`
4. `icu/datetimeevents`
5. `icu/ingredientevents`
The `N_WORKERS` environment variable set before the command controls how many parallel workers should be used
at maximum.

Lots of questions remain about how to appropriately handle times of the data -- e.g., things like HCPCS
events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the
timeline which is otherwise stored at the _datetime_ resolution?
The `slurm_runner.yaml` file (downloaded above) runs each stage across several workers on separate slurm
worker nodes using the `submitit` launcher. _**You will need to customize this file to your own slurm system
so that the partition names are correct before use.**_ The memory and time costs are viable in the current
configuration, but if your nodes are sufficiently different you may need to adjust those as well.

Other questions:
The `local_parallelism_runner.yaml` file (downloaded above) runs each stage via separate processes on the
launching machine. There are no additional arguments needed for this stage beyond the `N_WORKERS` environment
variable and there is nothing to customize in this file.

1. How to handle merging the deathtimes between the hosp table and the subjects table?
2. How to handle the dob nonsense MIMIC has?
To profile the time and memory costs of your ETL, add the `do_profile=true` flag at the end.

## Notes

Expand Down
10 changes: 5 additions & 5 deletions MIMIC-IV_Example/configs/event_configs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ hosp/diagnoses_icd:
_metadata:
hosp/d_icd_diagnoses:
description: "long_title"
parent_codes: "ICD{icd_version}CM/{icd_code}" # Single strings are templates of columns.
parent_codes: "ICD{icd_version}CM/{norm_icd_code}" # Single strings are templates of columns.

hosp/drgcodes:
drg:
Expand Down Expand Up @@ -109,7 +109,7 @@ hosp/omr:
time: col(chartdate)
time_format: "%Y-%m-%d"

hosp/subjects:
hosp/patients:
gender:
code:
- GENDER
Expand Down Expand Up @@ -165,8 +165,8 @@ hosp/procedures_icd:
hosp/d_icd_procedures:
description: "long_title"
parent_codes: # List of objects are string labels mapping to filters to be evaluated.
- "ICD{icd_version}Proc/{icd_code}": { icd_version: 9 }
- "ICD{icd_version}PCS/{icd_code}": { icd_version: 10 }
- "ICD{icd_version}Proc/{norm_icd_code}": { icd_version: "9" }
- "ICD{icd_version}PCS/{norm_icd_code}": { icd_version: "10" }

hosp/transfers:
transfer:
Expand Down Expand Up @@ -303,7 +303,7 @@ icu/inputevents:
- KG
time: col(starttime)
time_format: "%Y-%m-%d %H:%M:%S"
numeric_value: subjectweight
numeric_value: patientweight

icu/outputevents:
output:
Expand Down
36 changes: 36 additions & 0 deletions MIMIC-IV_Example/configs/extract_MIMIC.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
defaults:
- _extract
- _self_

description: |-
This pipeline extracts the MIMIC-IV dataset in longitudinal, sparse form from an input dataset meeting
select criteria and converts them to the flattened, MEDS format. You can control the key arguments to this
pipeline by setting environment variables:
```bash
export EVENT_CONVERSION_CONFIG_FP=# Path to your event conversion config
export MIMICIV_PRE_MEDS_DIR=# Path to the output dir of the pre-MEDS step
export MIMICIV_MEDS_COHORT_DIR=# Path to where you want the dataset to live
```
# The event conversion configuration file is used throughout the pipeline to define the events to extract.
event_conversion_config_fp: ${oc.env:EVENT_CONVERSION_CONFIG_FP}

input_dir: ${oc.env:MIMICIV_PRE_MEDS_DIR}
cohort_dir: ${oc.env:MIMICIV_MEDS_COHORT_DIR}

etl_metadata:
dataset_name: MIMIC-IV
dataset_version: 2.2

stage_configs:
shard_events:
infer_schema_length: 999999999

stages:
- shard_events
- split_and_shard_subjects
- convert_to_sharded_events
- merge_to_MEDS_cohort
- extract_code_metadata
- finalize_MEDS_metadata
- finalize_MEDS_data
12 changes: 8 additions & 4 deletions MIMIC-IV_Example/configs/pre_MEDS.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
raw_cohort_dir: ???
output_dir: ???
input_dir: ${oc.env:MIMICIV_RAW_DIR}
cohort_dir: ${oc.env:MIMICIV_PRE_MEDS_DIR}

do_overwrite: false

log_dir: ${cohort_dir}/.logs

# Hydra
hydra:
job:
name: pre_MEDS_${now:%Y-%m-%d_%H-%M-%S}
run:
dir: ${output_dir}/.logs/${hydra.job.name}
dir: ${log_dir}
sweep:
dir: ${output_dir}/.logs/${hydra.job.name}
dir: ${log_dir}
Loading

0 comments on commit 1ebda61

Please sign in to comment.