Merge pull request #17 from mmcdermott/dev

IN PROGRESS general dev contributions. Currently tracking:
mmcdermott · Jun 24, 2024 · e1e6260 · e1e6260
2 parents aec23c8 + 410e6ce
commit e1e6260
Show file tree

Hide file tree

Showing 31 changed files with 2,118 additions and 369 deletions.
diff --git a/.github/workflows/tests.yaml b/.github/workflows/tests.yaml
@@ -33,7 +33,7 @@ jobs:
       #----------------------------------------------
       - name: Run tests
         run: |
-          pytest -v --doctest-modules --cov
+          pytest -v --doctest-modules --cov=src -s
 
       - name: Upload coverage to Codecov
         uses: codecov/codecov-action@v4.0.1

diff --git a/MIMIC-IV_Example/README.md b/MIMIC-IV_Example/README.md
@@ -45,13 +45,24 @@ that page. You will need the raw `.csv.gz` files for this example. We will use `
 the root directory of where the resulting _core data files_ are stored -- e.g., there should be a `hosp` and
 `icu` subdirectory of `$MIMICIV_RAW_DIR`.
 
-## Step 2: Get the data ready for base MEDS extraction
+## Step 2: Run the basic MEDS ETL
+
+This step contains several sub-steps; luckily, all these substeps can be run via a single script, with the
+`joint_script.sh` script which uses the Hydra `joblib` launcher to run things with local parallelism (make
+sure you enable this feature by including the `[local_parallelism]` option during installation) or via
+`joint_script_slurm.sh` which uses the Hydra `submitit` launcher to run things through slurm (make sure you
+enable this feature by including the `[slurm_parallelism]` option during installation). This script entails
+several steps:
+
+### Step 2.1: Get the data ready for base MEDS extraction
 
 This is a step in a few parts:
 
 1. Join a few tables by `hadm_id` to get the right timestamps in the right rows for processing. In
    particular, we need to join:
-   - TODO
+   - the `hosp/diagnoses_icd` table with the `hosp/admissions` table to get the `dischtime` for each
+     `hadm_id`.
+   - the `hosp/drgcodes` table with the `hosp/admissions` table to get the `dischtime` for each `hadm_id`.
 2. Convert the patient's static data to a more parseable form. This entails:
    - Get the patient's DOB in a format that is usable for MEDS, rather than the integral `anchor_year` and
      `anchor_offset` fields.
@@ -61,7 +72,8 @@ After these steps, modified files or symlinks to the original files will be writ
 will be used as the input to the actual MEDS extraction ETL. We'll use `$MIMICIV_PREMEDS_DIR` to denote this
 directory.
 
-To run this step, you can use the following script (assumed to be run **not** from this directory but from the
+This step is run in the `joint_script.sh` script or the `joint_script_slurm.sh` script, but in either case the
+base command that is run is as follows (assumed to be run **not** from this directory but from the
 root directory of this repository):
 
 ```bash
@@ -70,9 +82,7 @@ root directory of this repository):
 
 In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total.
 
-## Step 3: Run the MEDS extraction ETL
-
-### Running locally, serially
+### Step 2.2: Run the MEDS extraction ETL
 
 We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`.
 Note this is a different directory than the pre-MEDS directory (though, of course, they can both be
@@ -83,114 +93,91 @@ This is a step in 4 parts:
 1. Sub-shard the raw files. Run this command as many times simultaneously as you would like to have workers
    performing this sub-sharding step. See below for how to automate this parallelism using hydra launchers.
 
-```bash
-./scripts/extraction/shard_events.py \
-    input_dir=$MIMICIV_PREMEDS_DIR \
-    cohort_dir=$MIMICIV_MEDS_DIR \
-    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
-```
+   This step uses the `./scripts/extraction/shard_events.py` script. See `joint_script*.sh` for the expected
+   format of the command.
 
-In practice, on a machine with 150 GB of RAM and 10 cores, this step takes approximately 20 minutes in total.
+2. Extract and form the patient splits and sub-shards. The `./scripts/extraction/split_and_shard_patients.py`
+   script is used for this step. See `joint_script*.sh` for the expected format of the command.
 
-2. Extract and form the patient splits and sub-shards.
+3. Extract patient sub-shards and convert to MEDS events. The
+   `./scripts/extraction/convert_to_sharded_events.py` script is used for this step. See `joint_script*.sh` for
+   the expected format of the command.
 
-```bash
-./scripts/extraction/split_and_shard_patients.py \
-    input_dir=$MIMICIV_PREMEDS_DIR \
-    cohort_dir=$MIMICIV_MEDS_DIR \
-    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
-```
+4. Merge the MEDS events into a single file per patient sub-shard. The
+   `./scripts/extraction/merge_to_MEDS_cohort.py` script is used for this step. See `joint_script*.sh` for the
+   expected format of the command.
 
-In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total.
+5. (Optional) Generate preliminary code statistics and merge to external metadata. This is not performed
+   currently in the `joint_script*.sh` scripts.
+
+## Pre-processing for a model
 
-3. Extract patient sub-shards and convert to MEDS events.
+To run the pre-processing steps for a model, consider the sample script provided here:
+
+1. Filter patients to only those with at least 32 events (unique timepoints):
 
 ```bash
-./scripts/extraction/convert_to_sharded_events.py \
-    input_dir=$MIMICIV_PREMEDS_DIR \
-    cohort_dir=$MIMICIV_MEDS_DIR \
-    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
+mbm47 in  compute-a-17-72 in MEDS_polars_functions on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines
+❯ ./scripts/preprocessing/filter_patients.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null stage_configs.filter_patients.min_events_per_patient=32
 ```
 
-In practice, serially, this also takes around 20 minutes or more. However, it can be trivially parallelized to
-cut the time down by a factor of the number of workers processing the data by simply running the command
-multiple times (though this will, of course, consume more resources). If your filesystem is distributed, these
-commands can also be launched as separate slurm jobs, for example. For MIMIC-IV, this level of parallelization
-and performance is not necessary; however, for larger datasets, it can be.
-
-4. Merge the MEDS events into a single file per patient sub-shard.
+2. Add time-derived measurements (age and time-of-day):
 
 ```bash
-./scripts/extraction/merge_to_MEDS_cohort.py \
-    input_dir=$MIMICIV_PREMEDS_DIR \
-    cohort_dir=$MIMICIV_MEDS_DIR \
-    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
+mbm47 in  compute-a-17-72 in MEDS_polars_functions on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 3s
+❯ ./scripts/preprocessing/add_time_derived_measurements.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DI
+R/test" code_modifier_columns=null stage_configs.add_time_derived_measurements.age.DOB_code="DOB"
 ```
 
-### Running Locally, in Parallel.
+3. Get preliminary counts for code filtering:
 
-This step is the exact same commands as above, but leverages Hydra's multirun capabilities with the `joblib`
-launcher. Install this package with the optional `local_parallelism` option (e.g., `pip install -e .[local_parallelism]` and run `./MIMIC-IV_Example/joint_script.sh`. See that script for expected args.
+```bash
+mbm47 in  compute-a-17-72 in MEDS_polars_functions on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines
+❯ ./scripts/preprocessing/collect_code_metadata.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null stage="preliminary_counts"
+```
 
-### Running Each Step over Slurm
+4. Filter codes:
 
-To use slurm, run each command with the number of workers desired using Hydra's multirun capabilities with the
-`submitit_slurm` launcher. Install this package with the optional `slurm_parallelism` option. See below for
-modified commands. Note these can't be chained in a single script as the jobs will not wait for all slurm jobs
-to finish before moving on to the next stage. Let `$N_PARALLEL_WORKERS` be the number of desired workers
+```bash
+mbm47 in  compute-a-17-72 in MEDS_polars_functions on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 4s
+❯ ./scripts/preprocessing/filter_codes.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modi
+fier_columns=null stage_configs.filter_codes.min_patients_per_code=128 stage_configs.filter_codes.min_occurrences_per_code=256
+```
 
-1. Sub-shard the raw files.
+5. Get outlier detection params:
 
 ```bash
-./scripts/extraction/shard_events.py \
-    --multirun \
-    worker="range(0,$N_PARALLEL_WORKERS)" \
-    hydra/launcher=submitit_slurm \
-    hydra.launcher.timeout_min=60 \
-    hydra.launcher.cpus_per_task=10 \
-    hydra.launcher.mem_gb=50 \
-    hydra.launcher.name="${hydra.job.name}_${worker}" \
-    hydra.launcher.partition="short" \
-    input_dir=$MIMICIV_PREMEDS_DIR \
-    cohort_dir=$MIMICIV_MEDS_DIR \
-    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
+mbm47 in  compute-a-17-72 in MEDS_polars_functions on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 19m57s
+❯ ./scripts/preprocessing/collect_code_metadata.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null stage=fit_outlier_detection
 ```
 
-In practice, on a machine with 150 GB of RAM and 10 cores, this step takes approximately 20 minutes in total.
-
-2. Extract and form the patient splits and sub-shards.
+6. Filter outliers:
 
 ```bash
-./scripts/extraction/split_and_shard_patients.py \
-    input_dir=$MIMICIV_PREMEDS_DIR \
-    cohort_dir=$MIMICIV_MEDS_DIR \
-    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
+mbm47 in  compute-a-17-72 in MEDS_polars_functions on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 5m14s
+❯ ./scripts/preprocessing/filter_outliers.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null
 ```
 
-In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total.
-
-3. Extract patient sub-shards and convert to MEDS events.
+7. Fit normalization parameters:
 
 ```bash
-./scripts/extraction/convert_to_sharded_events.py \
-    input_dir=$MIMICIV_PREMEDS_DIR \
-    cohort_dir=$MIMICIV_MEDS_DIR \
-    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
+mbm47 in  compute-a-17-72 in MEDS_polars_functions on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 16m25s
+❯ ./scripts/preprocessing/collect_code_metadata.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null stage=fit_normalization
 ```
 
-In practice, serially, this also takes around 20 minutes or more. However, it can be trivially parallelized to
-cut the time down by a factor of the number of workers processing the data by simply running the command
-multiple times (though this will, of course, consume more resources). If your filesystem is distributed, these
-commands can also be launched as separate slurm jobs, for example. For MIMIC-IV, this level of parallelization
-and performance is not necessary; however, for larger datasets, it can be.
+8. Fit vocabulary:
+
+```bash
+mbm47 in  compute-e-16-230 in MEDS_polars_functions on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 2s
+❯ ./scripts/preprocessing/fit_vocabulary_indices.py input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null
+```
 
-4. Merge the MEDS events into a single file per patient sub-shard.
+9. Normalize:
 
 ```bash
-./scripts/extraction/merge_to_MEDS_cohort.py \
-    input_dir=$MIMICIV_PREMEDS_DIR \
-    cohort_dir=$MIMICIV_MEDS_DIR \
-    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml
+mbm47 in  compute-e-16-230 in MEDS_polars_functions on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 4s
+❯ ./scripts/preprocessing/normalize.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifie
+r_columns=null
 ```
 
 ## Limitations / TO-DOs:

diff --git a/MIMIC-IV_Example/joint_script_slurm.sh b/MIMIC-IV_Example/joint_script_slurm.sh
@@ -44,17 +44,17 @@ shift 4
 # this doesn't fall back on running anything locally in a setting where only slurm worker nodes have
 # sufficient computational resources to run the actual jobs.
 
-# echo "Running pre-MEDS conversion on one worker."
-# ./MIMIC-IV_Example/pre_MEDS.py \
-#   --multirun \
-#   worker="range(0,1)" \
-#   hydra/launcher=submitit_slurm \
-#   hydra.launcher.timeout_min=60 \
-#   hydra.launcher.cpus_per_task=10 \
-#   hydra.launcher.mem_gb=50 \
-#   hydra.launcher.partition="short" \
-#   raw_cohort_dir="$MIMICIV_RAW_DIR" \
-#   output_dir="$MIMICIV_PREMEDS_DIR"
+echo "Running pre-MEDS conversion on one worker."
+./MIMIC-IV_Example/pre_MEDS.py \
+  --multirun \
+  worker="range(0,1)" \
+  hydra/launcher=submitit_slurm \
+  hydra.launcher.timeout_min=60 \
+  hydra.launcher.cpus_per_task=10 \
+  hydra.launcher.mem_gb=50 \
+  hydra.launcher.partition="short" \
+  raw_cohort_dir="$MIMICIV_RAW_DIR" \
+  output_dir="$MIMICIV_PREMEDS_DIR"
 
 echo "Trying submitit launching with $N_PARALLEL_WORKERS jobs."
 
@@ -72,41 +72,41 @@ echo "Trying submitit launching with $N_PARALLEL_WORKERS jobs."
     event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml \
     stage=shard_events
 
-# echo "Splitting patients on one worker"
-# ./scripts/extraction/split_and_shard_patients.py \
-#     --multirun \
-#     worker="range(0,1)" \
-#     hydra/launcher=submitit_slurm \
-#     hydra.launcher.timeout_min=60 \
-#     hydra.launcher.cpus_per_task=10 \
-#     hydra.launcher.mem_gb=50 \
-#     hydra.launcher.partition="short" \
-#     input_dir="$MIMICIV_PREMEDS_DIR" \
-#     cohort_dir="$MIMICIV_MEDS_DIR" \
-#     event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml "$@"
-#
-# echo "Converting to sharded events with $N_PARALLEL_WORKERS workers in parallel"
-# ./scripts/extraction/convert_to_sharded_events.py \
-#     --multirun \
-#     worker="range(0,$N_PARALLEL_WORKERS)" \
-#     hydra/launcher=submitit_slurm \
-#     hydra.launcher.timeout_min=60 \
-#     hydra.launcher.cpus_per_task=10 \
-#     hydra.launcher.mem_gb=50 \
-#     hydra.launcher.partition="short" \
-#     input_dir="$MIMICIV_PREMEDS_DIR" \
-#     cohort_dir="$MIMICIV_MEDS_DIR" \
-#     event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml "$@"
-#
-# echo "Merging to a MEDS cohort with $N_PARALLEL_WORKERS workers in parallel"
-# ./scripts/extraction/merge_to_MEDS_cohort.py \
-#     --multirun \
-#     worker="range(0,$N_PARALLEL_WORKERS)" \
-#     hydra/launcher=submitit_slurm \
-#     hydra.launcher.timeout_min=60 \
-#     hydra.launcher.cpus_per_task=10 \
-#     hydra.launcher.mem_gb=50 \
-#     hydra.launcher.partition="short" \
-#     input_dir="$MIMICIV_PREMEDS_DIR" \
-#     cohort_dir="$MIMICIV_MEDS_DIR" \
-#     event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml "$@"
+echo "Splitting patients on one worker"
+./scripts/extraction/split_and_shard_patients.py \
+    --multirun \
+    worker="range(0,1)" \
+    hydra/launcher=submitit_slurm \
+    hydra.launcher.timeout_min=60 \
+    hydra.launcher.cpus_per_task=10 \
+    hydra.launcher.mem_gb=50 \
+    hydra.launcher.partition="short" \
+    input_dir="$MIMICIV_PREMEDS_DIR" \
+    cohort_dir="$MIMICIV_MEDS_DIR" \
+    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml "$@"
+
+echo "Converting to sharded events with $N_PARALLEL_WORKERS workers in parallel"
+./scripts/extraction/convert_to_sharded_events.py \
+    --multirun \
+    worker="range(0,$N_PARALLEL_WORKERS)" \
+    hydra/launcher=submitit_slurm \
+    hydra.launcher.timeout_min=60 \
+    hydra.launcher.cpus_per_task=10 \
+    hydra.launcher.mem_gb=50 \
+    hydra.launcher.partition="short" \
+    input_dir="$MIMICIV_PREMEDS_DIR" \
+    cohort_dir="$MIMICIV_MEDS_DIR" \
+    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml "$@"
+
+echo "Merging to a MEDS cohort with $N_PARALLEL_WORKERS workers in parallel"
+./scripts/extraction/merge_to_MEDS_cohort.py \
+    --multirun \
+    worker="range(0,$N_PARALLEL_WORKERS)" \
+    hydra/launcher=submitit_slurm \
+    hydra.launcher.timeout_min=60 \
+    hydra.launcher.cpus_per_task=10 \
+    hydra.launcher.mem_gb=50 \
+    hydra.launcher.partition="short" \
+    input_dir="$MIMICIV_PREMEDS_DIR" \
+    cohort_dir="$MIMICIV_MEDS_DIR" \
+    event_conversion_config_fp=./MIMIC-IV_Example/configs/event_configs.yaml "$@"