Skip to content

Commit

Permalink
Merge pull request #42 from mmcdermott/improved_documentation
Browse files Browse the repository at this point in the history
Adding an improved documentation base and (eventually) readthedocs support.
  • Loading branch information
mmcdermott authored Jul 26, 2024
2 parents a9c87f3 + d4f01eb commit 9a3ce89
Show file tree
Hide file tree
Showing 3 changed files with 297 additions and 109 deletions.
102 changes: 6 additions & 96 deletions MIMIC-IV_Example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,31 +4,6 @@ This is an example of how to extract a MEDS dataset from MIMIC-IV. All scripts i
be run **not** from this directory but from the root directory of this entire repository (e.g., one directory
up from this one).

**Status**: This is a work in progress. The code is not yet functional. Remaining work includes:

- [x] Implementing the pre-MEDS processing step.
- [x] Implement the joining of discharge times.
- [x] Implement the conversion of the DOB to a more usable format.
- [x] Implement the joining of death times.
- [ ] Testing the pre-MEDS processing step on live MIMIC-IV.
- [x] Test that it runs at all.
- [ ] Test that the output is as expected.
- [ ] Check the installation instructions on a fresh client.
- [x] Testing the `configs/event_configs.yaml` configuration on MIMIC-IV
- [x] Testing the MEDS extraction ETL runs on MIMIC-IV (this should be expected to work, but needs
live testing).
- [x] Sub-sharding
- [x] Patient split gathering
- [x] Event extraction
- [x] Merging
- [ ] Validating the output MEDS cohort
- [x] Basic validation (even though file sizes are weird, the number of rows looks consistent).
- [ ] Debug and remove rows with null codes! (there are a lot of them)
- [ ] Detailed validation

Note: If you use the slurm system and you launch the hydra submitit jobs from an interactive slurm node, you
may need to run `unset SLURM_CPU_BIND` in your terminal first to avoid errors.

## Step 0: Installation

Download this repository and install the requirements:
Expand All @@ -38,7 +13,7 @@ git clone git@github.com:mmcdermott/MEDS_transforms.git
cd MEDS_transforms
conda create -n MEDS python=3.12
conda activate MEDS
pip install .[examples]
pip install .[examples,local_parallelism]
```

## Step 1: Download MIMIC-IV
Expand Down Expand Up @@ -129,76 +104,6 @@ This is a step in 4 parts:
5. (Optional) Generate preliminary code statistics and merge to external metadata. This is not performed
currently in the `joint_script*.sh` scripts.

## Pre-processing for a model

To run the pre-processing steps for a model, consider the sample script provided here:

1. Filter patients to only those with at least 32 events (unique timepoints):

```bash
mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines
❯ ./scripts/preprocessing/filter_patients.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null stage_configs.filter_patients.min_events_per_patient=32
```

2. Add time-derived measurements (age and time-of-day):

```bash
mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 3s
❯ ./scripts/preprocessing/add_time_derived_measurements.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DI
R/test" code_modifier_columns=null stage_configs.add_time_derived_measurements.age.DOB_code="DOB"
```

3. Get preliminary counts for code filtering:

```bash
mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines
❯ ./scripts/preprocessing/collect_code_metadata.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null stage="preliminary_counts"
```

4. Filter codes:

```bash
mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 4s
❯ ./scripts/preprocessing/filter_codes.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modi
fier_columns=null stage_configs.filter_codes.min_patients_per_code=128 stage_configs.filter_codes.min_occurrences_per_code=256
```

5. Get outlier detection params:

```bash
mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 19m57s
❯ ./scripts/preprocessing/collect_code_metadata.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null stage=fit_outlier_detection
```

6. Filter outliers:

```bash
mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 5m14s
❯ ./scripts/preprocessing/filter_outliers.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null
```

7. Fit normalization parameters:

```bash
mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 16m25s
❯ ./scripts/preprocessing/collect_code_metadata.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null stage=fit_normalization
```

8. Fit vocabulary:

```bash
mbm47 in  compute-e-16-230 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 2s
❯ ./scripts/preprocessing/fit_vocabulary_indices.py input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null
```

9. Normalize:

```bash
mbm47 in  compute-e-16-230 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 4s
❯ ./scripts/preprocessing/normalize.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifie
r_columns=null
```

## Limitations / TO-DOs:

Currently, some tables are ignored, including:
Expand All @@ -218,6 +123,11 @@ Other questions:
1. How to handle merging the deathtimes between the hosp table and the patients table?
2. How to handle the dob nonsense MIMIC has?

## Notes

Note: If you use the slurm system and you launch the hydra submitit jobs from an interactive slurm node, you
may need to run `unset SLURM_CPU_BIND` in your terminal first to avoid errors.

## Future Work

### Pre-MEDS Processing
Expand Down
40 changes: 29 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,26 +6,25 @@ MEDS-formatted data.
Completed functions include scripts and utilities for extraction of various forms of raw data into the MEDS
format in a scalable, parallelizable manner, as well as general configuration management utilities for complex
pipelines over MEDS data. In progress functions include more model-specific pre-processing steps for MEDS
data. See the "Roadmap" section below and [this google
doc](https://docs.google.com/document/d/14NKaIPAMKC1bXWV_IVJ7nQfMo09PpUQCRrqqVY6qVT4/edit?usp=sharing) for
more information.
data.

Examples of these capabilities in action can be seen in the `MIMIC-IV_Example` and `eICU_Example` directories,
which contain working, end-to-end examples to extract MEDS formatted data from MIMIC-IV v2.2 and eICU v2.0.
These directories also have `README.md` files with more detailed information on how to run the scripts in
those directories.
Examples of these capabilities in action can be seen in the `MIMIC-IV_Example` directory,
which contains a working, end-to-end examples to extract MEDS formatted data from MIMIC-IV v2.2. A working
example for eICU v2.0 is also present, though needs to be adapted to recent interface improvements. These
directories also have `README.md` files with more detailed information on how to run the scripts in those
directories.

## Installation

- For a base installation, clone this repository and run `pip install .` from the repository root.
- For running the MIMIC-IV example, install the optional MIMIC dependencies as well with
`pip install .[mimic]`.
`pip install .[examples]`.
- To support same-machine, process-based parallelism, install the optional joblib dependencies with `pip install .[local_parallelism]`.
- To support cluster-based parallelism, install the optional submitit dependencies with
`pip install .[slurm_parallelism]`.
- For working on development, install the optional development dependencies with `pip install .[dev,tests]`.
- Optional dependencies can be mutually installed by combining the optional dependency names with commas in
the square brackets, e.g., `pip install .[mimic,local_parallelism]`.
the square brackets, e.g., `pip install .[examples,local_parallelism]`.

## Design Philosophy

Expand Down Expand Up @@ -117,7 +116,7 @@ run with the `--help` flag, provide a detailed description of the script's purpo
E.g.,

```bash
./scripts/extraction/shard_events.py --help
MEDS_extract-shard_events --help
== MEDS/shard_events ==
MEDS/shard_events is a command line tool that provides an interface for running MEDS pipelines.

Expand All @@ -140,6 +139,25 @@ This stage shards the raw input events into smaller files for easier processing.
files are csvs)
```
Note that these stage scripts can be used for either a full pipeline or just as a component of a larger,
user-defined process -- it is up to the user to decide how to leverage these scripts in their own work.
### As an importable library
To use this repository as an importable library, the user should follow these steps:
1. Install the repository as a package.
2. Design your own transform function in your own codebase and leverage `MEDS_transform` utilities such as
`MEDS_transform.mapreduce.mapper.map_over` to easily apply your transform over a sharded MEDS dataset.
3. Leverage the `MEDS_transforms` configuration schema to enable easy configuration of your pipeline, by
importing the MEDS transforms configs via your hydra search path and using them as a base for your own
configuration files, enabling you to intermix your new stage configuration with the existing MEDS
transform stages.
4. Note that, if your transformations are sufficiently general, you can also submit a PR to add new
transformations to this repository, enabling others to leverage your work as well.
See [this example](https://github.com/mmcdermott/MEDS_transforms/pull/48) for an (in progress) example of
how to do this.
### As a template
To use this repository as a template, the user should follow these steps:
Expand Down Expand Up @@ -429,5 +447,5 @@ to run anything):
```bash
MEDS_transforms on  reusable_interface [$⇡] is 󰏗 v0.0.1 via  v3.12.4 via  MEDS_fns
./src/MEDS_transforms/scripts/preprocessing/normalize.py input_dir=foo cohort_dir=bar 'stages=["normalize", "tensorize"]' --cfg job --resolve
MEDS_transform-normalization input_dir=foo cohort_dir=bar 'stages=["normalization", "tensorization"]' --cfg job --resolve
```
Loading

0 comments on commit 9a3ce89

Please sign in to comment.