Merge pull request #42 from mmcdermott/improved_documentation

Adding an improved documentation base and (eventually) readthedocs support.
mmcdermott · Jul 26, 2024 · 9a3ce89 · 9a3ce89
2 parents a9c87f3 + d4f01eb
commit 9a3ce89
Show file tree

Hide file tree

Showing 3 changed files with 297 additions and 109 deletions.
diff --git a/MIMIC-IV_Example/README.md b/MIMIC-IV_Example/README.md
@@ -4,31 +4,6 @@ This is an example of how to extract a MEDS dataset from MIMIC-IV. All scripts i
 be run **not** from this directory but from the root directory of this entire repository (e.g., one directory
 up from this one).
 
-**Status**: This is a work in progress. The code is not yet functional. Remaining work includes:
-
-- [x] Implementing the pre-MEDS processing step.
-  - [x] Implement the joining of discharge times.
-  - [x] Implement the conversion of the DOB to a more usable format.
-  - [x] Implement the joining of death times.
-- [ ] Testing the pre-MEDS processing step on live MIMIC-IV.
-  - [x] Test that it runs at all.
-  - [ ] Test that the output is as expected.
-- [ ] Check the installation instructions on a fresh client.
-- [x] Testing the `configs/event_configs.yaml` configuration on MIMIC-IV
-- [x] Testing the MEDS extraction ETL runs on MIMIC-IV (this should be expected to work, but needs
-  live testing).
-  - [x] Sub-sharding
-  - [x] Patient split gathering
-  - [x] Event extraction
-  - [x] Merging
-- [ ] Validating the output MEDS cohort
-  - [x] Basic validation (even though file sizes are weird, the number of rows looks consistent).
-  - [ ] Debug and remove rows with null codes! (there are a lot of them)
-  - [ ] Detailed validation
-
-Note: If you use the slurm system and you launch the hydra submitit jobs from an interactive slurm node, you
-may need to run `unset SLURM_CPU_BIND` in your terminal first to avoid errors.
-
 ## Step 0: Installation
 
 Download this repository and install the requirements:
@@ -38,7 +13,7 @@ git clone git@github.com:mmcdermott/MEDS_transforms.git
 cd MEDS_transforms
 conda create -n MEDS python=3.12
 conda activate MEDS
-pip install .[examples]
+pip install .[examples,local_parallelism]
 ```
 
 ## Step 1: Download MIMIC-IV
@@ -129,76 +104,6 @@ This is a step in 4 parts:
 5. (Optional) Generate preliminary code statistics and merge to external metadata. This is not performed
    currently in the `joint_script*.sh` scripts.
 
-## Pre-processing for a model
-
-To run the pre-processing steps for a model, consider the sample script provided here:
-
-1. Filter patients to only those with at least 32 events (unique timepoints):
-
-```bash
-mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines
-❯ ./scripts/preprocessing/filter_patients.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null stage_configs.filter_patients.min_events_per_patient=32
-```
-
-2. Add time-derived measurements (age and time-of-day):
-
-```bash
-mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 3s
-❯ ./scripts/preprocessing/add_time_derived_measurements.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DI
-R/test" code_modifier_columns=null stage_configs.add_time_derived_measurements.age.DOB_code="DOB"
-```
-
-3. Get preliminary counts for code filtering:
-
-```bash
-mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines
-❯ ./scripts/preprocessing/collect_code_metadata.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null stage="preliminary_counts"
-```
-
-4. Filter codes:
-
-```bash
-mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 4s
-❯ ./scripts/preprocessing/filter_codes.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modi
-fier_columns=null stage_configs.filter_codes.min_patients_per_code=128 stage_configs.filter_codes.min_occurrences_per_code=256
-```
-
-5. Get outlier detection params:
-
-```bash
-mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 19m57s
-❯ ./scripts/preprocessing/collect_code_metadata.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null stage=fit_outlier_detection
-```
-
-6. Filter outliers:
-
-```bash
-mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 5m14s
-❯ ./scripts/preprocessing/filter_outliers.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null
-```
-
-7. Fit normalization parameters:
-
-```bash
-mbm47 in  compute-a-17-72 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 16m25s
-❯ ./scripts/preprocessing/collect_code_metadata.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null stage=fit_normalization
-```
-
-8. Fit vocabulary:
-
-```bash
-mbm47 in  compute-e-16-230 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 2s
-❯ ./scripts/preprocessing/fit_vocabulary_indices.py input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifier_columns=null
-```
-
-9. Normalize:
-
-```bash
-mbm47 in  compute-e-16-230 in MEDS_transforms on  preprocessing_steps [$] is 󰏗 v0.0.1 via  v3.12.3 via  MEDS_pipelines took 4s
-❯ ./scripts/preprocessing/normalize.py --multirun worker="range(0,3)" hydra/launcher=joblib input_dir="$MIMICIV_MEDS_DIR/3workers_slurm" cohort_dir="$MIMICIV_MEDS_PROC_DIR/test" code_modifie
-r_columns=null
-```
-
 ## Limitations / TO-DOs:
 
 Currently, some tables are ignored, including:
@@ -218,6 +123,11 @@ Other questions:
 1. How to handle merging the deathtimes between the hosp table and the patients table?
 2. How to handle the dob nonsense MIMIC has?
 
+## Notes
+
+Note: If you use the slurm system and you launch the hydra submitit jobs from an interactive slurm node, you
+may need to run `unset SLURM_CPU_BIND` in your terminal first to avoid errors.
+
 ## Future Work
 
 ### Pre-MEDS Processing

diff --git a/README.md b/README.md
@@ -6,26 +6,25 @@ MEDS-formatted data.
 Completed functions include scripts and utilities for extraction of various forms of raw data into the MEDS
 format in a scalable, parallelizable manner, as well as general configuration management utilities for complex
 pipelines over MEDS data. In progress functions include more model-specific pre-processing steps for MEDS
-data. See the "Roadmap" section below and [this google
-doc](https://docs.google.com/document/d/14NKaIPAMKC1bXWV_IVJ7nQfMo09PpUQCRrqqVY6qVT4/edit?usp=sharing) for
-more information.
+data.
 
-Examples of these capabilities in action can be seen in the `MIMIC-IV_Example` and `eICU_Example` directories,
-which contain working, end-to-end examples to extract MEDS formatted data from MIMIC-IV v2.2 and eICU v2.0.
-These directories also have `README.md` files with more detailed information on how to run the scripts in
-those directories.
+Examples of these capabilities in action can be seen in the `MIMIC-IV_Example` directory,
+which contains a working, end-to-end examples to extract MEDS formatted data from MIMIC-IV v2.2. A working
+example for eICU v2.0 is also present, though needs to be adapted to recent interface improvements. These
+directories also have `README.md` files with more detailed information on how to run the scripts in those
+directories.
 
 ## Installation
 
 - For a base installation, clone this repository and run `pip install .` from the repository root.
 - For running the MIMIC-IV example, install the optional MIMIC dependencies as well with
-  `pip install .[mimic]`.
+  `pip install .[examples]`.
 - To support same-machine, process-based parallelism, install the optional joblib dependencies with `pip install .[local_parallelism]`.
 - To support cluster-based parallelism, install the optional submitit dependencies with
   `pip install .[slurm_parallelism]`.
 - For working on development, install the optional development dependencies with `pip install .[dev,tests]`.
 - Optional dependencies can be mutually installed by combining the optional dependency names with commas in
-  the square brackets, e.g., `pip install .[mimic,local_parallelism]`.
+  the square brackets, e.g., `pip install .[examples,local_parallelism]`.
 
 ## Design Philosophy
 
@@ -117,7 +116,7 @@ run with the `--help` flag, provide a detailed description of the script's purpo
 E.g.,
 
 ```bash
-❯ ./scripts/extraction/shard_events.py --help
+❯  MEDS_extract-shard_events --help
 == MEDS/shard_events ==
 MEDS/shard_events is a command line tool that provides an interface for running MEDS pipelines.
 
@@ -140,6 +139,25 @@ This stage shards the raw input events into smaller files for easier processing.
     files are csvs)
 ```
 
+Note that these stage scripts can be used for either a full pipeline or just as a component of a larger,
+user-defined process -- it is up to the user to decide how to leverage these scripts in their own work.
+
+### As an importable library
+
+To use this repository as an importable library, the user should follow these steps:
+
+1. Install the repository as a package.
+2. Design your own transform function in your own codebase and leverage `MEDS_transform` utilities such as
+   `MEDS_transform.mapreduce.mapper.map_over` to easily apply your transform over a sharded MEDS dataset.
+3. Leverage the `MEDS_transforms` configuration schema to enable easy configuration of your pipeline, by
+   importing the MEDS transforms configs via your hydra search path and using them as a base for your own
+   configuration files, enabling you to intermix your new stage configuration with the existing MEDS
+   transform stages.
+4. Note that, if your transformations are sufficiently general, you can also submit a PR to add new
+   transformations to this repository, enabling others to leverage your work as well.
+   See [this example](https://github.com/mmcdermott/MEDS_transforms/pull/48) for an (in progress) example of
+   how to do this.
+
 ### As a template
 
 To use this repository as a template, the user should follow these steps:
@@ -429,5 +447,5 @@ to run anything):
 
 ```bash
 MEDS_transforms on  reusable_interface [$⇡] is 󰏗 v0.0.1 via  v3.12.4 via  MEDS_fns
-❯ ./src/MEDS_transforms/scripts/preprocessing/normalize.py input_dir=foo cohort_dir=bar 'stages=["normalize", "tensorize"]' --cfg job --resolve
+❯ MEDS_transform-normalization input_dir=foo cohort_dir=bar 'stages=["normalization", "tensorization"]' --cfg job --resolve
 ```