Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "reshard_by_split" stage that reshards a MEDS datasets into shards that subdivide splits via metadata/patient_splits.parquet #134

Closed
mmcdermott opened this issue Aug 10, 2024 · 2 comments
Assignees
Labels
Blocking External Tools For issues actively blocking external tools, such as ACES, MEDS-torch, MEDS-tab, etc. MEDS Formal Compatability For efforts to ensure formal compatibility with the MEDS schema MEDS-Transform Issues for the data pre-processing transformations in MEDS_transforms New Transformation Requests for a new transformation function that can be used in MEDS pipelines priority:critical A critical priority issue that should be solved and pushed to a new minor version release ASAP. Release Blocking

Comments

@mmcdermott
Copy link
Owner

This is not necessary for most applications, but some applications that want to be able to reliably load entire files while staying within a single split (such as the current implementation of MEDS-Tab) would benefit from this. This may be best in the current pipeline as two stages; a "sub-shard" stage then a "merge-shards" stage (both of which could be shared/merged with the extract stages of similar name), but I'm not fully sure yet.

@mmcdermott mmcdermott added MEDS Formal Compatability For efforts to ensure formal compatibility with the MEDS schema priority:critical A critical priority issue that should be solved and pushed to a new minor version release ASAP. Release Blocking New Transformation Requests for a new transformation function that can be used in MEDS pipelines MEDS-Transform Issues for the data pre-processing transformations in MEDS_transforms Blocking External Tools For issues actively blocking external tools, such as ACES, MEDS-torch, MEDS-tab, etc. labels Aug 10, 2024
@mmcdermott
Copy link
Owner Author

The issue in MEDS-tab being blocked by this: mmcdermott/MEDS_Tabular_AutoML#58

@mmcdermott
Copy link
Owner Author

This can use extract.split_and_shard_patients.py to come up with the new shards by fixing the patients to the splits defined in metadata/patient_splits.parquet thanks to #124 . These changes may also eventually inform #130.

Additionally, looking at the code, I think that for now the best plan is to have these be separate from the extract.shard_events.py stage and the extract.merge_to_MEDS_cohort.py stage but to generalize the extract.merge_to_MEDS_cohort.py stage's shard iterator function as that can be used here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocking External Tools For issues actively blocking external tools, such as ACES, MEDS-torch, MEDS-tab, etc. MEDS Formal Compatability For efforts to ensure formal compatibility with the MEDS schema MEDS-Transform Issues for the data pre-processing transformations in MEDS_transforms New Transformation Requests for a new transformation function that can be used in MEDS pipelines priority:critical A critical priority issue that should be solved and pushed to a new minor version release ASAP. Release Blocking
Projects
None yet
Development

No branches or pull requests

1 participant