Add a "reshard_by_split" stage that reshards a MEDS datasets into shards that subdivide splits via metadata/patient_splits.parquet
#134
Labels
Blocking External Tools
For issues actively blocking external tools, such as ACES, MEDS-torch, MEDS-tab, etc.
MEDS Formal Compatability
For efforts to ensure formal compatibility with the MEDS schema
MEDS-Transform
Issues for the data pre-processing transformations in MEDS_transforms
New Transformation
Requests for a new transformation function that can be used in MEDS pipelines
priority:critical
A critical priority issue that should be solved and pushed to a new minor version release ASAP.
Release Blocking
This is not necessary for most applications, but some applications that want to be able to reliably load entire files while staying within a single split (such as the current implementation of MEDS-Tab) would benefit from this. This may be best in the current pipeline as two stages; a "sub-shard" stage then a "merge-shards" stage (both of which could be shared/merged with the extract stages of similar name), but I'm not fully sure yet.
The text was updated successfully, but these errors were encountered: