Make it such that external_splits
specification can point to a patient_splits.parquet
file or a prior splits.json
file from MEDS-extract to match the cohort.
#130
Labels
documentation
Improvements or additions to documentation
MEDS Formal Compatability
For efforts to ensure formal compatibility with the MEDS schema
MEDS-Extract
priority:medium
A medium priority issue.
Usability / Interface
Right now, if you point
external_splits
to a prior dataset'ssplits.json
file, it will treat the shard name as part of the split. This should be fixed such that you can point to a single "splits" file and have it reload the right splits, not the shards part.Tagging @prenc for tracking
My current thoughts as to what should change about this:
splits.json
should be renamed to.shards.json
(note it is being made a hidden file). It should rarely be used. This also conforms with Pipelines should automatically determine shards from the input directory rather than relying on thesplits.json
file. #129 about how this should be standardized.external_splits
should be made to work with dataframe files (likepatient_splits.parquet
)external_splits
should throw a warning if it seems to be given a shards.json file and default to collapsing splits down to standardized names (though this should be controllable with an option in thestage_cfg
).The text was updated successfully, but these errors were encountered: