Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We need a more robust interface for ways of (a) processing numerical and categorical values and (b) normalizing output data in light of those modes. #177

Open
1 of 3 tasks
Tracked by #35
mmcdermott opened this issue Aug 25, 2024 · 5 comments
Labels
Blocking External Tools For issues actively blocking external tools, such as ACES, MEDS-torch, MEDS-tab, etc. MEDS-Transform Issues for the data pre-processing transformations in MEDS_transforms Needs Clarification This issue needs further clarification before it can be operationalized New Transformation Requests for a new transformation function that can be used in MEDS pipelines priority:high A high priority issue. Release Blocking

Comments

@mmcdermott
Copy link
Owner

mmcdermott commented Aug 25, 2024

This includes both some discussion around how we should structure these stages and some key implementation details. More details forthcoming.

@mmcdermott mmcdermott mentioned this issue Aug 25, 2024
24 tasks
@mmcdermott mmcdermott added priority:high A high priority issue. Release Blocking New Transformation Requests for a new transformation function that can be used in MEDS pipelines MEDS-Transform Issues for the data pre-processing transformations in MEDS_transforms Needs Clarification This issue needs further clarification before it can be operationalized Blocking External Tools For issues actively blocking external tools, such as ACES, MEDS-torch, MEDS-tab, etc. labels Aug 25, 2024
@mmcdermott
Copy link
Owner Author

As of now, I'm thinking we need the following capabilities:

  1. Numeric values are extracted from text. (e.g., blood pressure recorded as "120/80"). This is part of Adds an 'extract_values' transform to extract values and retype them from input MEDS data. #121
  2. Numeric values are converted to part of the code in raw form (e.g., Glasgow Coma Score is recorded numerically, but we want to embed it categorically). This is part of Adds an 'extract_values' transform to extract values and retype them from input MEDS data. #121
  3. Numeric values are normalized to have zero mean and unit variance. This is currently supported, and revisions to that API are handled in Normalize transform should be separated and configurable for different normalization modes, between normalizing numerical values and codes, etc. #65 -- I suspect the right way to do this is to have one stage that computes means and variances on just the metadata.parquet file alone, then another stage that uses them to do the normalization. Right now these both happen in the same stage. Other normalization strategies (e.g., min-max) should likewise be added and are covered in Normalize transform should be separated and configurable for different normalization modes, between normalizing numerical values and codes, etc. #65
  4. Numeric values are binned into buckets. Then, these are
    • Combined with code (this is used). The codes.parquet metadata file must be updated when this happens or in an immediate subsequent stage.
    • Added as a new event with just a value-specific code (this is used in ETHOS). The codes.parquet metadata file must be updated when this happens or in an immediate subsequent stage.
    • Kept as a numerical value?? (I don't know if this is used)

@mmcdermott
Copy link
Owner Author

@Oufattole @prenc does this sound right to you?

@Oufattole
Copy link
Collaborator

Thanks Matthew great summary!

Looks good to me, my only question is how will (4) maintain that time derived features should precede all other features, or lexicographic ordering computed in a previous vocab_indices stage? Would a user be expected to run a vocab_indices stage again after?

@mmcdermott
Copy link
Owner Author

mmcdermott commented Aug 25, 2024 via email

@mmcdermott
Copy link
Owner Author

fyi @Oufattole and @prenc #57 has been closed and now in MEDS v0.0.6 on pypi you can run an extract_values stage which will enable you to convert numeric values to categorical values and vice versa as desired. See an example (I know, documenation is forthcoming) here: https://github.com/mmcdermott/MEDS_transforms/blob/main/tests/test_extract_values.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocking External Tools For issues actively blocking external tools, such as ACES, MEDS-torch, MEDS-tab, etc. MEDS-Transform Issues for the data pre-processing transformations in MEDS_transforms Needs Clarification This issue needs further clarification before it can be operationalized New Transformation Requests for a new transformation function that can be used in MEDS pipelines priority:high A high priority issue. Release Blocking
Projects
None yet
Development

No branches or pull requests

2 participants