Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should we do about additional extracted data outside the scope of the MEDS label schema? #97

Open
mmcdermott opened this issue Aug 11, 2024 · 2 comments
Labels
MEDS Compatibility Compatibility with the Medical Event Data Standard (MEDS) data schema Needs Clarification For issues that need clarification before they can be addressed. priority:medium Things that are medium priority, and warrant attention for next larger version updates

Comments

@mmcdermott
Copy link
Collaborator

Options are:

  1. Write that info to an additional file.
  2. Write those columns anyways, and not be truly compliant.
  3. See if MEDS can expand the label_schema to include additional columns much as data does.
@justin13601 justin13601 added priority:low Things that are not on an immediate roadmap, but may be addressed in time or by contributors Needs Clarification For issues that need clarification before they can be addressed. MEDS Compatibility Compatibility with the Medical Event Data Standard (MEDS) data schema priority:medium Things that are medium priority, and warrant attention for next larger version updates and removed priority:low Things that are not on an immediate roadmap, but may be addressed in time or by contributors labels Aug 12, 2024
@Oufattole
Copy link
Contributor

The use case I have in mind for additional data is that users may wish to extract windows of data for contrastive learning tasks. So I may wish to extract a window of data prior to and after an event (inpatient admissions for example). For each window you need a start and end time, and that is it right?

So the label_schema is:

label = pa.schema(
    [
        ("subject_id", pa.int64()),
        ("prediction_time", pa.timestamp("us")),
        ("boolean_value", pa.bool_()),
        ("integer_value", pa.int64()),
        ("float_value", pa.float64()),
        ("categorical_value", pa.string()),
    ]
)

It seems the intention of the label schema is for supervised tasks where you just need a prediction time and label, so it doesn't seem appropriate to add window information to that. I would advocate for an additional file with window-based data using a struct per window as you used in v0.3.2 of aces (I think).

Let's suppose for my example I use this config:

predicates:
    admission:
        code: ADMISSION

trigger: admission

windows:
    pre:
        start: null
        end: trigger
        start_inclusive: True
        end_inclusive: False
    post:
        start: pre.end
        end: null
        start_inclusive: True
        end_inclusive: True

Maybe you could store each window in one file with an event index, and the file schema would be:

window = pa.schema(
    [
        ("subject_id", pa.int64()),
        ("start_time", pa.timestamp("us")),
        ("end_time", pa.bool_()),
    ]
)

@mmcdermott
Copy link
Collaborator Author

I don't think we need a formal pa schema for this extra information -- we can just use the old output format with the structs, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MEDS Compatibility Compatibility with the Medical Event Data Standard (MEDS) data schema Needs Clarification For issues that need clarification before they can be addressed. priority:medium Things that are medium priority, and warrant attention for next larger version updates
Projects
None yet
Development

No branches or pull requests

3 participants