Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MEDS 0.3 Release Candidate #32

Merged
merged 21 commits into from
Jul 30, 2024
Merged

MEDS 0.3 Release Candidate #32

merged 21 commits into from
Jul 30, 2024

Conversation

mmcdermott
Copy link
Contributor

No description provided.

@mmcdermott
Copy link
Contributor Author

For the label schema, I think we need a bit more clarity. In particular,

  1. Is prediction_time the time at which the prediction should be made? The last timestamp at which point data is allowed to be read in (and if so, inclusive or exclusive)? The inclusivity/exclusivity point we may need to add another column to specify.
  2. Are we comfortable with this omitting the possibility of specifying tasks that occlude some parts of an event from the input window but not others? E.g., given everything up to a visit and the data of that visit that aren't meds, predict the meds prescribed.
  3. Do we maybe want to replace "_value" with "_label"?

@mmcdermott
Copy link
Contributor Author

I assume there is no reason we don't want a TypedDict version of the split schema and will add one.

@EthanSteinberg
Copy link
Collaborator

Is prediction_time the time at which the prediction should be made? The last timestamp at which point data is allowed to be read in (and if so, inclusive or exclusive)?

Yep. The idea is that it's legal to use all features up to and including the timepoint of the prediction time (inclusive). In my experience inclusive makes things a bit easier.

The documentation for this could perhaps be improved.

The inclusivity/exclusivity point we may need to add another column to specify.

I think that's unnecessary complexity at this point.

Are we comfortable with this omitting the possibility of specifying tasks that occlude some parts of an event from the input window but not others? E.g., given everything up to a visit and the data of that visit that aren't meds, predict the meds prescribed.

I think we should avoid trying to cover those sorts of complex labels with this schema.

Do we maybe want to replace "_value" with "_label"?

I don't think that's a good idea. I think that would lead to confusion since then the word "label" would be overloaded a bit. I think we want "label" to be the pair of a value + prediction_time.

@mmcdermott
Copy link
Contributor Author

I have a slight concern about omitting an explicit quantifier for inclusive or exclusive. If we don't allow people to specify it, and someone wants to use that, the next default would be to add or subtract a small amount from the prediction time, and that feels brittle to me if timestamps are binned or the granularity is reduced at all for a dataset.

I also think there are reasonable settings where inclusivity or exclusivity matters. E.g., if you are trying to predict something about a patient at a particular time without observing that value, then the natural way to express that is to say that you have an exclusive prediction time of the time at which you want the thing predicted, which enables models to leverage both the observations up to that prediction time without including the answer and the knowledge of when the target value that is being predicted is projected to happen -- this is much harder to express without a notion of exclusivity.

@EthanSteinberg
Copy link
Collaborator

I think the simplicity gains of only having one type of label outweigh the minor annoyance of people having to subtract 1 microsecond every now and then for labels that need exclusivity.

I have a slight concern about omitting an explicit quantifier for inclusive or exclusive. If we don't allow people to specify it, and someone wants to use that, the next default would be to add or subtract a small amount from the prediction time, and that feels brittle to me if timestamps are binned or the granularity is reduced at all for a dataset.

Our datasets have a specified granularity of "[us]" so brittleness shouldn't be a problem. Subtracting one microsecond is always correct.

Exclusivity does matter, but I think it's easier to just let labelers subtract the microsecond when they need to rather than force a lot more complexity throughout the entire setup.

@mmcdermott
Copy link
Contributor Author

mmcdermott commented Jul 30, 2024

I suspect we'll eventually want to revisit this but for now I'm fne dropping the issue of label exclusivity. WIth the documentation changes in #35, and once tests are made to pass, I'm happy with this to go in.

@mmcdermott mmcdermott marked this pull request as ready for review July 30, 2024 19:21
@mmcdermott mmcdermott merged commit 8ba7cb1 into main Jul 30, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants