-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Usability: Change from double -> single nesting #20
Comments
I would 100% agree, if only for the conceptual simplicity. The initial reason for double nesting was to improve performance. However, this seems to not matter much anymore. It only seems to be a 4% improvement for double nesting. Here are benchmark numbers for double nesting vs single nesting when counting the number of codes in the MIMIC-IV demo dataset Single nested: 131 MB RAM, 6.7 MB disk, 38.74 seconds Double nested: 125 MB RAM, 7.1 MB disk, 37.21 seconds |
The benchmarking code for this was quite simple:
I ran this 3 times on an isolated machine and took the median time |
#23 is an example of how the double nesting results in more convoluted code than otherwise |
Here are some counterarguments to consider as we come to a final decision:
In terms of what I'd advocate we assess before we come to a final decision here, there are a few things:
My thinking is that both of these changes would be good for MEDS overall and would help make this issue clearer, so I would adovcate that we do those before we finalize this decision, if others think that sounds reasonable. |
Thanks for the response. I have some comments about some of those points.
I think there are actually really strong arguments that single nesting is conceptually simpler. It's one less field and one less conceptual entity in general. Even more importantly, it's one less invariant in the schema that people need to track. Right now our documentation tells people that there should only really be one event per timestamp, but we can't easily enforce that invariant and it can get easily confused / lost. You can see this confusion in practice with #23.
The conversion functions in both directions are O(n) functions that can be done in a single utility function call. The "easiness" of conversion is not important because conversion is just a cheap function call away.
In practice, it seems to have minimal impact because of the additional storage considerations of other metadata. Single digit increases in efficiency are not worth huge complexity costs. Not to mention that double nesting decreases storage efficiency of compressed data (as it's a bit harder to compress a double nested format well compared to a single nested format). If we care about efficiency here, we should focus on at least double digit performance increases like better code handling.
I don't see why that would be necessary. Operations that have to touch every code in the timeline are the most expensive (and common) operations and counting the is a best case scenario for double nesting vs single nesting as it is a cheap operation so the runtime is fully driven by the overhead of the respresentations. If double nesting's improvement there is so tiny, I don't see how you are going to see much benefit in other settings.
I think we have already done this? The concrete use cases we care about are running labeling functions, generating tabular features, and generating features for neural networks. |
@EthanSteinberg, are you really opposed to having a set of benchmarks we can use persistently, independently of this issue? Or are you just saying that you don't want to block the closing of this issue on the establishment of such benchmarks? I think (a) benchmarks would be valuable in general and in addition I think (b) that under the assumption that we will have benchmarks then waiting to close this issue until benchmarks are complete so we can evaluate the strategies under those benchmarks makes sense, but (a) and (b) are separate points and I'm not sure whether you're objecting to only one or to both. |
@mmcdermott I'm saying that we already have the benchmark we need for this issue. And that we have a strong incentive to move quicker rather than slower on a major schema thing like this. |
If we need additional benchmarking here to make a conclusion, let's write down a precise list of what benchmarks we need and I'll run them. |
I created #24 to track the question of if and what benchmarking would be needed. Ultimately I feel more advanced benchmarking would be helpful here, and of those things discussed in #24, I think critical would be (a) assessing what you have assessed already under different dataset properties (e.g., # of measurements per event, as I think that is the parameter that controls the possible efficiency penalty here) and dataset sizes and (b) assessing the impact on more realistic usage patterns, in particular in characterizing the cost of transforming data between single and double nested formats and in characterizing the cost of other operations on single or double nested formats (e.g., I would expect operations like adding time-derived measures like age and time-of-day to favor the double nested format and operations like ontology expansion to maybe favor the single-nested format, but I don't have a good sense of the expected magnitudes). I'd also be very interested in (c) understanding how this impacts eventual tensorization and GPU iteration costs, but I understand that is less relevant ultimately to this decision because that is always going to be model-dictated, not standard dictated, whereas the implication on other pre-processing steps is something where the raw format will have a bigger impact. |
I agree and disagree a bit with your proposed experiment list. Let's simplify things a bit by numbering them. a)
I agree with this one, I think it is valuable.
This one doesn't make any sense? All of our costs should be proportional to the size of the dataset. There is no way this could have an impact here. b)
I don't think we need to measure the cost of going from double to single nested, we only need to measure single to double as we already know double is slightly more efficient.
We only need to measure this if the transformation cost (b3) is significant. And I highly doubt that cost will be significant. As you can always transform back to double and run the double algorithm if that one is faster. c.
Same answer as 4, irrelevant if b3 is a minor cost. From this I would propose the following additional experiments to prove that single nested is a marginal decrease in efficiency. Measure two things:
Under two experimental conditions:
|
Ok. I now have all the results of the experiments you requested, and they appear to back my claim that the overhead from single nested is very minimal. https://github.com/EthanSteinberg/meds_benchmark is my benchmark code repository. I have taken the opportunity to make a couple of improvements from before. In particular, I removed the python startup time from my measurements and increased the run count to 30 runs per program to reduce the influence of noise. https://docs.google.com/spreadsheets/d/1sYn7RGkk8O1ra_NKgNcOUN2oeX2wPKwtTMBmqJLH3_E/edit#gid=0 contains my result table. We are measuring on two datasets, the real MIMIC demo, and a semi-artificial version of MIMIC demo where every event has the same timestamp. Summary: Single nested has a RAM overhead of about 5%, with no (or negative) disk overhead compared to double nested for both the real data and the data with only one timestamp value. (The disk values are different from before because I used a different library to generate the compressed files, but the single nested vs double nested comparison should be valid). The results are a bit more complex for the timing experiments. Real data: Semi-artificial data with only one timestamp: I believe this addresses all your concerns from a), b) and c) that you listed in your desired experiments. In particular for b) and c) we now know that the overhead is bounded between 5% and 7% for real data. Even in a very extreme case, where everything has the same timestamp, the overhead is still only 12%. It's also important to note that these overheads are overestimates. In real code there will be more compute in the inner for loops so the influence of this overhead will be much smaller in practice. Are there any other experiments you want to see? |
Thanks for expanding the benchmarks and humoring my desire for greater investigation here @EthanSteinberg. It is definitely looking like moving to single nesting would make more sense, but there are two additional things I'd want to see first @EthanSteinberg.
As a different note -- given the minimal impact on performance you are seeing here, is it worth actually moving to no nesting? E.g., having a totally flat structure of events which contain patient IDs directly alongside timestamps and code/value? To be clear, I still see the double nested format as the simplest, conceptually, but I recognize that of those on the issue so far at least (and I suspect more broadly) I am in the minority, so perhaps it is worth considering moving to an even more flat structure? Finally, independent of this issue, for the benchmark system in general I think there is some merit in controlling for dataset size and such in a more granular way than you have done here. This need not gate our decision here, to be clear, but I wanted to highlight that it is fully possible that performance will differ both qualitatively and quantitatively at different data scales, and we should be wary of that |
Also as we approach finalizing a decision here, tagging @tompollard @rvandewater @Jwoo5 so you all can weigh in on this too. @Miking98 and @jason-fries obviously more comments from you would be welcome but I took your prior indications on this thread to indicate support for this more generally. |
I also want to make an important point to emphasize why adding double nesting for a 5-7% improvement isn't worth it. There are so many other and better opportunities for improving performance than paying the complexity price for double nesting. Let's give an example: All the experiments above were run using huggingface datasets. If we use femr's optimized data loader instead, the runtime drops by about 40 times. import femr_cpp
code_counts = collections.defaultdict(int)
def process_patient(patient):
for measurement in patient['measurements']:
code_counts[measurement['code']] += 1
femr_cpp.process_patients("mimic_demo", process_patient)
print(sorted(list(code_counts.items()))[:10]) The time of that operation goes down to 0.77 seconds (instead of 30.63 for double nested with huggingface datasets). |
Well, that is a much more significant improvement, and I see and agree with your point that there are higher priority things to focus on in terms of performance. But, as MEDS doesn't include |
I vote for single nested.
|
@jason-fries do we have numbers for the ETL complications? If those are still present, that would both (1) be something that should definitely be included in benchmarks and (2) be conclusive evidence in favor of single nesting here. To my knowledge based on a hazy recollection from one of our calls, I thought that the polars concerns that @Miking98 mentioned in the issue description were not fully accurate, so I was assuming that those slowdowns were not in play here -- if they are, that would definitely impact my recommendation. Also, should we go even farther, if the impact on ETLs is that large -- to a fully unnested format, where we collapse out over patients as well? |
Right now, I also favor single nesting. I have to add that I have not yet worked extensively with FEMR or MEDS in my pipeline. This might change soon, and then I can weigh in more. |
I've been heavily using both MEDS and FEMR on STARR-OMOP (3M patients) and private data (~50k patients at a time, on the order of MIMIC-scale).
I would strongly advocate for the fully un-nested format, where we effectively have a single (potentially sharded) table with each row as a measurement. Ignoring theoretical O(*) arguments here, the fact is that many tools (duckdb, polars, SQL, snowflake, PySpark) are all built and optimized for this kind of data representation. Take, for example, duckdb's larger-than-memory out-of-core processing capabilities. They explicitly state that aggregate functions such as Additionally, if you're using a data warehouse backend like Snowflake or BigQuery, then wrangling fully un-nested data and just trading money for speed is easy. If, instead, conversion to MEDS leads to a list type in a column (at the level of patients for single-nesting, and additionally at the level of events for double-nesting), you essentially throttle the speed of any post-processing steps by potentially orders of magnitude. Strongly in favor of completely un-nested. |
I would be equally fine with any of the three formats in the case that major performance discrepancies exist favoring that format, and it sounds like completely unnested fits that bill, given ETLs (which are likely the most expensive processes) will favor that representation. It sounds like the decision steps here are to get some ETL examples in the shared benchmark referenced in #24 so that those #s can contribute here? I'll note that I think going to completely unnested would complicate our planned model for some other processing steps, but given as we can easily provide converters between any of the formats that folks can use in cases where it is desired, that does not seem like it should be higher priority than the implications the default format has on high-impact performance scenarios, so it doesn't dissuade me from thinking that it sounds like completely unnested is the way to go here. @EthanSteinberg do you have any objections to completely unnested? |
Just to add some quick thoughts:
|
@scottfleming if we go fully unnested, how would you recommend storing the static_measurements? Should those just be as normal measurements but with timestamp set to null? That seems simplest to me, though (as with other things in the fully de-nested state) it makes some aspects later more complex and less conceptually aligned. |
The main issue with a fully unnested format is that it's difficult to write labeling functions and neural network featurization on a completely unnested schema. And, unlike double vs single nesting, the transformation between fully unnested and single nested is very expensive. We also already have a completely unnested schema, MEDS-Flat, and it is important for certain usecases where the overhead due to nesting is unacceptable (or nesting makes certain technology impossible to use). We might want to move MEDS Flat into the main standard (into this repository as another pyarrow schema), but I think regardless we would want a nested schema of some sort to make it possible to write labeling functions and neural network featurization in a straightforward manner. I hate having multiple schemas (MEDS Flat vs MEDS), but I think this is unfortunately a case where both are necessary. |
I tentatively am thinking I agree @EthanSteinberg, but I think at this point we really need to nail down the kinds of usage we are intending to support (in writing, as @tompollard suggested above) as the scope of what we are proposing to support is rapidly becoming unclear here and which standard would be most useful in what setting is likewise getting challenging. We'll also need conversion functions between both views, and in the case that we have that we should also then discuss whether the expectation is that ETLs will write out both schemas or just one (and if so which) where conversions can be used as needed. I've made #25 to track the usage document. I'll re-iterate that I think #25 is especially important as we consider having two primary schemas because it is clear even in this conversation (and from #23) that we are all envisioning different kinds of usage patterns which require different data arrangements, and it isn't clear which of those patterns we want to principally support and if so how we intend to do that. some concrete guidance would be helpful for us in designing the tools for this schema and its core data format as well as for users in understanding how to best use MEDS. |
I guess one key clarification would be what the conceptual purpose and form of MEDS Flat is? Specifically, by "MEDS Flat" do we mean an un-nested data format where all measurements across all sources (labs, medications, flows) are all integrated/joined for each patient? Or, by "MEDS Flat" do we mean an un-nested data format for source-specific tables? So e.g., we have a labs table, a flows table, a medications table, all of which are in "MEDS Flat" format, but the key conceptual difference is that these tables aren't joined. The un-joined "MEDS Flat" could be useful in its own right, but inherits a lot of the challenges that PyHealth has, where they treat each channel independently. This un-joined MEDS Flat is currently the intermediary for a lot of the ETLs we have in The joined "MEDS Flat" is very useful and allows for a lot of important transformations to be performed much more quickly than if there were nesting involved. Counting the number of measurements per patient, finding the timestamp of the last measurement for each patient (important for calculating censoring times), filtering in/out all codes of a particular type -- all of these are painfully slow with nesting and lightning fast without.
Yeah probably I would store these just as normal measurements but with timestamp set to null. That also seems conceptually clear to me, but violates the assumption that the To @EthanSteinberg's point, there may still be value in having the single-nested representation for the purpose of eg defining labels based on a particular horizon (though you can arguably still do this in polars using In general, I'd be interested in having two concepts supported by the schema, with ETL tools to generate them and "recipes" for common ETL steps on top of them. The first would be the joined "MEDS Flat" discussed above. The second would be single-nested MEDS. For purposes of e.g., cohort selection and code filtering, those should just be done on MEDS Flat and potentially there's a layer of tools to support that (or we just expect people to do the ETL into MEDS Flat on their end and define high-level cohort-selection operations & code-filtering operations using whatever backend they want such as duckdb, polars, Snowflake, BigQuery etc). Then for defining labels and modeling, I imagine that could happen for single-nested MEDS. But given how much faster operating on flat data vs. nested is, I would imagine almost everything would be done on MEDS Flat. For simple labeling tasks (i.e., find time of death based on if/when a death code appears in the patient's timeline) users can construct those on top of MEDS Flat. For more complicated labels, they can use our provided tools for nesting and then define those labels using the single-nested data as a very last step. But at least the user then has control over the tradeoff between engineering complexity and time, especially at large scales. |
@scottfleming -- just as a quick clarification, I would only be in support of a "joined" flattened view in which we would basically take the structure of the nested schema and just flatten it out entirely -- so a single (potentially sharded) table with four columns: patient ID, timestamp [can be null], code, and value. Obviously I'm skipping measurement properties and different value types here, but in essence that's what I'm picturing. Separate tables would in my opinion defeat the simplicity that MEDS brings to bear by unifying everything together into the "measurements" idea of code/value structs (though these would now be completely flat) |
Perfect, I think we're in agreement here. Though @EthanSteinberg brought up a good point that I overlooked which is that if you're lazy-loading tables anyway and let's say labs are stored separately from vitals, then "join" and "concatenation" are the same thing as long as the metadata columns are aligned. And concatenation is "free". Additionally, because with parquet's columnar storage you pay minimal extra cost for nulls (basically just the bit-map for remembering where the nulls are), then getting to a single table that can be read into polars from a collection of "MEDS Flat"-compliant files (one for labs, one for medications, one for vitals, etc) that all have the core Speaking from the industry side where we store all our data in Snowflake, for example, it also means that we can easily create a "MEDS Flat" view of our data and keep it in Snowflake for storage and analytics. It also means that doing things like filtering in/out measurements based on metadata columns (e.g., visit type) is trivial, fast, and can easily be done using basically any backend the user prefers. |
The current structure of MEDS has
Patient => Event => Measurements
. I would propose changing it to justPatient => Events
Arguments in favor:
The text was updated successfully, but these errors were encountered: