-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add proposal for ol facet tables #2076
Add proposal for ol facet tables #2076
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2076 +/- ##
=========================================
Coverage 76.72% 76.72%
Complexity 1177 1177
=========================================
Files 222 222
Lines 5354 5354
Branches 429 429
=========================================
Hits 4108 4108
Misses 768 768
Partials 478 478 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Can you add specifics on how conflicts will be dealt with? Are these facet tables append-only? E.g., if two facets with the same name (I assume the name
column in the tables refers to the facet name) but different contents are received for the same dataset version (e.g., two different GreatExpectations suites on the same data), will both records be added to the tables? If not, which one wins? First received? Last received? If both records are inserted, will they be merged at query time? Or just appended one after the other? Are there indexes on these tables? In particular, the runs
table can get very, very large. The same will be true of the runs_facets
table eventually.
@collado-mike: All great questions that I'll elaborate on in the proposal! |
|
||
OpenLineage's core model is extensible via _facets_. A `facet` is user-defined metadata and enables entity enrichment. Initially, returning dataset, job, and run facets via the REST API was not supported, but eventually added in release [`0.14.0`](https://github.com/MarquezProject/marquez/compare/0.13.1...0.14.0). The implementation was simple: when querying the `datasets`, `jobs`, or `runs` tables, also query the `lineage_events` table for facets. | ||
|
||
We knew the initial implementation would have to eventually be revisited. That is, OpenLineage events can easily exceed **>** **`10MBs`** resulting in out-of-memory (OOM) errors as facet queries require loading the raw `event` in memory, then filtering for relevant facets. This proposal outlines how we can optimize query performance for OpenLineage facets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, we load the events to memory in Marquez and do facet filtering there.
If so, the solution should be to offload filtering to database. The short term solution could be filtering json content in postgres as described here (please note indexing json content is also possible). In this particular example we could select output dataset facets from event json within postgresql, instead of selecting whole events.
But lineage_events
will grow over time and querying for datasets' facets will slow down. Normalizing json facets, as described within this proposal, is a good way to go.
The only problem with separate tables is a backfill procedure which is a heavy operation. I am not sure whether lazy migration would work. The existence of some facets in new tables does not mean we collected all the existing facets, including those from lineage_events
table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very reasonable point, @pawel-big-lebowski. But, to get ahead of the issue, like you stated, we'll want to normalize facets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks great, I left some comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a great proposal which can have huge impact on Marquez performance 🚀
I've put some comments which are worth considering (if not already considered).
Note, facet tables will be: | ||
|
||
* Append only, mirroring the current insertion pattern of the `lineage_events` table; therefore, avoiding facet conflicts | ||
* Merging facets will follow a _first-to-last_ received order; meaning, facet rows will be merged post query using [`MapperUtils.toFacetsOrNull()`](https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/mappers/MapperUtils.java#L50) mirroring the current logic (i.e. newer facets will be added or override older facet values based on when the OpenLineage event was received) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases it might make sense to accumulate content inside the facets themselves.
For example, streaming Flink job might report how many records were processed per checkpoint - sometimes sending multiple results per OL event. The result facet should return list of those reports.
I don't think we need to address this within this proposal though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, certain facets can be handled differently (i.e. accumulated) based on some context. I'll make a note on the proposals scope / limitations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wslulciuc let's also create follow-up issue, will be good to go then on this problem.
e7e553b
to
5b95335
Compare
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
5b95335
to
d7f5f4b
Compare
2. Using the facet tables instead the `lineage_events` table to query for facets. | ||
3. Lazy migration, the facet tables will be queried, and if no facets are returned, then the `lineage_events` table; this approach avoids a backfill, but one will still be needed. | ||
|
||
## Migration procedure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also refer to this:
https://github.com/MarquezProject/marquez/pull/2152/files?short_path=17843f7#diff-17843f7d4567ca029ee0d63f56e6b75a000b384768d9e0352badffad66eeea3c
to see migration from a user's perspective.
d7f5f4b
to
48bb84c
Compare
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
48bb84c
to
3c5ebed
Compare
This PR adds the proposal: Optimize query performance for OpenLineage facets