Add proposal for ol facet tables #2076

wslulciuc · 2022-08-18T19:58:46Z

This PR adds the proposal: Optimize query performance for OpenLineage facets

codecov · 2022-08-18T20:04:17Z

Codecov Report

Merging #2076 (28a07e9) into main (1d28adf) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main    #2076   +/-   ##
=========================================
  Coverage     76.72%   76.72%           
  Complexity     1177     1177           
=========================================
  Files           222      222           
  Lines          5354     5354           
  Branches        429      429           
=========================================
  Hits           4108     4108           
  Misses          768      768           
  Partials        478      478

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

proposals/2078-optimization-ol-facets.md

collado-mike

Nice. Can you add specifics on how conflicts will be dealt with? Are these facet tables append-only? E.g., if two facets with the same name (I assume the name column in the tables refers to the facet name) but different contents are received for the same dataset version (e.g., two different GreatExpectations suites on the same data), will both records be added to the tables? If not, which one wins? First received? Last received? If both records are inserted, will they be merged at query time? Or just appended one after the other? Are there indexes on these tables? In particular, the runs table can get very, very large. The same will be true of the runs_facets table eventually.

wslulciuc · 2022-08-22T18:31:16Z

@collado-mike: All great questions that I'll elaborate on in the proposal!

pawel-big-lebowski · 2022-08-23T07:32:59Z

proposals/2078-optimization-ol-facets.md

+
+OpenLineage's core model is extensible via _facets_. A `facet` is user-defined metadata and enables entity enrichment. Initially, returning dataset, job, and run facets via the REST API was not supported, but eventually added in release [`0.14.0`](https://github.com/MarquezProject/marquez/compare/0.13.1...0.14.0). The implementation was simple: when querying the `datasets`, `jobs`, or `runs` tables, also query the `lineage_events` table for facets.
+
+We knew the initial implementation would have to eventually be revisited. That is, OpenLineage events can easily exceed **>** **`10MBs`** resulting in out-of-memory (OOM) errors as facet queries require loading the raw `event` in memory, then filtering for relevant facets. This proposal outlines how we can optimize query performance for OpenLineage facets.


From what I understand, we load the events to memory in Marquez and do facet filtering there.
If so, the solution should be to offload filtering to database. The short term solution could be filtering json content in postgres as described here (please note indexing json content is also possible). In this particular example we could select output dataset facets from event json within postgresql, instead of selecting whole events.

But lineage_events will grow over time and querying for datasets' facets will slow down. Normalizing json facets, as described within this proposal, is a good way to go.

The only problem with separate tables is a backfill procedure which is a heavy operation. I am not sure whether lazy migration would work. The existence of some facets in new tables does not mean we collected all the existing facets, including those from lineage_events table.

Very reasonable point, @pawel-big-lebowski. But, to get ahead of the issue, like you stated, we'll want to normalize facets.

julienledem

That looks great, I left some comments

proposals/2078-optimization-ol-facets.md

pawel-big-lebowski

I think it's a great proposal which can have huge impact on Marquez performance 🚀
I've put some comments which are worth considering (if not already considered).

proposals/2078-optimization-ol-facets.md

mobuchowski · 2022-08-30T12:38:14Z

proposals/2078-optimization-ol-facets.md

+Note, facet tables will be:
+
+* Append only, mirroring the current insertion pattern of the `lineage_events` table; therefore, avoiding facet conflicts
+* Merging facets will follow a _first-to-last_ received order; meaning, facet rows will be merged post query using [`MapperUtils.toFacetsOrNull()`](https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/mappers/MapperUtils.java#L50) mirroring the current logic (i.e. newer facets will be added or override older facet values based on when the OpenLineage event was received)


In some cases it might make sense to accumulate content inside the facets themselves.

For example, streaming Flink job might report how many records were processed per checkpoint - sometimes sending multiple results per OL event. The result facet should return list of those reports.

I don't think we need to address this within this proposal though.

I agree, certain facets can be handled differently (i.e. accumulated) based on some context. I'll make a note on the proposals scope / limitations.

@wslulciuc let's also create follow-up issue, will be good to go then on this problem.

Signed-off-by: wslulciuc <willy@datakin.com>

pawel-big-lebowski · 2022-12-21T08:48:55Z

proposals/2078-optimization-ol-facets.md

+2. Using the facet tables instead the `lineage_events` table to query for facets.
+3. Lazy migration, the facet tables will be queried, and if no facets are returned, then the `lineage_events` table; this approach avoids a backfill, but one will still be needed.
+
+## Migration procedure


Please also refer to this:
https://github.com/MarquezProject/marquez/pull/2152/files?short_path=17843f7#diff-17843f7d4567ca029ee0d63f56e6b75a000b384768d9e0352badffad66eeea3c

to see migration from a user's perspective.

proposals/2078-optimization-ol-facets.md

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

wslulciuc added the proposal label Aug 19, 2022

wslulciuc marked this pull request as ready for review August 19, 2022 07:30

wslulciuc requested review from collado-mike and fm100 August 19, 2022 08:13

fm100 reviewed Aug 19, 2022

View reviewed changes

proposals/2078-optimization-ol-facets.md Outdated Show resolved Hide resolved

collado-mike reviewed Aug 19, 2022

View reviewed changes

pawel-big-lebowski reviewed Aug 23, 2022

View reviewed changes

julienledem reviewed Aug 23, 2022

View reviewed changes

proposals/2078-optimization-ol-facets.md Show resolved Hide resolved

proposals/2078-optimization-ol-facets.md Outdated Show resolved Hide resolved

wslulciuc mentioned this pull request Aug 26, 2022

Add metadata cmd #2091

Merged

7 tasks

wslulciuc requested review from fm100, pawel-big-lebowski, collado-mike and julienledem August 27, 2022 03:17

pawel-big-lebowski reviewed Aug 30, 2022

View reviewed changes

proposals/2078-optimization-ol-facets.md Show resolved Hide resolved

proposals/2078-optimization-ol-facets.md Show resolved Hide resolved

proposals/2078-optimization-ol-facets.md Show resolved Hide resolved

mobuchowski reviewed Aug 30, 2022

View reviewed changes

pawel-big-lebowski mentioned this pull request Sep 12, 2022

Add facet tables to avoid querying lineage_events table #2078

Closed

mobuchowski mentioned this pull request Sep 12, 2022

add raw OpenLineage get event API #2070

Merged

wslulciuc mentioned this pull request Sep 30, 2022

Add OL facet tables #2152

Closed

7 tasks

boring-cyborg bot added the docs label Nov 15, 2022

wslulciuc requested review from pawel-big-lebowski and mobuchowski November 15, 2022 23:34

pawel-big-lebowski self-assigned this Dec 12, 2022

pawel-big-lebowski force-pushed the proposal/separate-facets-tables-for-ol-events branch 2 times, most recently from e7e553b to 5b95335 Compare December 13, 2022 08:56

wslulciuc added 4 commits December 13, 2022 09:59

Add proposal for ol facet tables

acb97f6

Signed-off-by: wslulciuc <willy@datakin.com>

continued: Add proposal for ol facet tables

764abbc

Signed-off-by: wslulciuc <willy@datakin.com>

Add implementation step for lazy migration

087b8a7

Signed-off-by: wslulciuc <willy@datakin.com>

Link issue

5a2912e

Signed-off-by: wslulciuc <willy@datakin.com>

wslulciuc added 6 commits December 13, 2022 09:59

continued: Link issue

f703a43

Signed-off-by: wslulciuc <willy@datakin.com>

Add index details for facet tables

0892cb0

Signed-off-by: wslulciuc <willy@datakin.com>

Add uuid as pk for facet tables

bf99a85

Signed-off-by: wslulciuc <willy@datakin.com>

Update tables and expand overview section

bd3b308

Signed-off-by: wslulciuc <willy@datakin.com>

Add info on facet tables

c359744

Signed-off-by: wslulciuc <willy@datakin.com>

Add type column in dataset_facets table

ad1e2cd

Signed-off-by: wslulciuc <willy@datakin.com>

pawel-big-lebowski force-pushed the proposal/separate-facets-tables-for-ol-events branch from 5b95335 to d7f5f4b Compare December 13, 2022 08:59

harels removed request for julienledem, collado-mike, fm100 and pawel-big-lebowski December 20, 2022 16:38

pawel-big-lebowski reviewed Dec 21, 2022

View reviewed changes

mobuchowski approved these changes Dec 21, 2022

View reviewed changes

proposals/2078-optimization-ol-facets.md Outdated Show resolved Hide resolved

proposals/2078-optimization-ol-facets.md Show resolved Hide resolved

pawel-big-lebowski force-pushed the proposal/separate-facets-tables-for-ol-events branch from d7f5f4b to 48bb84c Compare December 21, 2022 15:34

include proposal for migration procedure

3c5ebed

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

pawel-big-lebowski force-pushed the proposal/separate-facets-tables-for-ol-events branch from 48bb84c to 3c5ebed Compare December 21, 2022 15:43

Merge branch 'main' into proposal/separate-facets-tables-for-ol-events

28a07e9

pawel-big-lebowski merged commit a854702 into main Dec 21, 2022

pawel-big-lebowski deleted the proposal/separate-facets-tables-for-ol-events branch December 21, 2022 16:25

pawel-big-lebowski mentioned this pull request Jan 26, 2023

OL facets - PR2 - read facets from views based on lineage_events table #2355

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add proposal for ol facet tables #2076

Add proposal for ol facet tables #2076

wslulciuc commented Aug 18, 2022 •

edited

Loading

codecov bot commented Aug 18, 2022 •

edited

Loading

collado-mike left a comment

wslulciuc commented Aug 22, 2022

pawel-big-lebowski Aug 23, 2022

wslulciuc Aug 27, 2022

julienledem left a comment

pawel-big-lebowski left a comment

mobuchowski Aug 30, 2022 •

edited

Loading

wslulciuc Nov 15, 2022

mobuchowski Nov 16, 2022 •

edited

Loading

pawel-big-lebowski Dec 21, 2022


		OpenLineage's core model is extensible via _facets_. A `facet` is user-defined metadata and enables entity enrichment. Initially, returning dataset, job, and run facets via the REST API was not supported, but eventually added in release [`0.14.0`](https://github.com/MarquezProject/marquez/compare/0.13.1...0.14.0). The implementation was simple: when querying the `datasets`, `jobs`, or `runs` tables, also query the `lineage_events` table for facets.

		We knew the initial implementation would have to eventually be revisited. That is, OpenLineage events can easily exceed > `10MBs` resulting in out-of-memory (OOM) errors as facet queries require loading the raw `event` in memory, then filtering for relevant facets. This proposal outlines how we can optimize query performance for OpenLineage facets.

Add proposal for ol facet tables #2076

Add proposal for ol facet tables #2076

Conversation

wslulciuc commented Aug 18, 2022 • edited Loading

codecov bot commented Aug 18, 2022 • edited Loading

Codecov Report

collado-mike left a comment

Choose a reason for hiding this comment

wslulciuc commented Aug 22, 2022

pawel-big-lebowski Aug 23, 2022

Choose a reason for hiding this comment

wslulciuc Aug 27, 2022

Choose a reason for hiding this comment

julienledem left a comment

Choose a reason for hiding this comment

pawel-big-lebowski left a comment

Choose a reason for hiding this comment

mobuchowski Aug 30, 2022 • edited Loading

Choose a reason for hiding this comment

wslulciuc Nov 15, 2022

Choose a reason for hiding this comment

mobuchowski Nov 16, 2022 • edited Loading

Choose a reason for hiding this comment

pawel-big-lebowski Dec 21, 2022

Choose a reason for hiding this comment

wslulciuc commented Aug 18, 2022 •

edited

Loading

codecov bot commented Aug 18, 2022 •

edited

Loading

mobuchowski Aug 30, 2022 •

edited

Loading

mobuchowski Nov 16, 2022 •

edited

Loading