add lifecycleStateChange support #1847

pawel-big-lebowski · 2022-01-25T17:09:27Z

Signed-off-by: Pawel Leszczynski leszczynski.pawel@gmail.com

Problem

Lacking support for table operations reported from OpenLineage

Closes: #1800

Solution

Store stateChange field in backend database and expose the property over API.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

You've signed-off your work
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)

codecov · 2022-01-26T13:45:45Z

Codecov Report

Merging #1847 (56424c3) into main (db01c48) will increase coverage by 0.18%.
The diff coverage is 100.00%.

❗ Current head 56424c3 differs from pull request most recent head d4b2e85. Consider uploading reports for the commit d4b2e85 to get more accurate results

@@             Coverage Diff              @@
##               main    #1847      +/-   ##
============================================
+ Coverage     77.90%   78.09%   +0.18%     
- Complexity      937      944       +7     
============================================
  Files           193      193              
  Lines          5218     5249      +31     
  Branches        418      418              
============================================
+ Hits           4065     4099      +34     
+ Misses          706      705       -1     
+ Partials        447      445       -2

Impacted Files	Coverage Δ
...pi/src/main/java/marquez/db/DatasetVersionDao.java	`95.83% <ø> (ø)`
...pi/src/main/java/marquez/db/models/DatasetRow.java	`33.33% <ø> (ø)`
...main/java/marquez/db/models/DatasetVersionRow.java	`100.00% <ø> (ø)`
.../src/main/java/marquez/service/models/DbTable.java	`100.00% <ø> (ø)`
...in/java/marquez/service/models/DbTableVersion.java	`100.00% <ø> (ø)`
...i/src/main/java/marquez/service/models/Stream.java	`75.00% <ø> (ø)`
...ain/java/marquez/service/models/StreamVersion.java	`75.00% <ø> (ø)`
api/src/main/java/marquez/common/Utils.java	`86.15% <100.00%> (+0.21%)`	⬆️
api/src/main/java/marquez/db/Columns.java	`81.81% <100.00%> (+0.86%)`	⬆️
api/src/main/java/marquez/db/DatasetDao.java	`98.52% <100.00%> (ø)`
... and 15 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

pawel-big-lebowski · 2022-01-26T16:26:13Z

wslulciuc · 2022-02-09T08:17:31Z

@pawel-big-lebowski: though I agree we should handle the TableStateChangeFacet, I'm not sure the facet is applicable to all dataset types (ex: streams, files, etc). That is, displaying the table state change in the web UI would be relevant for tables, but not for a file in s3.

After looking over the initial issue on adding table operation support (and the facet definition), Marquez should perform certain actions based on the table state change. That is, for a given state change below the following actions are defined:

CREATE: the dataset would be created, versioned, and associated with the runID; but, based on the CreateTableCommandVisitor this state change would be a no-op as the table is registered as an output of the job run and would automatically be created by Marquez.
DROP: the dataset would be a soft delete; that is, Marquez would created a new deleted version of the dataset and no longer add the dataset to the current lineage graph view.
TRUNCATE / OVERWRITE: this essentially is an update on the dataset; on a completed job run, a new dataset version will be created automatically and associated with the runID.

If we refer back to issue #1800, @collado-mike mentions:

We should record these operations in Marquez and reflect the new state of the table after the operation has completed.

The actions defined above would reflect the new state of the table, which is the intended goal. As for displaying or recording the state change as proposed in this PR?

If we decide not to display the state change in the web UI as proposed here (as least not initially), we still record all the events in the lineage_events table, and reflect the new state of the table after the action mapped to state change is applied to the dataset.
If we decide to display the state change in the web UI, we would need to think in terms of dataset lifecycle states (the lifecycle of a dataset is something that Marquez currently manages, though not explicitly): CREATED, DELETED, UPDATED, etc. These lifecycle states would be generic enough to be associated with streams, files, etc. To make it a bit more clear that a given state change is associated with a given lifecycle state, I would use the column names lifecycle_state and latest_lifecycle_state in the tables dataset_versions and datasets respectively. I personally would prefer standardizing on the lifecycle states (defined as an enum) and mapping incoming state changes to them.

pawel-big-lebowski · 2022-02-10T14:27:18Z

Thanks @wslulciuc for the extensive comment. I think the issue is that we started with table state changes in Spark and wanted to implement it in Marquez, while not being aware of a whole dataset context within Marquez. Like you said, the approach may not fit well other dataset types like files or streams. I agree with the column names lifecycle_state and latest_lifecycle_state which are more generic.

Based on the knowledge collected, I think we should start with: OpenLineage/OpenLineage#518
to define global facets' spec that matches tables, files, streams etc. Then we should implement it in Spark and later revisit this Marquez PR.

pawel-big-lebowski · 2022-03-10T08:50:29Z

I renamed the fields in JSON and backend DB according to naming convention introduced in spec (lifecycleStateChange).
I implemented soft-delete mechanism, so that in case of drop operation, a dataset is marked with a is_deleted flag equal to true.

The still valid open question is: what do we want to do with this information on the frontend? There are at least two options:

Store dropped datasets in DB and do not show it on Web UI. This is implemented within current version. We collect the pretty useful information but we do not make any use of it at the moment.
Display dropped datasets in UI while marking them deleted. The potential drawback is the amount of deleted datasets can increase over time and blur the graph. Perhaps we could assume configurable 7 days expiry time and just show datasets dropped within the expiry time.

@julienledem @wslulciuc What do you think?

pawel-big-lebowski · 2022-03-21T11:04:49Z

We decided to proceed in a following way:

Datasets marked as deleted should be returned by the API.
Datasets marked as deleted should not be presented in UI. Filtering has to be done on the frontend side.
This approach is good enough for now and can be extended further on the frontend side later on.

wslulciuc

Great work, @pawel-big-lebowski! I have a few minor suggestions before we can merge this sweet feature

Mind using just latest_lifecycle_state and lifecycle_state for the column names to represent states? In OpenLineage, the facet does capture the state change (so the naming is appropriate), but in Marquez the column will represent the current state at a given time in the dataset lifecycle. You can think about how the run states are stored in the runs table.
I think we should also add the column latest_lifecycle_state to the datasets table?
With Add migration script to remove size limits from namespaces, dataset n… #1925 merged, you'll need to update you sql migration file to use V41 and above
Mind also updating the changelog?

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

pawel-big-lebowski · 2022-04-01T14:04:11Z

Great comments @wslulciuc. I've made changes for 1., 3. and 4. of your comments.

I am not sure whether adding column latest_lifecycle_state to datasets table is necessary. First, this does not require any extra join in postgres as whenever dataset is fetched, it is already joined with the latest dataset_version. Secondly, there are other columns obtained from dataset_version table in a similar way like fields:

marquez/api/src/main/java/marquez/db/DatasetDao.java

Line 162 in d8a83bc

+ "SELECT d.*, dv.fields, sv.schema_location, t.tags, facets\n"

Let me know if you had other argument for doing that.

wslulciuc · 2022-04-01T20:55:09Z

First, this does not require any extra join in postgres as whenever dataset is fetched, it is already joined with the latest dataset_version. Secondly, there are other columns obtained from dataset_version table in a similar way like fields

Totally agree, thanks for pointing this out! The datasets table already has a reference to the current version (=current_version_uuid ), which will also contain the current lifecycle state of the table. So, there's a stronger argument not to add the latest_lifecycle_state to datasets given that it can be derived for dataset_versions 👍

wslulciuc

Great work, @pawel-big-lebowski 💯 💯 💯

pawel-big-lebowski changed the title ~~add operation column~~ add tableStateChange support Jan 25, 2022

pawel-big-lebowski force-pushed the feature/table-operation-support branch from 4854583 to 89717e2 Compare January 26, 2022 12:31

pawel-big-lebowski force-pushed the feature/table-operation-support branch from 89717e2 to 83b8204 Compare January 26, 2022 16:17

pawel-big-lebowski marked this pull request as ready for review January 26, 2022 16:31

collado-mike requested a review from wslulciuc January 27, 2022 17:26

wslulciuc added the review Ready for review label Feb 1, 2022

pawel-big-lebowski mentioned this pull request Feb 15, 2022

add DatasetLifecycleStateDatasetFacet to spec OpenLineage/OpenLineage#531

Merged

9 tasks

pawel-big-lebowski force-pushed the feature/table-operation-support branch 2 times, most recently from d04951f to 3fd3222 Compare March 10, 2022 08:30

pawel-big-lebowski changed the title ~~add tableStateChange support~~ add lifecycleStateChange support Mar 10, 2022

pawel-big-lebowski force-pushed the feature/table-operation-support branch from 3fd3222 to 56424c3 Compare March 21, 2022 10:50

wslulciuc requested changes Mar 30, 2022

View reviewed changes

wslulciuc added this to the 0.22.0 milestone Mar 31, 2022

add operation column

0bbfc0a

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

pawel-big-lebowski force-pushed the feature/table-operation-support branch from d4b2e85 to 0bbfc0a Compare April 1, 2022 13:51

wslulciuc removed the review Ready for review label Apr 1, 2022

wslulciuc approved these changes Apr 1, 2022

View reviewed changes

wslulciuc merged commit c1325c5 into MarquezProject:main Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add lifecycleStateChange support #1847

add lifecycleStateChange support #1847

pawel-big-lebowski commented Jan 25, 2022 •

edited

Loading

codecov bot commented Jan 26, 2022 •

edited

Loading

pawel-big-lebowski commented Jan 26, 2022

wslulciuc commented Feb 9, 2022 •

edited

Loading

pawel-big-lebowski commented Feb 10, 2022

pawel-big-lebowski commented Mar 10, 2022

pawel-big-lebowski commented Mar 21, 2022

wslulciuc left a comment •

edited

Loading

pawel-big-lebowski commented Apr 1, 2022

wslulciuc commented Apr 1, 2022

wslulciuc left a comment

add lifecycleStateChange support #1847

add lifecycleStateChange support #1847

Conversation

pawel-big-lebowski commented Jan 25, 2022 • edited Loading

Problem

Solution

Checklist

codecov bot commented Jan 26, 2022 • edited Loading

Codecov Report

pawel-big-lebowski commented Jan 26, 2022

wslulciuc commented Feb 9, 2022 • edited Loading

pawel-big-lebowski commented Feb 10, 2022

pawel-big-lebowski commented Mar 10, 2022

pawel-big-lebowski commented Mar 21, 2022

wslulciuc left a comment • edited Loading

Choose a reason for hiding this comment

pawel-big-lebowski commented Apr 1, 2022

wslulciuc commented Apr 1, 2022

wslulciuc left a comment

Choose a reason for hiding this comment

pawel-big-lebowski commented Jan 25, 2022 •

edited

Loading

codecov bot commented Jan 26, 2022 •

edited

Loading

wslulciuc commented Feb 9, 2022 •

edited

Loading

wslulciuc left a comment •

edited

Loading