Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add lifecycleStateChange support #1847

Conversation

pawel-big-lebowski
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski commented Jan 25, 2022

Signed-off-by: Pawel Leszczynski leszczynski.pawel@gmail.com

Problem

Lacking support for table operations reported from OpenLineage

Closes: #1800

Solution

Store stateChange field in backend database and expose the property over API.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)

@pawel-big-lebowski pawel-big-lebowski changed the title add operation column add tableStateChange support Jan 25, 2022
@pawel-big-lebowski pawel-big-lebowski force-pushed the feature/table-operation-support branch from 4854583 to 89717e2 Compare January 26, 2022 12:31
@codecov
Copy link

codecov bot commented Jan 26, 2022

Codecov Report

Merging #1847 (56424c3) into main (db01c48) will increase coverage by 0.18%.
The diff coverage is 100.00%.

❗ Current head 56424c3 differs from pull request most recent head d4b2e85. Consider uploading reports for the commit d4b2e85 to get more accurate results

@@             Coverage Diff              @@
##               main    #1847      +/-   ##
============================================
+ Coverage     77.90%   78.09%   +0.18%     
- Complexity      937      944       +7     
============================================
  Files           193      193              
  Lines          5218     5249      +31     
  Branches        418      418              
============================================
+ Hits           4065     4099      +34     
+ Misses          706      705       -1     
+ Partials        447      445       -2     
Impacted Files Coverage Δ
...pi/src/main/java/marquez/db/DatasetVersionDao.java 95.83% <ø> (ø)
...pi/src/main/java/marquez/db/models/DatasetRow.java 33.33% <ø> (ø)
...main/java/marquez/db/models/DatasetVersionRow.java 100.00% <ø> (ø)
.../src/main/java/marquez/service/models/DbTable.java 100.00% <ø> (ø)
...in/java/marquez/service/models/DbTableVersion.java 100.00% <ø> (ø)
...i/src/main/java/marquez/service/models/Stream.java 75.00% <ø> (ø)
...ain/java/marquez/service/models/StreamVersion.java 75.00% <ø> (ø)
api/src/main/java/marquez/common/Utils.java 86.15% <100.00%> (+0.21%) ⬆️
api/src/main/java/marquez/db/Columns.java 81.81% <100.00%> (+0.86%) ⬆️
api/src/main/java/marquez/db/DatasetDao.java 98.52% <100.00%> (ø)
... and 15 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@pawel-big-lebowski pawel-big-lebowski force-pushed the feature/table-operation-support branch from 89717e2 to 83b8204 Compare January 26, 2022 16:17
@pawel-big-lebowski
Copy link
Collaborator Author

Zrzut ekranu 2022-01-26 o 17 15 42

@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review January 26, 2022 16:31
@wslulciuc wslulciuc added the review Ready for review label Feb 1, 2022
@wslulciuc
Copy link
Member

wslulciuc commented Feb 9, 2022

@pawel-big-lebowski: though I agree we should handle the TableStateChangeFacet, I'm not sure the facet is applicable to all dataset types (ex: streams, files, etc). That is, displaying the table state change in the web UI would be relevant for tables, but not for a file in s3.

After looking over the initial issue on adding table operation support (and the facet definition), Marquez should perform certain actions based on the table state change. That is, for a given state change below the following actions are defined:

  • CREATE: the dataset would be created, versioned, and associated with the runID; but, based on the CreateTableCommandVisitor this state change would be a no-op as the table is registered as an output of the job run and would automatically be created by Marquez.
  • DROP: the dataset would be a soft delete; that is, Marquez would created a new deleted version of the dataset and no longer add the dataset to the current lineage graph view.
  • TRUNCATE / OVERWRITE: this essentially is an update on the dataset; on a completed job run, a new dataset version will be created automatically and associated with the runID.

If we refer back to issue #1800, @collado-mike mentions:

We should record these operations in Marquez and reflect the new state of the table after the operation has completed.

The actions defined above would reflect the new state of the table, which is the intended goal. As for displaying or recording the state change as proposed in this PR?

  1. If we decide not to display the state change in the web UI as proposed here (as least not initially), we still record all the events in the lineage_events table, and reflect the new state of the table after the action mapped to state change is applied to the dataset.
  2. If we decide to display the state change in the web UI, we would need to think in terms of dataset lifecycle states (the lifecycle of a dataset is something that Marquez currently manages, though not explicitly): CREATED, DELETED, UPDATED, etc. These lifecycle states would be generic enough to be associated with streams, files, etc. To make it a bit more clear that a given state change is associated with a given lifecycle state, I would use the column names lifecycle_state and latest_lifecycle_state in the tables dataset_versions and datasets respectively. I personally would prefer standardizing on the lifecycle states (defined as an enum) and mapping incoming state changes to them.

@pawel-big-lebowski
Copy link
Collaborator Author

Thanks @wslulciuc for the extensive comment. I think the issue is that we started with table state changes in Spark and wanted to implement it in Marquez, while not being aware of a whole dataset context within Marquez. Like you said, the approach may not fit well other dataset types like files or streams. I agree with the column names lifecycle_state and latest_lifecycle_state which are more generic.

Based on the knowledge collected, I think we should start with: OpenLineage/OpenLineage#518
to define global facets' spec that matches tables, files, streams etc. Then we should implement it in Spark and later revisit this Marquez PR.

@pawel-big-lebowski pawel-big-lebowski force-pushed the feature/table-operation-support branch 2 times, most recently from d04951f to 3fd3222 Compare March 10, 2022 08:30
@pawel-big-lebowski
Copy link
Collaborator Author

  • I renamed the fields in JSON and backend DB according to naming convention introduced in spec (lifecycleStateChange).
  • I implemented soft-delete mechanism, so that in case of drop operation, a dataset is marked with a is_deleted flag equal to true.

The still valid open question is: what do we want to do with this information on the frontend? There are at least two options:

  • Store dropped datasets in DB and do not show it on Web UI. This is implemented within current version. We collect the pretty useful information but we do not make any use of it at the moment.
  • Display dropped datasets in UI while marking them deleted. The potential drawback is the amount of deleted datasets can increase over time and blur the graph. Perhaps we could assume configurable 7 days expiry time and just show datasets dropped within the expiry time.

@julienledem @wslulciuc What do you think?

@pawel-big-lebowski pawel-big-lebowski changed the title add tableStateChange support add lifecycleStateChange support Mar 10, 2022
@pawel-big-lebowski pawel-big-lebowski force-pushed the feature/table-operation-support branch from 3fd3222 to 56424c3 Compare March 21, 2022 10:50
@pawel-big-lebowski
Copy link
Collaborator Author

We decided to proceed in a following way:

  • Datasets marked as deleted should be returned by the API.
  • Datasets marked as deleted should not be presented in UI. Filtering has to be done on the frontend side.
  • This approach is good enough for now and can be extended further on the frontend side later on.

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @pawel-big-lebowski! I have a few minor suggestions before we can merge this sweet feature

  1. Mind using just latest_lifecycle_state and lifecycle_state for the column names to represent states? In OpenLineage, the facet does capture the state change (so the naming is appropriate), but in Marquez the column will represent the current state at a given time in the dataset lifecycle. You can think about how the run states are stored in the runs table.
  2. I think we should also add the column latest_lifecycle_state to the datasets table?
  3. With Add migration script to remove size limits from namespaces, dataset n… #1925 merged, you'll need to update you sql migration file to use V41 and above
  4. Mind also updating the changelog?

@wslulciuc wslulciuc added this to the 0.22.0 milestone Mar 31, 2022
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@pawel-big-lebowski pawel-big-lebowski force-pushed the feature/table-operation-support branch from d4b2e85 to 0bbfc0a Compare April 1, 2022 13:51
@pawel-big-lebowski
Copy link
Collaborator Author

Great comments @wslulciuc. I've made changes for 1., 3. and 4. of your comments.

I am not sure whether adding column latest_lifecycle_state to datasets table is necessary. First, this does not require any extra join in postgres as whenever dataset is fetched, it is already joined with the latest dataset_version. Secondly, there are other columns obtained from dataset_version table in a similar way like fields:

+ "SELECT d.*, dv.fields, sv.schema_location, t.tags, facets\n"

Let me know if you had other argument for doing that.

@wslulciuc
Copy link
Member

First, this does not require any extra join in postgres as whenever dataset is fetched, it is already joined with the latest dataset_version. Secondly, there are other columns obtained from dataset_version table in a similar way like fields

Totally agree, thanks for pointing this out! The datasets table already has a reference to the current version (=current_version_uuid ), which will also contain the current lifecycle state of the table. So, there's a stronger argument not to add the latest_lifecycle_state to datasets given that it can be derived for dataset_versions 👍

@wslulciuc wslulciuc removed the review Ready for review label Apr 1, 2022
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @pawel-big-lebowski 💯 💯 💯

@wslulciuc wslulciuc merged commit c1325c5 into MarquezProject:main Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Add support for table operations reported from OpenLineage
2 participants