Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lineage for orphaned datasets #2314

Merged
merged 2 commits into from
Dec 12, 2022
Merged

Conversation

collado-mike
Copy link
Collaborator

Signed-off-by: Michael Collado collado.mike@gmail.com

Problem

Sometimes a dataset is generated by a job whose current version no longer writes to that database. Since the lineage logic for a dataset always starts with a job that has written to or read from the dataset, we'll generate the lineage for the current version of that job, which may not include the dataset we started from.

Solution

This validates that a selected dataset node is always in the results of the lineage returned from the database. If the dataset is not in the set of nodes returned from the database, we assume that it's no longer connected to the original job and return a lineage graph with only the original dataset node.
(note that we always select the latest job that has written to or read from the dataset, so if a newer job now writes to that dataset, it will not be treated as an orphan dataset).

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added the api API layer changes label Dec 12, 2022
Signed-off-by: Michael Collado <collado.mike@gmail.com>
@collado-mike collado-mike force-pushed the fix/include_orphaned_dataset_lineage branch from 2e760fb to b302008 Compare December 12, 2022 21:08
@codecov
Copy link

codecov bot commented Dec 12, 2022

Codecov Report

Merging #2314 (95aac6f) into main (b1ff80e) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main    #2314      +/-   ##
============================================
+ Coverage     77.01%   77.07%   +0.06%     
- Complexity     1166     1170       +4     
============================================
  Files           222      222              
  Lines          5307     5317      +10     
  Branches        424      425       +1     
============================================
+ Hits           4087     4098      +11     
  Misses          747      747              
+ Partials        473      472       -1     
Impacted Files Coverage Δ
.../src/main/java/marquez/service/LineageService.java 86.77% <100.00%> (+2.09%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@wslulciuc wslulciuc enabled auto-merge (squash) December 12, 2022 21:28
@wslulciuc wslulciuc disabled auto-merge December 12, 2022 21:38
@wslulciuc wslulciuc merged commit 3212c8f into main Dec 12, 2022
@wslulciuc wslulciuc deleted the fix/include_orphaned_dataset_lineage branch December 12, 2022 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants