Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix column lineage returning multiple entries for job run multiple times #2176

Merged
merged 1 commit into from
Oct 10, 2022

Conversation

pawel-big-lebowski
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski commented Oct 10, 2022

Signed-off-by: Pawel Leszczynski leszczynski.pawel@gmail.com

Problem

Each column lineage row contains columns: output_dataset_version_uuid, output_dataset_field_uuid, input_dataset_version_uuid, input_dataset_field_uuid. This means that in case of a single job reading one column and writing to another, if a job is run twice this will result in two rows in column-lineage table. This is fine as we want to track this information. But we shouldn't retrieve this twice through the endpoint.

Solution

Return column dependency only once if a job has been run several times.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@codecov
Copy link

codecov bot commented Oct 10, 2022

Codecov Report

Merging #2176 (5e496ec) into main (496566e) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main    #2176   +/-   ##
=========================================
  Coverage     76.33%   76.33%           
  Complexity     1099     1099           
=========================================
  Files           214      214           
  Lines          5139     5139           
  Branches        407      407           
=========================================
  Hits           3923     3923           
  Misses          762      762           
  Partials        454      454           
Impacted Files Coverage Δ
api/src/main/java/marquez/db/ColumnLineageDao.java 100.00% <ø> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pawel-big-lebowski pawel-big-lebowski merged commit aa7a47d into main Oct 10, 2022
@pawel-big-lebowski pawel-big-lebowski deleted the fix-column-lineaege-mulitple-runs branch October 10, 2022 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants