Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model and store column lineage in Marquez DB #2096

Merged
merged 30 commits into from
Sep 30, 2022
Merged

Conversation

mzareba382
Copy link
Contributor

@mzareba382 mzareba382 commented Aug 30, 2022

Problem & solution

This PR adds a column-level lineage representation and API endpoint to retrieve column-level lineage data from Marquez's database. It is based on OpenLineage's column-level lineage facet

Column-level lineage data is being stored in separate table with following fields:

CREATE TABLE column_lineage (
  output_dataset_version_uuid   uuid REFERENCES dataset_versions(uuid), -- allows join to run_id
  output_dataset_field_uuid     uuid REFERENCES dataset_fields(uuid),
  input_dataset_version_uuid    uuid REFERENCES dataset_versions(uuid), -- speed up graph column lineage graph traversal
  input_dataset_field_uuid      uuid REFERENCES dataset_fields(uuid),
  transformation_description    VARCHAR(255) NOT NULL,
  transformation_type           VARCHAR(255) NOT NULL,
  created_at                    TIMESTAMP NOT NULL,
  updated_at                    TIMESTAMP NOT NULL,
  UNIQUE (output_dataset_version_uuid, output_dataset_field_uuid, input_dataset_version_uuid, input_dataset_field_uuid)
);

Relevant tickets are:

Solution:

  • Column-level lineage DB table with its DAO class
  • Update relevant methods in OpenLineageDao class to upsert table on OpenLineage metadata arrival.
  • Output_column_name and input_field should be linked via uuid's from dataset_fields table. Normalize this table to represent lineage as references in database.
  • Implement tests in OpenLineageDaoTests to check if column-level lineage data is properly retrieved from facets and saved to db.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

Mariusz Zaręba and others added 3 commits August 29, 2022 19:10
Signed-off-by: mzareba <mzareba382@gmail.com>
Signed-off-by: mzareba <mzareba382@gmail.com>
Signed-off-by: mzareba <mzareba382@gmail.com>
@boring-cyborg
Copy link

boring-cyborg bot commented Aug 30, 2022

Thanks for opening your first pull request in the Marquez project! Please check out our contributing guidelines (https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md).

mzareba382 and others added 8 commits August 30, 2022 17:25
…o DatasetRecord, write test for createLineageRow() invocation

Signed-off-by: mzareba <mzareba382@gmail.com>
Signed-off-by: mzareba <mzareba382@gmail.com>
Signed-off-by: mzareba <mzareba382@gmail.com>
Signed-off-by: mzareba <mzareba382@gmail.com>
Signed-off-by: mzareba <mzareba382@gmail.com>
@codecov
Copy link

codecov bot commented Sep 9, 2022

Codecov Report

Merging #2096 (21dac22) into main (2909864) will increase coverage by 0.29%.
The diff coverage is 93.90%.

@@             Coverage Diff              @@
##               main    #2096      +/-   ##
============================================
+ Coverage     75.49%   75.78%   +0.29%     
- Complexity     1045     1061      +16     
============================================
  Files           206      209       +3     
  Lines          4925     5006      +81     
  Branches        399      403       +4     
============================================
+ Hits           3718     3794      +76     
  Misses          763      763              
- Partials        444      449       +5     
Impacted Files Coverage Δ
api/src/main/java/marquez/db/Columns.java 81.81% <ø> (ø)
api/src/main/java/marquez/db/DatasetFieldDao.java 100.00% <ø> (ø)
.../main/java/marquez/db/models/UpdateLineageRow.java 100.00% <ø> (ø)
.../main/java/marquez/db/mappers/FieldDataMapper.java 88.88% <88.88%> (ø)
...ava/marquez/db/mappers/ColumnLineageRowMapper.java 90.90% <90.90%> (ø)
api/src/main/java/marquez/db/OpenLineageDao.java 95.21% <93.75%> (-0.21%) ⬇️
api/src/main/java/marquez/db/ColumnLineageDao.java 100.00% <100.00%> (ø)
...main/java/marquez/service/models/LineageEvent.java 85.07% <100.00%> (+0.94%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mzareba382 mzareba382 changed the title WIP: Add column level lineage representation and endpoint WIP: Add column level lineage representation Sep 12, 2022
@boring-cyborg boring-cyborg bot added the docs label Sep 14, 2022
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review September 14, 2022 08:26
@pawel-big-lebowski pawel-big-lebowski marked this pull request as draft September 15, 2022 15:19
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review September 16, 2022 07:51
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@wslulciuc
Copy link
Member

@mzareba382 I was wondering if this PR is now outdated? /cc @mobuchowski, @pawel-big-lebowski

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on modeling the column_lineage table and separating the column-level lineage facet into a separate table (similar to the design proposed in #2076). Also great to see extensive testing around this feature! 💯 🥇

@wslulciuc wslulciuc merged commit b6544ec into main Sep 30, 2022
@wslulciuc wslulciuc deleted the add-column-level-lineage branch September 30, 2022 09:23
@boring-cyborg
Copy link

boring-cyborg bot commented Sep 30, 2022

Great job! Congrats on your first merged pull request in the Marquez project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants