Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runless events - refactor job_versions_io_mapping #2654

Merged
merged 2 commits into from
Dec 13, 2023

Conversation

pawel-big-lebowski
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski commented Oct 17, 2023

Problem

This is currently a draft PR which is far from being merged. It is missing few tests related to schema changes which are marked with todo within the code. I've created a PR to have a better discussion on adding job_id to job_versions_io_mapping. This PR should be a follow-up of #2641.

The assumption was that it should be helpful in optimising get-lineage query. I would like first to clarify how are we going to make benefit of this extra column.

Solution

Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a database schema migration, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change.

Note: All database schema changes require discussion. Please link the issue for context.

One-line summary:

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added api API layer changes docs labels Oct 17, 2023
@netlify
Copy link

netlify bot commented Oct 17, 2023

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
🔨 Latest commit a8cdbe0
🔍 Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/65795f0ad206920009a4ac89

@codecov
Copy link

codecov bot commented Oct 17, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (a5a0e55) 84.08% compared to head (a8cdbe0) 84.15%.

Files Patch % Lines
...rations/V67_2_JobVersionsIOMappingBackfillJob.java 81.81% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2654      +/-   ##
============================================
+ Coverage     84.08%   84.15%   +0.06%     
- Complexity     1379     1390      +11     
============================================
  Files           248      249       +1     
  Lines          6295     6322      +27     
  Branches        286      286              
============================================
+ Hits           5293     5320      +27     
- Misses          849      850       +1     
+ Partials        153      152       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

api/src/main/java/marquez/db/DatasetFacetsDao.java Outdated Show resolved Hide resolved
@NonNull Instant lineageEventTime,
@NonNull String lineageEventType,
String lineageEventType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use DATASET or JOB as the lineageEventType? I think it would be helpful to know the event type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this column holds run states and its name is kind of misleading. Its name fits static lineage scenario well. However, I don't think we should store in a single column run-state and event-type. Renaming this column would require significant amount of work witch changes all over the project, including the spec.

Assuming this column contains run state update, then keeping it to null for DatasetEvent makes sense to me.

Copy link
Member

@wslulciuc wslulciuc Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's misleading, but Marquez doesn't necessarily have to adhere exactly to OpenLineage concepts and can redefine them. We can set lineage_event_type as null , but would prefer we (eventually) remap eventType to runState (just not within this PR).

@pawel-big-lebowski pawel-big-lebowski changed the title Runless events - consume dataset event Runless events - refactor job_versions_io_mapping Oct 18, 2023
@pawel-big-lebowski pawel-big-lebowski changed the base branch from main to static/dataset-event October 18, 2023 09:57
@pawel-big-lebowski pawel-big-lebowski force-pushed the static/job-version-mapping branch 5 times, most recently from 1607104 to 9f1dedd Compare October 23, 2023 10:57
@pawel-big-lebowski pawel-big-lebowski force-pushed the static/dataset-event branch 3 times, most recently from 2213b35 to 40bfe6b Compare October 24, 2023 12:24
@pawel-big-lebowski pawel-big-lebowski force-pushed the static/job-version-mapping branch 2 times, most recently from 46e26da to 7592059 Compare October 24, 2023 13:06
@pawel-big-lebowski
Copy link
Collaborator Author

@wslulciuc I've added another commit to the PR keeping in mind upcoming streaming job support and our offline discussion.

Based on that, I think we should rename job_versions_io_mapping to job_io_mapping and make job_version_uuid nullable to be set at the end of the job.

@julienledem
Copy link
Member

That's sounds fine to me. I'm curious to hear what Willy thinks.

Base automatically changed from static/dataset-event to main November 6, 2023 07:16
@pawel-big-lebowski pawel-big-lebowski force-pushed the static/job-version-mapping branch from ca5f6cd to 79acd2d Compare November 8, 2023 07:42
@pawel-big-lebowski pawel-big-lebowski changed the base branch from main to static/job-event November 8, 2023 07:42
@pawel-big-lebowski pawel-big-lebowski force-pushed the static/job-version-mapping branch 3 times, most recently from 2110951 to 4aa0900 Compare November 8, 2023 13:15
@pawel-big-lebowski pawel-big-lebowski force-pushed the static/job-version-mapping branch 4 times, most recently from b6c0213 to 554f90c Compare November 10, 2023 07:39
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review November 10, 2023 07:40
Base automatically changed from static/job-event to main November 16, 2023 06:57
@pawel-big-lebowski pawel-big-lebowski force-pushed the static/job-version-mapping branch 2 times, most recently from d7166cc to 17810f2 Compare November 16, 2023 07:17
ARRAY_AGG(DISTINCT io.dataset_uuid) FILTER (WHERE io.io_type='INPUT') AS inputs,
ARRAY_AGG(DISTINCT io.dataset_uuid) FILTER (WHERE io.io_type='OUTPUT') AS outputs
FROM job_io_mapping io
WHERE io.is_job_version_current = TRUE
Copy link
Member

@wslulciuc wslulciuc Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using a current version check, I'd be interesting to see the query plan and how the query may be optimized with the removal of join on that jobs table. Do we have any numbers on this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getLineage method is commonly used as it is entry point for a user to Marquez. The method is recursive and the purpose of this refactor is to make each recursion step to be computed within a single table with no joins required.

After this change, a whole lineage graph can be computed based on job_versions_io_mapping table. jobs_view is used only to enrich the returned graph nodes. Before this change, a join to jobs_view was needed in each recursion step to make sure if a row in job_versions_io_mapping represents current job version.

@@ -0,0 +1,13 @@
ALTER TABLE job_versions_io_mapping ADD COLUMN job_uuid uuid REFERENCES jobs(uuid) ON DELETE CASCADE;
ALTER TABLE job_versions_io_mapping ADD COLUMN symlink_target_job_uuid uuid REFERENCES jobs(uuid) ON DELETE CASCADE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: symlink_target_job_uuid -> job_symlink_target_uuid since it's defined in the jobs table as jobs.symlink_target_uuid

INSERT INTO job_io_mapping (
job_version_uuid, dataset_uuid, io_type, job_uuid, symlink_target_job_uuid, is_job_version_current)
VALUES (:jobVersionUuid, :datasetUuid, :ioType, :jobUuid, :symlinkTargetJobUuid, TRUE)
ON CONFLICT (job_version_uuid, dataset_uuid, io_type, job_uuid) DO UPDATE SET is_job_version_current = TRUE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set is_job_version_current = TRUE as a noop? i.e. just to fulfill the ON CONFLICT? Should we just use DO NOTHING instead?

Copy link
Collaborator Author

@pawel-big-lebowski pawel-big-lebowski Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right we can go with do nothing.
markVersionIOMappingObsolete marks obsolete rows with job version different that a given one, so we don't need to implement conflict scenario here.

@pawel-big-lebowski pawel-big-lebowski force-pushed the static/job-version-mapping branch from 17810f2 to c00b988 Compare December 5, 2023 11:34
@@ -0,0 +1,11 @@
ALTER TABLE job_versions_io_mapping ADD COLUMN job_uuid uuid REFERENCES jobs(uuid) ON DELETE CASCADE;
ALTER TABLE job_versions_io_mapping ADD COLUMN job_symlink_target_uuid uuid REFERENCES jobs(uuid) ON DELETE CASCADE;
ALTER TABLE job_versions_io_mapping ADD COLUMN is_current_job_version boolean DEFAULT FALSE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a made_current_at column?

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to the lineage query perf improvements and follow up analysis!

@wslulciuc wslulciuc added this to the 0.43.0 milestone Dec 13, 2023
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@pawel-big-lebowski pawel-big-lebowski force-pushed the static/job-version-mapping branch from c00b988 to a8cdbe0 Compare December 13, 2023 07:36
@pawel-big-lebowski pawel-big-lebowski merged commit b73fb15 into main Dec 13, 2023
16 checks passed
@pawel-big-lebowski pawel-big-lebowski deleted the static/job-version-mapping branch December 13, 2023 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes docs
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants