Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update lineage query to only look at jobs with inputs or outputs #2068

Merged
merged 2 commits into from
Aug 10, 2022

Conversation

collado-mike
Copy link
Collaborator

@collado-mike collado-mike commented Aug 10, 2022

Signed-off-by: Michael Collado collado.mike@gmail.com

Problem

In many environments a large number of jobs reporting events have no inputs or outputs - e.g., PythonOperators in an Airflow deployment. If a Marquez installation has a lot of these, the lineage query spends a lot of its time searching for overlaps with jobs that have no inputs or outputs. In one installation, we have > 200K jobs, but only ~7000 jobs that have any inputs or outputs at all.

Solution

This changes the lineage query to query the job_versions_io_mapping table and INNER join with the jobs_view so that only jobs that have inputs or outputs are present in the jobs_io CTE. The impact of this is that table becomes very small and the recursive join in the lineage CTE is very fast.

Probably notable that the missing inputs/outputs are largely due to insufficient coverage by the OpenLineage integrations - e.g., those PythonOperators are likely reading data from somewhere. This is, at best, a short term fix until OL coverage increases, at which point, the query will have to be revisited again.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

Signed-off-by: Michael Collado <collado.mike@gmail.com>
@collado-mike collado-mike requested a review from wslulciuc August 10, 2022 20:50
@collado-mike collado-mike enabled auto-merge (squash) August 10, 2022 22:21
@codecov
Copy link

codecov bot commented Aug 10, 2022

Codecov Report

Merging #2068 (4af5216) into main (476e472) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main    #2068   +/-   ##
=========================================
  Coverage     78.79%   78.79%           
  Complexity     1011     1011           
=========================================
  Files           200      200           
  Lines          5574     5574           
  Branches        422      422           
=========================================
  Hits           4392     4392           
  Misses          730      730           
  Partials        452      452           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@collado-mike, though you consider this a quick "hack", I do view it as reasonable optimization to improve the lineage query performance. Now, as you pointed out, it is a short-term optimization change. That is, coverage for OpenLineage will improve, which, in turn, will negatively impact the lineage query. But we have spoken briefly on ways we can continue to improve lineage query performance by introducing materialized views, caching, etc. Anyways, great work 💯 🥇

@collado-mike collado-mike merged commit 98f3114 into main Aug 10, 2022
@collado-mike collado-mike deleted the fix/lineage_query_perf branch August 10, 2022 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants