Update lineage query to only look at jobs with inputs or outputs #2068
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Michael Collado collado.mike@gmail.com
Problem
In many environments a large number of jobs reporting events have no inputs or outputs - e.g., PythonOperators in an Airflow deployment. If a Marquez installation has a lot of these, the lineage query spends a lot of its time searching for overlaps with jobs that have no inputs or outputs. In one installation, we have > 200K jobs, but only ~7000 jobs that have any inputs or outputs at all.
Solution
This changes the lineage query to query the
job_versions_io_mapping
table and INNER join with thejobs_view
so that only jobs that have inputs or outputs are present in thejobs_io
CTE. The impact of this is that table becomes very small and the recursive join in thelineage
CTE is very fast.Probably notable that the missing inputs/outputs are largely due to insufficient coverage by the OpenLineage integrations - e.g., those PythonOperators are likely reading data from somewhere. This is, at best, a short term fix until OL coverage increases, at which point, the query will have to be revisited again.
Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sql
database schema migration according to Flyway's naming convention (if relevant)