Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Column Lineage Query Performance #2821

Conversation

vinhnemo
Copy link
Contributor

@vinhnemo vinhnemo commented May 24, 2024

Problem

The current implementation of the dataset fields view query in ColumnLineageDao.java does not include a filter to narrow down the dataset fields to only those linked with the version UUIDs identified in selected_column_lineage. This results in processing a larger dataset than necessary.
The lack of this filter can cause the query execution time to increase, especially when dealing with large datasets.

Closes: #2802

Solution

I propose adding a filter condition to the CTE dataset_fields_view in ColumnLineageDao.java:
From:

        dataset_fields_view AS (
          SELECT d.namespace_name as namespace_name, d.name as dataset_name, df.name as field_name, df.type, df.uuid
          FROM dataset_fields df
          INNER JOIN datasets_view d ON d.uuid = df.dataset_uuid
        )

To

dataset_fields_view AS (
  SELECT 
    d.namespace_name as namespace_name, 
    d.name as dataset_name, 
    df.name as field_name, 
    df.type, 
    df.uuid 
  FROM 
    dataset_fields df 
    INNER JOIN (
      select 
        * 
      from 
        datasets_view 
      where 
        current_version_uuid IN (
          SELECT 
            DISTINCT output_dataset_version_uuid 
          FROM 
            selected_column_lineage 
          UNION 
          SELECT 
            DISTINCT input_dataset_version_uuid 
          FROM 
            selected_column_lineage
        )
    ) d ON d.uuid = df.dataset_uuid
)

This filter will ensure that only relevant dataset fields are processed, improving the overall efficiency of the query.

One-line summary: adding a filter condition to the CTE dataset_fields_view in ColumnLineageDao.java:

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added the api API layer changes label May 24, 2024
Copy link

boring-cyborg bot commented May 24, 2024

Thanks for opening your first pull request in the Marquez project! Please check out our contributing guidelines (https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md).

Copy link

netlify bot commented May 24, 2024

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
🔨 Latest commit a10360e
🔍 Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/665ded01517e610008b0be57

@vinhnemo vinhnemo force-pushed the feature/optimize-column-lineage-query-perf branch 6 times, most recently from f5325c6 to 536457a Compare May 27, 2024 08:49
Copy link

codecov bot commented May 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.55%. Comparing base (e54ffca) to head (e14f060).

Current head e14f060 differs from pull request most recent head a10360e

Please upload reports for the commit a10360e to get more accurate results.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2821      +/-   ##
============================================
- Coverage     84.56%   84.55%   -0.01%     
+ Complexity     1441     1440       -1     
============================================
  Files           251      251              
  Lines          6504     6501       -3     
  Branches        303      302       -1     
============================================
- Hits           5500     5497       -3     
  Misses          851      851              
  Partials        153      153              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vinhnemo vinhnemo force-pushed the feature/optimize-column-lineage-query-perf branch from 536457a to e106288 Compare May 27, 2024 09:03
Signed-off-by: Vinh Nguyen <phuvinh97ag@gmail.com>
@vinhnemo vinhnemo force-pushed the feature/optimize-column-lineage-query-perf branch from e106288 to d17a469 Compare May 27, 2024 09:09
@wslulciuc wslulciuc added this to the 0.48.0 milestone May 28, 2024
@wslulciuc wslulciuc added the db.perf This issue or pull request improves DB performance label May 28, 2024
- Format query
- replace select * with uuid, namespace_name, name

Signed-off-by: Vinh Nguyen <phuvinh97ag@gmail.com>
@vinhnemo vinhnemo force-pushed the feature/optimize-column-lineage-query-perf branch from 2805ad8 to e14f060 Compare June 3, 2024 09:32
@phixMe phixMe enabled auto-merge (squash) June 3, 2024 16:19
@phixMe phixMe merged commit 7d0b290 into MarquezProject:main Jun 3, 2024
15 checks passed
Copy link

boring-cyborg bot commented Jun 3, 2024

Great job! Congrats on your first merged pull request in the Marquez project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes db.perf This issue or pull request improves DB performance
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Optimize Column Lineage Query Performance
3 participants