Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Column Lineage Query Performance #2802

Closed
vinhnemo opened this issue Apr 22, 2024 · 3 comments · Fixed by #2821
Closed

Optimize Column Lineage Query Performance #2802

vinhnemo opened this issue Apr 22, 2024 · 3 comments · Fixed by #2821
Labels
db.perf This issue or pull request improves DB performance
Milestone

Comments

@vinhnemo
Copy link
Contributor

Description

While working with the column lineage queries within the Marquez project, I noticed that a particular query was performing suboptimally. Specifically, the query associated with the dataset fields view could take up to 4.5 seconds to execute under certain conditions. After investigating potential causes, I identified a missing filter in the Common Table Expression (CTE) that, when included, significantly improved performance by reducing the query time to approximately 1 second.

Issue

The current implementation of the dataset fields view query in ColumnLineageDao.java does not include a filter to narrow down the dataset fields to only those linked with the version UUIDs identified in selected_column_lineage. This results in processing a larger dataset than necessary.

Performance Impact

The lack of this filter can cause the query execution time to increase, especially when dealing with large datasets. In my testing environment, the execution time was observed at around 4.5 seconds.

Proposed Change

I propose adding a filter condition to the CTE dataset_fields_view in ColumnLineageDao.java:
From:

        dataset_fields_view AS (
          SELECT d.namespace_name as namespace_name, d.name as dataset_name, df.name as field_name, df.type, df.uuid
          FROM dataset_fields df
          INNER JOIN datasets_view d ON d.uuid = df.dataset_uuid
        )

To

dataset_fields_view AS (
  SELECT 
    d.namespace_name as namespace_name, 
    d.name as dataset_name, 
    df.name as field_name, 
    df.type, 
    df.uuid 
  FROM 
    dataset_fields df 
    INNER JOIN (
      select 
        * 
      from 
        datasets_view 
      where 
        current_version_uuid IN (
          SELECT 
            DISTINCT output_dataset_version_uuid 
          FROM 
            selected_column_lineage 
          UNION 
          SELECT 
            DISTINCT input_dataset_version_uuid 
          FROM 
            selected_column_lineage
        )
    ) d ON d.uuid = df.dataset_uuid
)

This filter will ensure that only relevant dataset fields are processed, improving the overall efficiency of the query.

Expected Outcome

The expected outcome of this change is a reduction in the execution time of the dataset fields view query, as evidenced by a decrease from 4.5 seconds to 1 second in tests. This improvement should translate to a better performance for all users interacting with this aspect of the Marquez API.

Steps to Reproduce

  1. Run the existing dataset fields view query on a large dataset.
  2. Note the execution time.
  3. Apply the proposed filter to the query.
  4. Re-run the query and compare the execution time.
Copy link

boring-cyborg bot commented Apr 22, 2024

Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template!

@wslulciuc wslulciuc added the db.perf This issue or pull request improves DB performance label May 22, 2024
@wslulciuc wslulciuc added this to the 0.48.0 milestone May 22, 2024
@wslulciuc
Copy link
Member

Woah! A huge oversight on our part. @vinhnemo want to contribute the patch?

@vinhnemo
Copy link
Contributor Author

Yes @wslulciuc ! Do I just need to create a PR or is there anything else I should do?

@wslulciuc wslulciuc moved this to In Progress in Marquez May 25, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Marquez Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
db.perf This issue or pull request improves DB performance
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants