Optimize Column Lineage Query Performance #2802

vinhnemo · 2024-04-22T10:44:03Z

Description

While working with the column lineage queries within the Marquez project, I noticed that a particular query was performing suboptimally. Specifically, the query associated with the dataset fields view could take up to 4.5 seconds to execute under certain conditions. After investigating potential causes, I identified a missing filter in the Common Table Expression (CTE) that, when included, significantly improved performance by reducing the query time to approximately 1 second.

Issue

The current implementation of the dataset fields view query in ColumnLineageDao.java does not include a filter to narrow down the dataset fields to only those linked with the version UUIDs identified in selected_column_lineage. This results in processing a larger dataset than necessary.

Performance Impact

The lack of this filter can cause the query execution time to increase, especially when dealing with large datasets. In my testing environment, the execution time was observed at around 4.5 seconds.

Proposed Change

I propose adding a filter condition to the CTE dataset_fields_view in ColumnLineageDao.java:
From:

        dataset_fields_view AS (
          SELECT d.namespace_name as namespace_name, d.name as dataset_name, df.name as field_name, df.type, df.uuid
          FROM dataset_fields df
          INNER JOIN datasets_view d ON d.uuid = df.dataset_uuid
        )

To

dataset_fields_view AS (
  SELECT 
    d.namespace_name as namespace_name, 
    d.name as dataset_name, 
    df.name as field_name, 
    df.type, 
    df.uuid 
  FROM 
    dataset_fields df 
    INNER JOIN (
      select 
        * 
      from 
        datasets_view 
      where 
        current_version_uuid IN (
          SELECT 
            DISTINCT output_dataset_version_uuid 
          FROM 
            selected_column_lineage 
          UNION 
          SELECT 
            DISTINCT input_dataset_version_uuid 
          FROM 
            selected_column_lineage
        )
    ) d ON d.uuid = df.dataset_uuid
)

This filter will ensure that only relevant dataset fields are processed, improving the overall efficiency of the query.

Expected Outcome

The expected outcome of this change is a reduction in the execution time of the dataset fields view query, as evidenced by a decrease from 4.5 seconds to 1 second in tests. This improvement should translate to a better performance for all users interacting with this aspect of the Marquez API.

Steps to Reproduce

Run the existing dataset fields view query on a large dataset.
Note the execution time.
Apply the proposed filter to the query.
Re-run the query and compare the execution time.

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2024-04-22T10:44:05Z

Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template!

wslulciuc · 2024-05-22T16:13:34Z

Woah! A huge oversight on our part. @vinhnemo want to contribute the patch?

vinhnemo · 2024-05-24T03:17:33Z

Yes @wslulciuc ! Do I just need to create a PR or is there anything else I should do?

wslulciuc added the db.perf This issue or pull request improves DB performance label May 22, 2024

wslulciuc added this to Marquez May 22, 2024

wslulciuc added this to the 0.48.0 milestone May 22, 2024

vinhnemo mentioned this issue May 24, 2024

Optimize Column Lineage Query Performance #2821

Merged

7 tasks

wslulciuc moved this to In Progress in Marquez May 25, 2024

phixMe closed this as completed in #2821 Jun 3, 2024

github-project-automation bot moved this from In Progress to Done in Marquez Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Column Lineage Query Performance #2802

Optimize Column Lineage Query Performance #2802

vinhnemo commented Apr 22, 2024

boring-cyborg bot commented Apr 22, 2024

wslulciuc commented May 22, 2024

vinhnemo commented May 24, 2024

Optimize Column Lineage Query Performance #2802

Optimize Column Lineage Query Performance #2802

Comments

vinhnemo commented Apr 22, 2024

Description

Issue

Performance Impact

Proposed Change

Expected Outcome

Steps to Reproduce

boring-cyborg bot commented Apr 22, 2024

wslulciuc commented May 22, 2024

vinhnemo commented May 24, 2024