You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While working with the column lineage queries within the Marquez project, I noticed that a particular query was performing suboptimally. Specifically, the query associated with the dataset fields view could take up to 4.5 seconds to execute under certain conditions. After investigating potential causes, I identified a missing filter in the Common Table Expression (CTE) that, when included, significantly improved performance by reducing the query time to approximately 1 second.
Issue
The current implementation of the dataset fields view query in ColumnLineageDao.java does not include a filter to narrow down the dataset fields to only those linked with the version UUIDs identified in selected_column_lineage. This results in processing a larger dataset than necessary.
Performance Impact
The lack of this filter can cause the query execution time to increase, especially when dealing with large datasets. In my testing environment, the execution time was observed at around 4.5 seconds.
Proposed Change
I propose adding a filter condition to the CTE dataset_fields_view in ColumnLineageDao.java:
From:
dataset_fields_view AS (
SELECT d.namespace_name as namespace_name, d.name as dataset_name, df.name as field_name, df.type, df.uuid
FROM dataset_fields df
INNER JOIN datasets_view d ON d.uuid = df.dataset_uuid
)
To
dataset_fields_view AS (
SELECT
d.namespace_name as namespace_name,
d.name as dataset_name,
df.name as field_name,
df.type,
df.uuid
FROM
dataset_fields df
INNER JOIN (
select
*
from
datasets_view
where
current_version_uuid IN (
SELECT
DISTINCT output_dataset_version_uuid
FROM
selected_column_lineage
UNION
SELECT
DISTINCT input_dataset_version_uuid
FROM
selected_column_lineage
)
) d ON d.uuid = df.dataset_uuid
)
This filter will ensure that only relevant dataset fields are processed, improving the overall efficiency of the query.
Expected Outcome
The expected outcome of this change is a reduction in the execution time of the dataset fields view query, as evidenced by a decrease from 4.5 seconds to 1 second in tests. This improvement should translate to a better performance for all users interacting with this aspect of the Marquez API.
Steps to Reproduce
Run the existing dataset fields view query on a large dataset.
Note the execution time.
Apply the proposed filter to the query.
Re-run the query and compare the execution time.
The text was updated successfully, but these errors were encountered:
Description
While working with the column lineage queries within the Marquez project, I noticed that a particular query was performing suboptimally. Specifically, the query associated with the dataset fields view could take up to 4.5 seconds to execute under certain conditions. After investigating potential causes, I identified a missing filter in the Common Table Expression (CTE) that, when included, significantly improved performance by reducing the query time to approximately 1 second.
Issue
The current implementation of the dataset fields view query in ColumnLineageDao.java does not include a filter to narrow down the dataset fields to only those linked with the version UUIDs identified in selected_column_lineage. This results in processing a larger dataset than necessary.
Performance Impact
The lack of this filter can cause the query execution time to increase, especially when dealing with large datasets. In my testing environment, the execution time was observed at around 4.5 seconds.
Proposed Change
I propose adding a filter condition to the CTE dataset_fields_view in ColumnLineageDao.java:
From:
To
This filter will ensure that only relevant dataset fields are processed, improving the overall efficiency of the query.
Expected Outcome
The expected outcome of this change is a reduction in the execution time of the dataset fields view query, as evidenced by a decrease from 4.5 seconds to 1 second in tests. This improvement should translate to a better performance for all users interacting with this aspect of the Marquez API.
Steps to Reproduce
The text was updated successfully, but these errors were encountered: