You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
WITH selected_column_lineage AS (
SELECT DISTINCT ON (cl.output_dataset_field_uuid, cl.input_dataset_field_uuid) cl.*
FROM column_lineage cl
JOIN dataset_fields df ON df.uuid = cl.output_dataset_field_uuid
JOIN datasets_view dv ON dv.uuid = df.dataset_uuid
WHERE ARRAY[($1,$2)]::DATASET_NAME[] && dv.dataset_symlinks -- array of string pairs is cast onto array of DATASET_NAME types to be checked if it has non-empty intersection with dataset symlinks
ORDER BY output_dataset_field_uuid, input_dataset_field_uuid, updated_at DESC, updated_at
),
dataset_fields_view AS (
SELECT d.namespace_name as namespace_name, d.name as dataset_name, df.name as field_name, df.type, df.uuid
FROM dataset_fields df
INNER JOIN datasets_view d ON d.uuid = df.dataset_uuid
)
SELECT
output_fields.namespace_name,
output_fields.dataset_name,
output_fields.field_name,
output_fields.type,
ARRAY_AGG(DISTINCT ARRAY[
input_fields.namespace_name,
input_fields.dataset_name,
CAST(c.input_dataset_version_uuid AS VARCHAR),
input_fields.field_name,
c.transformation_description,
c.transformation_type
]) AS inputFields,
null as dataset_version_uuid
FROM selected_column_lineage c
INNER JOIN dataset_fields_view output_fields ON c.output_dataset_field_uuid = output_fields.uuid
LEFT JOIN dataset_fields_view input_fields ON c.input_dataset_field_uuid = input_fields.uuid
GROUP BY
output_fields.namespace_name,
output_fields.dataset_name,
output_fields.field_name,
output_fields.type
I suspect that the query has joins with views at multiple places and as views can't and don't have indexes which is causing the issues. datasets_view while creating derived table selected_column_lineage and dataset_fields_view in the main select query.
Also, the derived table (selected_column_lineage) without indexes. It is getting joined against dataset_fields_view in the main select query.
We have one database that has 62981 rows in dataset_view and one that has 549987. Things work just fine in the former case with less rows.
I believe the issue happens at scale.
With recent test with larger data sizes queries don't tend to finish. We have seen examples of query running for 3+ days.
+-----+----------------------------------------------------+
|pid | query_duration |
+-----+----------------------------------------------------+
|18258 | 3 days 8 hours 6 mins 11.680846 secs |
|14029 | 3 days 7 hours 55 mins 57.796614 secs|
|25136 | 3 days 7 hours 55 mins 57.796616 secs|
|12053 | 3 days 8 hours 6 mins 11.680867 secs |
|12054 | 3 days 8 hours 6 mins 11.680869 secs |
|14028 | 3 days 7 hours 55 mins 57.796674 secs|
+-----+----------------------------------------------------+
The text was updated successfully, but these errors were encountered:
Marquez version:
0.42.0
Related issue: #2418
Related PR: #2419 for the previous merge.
Discussion thread: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1701766646176309?thread_ts=1676320609.289789&cid=C01E8MQGJP7
Current query:
I suspect that the query has joins with views at multiple places and as views can't and don't have indexes which is causing the issues.
datasets_view
while creating derived tableselected_column_lineage
anddataset_fields_view
in the main select query.Also, the derived table (
selected_column_lineage
) without indexes. It is getting joined againstdataset_fields_view
in the main select query.We have one database that has
62981
rows indataset_view
and one that has549987
. Things work just fine in the former case with less rows.I believe the issue happens at scale.
With recent test with larger data sizes queries don't tend to finish. We have seen examples of query running for
3+
days.The text was updated successfully, but these errors were encountered: