Optimization for ColumnLineageDao.getLineageRowsForDataset #2697

ddave09 · 2023-12-05T17:53:37Z

Marquez version: 0.42.0
Related issue: #2418
Related PR: #2419 for the previous merge.
Discussion thread: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1701766646176309?thread_ts=1676320609.289789&cid=C01E8MQGJP7

Current query:

WITH selected_column_lineage AS (
    SELECT DISTINCT ON (cl.output_dataset_field_uuid, cl.input_dataset_field_uuid) cl.*
    FROM column_lineage cl
    JOIN dataset_fields df ON df.uuid = cl.output_dataset_field_uuid
    JOIN datasets_view dv ON dv.uuid = df.dataset_uuid
    WHERE ARRAY[($1,$2)]::DATASET_NAME[] && dv.dataset_symlinks -- array of string pairs is cast onto array of DATASET_NAME types to be checked if it has non-empty intersection with dataset symlinks
    ORDER BY output_dataset_field_uuid, input_dataset_field_uuid, updated_at DESC, updated_at
  ),
  dataset_fields_view AS (
    SELECT d.namespace_name as namespace_name, d.name as dataset_name, df.name as field_name, df.type, df.uuid
    FROM dataset_fields df
    INNER JOIN datasets_view d ON d.uuid = df.dataset_uuid
  )
  SELECT
    output_fields.namespace_name,
    output_fields.dataset_name,
    output_fields.field_name,
    output_fields.type,
    ARRAY_AGG(DISTINCT ARRAY[
      input_fields.namespace_name,
      input_fields.dataset_name,
      CAST(c.input_dataset_version_uuid AS VARCHAR),
      input_fields.field_name,
      c.transformation_description,
      c.transformation_type
    ]) AS inputFields,
    null as dataset_version_uuid
  FROM selected_column_lineage c
  INNER JOIN dataset_fields_view output_fields ON c.output_dataset_field_uuid = output_fields.uuid
  LEFT JOIN dataset_fields_view input_fields ON c.input_dataset_field_uuid = input_fields.uuid
  GROUP BY
    output_fields.namespace_name,
    output_fields.dataset_name,
    output_fields.field_name,
    output_fields.type

I suspect that the query has joins with views at multiple places and as views can't and don't have indexes which is causing the issues. datasets_view while creating derived table selected_column_lineage and dataset_fields_view in the main select query.

Also, the derived table (selected_column_lineage) without indexes. It is getting joined against dataset_fields_view in the main select query.

We have one database that has 62981 rows in dataset_view and one that has 549987. Things work just fine in the former case with less rows.

I believe the issue happens at scale.

With recent test with larger data sizes queries don't tend to finish. We have seen examples of query running for 3+ days.

+-----+----------------------------------------------------+
|pid       | query_duration                                      |
+-----+----------------------------------------------------+
|18258 | 3 days 8 hours 6 mins 11.680846 secs |
|14029 | 3 days 7 hours 55 mins 57.796614 secs|
|25136 | 3 days 7 hours 55 mins 57.796616 secs|
|12053 | 3 days 8 hours 6 mins 11.680867 secs |
|12054 | 3 days 8 hours 6 mins 11.680869 secs |
|14028 | 3 days 7 hours 55 mins 57.796674 secs|
+-----+----------------------------------------------------+

The text was updated successfully, but these errors were encountered:

wslulciuc added this to Marquez Dec 7, 2023

wslulciuc added this to the Roadmap milestone Dec 7, 2023

wslulciuc added column-level-lineage db.perf This issue or pull request improves DB performance labels Jan 30, 2024

wslulciuc moved this to Todo in Marquez Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization for ColumnLineageDao.getLineageRowsForDataset #2697

Optimization for ColumnLineageDao.getLineageRowsForDataset #2697

ddave09 commented Dec 5, 2023

Optimization for ColumnLineageDao.getLineageRowsForDataset #2697

Optimization for ColumnLineageDao.getLineageRowsForDataset #2697

Comments

ddave09 commented Dec 5, 2023