Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization for ColumnLineageDao.getLineageRowsForDataset #2697

Open
ddave09 opened this issue Dec 5, 2023 · 0 comments
Open

Optimization for ColumnLineageDao.getLineageRowsForDataset #2697

ddave09 opened this issue Dec 5, 2023 · 0 comments
Labels
column-level-lineage db.perf This issue or pull request improves DB performance
Milestone

Comments

@ddave09
Copy link

ddave09 commented Dec 5, 2023

Marquez version: 0.42.0
Related issue: #2418
Related PR: #2419 for the previous merge.
Discussion thread: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1701766646176309?thread_ts=1676320609.289789&cid=C01E8MQGJP7

Current query:

WITH selected_column_lineage AS (
    SELECT DISTINCT ON (cl.output_dataset_field_uuid, cl.input_dataset_field_uuid) cl.*
    FROM column_lineage cl
    JOIN dataset_fields df ON df.uuid = cl.output_dataset_field_uuid
    JOIN datasets_view dv ON dv.uuid = df.dataset_uuid
    WHERE ARRAY[($1,$2)]::DATASET_NAME[] && dv.dataset_symlinks -- array of string pairs is cast onto array of DATASET_NAME types to be checked if it has non-empty intersection with dataset symlinks
    ORDER BY output_dataset_field_uuid, input_dataset_field_uuid, updated_at DESC, updated_at
  ),
  dataset_fields_view AS (
    SELECT d.namespace_name as namespace_name, d.name as dataset_name, df.name as field_name, df.type, df.uuid
    FROM dataset_fields df
    INNER JOIN datasets_view d ON d.uuid = df.dataset_uuid
  )
  SELECT
    output_fields.namespace_name,
    output_fields.dataset_name,
    output_fields.field_name,
    output_fields.type,
    ARRAY_AGG(DISTINCT ARRAY[
      input_fields.namespace_name,
      input_fields.dataset_name,
      CAST(c.input_dataset_version_uuid AS VARCHAR),
      input_fields.field_name,
      c.transformation_description,
      c.transformation_type
    ]) AS inputFields,
    null as dataset_version_uuid
  FROM selected_column_lineage c
  INNER JOIN dataset_fields_view output_fields ON c.output_dataset_field_uuid = output_fields.uuid
  LEFT JOIN dataset_fields_view input_fields ON c.input_dataset_field_uuid = input_fields.uuid
  GROUP BY
    output_fields.namespace_name,
    output_fields.dataset_name,
    output_fields.field_name,
    output_fields.type

I suspect that the query has joins with views at multiple places and as views can't and don't have indexes which is causing the issues. datasets_view while creating derived table selected_column_lineage and dataset_fields_view in the main select query.

Also, the derived table (selected_column_lineage) without indexes. It is getting joined against dataset_fields_view in the main select query.

We have one database that has 62981 rows in dataset_view and one that has 549987. Things work just fine in the former case with less rows.

I believe the issue happens at scale.

With recent test with larger data sizes queries don't tend to finish. We have seen examples of query running for 3+ days.

+-----+----------------------------------------------------+
|pid       | query_duration                                      |
+-----+----------------------------------------------------+
|18258 | 3 days 8 hours 6 mins 11.680846 secs |
|14029 | 3 days 7 hours 55 mins 57.796614 secs|
|25136 | 3 days 7 hours 55 mins 57.796616 secs|
|12053 | 3 days 8 hours 6 mins 11.680867 secs |
|12054 | 3 days 8 hours 6 mins 11.680869 secs |
|14028 | 3 days 7 hours 55 mins 57.796674 secs|
+-----+----------------------------------------------------+
@wslulciuc wslulciuc added this to Marquez Dec 7, 2023
@wslulciuc wslulciuc added this to the Roadmap milestone Dec 7, 2023
@wslulciuc wslulciuc added column-level-lineage db.perf This issue or pull request improves DB performance labels Jan 30, 2024
@wslulciuc wslulciuc moved this to Todo in Marquez Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
column-level-lineage db.perf This issue or pull request improves DB performance
Projects
Status: Todo
Development

No branches or pull requests

2 participants