Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Evaluating Query Performance: Z-order Indexing and Full Table Scans in Delta Lake with Trino #24724

Open
RohanDudeja opened this issue Jan 16, 2025 · 0 comments

Comments

@RohanDudeja
Copy link

Issue Description:

We are experiencing full table scans in a query that operates on a Delta Lake dataset using a Trino connection. The query involves multiple joins and filters on partitioned and Z-order indexed columns.

Despite the optimizations in place:

  1. The created_date column is partitioned.
  2. A Z-order index has been applied to the columns used in the join (b_id, c_id) and filter (entity_id, created_date).

The query still results in full table scans, regardless of the created_date range specified (e.g., 1 day, 1 week, or 2 weeks). This raises concerns about the effectiveness of Z-order indexing in our setup and the overall query performance.


Sample Query:

SELECT
  table_a.id,
  table_a.b_id,
  table_a.c_id,
  table_a.created_date,
  table_b.col1,
  table_b.col2,
  table_c.col1,
  table_c.col2
FROM
  database.table_a AS table_a
  LEFT JOIN database.table_b AS table_b 
    ON table_b.entity_id = 'x' AND table_a.b_id = table_b.id
  LEFT JOIN database.table_c AS table_c 
    ON table_c.entity_id = 'x' AND table_a.c_id = table_c.id
WHERE
  table_a.entity_id IN ('x') 
  AND table_a.created_date BETWEEN '2025-01-01' AND '2025-01-02'
ORDER BY
  table_a.id ASC;

Problem Statement:

We need to determine whether Z-order indexing is providing any performance benefits and identify alternative approaches to optimize the query execution plan and avoid full table scans.


Questions for Discussion:

  1. Z-order Indexing Impact:

    • Does Z-order indexing improve query performance in this specific use case with Delta Lake and Trino?
    • Are there scenarios where Z-order indexing might not provide noticeable benefits?
  2. Query Optimization:

    • Are there alternative approaches to optimize the query execution plan to prevent full table scans?
  3. Trino-Specific Behaviors:

    • Does Trino’s query execution engine fully leverage Z-order indexing for data pruning in Delta Lake?
    • Are there specific configurations or best practices for Trino to better utilize Z-order indexes?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant