Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude column schema when we fetch Glue partitions based on filter #14206

Merged
merged 1 commit into from
Sep 20, 2022

Conversation

Praveen2112
Copy link
Member

@Praveen2112 Praveen2112 commented Sep 20, 2022

Description

getPartitionNamesByFilter requires only partition values, including column schema as a part of result will be an overhead. Additional call to get the table information is also avoided. This could improve the planning time for queries having too many columns (1000+).

We did a local testing with a glue table having 1000 data columns, 3 partition columns and 1000 partitions -

For a query like this EXPLAIN SELECT count(*) FROM GLUE_TABLE group by part_column_2 LIMIT 1 - with table_statistics disabled.

The overall execution time before this change

7-8s (multiple runs)

The overall execution time after this change.

1.1-1.7s (multiple runs)

Non-technical explanation

Improvement in planning time for glue tables.

Release notes

( ) This is not user-visible and no release notes are required.
(x) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Improve planning time for wide glue tables

@findepi
Copy link
Member

findepi commented Sep 20, 2022

@Praveen2112 see TestIcebergGlueCatalogAccessOperations failure.

@findepi
Copy link
Member

findepi commented Sep 20, 2022

cc @alexjo2144 @findinpath @homar

@Praveen2112
Copy link
Member Author

I think TestIcebergGlueCatalogAccessOperations will be fixed by this PR - #14207

Copy link
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

`getPartitionNamesByFilter` requires only partition values, including column schema as a part of result will be an overhead.
Additional call to get the table information is also avoided.
@Praveen2112 Praveen2112 force-pushed the praveen/minor_glue_improvement branch from 20d8dea to e669529 Compare September 20, 2022 10:56
@Praveen2112 Praveen2112 merged commit 5e066e2 into master Sep 20, 2022
@Praveen2112 Praveen2112 deleted the praveen/minor_glue_improvement branch September 20, 2022 15:35
@github-actions github-actions bot added this to the 397 milestone Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants