Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch the metastore call used to retrieve the partitions by names #12527

Merged

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented May 24, 2022

Description

Is this change a fix, improvement, new feature, refactoring, or other?

Fix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Hive connector

How would you describe this change to a non-technical end user or system administrator?

Batch the retrieval of the metadata for the partition names in sync_partition_metadata to avoid response timeout in metastore / a payload much too big on the client side.

Related issues, pull requests, and links

Fixes #12525

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Hive
* Avoid putting too much burden on the HMS while performing `sync_partition_metadata` procedure when dealing with a table having many partitions

@guyco33
Copy link
Member

guyco33 commented May 24, 2022

I approve that fix works. Sync takes now 2 times more then before but at least it doesn't fail

@@ -67,6 +68,8 @@
ADD, DROP, FULL
}

private static final int BATCH_GET_PARTITIONS_BY_NAMES_MAX_PAGE_SIZE = 1000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing with the batch size, it seems that higher batches makes it faster.
With 1000 it took 3.59m
With 5000 it took 2.73m
I think that BATCH_GET_PARTITIONS_BY_NAMES_MAX_PAGE_SIZE should be configurable.

@findinpath findinpath requested a review from ebyhr May 24, 2022 13:38
@ebyhr ebyhr requested a review from findepi May 25, 2022 03:07
@findinpath findinpath force-pushed the sync_partition_metadata_batch_call branch from 2844275 to e356790 Compare May 25, 2022 15:27
@findinpath findinpath requested a review from findepi May 25, 2022 15:28
@findepi findepi merged commit c58d9ea into trinodb:master May 25, 2022
@findepi findepi mentioned this pull request May 25, 2022
@github-actions github-actions bot added this to the 382 milestone May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Hive sync_partition_metadata fails with timeout when table has many partitions
4 participants