Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] High number of REST.GET.BUCKET when writing over a hudi table #8715

Open
alexone95 opened this issue May 15, 2023 · 15 comments
Open
Labels
aws-support high-fs-calls Label tracking github issues for high unexpected number of FS calls mainly S3.. priority:critical production down; pipelines stalled; Need help asap.

Comments

@alexone95
Copy link

Hello, we were facing the problem that hudi spends a lot of time by requesting file in /archived directory, so in such a way to reduce this problem we build up a solution consisting of daily deleting the files in the archive. The solution works fine, in the way of reducing the commit latency, but from when we deployed the solution we are facing the problem that the REST.GET.BUCKET request increased a lot. In particular, from a single table we get 1 milion of request per day of wich 900k are GET request for this path /hudiTable/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/.

We read INSERT, UPDATE and DELETE operation from a Kafka topic and we replicate them in a target hudi table stored on Hive via a pyspark job running 24/7.

Why i get this behavior? there's something i can do in way to reduce the number of requests?

Environment Description

Hudi version : 0.12.1-amzn-0
Spark version : 3.3.0
Hive version : 3.1.3
Hadoop version : 3.3.3 amz
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no (EMR 6.9.0)

Additional context

HOODIE TABLE PROPERTIES:
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',
'hoodie.datasource.write.hive_style_partitioning':'true',
'hoodie.index.type':'GLOBAL_BLOOM',
'hoodie.simple.index.update.partition.path':'true',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.mode': 'hms',
'hoodie.copyonwrite.record.size.estimate':285,
'hoodie.parquet.small.file.limit': 104857600,
'hoodie.parquet.max.file.size': 120000000,
'hoodie.cleaner.commits.retained': 1

@alexone95
Copy link
Author

the second one in terms of number of request is: /hudiTable/.hoodie/metadata/.hoodie

@codope codope added the priority:major degraded perf; unable to move forward; potential bugs label May 15, 2023
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support May 15, 2023
@ad1happy2go
Copy link
Collaborator

@alexone95
We made some fixes to hudi versions 0.12.3 and 0.13.0 on fixing unnecessary calls to FS #7561
can you try them and let us know. should bring down your S3 calls.

Can you try with these versions if you still face the issue.

@codope codope moved this from ⏳ Awaiting Triage to 👤 User Action in Hudi Issue Support May 16, 2023
@alexone95
Copy link
Author

@alexone95 We made some fixes to hudi versions 0.12.3 and 0.13.0 on fixing unnecessary calls to FS #7561 can you try them and let us know. should bring down your S3 calls.

Can you try with these versions if you still face the issue.

The problem is that we are using Hudi over an EMR release (6.9) and even upgrading it to 6.10 we would not be able to use 0.12.3 or 0.13.0 version (6.10 comes out with 12.0.2-amzn). There's a way to solve the issue without upgrade the Hudi version?

@alexone95
Copy link
Author

I want still remark the fact that the problem that you mentioned, relative to the archive, was resolved by manually cleaning files (by a daily script), and that the S3 requests that we are getting now should be not supposed to be related to this problem.

Thanks in advice

@mzheng-plaid
Copy link

I believe this is caused by https://hudi.apache.org/docs/0.12.3/configurations#hoodiebootstrapindexenable . My understanding is that by default Hudi assumes every table is a bootstrapped table and does a check for the bootstrap index.

It looks like this setting was removed in 0.13.0

@alexone95
Copy link
Author

I believe this is caused by https://hudi.apache.org/docs/0.12.3/configurations#hoodiebootstrapindexenable . My understanding is that by default Hudi assumes every table is a bootstrapped table and does a check for the bootstrap index.

It looks like this setting was removed in 0.13.0

I tried by setting this config to False but nothing changed in terms of REST.GET.BUCKET request, i am still having nearly 1 milion requests per day

@alexone95
Copy link
Author

It would be a good idea to set this parameter: https://hudi.apache.org/docs/configurations/#hoodiecleanerdeletebootstrapbasefile to true? I expect this config to delete this type of file hudiTable/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/ am i wrong?

Thanks in advice

@alexone95
Copy link
Author

There's a way to delete those files in a secure way, ora preventing Hudi to create it?

@ad1happy2go
Copy link
Collaborator

@alexone95 The property you mentioned deletes the stale bootstrap base files only and we can't delete the bootstrap index files.

This PR #7404 also reduce the unnecessary calls to the HFileBootstrapIndex. Can you please try it.

Can you let us know more about this table like how many partitions in the table and size of the table.

@yihua
Copy link
Contributor

yihua commented May 25, 2023

Hi @alexone95 The issue of Hudi spending a lot of time on requesting files in the archived directory should be fixed in #7561. There was a regression where the archived timeline is always loaded in the meta sync regardless of whether it's needed or not. Have you reached out to EMR support to get a patch Hudi jar to solve the issue? cc @umehrot2 @rahil-c @CTTY

@mzheng-plaid
Copy link

#7404

Ah, thats interesting - does this mean we need to set https://hudi.apache.org/docs/0.12.3/configurations/#hoodiebootstrapindexclass to https://github.com/apache/hudi/blob/c859ea4cd23bc4ae56fca4914f34bbf5858bfde5/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/NoOpBootstrapIndex.java ? We just want to avoid the fs.exists call in HFileBootstrapIndex because none of our tables are bootstrapped anyways. It doesn't seem like disabling https://hudi.apache.org/docs/0.12.3/configurations#hoodiebootstrapindexenable does anything (as @alexone95 pointed out)

(We do not have metadata table enabled so #7404 would not work for us)

@ad1happy2go
Copy link
Collaborator

Tracking JIRA created - https://issues.apache.org/jira/browse/HUDI-6284

@xushiyan xushiyan moved this from 👤 User Action to 🏁 Triaged in Hudi Issue Support May 31, 2023
@xushiyan xushiyan added priority:critical production down; pipelines stalled; Need help asap. and removed priority:major degraded perf; unable to move forward; potential bugs labels May 31, 2023
@mzheng-plaid
Copy link

@ad1happy2go thanks for opening the ticket, but is there a way to disable this in current version? None of the tables we have are bootstrapped anyways

@ad1happy2go
Copy link
Collaborator

@mzheng-plaid Sorry for delayed response. I don't think there is way to disable it at the moment. @jonvex is working on a fix here, will soon send the patch out.

Also, can you clarify what do you mean by "tables are not bootstrapped", as you have bootstrap index file - /hudiTable/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile

@codope codope added the high-fs-calls Label tracking github issues for high unexpected number of FS calls mainly S3.. label Sep 18, 2023
@mzheng-plaid
Copy link

(@ad1happy2go forgot to respond last year, just circled back to this problem now)

Also, can you clarify what do you mean by "tables are not bootstrapped", as you have bootstrap index file
No the bootstrap index file does not exist. Hudi does a LIST operation (REST.GET.BUCKET) to check if the bootstrap index file exists

https://github.com/apache/hudi/blob/release-0.12.2/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java#L105C1-L108C76

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws-support high-fs-calls Label tracking github issues for high unexpected number of FS calls mainly S3.. priority:critical production down; pipelines stalled; Need help asap.
Projects
Status: 🏁 Triaged
Development

No branches or pull requests

6 participants