[SUPPORT] High number of REST.GET.BUCKET when writing over a hudi table #8715

alexone95 · 2023-05-15T13:57:01Z

Hello, we were facing the problem that hudi spends a lot of time by requesting file in /archived directory, so in such a way to reduce this problem we build up a solution consisting of daily deleting the files in the archive. The solution works fine, in the way of reducing the commit latency, but from when we deployed the solution we are facing the problem that the REST.GET.BUCKET request increased a lot. In particular, from a single table we get 1 milion of request per day of wich 900k are GET request for this path /hudiTable/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/.

We read INSERT, UPDATE and DELETE operation from a Kafka topic and we replicate them in a target hudi table stored on Hive via a pyspark job running 24/7.

Why i get this behavior? there's something i can do in way to reduce the number of requests?

Environment Description

Hudi version : 0.12.1-amzn-0
Spark version : 3.3.0
Hive version : 3.1.3
Hadoop version : 3.3.3 amz
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no (EMR 6.9.0)

Additional context

HOODIE TABLE PROPERTIES:
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',
'hoodie.datasource.write.hive_style_partitioning':'true',
'hoodie.index.type':'GLOBAL_BLOOM',
'hoodie.simple.index.update.partition.path':'true',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.mode': 'hms',
'hoodie.copyonwrite.record.size.estimate':285,
'hoodie.parquet.small.file.limit': 104857600,
'hoodie.parquet.max.file.size': 120000000,
'hoodie.cleaner.commits.retained': 1

The text was updated successfully, but these errors were encountered:

alexone95 · 2023-05-15T14:10:42Z

the second one in terms of number of request is: /hudiTable/.hoodie/metadata/.hoodie

ad1happy2go · 2023-05-16T17:05:18Z

@alexone95
We made some fixes to hudi versions 0.12.3 and 0.13.0 on fixing unnecessary calls to FS #7561
can you try them and let us know. should bring down your S3 calls.

Can you try with these versions if you still face the issue.

alexone95 · 2023-05-16T17:32:05Z

@alexone95 We made some fixes to hudi versions 0.12.3 and 0.13.0 on fixing unnecessary calls to FS #7561 can you try them and let us know. should bring down your S3 calls.

Can you try with these versions if you still face the issue.

The problem is that we are using Hudi over an EMR release (6.9) and even upgrading it to 6.10 we would not be able to use 0.12.3 or 0.13.0 version (6.10 comes out with 12.0.2-amzn). There's a way to solve the issue without upgrade the Hudi version?

alexone95 · 2023-05-17T17:01:53Z

I want still remark the fact that the problem that you mentioned, relative to the archive, was resolved by manually cleaning files (by a daily script), and that the S3 requests that we are getting now should be not supposed to be related to this problem.

Thanks in advice

mzheng-plaid · 2023-05-17T20:30:41Z

I believe this is caused by https://hudi.apache.org/docs/0.12.3/configurations#hoodiebootstrapindexenable . My understanding is that by default Hudi assumes every table is a bootstrapped table and does a check for the bootstrap index.

It looks like this setting was removed in 0.13.0

alexone95 · 2023-05-19T08:30:15Z

I believe this is caused by https://hudi.apache.org/docs/0.12.3/configurations#hoodiebootstrapindexenable . My understanding is that by default Hudi assumes every table is a bootstrapped table and does a check for the bootstrap index.

It looks like this setting was removed in 0.13.0

I tried by setting this config to False but nothing changed in terms of REST.GET.BUCKET request, i am still having nearly 1 milion requests per day

alexone95 · 2023-05-19T10:25:43Z

It would be a good idea to set this parameter: https://hudi.apache.org/docs/configurations/#hoodiecleanerdeletebootstrapbasefile to true? I expect this config to delete this type of file hudiTable/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/ am i wrong?

Thanks in advice

alexone95 · 2023-05-22T15:10:25Z

There's a way to delete those files in a secure way, ora preventing Hudi to create it?

ad1happy2go · 2023-05-25T17:05:41Z

@alexone95 The property you mentioned deletes the stale bootstrap base files only and we can't delete the bootstrap index files.

This PR #7404 also reduce the unnecessary calls to the HFileBootstrapIndex. Can you please try it.

Can you let us know more about this table like how many partitions in the table and size of the table.

yihua · 2023-05-25T17:10:33Z

Hi @alexone95 The issue of Hudi spending a lot of time on requesting files in the archived directory should be fixed in #7561. There was a regression where the archived timeline is always loaded in the meta sync regardless of whether it's needed or not. Have you reached out to EMR support to get a patch Hudi jar to solve the issue? cc @umehrot2 @rahil-c @CTTY

mzheng-plaid · 2023-05-25T18:34:11Z

#7404

Ah, thats interesting - does this mean we need to set https://hudi.apache.org/docs/0.12.3/configurations/#hoodiebootstrapindexclass to https://github.com/apache/hudi/blob/c859ea4cd23bc4ae56fca4914f34bbf5858bfde5/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/NoOpBootstrapIndex.java ? We just want to avoid the fs.exists call in HFileBootstrapIndex because none of our tables are bootstrapped anyways. It doesn't seem like disabling https://hudi.apache.org/docs/0.12.3/configurations#hoodiebootstrapindexenable does anything (as @alexone95 pointed out)

(We do not have metadata table enabled so #7404 would not work for us)

ad1happy2go · 2023-05-30T05:53:58Z

Tracking JIRA created - https://issues.apache.org/jira/browse/HUDI-6284

mzheng-plaid · 2023-05-31T16:59:12Z

@ad1happy2go thanks for opening the ticket, but is there a way to disable this in current version? None of the tables we have are bootstrapped anyways

ad1happy2go · 2023-06-14T15:55:44Z

@mzheng-plaid Sorry for delayed response. I don't think there is way to disable it at the moment. @jonvex is working on a fix here, will soon send the patch out.

Also, can you clarify what do you mean by "tables are not bootstrapped", as you have bootstrap index file - /hudiTable/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile

mzheng-plaid · 2024-07-12T21:01:00Z

(@ad1happy2go forgot to respond last year, just circled back to this problem now)

Also, can you clarify what do you mean by "tables are not bootstrapped", as you have bootstrap index file
No the bootstrap index file does not exist. Hudi does a LIST operation (REST.GET.BUCKET) to check if the bootstrap index file exists

https://github.com/apache/hudi/blob/release-0.12.2/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java#L105C1-L108C76

codope added the priority:major degraded perf; unable to move forward; potential bugs label May 15, 2023

codope added this to Hudi Issue Support May 15, 2023

github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support May 15, 2023

codope moved this from ⏳ Awaiting Triage to 👤 User Action in Hudi Issue Support May 16, 2023

codope added the aws-support label May 16, 2023

xushiyan moved this from 👤 User Action to 🏁 Triaged in Hudi Issue Support May 31, 2023

xushiyan added priority:critical production down; pipelines stalled; Need help asap. and removed priority:major degraded perf; unable to move forward; potential bugs labels May 31, 2023

codope added the high-fs-calls Label tracking github issues for high unexpected number of FS calls mainly S3.. label Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] High number of REST.GET.BUCKET when writing over a hudi table #8715

[SUPPORT] High number of REST.GET.BUCKET when writing over a hudi table #8715

alexone95 commented May 15, 2023

alexone95 commented May 15, 2023

ad1happy2go commented May 16, 2023

alexone95 commented May 16, 2023

alexone95 commented May 17, 2023

mzheng-plaid commented May 17, 2023

alexone95 commented May 19, 2023

alexone95 commented May 19, 2023

alexone95 commented May 22, 2023

ad1happy2go commented May 25, 2023

yihua commented May 25, 2023

mzheng-plaid commented May 25, 2023

ad1happy2go commented May 30, 2023

mzheng-plaid commented May 31, 2023

ad1happy2go commented Jun 14, 2023

mzheng-plaid commented Jul 12, 2024

[SUPPORT] High number of REST.GET.BUCKET when writing over a hudi table #8715

[SUPPORT] High number of REST.GET.BUCKET when writing over a hudi table #8715

Comments

alexone95 commented May 15, 2023

alexone95 commented May 15, 2023

ad1happy2go commented May 16, 2023

alexone95 commented May 16, 2023

alexone95 commented May 17, 2023

mzheng-plaid commented May 17, 2023

alexone95 commented May 19, 2023

alexone95 commented May 19, 2023

alexone95 commented May 22, 2023

ad1happy2go commented May 25, 2023

yihua commented May 25, 2023

mzheng-plaid commented May 25, 2023

ad1happy2go commented May 30, 2023

mzheng-plaid commented May 31, 2023

ad1happy2go commented Jun 14, 2023

mzheng-plaid commented Jul 12, 2024