-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] High number of REST.GET.BUCKET when writing over a hudi table #8715
Comments
the second one in terms of number of request is: /hudiTable/.hoodie/metadata/.hoodie |
@alexone95 Can you try with these versions if you still face the issue. |
The problem is that we are using Hudi over an EMR release (6.9) and even upgrading it to 6.10 we would not be able to use 0.12.3 or 0.13.0 version (6.10 comes out with 12.0.2-amzn). There's a way to solve the issue without upgrade the Hudi version? |
I want still remark the fact that the problem that you mentioned, relative to the archive, was resolved by manually cleaning files (by a daily script), and that the S3 requests that we are getting now should be not supposed to be related to this problem. Thanks in advice |
I believe this is caused by https://hudi.apache.org/docs/0.12.3/configurations#hoodiebootstrapindexenable . My understanding is that by default Hudi assumes every table is a bootstrapped table and does a check for the bootstrap index. It looks like this setting was removed in 0.13.0 |
I tried by setting this config to False but nothing changed in terms of REST.GET.BUCKET request, i am still having nearly 1 milion requests per day |
It would be a good idea to set this parameter: https://hudi.apache.org/docs/configurations/#hoodiecleanerdeletebootstrapbasefile to true? I expect this config to delete this type of file hudiTable/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/ am i wrong? Thanks in advice |
There's a way to delete those files in a secure way, ora preventing Hudi to create it? |
@alexone95 The property you mentioned deletes the stale bootstrap base files only and we can't delete the bootstrap index files. This PR #7404 also reduce the unnecessary calls to the HFileBootstrapIndex. Can you please try it. Can you let us know more about this table like how many partitions in the table and size of the table. |
Hi @alexone95 The issue of Hudi spending a lot of time on requesting files in the archived directory should be fixed in #7561. There was a regression where the archived timeline is always loaded in the meta sync regardless of whether it's needed or not. Have you reached out to EMR support to get a patch Hudi jar to solve the issue? cc @umehrot2 @rahil-c @CTTY |
Ah, thats interesting - does this mean we need to set https://hudi.apache.org/docs/0.12.3/configurations/#hoodiebootstrapindexclass to https://github.com/apache/hudi/blob/c859ea4cd23bc4ae56fca4914f34bbf5858bfde5/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/NoOpBootstrapIndex.java ? We just want to avoid the (We do not have metadata table enabled so #7404 would not work for us) |
Tracking JIRA created - https://issues.apache.org/jira/browse/HUDI-6284 |
@ad1happy2go thanks for opening the ticket, but is there a way to disable this in current version? None of the tables we have are bootstrapped anyways |
@mzheng-plaid Sorry for delayed response. I don't think there is way to disable it at the moment. @jonvex is working on a fix here, will soon send the patch out. Also, can you clarify what do you mean by "tables are not bootstrapped", as you have bootstrap index file - |
(@ad1happy2go forgot to respond last year, just circled back to this problem now)
|
Hello, we were facing the problem that hudi spends a lot of time by requesting file in /archived directory, so in such a way to reduce this problem we build up a solution consisting of daily deleting the files in the archive. The solution works fine, in the way of reducing the commit latency, but from when we deployed the solution we are facing the problem that the REST.GET.BUCKET request increased a lot. In particular, from a single table we get 1 milion of request per day of wich 900k are GET request for this path /hudiTable/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/.
We read INSERT, UPDATE and DELETE operation from a Kafka topic and we replicate them in a target hudi table stored on Hive via a pyspark job running 24/7.
Why i get this behavior? there's something i can do in way to reduce the number of requests?
Environment Description
Additional context
HOODIE TABLE PROPERTIES:
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',
'hoodie.datasource.write.hive_style_partitioning':'true',
'hoodie.index.type':'GLOBAL_BLOOM',
'hoodie.simple.index.update.partition.path':'true',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.mode': 'hms',
'hoodie.copyonwrite.record.size.estimate':285,
'hoodie.parquet.small.file.limit': 104857600,
'hoodie.parquet.max.file.size': 120000000,
'hoodie.cleaner.commits.retained': 1
The text was updated successfully, but these errors were encountered: