-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5234] Streaming read skip clustering #7296
Conversation
public static void writeDataAsInsert( | ||
List<RowData> dataBuffer, | ||
Configuration conf) throws Exception { | ||
InsertFunctionWrapper<RowData> funcWrapper = new InsertFunctionWrapper<>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: would refactor the code with writer wrapper.
.booleanType() | ||
.defaultValue(false) | ||
.withDescription("Whether to skip clustering instants for streaming read,\n" | ||
+ "to avoid reading duplicates"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add the cases which cause reading duplicates in the description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's needed, can you add more cases for clustering here ? You can add UTs in TestInputFormat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405, ok. I will do adding the UTs.
Co-authored-by: zhuanshenbsj1 <zhuanshen_bsj@163.com>
Co-authored-by: zhuanshenbsj1 <zhuanshen_bsj@163.com>
Change Logs
Supports skipping the clustering commits while reading. This can make the streaming read under clustering more efficient because the clustering files are all rewritten with the original records.
Impact
No
Risk level (write none, low medium or high below)
none
Documentation Update
Should update the doc whith this new option
read.streaming.skip_clustering
.Contributor's checklist