Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5234] Streaming read skip clustering #7296

Merged
merged 1 commit into from
Nov 25, 2022

Conversation

danny0405
Copy link
Contributor

Change Logs

Supports skipping the clustering commits while reading. This can make the streaming read under clustering more efficient because the clustering files are all rewritten with the original records.

Impact

No

Risk level (write none, low medium or high below)

none

Documentation Update

Should update the doc whith this new option read.streaming.skip_clustering.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

public static void writeDataAsInsert(
List<RowData> dataBuffer,
Configuration conf) throws Exception {
InsertFunctionWrapper<RowData> funcWrapper = new InsertFunctionWrapper<>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: would refactor the code with writer wrapper.

.booleanType()
.defaultValue(false)
.withDescription("Whether to skip clustering instants for streaming read,\n"
+ "to avoid reading duplicates");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the cases which cause reading duplicates in the description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's needed, can you add more cases for clustering here ? You can add UTs in TestInputFormat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405, ok. I will do adding the UTs.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit 431ade0 into apache:master Nov 25, 2022
satishkotha pushed a commit to satishkotha/incubator-hudi that referenced this pull request Dec 12, 2022
Co-authored-by: zhuanshenbsj1 <zhuanshen_bsj@163.com>
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
Co-authored-by: zhuanshenbsj1 <zhuanshen_bsj@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants