Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5007] Prevent Hudi from reading the entire timeline's when perf… #6920

Merged
merged 1 commit into from
Nov 29, 2022

Conversation

voonhous
Copy link
Member

@voonhous voonhous commented Oct 11, 2022

…orming a LATEST streaming read

Change Logs

Prevent Hudi from performing unnecessary file scans when performing a stream read from the latest instant.

Impact

No public API changed, changing the access level of a private method from private to public so that it can be used for testing

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@danny0405
Copy link
Contributor

danny0405 commented Oct 14, 2022

Nice catch, i have reviewed and applied a patch here:
5007.patch.zip

Can you also add a test class in TestInputFormat and test the input splits logic for streaming read in this case ?

You can take a reference from TestInputFormat#testReadSkipCompaction

@danny0405 danny0405 self-assigned this Oct 14, 2022
@danny0405 danny0405 added flink Issues related to flink reader-core labels Oct 14, 2022
@voonhous
Copy link
Member Author

voonhous commented Oct 21, 2022

Can you also add a test class in TestInputFormat and test the input splits logic for streaming read in this case ?

You can take a reference from TestInputFormat#testReadSkipCompaction

Apologies for the late reply, given that the read behaviour is correct, the only way we can test this fix is to make the IncrementalInputSplits#getInstantRange method public.

This bug is caused by IncrementalInputSplits#getInstantRange not producing the correct result, causing unnecessary file checks, blocking the first read cycle.

The IncrementalInputSplits.inputSplits method is still able to produce the right result, allowing the right records to be read out.

Hence, I believe a test should be performed on the generation of HoodieInstants to read via the IncrementalInputSplits#getInstantRange method instead of the read behaviour like the rest of the tests in the TestInputFormat class are doing.

To do this, one will have to make the method public or use Reflection to access this class, I am not sure if loading the class in via reflection is the right way to approach this test.

Do you have any suggestions?

@voonhous voonhous force-pushed the HUDI-5007 branch 3 times, most recently from 907782e to bdf960d Compare November 25, 2022 11:58
@voonhous
Copy link
Member Author

@danny0405 Added the required tests to validate this fix, can you please to take a look at it again?

Thank you.

@voonhous voonhous force-pushed the HUDI-5007 branch 2 times, most recently from 058d52e to c9060e7 Compare November 26, 2022 09:13
@voonhous
Copy link
Member Author

@danny0405 Added the changes that you have suggested. Thank you.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405
Copy link
Contributor

The failed test case in module hudi-utilities should not be affected by this patch, would merge it soon ~

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@danny0405 danny0405 merged commit 6baf733 into apache:master Nov 29, 2022
satishkotha pushed a commit that referenced this pull request Dec 13, 2022
satishkotha pushed a commit that referenced this pull request Dec 13, 2022
danny0405 pushed a commit to danny0405/hudi that referenced this pull request Dec 15, 2022
…orming a LATEST streaming read (apache#6920)

(cherry picked from commit 6baf733)
nsivabalan pushed a commit that referenced this pull request Dec 16, 2022
* [HUDI-5007] Prevent Hudi from reading the entire timeline's when performing a LATEST streaming read (#6920)

(cherry picked from commit 6baf733)

* [HUDI-5228] Flink table service job fs view conf overwrites the one of writing job (#7214)

(cherry picked from commit dc5cc08)

Co-authored-by: voonhous <voonhousu@gmail.com>
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
@voonhous voonhous deleted the HUDI-5007 branch May 15, 2023 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flink Issues related to flink reader-core
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants