[HUDI-4287] Optimize Flink checkpoint meta mechanism to fix mistaken pending instants #5913

chenshzh · 2022-06-20T13:39:52Z

Change Logs

Details added in https://issues.apache.org/jira/projects/HUDI/issues/HUDI-4287

To optimize the mechanism with CANCELLED CkpMessage state in the highest priority corresponding with DELETE instant during rollback action.

Revert back to re-send pending instants to ckp_meta when bootstrap.

Include scenes of compaction and clustering.

Impact

Last successful instant will be kept when recovering for recommit

Risk level (write none, low medium or high below)

medium. It will be fast verified by data quality check.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

wxplovecc · 2022-06-22T12:30:05Z

+1 for fix

chenshzh · 2022-06-29T06:17:35Z

Fix conflicts with https://issues.apache.org/jira/projects/HUDI/issues/HUDI-4311 , which also has attempted to fix the rollback data lose in batch write scene.

And this pr might be still needed because rollback could happen during runtime, such as StreamWriteOperatorCoordinator#notifyCheckpointComplete when commitInstant, and it won't crash the job for CkpMeta to rebootstrap.

chenshzh · 2022-07-01T12:57:35Z

@danny0405 updated already, pls take a review to see whether it's ok.

danny0405 · 2022-07-03T05:19:09Z

And this pr might be still needed because rollback could happen during runtime, such as StreamWriteOperatorCoordinator#notifyCheckpointComplete when commitInstant, and it won't crash the job fo

We already add stats for write functions to recoveer and recommit if the job fails during instant commition

leesf · 2022-12-09T06:39:04Z

@chenshzh would you please rebase to latest master first？

chenshzh · 2022-12-12T03:16:41Z

@chenshzh would you please rebase to latest master first？

@leesf updated.

alexeykudinkin · 2022-12-20T02:44:46Z

@chenshzh can you please check on the CI failures?

…pending instants

chenshzh · 2022-12-26T06:25:13Z

@hudi-bot run azure

hudi-bot · 2022-12-26T09:30:37Z

CI report:

fbb9ed3 Azure: FAILURE Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

chenshzh · 2022-12-27T05:00:13Z

FAILURE

@alexeykudinkin would you help review the CI failure once more?

I have rebased to the latest and it seems not the problem of this PR's specific changes ?

Because we find it failed in hudi-utilities and quite many recent PRs' pipeline failures are all about this module.

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on project hudi-utilities_2.11: There are test failures.
[ERROR] 
[ERROR] Please refer to /home/vsts/work/1/s/hudi-utilities/target/surefire-reports for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
[ERROR] The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /home/vsts/work/1/s/hudi-utilities && /usr/lib/jvm/temurin-8-jdk-amd64/jre/bin/java -Xmx2g org.apache.maven.surefire.booter.ForkedBooter /home/vsts/work/1/s/hudi-utilities/target/surefire 2022-12-26T06-54-12_966-jvmRun1 surefire3510960863127007832tmp surefire_87691856527070078480tmp
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 255
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /home/vsts/work/1/s/hudi-utilities && /usr/lib/jvm/temurin-8-jdk-amd64/jre/bin/java -Xmx2g org.apache.maven.surefire.booter.ForkedBooter /home/vsts/work/1/s/hudi-utilities/target/surefire 2022-12-26T06-54-12_966-jvmRun1 surefire3510960863127007832tmp surefire_87691856527070078480tmp
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 255
[ERROR] 	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1183)
[ERROR] 	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1011)
[ERROR] 	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:857)
[ERROR] 	at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2(MojoExecutor.java:370)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute(MojoExecutor.java:351)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:215)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:171)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:163)
[ERROR] 	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
[ERROR] 	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
[ERROR] 	at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
[ERROR] 	at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
[ERROR] 	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:294)
[ERROR] 	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
[ERROR] 	at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
[ERROR] 	at org.apache.maven.cli.MavenCli.execute(MavenCli.java:960)
[ERROR] 	at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:293)
[ERROR] 	at org.apache.maven.cli.MavenCli.main(MavenCli.java:196)
[ERROR] 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[ERROR] 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[ERROR] 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[ERROR] 	at java.lang.reflect.Method.invoke(Method.java:498)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:282)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:225)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:406)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:347)
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :hudi-utilities_2.11

danny0405 · 2022-12-28T04:08:36Z

Hello @chenshzh , do you think this is still an issue here ? We can have an offline talk if possible, are you in the Hudi DingTalk group now ?

chenshzh · 2022-12-28T06:34:13Z

Hello @chenshzh , do you think this is still an issue here ? We can have an offline talk if possible, are you in the Hudi DingTalk group now ?

@danny0405 Yes, I'm in the group now, and would be glad to discuss it.

danny0405 · 2022-12-28T06:57:41Z

Hello @chenshzh , do you think this is still an issue here ? We can have an offline talk if possible, are you in the Hudi DingTalk group now ?

@danny0405 Yes, I'm in the group now, and would be glad to discuss it.

You can add me as a friend and we can have a talk.

nsivabalan · 2023-02-08T01:21:26Z

@danny0405 : is this still a valid patch. can you follow up.

danny0405 · 2023-02-08T03:13:01Z

No, it should have been fixed in master.

chenshzh force-pushed the csz/HUDI-4287 branch from e7f9858 to 0125f05 Compare June 29, 2022 06:00

chenshzh force-pushed the csz/HUDI-4287 branch from 0125f05 to bb42bb2 Compare June 29, 2022 06:48

yihua added priority:major degraded perf; unable to move forward; potential bugs writer-core Issues relating to core transactions/write actions flink Issues related to flink labels Jul 5, 2022

chenshzh mentioned this pull request Dec 9, 2022

[HUDI-4311] Fix Flink lose data on some rollback scene #5950

Merged

5 tasks

chenshzh force-pushed the csz/HUDI-4287 branch from bb42bb2 to 59a5e7c Compare December 9, 2022 12:43

alexeykudinkin added priority:blocker and removed priority:major degraded perf; unable to move forward; potential bugs labels Dec 20, 2022

chenshzh added 3 commits December 23, 2022 14:59

[HUDI-4287] Optimize Flink checkpoint meta mechanism to fix mistaken …

3792c85

…pending instants

add clustering condition

b39281f

fix checkstyle

fbb9ed3

chenshzh force-pushed the csz/HUDI-4287 branch from 66cadb8 to fbb9ed3 Compare December 23, 2022 07:00

nsivabalan added priority:critical production down; pipelines stalled; Need help asap. and removed priority:blocker labels Jan 24, 2023

danny0405 closed this Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-4287] Optimize Flink checkpoint meta mechanism to fix mistaken pending instants #5913

[HUDI-4287] Optimize Flink checkpoint meta mechanism to fix mistaken pending instants #5913

chenshzh commented Jun 20, 2022 •

edited

Loading

wxplovecc commented Jun 22, 2022

chenshzh commented Jun 29, 2022 •

edited

Loading

chenshzh commented Jul 1, 2022

danny0405 commented Jul 3, 2022

leesf commented Dec 9, 2022

chenshzh commented Dec 12, 2022

alexeykudinkin commented Dec 20, 2022

chenshzh commented Dec 26, 2022

hudi-bot commented Dec 26, 2022

chenshzh commented Dec 27, 2022

danny0405 commented Dec 28, 2022

chenshzh commented Dec 28, 2022 •

edited

Loading

danny0405 commented Dec 28, 2022

nsivabalan commented Feb 8, 2023

danny0405 commented Feb 8, 2023

[HUDI-4287] Optimize Flink checkpoint meta mechanism to fix mistaken pending instants #5913

[HUDI-4287] Optimize Flink checkpoint meta mechanism to fix mistaken pending instants #5913

Conversation

chenshzh commented Jun 20, 2022 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

wxplovecc commented Jun 22, 2022

chenshzh commented Jun 29, 2022 • edited Loading

chenshzh commented Jul 1, 2022

danny0405 commented Jul 3, 2022

leesf commented Dec 9, 2022

chenshzh commented Dec 12, 2022

alexeykudinkin commented Dec 20, 2022

chenshzh commented Dec 26, 2022

hudi-bot commented Dec 26, 2022

CI report:

chenshzh commented Dec 27, 2022

danny0405 commented Dec 28, 2022

chenshzh commented Dec 28, 2022 • edited Loading

danny0405 commented Dec 28, 2022

nsivabalan commented Feb 8, 2023

danny0405 commented Feb 8, 2023

chenshzh commented Jun 20, 2022 •

edited

Loading

chenshzh commented Jun 29, 2022 •

edited

Loading

chenshzh commented Dec 28, 2022 •

edited

Loading