Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-4287] Optimize Flink checkpoint meta mechanism to fix mistaken pending instants #5913

Closed
wants to merge 3 commits into from

Conversation

chenshzh
Copy link
Contributor

@chenshzh chenshzh commented Jun 20, 2022

Change Logs

Details added in https://issues.apache.org/jira/projects/HUDI/issues/HUDI-4287

To optimize the mechanism with CANCELLED CkpMessage state in the highest priority corresponding with DELETE instant during rollback action.

Revert back to re-send pending instants to ckp_meta when bootstrap.

Include scenes of compaction and clustering.

Impact

Last successful instant will be kept when recovering for recommit

Risk level (write none, low medium or high below)

medium. It will be fast verified by data quality check.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@wxplovecc
Copy link
Contributor

+1 for fix

@chenshzh
Copy link
Contributor Author

chenshzh commented Jun 29, 2022

Fix conflicts with https://issues.apache.org/jira/projects/HUDI/issues/HUDI-4311 , which also has attempted to fix the rollback data lose in batch write scene.

And this pr might be still needed because rollback could happen during runtime, such as StreamWriteOperatorCoordinator#notifyCheckpointComplete when commitInstant, and it won't crash the job for CkpMeta to rebootstrap.

@chenshzh
Copy link
Contributor Author

chenshzh commented Jul 1, 2022

@danny0405 updated already, pls take a review to see whether it's ok.

@danny0405
Copy link
Contributor

And this pr might be still needed because rollback could happen during runtime, such as StreamWriteOperatorCoordinator#notifyCheckpointComplete when commitInstant, and it won't crash the job fo

We already add stats for write functions to recoveer and recommit if the job fails during instant commition

@yihua yihua added priority:major degraded perf; unable to move forward; potential bugs writer-core Issues relating to core transactions/write actions flink Issues related to flink labels Jul 5, 2022
@leesf
Copy link
Contributor

leesf commented Dec 9, 2022

@chenshzh would you please rebase to latest master first?

@chenshzh
Copy link
Contributor Author

@chenshzh would you please rebase to latest master first?

@leesf updated.

@alexeykudinkin alexeykudinkin added priority:blocker and removed priority:major degraded perf; unable to move forward; potential bugs labels Dec 20, 2022
@alexeykudinkin
Copy link
Contributor

@chenshzh can you please check on the CI failures?

@chenshzh
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@chenshzh
Copy link
Contributor Author

  • FAILURE

@alexeykudinkin would you help review the CI failure once more?

I have rebased to the latest and it seems not the problem of this PR's specific changes ?

Because we find it failed in hudi-utilities and quite many recent PRs' pipeline failures are all about this module.

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on project hudi-utilities_2.11: There are test failures.
[ERROR] 
[ERROR] Please refer to /home/vsts/work/1/s/hudi-utilities/target/surefire-reports for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
[ERROR] The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /home/vsts/work/1/s/hudi-utilities && /usr/lib/jvm/temurin-8-jdk-amd64/jre/bin/java -Xmx2g org.apache.maven.surefire.booter.ForkedBooter /home/vsts/work/1/s/hudi-utilities/target/surefire 2022-12-26T06-54-12_966-jvmRun1 surefire3510960863127007832tmp surefire_87691856527070078480tmp
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 255
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /home/vsts/work/1/s/hudi-utilities && /usr/lib/jvm/temurin-8-jdk-amd64/jre/bin/java -Xmx2g org.apache.maven.surefire.booter.ForkedBooter /home/vsts/work/1/s/hudi-utilities/target/surefire 2022-12-26T06-54-12_966-jvmRun1 surefire3510960863127007832tmp surefire_87691856527070078480tmp
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 255
[ERROR] 	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1183)
[ERROR] 	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1011)
[ERROR] 	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:857)
[ERROR] 	at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2(MojoExecutor.java:370)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute(MojoExecutor.java:351)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:215)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:171)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:163)
[ERROR] 	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
[ERROR] 	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
[ERROR] 	at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
[ERROR] 	at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
[ERROR] 	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:294)
[ERROR] 	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
[ERROR] 	at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
[ERROR] 	at org.apache.maven.cli.MavenCli.execute(MavenCli.java:960)
[ERROR] 	at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:293)
[ERROR] 	at org.apache.maven.cli.MavenCli.main(MavenCli.java:196)
[ERROR] 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[ERROR] 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[ERROR] 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[ERROR] 	at java.lang.reflect.Method.invoke(Method.java:498)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:282)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:225)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:406)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:347)
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :hudi-utilities_2.11

@danny0405
Copy link
Contributor

Hello @chenshzh , do you think this is still an issue here ? We can have an offline talk if possible, are you in the Hudi DingTalk group now ?

@chenshzh
Copy link
Contributor Author

chenshzh commented Dec 28, 2022

Hello @chenshzh , do you think this is still an issue here ? We can have an offline talk if possible, are you in the Hudi DingTalk group now ?

@danny0405 Yes, I'm in the group now, and would be glad to discuss it.

@danny0405
Copy link
Contributor

Hello @chenshzh , do you think this is still an issue here ? We can have an offline talk if possible, are you in the Hudi DingTalk group now ?

@danny0405 Yes, I'm in the group now, and would be glad to discuss it.

You can add me as a friend and we can have a talk.

@nsivabalan nsivabalan added priority:critical production down; pipelines stalled; Need help asap. and removed priority:blocker labels Jan 24, 2023
@nsivabalan
Copy link
Contributor

@danny0405 : is this still a valid patch. can you follow up.

@danny0405
Copy link
Contributor

No, it should have been fixed in master.

@danny0405 danny0405 closed this Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flink Issues related to flink priority:critical production down; pipelines stalled; Need help asap. writer-core Issues relating to core transactions/write actions
Projects
Status: 🚧 Needs Repro
Archived in project
Development

Successfully merging this pull request may close these issues.

8 participants