[SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite #46560

cloud-fan · 2024-05-13T11:25:22Z

What changes were proposed in this pull request?

Recently I noticed a test from ParquetIOSuite being flaky: SPARK-7837 Do not close output writer twice when commitTask() fails

It turns out to be a race condition. The test injects error to the task committing step, and the job may fail in two ways:

The task got the driver's permission to commit the task, but the committing failed and thus the task failed. This will trigger a stage failure as it means possible data duplication, see [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status #36564
In test we disable task retry, so TaskSetManager will abort the stage.

Both these two failures are done by sending an event to DAGScheduler, so the final job failure depends on which event gets processed first. This is not a big deal, but that test in ParquetIOSuite checks the error class. This PR fixes the flaky test by disabling output committer coordination. Necessary changes are added to allow the disablement per query.

Why are the changes needed?

fix flaky test

Does this PR introduce any user-facing change?

no

How was this patch tested?

N/A

Was this patch authored or co-authored using generative AI tooling?

No

cloud-fan · 2024-05-13T11:25:41Z

cc @gengliangwang @viirya

viirya · 2024-05-13T14:37:38Z

The task got the driver's permission to commit the task, but the committing failed and thus the task failed. This will trigger a stage failure as it means possible data duplication, see #36564

I read #36564, it seems to handle the case that a commit is successful but the task is failed which means data duplication is possible.

But here the task and the commit are both failed. It should not cause data duplication. Does it also trigger a stage failure?

cloud-fan · 2024-05-13T14:53:58Z

@viirya I think #36564 's PR description was very clear about the details. Driver never knows if the task commit is successful or not. It only knows: 1) if the "ask for commit" request is approved or not. 2) if the task completes successfully or not.

The flaky test triggers it: 1) it injects failure to commitTask, which happens after the task receives driver's permission to commit. 2) the task fails eventually.

viirya · 2024-05-13T15:12:28Z

In this pr, we do below since:
When commit task success, executor send an CommitOutputSuccess message to outputCommitCoordinator.
When outputCommitCoordinator handle taskComplete, if task failed but commit success, means data duplicate will happen, we should failed to job.

@cloud-fan Thanks for explaining it. It sounds reasonable. Although from what I read from the above description in #36564, it seems when outputCommitCoordinator handles taskComplete, it will check if the task is failed but the commit is successful, then decide if data duplication is happened or not.

viirya · 2024-05-13T15:25:15Z

I got more clear picture after reading the change in #36564 and the code of OutputCommitCoordinator. The PR description of #36564 is not actually accurate and can cause confusion if not read the actual code in OutputCommitCoordinator.

See https://github.com/apache/spark/pull/36564/files#r1598660630

viirya · 2024-05-13T15:29:27Z

The task got the driver's permission to commit the task, but the committing failed and thus the task failed. This will trigger a stage failure as it means possible data duplication

This is more correct. The OutputCommitCoordinator only knows that a commit is authorized (the task got the driver's permission). But in #36565, it is not distinct it from a commit success. It'd be better to update the reason string in OutputCommitCoordinator, although not related to this PR.

gengliangwang · 2024-05-13T17:10:43Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

@@ -276,8 +276,14 @@ class HadoopMapReduceCommitProtocol(
  override def commitTask(taskContext: TaskAttemptContext): TaskCommitMessage = {
    val attemptId = taskContext.getTaskAttemptID
    logTrace(s"Commit task ${attemptId}")
+    val disableCommitCoordination =


Suggested change

val disableCommitCoordination =

val disableCommitCoordinationInTest =

Also let's add a comment about why we need to disable it in test

gengliangwang · 2024-05-13T17:12:37Z

core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala

@@ -42,7 +42,8 @@ object SparkHadoopMapRedUtil extends Logging {
      committer: MapReduceOutputCommitter,
      mrTaskContext: MapReduceTaskAttemptContext,
      jobId: Int,
-      splitId: Int): Unit = {
+      splitId: Int,
+      disableCommitCoordination: Boolean): Unit = {


nit: shall we rename as disableCommitCoordinationInTest as well?

gengliangwang · 2024-05-13T19:14:56Z

@cloud-fan I created #46562 for this, which avoids changing the production code to fix the flaky test.

cloud-fan · 2024-05-13T23:41:11Z

closing in favor of #46562

disable output committer coordination in one test of ParquetIOSuite

bba8a10

github-actions bot added SQL CORE labels May 13, 2024

Update SparkHadoopMapRedUtil.scala

a21450f

viirya approved these changes May 13, 2024

View reviewed changes

gengliangwang reviewed May 13, 2024

View reviewed changes

gengliangwang mentioned this pull request May 13, 2024

[SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite #46562

Closed

cloud-fan closed this May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite #46560

[SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite #46560

cloud-fan commented May 13, 2024 •

edited

Loading

cloud-fan commented May 13, 2024

viirya commented May 13, 2024

cloud-fan commented May 13, 2024

viirya commented May 13, 2024

viirya commented May 13, 2024

viirya commented May 13, 2024

gengliangwang May 13, 2024

gengliangwang May 13, 2024

gengliangwang May 13, 2024

gengliangwang commented May 13, 2024

cloud-fan commented May 13, 2024

	val disableCommitCoordination =
	val disableCommitCoordinationInTest =

[SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite #46560

[SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite #46560

Conversation

cloud-fan commented May 13, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan commented May 13, 2024

viirya commented May 13, 2024

cloud-fan commented May 13, 2024

viirya commented May 13, 2024

viirya commented May 13, 2024

viirya commented May 13, 2024

gengliangwang May 13, 2024

Choose a reason for hiding this comment

gengliangwang May 13, 2024

Choose a reason for hiding this comment

gengliangwang May 13, 2024

Choose a reason for hiding this comment

gengliangwang commented May 13, 2024

cloud-fan commented May 13, 2024

cloud-fan commented May 13, 2024 •

edited

Loading