[SPARK-23408][SS] Synchronize successive AddData actions in Streaming*JoinSuite #20650

tdas · 2018-02-21T11:13:34Z

The best way to review this PR is to ignore whitespace/indent changes. Use this link - https://github.com/apache/spark/pull/20650/files?w=1

What changes were proposed in this pull request?

The stream-stream join tests add data to multiple sources and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached.

Prior attempt to solve this issue by @jose-torres in #20646 attempted to simultaneously synchronize on all memory sources together when consecutive AddData was found in the actions. However, this carries the risk of deadlock as well as unintended modification of stress tests (see the above PR for a detailed explanation). Instead, this PR attempts the following.

A new action called StreamProgressBlockedActions that allows multiple actions to be executed while the streaming query is blocked from making progress. This allows data to be added to multiple sources that are made visible simultaneously in the next batch.
An alias of StreamProgressBlockedActions called MultiAddData is explicitly used in the Streaming*JoinSuites to add data to two memory sources simultaneously.

This should avoid unintentional modification of the stress tests (or any other test for that matter) while making sure that the flaky tests are deterministic.

How was this patch tested?

Modified test cases in Streaming*JoinSuites where there are consecutive AddData actions.

tdas · 2018-02-21T11:14:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

@@ -543,6 +543,15 @@ abstract class StreamExecution(
    Option(name).map(_ + "<br/>").getOrElse("") +
      s"id = $id<br/>runId = $runId<br/>batch = $batchDescription"
  }
+
+  private[sql] def withProgressLocked(f: => Unit): Unit = {


TODO: Add docs.

tdas · 2018-02-21T11:18:36Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

@@ -102,6 +102,14 @@ trait StreamTest extends QueryTest with SharedSQLContext with TimeLimits with Be
      AddDataMemory(source, data)
  }

+  object MultiAddData {


TODO: add docs.

tdas · 2018-02-21T11:18:43Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

@@ -217,6 +225,14 @@ trait StreamTest extends QueryTest with SharedSQLContext with TimeLimits with Be
      s"ExpectFailure[${causeClass.getName}, isFatalError: $isFatalError]"
  }

+  case class StreamProgressLockedActions(actions: Seq[StreamAction], desc: String = null)


TODO: add docs.

SparkQA · 2018-02-21T13:38:48Z

Test build #87585 has finished for PR 20650 at commit b4c3c55.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-21T13:43:50Z

Test build #4109 has finished for PR 20650 at commit b4c3c55.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-21T13:46:14Z

Test build #4111 has finished for PR 20650 at commit b4c3c55.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-21T14:30:05Z

Test build #4105 has finished for PR 20650 at commit b4c3c55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-21T14:33:39Z

Test build #4108 has finished for PR 20650 at commit b4c3c55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-21T14:38:49Z

Test build #4106 has finished for PR 20650 at commit b4c3c55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-21T14:39:58Z

Test build #4110 has finished for PR 20650 at commit b4c3c55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-21T14:40:40Z

Test build #4113 has finished for PR 20650 at commit b4c3c55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-21T14:42:00Z

Test build #4107 has finished for PR 20650 at commit b4c3c55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-21T14:42:10Z

Test build #4112 has finished for PR 20650 at commit b4c3c55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-21T14:43:28Z

Test build #4104 has finished for PR 20650 at commit b4c3c55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-21T14:43:41Z

Test build #4114 has finished for PR 20650 at commit b4c3c55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-02-21T18:50:11Z

I'm not sure I agree with all the comments on the previous PR, but I agree that this also works.

As discussed, the downside to this approach is that people in the future can continue to write the same kind of flaky tests this PR fixes. Ideally I'd like to see some kind of story for how people will know they must use MultiAddData.

tdas · 2018-02-21T19:57:15Z

Yes, this is indeed a slight downside. Only time people should choose to use it if they want to add data in multiple sources that are to be visible in the batch. In our case, we need to add data in multiple sources in the same batch because we want to verify the number of state rows changed.

tdas · 2018-02-21T20:47:37Z

@zsxwing can you also take a look?

SparkQA · 2018-02-22T00:01:23Z

Test build #87596 has finished for PR 20650 at commit 7b78fa1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-22T01:16:04Z

Test build #87597 has finished for PR 20650 at commit fdcf716.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-22T01:38:15Z

Test build #4119 has finished for PR 20650 at commit fdcf716.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-22T02:24:24Z

Test build #4116 has finished for PR 20650 at commit fdcf716.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-22T02:32:44Z

Test build #4125 has finished for PR 20650 at commit fdcf716.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-22T02:34:31Z

Test build #4121 has finished for PR 20650 at commit fdcf716.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-22T03:06:12Z

Test build #4120 has finished for PR 20650 at commit fdcf716.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-22T03:07:37Z

Test build #4117 has finished for PR 20650 at commit fdcf716.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-22T03:15:14Z

Test build #4124 has finished for PR 20650 at commit fdcf716.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-22T03:17:44Z

Test build #4118 has finished for PR 20650 at commit fdcf716.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-22T03:45:24Z

Test build #4123 has finished for PR 20650 at commit fdcf716.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-22T04:10:20Z

Test build #4122 has finished for PR 20650 at commit fdcf716.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-22T20:50:29Z

All the failures above can be attributed to other flakiness unrelated to the flakiness this PR trying to address.

jose-torres · 2018-02-22T23:37:40Z

LGTM

zsxwing · 2018-02-23T18:59:26Z

lgtm

…

On Thu, Feb 22, 2018 at 3:38 PM Jose Torres ***@***.***> wrote: LGTM — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20650 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA9FSiW0uIGqfWt3RjsyoY2p7x_B-ClHks5tXfptgaJpZM4SNdOS> .

gatorsmile · 2018-12-21T23:21:36Z

The StreamingJoinSuite in Spark 2.3 are pretty flaky. Do you think we can backport this to 2.3?

dongjoon-hyun · 2018-12-22T20:37:05Z

+1 for @gatorsmile 's opinion. That will be very helpful if we can.

mgaido91 · 2018-12-22T21:29:41Z

+1 too, tests are too flaky

mgaido91 · 2019-01-04T19:49:49Z

any luck with this backport? Branch 2.3 is still very flaky (see #23450).

…*JoinSuite **The best way to review this PR is to ignore whitespace/indent changes. Use this link - https://github.com/apache/spark/pull/20650/files?w=1** The stream-stream join tests add data to multiple sources and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached. Prior attempt to solve this issue by jose-torres in apache#20646 attempted to simultaneously synchronize on all memory sources together when consecutive AddData was found in the actions. However, this carries the risk of deadlock as well as unintended modification of stress tests (see the above PR for a detailed explanation). Instead, this PR attempts the following. - A new action called `StreamProgressBlockedActions` that allows multiple actions to be executed while the streaming query is blocked from making progress. This allows data to be added to multiple sources that are made visible simultaneously in the next batch. - An alias of `StreamProgressBlockedActions` called `MultiAddData` is explicitly used in the `Streaming*JoinSuites` to add data to two memory sources simultaneously. This should avoid unintentional modification of the stress tests (or any other test for that matter) while making sure that the flaky tests are deterministic. Modified test cases in `Streaming*JoinSuites` where there are consecutive `AddData` actions. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#20650 from tdas/SPARK-23408. NOTE: Modified a bit to cover DSv2 incompatibility between Spark 2.3 and 2.4 by Jungtaek Lim <kabhwan@gmail.com> * StreamingDataSourceV2Relation is a class for 2.3, whereas it is a case class for 2.4

…in Streaming*JoinSuite ## What changes were proposed in this pull request? **The best way to review this PR is to ignore whitespace/indent changes. Use this link - https://github.com/apache/spark/pull/20650/files?w=1** The stream-stream join tests add data to multiple sources and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached. Prior attempt to solve this issue by jose-torres in #20646 attempted to simultaneously synchronize on all memory sources together when consecutive AddData was found in the actions. However, this carries the risk of deadlock as well as unintended modification of stress tests (see the above PR for a detailed explanation). Instead, this PR attempts the following. - A new action called `StreamProgressBlockedActions` that allows multiple actions to be executed while the streaming query is blocked from making progress. This allows data to be added to multiple sources that are made visible simultaneously in the next batch. - An alias of `StreamProgressBlockedActions` called `MultiAddData` is explicitly used in the `Streaming*JoinSuites` to add data to two memory sources simultaneously. This should avoid unintentional modification of the stress tests (or any other test for that matter) while making sure that the flaky tests are deterministic. NOTE: This patch is modified a bit from origin PR (#20650) to cover DSv2 incompatibility between Spark 2.3 and 2.4: StreamingDataSourceV2Relation is a class for 2.3, whereas it is a case class for 2.4 ## How was this patch tested? Modified test cases in `Streaming*JoinSuites` where there are consecutive `AddData` actions. Closes #23757 from HeartSaVioR/fix-streaming-join-test-flakiness-branch-2.3. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Co-authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

Fixed bug

b4c3c55

tdas commented Feb 21, 2018

View reviewed changes

Added docs

7b78fa1

Undo ContinuousExecution changes

fdcf716

asfgit closed this in 855ce13 Feb 23, 2018

dongjoon-hyun mentioned this pull request Dec 22, 2018

[SPARK-26366][SQL][BACKPORT-2.3] ReplaceExceptWithFilter should consider NULL as False #23372

Closed

HeartSaVioR mentioned this pull request Feb 11, 2019

[SPARK-23408][SS][BRANCH-2.3] Synchronize successive AddData actions in Streaming*JoinSuite #23757

Closed

[SPARK-23408][SS] Synchronize successive AddData actions in Streaming*JoinSuite #20650

[SPARK-23408][SS] Synchronize successive AddData actions in Streaming*JoinSuite #20650

Conversation

tdas commented Feb 21, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

tdas Feb 21, 2018

Choose a reason for hiding this comment

tdas Feb 21, 2018

Choose a reason for hiding this comment

tdas Feb 21, 2018

Choose a reason for hiding this comment

SparkQA commented Feb 21, 2018

SparkQA commented Feb 21, 2018

SparkQA commented Feb 21, 2018

SparkQA commented Feb 21, 2018

SparkQA commented Feb 21, 2018

SparkQA commented Feb 21, 2018

SparkQA commented Feb 21, 2018

SparkQA commented Feb 21, 2018

SparkQA commented Feb 21, 2018

SparkQA commented Feb 21, 2018

SparkQA commented Feb 21, 2018

SparkQA commented Feb 21, 2018

jose-torres commented Feb 21, 2018

tdas commented Feb 21, 2018 • edited Loading

tdas commented Feb 21, 2018

SparkQA commented Feb 22, 2018

SparkQA commented Feb 22, 2018

SparkQA commented Feb 22, 2018

SparkQA commented Feb 22, 2018

SparkQA commented Feb 22, 2018

SparkQA commented Feb 22, 2018

SparkQA commented Feb 22, 2018

SparkQA commented Feb 22, 2018

SparkQA commented Feb 22, 2018

SparkQA commented Feb 22, 2018

SparkQA commented Feb 22, 2018

SparkQA commented Feb 22, 2018

tdas commented Feb 22, 2018

jose-torres commented Feb 22, 2018

zsxwing commented Feb 23, 2018 via email

gatorsmile commented Dec 21, 2018

dongjoon-hyun commented Dec 22, 2018

mgaido91 commented Dec 22, 2018

mgaido91 commented Jan 4, 2019

tdas commented Feb 21, 2018 •

edited

Loading

tdas commented Feb 21, 2018 •

edited

Loading