[SPARK-5345][DEPLOY] Fix unstable test case in FsHistoryProviderSuite #4133

sarutak · 2015-01-21T07:36:42Z

In FsHistoryProviderSuite, a test "Parse new and old application logs" sometimes fail and sometimes succeed. It's unstable.

SparkQA · 2015-01-21T08:23:55Z

Test build #25880 has finished for PR 4133 at commit 77678fe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-01-21T08:57:30Z

Spark.testing should already be set for all tests. Is that needed ? What is the problem briefly and what does this fix?

vanzin · 2015-01-21T18:15:28Z

Can you explain how this fixes the problem? You're changing the order of the checks, but that doesn't match what the code does. For example, run this in a Scala shell:

scala> Seq((3L, 2L), (2L, 1L), (-1L, 2L), (-1L, 1L)).sortBy(x => (-x._1, -x._2))
res0: Seq[(Long, Long)] = List((3,2), (2,1), (-1,2), (-1,1))

That emulates what checkForLogs does in the sortBy call (sort first by end time descending, then by start time descending). As you can see, the entry with start time 2 should come before the entry with start time 1.

vanzin · 2015-01-21T18:28:10Z

In fact, your test change uncovered a bug. In FsHistoryProvider.scala, L214:

          if (newIterator.head.endTime > oldIterator.head.endTime) {

This will not merge the lists correctly for apps that have not finished yet. It should do a check similar to the sortBy clause above. Doesn't explain the original flakiness, though.

andrewor14 · 2015-01-25T23:43:13Z

I haven't investigated the test in detail, but could the flakiness have been caused by the check interval? IIRC the history server doesn't actually start checking for logs until one full check interval has elapsed. If we assert before that happens, then obviously we're not gonna find the logs we expect. Could that be the problem?

vanzin · 2015-01-26T17:20:37Z

The flakiness might be because the polling timer is running in test mode, while the tests expect it not to run. @sarutak's patch seems to solve that, but that exposes a different bug that I mentioned above. I'll take a quick look at the code to see if that theory holds.

vanzin · 2015-01-26T19:03:06Z

core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala

@@ -43,7 +43,7 @@ class FsHistoryProviderSuite extends FunSuite with BeforeAndAfter with Matchers
    testDir = Utils.createTempDir()
    provider = new FsHistoryProvider(new SparkConf()
      .set("spark.history.fs.logDirectory", testDir.getAbsolutePath())
-      .set("spark.history.fs.updateInterval", "0"))
+      .set("spark.testing", "true"))


Hmm... as Sean mentioned, this should already be defined here. Can you double check that it's really not set, and if not, what's causing it?

vanzin · 2015-01-30T04:18:54Z

Hi @sarutak ,

Check out https://github.com/vanzin/spark/tree/SPARK-5345. That fixes the sort problem, and cleans up the code a bit. It doesn't explain how the thread might be started during testing, though. But it doesn't seem to be happening with this patch (I even added an exception - not in the patch - to check that's the case, which wasn't triggered).

vanzin · 2015-01-30T18:20:35Z

I hit the exception I added for testing when running all tests, but never when running the test in isolation. It looks like some test is clearing the system properties (or that particular one), which is bad for other reasons, and I don't think should be worked around here.

vanzin · 2015-01-30T18:46:12Z

BTW it's very likely this is caused by the issue described in #4220 (comment).

JoshRosen · 2015-02-03T08:46:58Z

Jenkins, retest this please.

JoshRosen · 2015-02-03T08:47:37Z

I think we've observed this test's flakiness even after fixing #4220, so we should continue investigating it.

SparkQA · 2015-02-03T09:41:38Z

Test build #26640 has finished for PR 4133 at commit 77678fe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-02-03T17:55:11Z

Even if the flakiness hasn't been fixed yet, this patch is still not correct. The test itself is correct, the FsHistoryProvider implementation has a bug; see patch I linked to above (https://github.com/vanzin/spark/tree/SPARK-5345).

srowen · 2015-02-06T11:28:18Z

Shall we close this PR in favor of #4370 ?

vanzin · 2015-02-06T18:48:06Z

This PR cannot be pushed as is, since it makes the test test the buggy sort behavior instead of correcting it. Still, if after #4220 these tests still fail, there's something still to be done for the bug.

JoshRosen · 2015-02-08T03:37:32Z

I think that we should close this for the time being, since we can always re-open if we see the bug again.

sarutak · 2015-02-08T03:40:08Z

O.K, I close this PR for now.

Fixed unstable test case in FsHistoryProviderSuite

77678fe

vanzin reviewed Jan 26, 2015
View reviewed changes

sarutak closed this Feb 8, 2015

sarutak deleted the SPARK-5345 branch April 11, 2015 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5345][DEPLOY] Fix unstable test case in FsHistoryProviderSuite #4133

[SPARK-5345][DEPLOY] Fix unstable test case in FsHistoryProviderSuite #4133

sarutak commented Jan 21, 2015

SparkQA commented Jan 21, 2015

srowen commented Jan 21, 2015

vanzin commented Jan 21, 2015

vanzin commented Jan 21, 2015

andrewor14 commented Jan 25, 2015

vanzin commented Jan 26, 2015

vanzin Jan 26, 2015

vanzin commented Jan 30, 2015

vanzin commented Jan 30, 2015

vanzin commented Jan 30, 2015

JoshRosen commented Feb 3, 2015

JoshRosen commented Feb 3, 2015

SparkQA commented Feb 3, 2015

vanzin commented Feb 3, 2015

srowen commented Feb 6, 2015

vanzin commented Feb 6, 2015

JoshRosen commented Feb 8, 2015

sarutak commented Feb 8, 2015

[SPARK-5345][DEPLOY] Fix unstable test case in FsHistoryProviderSuite #4133

[SPARK-5345][DEPLOY] Fix unstable test case in FsHistoryProviderSuite #4133

Conversation

sarutak commented Jan 21, 2015

SparkQA commented Jan 21, 2015

srowen commented Jan 21, 2015

vanzin commented Jan 21, 2015

vanzin commented Jan 21, 2015

andrewor14 commented Jan 25, 2015

vanzin commented Jan 26, 2015

vanzin Jan 26, 2015

Choose a reason for hiding this comment

vanzin commented Jan 30, 2015

vanzin commented Jan 30, 2015

vanzin commented Jan 30, 2015

JoshRosen commented Feb 3, 2015

JoshRosen commented Feb 3, 2015

SparkQA commented Feb 3, 2015

vanzin commented Feb 3, 2015

srowen commented Feb 6, 2015

vanzin commented Feb 6, 2015

JoshRosen commented Feb 8, 2015

sarutak commented Feb 8, 2015