Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5345][DEPLOY] Fix unstable test case in FsHistoryProviderSuite #4133

Closed
wants to merge 1 commit into from

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Jan 21, 2015

In FsHistoryProviderSuite, a test "Parse new and old application logs" sometimes fail and sometimes succeed. It's unstable.

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25880 has finished for PR 4133 at commit 77678fe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 21, 2015

Spark.testing should already be set for all tests. Is that needed ? What is the problem briefly and what does this fix?

@vanzin
Copy link
Contributor

vanzin commented Jan 21, 2015

Can you explain how this fixes the problem? You're changing the order of the checks, but that doesn't match what the code does. For example, run this in a Scala shell:

scala> Seq((3L, 2L), (2L, 1L), (-1L, 2L), (-1L, 1L)).sortBy(x => (-x._1, -x._2))
res0: Seq[(Long, Long)] = List((3,2), (2,1), (-1,2), (-1,1))

That emulates what checkForLogs does in the sortBy call (sort first by end time descending, then by start time descending). As you can see, the entry with start time 2 should come before the entry with start time 1.

@vanzin
Copy link
Contributor

vanzin commented Jan 21, 2015

In fact, your test change uncovered a bug. In FsHistoryProvider.scala, L214:

          if (newIterator.head.endTime > oldIterator.head.endTime) {

This will not merge the lists correctly for apps that have not finished yet. It should do a check similar to the sortBy clause above. Doesn't explain the original flakiness, though.

@andrewor14
Copy link
Contributor

I haven't investigated the test in detail, but could the flakiness have been caused by the check interval? IIRC the history server doesn't actually start checking for logs until one full check interval has elapsed. If we assert before that happens, then obviously we're not gonna find the logs we expect. Could that be the problem?

@vanzin
Copy link
Contributor

vanzin commented Jan 26, 2015

The flakiness might be because the polling timer is running in test mode, while the tests expect it not to run. @sarutak's patch seems to solve that, but that exposes a different bug that I mentioned above. I'll take a quick look at the code to see if that theory holds.

@@ -43,7 +43,7 @@ class FsHistoryProviderSuite extends FunSuite with BeforeAndAfter with Matchers
testDir = Utils.createTempDir()
provider = new FsHistoryProvider(new SparkConf()
.set("spark.history.fs.logDirectory", testDir.getAbsolutePath())
.set("spark.history.fs.updateInterval", "0"))
.set("spark.testing", "true"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... as Sean mentioned, this should already be defined here. Can you double check that it's really not set, and if not, what's causing it?

@vanzin
Copy link
Contributor

vanzin commented Jan 30, 2015

Hi @sarutak ,

Check out https://github.com/vanzin/spark/tree/SPARK-5345. That fixes the sort problem, and cleans up the code a bit. It doesn't explain how the thread might be started during testing, though. But it doesn't seem to be happening with this patch (I even added an exception - not in the patch - to check that's the case, which wasn't triggered).

@vanzin
Copy link
Contributor

vanzin commented Jan 30, 2015

I hit the exception I added for testing when running all tests, but never when running the test in isolation. It looks like some test is clearing the system properties (or that particular one), which is bad for other reasons, and I don't think should be worked around here.

@vanzin
Copy link
Contributor

vanzin commented Jan 30, 2015

BTW it's very likely this is caused by the issue described in #4220 (comment).

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@JoshRosen
Copy link
Contributor

I think we've observed this test's flakiness even after fixing #4220, so we should continue investigating it.

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26640 has finished for PR 4133 at commit 77678fe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Feb 3, 2015

Even if the flakiness hasn't been fixed yet, this patch is still not correct. The test itself is correct, the FsHistoryProvider implementation has a bug; see patch I linked to above (https://github.com/vanzin/spark/tree/SPARK-5345).

@srowen
Copy link
Member

srowen commented Feb 6, 2015

Shall we close this PR in favor of #4370 ?

@vanzin
Copy link
Contributor

vanzin commented Feb 6, 2015

This PR cannot be pushed as is, since it makes the test test the buggy sort behavior instead of correcting it. Still, if after #4220 these tests still fail, there's something still to be done for the bug.

@JoshRosen
Copy link
Contributor

I think that we should close this for the time being, since we can always re-open if we see the bug again.

@sarutak
Copy link
Member Author

sarutak commented Feb 8, 2015

O.K, I close this PR for now.

@sarutak sarutak closed this Feb 8, 2015
@sarutak sarutak deleted the SPARK-5345 branch April 11, 2015 05:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants