Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26275][PYTHON][ML] Increases timeout for StreamingLogisticRegressionWithSGDTests.test_training_and_prediction test #23236

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

Looks this test is flaky

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console

======================================================================
FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
Test that the model improves on toy data with no. of batches
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 367, in test_training_and_prediction
    self._eventually(condition)
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 78, in _eventually
    % (timeout, lastValue))
AssertionError: Test failed due to timeout after 30 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74

----------------------------------------------------------------------
Ran 13 tests in 185.051s

FAILED (failures=1, skipped=1)

This looks happening after increasing the parallelism in Jenkins to speed up at #23111. I am able to reproduce this manually when the resource usage is heavy (with manual decrease of timeout).

How was this patch tested?

Manually tested by

cd python
./run-tests --testnames 'pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction' --python-executables=python

@HyukjinKwon
Copy link
Member Author

cc @BryanCutler and @viirya

@HyukjinKwon
Copy link
Member Author

retest this please

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Dec 5, 2018

(cc @squito as well since it's a bit related to #23111)

@SparkQA
Copy link

SparkQA commented Dec 5, 2018

Test build #99721 has finished for PR 23236 at commit 3c4ee75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2018

Test build #99724 has finished for PR 23236 at commit 3c4ee75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2018

Test build #99726 has finished for PR 23236 at commit 3c4ee75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2018

Test build #99727 has finished for PR 23236 at commit 3c4ee75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2018

Test build #99729 has finished for PR 23236 at commit 3c4ee75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Dec 5, 2018

lgtm

@BryanCutler
Copy link
Member

Seems ok to me, but there are a few silly things with this test that might help also

  • why is the stepSize so low at 0.01? I think it would be fine at 0.1, but even conservatively at 0.03 would still help converge a little faster

  • why is it comparing the current error with the second error value (errors[1] - errors[-1])? Maybe because errors[1] is bigger, but should just be errors[0] and make the values more sensible, like maybe check that error is > 0.2

  • having 2 if statements to check basically the same condition is a bit redundant, just assert(len(errors) < len(predict_batches) before the last return

  • nit: true_predicted = [] on line 350 isn't used

If you don't feel comfortable changing any of these, just increasing the timeout is probably fine

@HyukjinKwon
Copy link
Member Author

Thanks for a close look, @BryanCutler. I just increased the timeout because I was just trying to keep the test as is and fix it. The actual test doesn't look taking so much time but it looks it can be flaky when the resource usage is heavy. If you feel this can be resolved in other ways, would you mind if I ask to take a look for this please? and i'm happy to close this one. Actually, I am not quite familiar with ML ones ..

@BryanCutler
Copy link
Member

True, the test is not that long under light resources. Locally, I saw a couple seconds difference with the changes I mentioned. The weird thing is the unmodified test completes after the 11th batch with errors

['0.67', '0.71', '0.78', '0.7', '0.75', '0.74', '0.73', '0.69', '0.62', '0.71', '0.31']

Compared to the error values from the test failures above, they match up until the 10th batch but then these continue until the 16th where it has a timeout

0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74

I would expect the seed to produce the same values (or all diff if the random func is different), which makes me think something else is going on..

Anyway, I think it's fine to increase the timeout and if it's still flaky, we can look at making changes.

@viirya
Copy link
Member

viirya commented Dec 6, 2018

I agree with @BryanCutler's analysis and it looks a bit weird at few things about this test. I also think it is fine to increase the timeout for now.

@HyukjinKwon
Copy link
Member Author

Thanks guys .. I will try to take a look for it as well (although it's going to take relatively a long while). Let me get this in for now anyway.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@asfgit asfgit closed this in ab76900 Dec 6, 2018
@viirya
Copy link
Member

viirya commented Dec 6, 2018

Compared to the error values from the test failures above, they match up until the 10th batch but then these continue until the 16th where it has a timeout

I suspect that might because as the resource usage is heavy, StreamingLogisticRegressionWithSGD's training speed on input batch stream can't always catch up predict batch stream. So the model doesn't reach expected improvement in error yet.

@BryanCutler
Copy link
Member

I suspect that might because as the resource usage is heavy, StreamingLogisticRegressionWithSGD's training speed on input batch stream can't always catch up predict batch stream. So the model doesn't reach expected improvement in error yet.

Yeah that sounds likely, and the timeout increase should help with that

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…essionWithSGDTests.test_training_and_prediction test

## What changes were proposed in this pull request?

Looks this test is flaky

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console

```
======================================================================
FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
Test that the model improves on toy data with no. of batches
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 367, in test_training_and_prediction
    self._eventually(condition)
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 78, in _eventually
    % (timeout, lastValue))
AssertionError: Test failed due to timeout after 30 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74

----------------------------------------------------------------------
Ran 13 tests in 185.051s

FAILED (failures=1, skipped=1)
```

This looks happening after increasing the parallelism in Jenkins to speed up at apache#23111. I am able to reproduce this manually when the resource usage is heavy (with manual decrease of timeout).

## How was this patch tested?

Manually tested by

```
cd python
./run-tests --testnames 'pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction' --python-executables=python
```

Closes apache#23236 from HyukjinKwon/SPARK-26275.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon deleted the SPARK-26275 branch March 3, 2020 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants