-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] Random failures in Dask tests during teardown #3829
Comments
To avoid merge conflicts with #3855, this is the next Dask issue I'm picking up. Will try some strategies on a PR on my fork. |
I've run the All of them succeeded.
I can't use the Since everything in the job is dockerized, and since I just saw 40 consecutive successes, my theories about the source of flakiness are now related to the differences between the Microsoft-hosted runners and the
@StrikerRUS or @guolinke , are you able to share the specs for the |
I'm testing on #3869 , but want to add here that I have another idea for helping with this. We use |
I saw |
Ok, I can try again with If that still doesn't reveal anything, we can try changing Python version. If you notice the error from this issue in other builds, please leave a commit here with the TASK, operating system, and Python version. I'll do the same. But I just ran 50 of them (25 Linux, 25 Linux_latest) consecutively on |
just saw this timeout error in a build from #3866 https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=8823&view=logs&j=fb919173-48b9-522d-5342-9a59a13eb10b&t=7fcbeee5-d4c1-571c-8072-1d5874591254, on the I am almost done with testing on #3869, and will share some information here soon about how often this seems to happen and how we might fix it. |
Yeah, those errors are not very frequent, so it may require more runs and time to prepare any statistics. |
Alright, so! On #3869 , I've run 300 tests of the Python CI jobs. I did this 10 at a time, in batches with 5 running on 296 / 300 of these succeeded, and all 4 failures were the timeout error described in this issue. See the description of #3869 for build logs. All 4 failures were on
Proposed SolutionI've decided against the idea of using the My plan next is to try just setting a much higher timeout (like 5 minutes) in each of the My plan is to make this change and then test it with another 300 builds. This time, I'll split them up between only |
@jameslamb Does adding |
Yes, that's the absolute worst-case scenario. But to be clear, it doesn't mean "wait 5 minutes, then exit". It means "wait until the client has been closed successfully OR 5 minutes have elapsed". I think that worst-case scenario is very very unlikely, given the results above where only 4 out of 300 jobs failed. And by the way, I realized that two of those failures came from |
Maybe we can go in a more naive and resource-saving way? |
It's fine with me. I just anticipated that you would want more proof before a change like that. I'm happy to stop testing right now and just make a PR that sets the timeout to 60. |
No, I'm totally fine having a place where we can simply increment a value and consequence of such actions will lead us to eliminating the error! BTW, what is default value for |
Sorry, I shouldn't have made assumptions 😬
It's hard to nail down, because it depends on your specific version of from dask.distributed import Client
c = Client()
print(c._timeout)
# 10 For me, this shows the timeout on my default When I added a |
Wow, how complicated!
|
The timeout I'm proposing that we change is not related to how long it takes the rest of the test to run. It specifically controls how long it takes to close down the cluster with Like I mentioned above, we're talking about an absolute-worst-case addition of 1 minute per test. In my tests on #3869 I saw that only 4 out of 300 builds failed, and on those 4 builds each had only 1 test case in |
OK, got your position. |
@trivialfis or @hcho3 could you help us with something? No need to read all of the discussion above in this issue. I noticed that in https://github.com/dmlc/xgboost/blob/c3c8e66fc9d285c265b07d78fb6866770a65347b/tests/python/test_with_dask.py, you use a mix of three strategies for creating Dask clients:
Could you explain to us what about the tests there made you choose one or the other? I've read https://distributed.dask.org/en/latest/develop.html#writing-tests but I have a feeling you discovered other things that aren't explained there, that could help us to make better decisions in LightGBM's Dask module. Thank you so much! |
Moving #3866 (comment) over here to keep all conversation in this issue. Even with the attempted fix from #3869 , another "dask teardown timeout" failure was just observed on a CI run for #3866 : https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=8861&view=logs&j=c28dceab-947a-5848-c21f-eef3695e5f11&t=fa158246-17e2-53d4-5936-86070edbaacf I see this in the log, which I didn't notice before
That It looks like that value is coming from https://github.com/dask/distributed/blob/1297b18ff09276f6ad1553d40ab3ce77acf0fc0e/distributed/utils_test.py#L728-L736. I'm reopening this issue, and can try to work through other options some time in the near future. Maybe there is a way to influence that specific timeout, or maybe we need to try a totally different approach for providing a Dask cluster in tests, that we have tighter control over. |
Just a remark: I haven't seen this error for a while. |
I haven't seen this in a while either. I'm ok with closing it for now if you'd like. |
Sorry I missed the previous notification. Our tests are not quite stable either, during teardown there could be a number of issues from race between scheduler and workers to the timeout you are seeing. |
I think the synchronization during shutdown still have room for improvement, that's one of the reasons we keep using the Python context in our demos/tests: with LocalCluster() as cluster:
with Client(cluster) as client:
main() Otherwise something like this might result into even weirder error dask/distributed#3081: cluster = LocalCluster()
client = Client(cluster) |
I'll defer the decision to you. |
Since our CI time has been increasing with the additions of
I'd like to get some feedback about using this approach. This only builds one cluster that gets reused by all the tests so the time is spent actually doing work and not waiting for setup/teardown. I already tried running all of the tests with this approach and it takes 218 seconds on my machine, the only failure I saw was in LightGBM/tests/python_package_test/test_dask.py Lines 1145 to 1149 in 1ce4b22
client.restart there does the job.
|
Really interesting, thanks for the research! I would definitely support the change you're proposing! I think it's ok to have all of the tests depend on the same cluster, since we run the tests serially and since that dependency might help us catch problems where LightGBM doesn't clean up after itself. HOWEVER, I think we should avoid merging such a PR for at least a week. We recently merged some stability fixes for So you could submit a pull request whenever you have time, but I'd want to wait at least a week before it's merged. |
Thank you, I agree with that. I actually thought of mentioning in the dask issue that seanlaw's approach should be in those docs. Related to #4099, I actually found something similar when playing around with the single cluster approach. I was using the branch for the voting algorithm, which includes the test that checks that if you try feature parallel you get an error, however it wasn't always raising it. I investigated and found that even though the data had two partitions sometimes they ended up in the same worker and the training basically ignored |
I'm torn on this. So far I've been against using As I type that, I had a thought...I would support a PR that uses
I think you should! It's important that the things we learn here in LightGBM make it back to the broader Dask community. |
I see what you mean. I usually don't persist my collections when using I think another (easier) approach would be just changing the defaults in |
I totally agree with that. I'd strongly support a PR to change the default to 10 or 20 partitions if you want to make one. |
Let's try the "change default partitioning" first (not adding |
I'm running all the tests with |
Ok cool, thank you. These sources of instability are so tough to track down because they're transient, so I'd rather change things slowly and evaluate after each one how often we see the tests failing. |
I think the client context manager on #4159 helped with this issue and I'm +1 on closing it. #4149's purpose was to make sure that we were always doing distributed training and not having some tests train on a single "machine" out of bad luck. I believe that's still a nice thing to have, however that's blocked by what I believe is a bug in classification for distributed training (#4220, #5025). I can close it and come back to it once that's been resolved. |
This issue has been automatically locked since there has not been any recent activity since it was closed. |
Need to make Dask tests less flaky and ideally remove all randomness from them.
Log from one of failed job:
The text was updated successfully, but these errors were encountered: