-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests][dask] fix hangs after exception in worker #4772
Conversation
With added
But I guess it means that this workaround cannot be treated as reliable enough and it brings flakiness into our tests. |
I agree, it seems like this doesn't totally fix the issue 😕 we'll have to keep investigating in #4771 |
Seems like it should work, I don't see what could be hanging with fresh workers. I can maybe try increasing the timeout in the restart or add a sleep after it. What do you guys think? |
Sure, try it! We'd probably learn something from that exercise either way. Feel free to use this PR to test strategies. One trick I like to use, when I need to use CI to test, is to have a single commit that removes all the other CI jobs that aren't relevant to what I'm working on, and then bring those CI configs back in later with So for you, that might be like git rm .appveyor.yaml
git rm .github/workflows/r_*.yaml
git add .
git commit -m "remove unrelated CI configs" |
Wow this job really broke |
Closing this as |
whoa! Ok thanks for investigating. We'll have to try to find another approach. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
As described in #4771, with
dask==2021.10.0
anddistributed==2021.10.0
when a worker encounters an exception in a test as happens inLightGBM/tests/python_package_test/test_dask.py
Lines 1309 to 1310 in 2394b41
the following tests hang indefinitely.
This adds a
client.restart
to restart the workers and get rid of the raised exception, since as of #4159 all of LightGBM's Dask unit tests share the same cluster.