-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci] [dask] CI jobs failing with Dask 2022.7.1 #5390
Comments
Linking possible source: dask/distributed#6714 |
Great find @jmoralez !!! I just read through that PR, totally agree with you that it looks very relevant and is probably the root cause for this. I wonder if the way that the failing test leaves the cluster with an error on one worker results in a situation where the worker can't shut down cleanly. I believe the breaking change in that PR you linked is that LightGBM/tests/python_package_test/test_dask.py Lines 1643 to 1652 in f94050a
Based on https://docs.dask.org/en/stable/_modules/distributed/client.html#Client.restart, I guess we could try something like the following to recover the old behavior? client.restart(wait_for_workers=False)
client.wait_for_workers(n_workers=1) Or, alternatively, we can try to figure out why raising an error the way that this test intentionally does creates issues for @jmoralez do you have time to look into this this week? I'd be happy to back off and let you take it if you're interested. @gjoseph92 , also tagging you since you were the author of dask/distributed#6714, and maybe whatever we're doing here in LightGBM is weird/unexpected and might be informative to you as you continue working on |
Sorry about that. Yeah, I don't understand what you're doing with the socket, but presumably you're doing something that makes the worker block during Regardless, what @jameslamb said is what you'll want to match the previous behavior here: client.restart(wait_for_workers=False)
client.wait_for_workers(n_workers=1) |
Also, at some point soon-ish, I'm hoping to finish up and merge dask/distributed#6427. This might break your test yet again. Currently, the
After that PR:
The key difference is that maybe what you have right now is managing to block/ignore the SIGTERM, preventing the stuck worker from shutting down. Since you can't block a SIGKILL, after this PR is merged I would expect (and hope) that |
Thanks very much for that @gjoseph92 ! And no need to apologize... my
LightGBM has its own distributed computing setup (e.g. to perform collective operations across multiple processes) that This test checks that the expected error is raised in the situation where someone specifies a port that is already in use at the time that training starts.
I suspect (and have suspected for a while) that the way we are raising errors on LightGBM/python-package/lightgbm/dask.py Lines 664 to 671 in f94050a
I wouldn't be surprised to learn that the process is leaked and keeps running. Since our only use of
We definitely are not, and the new behavior you've described sounds like a nice improvement to |
It's not the error itself, it's actually the other worker that gets stuck. Since the distributed training expects all machines to be ready, when one fails to bind the port the other ones just wait forever. That's what I thought |
oh interesting! LightGBM does support that by the way --> https://lightgbm.readthedocs.io/en/v3.3.2/Parameters.html#time_out but it defaults to 120 minutes 😬 |
Nice! That should work, I'll try it. |
It seems that timeout is for individual sockets, we need to have one for the whole network setup. This is the relevant part: LightGBM/src/network/linkers_socket.cpp Lines 196 to 216 in f94050a
So it doesn't wait forever, it does have a finite amount of retries and each time it waits longer. I added a log there to see how many retries it would make and there were 20, however after these 20 tries instead of failing the setup it assumed everything was ok and then segfaulted. I'll keep investigating this |
Had no luck haha. I removed the |
Current
|
Just checked the diff between dependencies in the last successful CI run and the current failing one. The only difference is in
|
Thanks for capturing that before the logs expire @StrikerRUS ! Unless anyone gets to it sooner, I'll investigate this tonight. |
It's quite strange that only |
Seems that pinning |
The only item there that in my opinion can be related to our issue is Specifically,
|
That could definitely be related, especially if something on the |
I was able to reproduce this in Docker tonight, using the same image and environment variables as the # Linux bdist
docker run \
--rm \
--env AZURE='true' \
--env BUILD_DIRECTORY=/opt/LightGBM \
--env CONDA_ENV='test-env' \
--env LGB_VER=$(head -n 1 VERSION.txt) \
--env PYTHON_VERSION='3.8' \
--env TASK=bdist \
--workdir=/opt/LightGBM \
-v $(pwd):/opt/LightGBM \
-it lightgbm/vsts-agent:ubuntu-14.04 \
/bin/bash
export PATH="$CONDA/bin":${PATH}
${BUILD_DIRECTORY}/.ci/setup.sh
${BUILD_DIRECTORY}/.ci/test.sh With those commands, as of latest Then, I upgraded to conda install \
--yes \
--name test-env \
'scipy==1.9.0'
source activate test-env
# just target a single failing test, for faster feedback
pytest \
-v \
"$BUILD_DIRECTORY/tests/python_package_test/test_dask.py::test_classifier[data-rf-binary-classification-scipy_csr_matrix]" Running it like that revealed the underlying error!
We've seen that issue in the Linux jobs on Ubuntu 14.04 before:
And as noted in conda-forge/conda-forge.github.io#1551
but, on the other hand, in pytorch/pytorch#2575 (comment) (which is linked from that
(supported by pytorch/pytorch#2575 (comment)) It's suggested in those linked issues (and in @StrikerRUS 's comment) that this error can be avoided by using a version of glibc > 2.21. Unfortunately, it looks like the Ubuntu 14.04 image we have uses glibc 2.19. docker run \
--rm \
-it lightgbm/vsts-agent:ubuntu-14.04 \
ldd --version
So given that...maybe #5022 "worked" for so long because it resulted in different versions of some dependencies, which resulted in different library-loading order when importing them....and now the new |
I tried running the tests with Like this. LD_DEBUG=libs \
pytest \
-v \
"$BUILD_DIRECTORY/tests/python_package_test/test_dask.py::test_classifier[data-rf-binary-classification-scipy_csr_matrix]" \
> ./out.txt 2>&1 Here's a sample of what is produced:
Haven't had a chance to compare those outputs yet (they require a bit of modification to compare in a text differ), but wanted to put that command up here in case its useful for others investigating this or finding this issue from search engines. |
@jameslamb Thank you very much for your investigation! Maybe it's time to bump minimum required UPD: Or even better to one of the standard |
@StrikerRUS sorry it took me so long to respond to this! I agree with the proposal to bump the minimum |
@StrikerRUS I just returned from traveling and have some time to work on this. But I saw that you've been pushing commits directly to Would you like me to wait until you're done with that effort to support the aarch64 wheels? And then would you like me to make a PR into |
Nice to hear! 🙂
It depends on how long it will take to merge #5252. I believe we need at least one more review there. I think I can configure pushes to Docker Hub for a new branch, let's say |
@jameslamb You can check the diff for the |
BTW, I guess we need to create a separate issue for discussing the migration to official |
opened #5514 |
I'm working on this in #5597. |
I've stopped working on this. Will post here if I start investigating it again in the future. |
Just noting that even though this discussion is very old and references an old version of Dask, it shouldn't be closed until we've restored this unit test: LightGBM/tests/python_package_test/test_dask.py Line 1339 in 1443548
Contributions welcome if anyone with Dask knowledge wants to investigate that! |
Description
Created from #5388 (comment).
All CUDA CI jobs and several Linux jobs are failing with the following.
client.restart()
calls in that test are resulting in the following:traceback (click me)
It looks like those jobs are getting
dask
anddistributed
2022.7.1which hit
conda-forge
3 days ago.Reproducible example
Here's an example: https://github.com/microsoft/LightGBM/runs/7522939980?check_suite_focus=true
I don't believe the failure is related to anything specific on the PR that that failed build came from.
Additional Comments
Note that this should not be a concern for jobs using
Python < 3.8
, asdask
/distributed
have dropped support for those Python versions.Logs from an example build on #5388 where I tried to pin to exactly
dask==2022.7.0
(build link):The text was updated successfully, but these errors were encountered: