Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] [dask] CI jobs failing with Dask 2022.7.1 #5390

Open
jameslamb opened this issue Jul 26, 2022 · 29 comments
Open

[ci] [dask] CI jobs failing with Dask 2022.7.1 #5390

jameslamb opened this issue Jul 26, 2022 · 29 comments

Comments

@jameslamb
Copy link
Collaborator

Description

Created from #5388 (comment).

All CUDA CI jobs and several Linux jobs are failing with the following.

FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[binary-classification]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[multiclass-classification]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[regression]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[ranking]
= 4 failed, 700 passed, 10 skipped, 2 xfailed, 395 warnings in 655.51s (0:10:55) =

client.restart() calls in that test are resulting in the following:

raise TimeoutError(f"{len(bad_nannies)}/{len(nannies)} nanny worker(s) did not shut down within {timeout}s")
E asyncio.exceptions.TimeoutError: 1/2 nanny worker(s) did not shut down within 120s

traceback (click me)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3360: in restart
    return self.sync(
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:338: in sync
    return sync(
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:405: in sync
    raise exc.with_traceback(tb)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:378: in f
    result = yield future
/root/miniforge/envs/test-env/lib/python3.9/site-packages/tornado/gen.py:762: in run
    value = future.result()
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3329: in _restart
    await self.scheduler.restart(timeout=timeout, wait_for_workers=wait_for_workers)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:1153: in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:943: in send_recv
    raise exc.with_traceback(tb)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:769: in _handle_comm
    result = await result
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:778: in wrapper
    return await func(*args, **kwargs)

It looks like those jobs are getting dask and distributed 2022.7.1

    dask-2022.7.1              |     pyhd8ed1ab_0           5 KB  conda-forge
    dask-core-2022.7.1         |     pyhd8ed1ab_0         840 KB  conda-forge
    dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
    distributed-2022.7.1       |     pyhd8ed1ab_0         735 KB  conda-forge

which hit conda-forge 3 days ago.

Screen Shot 2022-07-26 at 1 34 56 PM

Reproducible example

Here's an example: https://github.com/microsoft/LightGBM/runs/7522939980?check_suite_focus=true

I don't believe the failure is related to anything specific on the PR that that failed build came from.

Additional Comments

Note that this should not be a concern for jobs using Python < 3.8, as dask / distributed have dropped support for those Python versions.

Logs from an example build on #5388 where I tried to pin to exactly dask==2022.7.0 (build link):

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - dask==2022.7.0 -> python[version='>=3.8']
  - distributed==2022.7.0 -> python[version='>=3.8']
@jmoralez
Copy link
Collaborator

Linking possible source: dask/distributed#6714

@jameslamb
Copy link
Collaborator Author

Great find @jmoralez !!!

I just read through that PR, totally agree with you that it looks very relevant and is probably the root cause for this.

I wonder if the way that the failing test leaves the cluster with an error on one worker results in a situation where the worker can't shut down cleanly.

I believe the breaking change in that PR you linked is that restart() will now, by default, block until all workers shut down and restart successfully.

# test that "machines" is actually respected by creating a socket that uses
# one of the ports mentioned in "machines"
error_msg = f"Binding port {open_ports[0]} failed"
with pytest.raises(lgb.basic.LightGBMError, match=error_msg):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind((workers_hostname, open_ports[0]))
dask_model.fit(dX, dy, group=dg)
# The above error leaves a worker waiting
client.restart()

Based on https://docs.dask.org/en/stable/_modules/distributed/client.html#Client.restart, I guess we could try something like the following to recover the old behavior?

client.restart(wait_for_workers=False)
client.wait_for_workers(n_workers=1)

Or, alternatively, we can try to figure out why raising an error the way that this test intentionally does creates issues for Client.restart().

@jmoralez do you have time to look into this this week? I'd be happy to back off and let you take it if you're interested.


@gjoseph92 , also tagging you since you were the author of dask/distributed#6714, and maybe whatever we're doing here in LightGBM is weird/unexpected and might be informative to you as you continue working on distributed.

@gjoseph92
Copy link

Sorry about that. Yeah, client.restart is now stricter and verifies that the restart was actually successful.

I don't understand what you're doing with the socket, but presumably you're doing something that makes the worker block during Worker.close. I'm kinda curious what the motivation for that is / what you're trying to test (and if this is something LightGBM users would encounter and have issues with like this test does)? Another important question: what happens to the stuck worker after the test? Does it ever actually shut down, or is the process leaked and just keeps running?

Regardless, what @jameslamb said is what you'll want to match the previous behavior here:

client.restart(wait_for_workers=False)
client.wait_for_workers(n_workers=1)

@gjoseph92
Copy link

Also, at some point soon-ish, I'm hoping to finish up and merge dask/distributed#6427. This might break your test yet again.

Currently, the Nanny.kill process goes:

  1. Call Worker.close
  2. If that doesn't shut it down, send a SIGTERM to the worker. I think with your current test setup this will actually kill the worker process. But if you were running dask-worker (like users would be), it would just trigger Worker.close, so effectively a no-op.
  3. If it's still not closed after the timeout... 🤷

After that PR:

  1. Call Worker.close
  2. If that doesn't shut it down, send a SIGKILL to the worker. SIGKILL cannot be caught so the process will shut down nearly immediately (modulo it being suspended or currently waiting on a system call).
  3. If it's still not closed after the timeout, raise an error.

The key difference is that maybe what you have right now is managing to block/ignore the SIGTERM, preventing the stuck worker from shutting down. Since you can't block a SIGKILL, after this PR is merged I would expect (and hope) that client.restart would result in both workers restarting successfully, not just 1. If you're counting on the stuck worker to survive the restart, I don't think it would (it's a bug that it's able to right now).

@jameslamb
Copy link
Collaborator Author

Thanks very much for that @gjoseph92 ! And no need to apologize... my @ wasn't a passive-aggressive "hey why did you break our tests", I promise 😂

I'm kinda curious what the motivation for that is / what you're trying to test

LightGBM has its own distributed computing setup (e.g. to perform collective operations across multiple processes) that lightgbm.dask just wraps. In that setup, worker processes communicate over TCP sockets. For users in highly-restricted environments (e.g. behind a strict firewall), we provide the option to explicitly list the ports used for those sockets.

This test checks that the expected error is raised in the situation where someone specifies a port that is already in use at the time that training starts.

what happens to the stuck worker after the test? Does it ever actually shut down, or is the process leaked and just keeps running?

I suspect (and have suspected for a while) that the way we are raising errors on lightgbm.dask could be improved, and might result in leaving workers in a bad state.

# Start computation in the background
parts = list(map(delayed, parts))
parts = client.compute(parts)
wait(parts)
for part in parts:
if part.status == 'error': # type: ignore
return part # trigger error locally

I wouldn't be surprised to learn that the process is leaked and keeps running. Since our only use of Client.restart() is in a test and since that happens only on remote machines during CI, it would be easy for us to miss a leaked process.

If you're counting on the stuck worker to survive the restart

We definitely are not, and the new behavior you've described sounds like a nice improvement to Client.restart() 😁

@jameslamb jameslamb added bug and removed blocking labels Jul 27, 2022
@jmoralez
Copy link
Collaborator

jmoralez commented Jul 27, 2022

Or, alternatively, we can try to figure out why raising an error the way that this test intentionally does creates issues for Client.restart()

It's not the error itself, it's actually the other worker that gets stuck. Since the distributed training expects all machines to be ready, when one fails to bind the port the other ones just wait forever. That's what I thought restart fixed but I'm not sure anymore haha. I think one better fix would be to add a timeout to the distributed training, so that if one of the machines fails to connect, the process is interrupted in the rest of them. I'll give that a shot and let you know.

@jameslamb
Copy link
Collaborator Author

add a timeout to the distributed training, so that if one of the machines fails to connect, the process is interrupted in the rest of them

oh interesting! LightGBM does support that by the way --> https://lightgbm.readthedocs.io/en/v3.3.2/Parameters.html#time_out

but it defaults to 120 minutes 😬

@jmoralez
Copy link
Collaborator

Nice! That should work, I'll try it.

@jmoralez
Copy link
Collaborator

It seems that timeout is for individual sockets, we need to have one for the whole network setup. This is the relevant part:

for (auto it = need_connect.begin(); it != need_connect.end(); ++it) {
int out_rank = it->first;
// let smaller rank connect to larger rank
if (out_rank > rank_) {
int connect_fail_delay_time = connect_fail_retry_first_delay_interval;
for (int i = 0; i < connect_fail_retry_cnt; ++i) {
TcpSocket cur_socket;
if (cur_socket.Connect(client_ips_[out_rank].c_str(), client_ports_[out_rank])) {
// send local rank
cur_socket.Send(reinterpret_cast<const char*>(&rank_), sizeof(rank_));
SetLinker(out_rank, cur_socket);
break;
} else {
Log::Warning("Connecting to rank %d failed, waiting for %d milliseconds", out_rank, connect_fail_delay_time);
cur_socket.Close();
std::this_thread::sleep_for(std::chrono::milliseconds(connect_fail_delay_time));
connect_fail_delay_time = static_cast<int>(connect_fail_delay_time * connect_fail_retry_delay_factor);
}
}
}
}

So it doesn't wait forever, it does have a finite amount of retries and each time it waits longer. I added a log there to see how many retries it would make and there were 20, however after these 20 tries instead of failing the setup it assumed everything was ok and then segfaulted.
image

I'll keep investigating this

@jmoralez
Copy link
Collaborator

Had no luck haha. I removed the restart from the test and I raised an exception when the network setup took too long but the test gets stuck because there are pending tasks and the worker seems to be dead.

@StrikerRUS
Copy link
Collaborator

Current master fails in Linux regular and Linux bdist jobs after 1h:

##[error]The job running on agent lgbubb5e0000SJI ran longer than the maximum time of 60 minutes. For more information, see https://go.microsoft.com/fwlink/?linkid=2077134



...


============================= test session starts ==============================
platform linux -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0
rootdir: /__w/1/s
collected 719 items

../tests/c_api_test/test_.py ..                                          [  0%]
../tests/python_package_test/test_basic.py ............................. [  4%]
............................................                             [ 10%]
../tests/python_package_test/test_callback.py ............               [ 12%]
../tests/python_package_test/test_consistency.py ......                  [ 12%]
../tests/python_package_test/test_dask.py FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [ 17%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF...FFFFFFFFFFFFFFFFF [ 27%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [ 37%]
FFFFFFFFFFFFFFFF
##[error]The operation was canceled.

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=13193&view=logs&j=c28dceab-947a-5848-c21f-eef3695e5f11

@StrikerRUS
Copy link
Collaborator

Just checked the diff between dependencies in the last successful CI run and the current failing one. The only difference is in scipy

   scipy-1.8.1                |   py39h8ba3f38_2        25.0 MB  conda-forge

    scipy-1.9.0                |   py39h8ba3f38_0        26.2 MB  conda-forge

@jameslamb
Copy link
Collaborator Author

jameslamb commented Aug 2, 2022

Thanks for capturing that before the logs expire @StrikerRUS !

Unless anyone gets to it sooner, I'll investigate this tonight.

@StrikerRUS
Copy link
Collaborator

It's quite strange that only Linux regular and Linux bdist are affected. scipy has been upgraded in Linux mpi_source, for example, as well...

@StrikerRUS
Copy link
Collaborator

Seems that pinning scipy version works.

@StrikerRUS
Copy link
Collaborator

scipy 1.9.0 changelog: http://scipy.github.io/devdocs/release.1.9.0.html

The only item there that in my opinion can be related to our issue is

SciPy switched to Meson as its build system

Specifically,

The build defaults to using OpenBLAS

@jameslamb
Copy link
Collaborator Author

That could definitely be related, especially if something on the dask-specific code paths leads to conflicting shared-library-loading (like some of the issues we've talked about in #5106).

@jameslamb
Copy link
Collaborator Author

I was able to reproduce this in Docker tonight, using the same image and environment variables as the Linux bdist job.

# Linux bdist
docker run \
    --rm \
    --env AZURE='true' \
    --env BUILD_DIRECTORY=/opt/LightGBM \
    --env CONDA_ENV='test-env' \
    --env LGB_VER=$(head -n 1 VERSION.txt) \
    --env PYTHON_VERSION='3.8' \
    --env TASK=bdist \
    --workdir=/opt/LightGBM \
    -v $(pwd):/opt/LightGBM \
    -it lightgbm/vsts-agent:ubuntu-14.04 \
    /bin/bash

export PATH="$CONDA/bin":${PATH}
${BUILD_DIRECTORY}/.ci/setup.sh
${BUILD_DIRECTORY}/.ci/test.sh

With those commands, as of latest master, scipy==1.8.1 is installed and all Dask tests succeed.

Then, I upgraded to scipy==1.9.0 and the Dask tests failed.

conda install \
    --yes \
    --name test-env \
    'scipy==1.9.0'
source activate test-env

# just target a single failing test, for faster feedback
pytest \
    -v \
    "$BUILD_DIRECTORY/tests/python_package_test/test_dask.py::test_classifier[data-rf-binary-classification-scipy_csr_matrix]"

Running it like that revealed the underlying error!

>   self._handle = _dlopen(self._name, mode)
E   OSError: dlopen: cannot load any more object with static TLS

We've seen that issue in the Linux jobs on Ubuntu 14.04 before:

And as noted in conda-forge/conda-forge.github.io#1551

This has some unfortunate side effects like how changing the import order between libraries makes the error appear / go away. Such kinds of accidents lead to the proliferation of unfortunate (because: randomly working or not) advice...

but, on the other hand, in pytorch/pytorch#2575 (comment) (which is linked from that conda-forge issue)

importing something that gets libgomp early enough should be the recommended workaround.

(supported by pytorch/pytorch#2575 (comment))

It's suggested in those linked issues (and in @StrikerRUS 's comment) that this error can be avoided by using a version of glibc > 2.21.

Unfortunately, it looks like the Ubuntu 14.04 image we have uses glibc 2.19.

docker run \
    --rm \
    -it lightgbm/vsts-agent:ubuntu-14.04 \
    ldd --version
ldd (Ubuntu EGLIBC 2.19-0ubuntu6.15) 2.19

So given that...maybe #5022 "worked" for so long because it resulted in different versions of some dependencies, which resulted in different library-loading order when importing them....and now the new scipy release has caused a slightly different library-loading order, which causes this issue to reoccur?

@jameslamb
Copy link
Collaborator Author

jameslamb commented Aug 3, 2022

I tried running the tests with LD_DEBUG=libs set (see here for docs on that variable), to see what was being searched.

Like this.

LD_DEBUG=libs \
pytest \
    -v \
    "$BUILD_DIRECTORY/tests/python_package_test/test_dask.py::test_classifier[data-rf-binary-classification-scipy_csr_matrix]" \
> ./out.txt 2>&1

Here's a sample of what is produced:

      2622:	find library=libpthread.so.0 [0]; searching
      2622:	 search path=/opt/miniforge/envs/test-env/bin/../lib/tls/x86_64:/opt/miniforge/envs/test-env/bin/../lib/tls:/opt/miniforge/envs/test-env/bin/../lib/x86_64:/opt/miniforge/envs/test-env/bin/../lib		(RPATH from file /opt/miniforge/envs/test-env/bin/python)
      2622:	  trying file=/opt/miniforge/envs/test-env/bin/../lib/tls/x86_64/libpthread.so.0
      2622:	  trying file=/opt/miniforge/envs/test-env/bin/../lib/tls/libpthread.so.0
      2622:	  trying file=/opt/miniforge/envs/test-env/bin/../lib/x86_64/libpthread.so.0
      2622:	  trying file=/opt/miniforge/envs/test-env/bin/../lib/libpthread.so.0
      2622:	 search cache=/etc/ld.so.cache
      2622:	  trying file=/lib/x86_64-linux-gnu/libpthread.so.0
      2622:	
      2622:	find library=libdl.so.2 [0]; searching
      2622:	 search path=/opt/miniforge/envs/test-env/bin/../lib		(RPATH from file /opt/miniforge/envs/test-env/bin/python)
      2622:	  trying file=/opt/miniforge/envs/test-env/bin/../lib/libdl.so.2
      2622:	 search cache=/etc/ld.so.cache
      2622:	  trying file=/lib/x86_64-linux-gnu/libdl.so.2

Haven't had a chance to compare those outputs yet (they require a bit of modification to compare in a text differ), but wanted to put that command up here in case its useful for others investigating this or finding this issue from search engines.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Aug 3, 2022

@jameslamb Thank you very much for your investigation!

Maybe it's time to bump minimum required glibc version (2.14 -> 2.23) in LightGBM 4.0.0 and switch to Ubuntu 16.04 image for artifacts production?
https://en.wikipedia.org/wiki/Glibc#Version_history

UPD: Or even better to one of the standard manylinux PyPI images (linking #4484 (comment) and #3421 (comment)).

@jameslamb
Copy link
Collaborator Author

@StrikerRUS sorry it took me so long to respond to this!

I agree with the proposal to bump the minimum glibc version and to start publishing wheels with using one of the standard manylinux images.

@jameslamb
Copy link
Collaborator Author

start publishing wheels with using one of the standard manylinux images

@StrikerRUS I just returned from traveling and have some time to work on this.

But I saw that you've been pushing commits directly to dev: https://github.com/guolinke/lightgbm-ci-docker/compare/dev (noticed because of #5252 (comment)).

Would you like me to wait until you're done with that effort to support the aarch64 wheels? And then would you like me to make a PR into dev in that repo with proposed changes to the image?

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Aug 30, 2022

@jameslamb

have some time to work on this.

Nice to hear! 🙂

Would you like me to wait until you're done with that effort to support the aarch64 wheels?

It depends on how long it will take to merge #5252. I believe we need at least one more review there.

I think I can configure pushes to Docker Hub for a new branch, let's say dev-2, and you can test your ideas in that branch right now.

@StrikerRUS
Copy link
Collaborator

@jameslamb
Done! Feel free to do your experiments in the dev2 branch.

You can check the diff for the .github/workflows/publish_image.yml file for how to make it possible to push multiple Dockerfiles into Docker Hub.

@StrikerRUS
Copy link
Collaborator

BTW, I guess we need to create a separate issue for discussing the migration to official manylinux image.

@jameslamb
Copy link
Collaborator Author

separate issue for discussing migration to official manylinux image

opened #5514

@jameslamb
Copy link
Collaborator Author

I'm working on this in #5597.

@jameslamb jameslamb removed their assignment Feb 1, 2023
@jameslamb
Copy link
Collaborator Author

I've stopped working on this. Will post here if I start investigating it again in the future.

@jameslamb
Copy link
Collaborator Author

Just noting that even though this discussion is very old and references an old version of Dask, it shouldn't be closed until we've restored this unit test:

pytest.skip("skipping due to timeout issues discussed in https://github.com/microsoft/LightGBM/issues/5390")

Contributions welcome if anyone with Dask knowledge wants to investigate that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants