-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] run one training task on each worker #4132
Conversation
Awesome!! Wow. Yes I'll review this tomorrow PM. But just reading this for a few minutes, I think you're totally on to something and I'm guessing there's a good reason why xgboost.dask also calls client.submit this way. |
welp,
HOWEVER, I still think this PR fixes a critical source of instability. Maybe we just have multiple stability problems in the UPDATE: looking more closely at the logs, I think this test actually failed with a variation of the issue mentioned in #4112 (comment), where
Notice that the same port was chosen for both workers!
That aligns with the idea from #4112 (comment), that |
Added #4133 which I think should fix the "sometimes we get a Pushing an empty commit to re-trigger CI, but leaving the build from #4132 (comment) undisturbed in case anyone wants to look at the logs. |
The most recent
I've manually restarted that build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb Your detailed description makes a lot of sense!
allow_other_workers (default: True): indicate whether it is ok to run the task on other workers not mentioned in
workers`
But it looks like this param is already False
by default:
It is, but I think we should be explicit about that. In case it wasn't clear, the lack of |
oh I see I wrote "Default: True` in the PR description. Edited 😂 |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This PR resolves a critical source of instability in
lightgbm.dask
, which I think might be the cause of some of the issues we have observed sporadically in tests (#4057, #4074, #4099).Changes in this PR
lightgbm.dask
's use ofdistributed.Client.submit()
to more tightly control how Dask schedules LightGBM training tasksHow this improves
lightgbm
I believe this PR will fix a critical source of instability that can cause training to fail (either throwing an error or hanging indefinitely) if multiple training tasks are accidentally scheduled onto the same worker. I think that the risk of this issue increases as you have more workers in the Dask cluster and as there is more other work going on in the cluster alongside LightGBM training.
This PR should also improve the stability of LightGBM's Dask unit tests.
Overview
In distributed LightGBM training, users create
n
workers, each of which runs LightGBM training on a local chunk of the full training data. Estimators inlightgbm.dask
expect training data to be given in Dask collections. Those estimators look at which Dask workers hold each partition of the training data, and run one LightGBM training process per worker.The logic for scheduling those training processes out onto Dask workers is currently incorrect.
LightGBM/python-package/lightgbm/dask.py
Lines 383 to 397 in 9cab93a
The current code follows this pseudocode, which is like
However, the desired behavior is slightly different. It should be
This more specific definition is important. If the training data are split across two workers (for example) and LightGBM sets
machines
to a list with those two workers, but then two_train_part()
tasks are scheduled on the same worker, training will fail.rank 0
, the worker will be killed and restarted.rank 0
, I saw that LightGBM just hangs indefinitely. If you have task-level timeouts set in Dask, Dask might eventually kill that process. Either way, training will not succeed.Per the Dask docs, the following parameters should be used in
lightgbm.dask
to achieve tight control over where tasks run.workers
: specify EXACTLY which worker(s) the task being submitted can run onallow_other_workers
(default:TrueFalse): indicate whether it is ok to run the task on other workers not mentioned inworkers
pure
(default: True): whether or not the function is a "pure function", which according to Wikipedia, means that it always returns the same outputs given the same inputs, does not modify any global state (including the file system), and does not have side effects on output streamsMore Background
The bug fixed by this PR is a general class of problem in
lightgbm.dask
that we should be aware of. By default, Dask assumes that tasks can be scheduled onto any worker, and even that they can be "stolen" from one worker by another).These default assumptions don't hold for distributed training of LightGBM or XGBoost.
Notes for Reviewers
I checked the blame in
dask/distributed
, and it looks like the three keyword arguments toclient.submit()
that I'm specifying here have been a part of the client for at least 3 years. So I don't think they cause any compatibility issues we need to worry about.https://github.com/dask/distributed/blame/1b32bd30201ef6ced5029180143d2c37b393b586/distributed/client.py#L1234-L1240