-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] use random ports in network setup #3823
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution and interest in LightGBM @jmoralez !
It does seem like this produces the desired behavior on LocalCluster
, but I'm concerned that this might be disruptive for users on multi-machine clusters, such as those that can be created with dask-cloudprovider
. With the approach proposed in this PR, you lose the ability to control the range of ports that are used by lightgbm, which I think would be problematic for people working in environments that are limited by firewall rules. What do you think?
Thank you for your feedback, @jameslamb! I don't think there will be a problem by using a different type of cluster, since |
Ah! Let me add more information on this. By default Dask does find ports for that worker-to-scheduler communication randomly, but it exposes options to control that more tightly: https://docs.dask.org/en/latest/setup/cli.html#handling-ports. The ports for LightGBM aren't related to Dask directly. LightGBM has its own distributed learning framework written in C++ (https://github.com/microsoft/LightGBM/tree/master/src/network). This Dask package just wraps that framework. In the Dask package here, we set up one long-running Dask task per worker, and that task is running a LightGBM worker. The LightGBM workers talk to each other directly over TCP sockets (by default), and Dask has no knowledge of that communication.
Thanks very much! If it's easier, I have a testing setup that uses When I was working on #3766, I found that there were several fixes I tried which worked well on LocalCluster that didn't hold up in a multi-machine setting. |
Hi @jameslamb. I've successfully ran the |
Hi James. I've set up the version of my branch here. I think this random port assignment could be a fallback when specifying As a side note I've noticed that with few samples I get |
Can you please open a separate issue for this so we can discuss there? I have some theories.
really cool, thank you! I clicked into the notebook and ran it. I really really appreciate the effort you've put into this PR and into setting up this nice test harness in Coiled. I can see that the 5 consecutive runs worked! I even tried changing the test and running 50 consecutive trainings back to back. No problems :). I can see that this cluster's workers are on different IPs, so I think this is a great test that this PR will work in other multi-machine settings. I like the idea of possibly using both this implementation and the current one, but I'd rather switch the order. I think it makes sense to have your proposed (faster) approach be the default behavior, and for the current behavior to only be used if people opt in to it because they're working in an environment with networking constraints like firewall rules that only permit specific port ranges. I think users will be ok with paying a small performance penalty to comply with their security setup, and that anyone NOT constrained in that way should get the faster method by default. However, I don't think we should support setting Ok sorry, that was a lot of information. Given all that, here's my proposal
What do you think? |
Hi James. That sounds good, I'll work on that and update this PR. Thank you! |
Hi James. Since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Really like the changes. I agree with the decision you made about params["local_listen_port"] == 12400
inside _train()
, since by that point that parameter is guaranteed to exist (
LightGBM/python-package/lightgbm/dask.py
Lines 226 to 230 in 56fc036
params = _choose_param_value( | |
main_param_name="local_listen_port", | |
params=params, | |
default_value=12400 | |
) |
I left some new suggestions. Once you've addressed those, I'll test this manually on a FargateCluster with dask-cloudprovider
, and ask @StrikerRUS for a review.
…id open ports only on workers with data.
Thank you for your feedback, James. I've included your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left one more style thing, otherwise looks ok to me! I'll pull this and test on AWS.
@StrikerRUS whenever you have time, could you give your opinion? I can give you the relevant context so you don't have to re-read all the discussion here and in #3768.
In #3766, I added a function _find_open_port()
that the Dask package uses to build the machines
list. But as I noted in #3766 (comment), the approach adds some overhead. It currently runs that check sequentially. So for a 3-worker cluster, it will go "find port on w1
then find port on w2
then find port on w3
". This means there is O(n_workers) overhead introduced.
This PR proposes running a function to find a random port on every worker simultaneously, so in theory the overhead of setting up machines
becomes O(1). This is faster, but it also means the ports used for machines
are totally unrelated to local_listen_port
.
In #3823 (comment) I told @jmoralez that in my opinion, this faster behavior should be the default, but that we need to still give people who are limited by firewall rules the ability to specify a local_listen_port
and know that only that port and those near it will be searched.
So we settled on the current state of this PR, which is:
- if local_listen_port is missing from parameters, use the faster approach
- if local_listen_port is set to 12400 (LightGBM's default), use the faster approach
- if local_listen_port is set to a value other than 12400, use the
_find_open_port()
approach where we specifically search 1000 ports fromlocal_listen_port
tolocal_listen_port + 999
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmoralez Thanks a lot for the contribution!
@jameslamb Many thanks for the detailed overview of the problem behind this PR!
I'm worrying about only one thing. After this PR port 12400
is treated as "not set". This is not true from the user point of view. Imagine the following scenario. One reads LightGBM docs and finds that port for distributed training can be set via local_listen_port
parameter and its default value is 12400
. User doesn't have any preferences for ports and decides to leave default value (or more critical problem: harcodes 12400
in params). The user goes to firewall setting and opens 12400
port with the hope that everything will work. But it won't work because actually random, not 12400
, port will be used inside LightGBM.
Things are getting worse due to that there is no dedicated port
param in Dask wrapper that can be set to say None
and take priority over local_listen_port
in kwargs. I have only one solution for this inconsistency for now:
document in Dask docs that actually only for Dask port 12400 === random
.
Somehow like the following with warning
directive:
Thanks @StrikerRUS . @jmoralez , don't make any changes yet. I need to go re-read the code in |
I haven't forgotten about this PR. I did a lot of research this weekend, reading through LightGBM's |
Ok @jmoralez we just merged #3994, so this PR should now be unblocked. Could you please update this to the latest LightGBM/python-package/lightgbm/dask.py Line 418 in 1f73f55
|
Thank you, James! Should I remove |
yes please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work! This is a nice change that simplifies the code AND makes it faster.
The tests look good to me as well. Thanks very much!
@@ -343,6 +335,19 @@ def test_classifier_pred_contrib(output, centers, client): | |||
client.close(timeout=CLIENT_CLOSE_TIMEOUT) | |||
|
|||
|
|||
def test_find_random_open_port(client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent test! I really like this
Thanks a lot for all your help with this, James! |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This is intended to solve the problem that @jameslamb defined in #3768 by running
socket.bind('', 0)
withclient.run
, which finds a random port for each worker. I've included a small test to check that the found ports are indeed different for aLocalCluster
. Happy to receive any feedback.