[dask] [python-package] Search for available ports when setting up network (fixes #3753) #3766

jameslamb · 2021-01-14T22:17:24Z

See #3753 for additional details.

The Problem

By default, LightGBM's Python package is built to use TCP sockets for distributed learning (-DUSE_SOCKET=1). When performing distributed learning, LightGBM requires a list of machines and ports, as is documented here.

machine1_ip 12345
machine2_ip 12345

lightgbm.dask currently creates this list by assigning a fixed range of ports. Assuming that you pass parameters local_listen_port=12400, for example, the first worker will open a socket {first_worker_ip}: 12400, the second will open one {second_worker_ip}:12401, and so on.

This approach is fragile, since it makes the assumption that the ports you request will definitely be available. It can lead to an error like this:

Binding to port 12402 failed

This can happen either because of connections not quite cleaned up after previous LightGBM training runs (#3753 (comment)) or because of any other processes running on that machine which might already be using a port that LightGBM wants.

Changes in this PR

lightgbm.dask will now search for open ports on each worker when constructing the network. It uses local_listen_port as a starting point, but doesn't assume that that port or any other specific port is available.
- this is now done one time in _train(), before setting up the network. On master today, the code to set up the parameters is re-run in _train_part() (once on each worker) even though it produces the same results
refactors the internals for constructing network parameters
adds tests that you can call .fit() with a Dask model many times in a row with no sleeping in between...this is the primary way that the bug in [dask] [python-package] DaskLGBMRegressor training error: 'binding port 12402 failed' #3753 shows up
adds type hints to only new code added in this PR (per [python-package] type hints in python package #3756)

Notes for reviewers

I tested this with both a LocalCluster with 3 workers and a dask_cloudprovider.aws.FargateCluster, using the steps described in #3753 (comment) and the notebooks in https://github.com/jameslamb/talks/tree/main/recent-developments-in-lightgbm. ~~I think I'll set up a separate small repo with my testing setup so that this can be more easily re-used by contributors in the future.~~ This can be tested with the setup in https://github.com/jameslamb/lightgbm-dask-testing.

I captured the screenshot below to show some more evidence that this worked. I had the code print out its decision for ports on each worker, and I ran training 10 times in a row in a for loop like this:

for i in range(10):
    client.restart()
    ...
    dask_reg.fit(...)

You can see that on some runs, it chose 12400, 12401, and 12402. But on others, LightGBM chose to skip up a few ports. That's the "search until you find one" stuff working! The picture on the right side of the screenshot shows the individual training runs succeeding one after another.

Since this worked well on both LocalCluster (where all workers have the same IP address) and FargateCluster (where each worker is a different physical machine, with a unique IP address), I feel confident in this fix.

special thanks to @imatiach-msft and @ffineis for their help figuring this out!

jameslamb · 2021-01-14T22:21:17Z

python-package/lightgbm/dask.py

+    lightgbm_ports = set()
+    worker_ip_to_port = {}
+    for worker_address in worker_addresses:
+        port = client.submit(
+            func=_find_open_port,
+            workers=[worker_address],
+            worker_ip=urlparse(worker_address).hostname,
+            local_listen_port=local_listen_port,
+            ports_to_skip=lightgbm_ports
+        ).result()
+        lightgbm_ports.add(port)
+        worker_ip_to_port[worker_address] = port


I want to draw reviewers' attention to this block. This setup checks for ports to use sequentially. So it will first run _find_open_port() on worker 1's machine, then worker 2's machine, then work 3's, etc.

That means there is a small startup time introduced. I saw that this process took about 5 seconds per worker when I was running a cluster on AWS Fargate. So that means you might expect around 1 minute of startup time for a cluster with 20 workers.

I don't think this is too bad, and I think it's a fairly clean way to ensure that if you have multiple work processes on the same machine, they're assigned different ports in the network parameters

This is probably overkill, but if you wanted to speed it up, maybe could group by IP address and apply the search in parallel across separate nodes (IP addresses)? So each the search for ports gathers as many ports as there are workers on that server.

def _find_ports_for_workers(client: Client, worker_addresses: Iterable[str], local_listen_port: int) -> Dict[str, int]: worker_ip_to_port = {} # group workers by ip addr worker_hostname_map = defaultdict(list()) for worker_address in worker_addresses: hostname = urlparse(worker_address).hostname worker_hostname_map[hostname].append(worker_address) # run search for ports on groups of workers lgbm_machines = list() for hostname in worker_hostname_map: machines_on_node = client.submit( func=_port_search_for_ip, worker_ip=hostname, desired_ports=len(worker_hostname_map[hostname]), local_listen_port=local_listen_port ) # add more futures lgbm_machines.append(machines_on_node) # wait for search across nodes to complete. _ = wait(lgbm_machines) lgbm_machines = lgbm_machines.results() def _port_search_for_ip(worker_ip: str, local_listen_port: int, n_desired_ports: int) -> Iterable[str]: max_tries = 1000 out_ports = list() found_all_ports = False for i in range(max_tries): out_port = local_listen_port + i try: with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: s.bind((worker_ip, out_port)) s.listen() out_ports.append(out_port) if len(out_ports) == n_desired_ports: found_all_ports = True break except OSError: continue if not found_all_ports: raise RuntimeError() return [hostname + ':' + str(x) for x in out_ports]

Otherwise, the contraints of lightgbm + dask are that: 1) lightgbm requires that each worker knows about all of the other workers while at the same time 2) prior to distributing work to each worker, we know which workers can listen for lightgbm traffic on which ports. Seems like your fix nails this!

This is probably overkill, but if you wanted to speed it up, maybe could group by IP address and apply the search in parallel across separate nodes (IP addresses)? So each the search for ports gathers as many ports as there are workers on that server

That's a great idea, I like it! I think I'll push it to a later PR though, since it would take me some time to test and I really want to unblock you for #3708

GH needs more emojis but if I could, I'd give you the prayer "bless up" hands

haha thank you

I just documented this in #3768, thanks for taking the time to type out that possible implementation. Super helpful.

jameslamb · 2021-01-14T22:35:34Z

hmmm, never saw this error before. Spotted on the macOS-latest, TASK=REGULAR build. Will try rebuilding, maybe it's a temporary thing.

[NbConvertApp] Converting notebook interactive_plot_example.ipynb to notebook
Traceback (most recent call last):
  File "/Users/runner/miniconda/envs/test-env/bin/jupyter-nbconvert", line 11, in <module>
    sys.exit(main())
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/jupyter_core/application.py", line 254, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/traitlets/config/application.py", line 664, in launch_instance
    app.start()
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/nbconvert/nbconvertapp.py", line 350, in start
    self.convert_notebooks()
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/nbconvert/nbconvertapp.py", line 524, in convert_notebooks
    self.convert_single_notebook(notebook_filename)
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/nbconvert/nbconvertapp.py", line 489, in convert_single_notebook
    output, resources = self.export_single_notebook(notebook_filename, resources, input_buffer=input_buffer)
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/nbconvert/nbconvertapp.py", line 418, in export_single_notebook
    output, resources = self.exporter.from_filename(notebook_filename, resources=resources)
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/nbconvert/exporters/exporter.py", line 181, in from_filename
    return self.from_file(f, resources=resources, **kw)
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/nbconvert/exporters/exporter.py", line 199, in from_file
    return self.from_notebook_node(nbformat.read(file_stream, as_version=4), resources=resources, **kw)
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/nbformat/__init__.py", line 143, in read
    return reads(buf, as_version, **kwargs)
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/nbformat/__init__.py", line 77, in reads
    validate(nb)
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/nbformat/validator.py", line 270, in validate
    relax_add_props=relax_add_props):
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/nbformat/validator.py", line 303, in iter_validate
    validator = get_validator(version, version_minor, relax_add_props=relax_add_props)
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/nbformat/validator.py", line 57, in get_validator
    schema_json = _get_schema_json(v, version=version, version_minor=version_minor)
  File "/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/nbformat/validator.py", line 94, in _get_schema_json
    with open(schema_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/runner/miniconda/envs/test-env/lib/python3.6/site-packages/nbformat/v4/nbformat.v4.2.schema.json'

StrikerRUS

@jameslamb Just to clarify: I have literally zero experience with Dask. Sorry.

However, all your thoughts sound reasonable to me, so I think I can approve this PR.
Just please take a look at some general Python comments I left.

python-package/lightgbm/dask.py

tests/python_package_test/test_dask.py

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

…M into fix/dask-network-setup

jameslamb · 2021-01-15T05:14:56Z

@jameslamb Just to clarify: I have literally zero experience with Dask. Sorry.

@StrikerRUS it's ok, I understand! I'm not sure if any other maintainers have Dask experience. If there's ever a Dask PR that you feel is large or controversial, we can probably get some reviewing help from our friends at XGBoost or I can ask others in the Dask community.

python-package/lightgbm/dask.py

StrikerRUS · 2021-01-15T13:29:01Z

@jameslamb

we can probably get some reviewing help from our friends at XGBoost or I can ask others in the Dask community.

Asking for a help from outside reviewers or maybe even find someone interested and experienced in LightGBM and Dask to collaborate on improvements in Dask-package on a regular basis supporting you will be just awesome.

StrikerRUS

@jameslamb Thanks for addressing comments!

tests/python_package_test/test_dask.py

jameslamb · 2021-01-15T16:19:05Z

I've moved my test setup out into its own repo: https://github.com/jameslamb/lightgbm-dask-testing. I'm going to test this on LocalCluster and FargateCluster one more time, to be sure I didn't make any mistakes responding to review comments. If all goes well, I'll merge after that and go review #3708

jameslamb · 2021-01-15T16:59:41Z

happy to say this worked on LocalCluster and FargateCluster. Thanks for the help and reviews!

github-actions · 2023-08-24T02:19:08Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added 10 commits January 13, 2021 10:11

starting work

7079ea6

fixed port-binding issue on localhost

7668743

minor cleanup

ed4fcee

updates

482a8bf

getting closer

a53f913

definitely working for LocalCluster

ddeef5a

it works, it works

93ad8dd

docs

0eadb1c

add tests

bf7bf14

removing testing-only files

9ec182f

jameslamb added the fix label Jan 14, 2021

jameslamb requested review from guolinke and StrikerRUS January 14, 2021 22:17

jameslamb commented Jan 14, 2021

View reviewed changes

linting

370eff6

StrikerRUS reviewed Jan 15, 2021

View reviewed changes

jameslamb and others added 4 commits January 14, 2021 18:56

Merge branch 'master' into fix/dask-network-setup

67aca5c

Apply suggestions from code review

9f156e4

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

Merge branch 'fix/dask-network-setup' of github.com:jameslamb/LightGB…

b599777

…M into fix/dask-network-setup

remove duplicated code

7aca067

ffineis reviewed Jan 15, 2021

View reviewed changes

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved

StrikerRUS approved these changes Jan 15, 2021

View reviewed changes

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved

remove unnecessary listen()

4ee5557

ffineis approved these changes Jan 15, 2021

View reviewed changes

jameslamb merged commit f6d2dce into microsoft:master Jan 15, 2021

jameslamb deleted the fix/dask-network-setup branch January 15, 2021 17:00

This was referenced Jan 15, 2021

[dask] Search for open ports only once per IP #3768

Closed

[python-package] [dask] Add DaskLGBMRanker #3708

Merged

[dask] make Dask training resilient to worker restarts during network setup #3775

Closed

jameslamb mentioned this pull request Jan 25, 2021

[dask] use random ports in network setup #3823

Merged

jameslamb mentioned this pull request Feb 13, 2023

[python-package] add type hints on plotting code #5708

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] [python-package] Search for available ports when setting up network (fixes #3753) #3766

[dask] [python-package] Search for available ports when setting up network (fixes #3753) #3766

jameslamb commented Jan 14, 2021 •

edited

Loading

jameslamb Jan 14, 2021

ffineis Jan 15, 2021 •

edited

Loading

jameslamb Jan 15, 2021

ffineis Jan 15, 2021

jameslamb Jan 15, 2021

jameslamb Jan 15, 2021

jameslamb commented Jan 14, 2021

StrikerRUS left a comment

jameslamb commented Jan 15, 2021

StrikerRUS commented Jan 15, 2021

StrikerRUS left a comment

jameslamb commented Jan 15, 2021

jameslamb commented Jan 15, 2021

github-actions bot commented Aug 24, 2023

[dask] [python-package] Search for available ports when setting up network (fixes #3753) #3766

[dask] [python-package] Search for available ports when setting up network (fixes #3753) #3766

Conversation

jameslamb commented Jan 14, 2021 • edited Loading

The Problem

Changes in this PR

Notes for reviewers

jameslamb Jan 14, 2021

Choose a reason for hiding this comment

ffineis Jan 15, 2021 • edited Loading

Choose a reason for hiding this comment

jameslamb Jan 15, 2021

Choose a reason for hiding this comment

ffineis Jan 15, 2021

Choose a reason for hiding this comment

jameslamb Jan 15, 2021

Choose a reason for hiding this comment

jameslamb Jan 15, 2021

Choose a reason for hiding this comment

jameslamb commented Jan 14, 2021

StrikerRUS left a comment

Choose a reason for hiding this comment

jameslamb commented Jan 15, 2021

StrikerRUS commented Jan 15, 2021

StrikerRUS left a comment

Choose a reason for hiding this comment

jameslamb commented Jan 15, 2021

jameslamb commented Jan 15, 2021

github-actions bot commented Aug 24, 2023

jameslamb commented Jan 14, 2021 •

edited

Loading

ffineis Jan 15, 2021 •

edited

Loading