You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In LightGBM distributed training (documented here and here), each worker needs access to a list of all other workers' IPs + a port to communicate with them over.
#3766 updated lightgbm.dask to search for open ports on each worker when creating this list, instead of just assuming a fixed range of ports would be available.
This works well, but it's a blocking operation that has to be done sequentially, so it slows down training.
This is done sequentially because multiple Dask worker processes can live on the same IP address. So if you use a LocalCluster with 3 workers, for example, all 3 of those workers will be on your local machine. Or if you use a multi-machine cluster with nprocs > 1, multiple worker processes will run on each physical machine in the cluster.
As a result of this change, the time complexity of that "find open ports" step is O(num_worker_processes). If instead we only did the search once per IP address, then this check could be safely parallelized, and the time complexity would be more like O(nprocs).
To close this issue, change lightgbm.dask._find_ports_for_workers() (
Adding this to #2302 and closing it, per our practice for managing features. Anyone is welcome to contribute this feature! Please leave a comment here if you're interested in contributing it.
Summary
In LightGBM distributed training (documented here and here), each worker needs access to a list of all other workers' IPs + a port to communicate with them over.
#3766 updated
lightgbm.dask
to search for open ports on each worker when creating this list, instead of just assuming a fixed range of ports would be available.This works well, but it's a blocking operation that has to be done sequentially, so it slows down training.
The following pseudocode illustrates the process:
This is done sequentially because multiple Dask worker processes can live on the same IP address. So if you use a
LocalCluster
with 3 workers, for example, all 3 of those workers will be on your local machine. Or if you use a multi-machine cluster withnprocs > 1
, multiple worker processes will run on each physical machine in the cluster.As a result of this change, the time complexity of that "find open ports" step is
O(num_worker_processes)
. If instead we only did the search once per IP address, then this check could be safely parallelized, and the time complexity would be more likeO(nprocs)
.To close this issue, change
lightgbm.dask._find_ports_for_workers()
(LightGBM/python-package/lightgbm/dask.py
Line 72 in f6d2dce
Motivation
This optimization would reduce the overhead introduced by using Dask for distributed training, which should make training faster.
References
This could be done following something like the code @ffineis provided in #3766 (comment).
The text was updated successfully, but these errors were encountered: