-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests][cli] distributed training #4254
Merged
Merged
Changes from 5 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
7356527
include distributed tests
jmoralez 711009c
remove github action file
jmoralez 3b9aac0
merge master
jmoralez e3c96a8
try CI
jmoralez eabb3b7
build shared library and fix linting error
jmoralez 9a4ad3b
ignore files created for testing. add type hints and check with mypy.…
jmoralez 1d108e1
lint
jmoralez 9f51ad8
use pre_partition and write separate model files. remove mypy
jmoralez 17ee6ab
update docs
jmoralez a424374
remove ci. lower rtol. pass num_machines in config
jmoralez f5e9d49
write predict.conf in the predict method. more robust port setup. use…
jmoralez f7c6fdd
add paths to tests and binary. remove lgb dependency. update .igtignore.
jmoralez c964fdf
lint
jmoralez 56113de
allow to pass executable dir as argument to pytest
jmoralez 83ccf6a
pass execfile to pytest instead of execdir
jmoralez e84df75
add suggestions
jmoralez 45bfc79
use os.path and add type hint to predict_config
jmoralez d4cbb7a
Update tests/distributed/_test_distributed.py
jameslamb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
import copy | ||
import subprocess | ||
import sys | ||
from concurrent.futures import ThreadPoolExecutor | ||
|
||
import numpy as np | ||
from sklearn.datasets import make_blobs, make_regression | ||
from sklearn.metrics import accuracy_score | ||
|
||
import lightgbm as lgb | ||
jmoralez marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
def create_data(task, n_samples=1_000): | ||
if task == 'binary-classification': | ||
centers = [[-4, -4], [4, 4]] | ||
X, y = make_blobs(n_samples, centers=centers, random_state=42) | ||
elif task == 'regression': | ||
X, y = make_regression(n_samples, n_features=4, n_informative=2, random_state=42) | ||
dataset = np.hstack((y[:, None], X)) | ||
return dataset | ||
|
||
|
||
def run_and_log(cmd): | ||
process = subprocess.Popen(cmd, stdout=subprocess.PIPE) | ||
StrikerRUS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
for c in iter(lambda: process.stdout.read(1), b''): | ||
sys.stdout.buffer.write(c) | ||
|
||
|
||
class DistributedMockup: | ||
default_config = { | ||
'task': 'train', | ||
'output_model': 'model.txt', | ||
'machine_list_file': 'mlist.txt', | ||
'tree_learner': 'data', | ||
'force_row_wise': True, | ||
'verbose': 0, | ||
'num_boost_round': 20, | ||
'num_leaves': 15, | ||
'num_threads': 2, | ||
} | ||
|
||
def __init__(self, config={}, n_workers=2): | ||
self.config = copy.deepcopy(self.default_config) | ||
self.config.update(config) | ||
self.config['num_machines'] = n_workers | ||
self.n_workers = n_workers | ||
|
||
def worker_train(self, i): | ||
cmd = f'./lightgbm config=train{i}.conf'.split() | ||
if i == 0: | ||
return run_and_log(cmd) | ||
subprocess.run(cmd) | ||
StrikerRUS marked this conversation as resolved.
Show resolved
Hide resolved
StrikerRUS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def _set_ports(self): | ||
self.listen_ports = [lgb.dask._find_random_open_port() for _ in range(self.n_workers)] | ||
StrikerRUS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
with open('mlist.txt', 'wt') as f: | ||
for port in self.listen_ports: | ||
f.write(f'127.0.0.1 {port}\n') | ||
|
||
def _write_data(self, data): | ||
np.savetxt('train.txt', data, delimiter=',') | ||
for i, partition in enumerate(np.array_split(data, self.n_workers)): | ||
np.savetxt(f'train{i}.txt', partition, delimiter=',') | ||
|
||
def fit(self, data): | ||
self._write_data(data) | ||
self.label_ = data[:, 0] | ||
self._set_ports() | ||
futures = [] | ||
with ThreadPoolExecutor(max_workers=self.n_workers) as executor: | ||
for i in range(self.n_workers): | ||
self.write_train_config(i) | ||
futures.append(executor.submit(self.worker_train, i)) | ||
_ = [f.result() for f in futures] | ||
|
||
def predict(self): | ||
cmd = './lightgbm config=predict.conf'.split() | ||
run_and_log(cmd) | ||
y_pred = np.loadtxt('predictions.txt') | ||
return y_pred | ||
|
||
def write_train_config(self, i): | ||
with open(f'train{i}.conf', 'wt') as f: | ||
jmoralez marked this conversation as resolved.
Show resolved
Hide resolved
|
||
f.write(f'local_listen_port = {self.listen_ports[i]}\n') | ||
f.write(f'data = train{i}.txt\n') | ||
for param, value in self.config.items(): | ||
f.write(f'{param} = {value}\n') | ||
|
||
|
||
def test_classifier(): | ||
data = create_data(task='binary-classification') | ||
params = { | ||
'objective': 'binary', | ||
} | ||
clf = DistributedMockup(params) | ||
jmoralez marked this conversation as resolved.
Show resolved
Hide resolved
|
||
clf.fit(data) | ||
y_probas = clf.predict() | ||
y_pred = y_probas > 0.5 | ||
assert accuracy_score(clf.label_, y_pred) == 1. | ||
|
||
|
||
def test_regressor(): | ||
data = create_data(task='regression') | ||
params = { | ||
'objective': 'regression', | ||
} | ||
reg = DistributedMockup(params) | ||
jmoralez marked this conversation as resolved.
Show resolved
Hide resolved
|
||
reg.fit(data) | ||
y_pred = reg.predict() | ||
np.testing.assert_allclose(y_pred, reg.label_, rtol=0.5, atol=50.) | ||
jmoralez marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
task = predict | ||
StrikerRUS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
data = train.txt | ||
input_model = model.txt | ||
output_result = predictions.txt |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for writing this so late!
Do you @jameslamb really think that parallel tests should be executed as a separate CI job? I believe that environment initialization overhead is quite big in this case.
From my point of view, we should add new CI jobs in cases when new job requires something special to be done we haven't done before as a part of our CI scripts.
For example,
dask
is tested within our ordinary (mostly Python-package) tests (we just added some dependencies in conda env), we don't have separatedask
CI jobmpi
requires MPI installation and is mutually exclusive with ordinary installation, hence we have a separatempi
CI jobgpu
requires installation ofboost
, GPU drivers and is mutually exclusive with ordinary installation, hence we have a separategpu
CI jobregular
,sdist
,bdist
jobs are presented because they all test different ways of Python-package installation and its' correctnessswig
only compiles SWIG wrapper and doesn't run any tests (even for such scenario we used to compile SWIG artifacts withinsdist
job before got self-running agents to balance CI loading)cpp_tests
requires special CMake flags and doesn't require Python ecosystem initialization, hence we have a separatecpp-tests
CI jobAlso, please note that C API tests are run with each Python-package-related CI job.
https://github.com/microsoft/LightGBM/blob/master/tests/c_api_test/test_.py
I believe that basic distributed tests which are not requiring MPI installation or other special things can be run with each Python-package-related CI job. Please share your opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, the most recent run of this job took 4m15s, about 1m40s of which was spent compiling
lightgbm
and running the tests (based on timestamps in logs)reference: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10053&view=logs&j=cd496dfc-df55-5b41-9968-b20563c04f1a
They are "basic" today because I've asked @jmoralez to focus this first PR on only setting up the structure and not on adding a lot of additional tests. But I expect that a lot more tests will be added in the future. Every time we see an issue with
lightgbm.dask
or frommmlspark
and wonder "is this a problem at the level of those interfaces or something from LightGBM's core library?", I think we'll want to add tests here.I think it's possible that a few months from now, these distributed training tests might take 3 or 4 minutes to run (just the test part, not all of the environment setup). Instead of adding 3 to 4 minutes to each Python CI job, I thought it made sense to add only a single job as a starting point.
If you don't find that convincing, I don't feel strongly enough about it to block the PR until we agree. If you still think these tests should be run on every Python job instead of as a standalone, please advise @jmoralez on specifically what changes should be made (for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do support this.
For sure, that was the reason I created #3841.
You are right that separate CI jobs have their own advantages, but I think that the main benefit of embedding distributed tests into CI jobs with main tests is that in this case you get good coverage and balance for different OSes, compilers and environments in general. You don't need to think about where to setup new job: on Azure, GitHub Action, Appveyor; what compiler to choose on one CI service and what compiler on another one; invent new
if
rules forsetup.py
or something else. It is already done and you get it for free.I believe it's a bad idea to merge PRs that are "convenient" for only one of the maintainers.
We are not speaking about just theoretical things, we are discussing this particular PR and changes it proposes to merge into the
master
branch. I don't agree with that this is "delaying" PR somehow. This is what PRs are designed for - for discussing changes they contain.With new info uncovered
some previous statements may become irrelevant, for example. But this should be discussed before the merge to prevent code reverts in the future, I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree. Sorry if it sounded like I was complaining, that was not my intention.
In this case, @jmoralez WILL have to think about those things, all at once, right now. Since distributed training with the CLI isn't currently tested on Mac or Windows, for example, if failures are experienced adding those tests to jobs for those operating systems then they'll have to be fixed (or ignored with
if
orpytest.skip
) before this can be merged. The same goes for other combinations, like different Python version.So, said another way, I have a preference for taking smaller, incremental steps for adding this particular type of testing. I prefer adding this type of testing in a single job right now because then two types of work (changing which configurations the job runs on, adding new tests) can happen in parallel.
I think you are saying you'd prefer to add broader coverage in this first PR. I am ok with that and happy to support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no, definitely no! Only in case he'd love to do it.
As an example, please remember adding cpp tests with GTest. First, possibility of tests compilation was added in #3555 and was checked for correctness: #3555 (comment). Then, we actually added CI jobs with cpp tests: #4166. And right now different tests are proposed to be added in pending PRs. I think this is a good example of
As the core code of this PR is independent to the way we will execute it, I think we can go the same direction here. Leave only testing code in this PR and add CI jobs calling it in a follow-up one(s). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, I guess I misunderstood. I thought that your comment at the beginning of this thread was saying that you wanted this PR to add these tests to all existing Python jobs. I think now that maybe you just meant "for us to consider #3841 fixed, it should be added to all of those jobs".
Sure! Let's do that, I agree. So then I think this PR would need the following changes:
fixes
from the PR title andclose
from the PR description.vsts-ci.yml
and.ci/test.sh
@StrikerRUS do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, #3841 should be "fixed" only after adding some meaningful number of tests for distributed code.
Totally agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. @jmoralez when you have time, could you please remove the changes from
.vsts-ci.yml
and.ci/test.sh
?I've removed "fixes" from the PR title.