Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests][cli] distributed training #4254

Merged
merged 18 commits into from
Jul 4, 2021

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented May 4, 2021

This aims to close contributes to #3841 by proposing a way to run tests on distributed training calling the CLI from python.

Here I define a class DistributedMockup (suggestions for a better name are more than welcome) that implements a fit method that creates the required configuration files, saves the data and then runs the lightgbm binary using a ThreadPoolExecutor. The training logs are taken from the first machine in mlist.txt and are piped to stdout.

There's also a predict method that computes the predictions using the trained model on a single call to the binary (don't think it's necessary here to make a call on each partition).

I tested this on my fork using github actions, however this will run on azure pipelines so I'll be working on that following the very detailed explanation here: #3841 (comment)

@jmoralez
Copy link
Collaborator Author

jmoralez commented May 5, 2021

Hi @jameslamb. Will this:

pytest $BUILD_DIRECTORY/tests || exit -1

trigger the tests?
In that case I think I have to fix the paths, everything right now is designed to work by calling pytest from tests/distributed/

@jmoralez jmoralez marked this pull request as ready for review May 6, 2021 00:29
@jmoralez jmoralez changed the title WIP: [tests][cli] distributed training (fixes #3841) [tests][cli] distributed training (fixes #3841) May 6, 2021
@jmoralez
Copy link
Collaborator Author

jmoralez commented May 6, 2021

This ran successfully (logs) so I think it's ready for review. Right now it's only running two simple tests but could be extended to perform the tests that are run for the dask interface, we could probably borrow most of the data creation and the tests themselves.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you for the effort to make this extensible and to fit it into this project's existing continuous integration setup!


Can you please add rules for the files that would be created by these tests to the top-level .gitignore?

I pulled this branch and ran the tests tonight (worked perfectly 🎉 ) , and several files were untracked.

On branch tests/distributed
Your branch is up to date with 'origin/tests/distributed'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	mlist.txt
	predictions.txt
	train.txt
	train0.conf
	train0.txt
	train1.conf
	train1.txt

nothing added to commit but untracked files present (use "git add" to track)

I see that right now, the logic for _write_data() is written like "evenly split the training data between the workers". In my opinion, I think it would be better to have DistributedMockup.fit() expect data to be a list of data chunks. In the few months since lightgbm.dask was first introduced, we've seen a lot of issues related to uneven distribution of the training data across workers (for example #3817, #3882, #4101 (comment), #4206).

If fit() took in List[np.ndarray], then it would be easy to write tests on the behavior of distributed training in situations where, for example, feature distributions are very different on different partitions, or where the dataset sizes are very different.

What do you think?


Can you please add docstrings and type hints (including return type hints) on the methods of DistributedMockup, create_data(), and run_and_log()? I think that will help others to be able to contribute changes to these tests in the future, and will give us a chance of finding subtle bugs with mypy.

As of #3756, we've been trying to encourage the use of type hints in this project's Python code.

Notes for Other Reviewers

I support merging this PR with only a single job on Linux_latest (this is what I asked for in #3841 (comment)) and with only minimal tests like those in test_classifier and test_regressor. I have a preference for setting up the structure here and then incrementally adding tests later, like @jmoralez has proposed.

@jmoralez
Copy link
Collaborator Author

Hi @jameslamb, I've included your comments, I think sending the data already partitioned to the fit method makes a lot of sense.

I noticed that mypy is only ran for the python-package folder in the lint task so I included it in the cli-distributed task to check the _test_distributed.py file.

This job failed in the CI but seems to be testing on windows, so I don't think is related to this PR. Looking forward to your thoughts on these changes.

@jameslamb
Copy link
Collaborator

This job failed in the CI but seems to be testing on windows, so I don't think is related to this PR. Looking forward to your thoughts on these changes.

Yep, I don't think that's related to your changes. It's an error we see occasionally on Appveyor.

C:\Miniconda37-x64\envs\test-env\lib\site-packages\lightgbm\basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
   1268             self.__init_from_csc(data, params_str, ref_dataset)
   1269         elif isinstance(data, np.ndarray):
-> 1270             self.__init_from_np2d(data, params_str, ref_dataset)
   1271         elif isinstance(data, list) and len(data) > 0 and all(isinstance(x, np.ndarray) for x in data):
   1272             self.__init_from_list_np2d(data, params_str, ref_dataset)
C:\Miniconda37-x64\envs\test-env\lib\site-packages\lightgbm\basic.py in __init_from_np2d(self, mat, params_str, ref_dataset)
   1318             c_str(params_str),
   1319             ref_dataset,
-> 1320             ctypes.byref(self.handle)))
   1321         return self
   1322 
OSError: exception: access violation writing 0xFFFFFFFF93070000
OSError: exception: access violation writing 0xFFFFFFFF93070000

I've manually restarted that build.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, looks great! I left a few more suggested changes, and one question that I think we need help with from other maintainers.

.ci/test.sh Outdated Show resolved Hide resolved
tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
@jameslamb jameslamb self-requested a review May 12, 2021 04:08
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you!

@shiyu1994 could you also please review? I think you're more familiar with the CLI than I am (and @StrikerRUS is currently reviewing a lot of other PRs 😀 )

@jameslamb jameslamb requested a review from StrikerRUS May 25, 2021 03:42
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing fundamental comments! Please now check some minor comments below.

tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
tests/distributed/predict.conf Outdated Show resolved Hide resolved
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my previous comments! Here are some new comments about outdated docstrings and .gitignore update.

tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
.gitignore Show resolved Hide resolved
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for as always valuable contribution! LGTM, except one typo and one simplification.

tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
@jmoralez
Copy link
Collaborator Author

@StrikerRUS I've added your latest comments, let me know what you think.

@StrikerRUS
Copy link
Collaborator

@jmoralez Yeah, I saw. Many thanks!
I've already approved this PR (#4254 (review)).

@jameslamb There have been several changes since your latest approval. Would you like to take another look before merging?

@jameslamb
Copy link
Collaborator

@jameslamb There have been several changes since your latest approval. Would you like to take another look before merging?

yes please, will look soon

@jameslamb jameslamb self-requested a review June 25, 2021 03:25
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me. I just left some minor suggestions.

And one more general one... Can you please replace all uses of pathlib with os.path?

This project does not use pathlib anywhere else (you can confirm this with git grep pathlib), and keeping such things consistent reduces the maintenance burden for maintainers and friction for new contributors.

tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
tests/distributed/_test_distributed.py Outdated Show resolved Hide resolved
@jmoralez
Copy link
Collaborator Author

I think pathlib should be considered instead of os.path, for example this:

dir_path = os.path.dirname(os.path.realpath(__file__))
if os.path.isfile(os.path.join(dir_path, 'VERSION.txt')):
with open(os.path.join(dir_path, 'VERSION.txt')) as version_file:

becomes this:

dir_path = Path(__file__).parent
version_file = dir_path / 'VERSION.txt'
if version_file.is_file():
    with version_file.open() as f:

@StrikerRUS
Copy link
Collaborator

I think pathlib should be considered instead of os.path, for example this:

I believe we can start migrating from os.path to pathlib similarly to #4136. @jameslamb WDYT?

@jameslamb
Copy link
Collaborator

I believe we can start migrating from os.path to pathlib

Unless there is a functional reason to prefer pathlib (e.g. efficiency, compatibility), I don't support transitioning this project's Python code from os.path to pathlib. I don't personally think that the example in #4254 (comment) is very compelling.

Is there some functional reason to prefer it?

similarly to #4136

I've observed that some other "good first issue" types of efforts (like #4136) have attracted contributions that required a lot of maintainer effort. In my opinion, that is worthwhile if the project gets some functional benefit from the changes, but not something that we should encourage for a change whose primary benefit, as I understand it, is a style preference.

If you'd like to make this change, @StrikerRUS , I won't oppose that. But I think the issue to document it should not use the good first issue tag, and that maintainers like you should do some of the work and not worry about preserving it for new contributors.

@StrikerRUS
Copy link
Collaborator

Is there some functional reason to prefer it?

It makes code much clearer and it's portable across platforms. With os.path module, ideally, we should use os.pardir instead of .. and os.path.sep instead of / inside os.path.join(), for example. But we often forget about it:

os.path.join(curr_path, '../../'),
os.path.join(curr_path, '../../python-package/lightgbm/compile'),
os.path.join(curr_path, '../../python-package/compile'),
os.path.join(curr_path, '../../lib/')]

One good article:
https://treyhunner.com/2018/12/why-you-should-be-using-pathlib/

I've observed that some other "good first issue" types of efforts (like #4136) have attracted contributions that required a lot of maintainer effort.

To be honest, I've never been a fan of "good first issue"s due to the reasons you've written above.

If you'd like to make this change, @StrikerRUS , I won't oppose that.

Great! I'll prepare an issue then.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks very much for the help!

@StrikerRUS
Copy link
Collaborator

@jameslamb Can we merge this?

@jameslamb
Copy link
Collaborator

@jameslamb Can we merge this?

yep! Sorry, should have merged it right after my review. I got distracted by something else and forgot to come back to this PR. Thanks for the @.

@jameslamb jameslamb merged commit b699fa6 into microsoft:master Jul 4, 2021
@jmoralez jmoralez deleted the tests/distributed branch July 8, 2021 02:40
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants