-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests][dask] add scikit-learn compatibility tests (fixes #3894) #3947
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for working on this! I've updated the PR title and description so that they're descriptive of the work. Linking to the relevant issue in the description helps reviewers to understand the context of why a pull request has been opened. Using an informative title makes the PR easier to understand in the git history and as a changelog entry: https://github.com/microsoft/LightGBM/releases/tag/untagged-edc0a3c487f27c6fbdb9
I'd be happy to review this code once you sign the Contributor License Agreement. Please let me know if you have any questions about it.
No problem I had a lot of fun! |
I'm surprised by that, but don't worry about it. I just was showing that we use some automation that automatically adds one changelog entry for each pull request, with the pull request title as the text
Can you please explain specifically what you mean by "causing problems with the client"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some suggestions that might help with the failing tests in CI.
Are you able to test this locally? That might be faster for you than relying on our CI for feedback. If you have trouble building lightgbm
locally or running the tests locally, let me know and I can help.
"estimator, check", | ||
_generate_checks_per_estimator(_yield_all_checks, _tested_estimators()), | ||
) | ||
def test_sklearn_integration(estimator, check): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please pass the client
fixture into this test and the one below. It should fix the errors like this that I see in CI.
E ValueError: No clients found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to test_dask.py? I have built lightgbm on WSL Ubuntu 18.04 and I can run test_dask.py and test_sklearn.py
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Several of the tests output this error when I run locally def deepcopy(x, memo=None, _nil=[]):
"""Deep copy operation on arbitrary Python objects.
See the module's __doc__ string for more info.
"""
if memo is None:
memo = {}
d = id(x)
y = memo.get(d, _nil)
if y is not _nil:
return y
cls = type(x)
copier = _deepcopy_dispatch.get(cls)
if copier is not None:
y = copier(x, memo)
else:
if issubclass(cls, type):
y = _deepcopy_atomic(x, memo)
else:
copier = getattr(x, "__deepcopy__", None)
if copier is not None:
y = copier(memo)
else:
reductor = dispatch_table.get(cls)
if reductor:
rv = reductor(x)
else:
reductor = getattr(x, "__reduce_ex__", None)
if reductor is not None:
> rv = reductor(4)
|
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
s.bind(('127.0.0.1', 12400)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of these socket.socket()
lines? Are you just checking that this port is available?
lightgbm.dask
already does that automatically, so this is not necessary. Can you please remove these?
Draft Releases can be seen only by maintainers. |
oh didn't realize! I thought they could only be EDITED by maintainers. Ok, makes sense. |
Ok, do you need help debugging? |
|
I'll try running this myself a bit. I don't know what the internals of those checks imported from |
My understanding is that one of these functions are run in
LightGBM/tests/python_package_test/test_sklearn.py Lines 1165 to 1168 in 84c4b75
I looked at parametrize_with_checks() and the checks on their github here and it seems like most of them have calls to a clone() . Which copies the passed in estimator and also call a copy.deepcopy() of the original estimator's params before running each test on a cloned estimator.I think they are using copy.deepcopy() from the python standard lib And I think that is behind the TypeError: cannot pickle '_asyncio.Task' object
|
ok @imjwang I did some investigation. There is a lot here, sorry. Please don't change anything in this PR until @StrikerRUS also gives an opinion on my comments below. Why so many of the tests in this PR are failingI think we have to skip any of the tests from https://github.com/scikit-learn/scikit-learn/blob/31b34b560de57a049dd435dccc55112271322370/sklearn/utils/estimator_checks.py that involve calling Many of the tests fail with errors like this
or
What I think should be done in this PRSo I think that for now, the best we can do is specifically select the checks that don't have
It's not a lot, but at least it will catch a few things. One nice side-effect of this is that the code for different versions of scikit-learn can be removed. import sklearn.utils.estimator_checks as sklearn_checks
_check_names = [
"check_estimator_get_tags_default_keys",
"check_get_params_invariance",
"check_set_params"
]
sklearn_checks_to_run = []
for check_name in _check_names:
check_func = getattr(sklearn_checks, check_name, None)
if check_func:
sklearn_checks_to_run.append(check_func)
def _tested_estimators():
for Estimator in [lgb.DaskLGBMClassifier, lgb.DaskLGBMRegressor]:
yield Estimator()
@pytest.mark.parametrize("estimator", _tested_estimators())
@pytest.mark.parametrize("check", sklearn_checks_to_run)
def test_sklearn_integration(estimator, check, client):
estimator.set_params(local_listen_port=18000, time_out=5)
name = type(estimator).__name__
check(name, estimator)
client.close(timeout=CLIENT_CLOSE_TIMEOUT)
# this test is separate because it takes a not-yet-constructed estimator
@pytest.mark.parametrize("estimator", list(_tested_estimators()))
def test_parameters_default_constructible(estimator):
name, Estimator = estimator.__class__.__name__, estimator.__class__
sklearn_checks.check_parameters_default_constructible(name, Estimator) I tested this and the tests passed for both How we can get better coverage of scikit-learn compatibility in the futureI've opened dask/dask-ml#796 in Separate QuestionI explicitly omitted import lightgbm as lgb
from sklearn.utils.estimator_checks import check_no_attributes_set_in_init
check_no_attributes_set_in_init("LGBMClassifier", lgb.LGBMClassifier())
|
This test is skipped. LightGBM/python-package/lightgbm/sklearn.py Lines 498 to 502 in 4ae5949
scikit-learn/scikit-learn#16241 Ah, I see! The majority of tests passes TBH, I don't think that applying a few separate tests from the whole scikit-learn test suite will make a lot of sense... Checking "compatibility" in a such way may be confusing and incorrectly "relaxing" us. I believe it's better to think that Dask estimators are not compatible with scikit-learn API for now. |
Ah, missed that! Didn't think to look for testing-specific stuff in the sklearn module itself. Got it.
I think that we should keep this handful of tests. Some of them capture things that you had to spend time and energy teaching me on #3883. Having those checks in tests might save us similar reviewing effort in the future. If contributors propose a pull request that fails one of those checks, then the tests will tell them what went wrong and what they should fix. @imjwang could you please update this PR to latest |
@jameslamb Yes, sure thing |
The current code fails the lint test for imports and I'm very confused. The only line I added was |
oh sorry! We very very recently added Run the following to fix it. pip install isort
isort . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, thanks for the contribution! I'd like to request one more review from @StrikerRUS before we merge this.
_check_names = [ | ||
"check_estimator_get_tags_default_keys", | ||
"check_get_params_invariance", | ||
"check_set_params" | ||
] | ||
sklearn_checks_to_run = [] | ||
for check_name in _check_names: | ||
check_func = getattr(sklearn_checks, check_name, None) | ||
if check_func: | ||
sklearn_checks_to_run.append(check_func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please convert this piece of code into a function. Let's keep code organized.
def sklearn_checks_to_run():
check_names = [
"check_estimator_get_tags_default_keys",
"check_get_params_invariance",
"check_set_params"
]
for check_name in check_names:
check_func = getattr(sklearn_checks, check_name, None)
if check_func:
yield check_func
@StrikerRUS Does this work? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imjwang Thank you for your contribution!
@jameslamb @StrikerRUS Thanks for the opportunity and support! I've learned a good bit about testing and CI and I am happy to contribute to this project. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
sklearn integration tests for dask. This fixes #3894