[dask] Add a scikit-learn compatibility test #3894

jameslamb · 2021-02-02T22:26:33Z

Summary

scikit-learn supports a test that people writing scikit-learn extensions can use to check API compatibility with the rest of the scikit-learn ecosystem. Such a test should be added for the code in python-package/lightgbm/dask.py.

Motivation

The discussion in #3883 focused a lot on compatibility of the dask estimators, like DaskLGBMClassifier, with the broader scikit-learn ecosystem. It was proposed in #3883 (comment) that we should have a test on the Dask module similar to

LightGBM/tests/python_package_test/test_sklearn.py

Line 1169 in a4cae37

def test_sklearn_integration(estimator, check):

.

I think that actually several of the tests from that module should have Dask equivalents, to test compatibility with the scikit-learn ecosystem.

References

More information on the scikit-learn API is available in "Developing scikit-learn estimators". In general, changes to lightgbm.dask that make it align more closely with that spec are welcome.

The text was updated successfully, but these errors were encountered:

imjwang · 2021-02-02T23:46:46Z

Hey, I would like to be assigned this issue. I am in a software methodology class and passionate about data science.

jameslamb · 2021-02-03T00:13:36Z

Thanks @lmjwang! You're welcome to take it.

A few things you should know:

the Dask tests will only work on Linux, so you will have to develop in a Linux environment
you can build the package from source by running cd python-package && python setup.py install
you can run just the Dask tests to test your code by running pytest tests/python_package_test/test_dask.py

Feel free to ask questions here if you get stuck. Before beginning, please read the items that are linked in the issue's description.

imjwang · 2021-02-07T17:20:49Z

Hey @jameslamb, I'm going through test_sklearn.py and I'm curious why lgb.sklearn.LGBMRanker is not included here?

def _tested_estimators():
    for Estimator in [lgb.sklearn.LGBMClassifier, lgb.sklearn.LGBMRegressor]:
        yield Estimator()

jameslamb · 2021-02-07T17:31:31Z

Good question! I do not actually know. @StrikerRUS do you know?

By the way in the future, a link to code is more helpful than just copying it:

LightGBM/tests/python_package_test/test_sklearn.py

Line 1150 in e056237

def _tested_estimators():

StrikerRUS · 2021-02-07T18:32:08Z

Scikit-learn doesn't have learning-to-rank applications. So there is no point to test LGBMRanker to be "compatible" with something that doesn't support ranking.
From the technical point of view, LGBMRanker requires group argument to be passed to fit() method. And scikit-learn API tests knows nothing about it.

imjwang · 2021-02-09T20:45:26Z

I copied these lines of code to test_dask.py with their dependencies, but used lgb.DaskLGBMClassifier, lgb.DaskLGBMRegressor for _tested_estimators():

LightGBM/tests/python_package_test/test_sklearn.py

Lines 1150 to 1182 in a4cae37

    
           def _tested_estimators(): 
        
               for Estimator in [lgb.sklearn.LGBMClassifier, lgb.sklearn.LGBMRegressor]: 
        
                   yield Estimator() 
        
           if sk_version < parse_version("0.23"): 
        
               def _generate_checks_per_estimator(check_generator, estimators): 
        
                   for estimator in estimators: 
        
                       name = estimator.__class__.__name__ 
        
                       for check in check_generator(name, estimator): 
        
                           yield estimator, check 
        
               @pytest.mark.skipif( 
        
                   sk_version < parse_version("0.21"), reason="scikit-learn version is less than 0.21" 
        
               ) 
        
               @pytest.mark.parametrize( 
        
                   "estimator, check", 
        
                   _generate_checks_per_estimator(_yield_all_checks, _tested_estimators()), 
        
               ) 
        
               def test_sklearn_integration(estimator, check): 
        
                   xfail_checks = estimator._get_tags()["_xfail_checks"] 
        
                   check_name = check.__name__ if hasattr(check, "__name__") else check.func.__name__ 
        
                   if xfail_checks and check_name in xfail_checks: 
        
                       warnings.warn(xfail_checks[check_name], SkipTestWarning) 
        
                       raise SkipTest 
        
                   estimator.set_params(min_child_samples=1, min_data_in_bin=1) 
        
                   name = estimator.__class__.__name__ 
        
                   check(name, estimator) 
        
           else: 
        
               @parametrize_with_checks(list(_tested_estimators())) 
        
               def test_sklearn_integration(estimator, check, request): 
        
                   estimator.set_params(min_child_samples=1, min_data_in_bin=1) 
        
                   check(estimator)

But running pytest tests/python_package_test/test_dask.py -k test_sklearn_integration
is giving me a bunch of failed tests mostly with ValueError: No clients found.

Does this satisfy the task or am I missing something? @jameslamb

jameslamb · 2021-02-09T23:26:29Z

Seems like a good approach! If you feel like you have something that is mostly working, please open a pull request and I can give more specific help there.

For the No clients found error...

Creating an instance of the Dask estimators requires an active Dask client. We do this with a client fixture from dask/distributed.

You can use this by passing the keyword argument client into the test function, like

LightGBM/tests/python_package_test/test_dask.py

Line 336 in 18d5793

def test_training_does_not_fail_on_port_conflicts(client):

* add test_dask.py * Update tests/python_package_test/test_dask.py Co-authored-by: James Lamb <jaylamb20@gmail.com> * clients * remove ports * safe sklearn checks * safe sklearn checks * fix whitespace * fix whitespace-try 2 * fix whitespace-try 3 * isort * isort * sklearn_checks_to_learn Co-authored-by: James Lamb <jaylamb20@gmail.com>

StrikerRUS · 2021-02-18T02:35:11Z

We may want to get back for this after resolving upstream issue dask/dask-ml#796.

github-actions · 2023-08-23T16:22:45Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added good first issue dask labels Feb 2, 2021

jameslamb mentioned this issue Feb 2, 2021

[dask] remove 'client' kwarg from fit() and predict() (fixes #3808) #3883

Merged

jameslamb assigned imjwang Feb 3, 2021

jameslamb mentioned this issue Feb 12, 2021

[tests][dask] add scikit-learn compatibility tests (fixes #3894) #3947

Merged

StrikerRUS closed this as completed in #3947 Feb 18, 2021

StrikerRUS mentioned this issue Apr 28, 2021

LGBMRanker query group setting when using gridsearchcv #3018

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] Add a scikit-learn compatibility test #3894

[dask] Add a scikit-learn compatibility test #3894

jameslamb commented Feb 2, 2021

imjwang commented Feb 2, 2021

jameslamb commented Feb 3, 2021

imjwang commented Feb 7, 2021

jameslamb commented Feb 7, 2021

StrikerRUS commented Feb 7, 2021

imjwang commented Feb 9, 2021

jameslamb commented Feb 9, 2021

StrikerRUS commented Feb 18, 2021

github-actions bot commented Aug 23, 2023

[dask] Add a scikit-learn compatibility test #3894

[dask] Add a scikit-learn compatibility test #3894

Comments

jameslamb commented Feb 2, 2021

Summary

Motivation

References

imjwang commented Feb 2, 2021

jameslamb commented Feb 3, 2021

imjwang commented Feb 7, 2021

jameslamb commented Feb 7, 2021

StrikerRUS commented Feb 7, 2021

imjwang commented Feb 9, 2021

jameslamb commented Feb 9, 2021

StrikerRUS commented Feb 18, 2021

github-actions bot commented Aug 23, 2023