[python] improved sklearn interface #870

StrikerRUS · 2017-08-28T18:45:53Z

This PR makes sklearn wrapper to pass scikit-learn's check_estimator test.

The one remained thing I want to add to this PR is dynamic construction of docstring. I mean, there is no information about n_classes_ and classes_ attributes in docsting of LGBMClassifier. I want something like this:

base_doc = LGBMModel.__doc__
__doc__ = base_doc[:base_doc.find('more important the feature).')] + "description of classes_ and n_classes_ here" + base_doc[base_doc.find('Note'):]

Maybe someone could help me with it?

msftclas · 2017-08-28T18:45:56Z

@StrikerRUS,
Thanks for having already signed the Contribution License Agreement. Your agreement was validated by Microsoft. We will now review your pull request.
Thanks,
Microsoft Pull Request Bot

StrikerRUS · 2017-08-28T18:47:38Z

python-package/lightgbm/sklearn.py

-            self.best_iteration = self._Booster.best_iteration
-        self.best_score = self._Booster.best_score
+            self._best_iteration = self._Booster.best_iteration
+        self._best_score = self._Booster.best_score


Should the L455 be into the if statement?

@StrikerRUS yes.

StrikerRUS · 2017-08-28T18:56:06Z

python-package/lightgbm/sklearn.py

                The target values
-            y_pred: array_like of shape [n_samples] or shape[n_samples * n_class] (for multi-class)
+            y_pred: array-like of shape = [n_samples] or shape = [n_samples * n_class] (for multi-class)


I suppose, there should be "... or shape = [n_samples * n_outputs] (for multi-output problem)".

@StrikerRUS we don't support multi-output yet.

@wxchan
Thanks for your review. Can you please explain than the case of y with shape = [n_samples * n_class]?
It's still 1d-array, as I understand, and possible only in classification problem situation, right?

@StrikerRUS yes, it's the flatten (n_samples, n_class) array.

@wxchan
Thank you. I'll edit check_consistent_length function with next commit.

And what about docstring (my first comment in this PR)? Do you know how it could be done?

@wxchan
Am I right that fit method accepts only y of shape = [n_samples] in both classification and regression case?

@StrikerRUS yes for shape of y. I don't know how to do the docstring thing, you can simply add those to LGBMModel.doc and add Classifier only.

@wxchan
Thank you. I combined both ways of improving docstrings: dynamic construction and notes about classification problem only.

henry0312 · 2017-08-29T05:27:26Z

@wxchan please review later

henry0312 · 2017-08-29T05:27:47Z

.travis/test.sh

@@ -40,7 +40,7 @@ if [[ ${TASK} == "if-else" ]]; then
    exit 0
 fi

-conda install --yes numpy scipy scikit-learn pandas matplotlib
+conda install --yes numpy nose scipy scikit-learn pandas matplotlib


why do you install nose?

Sklearn's checks require nose. Without it new integration test fails.

Sklearn's checks require nose

oh..., I got it.

StrikerRUS · 2017-08-29T10:56:17Z

Refer to #261

StrikerRUS · 2017-08-31T14:58:54Z

python-package/lightgbm/sklearn.py

            elif (isinstance(eval_group, dict) and any(i not in eval_group or eval_group[i] is None for i in range_(len(eval_group)))) \
                    or (isinstance(eval_group, list) and any(group is None for group in eval_group)):
-                raise ValueError("Should set group for all eval dataset for ranking task; if you use dict, the index should start from 0")
+                raise ValueError("Should set group for all eval datasets for ranking task; "
+                                 "if you use dict, the index should start from 0")

        if eval_at is not None:


@wxchan
Is it critical to check eval_at for None? I mean, will it cause error in underlying booster or we can omit if statement since default value is [1] and docstring says it should be list of int?

will eval_at = None raise an error?

@wxchan I'll check it. But I want to say that it's strange to me that we check for None while default value is [1] and docstring doesn't allow to pass None. It's like we'll check eval_at for, let say, string type...

@StrikerRUS I think it's because the default used to be None. I guess it will fail at param_dict_to_str if it's None and user always didn't understand our error msg. I think removing it is fine.

@wxchan
I removed check for None and tried to pass None:
TypeError: Unknown type of parameter:ndcg_eval_at, got:NoneType
and string:
LightGBMError: b'invalid stoll argument'
and object:
TypeError: Unknown type of parameter:ndcg_eval_at, got:type

I think error message for None is even more user-friendly than for string :-).

StrikerRUS · 2017-09-02T18:22:43Z

I think I've finished, except this #870 (comment)

wxchan · 2017-09-03T10:16:26Z

python-package/lightgbm/sklearn.py

        return self.booster_.predict(X, pred_leaf=True, num_iteration=num_iteration)

    @property
+    def n_features_(self):


Look like sklearn will remove this in the future, https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a805efbe4bb06516670a9b8c690992bd7/sklearn/ensemble/gradient_boosting.py#L936, should we add something like this?

@wxchan Nothing told about it in Random Forest or Extra Trees. There n_features isn't deprecated. I can't find the reason why they added deprecated in Gradient Boosting (no information in the changelog at the site).

@StrikerRUS They discussed in this thread scikit-learn/scikit-learn#7846 (comment). I think it cannot pass some test they added.

@wxchan Thank you very much for the information.
I've reproduced tests locally (scikit-learn 0.19) with printing test's name:

import lightgbm as lgb from sklearn.utils.estimator_checks import (_yield_all_checks, SkipTest, check_parameters_default_constructible, check_no_fit_attributes_set_in_init) # we cannot use `check_estimator` directly since there is no skip test mechanism for name, estimator in ((lgb.sklearn.LGBMClassifier.__name__, lgb.sklearn.LGBMClassifier), (lgb.sklearn.LGBMRegressor.__name__, lgb.sklearn.LGBMRegressor)): check_parameters_default_constructible(name, estimator) check_no_fit_attributes_set_in_init(name, estimator) # we cannot leave default params (see https://github.com/Microsoft/LightGBM/issues/833) estimator = estimator(min_data=1, min_data_in_bin=1) for check in _yield_all_checks(name, estimator): if check.__name__ == 'check_estimators_nan_inf': continue # skip test because LightGBM deals with nan try: print(check.__name__) check(name, estimator) except SkipTest as message: warnings.warn(message, SkipTestWarning)

And it seems that everything is OK with our estimators even without deprecation mark:

check_estimators_dtypes check_fit_score_takes_y check_dtype_object check_sample_weights_pandas_series check_sample_weights_list check_estimators_fit_returns_self check_estimators_empty_data_messages check_pipeline_consistency check_estimators_overwrite_params <----------------------- check_estimator_sparse_data check_estimators_pickle check_classifier_data_not_an_array check_classifiers_one_label C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix. warnings.warn('Converting data to scipy sparse matrix.') C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix. warnings.warn('Converting data to scipy sparse matrix.') C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix. warnings.warn('Converting data to scipy sparse matrix.') C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix. warnings.warn('Converting data to scipy sparse matrix.') C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix. warnings.warn('Converting data to scipy sparse matrix.') check_classifiers_classes check_estimators_partial_fit_n_features check_classifiers_train check_classifiers_regression_target check_supervised_y_2d check_estimators_unfitted check_non_transformer_estimators_n_iter check_decision_proba_consistency check_fit2d_predict1d check_fit2d_1sample check_fit2d_1feature check_fit1d_1feature check_fit1d_1sample check_get_params_invariance check_dict_unchanged check_dont_overwrite_parameters <----------------------- check_estimators_dtypes check_fit_score_takes_y check_dtype_object check_sample_weights_pandas_series check_sample_weights_list check_estimators_fit_returns_self check_estimators_empty_data_messages check_pipeline_consistency check_estimators_overwrite_params <----------------------- check_estimator_sparse_data check_estimators_pickle check_regressors_train check_regressor_data_not_an_array C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix. warnings.warn('Converting data to scipy sparse matrix.') C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix. warnings.warn('Converting data to scipy sparse matrix.') C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix. warnings.warn('Converting data to scipy sparse matrix.') C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix. warnings.warn('Converting data to scipy sparse matrix.') C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix. warnings.warn('Converting data to scipy sparse matrix.') check_estimators_partial_fit_n_features check_regressors_no_decision_function check_supervised_y_2d check_supervised_y_no_nan check_regressors_int check_estimators_unfitted check_non_transformer_estimators_n_iter check_fit2d_predict1d check_fit2d_1sample check_fit2d_1feature check_fit1d_1feature check_fit1d_1sample check_get_params_invariance check_dict_unchanged check_dont_overwrite_parameters <-----------------------

Ok, LGTM now, we can deprecate it in the future if it raises some issue.

@wxchan Sure, will see.

guolinke · 2017-09-03T12:47:36Z

@wxchan @StrikerRUS
ping me when this is ready to merge.

wxchan · 2017-09-03T16:24:42Z

@StrikerRUS is there any remaining issue in this thread I forget?

StrikerRUS · 2017-09-03T16:27:39Z

@wxchan I suppose no.

StrikerRUS · 2017-09-03T16:58:25Z

The only one thing I'm worrying about is readthedocs. Are dynamic docstrings OK for it?

wxchan · 2017-09-03T18:29:48Z

python-package/lightgbm/compat.py

-    LGBMLabelEncoder = None
-    LGBMStratifiedKFold = None
-    LGBMGroupKFold = None
+    _SKLEARN_INSTALLED = False


look like a typo? all other places are SKLEARN_INSTALLED

Oops, sorry.

wxchan · 2017-09-03T18:41:06Z

python-package/lightgbm/compat.py

+    _LGBMClassifierBase = object
+    _LGBMRegressorBase = object
+    _LGBMLabelEncoder = None
+    LGBMDeprecated = None


I remember this will raise an error if sklearn not installed, check #221. You can uninstall scikit-learn and take a try.

I thought that it's just forgotten object. Sorry.
btw, in two words why these dummy variables are needed?

Maybe some of these I don't need to specify too?

LGBMNotFittedError = ValueError _LGBMCheckXY = None _LGBMCheckArray = None _LGBMCheckConsistentLength = None _LGBMCheckClassificationTargets = None

I uninstalled skikit-learn, removed the line
LGBMDeprecated = None
and
import lightgbm as lgb
didn't raise an error. Does it indicate that everything is OK?

wxchan · 2017-09-03T18:42:29Z

For docstrings, you can run make html in docs folder, it will generate html doc pages, you can check if it's what you expected.

StrikerRUS · 2017-09-03T23:02:26Z

@wxchan
Thanks a lot for the tip!
I've generated docs and it seems that everything is OK, include dynamically constructed docstrings.

henry0312 · 2017-09-04T03:34:36Z

python-package/lightgbm/sklearn.py

+            elif isinstance(self, LGBMRanker):
+                self._objective = "lambdarank"
+            else:
+                raise ValueError("Unknown LGBMModel type.")


I think it's better to warn, instead of raising an exception, because we can't make an instance of LGBMModel.

I'm sorry that I was wrong.
Please ignore this point :p

henry0312

This PR seems good!
although we need @wxchan's approval.

StrikerRUS · 2017-09-04T14:12:02Z

python-package/lightgbm/sklearn.py

            Number of parallel threads.
-        silent : boolean
+        silent : bool, optional (default=True)
            Whether to print messages while running boosting.
        **kwargs : other parameters
            Check http://lightgbm.readthedocs.io/en/latest/Parameters.html for more parameters.
            Note: **kwargs is not supported in sklearn, it may cause unexpected issues.


Is this note still actual?

I can't ensure everything works because sklearn doesn't support **kwargs and it indeed raises some issues before.

@wxchan OK, got it!

wxchan · 2017-09-05T03:48:48Z

@guolinke LGTM now, I think it's ok to merge. @StrikerRUS Do you have anything left?

StrikerRUS · 2017-09-05T09:14:59Z

@wxchan @guolinke I've done all I wanted.

* improved sklearn interface; added sklearns' tests * moved best_score into the if statement * improved docstrings; simplified LGBMCheckConsistentLength * fixed typo * pylint * updated example * fixed Ranker interface * added missed boosting_type * fixed more comfortable autocomplete without unused objects * removed check for None of eval_at * fixed according to review * fixed typo * added description of fit return type * dictionary->dict for short * markdown cleanup

improved sklearn interface; added sklearns' tests

6cb1675

msftclas added the cla-already-signed label Aug 28, 2017

StrikerRUS commented Aug 28, 2017

View reviewed changes

henry0312 reviewed Aug 29, 2017

View reviewed changes

Nikita Titov added 4 commits August 30, 2017 15:58

moved best_score into the if statement

62a0ac0

improved docstrings; simplified LGBMCheckConsistentLength

1b9d998

fixed typo

2e63c64

pylint

be18b2b

StrikerRUS commented Aug 31, 2017

View reviewed changes

Nikita Titov added 3 commits August 31, 2017 19:50

updated example

86f7e0b

fixed Ranker interface

04e7d23

added missed boosting_type

a3b6bd8

StrikerRUS changed the title ~~[WIP] [python] improved sklearn interface~~ [python] improved sklearn interface Sep 2, 2017

fixed more comfortable autocomplete without unused objects

1ea5d2d

wxchan reviewed Sep 3, 2017

View reviewed changes

removed check for None of eval_at

5732811

wxchan reviewed Sep 3, 2017

View reviewed changes

fixed according to review

d571945

henry0312 suggested changes Sep 4, 2017

View reviewed changes

henry0312 approved these changes Sep 4, 2017

View reviewed changes

Nikita Titov added 2 commits September 4, 2017 15:25

fixed typo

7873e61

added description of fit return type

cff8b65

StrikerRUS commented Sep 4, 2017

View reviewed changes

dictionary->dict for short

a24dd78

wxchan approved these changes Sep 5, 2017

View reviewed changes

markdown cleanup

fae2014

guolinke merged commit 015c8ff into microsoft:master Sep 5, 2017

StrikerRUS deleted the sklearn-improving branch September 5, 2017 10:50

henry0312 mentioned this pull request Sep 6, 2017

python: always assign best_score after train #890

Merged

StrikerRUS mentioned this pull request Sep 11, 2017

[python] bring pandas support to the sklearn wrapper back #904

Merged

lock bot locked as resolved and limited conversation to collaborators Mar 12, 2020

[python] improved sklearn interface #870

[python] improved sklearn interface #870

Conversation

StrikerRUS commented Aug 28, 2017

msftclas commented Aug 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS Aug 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS Aug 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henry0312 commented Aug 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS commented Aug 29, 2017

StrikerRUS Aug 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wxchan Sep 3, 2017 • edited Loading

Choose a reason for hiding this comment

StrikerRUS Sep 3, 2017 • edited Loading

Choose a reason for hiding this comment

StrikerRUS commented Sep 2, 2017

Choose a reason for hiding this comment

StrikerRUS Sep 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS Sep 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guolinke commented Sep 3, 2017

wxchan commented Sep 3, 2017

StrikerRUS commented Sep 3, 2017

StrikerRUS commented Sep 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wxchan Sep 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wxchan commented Sep 3, 2017

StrikerRUS commented Sep 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henry0312 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wxchan commented Sep 5, 2017

StrikerRUS commented Sep 5, 2017

StrikerRUS Aug 30, 2017 •

edited

Loading

StrikerRUS Aug 30, 2017 •

edited

Loading

StrikerRUS Aug 31, 2017 •

edited

Loading

wxchan Sep 3, 2017 •

edited

Loading

StrikerRUS Sep 3, 2017 •

edited

Loading

StrikerRUS Sep 3, 2017 •

edited

Loading

StrikerRUS Sep 3, 2017 •

edited

Loading

wxchan Sep 3, 2017 •

edited

Loading