Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] improved sklearn interface #870

Merged
merged 15 commits into from
Sep 5, 2017
Merged

[python] improved sklearn interface #870

merged 15 commits into from
Sep 5, 2017

Conversation

StrikerRUS
Copy link
Collaborator

This PR makes sklearn wrapper to pass scikit-learn's check_estimator test.

The one remained thing I want to add to this PR is dynamic construction of docstring. I mean, there is no information about n_classes_ and classes_ attributes in docsting of LGBMClassifier. I want something like this:

base_doc = LGBMModel.__doc__
__doc__ = base_doc[:base_doc.find('more important the feature).')] + "description of classes_ and n_classes_ here" + base_doc[base_doc.find('Note'):]

Maybe someone could help me with it?

@msftclas
Copy link

@StrikerRUS,
Thanks for having already signed the Contribution License Agreement. Your agreement was validated by Microsoft. We will now review your pull request.
Thanks,
Microsoft Pull Request Bot

self.best_iteration = self._Booster.best_iteration
self.best_score = self._Booster.best_score
self._best_iteration = self._Booster.best_iteration
self._best_score = self._Booster.best_score
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the L455 be into the if statement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The target values
y_pred: array_like of shape [n_samples] or shape[n_samples * n_class] (for multi-class)
y_pred: array-like of shape = [n_samples] or shape = [n_samples * n_class] (for multi-class)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose, there should be "... or shape = [n_samples * n_outputs] (for multi-output problem)".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS we don't support multi-output yet.

Copy link
Collaborator Author

@StrikerRUS StrikerRUS Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxchan
Thanks for your review. Can you please explain than the case of y with shape = [n_samples * n_class]?
It's still 1d-array, as I understand, and possible only in classification problem situation, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS yes, it's the flatten (n_samples, n_class) array.

Copy link
Collaborator Author

@StrikerRUS StrikerRUS Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxchan
Thank you. I'll edit check_consistent_length function with next commit.

And what about docstring (my first comment in this PR)? Do you know how it could be done?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxchan
Am I right that fit method accepts only y of shape = [n_samples] in both classification and regression case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS yes for shape of y. I don't know how to do the docstring thing, you can simply add those to LGBMModel.doc and add Classifier only.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxchan
Thank you. I combined both ways of improving docstrings: dynamic construction and notes about classification problem only.

@henry0312
Copy link
Contributor

@wxchan please review later

@@ -40,7 +40,7 @@ if [[ ${TASK} == "if-else" ]]; then
exit 0
fi

conda install --yes numpy scipy scikit-learn pandas matplotlib
conda install --yes numpy nose scipy scikit-learn pandas matplotlib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you install nose?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sklearn's checks require nose. Without it new integration test fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sklearn's checks require nose

oh..., I got it.

@StrikerRUS
Copy link
Collaborator Author

Refer to #261

elif (isinstance(eval_group, dict) and any(i not in eval_group or eval_group[i] is None for i in range_(len(eval_group)))) \
or (isinstance(eval_group, list) and any(group is None for group in eval_group)):
raise ValueError("Should set group for all eval dataset for ranking task; if you use dict, the index should start from 0")
raise ValueError("Should set group for all eval datasets for ranking task; "
"if you use dict, the index should start from 0")

if eval_at is not None:
Copy link
Collaborator Author

@StrikerRUS StrikerRUS Aug 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxchan
Is it critical to check eval_at for None? I mean, will it cause error in underlying booster or we can omit if statement since default value is [1] and docstring says it should be list of int?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will eval_at = None raise an error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxchan I'll check it. But I want to say that it's strange to me that we check for None while default value is [1] and docstring doesn't allow to pass None. It's like we'll check eval_at for, let say, string type...

Copy link
Contributor

@wxchan wxchan Sep 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS I think it's because the default used to be None. I guess it will fail at param_dict_to_str if it's None and user always didn't understand our error msg. I think removing it is fine.

Copy link
Collaborator Author

@StrikerRUS StrikerRUS Sep 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxchan
I removed check for None and tried to pass None:
TypeError: Unknown type of parameter:ndcg_eval_at, got:NoneType
and string:
LightGBMError: b'invalid stoll argument'
and object:
TypeError: Unknown type of parameter:ndcg_eval_at, got:type

I think error message for None is even more user-friendly than for string :-).

@StrikerRUS
Copy link
Collaborator Author

I think I've finished, except this #870 (comment)

@StrikerRUS StrikerRUS changed the title [WIP] [python] improved sklearn interface [python] improved sklearn interface Sep 2, 2017
return self.booster_.predict(X, pred_leaf=True, num_iteration=num_iteration)

@property
def n_features_(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@StrikerRUS StrikerRUS Sep 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxchan Nothing told about it in Random Forest or Extra Trees. There n_features isn't deprecated. I can't find the reason why they added deprecated in Gradient Boosting (no information in the changelog at the site).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS They discussed in this thread scikit-learn/scikit-learn#7846 (comment). I think it cannot pass some test they added.

Copy link
Collaborator Author

@StrikerRUS StrikerRUS Sep 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxchan Thank you very much for the information.
I've reproduced tests locally (scikit-learn 0.19) with printing test's name:

import lightgbm as lgb
from sklearn.utils.estimator_checks import (_yield_all_checks, SkipTest,
                                            check_parameters_default_constructible,
                                            check_no_fit_attributes_set_in_init)

# we cannot use `check_estimator` directly since there is no skip test mechanism
for name, estimator in ((lgb.sklearn.LGBMClassifier.__name__, lgb.sklearn.LGBMClassifier),
                        (lgb.sklearn.LGBMRegressor.__name__, lgb.sklearn.LGBMRegressor)):
    check_parameters_default_constructible(name, estimator)
    check_no_fit_attributes_set_in_init(name, estimator)
    # we cannot leave default params (see https://github.com/Microsoft/LightGBM/issues/833)
    estimator = estimator(min_data=1, min_data_in_bin=1)
    for check in _yield_all_checks(name, estimator):
        if check.__name__ == 'check_estimators_nan_inf':
            continue  # skip test because LightGBM deals with nan
        try:
            print(check.__name__)
            check(name, estimator)
        except SkipTest as message:
            warnings.warn(message, SkipTestWarning)

And it seems that everything is OK with our estimators even without deprecation mark:

check_estimators_dtypes
check_fit_score_takes_y
check_dtype_object
check_sample_weights_pandas_series
check_sample_weights_list
check_estimators_fit_returns_self
check_estimators_empty_data_messages
check_pipeline_consistency
check_estimators_overwrite_params    <-----------------------
check_estimator_sparse_data
check_estimators_pickle
check_classifier_data_not_an_array
check_classifiers_one_label

C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
  warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
  warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
  warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
  warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
  warnings.warn('Converting data to scipy sparse matrix.')

check_classifiers_classes
check_estimators_partial_fit_n_features
check_classifiers_train
check_classifiers_regression_target
check_supervised_y_2d
check_estimators_unfitted
check_non_transformer_estimators_n_iter
check_decision_proba_consistency
check_fit2d_predict1d
check_fit2d_1sample
check_fit2d_1feature
check_fit1d_1feature
check_fit1d_1sample
check_get_params_invariance
check_dict_unchanged
check_dont_overwrite_parameters    <-----------------------
check_estimators_dtypes
check_fit_score_takes_y
check_dtype_object
check_sample_weights_pandas_series
check_sample_weights_list
check_estimators_fit_returns_self
check_estimators_empty_data_messages
check_pipeline_consistency
check_estimators_overwrite_params    <-----------------------
check_estimator_sparse_data
check_estimators_pickle
check_regressors_train
check_regressor_data_not_an_array

C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
  warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
  warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
  warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
  warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
  warnings.warn('Converting data to scipy sparse matrix.')


check_estimators_partial_fit_n_features
check_regressors_no_decision_function
check_supervised_y_2d
check_supervised_y_no_nan
check_regressors_int
check_estimators_unfitted
check_non_transformer_estimators_n_iter
check_fit2d_predict1d
check_fit2d_1sample
check_fit2d_1feature
check_fit1d_1feature
check_fit1d_1sample
check_get_params_invariance
check_dict_unchanged
check_dont_overwrite_parameters  <-----------------------

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, LGTM now, we can deprecate it in the future if it raises some issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxchan Sure, will see.

@guolinke
Copy link
Collaborator

guolinke commented Sep 3, 2017

@wxchan @StrikerRUS
ping me when this is ready to merge.

@wxchan
Copy link
Contributor

wxchan commented Sep 3, 2017

@StrikerRUS is there any remaining issue in this thread I forget?

@StrikerRUS
Copy link
Collaborator Author

@wxchan I suppose no.

@StrikerRUS
Copy link
Collaborator Author

The only one thing I'm worrying about is readthedocs. Are dynamic docstrings OK for it?

LGBMLabelEncoder = None
LGBMStratifiedKFold = None
LGBMGroupKFold = None
_SKLEARN_INSTALLED = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look like a typo? all other places are SKLEARN_INSTALLED

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry.

_LGBMClassifierBase = object
_LGBMRegressorBase = object
_LGBMLabelEncoder = None
LGBMDeprecated = None
Copy link
Contributor

@wxchan wxchan Sep 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember this will raise an error if sklearn not installed, check #221. You can uninstall scikit-learn and take a try.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that it's just forgotten object. Sorry.
btw, in two words why these dummy variables are needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some of these I don't need to specify too?

LGBMNotFittedError = ValueError 
_LGBMCheckXY = None
_LGBMCheckArray = None
_LGBMCheckConsistentLength = None
_LGBMCheckClassificationTargets = None 

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I uninstalled skikit-learn, removed the line
LGBMDeprecated = None
and
import lightgbm as lgb
didn't raise an error. Does it indicate that everything is OK?

@wxchan
Copy link
Contributor

wxchan commented Sep 3, 2017

For docstrings, you can run make html in docs folder, it will generate html doc pages, you can check if it's what you expected.

@StrikerRUS
Copy link
Collaborator Author

@wxchan
Thanks a lot for the tip!
I've generated docs and it seems that everything is OK, include dynamically constructed docstrings.

elif isinstance(self, LGBMRanker):
self._objective = "lambdarank"
else:
raise ValueError("Unknown LGBMModel type.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to warn, instead of raising an exception, because we can't make an instance of LGBMModel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry that I was wrong.
Please ignore this point :p

Copy link
Contributor

@henry0312 henry0312 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems good!
although we need @wxchan's approval.

Number of parallel threads.
silent : boolean
silent : bool, optional (default=True)
Whether to print messages while running boosting.
**kwargs : other parameters
Check http://lightgbm.readthedocs.io/en/latest/Parameters.html for more parameters.
Note: **kwargs is not supported in sklearn, it may cause unexpected issues.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this note still actual?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't ensure everything works because sklearn doesn't support **kwargs and it indeed raises some issues before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxchan OK, got it!

@wxchan
Copy link
Contributor

wxchan commented Sep 5, 2017

@guolinke LGTM now, I think it's ok to merge. @StrikerRUS Do you have anything left?

@StrikerRUS
Copy link
Collaborator Author

@wxchan @guolinke I've done all I wanted.

@guolinke guolinke merged commit 015c8ff into microsoft:master Sep 5, 2017
@StrikerRUS StrikerRUS deleted the sklearn-improving branch September 5, 2017 10:50
guolinke pushed a commit that referenced this pull request Oct 9, 2017
* improved sklearn interface; added sklearns' tests

* moved best_score into the if statement

* improved docstrings; simplified LGBMCheckConsistentLength

* fixed typo

* pylint

* updated example

* fixed Ranker interface

* added missed boosting_type

* fixed more comfortable autocomplete without unused objects

* removed check for None of eval_at

* fixed according to review

* fixed typo

* added description of fit return type

* dictionary->dict for short

* markdown cleanup
@lock lock bot locked as resolved and limited conversation to collaborators Mar 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants