-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] improved sklearn interface #870
Conversation
@StrikerRUS, |
python-package/lightgbm/sklearn.py
Outdated
self.best_iteration = self._Booster.best_iteration | ||
self.best_score = self._Booster.best_score | ||
self._best_iteration = self._Booster.best_iteration | ||
self._best_score = self._Booster.best_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the L455 be into the if
statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StrikerRUS yes.
python-package/lightgbm/sklearn.py
Outdated
The target values | ||
y_pred: array_like of shape [n_samples] or shape[n_samples * n_class] (for multi-class) | ||
y_pred: array-like of shape = [n_samples] or shape = [n_samples * n_class] (for multi-class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose, there should be "... or shape = [n_samples * n_outputs] (for multi-output problem)".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StrikerRUS we don't support multi-output yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wxchan
Thanks for your review. Can you please explain than the case of y
with shape = [n_samples * n_class]
?
It's still 1d-array, as I understand, and possible only in classification problem situation, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StrikerRUS yes, it's the flatten (n_samples, n_class) array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wxchan
Thank you. I'll edit check_consistent_length
function with next commit.
And what about docstring (my first comment in this PR)? Do you know how it could be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wxchan
Am I right that fit
method accepts only y
of shape = [n_samples] in both classification and regression case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StrikerRUS yes for shape of y. I don't know how to do the docstring thing, you can simply add those to LGBMModel.doc and add Classifier only
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wxchan
Thank you. I combined both ways of improving docstrings: dynamic construction and notes about classification problem only.
@wxchan please review later |
@@ -40,7 +40,7 @@ if [[ ${TASK} == "if-else" ]]; then | |||
exit 0 | |||
fi | |||
|
|||
conda install --yes numpy scipy scikit-learn pandas matplotlib | |||
conda install --yes numpy nose scipy scikit-learn pandas matplotlib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you install nose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sklearn's checks require nose
. Without it new integration test fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sklearn's checks require nose
oh..., I got it.
Refer to #261 |
python-package/lightgbm/sklearn.py
Outdated
elif (isinstance(eval_group, dict) and any(i not in eval_group or eval_group[i] is None for i in range_(len(eval_group)))) \ | ||
or (isinstance(eval_group, list) and any(group is None for group in eval_group)): | ||
raise ValueError("Should set group for all eval dataset for ranking task; if you use dict, the index should start from 0") | ||
raise ValueError("Should set group for all eval datasets for ranking task; " | ||
"if you use dict, the index should start from 0") | ||
|
||
if eval_at is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wxchan
Is it critical to check eval_at
for None? I mean, will it cause error in underlying booster or we can omit if
statement since default value is [1]
and docstring says it should be list of int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will eval_at = None raise an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wxchan I'll check it. But I want to say that it's strange to me that we check for None
while default value is [1]
and docstring doesn't allow to pass None
. It's like we'll check eval_at
for, let say, string type...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StrikerRUS I think it's because the default used to be None. I guess it will fail at param_dict_to_str if it's None and user always didn't understand our error msg. I think removing it is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wxchan
I removed check for None and tried to pass None
:
TypeError: Unknown type of parameter:ndcg_eval_at, got:NoneType
and string:
LightGBMError: b'invalid stoll argument'
and object
:
TypeError: Unknown type of parameter:ndcg_eval_at, got:type
I think error message for None
is even more user-friendly than for string :-).
I think I've finished, except this #870 (comment) |
return self.booster_.predict(X, pred_leaf=True, num_iteration=num_iteration) | ||
|
||
@property | ||
def n_features_(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look like sklearn will remove this in the future, https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a805efbe4bb06516670a9b8c690992bd7/sklearn/ensemble/gradient_boosting.py#L936, should we add something like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wxchan Nothing told about it in Random Forest or Extra Trees. There n_features
isn't deprecated. I can't find the reason why they added deprecated
in Gradient Boosting (no information in the changelog at the site).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StrikerRUS They discussed in this thread scikit-learn/scikit-learn#7846 (comment). I think it cannot pass some test they added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wxchan Thank you very much for the information.
I've reproduced tests locally (scikit-learn 0.19) with printing test's name:
import lightgbm as lgb
from sklearn.utils.estimator_checks import (_yield_all_checks, SkipTest,
check_parameters_default_constructible,
check_no_fit_attributes_set_in_init)
# we cannot use `check_estimator` directly since there is no skip test mechanism
for name, estimator in ((lgb.sklearn.LGBMClassifier.__name__, lgb.sklearn.LGBMClassifier),
(lgb.sklearn.LGBMRegressor.__name__, lgb.sklearn.LGBMRegressor)):
check_parameters_default_constructible(name, estimator)
check_no_fit_attributes_set_in_init(name, estimator)
# we cannot leave default params (see https://github.com/Microsoft/LightGBM/issues/833)
estimator = estimator(min_data=1, min_data_in_bin=1)
for check in _yield_all_checks(name, estimator):
if check.__name__ == 'check_estimators_nan_inf':
continue # skip test because LightGBM deals with nan
try:
print(check.__name__)
check(name, estimator)
except SkipTest as message:
warnings.warn(message, SkipTestWarning)
And it seems that everything is OK with our estimators even without deprecation
mark:
check_estimators_dtypes
check_fit_score_takes_y
check_dtype_object
check_sample_weights_pandas_series
check_sample_weights_list
check_estimators_fit_returns_self
check_estimators_empty_data_messages
check_pipeline_consistency
check_estimators_overwrite_params <-----------------------
check_estimator_sparse_data
check_estimators_pickle
check_classifier_data_not_an_array
check_classifiers_one_label
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
warnings.warn('Converting data to scipy sparse matrix.')
check_classifiers_classes
check_estimators_partial_fit_n_features
check_classifiers_train
check_classifiers_regression_target
check_supervised_y_2d
check_estimators_unfitted
check_non_transformer_estimators_n_iter
check_decision_proba_consistency
check_fit2d_predict1d
check_fit2d_1sample
check_fit2d_1feature
check_fit1d_1feature
check_fit1d_1sample
check_get_params_invariance
check_dict_unchanged
check_dont_overwrite_parameters <-----------------------
check_estimators_dtypes
check_fit_score_takes_y
check_dtype_object
check_sample_weights_pandas_series
check_sample_weights_list
check_estimators_fit_returns_self
check_estimators_empty_data_messages
check_pipeline_consistency
check_estimators_overwrite_params <-----------------------
check_estimator_sparse_data
check_estimators_pickle
check_regressors_train
check_regressor_data_not_an_array
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
warnings.warn('Converting data to scipy sparse matrix.')
C:\Program Files\Anaconda3\lib\site-packages\lightgbm\basic.py:423: UserWarning: Converting data to scipy sparse matrix.
warnings.warn('Converting data to scipy sparse matrix.')
check_estimators_partial_fit_n_features
check_regressors_no_decision_function
check_supervised_y_2d
check_supervised_y_no_nan
check_regressors_int
check_estimators_unfitted
check_non_transformer_estimators_n_iter
check_fit2d_predict1d
check_fit2d_1sample
check_fit2d_1feature
check_fit1d_1feature
check_fit1d_1sample
check_get_params_invariance
check_dict_unchanged
check_dont_overwrite_parameters <-----------------------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, LGTM now, we can deprecate it in the future if it raises some issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wxchan Sure, will see.
@wxchan @StrikerRUS |
@StrikerRUS is there any remaining issue in this thread I forget? |
@wxchan I suppose no. |
The only one thing I'm worrying about is readthedocs. Are dynamic docstrings OK for it? |
python-package/lightgbm/compat.py
Outdated
LGBMLabelEncoder = None | ||
LGBMStratifiedKFold = None | ||
LGBMGroupKFold = None | ||
_SKLEARN_INSTALLED = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look like a typo? all other places are SKLEARN_INSTALLED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, sorry.
python-package/lightgbm/compat.py
Outdated
_LGBMClassifierBase = object | ||
_LGBMRegressorBase = object | ||
_LGBMLabelEncoder = None | ||
LGBMDeprecated = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember this will raise an error if sklearn not installed, check #221. You can uninstall scikit-learn and take a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that it's just forgotten object. Sorry.
btw, in two words why these dummy variables are needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some of these I don't need to specify too?
LGBMNotFittedError = ValueError
_LGBMCheckXY = None
_LGBMCheckArray = None
_LGBMCheckConsistentLength = None
_LGBMCheckClassificationTargets = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I uninstalled skikit-learn, removed the line
LGBMDeprecated = None
and
import lightgbm as lgb
didn't raise an error. Does it indicate that everything is OK?
For docstrings, you can run |
@wxchan |
elif isinstance(self, LGBMRanker): | ||
self._objective = "lambdarank" | ||
else: | ||
raise ValueError("Unknown LGBMModel type.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to warn, instead of raising an exception, because we can't make an instance of LGBMModel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry that I was wrong.
Please ignore this point :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR seems good!
although we need @wxchan's approval.
Number of parallel threads. | ||
silent : boolean | ||
silent : bool, optional (default=True) | ||
Whether to print messages while running boosting. | ||
**kwargs : other parameters | ||
Check http://lightgbm.readthedocs.io/en/latest/Parameters.html for more parameters. | ||
Note: **kwargs is not supported in sklearn, it may cause unexpected issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this note still actual?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't ensure everything works because sklearn doesn't support **kwargs and it indeed raises some issues before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wxchan OK, got it!
@guolinke LGTM now, I think it's ok to merge. @StrikerRUS Do you have anything left? |
* improved sklearn interface; added sklearns' tests * moved best_score into the if statement * improved docstrings; simplified LGBMCheckConsistentLength * fixed typo * pylint * updated example * fixed Ranker interface * added missed boosting_type * fixed more comfortable autocomplete without unused objects * removed check for None of eval_at * fixed according to review * fixed typo * added description of fit return type * dictionary->dict for short * markdown cleanup
This PR makes sklearn wrapper to pass scikit-learn's
check_estimator
test.The one remained thing I want to add to this PR is dynamic construction of docstring. I mean, there is no information about
n_classes_
andclasses_
attributes in docsting ofLGBMClassifier
. I want something like this:Maybe someone could help me with it?