-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: [python-package] support sub-classing scikit-learn estimators #6783
base: master
Are you sure you want to change the base?
Conversation
…htGBM into python/sklearn-subclassing
Could you please setup an RTD build for this branch? I'd like to see how |
Sure, here's a first build: https://readthedocs.org/projects/lightgbm/builds/26983170/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great simplification, thanks for working on it!
I don't have any serious comments, just want to get some answers before approving.
docs/FAQ.rst
Outdated
|
||
def predict(self, X, max_score: float = np.inf): | ||
preds = super().predict(X) | ||
preds[np.where(preds > max_score)] = max_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe np.clip() for the simplicity?
preds[np.where(preds > max_score)] = max_score | |
np.clip(preds, a_min=None, a_max=max_score, out=preds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nice, sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this change go? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, maybe make this an out-of-place operation?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this change go
It's there:
Line 405 in e39d19f
np.clip(preds, a_min=None, a_max=max_score, out=preds) |
Might just not be obvious because I committed it directly via the git
CLI, instead of clicking the button here.
maybe make this an out-of-place operation?
It's an in-place operation if you pass a pre-allocated array to out
as we are here.
import numpy as np
y = np.array([0, 1, 2, 3, 4])
np.clip(y, a_min=None, a_max=2, out=y)
print(y)
# [0 1 2 2 2]
From https://numpy.org/doc/stable/reference/generated/numpy.clip.html#numpy.clip
out ndarray, optional
The results will be placed in this array. It may be the input array for in-place clipping.out
must be of the right shape to hold the output. Its type is preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow GitHub did not show me the up-to-date code yesterday 🤯 nvm, sorry for bringing this up again 👀
importance_type=importance_type, | ||
**kwargs, | ||
) | ||
super().__init__(**kwargs) | ||
|
||
_base_doc = LGBMClassifier.__init__.__doc__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a little better for users to see all the parameters right here, instead of having to click over to another page.
This is what XGBoost is doing too: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRFRegressor
But I do also appreciate that it could look confusing.
If we don't do it this way, then I'd recommend we add a link in the docs for `**kwargs`` in these estimators, like this:
**kwargs
Other parameters for the model. These can be any of the keyword arguments forLGBMModel
or any other LightGBM parameters documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.
I have a weak preference for keeping it as-is (the signature in docs not having all parameters, but docstring having all parameters), but happy to change it if you think that's confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying your opinion!
I love your suggestion for **kwargs
description. But my preference is also weak 🙂
I think we need a third judge opinion for this question.
Either way, I'm approving this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I only saw this now! My personal preference would actually be to keep all of the parameters (similar to the previous state) and simply make them keyword arguments. While this results in more code and some duplication of defaults, I think that this is the clearest interface for users. If you think this is undesirable @jameslamb, I'd at least opt for documenting all of the "transitive" parameters, just like in the XGBoost docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going over the code again, I think the number of times the args are repeated, it's a very practical consideration to use **kwargs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Haha so actually, I think your previous comment + @StrikerRUS 's question has convinced me that we should put all the keyword arguments back!
- reduces confusion in the docs or running
help()
interactively ("the clearest interface for users" like you said) - we have a unit test to ensure that they all stay in sync
- it's ok for that to be slightly more development effort because:
- adding/removing keyword arguments from the class constructors is not that common here
- adding new classes inheriting from these estimators here is very rare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make that change tomorrow. Put up WIP:
in the title to indicate it's not ready to merge yet.
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
docs/FAQ.rst
Outdated
|
||
def predict(self, X, max_score: float = np.inf): | ||
preds = super().predict(X) | ||
preds[np.where(preds > max_score)] = max_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this change go? 😄
docs/FAQ.rst
Outdated
|
||
def predict(self, X, max_score: float = np.inf): | ||
preds = super().predict(X) | ||
preds[np.where(preds > max_score)] = max_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, maybe make this an out-of-place operation?)
importance_type=importance_type, | ||
**kwargs, | ||
) | ||
super().__init__(**kwargs) | ||
|
||
_base_doc = LGBMClassifier.__init__.__doc__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going over the code again, I think the number of times the args are repeated, it's a very practical consideration to use **kwargs
.
I recently saw a Stack Overflow post ("Why can't I wrap LGBM?") expressing the same concerns from #4426 ... it's difficult to sub-class
lightgbm
'sscikit-learn
estimators.It doesn't have to be! Look how minimal the code is for
XGBRFRegressor
:https://github.com/dmlc/xgboost/blob/45009413ce9f0d2bdfcd0c9ea8af1e71e3c0a191/python-package/xgboost/sklearn.py#L1869
This PR proposes borrowing some patterns I learned while working on
xgboost
'sscikit-learn
estimators to make it easier to sub-classlightgbm
estimators. This also has the nice side effect of simplifying thelightgbm.dask
code 😁Notes for Reviewers
Why make the breaking change of requiring keyword args?
As part of this PR, I'm proposing immediately switching the constructors for
scikit-learn
estimators here (including those inlightgbm.dask
) to only supporting keyword arguments.Why I'm proposing this instead of a deprecation cycle:
scikit-learn
itself does this (HistGradientBoostingClassifier example)I posted a related answer to that Stack Overflow question
https://stackoverflow.com/a/79344862/3986677