-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] compatibility with scikit-learn #2628
Comments
is there any reason for they break the compatibility? |
Unfortunately, I have no info. According to scikit-learn/scikit-learn#15805 (comment), seems that "optional data-dependent parameters" actually means "optional parameters should be indexable and have length equal to number of data samples". And "Also it is expected that parameters with trailing _ are not to be set inside the |
@StrikerRUS without eval sets, how do sklean doing early stopping? |
Via valid data fraction parameter and
Not a fair replacement for the current behavior I think. What if validation data comes from a different file, for example? Then it might be a not function to partition data, but a more general function that returns some data. However, passing (actually storing) functions is dangerous due to But hold on - seems they want to revert this check completely for now in 0.22.1: scikit-learn/scikit-learn#15805 (comment). Speaking generally about sklearn wrapper design, I think we should wait at least to their HistGradientBoosting leaves beta stage. They probably will have the same problems as we do because they want to mimic LightGBM (mostly)/XGBoost/CatBoost behavior: scikit-learn/scikit-learn#15127, scikit-learn/scikit-learn#15841, scikit-learn/scikit-learn#14830, ...
|
Yeah we'll be fixing this issue in the next minor release. Conventionally, we expect the fit params to be sample aligned, and that's why the recent change wasn't considered backward incompatible. We know it has its limitations, and we're working on an API to fix those issues in a better way, but that'll take a while probably. |
@adrinjalali Thank you very much for heads-up! However, I strongly believe that library public integration API cannot be developed by internal "convention", but should be documented in an appropriate place ahead actual changes. Unfortunately, third-party developers cannot follow all discussions happened in repo's issues and the only way to get familiar with API conventions and agreements for them is public docs and warnings. Also, it'll very cool to introduce breaking checks firstly in BTW, is it possible to revert another breaking change in |
Despite that we have not decided anything, we cannot allow to paralyze our development process due to failing CI tests and nightly releases. So simply prohibit using newer scikit-learn versions for now: #2637. |
Be aware that we are going to revert back the behaviour in scikit-learn/scikit-learn#15863 which will be back-ported in 0.22.1. In your EDIT: Here I am referring to passing scalar values in |
Which change specifically? |
BTW the fact that fit params should be data-dependent is not new. Early stopping parameters (
For the validation set I am not so sure. In the next major version for scikit-learn we will probably work on generalizing a consistent API for early stopping and we will have to discuss what is the proper way to pass a pre-sampled validation set. |
@glemaitre This is very awesome! Thanks a lot for heads-up!
That one which prohibits setting private fields in
Agree! However, "data dependent variables" is not very intuitive term to describe a such expected behaviour. Maybe it's better to paraphrase into "data aligned variables" to avoid confusions? Or the best option is to explicitly say what you mean, for example: "data aligned variables which should be indexable and have length equal to number of data samples in training set".
Another awesome news! |
We are still not fully decided about this requirement. It will probably be refined soon.
This is being rolled-back in 0.22.1: scikit-learn/scikit-learn#15947 |
@ogrisel Great news! Looking forward to the new version!
But I do not see any diffs for |
scikit-learn 0.22.1 is out. This restores the previous behavior. |
Hum I just saw the last comment.
This is not incompatible. scikit-learn/scikit-learn#15947 makes it possible to customize the way an estimators checks that it has been fitted when calling predict / transform. |
@ogrisel Great! Thank you very much for keeping us with the latest news!
Sorry, but I'm afraid I haven't fully understood this. Does it mean that with 0.22.1 version the following code is allowed in LightGBM/python-package/lightgbm/sklearn.py Lines 320 to 322 in edb9149
I'm asking because the new version is not updated at conda yet from where our CI routines get packages. |
It's true that setting (fixed) private attribute in What has changed in scikit-learn/scikit-learn#14511 is that this check used to be only run when Note that you can use |
@ogrisel Ah, got it! Thanks a lot for the detailed info! |
Just curious, are there any plans to bring a consistency between docs and public API tests for that aspect? Or what is internal convention on private attributes in |
Gentle ping @ogrisel . |
@StrikerRUS I opened scikit-learn/scikit-learn#16241 to discuss this. |
Thank you, @ogrisel ! |
To give a status update on this, we are still waiting for a contributor to propose a fix in scikit-learn/scikit-learn#16241 (have somewhat limited resources atm). Bear in mind that common checks in scikit-learn are by no way perfect, and while we are working to improve them, occasionally they will yield false positives scikit-learn/scikit-learn#6715. They have been made more modular with In this case, I think the right thing is to skip that test and indicate |
Thanks a lot for your feedback!
Yeah, we've been waiting for an answer exactly for this question 🙂
Do not waste your time! I'll prepare changes and ping you when they'll be ready for review. |
Just made #2946 already :) |
Huh, nice timings! |
New scikit-learn version (0.22) breaks compatibility with our current estimators from sklearn wrapper (not presented in
default
conda channel yet, that's why we have greenmaster
). There are two failing tests there:The first failing test says that we are not allowed to pass any optional arguments in
fit
method, exceptsample_weight
. There is an issue in scikit-learn repo which after fixing should bring back possibility to pass scalar values in0.22.1
: scikit-learn/scikit-learn#15805. However, it will not solve our problem because we have a lot of non-scalar arguments, e.g.eval_set
,eval_names
,eval_init_score
, etc. Speaking abouteval_set
, they are thinking about the possibility to add a mechanism of custom validation data in scikit-learn, but haven't designed an API for it yet: scikit-learn/scikit-learn#15127 (comment). So I don't think we should wait for a public API release soon.The second failing test indicates that estimator "should not set any attribute apart from parameters during init". Seems that it means we cannot use
kwargs
anymore.Moreover, it turned out recently that passing
check_estimator
test doesn't mean that an estimator is "fully compatible" with scikit-learn library: scikit-learn/scikit-learn#15392 (comment). In fact it means we have no any formal indicator that our estimators work fine with tools from scikit-learn.Given that we know our estimators were compatible with scikit-learn < 0.22 in terms of passing their public API test and in practice as well (we have no any reproducible open issues related to sklearn wrapper in our repo at present), I think that the only option we have for now is document here that the last supported version of scikit-learn is
0.21.3
and prohibit using more recent versions by raising fatal error here. Then we should wait for stable and more complete version ofcheck_estimator
test and rewrite our wrapper from scratch preserving our current features (if it will be possible at all). Because for now I don't think we have resources to rewrite our wrapper completely after every new scikit-learn release which can potentially break everything by introducing new checks incheck_estimator
without any deprecation/future warnings (the second current failing test) or even without any new checks incheck_estimator
but by new checks directly inside a tool (the first current failing test). Even if we will do that, we have no guarantee that we are "fully compatible" with a new version and will have to check compatibility by testing every tool from scikit-learn with our estimators in different scenarios by hand and wait for issue reports from users.Maybe we should ask for a help or advice from scikit-learn team as it was kindly suggested here: scikit-learn/scikit-learn#15392 (comment). IDK...
@guolinke, @chivee, @Laurae2, @jameslamb, @henry0312
The text was updated successfully, but these errors were encountered: