WIP: [python-package] support sub-classing scikit-learn estimators #6783

jameslamb · 2025-01-10T06:39:24Z

I recently saw a Stack Overflow post ("Why can't I wrap LGBM?") expressing the same concerns from #4426 ... it's difficult to sub-class lightgbm's scikit-learn estimators.

It doesn't have to be! Look how minimal the code is for XGBRFRegressor:

https://github.com/dmlc/xgboost/blob/45009413ce9f0d2bdfcd0c9ea8af1e71e3c0a191/python-package/xgboost/sklearn.py#L1869

This PR proposes borrowing some patterns I learned while working on xgboost's scikit-learn estimators to make it easier to sub-class lightgbm estimators. This also has the nice side effect of simplifying the lightgbm.dask code 😁

Notes for Reviewers

Why make the breaking change of requiring keyword args?

As part of this PR, I'm proposing immediately switching the constructors for scikit-learn estimators here (including those in lightgbm.dask) to only supporting keyword arguments.

Why I'm proposing this instead of a deprecation cycle:

scikit-learn itself does this (HistGradientBoostingClassifier example)
- so all of its machinery passing parameters around as keyword arguments
- keyword arguments are recommended throughout https://scikit-learn.org/stable/developers/develop.html
I strongly suspect that using positional arguments for these constructors is rare
anyone relying on positional arguments will get a loud and easy-to-diagnose-and-fix error, so the effort to adjust should be minimal

import lightgbm as lgb
lgb.LGBMClassifier("gbdt")
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# TypeError: LGBMClassifier.__init__() takes 1 positional argument but 2 were given

I posted a related answer to that Stack Overflow question

https://stackoverflow.com/a/79344862/3986677

…htGBM into python/sklearn-subclassing

tests/python_package_test/test_dask.py

StrikerRUS · 2025-01-27T16:15:04Z

Could you please setup an RTD build for this branch? I'd like to see how init signature will be rendered there.

jameslamb · 2025-01-27T16:18:37Z

Sure, here's a first build: https://readthedocs.org/projects/lightgbm/builds/26983170/

StrikerRUS

Great simplification, thanks for working on it!

I don't have any serious comments, just want to get some answers before approving.

StrikerRUS · 2025-01-27T16:37:39Z

docs/FAQ.rst

+
+        def predict(self, X, max_score: float = np.inf):
+            preds = super().predict(X)
+            preds[np.where(preds > max_score)] = max_score


Maybe np.clip() for the simplicity?

Suggested change

preds[np.where(preds > max_score)] = max_score

np.clip(preds, a_min=None, a_max=max_score, out=preds)

oh nice, sure!

Where did this change go? 😄

(also, maybe make this an out-of-place operation?)

Where did this change go

It's there:

LightGBM/docs/FAQ.rst

Line 405 in e39d19f

np.clip(preds, a_min=None, a_max=max_score, out=preds)

Might just not be obvious because I committed it directly via the git CLI, instead of clicking the button here.

maybe make this an out-of-place operation?

It's an in-place operation if you pass a pre-allocated array to out as we are here.

import numpy as np y = np.array([0, 1, 2, 3, 4]) np.clip(y, a_min=None, a_max=2, out=y) print(y) # [0 1 2 2 2]

From https://numpy.org/doc/stable/reference/generated/numpy.clip.html#numpy.clip

out ndarray, optional
The results will be placed in this array. It may be the input array for in-place clipping. out must be of the right shape to hold the output. Its type is preserved.

Somehow GitHub did not show me the up-to-date code yesterday 🤯 nvm, sorry for bringing this up again 👀

python-package/lightgbm/sklearn.py

tests/python_package_test/test_dask.py

StrikerRUS · 2025-01-27T17:44:13Z

python-package/lightgbm/dask.py

-            importance_type=importance_type,
-            **kwargs,
-        )
+        super().__init__(**kwargs)

    _base_doc = LGBMClassifier.__init__.__doc__


Do you think it's OK to have just one client argument in the signature, but describe all parent args in the docstring?..

I think it's a little better for users to see all the parameters right here, instead of having to click over to another page.

This is what XGBoost is doing too: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRFRegressor

But I do also appreciate that it could look confusing.

If we don't do it this way, then I'd recommend we add a link in the docs for `**kwargs`` in these estimators, like this:

**kwargs Other parameters for the model. These can be any of the keyword arguments for LGBMModel or any other LightGBM parameters documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

I have a weak preference for keeping it as-is (the signature in docs not having all parameters, but docstring having all parameters), but happy to change it if you think that's confusing.

Thanks for clarifying your opinion!
I love your suggestion for **kwargs description. But my preference is also weak 🙂
I think we need a third judge opinion for this question.

Either way, I'm approving this PR!

@jmoralez or @borchero could one of you comment on this thread and help us break the tie?

To make progress on the release, if we don't hear back in the next 2 days I'll merge this PR as-is and we can come back and change the docs later.

Sorry, I only saw this now! My personal preference would actually be to keep all of the parameters (similar to the previous state) and simply make them keyword arguments. While this results in more code and some duplication of defaults, I think that this is the clearest interface for users. If you think this is undesirable @jameslamb, I'd at least opt for documenting all of the "transitive" parameters, just like in the XGBoost docs.

Going over the code again, I think the number of times the args are repeated, it's a very practical consideration to use **kwargs.

Thanks!

Haha so actually, I think your previous comment + @StrikerRUS 's question has convinced me that we should put all the keyword arguments back!

reduces confusion in the docs or running help() interactively ("the clearest interface for users" like you said)

we have a unit test to ensure that they all stay in sync

it's ok for that to be slightly more development effort because:

adding/removing keyword arguments from the class constructors is not that common here

adding new classes inheriting from these estimators here is very rare

I'll make that change tomorrow. Put up WIP: in the title to indicate it's not ready to merge yet.

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

StrikerRUS

Thank you very much!

borchero

Thanks!

borchero · 2025-02-04T21:44:00Z

docs/FAQ.rst

+
+        def predict(self, X, max_score: float = np.inf):
+            preds = super().predict(X)
+            preds[np.where(preds > max_score)] = max_score


Where did this change go? 😄

borchero · 2025-02-04T21:44:32Z

docs/FAQ.rst

+
+        def predict(self, X, max_score: float = np.inf):
+            preds = super().predict(X)
+            preds[np.where(preds > max_score)] = max_score


(also, maybe make this an out-of-place operation?)

borchero · 2025-02-04T21:48:37Z

python-package/lightgbm/dask.py

-            importance_type=importance_type,
-            **kwargs,
-        )
+        super().__init__(**kwargs)

    _base_doc = LGBMClassifier.__init__.__doc__


Going over the code again, I think the number of times the args are repeated, it's a very practical consideration to use **kwargs.

jameslamb added 3 commits January 4, 2025 01:59

[python-package] make sub-classing scikit-learn estimators easier

3b5f648

tests passing

02c48c3

add docs

7b720cb

jameslamb added in progress breaking labels Jan 10, 2025

jameslamb added 4 commits January 10, 2025 00:40

Update tests/python_package_test/test_sklearn.py

51b5e64

remove docs links

81178fd

Merge branch 'python/sklearn-subclassing' of github.com:microsoft/Lig…

110b0e1

…htGBM into python/sklearn-subclassing

Merge branch 'master' into python/sklearn-subclassing

104471a

jameslamb changed the title ~~WIP: [python-package] support sub-classing scikit-learn estimators~~ [python-package] support sub-classing scikit-learn estimators Jan 11, 2025

jameslamb added awaiting review and removed in progress labels Jan 11, 2025

jameslamb marked this pull request as ready for review January 11, 2025 05:06

jameslamb requested review from guolinke, shiyu1994, jmoralez, borchero and StrikerRUS as code owners January 11, 2025 05:06

jameslamb added 2 commits January 12, 2025 23:24

fix Dask tests

d80b0df

Merge branch 'python/sklearn-subclassing' of github.com:microsoft/Lig…

b7e041a

…htGBM into python/sklearn-subclassing

jameslamb commented Jan 13, 2025

View reviewed changes

tests/python_package_test/test_dask.py Show resolved Hide resolved

Merge branch 'master' into python/sklearn-subclassing

68177a7

jameslamb mentioned this pull request Jan 23, 2025

WIP: release v4.6.0 #6796

Draft

31 tasks

jameslamb added 2 commits January 26, 2025 11:31

Merge branch 'master' into python/sklearn-subclassing

70f29a7

Merge branch 'master' into python/sklearn-subclassing

6796ba9

StrikerRUS reviewed Jan 27, 2025

View reviewed changes

jameslamb and others added 2 commits January 29, 2025 22:29

Update tests/python_package_test/test_dask.py

409733a

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

Update python-package/lightgbm/sklearn.py

0a40e9b

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

jameslamb and others added 2 commits January 29, 2025 22:48

Update docs/FAQ.rst

cd54639

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

Merge branch 'master' into python/sklearn-subclassing

e39d19f

jameslamb requested a review from StrikerRUS January 30, 2025 04:48

StrikerRUS approved these changes Jan 30, 2025

View reviewed changes

borchero approved these changes Feb 4, 2025

View reviewed changes

Merge branch 'master' into python/sklearn-subclassing

7077c24

jameslamb added in progress and removed awaiting review labels Feb 6, 2025

jameslamb changed the title ~~[python-package] support sub-classing scikit-learn estimators~~ WIP: [python-package] support sub-classing scikit-learn estimators Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: [python-package] support sub-classing scikit-learn estimators #6783

WIP: [python-package] support sub-classing scikit-learn estimators #6783

jameslamb commented Jan 10, 2025 •

edited

Loading

StrikerRUS commented Jan 27, 2025

jameslamb commented Jan 27, 2025

StrikerRUS left a comment

StrikerRUS Jan 27, 2025

jameslamb Jan 30, 2025

borchero Feb 4, 2025

borchero Feb 4, 2025

jameslamb Feb 5, 2025 •

edited

Loading

borchero Feb 5, 2025

StrikerRUS Jan 27, 2025 •

edited

Loading

jameslamb Jan 30, 2025

StrikerRUS Jan 30, 2025

jameslamb Jan 31, 2025

borchero Feb 4, 2025

borchero Feb 4, 2025

jameslamb Feb 5, 2025

jameslamb Feb 6, 2025

StrikerRUS left a comment

borchero left a comment

borchero Feb 4, 2025

borchero Feb 4, 2025

borchero Feb 4, 2025

	preds[np.where(preds > max_score)] = max_score
	np.clip(preds, a_min=None, a_max=max_score, out=preds)

WIP: [python-package] support sub-classing scikit-learn estimators #6783

Are you sure you want to change the base?

WIP: [python-package] support sub-classing scikit-learn estimators #6783

Conversation

jameslamb commented Jan 10, 2025 • edited Loading

Notes for Reviewers

Why make the breaking change of requiring keyword args?

I posted a related answer to that Stack Overflow question

StrikerRUS commented Jan 27, 2025

jameslamb commented Jan 27, 2025

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

borchero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb commented Jan 10, 2025 •

edited

Loading

jameslamb Feb 5, 2025 •

edited

Loading

StrikerRUS Jan 27, 2025 •

edited

Loading