Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: [python-package] support sub-classing scikit-learn estimators #6783

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Jan 10, 2025

I recently saw a Stack Overflow post ("Why can't I wrap LGBM?") expressing the same concerns from #4426 ... it's difficult to sub-class lightgbm's scikit-learn estimators.

It doesn't have to be! Look how minimal the code is for XGBRFRegressor:

https://github.com/dmlc/xgboost/blob/45009413ce9f0d2bdfcd0c9ea8af1e71e3c0a191/python-package/xgboost/sklearn.py#L1869

This PR proposes borrowing some patterns I learned while working on xgboost's scikit-learn estimators to make it easier to sub-class lightgbm estimators. This also has the nice side effect of simplifying the lightgbm.dask code 😁

Notes for Reviewers

Why make the breaking change of requiring keyword args?

As part of this PR, I'm proposing immediately switching the constructors for scikit-learn estimators here (including those in lightgbm.dask) to only supporting keyword arguments.

Why I'm proposing this instead of a deprecation cycle:

import lightgbm as lgb
lgb.LGBMClassifier("gbdt")
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# TypeError: LGBMClassifier.__init__() takes 1 positional argument but 2 were given

I posted a related answer to that Stack Overflow question

https://stackoverflow.com/a/79344862/3986677

@jameslamb jameslamb changed the title WIP: [python-package] support sub-classing scikit-learn estimators [python-package] support sub-classing scikit-learn estimators Jan 11, 2025
@jameslamb jameslamb marked this pull request as ready for review January 11, 2025 05:06
@jameslamb jameslamb mentioned this pull request Jan 23, 2025
31 tasks
@StrikerRUS
Copy link
Collaborator

Could you please setup an RTD build for this branch? I'd like to see how init signature will be rendered there.

@jameslamb
Copy link
Collaborator Author

Sure, here's a first build: https://readthedocs.org/projects/lightgbm/builds/26983170/

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great simplification, thanks for working on it!

I don't have any serious comments, just want to get some answers before approving.

docs/FAQ.rst Outdated

def predict(self, X, max_score: float = np.inf):
preds = super().predict(X)
preds[np.where(preds > max_score)] = max_score
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe np.clip() for the simplicity?

Suggested change
preds[np.where(preds > max_score)] = max_score
np.clip(preds, a_min=None, a_max=max_score, out=preds)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice, sure!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this change go? 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, maybe make this an out-of-place operation?)

Copy link
Collaborator Author

@jameslamb jameslamb Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this change go

It's there:

image

np.clip(preds, a_min=None, a_max=max_score, out=preds)

Might just not be obvious because I committed it directly via the git CLI, instead of clicking the button here.

maybe make this an out-of-place operation?

It's an in-place operation if you pass a pre-allocated array to out as we are here.

import numpy as np
y = np.array([0, 1, 2, 3, 4])
np.clip(y, a_min=None, a_max=2, out=y)
print(y)
# [0 1 2 2 2]

From https://numpy.org/doc/stable/reference/generated/numpy.clip.html#numpy.clip

out ndarray, optional
The results will be placed in this array. It may be the input array for in-place clipping. out must be of the right shape to hold the output. Its type is preserved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow GitHub did not show me the up-to-date code yesterday 🤯 nvm, sorry for bringing this up again 👀

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
importance_type=importance_type,
**kwargs,
)
super().__init__(**kwargs)

_base_doc = LGBMClassifier.__init__.__doc__
Copy link
Collaborator

@StrikerRUS StrikerRUS Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's OK to have just one client argument in the signature, but describe all parent args in the docstring?..

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a little better for users to see all the parameters right here, instead of having to click over to another page.

This is what XGBoost is doing too: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRFRegressor

But I do also appreciate that it could look confusing.

If we don't do it this way, then I'd recommend we add a link in the docs for `**kwargs`` in these estimators, like this:

**kwargs Other parameters for the model. These can be any of the keyword arguments for LGBMModel or any other LightGBM parameters documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

I have a weak preference for keeping it as-is (the signature in docs not having all parameters, but docstring having all parameters), but happy to change it if you think that's confusing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying your opinion!
I love your suggestion for **kwargs description. But my preference is also weak 🙂
I think we need a third judge opinion for this question.

Either way, I'm approving this PR!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmoralez or @borchero could one of you comment on this thread and help us break the tie?

To make progress on the release, if we don't hear back in the next 2 days I'll merge this PR as-is and we can come back and change the docs later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I only saw this now! My personal preference would actually be to keep all of the parameters (similar to the previous state) and simply make them keyword arguments. While this results in more code and some duplication of defaults, I think that this is the clearest interface for users. If you think this is undesirable @jameslamb, I'd at least opt for documenting all of the "transitive" parameters, just like in the XGBoost docs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going over the code again, I think the number of times the args are repeated, it's a very practical consideration to use **kwargs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Haha so actually, I think your previous comment + @StrikerRUS 's question has convinced me that we should put all the keyword arguments back!

  • reduces confusion in the docs or running help() interactively ("the clearest interface for users" like you said)
  • we have a unit test to ensure that they all stay in sync
  • it's ok for that to be slightly more development effort because:
    • adding/removing keyword arguments from the class constructors is not that common here
    • adding new classes inheriting from these estimators here is very rare

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make that change tomorrow. Put up WIP: in the title to indicate it's not ready to merge yet.

jameslamb and others added 2 commits January 29, 2025 22:29
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
jameslamb and others added 2 commits January 29, 2025 22:48
@jameslamb jameslamb requested a review from StrikerRUS January 30, 2025 04:48
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!

Copy link
Collaborator

@borchero borchero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

docs/FAQ.rst Outdated

def predict(self, X, max_score: float = np.inf):
preds = super().predict(X)
preds[np.where(preds > max_score)] = max_score
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this change go? 😄

docs/FAQ.rst Outdated

def predict(self, X, max_score: float = np.inf):
preds = super().predict(X)
preds[np.where(preds > max_score)] = max_score
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, maybe make this an out-of-place operation?)

importance_type=importance_type,
**kwargs,
)
super().__init__(**kwargs)

_base_doc = LGBMClassifier.__init__.__doc__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going over the code again, I think the number of times the args are repeated, it's a very practical consideration to use **kwargs.

@jameslamb jameslamb changed the title [python-package] support sub-classing scikit-learn estimators WIP: [python-package] support sub-classing scikit-learn estimators Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants