Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staged predict function as in scikit-learn #5031

Closed
egemenzeytinci opened this issue Feb 24, 2022 · 8 comments
Closed

Staged predict function as in scikit-learn #5031

egemenzeytinci opened this issue Feb 24, 2022 · 8 comments

Comments

@egemenzeytinci
Copy link

In scikit-learn, the staged_predict function allows to see the regression targets at each stage. This is important because it allows to monitoring the model after each step. Here is the link for the function as I determined above: staged_predict

As far as I see, there is no implementation like this function in LightGBM for the each step.

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM!

I see the following description at the link you provided

This method allows monitoring (i.e. determine error on testing set) after each stage.

Could you explain a bit more why you think LightGBM would benefit from adding this method to LGBMRegressor in its Python package?

You can already achieve "determine error on testing set after each stage" by providing validation sets:

import lightgbm as lgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=10_000, n_features=8, n_informative=5)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    random_state=42
)

reg = lgb.LGBMRegressor(
    num_boost_round=10,
    metric=["l2", "mae", "mape"]
)
reg.fit(X_train, y_train, eval_set=[(X_test, y_test)])

# show metrics evaluated at each iteration
reg.evals_result_
{'valid_0': OrderedDict([('l2',
               [9229.674591859464,
                7825.89932531363,
                6625.686405072829,
                5616.5372420227495,
                4797.817180326764]),
              ('l1',
               [76.4812212591533,
                70.12525847773459,
                64.32944081379047,
                58.96086177684842,
                54.24494127560389]),
              ('mape',
               [0.9449547823155805,
                0.9018169596502487,
                0.8558294965519376,
                0.8024717734359049,
                0.7569051056778325])])}

And if you want to get the predictions at each iteration, LightGBM allows you to provide an iteration number to its various predict() methods.

# examples: get models' predictions of the target from training data based on only the first 3 trees
reg.predict(X_train, num_iteration=2)

@egemenzeytinci
Copy link
Author

egemenzeytinci commented Feb 25, 2022

Thanks for your answer.

1- staged_predict can be also added to classifier. (staged_predict_proba for the classifier, which doesn't apply to regressor).

2- staged_predict is interesting to us for the following reasons;

  • The evolution of a prediction could be used as features in a downstream model (akin to an embedding). This is the most interesting feature we're after.
  • In some of our products, we'd like to present the whole prediction sequence as opposed to just the end result.
  • We have custom metrics and analyses we'd like to compute on each prediction round.

I get that the predictions can be obtained using the num_iteration kwarg inside a for loop, but AFAIU, to get the whole prediction sequence, this quickly becomes inefficient as thee number of trees grow, and unusable for 100+ trees.

@egemenzeytinci
Copy link
Author

egemenzeytinci commented Mar 1, 2022

Gentle ping @jameslamb, is there any update about the issue?

@wuzhe1234
Copy link

@egemenzeytinci , save the prediction of each iteration could be a walkaround.

    x_train = df_train["feature"]
    N, F = x_train.shape
    predictions = pd.DataFrame(np.zeros((N, self.epochs), dtype=float))

    def save_predictions(env):
        # have to copy because the buffer is reused in each iteration
        train_preds = env.model._Booster__inner_predict(0).copy()
        predictions.iloc[:, env.iteration] = train_preds

    model = lgb.train(
        self.params,
        dtrain,
        num_boost_round=self.epochs,
        valid_sets=[dtrain, dvalid],
        valid_names=["train", "valid"],
        verbose_eval=20,
        evals_result=evals_result,
        callbacks=[save_predictions],
    )

@jameslamb
Copy link
Collaborator

Sorry for the delay, this project is really struggling from a lack of maintainer availability at the moment.

If this is something that's standard in scikit-learn for regression and classification, we're open to adding it to LightGBM's scikit-learn API. But we can't make any commitment to that happening in the near future.

If you're very interested in seeing this in LightGBM, the best way to make that happen soon is probably to contribute it yourself. If you're interested in attempting a pull request, we'd be happy to help with reviews and answers to any questions you have.

@github-actions
Copy link

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

@github-actions

This comment was marked as off-topic.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
@jameslamb
Copy link
Collaborator

Sorry, this was locked accidentally. Just unlocked it. We'd still love help with this feature!

@microsoft microsoft unlocked this conversation Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants