Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement easy access to single-tree prediction in fitted LGBM model #3058

Closed
pransito opened this issue May 8, 2020 · 7 comments
Closed

Comments

@pransito
Copy link

pransito commented May 8, 2020

This has been mentioned in #845. However, the suggested solution there is not working. Here I would like to re-emphasize the need and elaborate on the desired feature.

Summary

In sklearn it is super easy to get via "model.estimators_" access to the prediction of every single tree in the ensemble. I mean the single prediction regardless of all other trees (no cumulative prediction). In LightGBM (I am mainly concerned with regression) this is difficult to achieve or even impossible so far (In #845 it was suggested to achieve that via booster.dump_model, leaf_index prediction etc..., but I have not managed to make that work, the values associated with the leaves also seem to be mean-corrected or just reflect the incremental change to the previous tree... but even taking all this into consideration it still is a cumulative prediction and hence super narrow prediction distributions).

Motivation

It would be very useful to have this feature because it certain use cases it is important to get an idea of the distribution of predictions of all the trees (is it wide or narrow; is it skewed). In some way it may be interpreted as a posterior distribution on the metric variable that is to be predicted (in LGBM regression). This is relevant for both classical GBM regression and classical RF regression.

Description

Like in sklearn there should be a .estimators_ object, with a .predict(X) method that gives out the prediction of every single tree for every row in X. It should be easily accessible and not hidden. It should handle whether boost_from_average was used or not automatically. There should be made a clear distinction between cumulative prediction (which is currently implemented with .predict(num_iteration=i) and "iid" prediction (i.e. every single tree on its own), which I suggest to implement as a new feature. One could imagine to have for the .predict() function a flag "cumulative=True" and when set to false, the trees will answer independently from one another.

References

https://github.com/scikit-learn/scikit-learn/blob/95d4f0841/sklearn/tree/_classes.py#L395

@franktoffel
Copy link

Any update on this? We are facing similar issues.

@guolinke
Copy link
Collaborator

@shiyu1994 can you help to check this?

@shiyu1994
Copy link
Collaborator

@shiyu1994 can you help to check this?

Maybe we can add a predict_with_tree(tree_id=i) method for Booster. I'll handle this.

@StrikerRUS
Copy link
Collaborator

Will adding start_iteration parameter to the existing predict method be enough? I think then it will be possible to select one tree with the help of num_iteration and start_iteration. Also, it will be consistent with the API of save_model method (and some others).

def predict(self, data, num_iteration=None,
raw_score=False, pred_leaf=False, pred_contrib=False,
data_has_header=False, is_reshape=True, **kwargs):
"""Make a prediction.
Parameters
----------
data : string, numpy array, pandas DataFrame, H2O DataTable's Frame or scipy.sparse
Data source for prediction.
If string, it represents the path to txt file.
num_iteration : int or None, optional (default=None)
Limit number of iterations in the prediction.
If None, if the best iteration exists, it is used; otherwise, all iterations are used.
If <= 0, all iterations are used (no limits).
raw_score : bool, optional (default=False)
Whether to predict raw scores.
pred_leaf : bool, optional (default=False)
Whether to predict leaf index.
pred_contrib : bool, optional (default=False)
Whether to predict feature contributions.
.. note::
If you want to get more explanations for your model's predictions using SHAP values,
like SHAP interaction values,
you can install the shap package (https://github.com/slundberg/shap).
Note that unlike the shap package, with ``pred_contrib`` we return a matrix with an extra
column, where the last column is the expected value.
data_has_header : bool, optional (default=False)
Whether the data has header.
Used only if data is string.
is_reshape : bool, optional (default=True)
If True, result is reshaped to [nrow, ncol].
**kwargs
Other parameters for the prediction.
Returns
-------
result : numpy array, scipy.sparse or list of scipy.sparse
Prediction result.
Can be sparse or a list of sparse objects (each element represents predictions for one class) for feature contributions (when ``pred_contrib=True``).
"""

def save_model(self, filename, num_iteration=None, start_iteration=0, importance_type='split'):
"""Save Booster to file.
Parameters
----------
filename : string
Filename to save Booster.
num_iteration : int or None, optional (default=None)
Index of the iteration that should be saved.
If None, if the best iteration exists, it is saved; otherwise, all iterations are saved.
If <= 0, all iterations are saved.
start_iteration : int, optional (default=0)
Start index of the iteration that should be saved.
importance_type : string, optional (default="split")
What type of feature importance should be saved.
If "split", result contains numbers of times the feature is used in a model.
If "gain", result contains total gains of splits which use the feature.
Returns
-------
self : Booster
Returns self.
"""

@shiyu1994
Copy link
Collaborator

I've done the implementation as @StrikerRUS suggested. If boost_from_average is enabled, the average score will be integrated into the first tree. So booster.predict(data, start_iteration=0, num_iteration=1), it will provide the score of the first tree with average value added. Does that meet your request? @pransito

@franktoffel
Copy link

franktoffel commented Aug 6, 2020 via email

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants