[python package]: suggestion: lgb.Booster.predict() should check that the input X data makes sense #812

j-mark-hou · 2017-08-09T14:28:40Z

In particular, I'm thinking about these things:

if the input is an np.array, check that the columns is the same as the number of features the lgb.Booster object uses. if not, throw a warning.
if the input is a pd.Dataframe object, should check that the feature_names of the lgb.Booster object is a superset of the columns of the pd.Dataframe
if feature names in the booster object are repeated, or if column names in the pd.Dataframe are repeated, fall back to 0.

if these things sound reasonable I'd be happy to add these checks to the lgb.Booster.predict() function prior to calling the lgb._InnerPredictor.predict()

guolinke · 2017-08-16T13:05:55Z

@j-mark-hou sure, very happy to see this feature.

guolinke · 2017-10-26T01:35:06Z

@j-mark-hou any updates ?

j-mark-hou · 2017-10-26T01:40:53Z

I think maybe this should be rolled into a more systematic rewrite of the pandas api. There's some current things about the implementation that I don't quite understand, and unfortunately I don't currently have the time to dig deeper into this. Sorry, I'll let you know if I revisit this at some point in the future.

arsenyinfo · 2018-03-30T23:15:10Z

Currently, it leads to some kind of inconsistency:

In [1]: from lightgbm import LGBMClassifier, Dataset, train
   ...: import numpy as np
   ...:
   ...: x_data, y_data = np.random.rand(1000, 100), np.random.rand(1000) > .5
   ...: x_bad = np.random.rand(1000, 101)
   ...:
   ...:
   ...: def sklearn_style():
   ...:     clf = LGBMClassifier()
   ...:     clf.fit(x_data, y_data)
   ...:     return clf.predict(x_bad)
   ...:
   ...:
   ...: def xgboost_style():
   ...:     dataset = Dataset(x_data, y_data)
   ...:     params = {'application': 'binary'}
   ...:     booster = train(params, dataset)
   ...:     return booster.predict(x_data)
   ...:

In [2]: sklearn_style().shape
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-281be6aeee0d> in <module>()
----> 1 sklearn_style().shape

<ipython-input-1-adc7de9c0c86> in sklearn_style()
      9     clf = LGBMClassifier()
     10     clf.fit(x_data, y_data)
---> 11     return clf.predict(x_bad)
     12
     13

~/.pyenv/versions/3.6.2/lib/python3.6/site-packages/lightgbm/sklearn.py in predict(self, X, raw_score, num_iteration)
    674
    675     def predict(self, X, raw_score=False, num_iteration=0):
--> 676         class_probs = self.predict_proba(X, raw_score, num_iteration)
    677         class_index = np.argmax(class_probs, axis=1)
    678         return self._le.inverse_transform(class_index)

~/.pyenv/versions/3.6.2/lib/python3.6/site-packages/lightgbm/sklearn.py in predict_proba(self, X, raw_score, num_iteration)
    704                              "match the input. Model n_features_ is %s and "
    705                              "input n_features is %s "
--> 706                              % (self._n_features, n_features))
    707         class_probs = self.booster_.predict(X, raw_score=raw_score, num_iteration=num_iteration)
    708         if self._n_classes > 2:

ValueError: Number of features of the model must match the input. Model n_features_ is 100 and input n_features is 101

In [3]: xgboost_style().shape
[LightGBM] [Info] Number of positive: 506, number of negative: 494
[LightGBM] [Info] Total Bins 25500
[LightGBM] [Info] Number of data: 1000, number of used features: 100
[LightGBM] [Info] Finished loading 100 models
Out[3]: (1000,)

Imho, check on feature count is really important, and this lack of assertion may lead to harsh, not so easy to debug issues.

StrikerRUS · 2019-08-01T16:51:34Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

jsh9 · 2019-08-31T19:46:42Z

I may be interested helping improve this. What would be the ideal behavior here? Does LightGBM make predictions only based on column order (like sklearn), or based on column names?

StrikerRUS · 2019-10-03T18:30:10Z

The shape of data for prediction is now checked at cpp side, thanks to #2464.

The next steps may be to check the type and order of features, and their names in case of pandas DataFrame.

jmoralez · 2021-12-24T01:24:07Z

Reopening because I'm working on this.

…812) (#4909) * check feature names and order in predict with dataframe * slice df in predict to remove the target * scramble features * handle int column names * only change column order when needed * include validate_features param in booster and sklearn estimators * document validate_features argument * use all_close in preds checks and check for assertion error to compare different arrays * perform remapping and checks in cpp * remove extra logs * fixes * revert cpp * proposal * remove extra arg * lint * restore _data_from_pandas arguments * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * move data conversion to Predictor.predict * use Vector2Ptr Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

github-actions · 2023-08-15T20:34:52Z

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

guolinke added the feature request label Aug 16, 2017

StrikerRUS mentioned this issue Feb 20, 2019

[python] Crash in training with early stopping inside sklearn Pipeline with dimentionality reduction #2012

Closed

guolinke mentioned this issue Aug 1, 2019

Feature Requests & Voting Hub #2302

Open

guolinke closed this as completed Aug 1, 2019

StrikerRUS mentioned this issue Aug 29, 2019

Question: behavior of Booster.predict() with wrong number of columns #2366

Closed

StrikerRUS mentioned this issue Sep 10, 2019

LightGBM predicts even with missing features #2396

Closed

guolinke reopened this Sep 27, 2019

guolinke mentioned this issue Sep 27, 2019

check the shape for mat, csr and csc in prediction #2464

Merged

StrikerRUS closed this as completed Oct 3, 2019

StrikerRUS mentioned this issue Mar 27, 2021

Inconsistent prediction when the order of the columns of pandas dataframe is different from training #4102

Closed

jmoralez reopened this Dec 24, 2021

jmoralez mentioned this issue Dec 24, 2021

[python-package] check feature names in predict with dataframe (fixes #812) #4909

Merged

StrikerRUS closed this as completed in #4909 Jun 27, 2022

jameslamb mentioned this issue Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python package]: suggestion: lgb.Booster.predict() should check that the input X data makes sense #812

[python package]: suggestion: lgb.Booster.predict() should check that the input X data makes sense #812

j-mark-hou commented Aug 9, 2017 •

edited

Loading

guolinke commented Aug 16, 2017

guolinke commented Oct 26, 2017

j-mark-hou commented Oct 26, 2017

arsenyinfo commented Mar 30, 2018

StrikerRUS commented Aug 1, 2019

jsh9 commented Aug 31, 2019

StrikerRUS commented Oct 3, 2019

jmoralez commented Dec 24, 2021

github-actions bot commented Aug 15, 2023

[python package]: suggestion: lgb.Booster.predict() should check that the input X data makes sense #812

[python package]: suggestion: lgb.Booster.predict() should check that the input X data makes sense #812

Comments

j-mark-hou commented Aug 9, 2017 • edited Loading

guolinke commented Aug 16, 2017

guolinke commented Oct 26, 2017

j-mark-hou commented Oct 26, 2017

arsenyinfo commented Mar 30, 2018

StrikerRUS commented Aug 1, 2019

jsh9 commented Aug 31, 2019

StrikerRUS commented Oct 3, 2019

jmoralez commented Dec 24, 2021

github-actions bot commented Aug 15, 2023

j-mark-hou commented Aug 9, 2017 •

edited

Loading