Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weights & Early Stopping with LGBMRegressor #4551

Closed
John64 opened this issue Aug 24, 2021 · 7 comments
Closed

Weights & Early Stopping with LGBMRegressor #4551

John64 opened this issue Aug 24, 2021 · 7 comments
Labels

Comments

@John64
Copy link

John64 commented Aug 24, 2021

I've been using LightGBM for a while, but mostly with classification & never weights. I have a basic pandas dataframe with the weights in a column. I can't get an early stopping LGBMRegressor, utilizing 'mae', to run. Keep getting this error:
('Wrong type(float) for weight.\nIt should be list, numpy 1-D array or pandas Series',) Wrong type(float) for weight. It should be list, numpy 1-D array or pandas Series
Here is how the model is declared:

obj = 'mae'
eval_setit = obj
model = LGBMRegressor(boosting_type='gbdt', objective=obj, learning_rate=.3, n_jobs = 1, num_threads=1
                              ,early_stopping_round=10, num_iterations=100)

model fit causes error:

weight_sample = X_train['weight'].values
weight_eval = X_test['weight'].to_list()
model.fit(X_train[filter_now].values, y_train.values, eval_set=[(X_test[filter_now].values, y_test.values)], 
    eval_metric=eval_setit, verbose=False, sample_weight=weight_sample, eval_sample_weight=weight_eval)

X_test & X_train are pandas dataframes with y_test & y_train being pandas series

I've tried various combinations of .values, .to_list(), and .ravel() as one post online had said this can happen if the dataframe format doesn't match the weights formats (mis-match of formats) --but not finding a solution. Always the same error as long as "eval_sample_weight" is being given. Without it, everything appears to run, but of course the early_stopping will be equally weighting all samples in the eval_set, creating error.

Hoping someone else has run into this or knows what the error is referencing. There are no nans/nulls in the weightings & they're the correct length. Format for the weighting column in pandas is float64 & all values are between ~.03 and ~.078.
I'm running this in parallel and some of the .to_lists() and .values appear to be slowing things down, so if someone has a cleaner solution to train with a Dataframe, that'd be great. Any help (examples, links, general advice...) would be appreciated. I'm out of ideas. Thank You

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM! We need some more information to help you.

  1. What version of LightGBM are you using and how did you install it?
  2. Are you able to provide a fully-reproducible example (some self-contained code that maintainers could run which produces the same error)? That would reduce the effort needed to answer your question.

@John64
Copy link
Author

John64 commented Aug 24, 2021

From Mini-Conda list:

lightgbm                  3.2.1            py39h415ef7b_0    conda-forge
python                    3.9.6           h7840368_1_cpython    conda-forge
pandas                    1.3.0            py39h2e25243_0    conda-forge
numpy                     1.21.0           py39h6635163_0    conda-forge

Not sure how ppl usually do this, but this minimal code will reproduce error as long as pickle objects are OK. It wouldn't upload here, so I put into a link in my repository:
https://github.com/John64/dataexamples

import numpy as np
import pandas as pd
import pickle
from lightgbm import LGBMRegressor

tpath = 'D:\\temp\\'
limitcols = ['tb_prec_min', 'mc_ratio3', 'mc_ratio1', 's1_bestscore', 'tb_prec_avg','b_rawRsys'] #'weight'

X_train = pickle.load(open(tpath+ 'X_train.dat','rb'))
X_test = pickle.load(open(tpath+ 'X_test.dat','rb'))
y_train = pickle.load(open(tpath+ 'y_train.dat','rb'))
y_test = pickle.load(open(tpath+ 'y_test.dat','rb'))

model = LGBMRegressor(boosting_type='gbdt', objective='mae', learning_rate=.3, n_jobs = 1, num_threads=1,
                      early_stopping_round=10, num_iterations=100)

weight_sample = X_train['weight'].values
weight_eval = X_test['weight'].to_list()
model.fit(X_train[limitcols], y_train.values, eval_set=[(X_test[limitcols].values, y_test.values)], 
          eval_metric='mae', verbose=True, sample_weight=weight_sample, eval_sample_weight=weight_eval) #, eval_sample_weight=weight_eval(without this it works)

#TypeError: Wrong type(float) for weight. It should be list, numpy 1-D array or pandas Series

Thanks Again

@jameslamb
Copy link
Collaborator

Thanks very much!

Through that information, you've shared that you're using Python 3.9, version 3.2.1 of lightgbm, and are on Windows (based on the path D:\\temp\\).

I personally don't open pickle files whose origin I don't know about, since it is possible to define arbitrary code to run when an object is unpickled.

You could create a fully-reproducible example for this case by, for example

@John64
Copy link
Author

John64 commented Aug 24, 2021

Will csv work? Got it to do the same with csv files.
Been a while since using numpy for random data, but I'll try in the morning if needed. Thx

X_train = pd.read_csv(tpath+ 'X_train.csv', index_col=0)
X_test = pd.read_csv(tpath+ 'X_test.csv', index_col=0)
y_train = pd.read_csv(tpath+ 'y_train.csv', index_col=0)
y_test = pd.read_csv(tpath+ 'y_test.csv', index_col=0)
[X_test.csv](https://github.com/microsoft/LightGBM/files/7042522/X_test.csv)
[X_train.csv](https://github.com/microsoft/LightGBM/files/7042523/X_train.csv)
[y_test.csv](https://github.com/microsoft/LightGBM/files/7042524/y_test.csv)
[y_train.csv](https://github.com/microsoft/LightGBM/files/7042525/y_train.csv)

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Aug 25, 2021

Hey @John64! Thanks a lot for the repro!
Just a hint: you can simplify CSV reading like the following:

X_train = pd.read_csv(r'https://github.com/microsoft/LightGBM/files/7042523/X_train.csv', index_col=0)

This is the same error as in #4534. And the solution is in #4534 (comment).

You should pass lists for eval_* arguments, one item per validation pair of X and y.

Just fix these lines in your code:

model.fit(X_train[limitcols], y_train.values, eval_set=[(X_test[limitcols].values, y_test.values)],
          eval_metric='mae', verbose=True, sample_weight=weight_sample, eval_sample_weight=[weight_eval])

Note that eval_sample_weight is a list of arrays:

eval_sample_weight=[weight_eval]

@John64
Copy link
Author

John64 commented Aug 25, 2021

Thank You very much Striker. Working Great

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants