[python] Reproduce the result of cross validation #5000

truongphanduykhanh · 2022-02-09T16:06:55Z

Description

Just for self-researching and understanding the package, I would like to replicate the boosters generated by lgb.cv() (with custom folds). I use lgb.train() on each of the custom fold data but the boosters from lgb.train() is different from the boosters from lgb.cv().

In following example, I use custom 3-folds on a toy data set.

Reproducible example

import pandas as pd
import lightgbm as lgb

# ------------------------------
# import data
data = pd.read_csv('data.csv')
print(data.shape)
(12, 3)
print(data.columns)
Index(['LABEL', 'FEAT1', 'FEAT2'], dtype='object', length=3)

# ------------------------------
# 3 folds to split (train, valid)
# train sets have 2, 3 and 4 rows, respectively
# every valid set has 1 row
folds = [([0, 1], [2]), ([3, 4, 5], [6]), ([7, 8, 9, 10], [11])

# ------------------------------
DEFAULTED_PARAMS = {
    'objective': 'binary',
    'metric': 'auc',
    'verbose': -1
}
data_lgb = lgb.Dataset(data=data[['FEAT1', 'FEAT2']], label=data['LABEL'])

# ------------------------------
# (1) cross validation
cvbooster = lgb.cv(
    params=DEFAULTED_PARAMS,
    train_set=data_lgb,
    num_boost_round=10,
    folds=folds,
    return_cvbooster=True
)

# ------------------------------
# (2) replicate by separate models
boosters = []
for i in range(3):
    train_feat = data.loc[folds[i][0], ['FEAT1', 'FEAT2']]
    train_label = data.loc[folds[i][0], 'LABEL']
    train_data = lgb.Dataset(data=train_feat, label=train_label)
    booster = lgb.train(
        params=DEFAULTED_PARAMS,
        train_set=train_data,
        num_boost_round=10
    )
    boosters.append(booster)

# ------------------------------
# assess the results from (1) cv and (2) separated models
# only try the first booster in each
valid_feat = data.loc[folds[0][1], ['FEAT1', 'FEAT2']]  # valid features of the first fold

pred_cv = cvbooster['cvbooster'].boosters[0].predict(valid_feat)  # first booster from .cv()
pred_train = boosters[0].predict(valid_feat)  # first booster from .train()


pred_cv
array([0.12345])

pred_train
array([0.11111])  # it's different from the prediction of cv.

# ------------------------------
# the predictions are still different when I control the num_iteration
pred_cv = cvbooster['cvbooster'].boosters[0].predict(valid_feat, num_iteration=5)
pred_train = boosters[0].predict(valid_feat, num_iteration=5)

pred_cv
array([0.13579])

pred_train
array([0.12222]). # it's still different from the prediction of cv.

So cvbooster['cvbooster'].boosters[0] is a different booster from boosters[0]. I've tried a lot with different data and custom folds but still can't replicate the booster from cv. Sorry if I make silly mistake somewhere.

Environment info

LightGBM version: 3.2.1
Command(s) you used to install LightGBM:

pip install lightgbm

The text was updated successfully, but these errors were encountered:

jameslamb · 2022-02-09T16:25:45Z

Thanks for your interest in LightGBM and excellent issue write-up!

I haven't had a chance to run your code yet but I can just say briefly that I see one major way that your code differs from lightgbm.cv().

A LightGBM Dataset object contains the result of some pre-processing, including binning continuous features into histograms. Each time you construct a Dataset from raw data like a numpy array or pandas data frame, LightGBM will calculate new bin boundaries for features.

To ensure that all subsets used in CV have the same bin boundaries, the process for lightgbm.cv() is to take a fully-constructed dataset and then to extract subsets of it for each folder using the .subset() method.

LightGBM/python-package/lightgbm/engine.py

Lines 348 to 349 in 0688f47

    
           for train_idx, test_idx in folds: 
        
               train_set = full_data.subset(sorted(train_idx))

In your code, you are creating a new Dataset object on each chunk of the training data.

For more details, you may want to see this related discussion: #4319

Could you try modifying your code to create one Dataset on the full training dataset and then use Dataset.subset() to extract individual folds?

truongphanduykhanh · 2022-02-10T02:59:05Z

Thank you very much. I change the the codes according to your suggestion and it works like a charm! And very good discussion in #4319 too. Much appreciate your effort on this repo.

# ------------------------------
boosters = []
for i in range(3):
    train_fold = train_data.subset(folds[i][0])  # <----- change here
    booster = lgb.train(
        params=DEFAULTED_PARAMS,
        train_set=train_data,
        num_boost_round=10
    )
    boosters.append(booster)

# ------------------------------
valid_feat = data.loc[folds[0][1], ['FEAT1', 'FEAT2']]

pred_cv = cvbooster['cvbooster'].boosters[0].predict(valid_feat)
pred_train = boosters[0].predict(valid_feat)

pred_cv
array([0.12345])

pred_train
array([0.12345])  # it's exact as in cv

jameslamb · 2022-02-10T05:16:41Z

oh great, glad that worked! I'll close this for now, come back any time if you have other questions 👋

github-actions · 2023-08-23T00:20:25Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added the question label Feb 9, 2022

jameslamb changed the title ~~Can't reproduce the result of native cross validation~~ [python] Can't reproduce the result of native cross validation Feb 9, 2022

truongphanduykhanh changed the title ~~[python] Can't reproduce the result of native cross validation~~ [python] Reproduce the result of cross validation Feb 10, 2022

jameslamb closed this as completed Feb 10, 2022

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Reproduce the result of cross validation #5000

[python] Reproduce the result of cross validation #5000

truongphanduykhanh commented Feb 9, 2022 •

edited

Loading

jameslamb commented Feb 9, 2022 •

edited

Loading

truongphanduykhanh commented Feb 10, 2022 •

edited

Loading

jameslamb commented Feb 10, 2022

github-actions bot commented Aug 23, 2023

[python] Reproduce the result of cross validation #5000

[python] Reproduce the result of cross validation #5000

Comments

truongphanduykhanh commented Feb 9, 2022 • edited Loading

Description

Reproducible example

Environment info

jameslamb commented Feb 9, 2022 • edited Loading

truongphanduykhanh commented Feb 10, 2022 • edited Loading

jameslamb commented Feb 10, 2022

github-actions bot commented Aug 23, 2023

truongphanduykhanh commented Feb 9, 2022 •

edited

Loading

jameslamb commented Feb 9, 2022 •

edited

Loading

truongphanduykhanh commented Feb 10, 2022 •

edited

Loading