Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Reproduce the result of cross validation #5000

Closed
truongphanduykhanh opened this issue Feb 9, 2022 · 4 comments
Closed

[python] Reproduce the result of cross validation #5000

truongphanduykhanh opened this issue Feb 9, 2022 · 4 comments
Labels

Comments

@truongphanduykhanh
Copy link

truongphanduykhanh commented Feb 9, 2022

Description

Just for self-researching and understanding the package, I would like to replicate the boosters generated by lgb.cv() (with custom folds). I use lgb.train() on each of the custom fold data but the boosters from lgb.train() is different from the boosters from lgb.cv().

In following example, I use custom 3-folds on a toy data set.

Reproducible example

import pandas as pd
import lightgbm as lgb

# ------------------------------
# import data
data = pd.read_csv('data.csv')
print(data.shape)
(12, 3)
print(data.columns)
Index(['LABEL', 'FEAT1', 'FEAT2'], dtype='object', length=3)

# ------------------------------
# 3 folds to split (train, valid)
# train sets have 2, 3 and 4 rows, respectively
# every valid set has 1 row
folds = [([0, 1], [2]), ([3, 4, 5], [6]), ([7, 8, 9, 10], [11])

# ------------------------------
DEFAULTED_PARAMS = {
    'objective': 'binary',
    'metric': 'auc',
    'verbose': -1
}
data_lgb = lgb.Dataset(data=data[['FEAT1', 'FEAT2']], label=data['LABEL'])

# ------------------------------
# (1) cross validation
cvbooster = lgb.cv(
    params=DEFAULTED_PARAMS,
    train_set=data_lgb,
    num_boost_round=10,
    folds=folds,
    return_cvbooster=True
)

# ------------------------------
# (2) replicate by separate models
boosters = []
for i in range(3):
    train_feat = data.loc[folds[i][0], ['FEAT1', 'FEAT2']]
    train_label = data.loc[folds[i][0], 'LABEL']
    train_data = lgb.Dataset(data=train_feat, label=train_label)
    booster = lgb.train(
        params=DEFAULTED_PARAMS,
        train_set=train_data,
        num_boost_round=10
    )
    boosters.append(booster)

# ------------------------------
# assess the results from (1) cv and (2) separated models
# only try the first booster in each
valid_feat = data.loc[folds[0][1], ['FEAT1', 'FEAT2']]  # valid features of the first fold

pred_cv = cvbooster['cvbooster'].boosters[0].predict(valid_feat)  # first booster from .cv()
pred_train = boosters[0].predict(valid_feat)  # first booster from .train()


pred_cv
array([0.12345])

pred_train
array([0.11111])  # it's different from the prediction of cv.

# ------------------------------
# the predictions are still different when I control the num_iteration
pred_cv = cvbooster['cvbooster'].boosters[0].predict(valid_feat, num_iteration=5)
pred_train = boosters[0].predict(valid_feat, num_iteration=5)

pred_cv
array([0.13579])

pred_train
array([0.12222]). # it's still different from the prediction of cv.

So cvbooster['cvbooster'].boosters[0] is a different booster from boosters[0]. I've tried a lot with different data and custom folds but still can't replicate the booster from cv. Sorry if I make silly mistake somewhere.

Environment info

LightGBM version: 3.2.1
Command(s) you used to install LightGBM:

pip install lightgbm
@jameslamb
Copy link
Collaborator

jameslamb commented Feb 9, 2022

Thanks for your interest in LightGBM and excellent issue write-up!

I haven't had a chance to run your code yet but I can just say briefly that I see one major way that your code differs from lightgbm.cv().

A LightGBM Dataset object contains the result of some pre-processing, including binning continuous features into histograms. Each time you construct a Dataset from raw data like a numpy array or pandas data frame, LightGBM will calculate new bin boundaries for features.

To ensure that all subsets used in CV have the same bin boundaries, the process for lightgbm.cv() is to take a fully-constructed dataset and then to extract subsets of it for each folder using the .subset() method.

for train_idx, test_idx in folds:
train_set = full_data.subset(sorted(train_idx))

In your code, you are creating a new Dataset object on each chunk of the training data.

For more details, you may want to see this related discussion: #4319

Could you try modifying your code to create one Dataset on the full training dataset and then use Dataset.subset() to extract individual folds?

@jameslamb jameslamb changed the title Can't reproduce the result of native cross validation [python] Can't reproduce the result of native cross validation Feb 9, 2022
@truongphanduykhanh
Copy link
Author

truongphanduykhanh commented Feb 10, 2022

Thank you very much. I change the the codes according to your suggestion and it works like a charm! And very good discussion in #4319 too. Much appreciate your effort on this repo.

# ------------------------------
boosters = []
for i in range(3):
    train_fold = train_data.subset(folds[i][0])  # <----- change here
    booster = lgb.train(
        params=DEFAULTED_PARAMS,
        train_set=train_data,
        num_boost_round=10
    )
    boosters.append(booster)

# ------------------------------
valid_feat = data.loc[folds[0][1], ['FEAT1', 'FEAT2']]

pred_cv = cvbooster['cvbooster'].boosters[0].predict(valid_feat)
pred_train = boosters[0].predict(valid_feat)

pred_cv
array([0.12345])

pred_train
array([0.12345])  # it's exact as in cv

@truongphanduykhanh truongphanduykhanh changed the title [python] Can't reproduce the result of native cross validation [python] Reproduce the result of cross validation Feb 10, 2022
@jameslamb
Copy link
Collaborator

oh great, glad that worked! I'll close this for now, come back any time if you have other questions 👋

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants