Data leakage in cross-validation functions in R and Python packages #4319

fabsig · 2021-05-25T12:15:59Z

Description

The cross-validation functions in the R and Python packages (here and here) currently produce data leakage. First, the entire data set is used to create the feature mapper (which maps features into bins), and afterwards the data is split into training and validation sets and both the training and the validation data sets use the same feature mapper (see here). Crucially, the test / validation data is also used to create this feature mapper. I.e., part of the model (the feature mapper) has already "seen" a part of the validation data on which the model is evaluated, supposedly, in an out-of-sample manner. Note that no label data has leaked, but information on feature data should also not leak.

The code below demonstrates the problem. The two versions ((i) splitting the data into training and validation "by hand" and (ii) using the cv function) should produce identical results, but they do not.

I will make a pull request / proposal with a partial fix when free_raw_data=False. Also, I suggest adding a warning message to notify users about this form of data leakage.

Reproducible example

data_leakage_CV_lightgbm.py

import lightgbm as lgb
import numpy as np
def f1d(x):
    """Non-linear function for simulation"""
    return (1.7 * (1 / (1 + np.exp(-(x - 0.5) * 20)) + 0.75 * x))

# Simulate data
n = 500  # number of samples
np.random.seed(1)
X = np.random.rand(n, 2)
f = f1d(X[:, 0])
y = f  + np.sqrt(0.01) * np.random.normal(size=n)
# Split into train and test data
train_ind = np.arange(0,int(n/2))
test_ind = np.arange(int(n/2),n)
folds = [(train_ind, test_ind)]
params = { 'objective': 'regression_l2',
            'learning_rate': 0.05,
            'max_depth': 6,
            'min_data_in_leaf': 5,
            'verbose': 0 }

# Using cv function
data = lgb.Dataset(X, y)
cvbst = lgb.cv(params=params, train_set=data,
               num_boost_round=10, early_stopping_rounds=5,
               folds=folds, verbose_eval=True, show_stdv=False, seed=1)
# Results for last 3 iterations:
#[8]     cv_agg's l2: 0.536216
#[9]     cv_agg's l2: 0.48615
#[10]    cv_agg's l2: 0.440826

# Using train function and manually splitting the data does not give the same results
data_train = lgb.Dataset(X[train_ind, :], y[train_ind])
data_eval = lgb.Dataset(X[test_ind, :], y[test_ind], reference=data_train)
evals_result = {}
bst = lgb.train(params=params,
                train_set=data_train,
                num_boost_round=10,
                valid_sets=data_eval,
                early_stopping_rounds=5,
                evals_result=evals_result)
# Results for last 3 iterations:
#[8]     valid_0's l2: 0.534485
#[9]     valid_0's l2: 0.484565
#[10]    valid_0's l2: 0.438873

Environment info

LightGBM version or commit hash: da3465c

Command(s) you used to install LightGBM: python setup.py install

The text was updated successfully, but these errors were encountered:

fabsig · 2021-05-25T12:21:32Z

Note that with the proposed fix in #4320, the above example produces two times the same results when setting free_raw_data=False, i.e., data = lgb.Dataset(X, y, free_raw_data=False)

StrikerRUS · 2021-05-25T22:43:24Z

Thanks a lot for the detailed description!

both the training and the validation data sets use the same feature mapper

According to this answer it looks like this is done by design for some reason.

Set reference will use the reference's (usually trainset) bin mapper to construct the valid set.
#2553 (comment)

Maybe @guolinke can comment?

Also, see this #3362 (comment).

fabsig · 2021-05-26T04:58:22Z

@StrikerRUS: Thank you for your feedback. The fact that both the training and validation data use the same feature mapper is, per se, no problem as long as the feature mapper is only constructed using information from the training data. But this feature mapper is part of the model and must not be constructed using the validation data.

mayer79 · 2021-06-04T17:29:29Z

IMHO, the leakage from joint binning is negligible (e.g. it does not use the response variable). My suggestion is to mention it in the help of lgb.cv instead of making the code longer and run-time slower. The note could be: "Note that feature binning is done for the combined data, not per fold."

StrikerRUS · 2021-06-16T19:12:32Z

I'm +1 for @mayer79's proposal of documenting this problem.

fabsig · 2021-06-18T07:43:25Z

I would recommend a zero-tolerance policy for all kinds of data leakage in cross-validation. The fact that one needs to add more lines of code does not seem to be a sound argument for not fixing the problem. Intuitively, the amount of information leakage is often small, and I agree that this intuition holds true in many applications. But, can you guarantee that there are no datasets where this data leakage might be a serious issue?

jameslamb · 2021-10-26T03:32:15Z

@fabsig I apologize for the long delay in responding to this issue! I'd like to pick it up again and try to move it to resolution before LightGBM 4.0.0.

@shiyu1994 @btrotta @Laurae2 @guolinke could you take a look at this issue and give us your opinion?

Specifically, this question:

Today, lgb.cv() in the R and Python packages constructs a single Dataset from the full raw training data, then performs k-fold cross validation by taking subsets of that Dataset.

Should it be modified to instead subset the raw training data in each cross validation trial, and create new Datasets from each of those subsets?

If we decide to move forward with this change, I'd be happy to start providing a review on the specific code changes in #4320.

guolinke · 2021-10-27T09:27:19Z

sorry for the late response.
IMO, I agree with @fabsig , zero-tolerance for leakage is the best solution.
@shiyu1994 can you help for this?

shiyu1994 · 2021-11-03T13:01:23Z

Sorry for the slow response. I'm busy with several large PRs these days and missed this.

Yes, I fully support this idea. Actually, @guolinke and I have discussed about this issue before. A strict cv function should do everything without looking at any data in the fold for test.

I'll provide a review for #4320.

mayer79 · 2021-11-03T13:20:34Z

@shiyu1994 : Agreed but please monitor memory footprint for large data. In my view, it would not be acceptable that the footprint would increase by a factor of k, where k is the fold count, compared to the current solution. (This depends on how lgb.cv() currently stores the data, which I am actually not sure.)

shiyu1994 · 2021-11-05T02:23:54Z

@mayer79 The current solution only stores one single copy of data. I think to fully avoid data leakage, it is unavoidable to store k copies of discretized version of data (each copy has k-1/k of the original size). Because each copy will have different boundaries for feature discretization.
Do you think it is worthy to provide such an alternative to allow fully avoiding data leakage but increase the memory cost? We can still keep the current approach as a memory saving way, so that users can make the trade-off.

mayer79 · 2021-11-05T09:05:17Z

Thanks so much for clarifying. Having both options would be an ideal solution. Still, I think the potential of leakage is negligible compared to all the bad things the average data scientist might do during modeling (like e.g. doing stratified instead of grouped splitting when rows are dependent etc.) ;-(

harshsarda29 · 2024-07-06T19:46:32Z

There is an issue with using the entire data for binning in the cases where the data distribution changes with time which is often the case in real world applications. While working on my data, I faced this similar issue and saw that if reference = entire dataset is not provided when I try to reproduce the results generated by lightgbm.cv, the metric (average precision score) differs by a lot. In my case, the difference is 0.06. With the current approach, the cross validation scores would always look quite high as compared to retraining the model on the entire data and then evaluating on test.

fabsig mentioned this issue May 25, 2021

partial fix for data leakage problem in cross-validation functions #4320

Open

jameslamb changed the title ~~Data leakage in cross-valiation functions in R and Python packages~~ Data leakage in cross-validation functions in R and Python packages May 26, 2021

jameslamb added effectiveness question r-package labels Jun 2, 2021

jameslamb mentioned this issue Dec 7, 2021

[R-package] [docs] add Michael Mayer to DESCRIPTION #4867

Merged

jameslamb mentioned this issue Feb 9, 2022

[python] Reproduce the result of cross validation #5000

Closed

david-cortes mentioned this issue Nov 5, 2023

[doc] Draft for language binding consistency. [skip ci] dmlc/xgboost#9755

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data leakage in cross-validation functions in R and Python packages #4319

Data leakage in cross-validation functions in R and Python packages #4319

fabsig commented May 25, 2021 •

edited

Loading

fabsig commented May 25, 2021

StrikerRUS commented May 25, 2021 •

edited

Loading

fabsig commented May 26, 2021

mayer79 commented Jun 4, 2021

StrikerRUS commented Jun 16, 2021

fabsig commented Jun 18, 2021

jameslamb commented Oct 26, 2021

guolinke commented Oct 27, 2021

shiyu1994 commented Nov 3, 2021

mayer79 commented Nov 3, 2021

shiyu1994 commented Nov 5, 2021

mayer79 commented Nov 5, 2021 •

edited

Loading

harshsarda29 commented Jul 6, 2024 •

edited

Loading

Data leakage in cross-validation functions in R and Python packages #4319

Data leakage in cross-validation functions in R and Python packages #4319

Comments

fabsig commented May 25, 2021 • edited Loading

Description

Reproducible example

Environment info

fabsig commented May 25, 2021

StrikerRUS commented May 25, 2021 • edited Loading

fabsig commented May 26, 2021

mayer79 commented Jun 4, 2021

StrikerRUS commented Jun 16, 2021

fabsig commented Jun 18, 2021

jameslamb commented Oct 26, 2021

guolinke commented Oct 27, 2021

shiyu1994 commented Nov 3, 2021

mayer79 commented Nov 3, 2021

shiyu1994 commented Nov 5, 2021

mayer79 commented Nov 5, 2021 • edited Loading

harshsarda29 commented Jul 6, 2024 • edited Loading

fabsig commented May 25, 2021 •

edited

Loading

StrikerRUS commented May 25, 2021 •

edited

Loading

mayer79 commented Nov 5, 2021 •

edited

Loading

harshsarda29 commented Jul 6, 2024 •

edited

Loading