Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data leakage in cross-validation functions in R and Python packages #4319

Open
fabsig opened this issue May 25, 2021 · 13 comments
Open

Data leakage in cross-validation functions in R and Python packages #4319

fabsig opened this issue May 25, 2021 · 13 comments

Comments

@fabsig
Copy link
Contributor

fabsig commented May 25, 2021

Description

The cross-validation functions in the R and Python packages (here and here) currently produce data leakage. First, the entire data set is used to create the feature mapper (which maps features into bins), and afterwards the data is split into training and validation sets and both the training and the validation data sets use the same feature mapper (see here). Crucially, the test / validation data is also used to create this feature mapper. I.e., part of the model (the feature mapper) has already "seen" a part of the validation data on which the model is evaluated, supposedly, in an out-of-sample manner. Note that no label data has leaked, but information on feature data should also not leak.

The code below demonstrates the problem. The two versions ((i) splitting the data into training and validation "by hand" and (ii) using the cv function) should produce identical results, but they do not.

I will make a pull request / proposal with a partial fix when free_raw_data=False. Also, I suggest adding a warning message to notify users about this form of data leakage.

Reproducible example

data_leakage_CV_lightgbm.py
import lightgbm as lgb
import numpy as np
def f1d(x):
    """Non-linear function for simulation"""
    return (1.7 * (1 / (1 + np.exp(-(x - 0.5) * 20)) + 0.75 * x))

# Simulate data
n = 500  # number of samples
np.random.seed(1)
X = np.random.rand(n, 2)
f = f1d(X[:, 0])
y = f  + np.sqrt(0.01) * np.random.normal(size=n)
# Split into train and test data
train_ind = np.arange(0,int(n/2))
test_ind = np.arange(int(n/2),n)
folds = [(train_ind, test_ind)]
params = { 'objective': 'regression_l2',
            'learning_rate': 0.05,
            'max_depth': 6,
            'min_data_in_leaf': 5,
            'verbose': 0 }

# Using cv function
data = lgb.Dataset(X, y)
cvbst = lgb.cv(params=params, train_set=data,
               num_boost_round=10, early_stopping_rounds=5,
               folds=folds, verbose_eval=True, show_stdv=False, seed=1)
# Results for last 3 iterations:
#[8]     cv_agg's l2: 0.536216
#[9]     cv_agg's l2: 0.48615
#[10]    cv_agg's l2: 0.440826

# Using train function and manually splitting the data does not give the same results
data_train = lgb.Dataset(X[train_ind, :], y[train_ind])
data_eval = lgb.Dataset(X[test_ind, :], y[test_ind], reference=data_train)
evals_result = {}
bst = lgb.train(params=params,
                train_set=data_train,
                num_boost_round=10,
                valid_sets=data_eval,
                early_stopping_rounds=5,
                evals_result=evals_result)
# Results for last 3 iterations:
#[8]     valid_0's l2: 0.534485
#[9]     valid_0's l2: 0.484565
#[10]    valid_0's l2: 0.438873

Environment info

LightGBM version or commit hash: da3465c

Command(s) you used to install LightGBM: python setup.py install

@fabsig
Copy link
Contributor Author

fabsig commented May 25, 2021

Note that with the proposed fix in #4320, the above example produces two times the same results when setting free_raw_data=False, i.e., data = lgb.Dataset(X, y, free_raw_data=False)

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented May 25, 2021

Thanks a lot for the detailed description!

both the training and the validation data sets use the same feature mapper

According to this answer it looks like this is done by design for some reason.

Set reference will use the reference's (usually trainset) bin mapper to construct the valid set.
#2553 (comment)

Maybe @guolinke can comment?

Also, see this #3362 (comment).

@jameslamb jameslamb changed the title Data leakage in cross-valiation functions in R and Python packages Data leakage in cross-validation functions in R and Python packages May 26, 2021
@fabsig
Copy link
Contributor Author

fabsig commented May 26, 2021

@StrikerRUS: Thank you for your feedback. The fact that both the training and validation data use the same feature mapper is, per se, no problem as long as the feature mapper is only constructed using information from the training data. But this feature mapper is part of the model and must not be constructed using the validation data.

@mayer79
Copy link
Contributor

mayer79 commented Jun 4, 2021

IMHO, the leakage from joint binning is negligible (e.g. it does not use the response variable). My suggestion is to mention it in the help of lgb.cv instead of making the code longer and run-time slower. The note could be: "Note that feature binning is done for the combined data, not per fold."

@StrikerRUS
Copy link
Collaborator

I'm +1 for @mayer79's proposal of documenting this problem.

@fabsig
Copy link
Contributor Author

fabsig commented Jun 18, 2021

I would recommend a zero-tolerance policy for all kinds of data leakage in cross-validation. The fact that one needs to add more lines of code does not seem to be a sound argument for not fixing the problem. Intuitively, the amount of information leakage is often small, and I agree that this intuition holds true in many applications. But, can you guarantee that there are no datasets where this data leakage might be a serious issue?

@jameslamb
Copy link
Collaborator

@fabsig I apologize for the long delay in responding to this issue! I'd like to pick it up again and try to move it to resolution before LightGBM 4.0.0.

@shiyu1994 @btrotta @Laurae2 @guolinke could you take a look at this issue and give us your opinion?

Specifically, this question:

Today, lgb.cv() in the R and Python packages constructs a single Dataset from the full raw training data, then performs k-fold cross validation by taking subsets of that Dataset.

Should it be modified to instead subset the raw training data in each cross validation trial, and create new Datasets from each of those subsets?

If we decide to move forward with this change, I'd be happy to start providing a review on the specific code changes in #4320.

@guolinke
Copy link
Collaborator

sorry for the late response.
IMO, I agree with @fabsig , zero-tolerance for leakage is the best solution.
@shiyu1994 can you help for this?

@shiyu1994
Copy link
Collaborator

Sorry for the slow response. I'm busy with several large PRs these days and missed this.

Yes, I fully support this idea. Actually, @guolinke and I have discussed about this issue before. A strict cv function should do everything without looking at any data in the fold for test.

I'll provide a review for #4320.

@mayer79
Copy link
Contributor

mayer79 commented Nov 3, 2021

@shiyu1994 : Agreed but please monitor memory footprint for large data. In my view, it would not be acceptable that the footprint would increase by a factor of k, where k is the fold count, compared to the current solution. (This depends on how lgb.cv() currently stores the data, which I am actually not sure.)

@shiyu1994
Copy link
Collaborator

@mayer79 The current solution only stores one single copy of data. I think to fully avoid data leakage, it is unavoidable to store k copies of discretized version of data (each copy has k-1/k of the original size). Because each copy will have different boundaries for feature discretization.
Do you think it is worthy to provide such an alternative to allow fully avoiding data leakage but increase the memory cost? We can still keep the current approach as a memory saving way, so that users can make the trade-off.

@mayer79
Copy link
Contributor

mayer79 commented Nov 5, 2021

Thanks so much for clarifying. Having both options would be an ideal solution. Still, I think the potential of leakage is negligible compared to all the bad things the average data scientist might do during modeling (like e.g. doing stratified instead of grouped splitting when rows are dependent etc.) ;-(

@harshsarda29
Copy link

harshsarda29 commented Jul 6, 2024

There is an issue with using the entire data for binning in the cases where the data distribution changes with time which is often the case in real world applications. While working on my data, I faced this similar issue and saw that if reference = entire dataset is not provided when I try to reproduce the results generated by lightgbm.cv, the metric (average precision score) differs by a lot. In my case, the difference is 0.06. With the current approach, the cross validation scores would always look quite high as compared to retraining the model on the entire data and then evaluating on test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants