-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data leakage in cross-validation functions in R and Python packages #4319
Comments
Note that with the proposed fix in #4320, the above example produces two times the same results when setting |
Thanks a lot for the detailed description!
According to this answer it looks like this is done by design for some reason.
Maybe @guolinke can comment? Also, see this #3362 (comment). |
@StrikerRUS: Thank you for your feedback. The fact that both the training and validation data use the same feature mapper is, per se, no problem as long as the feature mapper is only constructed using information from the training data. But this feature mapper is part of the model and must not be constructed using the validation data. |
IMHO, the leakage from joint binning is negligible (e.g. it does not use the response variable). My suggestion is to mention it in the help of |
I'm +1 for @mayer79's proposal of documenting this problem. |
I would recommend a zero-tolerance policy for all kinds of data leakage in cross-validation. The fact that one needs to add more lines of code does not seem to be a sound argument for not fixing the problem. Intuitively, the amount of information leakage is often small, and I agree that this intuition holds true in many applications. But, can you guarantee that there are no datasets where this data leakage might be a serious issue? |
@fabsig I apologize for the long delay in responding to this issue! I'd like to pick it up again and try to move it to resolution before LightGBM 4.0.0. @shiyu1994 @btrotta @Laurae2 @guolinke could you take a look at this issue and give us your opinion? Specifically, this question:
If we decide to move forward with this change, I'd be happy to start providing a review on the specific code changes in #4320. |
sorry for the late response. |
Sorry for the slow response. I'm busy with several large PRs these days and missed this. Yes, I fully support this idea. Actually, @guolinke and I have discussed about this issue before. A strict cv function should do everything without looking at any data in the fold for test. I'll provide a review for #4320. |
@shiyu1994 : Agreed but please monitor memory footprint for large data. In my view, it would not be acceptable that the footprint would increase by a factor of k, where k is the fold count, compared to the current solution. (This depends on how |
@mayer79 The current solution only stores one single copy of data. I think to fully avoid data leakage, it is unavoidable to store |
Thanks so much for clarifying. Having both options would be an ideal solution. Still, I think the potential of leakage is negligible compared to all the bad things the average data scientist might do during modeling (like e.g. doing stratified instead of grouped splitting when rows are dependent etc.) ;-( |
There is an issue with using the entire data for binning in the cases where the data distribution changes with time which is often the case in real world applications. While working on my data, I faced this similar issue and saw that if reference = entire dataset is not provided when I try to reproduce the results generated by lightgbm.cv, the metric (average precision score) differs by a lot. In my case, the difference is 0.06. With the current approach, the cross validation scores would always look quite high as compared to retraining the model on the entire data and then evaluating on test. |
Description
The cross-validation functions in the R and Python packages (here and here) currently produce data leakage. First, the entire data set is used to create the feature mapper (which maps features into bins), and afterwards the data is split into training and validation sets and both the training and the validation data sets use the same feature mapper (see here). Crucially, the test / validation data is also used to create this feature mapper. I.e., part of the model (the feature mapper) has already "seen" a part of the validation data on which the model is evaluated, supposedly, in an out-of-sample manner. Note that no label data has leaked, but information on feature data should also not leak.
The code below demonstrates the problem. The two versions ((i) splitting the data into training and validation "by hand" and (ii) using the
cv
function) should produce identical results, but they do not.I will make a pull request / proposal with a partial fix when
free_raw_data=False
. Also, I suggest adding a warning message to notify users about this form of data leakage.Reproducible example
data_leakage_CV_lightgbm.py
Environment info
LightGBM version or commit hash: da3465c
Command(s) you used to install LightGBM:
python setup.py install
The text was updated successfully, but these errors were encountered: