-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SequentialDataset inter_feat has label leakage #620
Comments
I have found a simple way to work around it, yet I am not sure how to prohibit this risk in general cases. Please let me know if you have a clear idea to fix and I would like to PR for it. |
Can you describe your way to fix this bug? We can check if there are any other problems. |
In my case, I need to fetch the
and then work with the I notice that in the |
OK, there's no problem with your idea on the whole. However, we are going to release v0.2.0 (on the 0.2.x branch) recently. On this branch, RecBole/recbole/data/dataset/dataset.py Line 113 in 0772227
RecBole/recbole/data/dataset/dataset.py Lines 924 to 929 in 0772227
So we will rewrite the code you provide to be equivalent. Finally, thank you for discovering this problem. |
A pleasure to make a contribution to Recbole. |
That's fine, we suggest you to make contributions to branch 0.2.x, because we will release v0.2.0 next month. Thanks for your contributions again. |
I am recently developing a sequential dataset, but similar to LightGCN. So I follow the current implementation of LighGCN to fetch the historical interactions from
train_data.inter_feat
, but the performance is way higher than my expectation.Then I cautiously check if there is any possibility that the test label may leak. unfortunately, it is indeed.
During
leave_one_out
of sequential_dataset, a direct copy of the whole dataset is implemented, and then to select the index according to the item indices. In general, it should be fine, but the key problem is the inter_feat of the dataset, no matter for train_set, test_set, valid_set are identical. So when I directly fetch the inter_feat within the trainset, the test label has leaked.In comparison with the implementation in general dataset, it select the corresponding indices and save the partitioned inter_feat. So LightGCN does not suffer from this issue.
I am not sure whether it is a bug, but it indeed shows some risk. If some other user is not aware of the tiny issue on the sequential dataset, it makes a false improvement on the performance.
The text was updated successfully, but these errors were encountered: