Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SequentialDataset inter_feat has label leakage #620

Closed
rowedenny opened this issue Dec 25, 2020 · 6 comments
Closed

SequentialDataset inter_feat has label leakage #620

rowedenny opened this issue Dec 25, 2020 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@rowedenny
Copy link
Contributor

I am recently developing a sequential dataset, but similar to LightGCN. So I follow the current implementation of LighGCN to fetch the historical interactions from train_data.inter_feat, but the performance is way higher than my expectation.

Then I cautiously check if there is any possibility that the test label may leak. unfortunately, it is indeed.

During leave_one_out of sequential_dataset, a direct copy of the whole dataset is implemented, and then to select the index according to the item indices. In general, it should be fine, but the key problem is the inter_feat of the dataset, no matter for train_set, test_set, valid_set are identical. So when I directly fetch the inter_feat within the trainset, the test label has leaked.

In comparison with the implementation in general dataset, it select the corresponding indices and save the partitioned inter_feat. So LightGCN does not suffer from this issue.

I am not sure whether it is a bug, but it indeed shows some risk. If some other user is not aware of the tiny issue on the sequential dataset, it makes a false improvement on the performance.

@rowedenny
Copy link
Contributor Author

I have found a simple way to work around it, yet I am not sure how to prohibit this risk in general cases. Please let me know if you have a clear idea to fix and I would like to PR for it.

@chenyushuo
Copy link
Collaborator

Can you describe your way to fix this bug? We can check if there are any other problems.

@rowedenny
Copy link
Contributor Author

rowedenny commented Dec 26, 2020

In my case, I need to fetch the self.inter_feat of a sequential dataset object. Take the leave_one_out into consideration, the last two interactions of each user should be excluded, thus what I did is to reconstruct the function _create_spare_matrix by the following code

new_inter_feat = self.inter_feat.groupby(self.uid_field, as_index=False).apply(lambda x: x.iloc[:-2])

and then work with the new_inter_feat instead of self.inter_feat

I notice that in the data_augumentation of a sequential dataset, it applies the indices of the dataframe such that the dataloader could generate the context and target, it seems that we directly overwrite the self.inter_feat of the train_set, at least in my case, but I have no idea in the general cases. Please do take a close look at it.

@chenyushuo
Copy link
Collaborator

OK, there's no problem with your idea on the whole.

However, we are going to release v0.2.0 (on the 0.2.x branch) recently. On this branch, dataset.inter_feat will be converted to Interaction after the initial data processing,

self._change_feat_format()

def _change_feat_format(self):
"""Change feat format from :class:`pandas.DataFrame` to :class:`Interaction`.
"""
for feat_name in self.feat_name_list:
feat = getattr(self, feat_name)
setattr(self, feat_name, self._dataframe_to_interaction(feat))

So we will rewrite the code you provide to be equivalent.

Finally, thank you for discovering this problem.

@rowedenny
Copy link
Contributor Author

A pleasure to make a contribution to Recbole.

@chenyushuo
Copy link
Collaborator

That's fine, we suggest you to make contributions to branch 0.2.x, because we will release v0.2.0 next month.

Thanks for your contributions again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants