SequentialDataset inter_feat has label leakage #620

rowedenny · 2020-12-25T22:45:03Z

I am recently developing a sequential dataset, but similar to LightGCN. So I follow the current implementation of LighGCN to fetch the historical interactions from train_data.inter_feat, but the performance is way higher than my expectation.

Then I cautiously check if there is any possibility that the test label may leak. unfortunately, it is indeed.

During leave_one_out of sequential_dataset, a direct copy of the whole dataset is implemented, and then to select the index according to the item indices. In general, it should be fine, but the key problem is the inter_feat of the dataset, no matter for train_set, test_set, valid_set are identical. So when I directly fetch the inter_feat within the trainset, the test label has leaked.

In comparison with the implementation in general dataset, it select the corresponding indices and save the partitioned inter_feat. So LightGCN does not suffer from this issue.

I am not sure whether it is a bug, but it indeed shows some risk. If some other user is not aware of the tiny issue on the sequential dataset, it makes a false improvement on the performance.

The text was updated successfully, but these errors were encountered:

rowedenny · 2020-12-26T04:28:33Z

I have found a simple way to work around it, yet I am not sure how to prohibit this risk in general cases. Please let me know if you have a clear idea to fix and I would like to PR for it.

chenyushuo · 2020-12-26T09:02:04Z

Can you describe your way to fix this bug? We can check if there are any other problems.

rowedenny · 2020-12-26T16:32:17Z

In my case, I need to fetch the self.inter_feat of a sequential dataset object. Take the leave_one_out into consideration, the last two interactions of each user should be excluded, thus what I did is to reconstruct the function _create_spare_matrix by the following code

new_inter_feat = self.inter_feat.groupby(self.uid_field, as_index=False).apply(lambda x: x.iloc[:-2])

and then work with the new_inter_feat instead of self.inter_feat

I notice that in the data_augumentation of a sequential dataset, it applies the indices of the dataframe such that the dataloader could generate the context and target, it seems that we directly overwrite the self.inter_feat of the train_set, at least in my case, but I have no idea in the general cases. Please do take a close look at it.

chenyushuo · 2020-12-28T06:58:46Z

OK, there's no problem with your idea on the whole.

However, we are going to release v0.2.0 (on the 0.2.x branch) recently. On this branch, dataset.inter_feat will be converted to Interaction after the initial data processing,

RecBole/recbole/data/dataset/dataset.py

Line 113 in 0772227

self._change_feat_format()

RecBole/recbole/data/dataset/dataset.py

Lines 924 to 929 in 0772227

    
               def _change_feat_format(self): 
        
                   """Change feat format from :class:`pandas.DataFrame` to :class:`Interaction`. 
        
                   """ 
        
                   for feat_name in self.feat_name_list: 
        
                       feat = getattr(self, feat_name) 
        
                       setattr(self, feat_name, self._dataframe_to_interaction(feat))

So we will rewrite the code you provide to be equivalent.

Finally, thank you for discovering this problem.

rowedenny · 2020-12-28T15:53:07Z

A pleasure to make a contribution to Recbole.

chenyushuo · 2020-12-29T02:33:33Z

That's fine, we suggest you to make contributions to branch 0.2.x, because we will release v0.2.0 next month.

Thanks for your contributions again.

ShanleiMu assigned chenyushuo Dec 26, 2020

rowedenny mentioned this issue Jan 24, 2021

update fetch interaction matrix of sequential dataset #698

Merged

rowedenny closed this as completed Feb 26, 2021

Sherry-XLL added the enhancement label Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SequentialDataset inter_feat has label leakage #620

SequentialDataset inter_feat has label leakage #620

rowedenny commented Dec 25, 2020

rowedenny commented Dec 26, 2020

chenyushuo commented Dec 26, 2020

rowedenny commented Dec 26, 2020 •

edited

Loading

chenyushuo commented Dec 28, 2020

rowedenny commented Dec 28, 2020

chenyushuo commented Dec 29, 2020

SequentialDataset inter_feat has label leakage #620

SequentialDataset inter_feat has label leakage #620

Comments

rowedenny commented Dec 25, 2020

rowedenny commented Dec 26, 2020

chenyushuo commented Dec 26, 2020

rowedenny commented Dec 26, 2020 • edited Loading

chenyushuo commented Dec 28, 2020

rowedenny commented Dec 28, 2020

chenyushuo commented Dec 29, 2020

rowedenny commented Dec 26, 2020 •

edited

Loading