Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Add features to .inter atomic file #608

Closed
mayaKaplansky opened this issue Dec 22, 2020 · 17 comments
Closed

[Question] Add features to .inter atomic file #608

mayaKaplansky opened this issue Dec 22, 2020 · 17 comments
Labels
FAQ Frequently Asked Questions

Comments

@mayaKaplansky
Copy link

Hi
I see I can have user features and item features, but my dataset has interaction features.
Can I express them in the .inter file as additional columns?

@mayaKaplansky mayaKaplansky added the bug Something isn't working label Dec 22, 2020
@linzihan-backforward
Copy link
Collaborator

Yes, you can add any feature columns in the .inter file.
Also , there are a few dataset paramaters that you can set to control the pipeline, such as:

load_col: 
          inter: [user_id, item_id, rating, timestamp]

More instruction can be found in our doc.

@hyp1231
Copy link
Member

hyp1231 commented Dec 23, 2020

Thanks for Zihan's comment about selectively loading interaction features from atomic files to Dataset.

As for accessing features in models, generally, you can just fetch them from Interaction.
Besides, if you are developing sequential models and want to get historical interaction features, please see #546 for details.

@hyp1231 hyp1231 added FAQ Frequently Asked Questions and removed bug Something isn't working labels Dec 23, 2020
@mayaKaplansky
Copy link
Author

Thanks
Is 'rating' a mandatory field?
I have other features.
Also, as for timestamp. is it OK to just have the sequential order as the time (1,2,3)? Is it OK that all the sessions are marked like this, just to indicate the order within the session?

@rowedenny
Copy link
Contributor

  1. To my best knowledge, 'rating' is NOT a mandatory field, as long as you do not specify it in the config file.
  2. For the timestamps, you can mark each interaction with a numeric value. If you create your dataset object as an instance of SequentialDataset, then it will sort based on the user field and time filed (user firstly) with ascending order, which indicates the order.

Just a reminder that if the user in your dataset has multiple sessions, then the interactions in different sessions will be mixed since they are all marked with the same timestamp. Other than this minor tip, I think it should be fine.

For more details, please refer to the following code: https://github.com/RUCAIBox/RecBole/blob/master/recbole/data/dataset/sequential_dataset.py#L74

@mayaKaplansky
Copy link
Author

Thanks for the timestamp tip!

Can you explain what you mean by "As for accessing features in models, generally, you can just fetch them from Interaction." And what's the difference between features and historical features?
If I have features in the interaction file, and I want to use GRU4Rec, should I use the original one or the GRU4recf?
Thanks!

@rowedenny
Copy link
Contributor

rowedenny commented Dec 24, 2020

Since I happened to code this part, I think I am appropriate to reply:
The trickly part is related to the dataloader implementation:

  1. If I remember correctly, in the general recommendation models, the implementations automatically fetch all the interaction features, however, in the sequential recommendation, the original dataloader only fetch the fields including user, item, and timestamp, but neglect the rest fields. That is the part that I PR for. So going back to your question, if your model is a sequential model, and the feature is an interaction feature, then it is the historical feature we have talked about.
  2. If I remember correctly, the features applying in the GRU4recf are the user/item profile features, instead of interaction features. To get access to them, you need to 1) specify a field name attribute for the model 2) in the calculate_loss, using interaction[self.field_name] to fetch the features

@mayaKaplansky
Copy link
Author

Thanks!
for #1 - I understand that after the PR I can add new features to the interaction file, but I couldn't figure if I should specify that somewhere in the code

for #2 - So if my features are only on the interaction (and not user or item) I should use Gru4Rec and not GRU3recf. What do you mean by "to get access to them" - do you mean that the model will consider them?

@rowedenny
Copy link
Contributor

Please allow me to reply with an example, say for the following atom file,

user_id:token item_id:token rating:float timestamp:float
1 1193 5 978302107
1 661 3 978302108
1 743 2 978302109

After the data augmentation, we expect to generate the following two sequences,

user_id:token item_id_sequence:token ITEM_ID FEATURE_SEQUENCE_FIELD_NAME
1 1193 661 5
1 1193, 661 743 5, 3

Going back to your questions

  1. You need to specify the suffix to generate the necessary X_sequence_filed_name(s). For example, to generate the item_sequence to predict the target item. Recbole has implemented the mapping, as long as you specify "LIST_SUFFIX: _list", as shown in config, then it will generate the sequence field name by adding the suffix to the corresponding field name, say item_id:token --> item_id_sequence:token. I think given the PR above, it will also do for the other interaction features.
  2. Now we need to fetch it within the model to calculate the loss. Firstly assign an attribute to the model to specify the filed name, then get access to the feature_seq from interaction. More concretely, here is an example to get access to the feature seq
def __init__(self, config, dataset):
    super(MyModel, self).__init__(config, dataset)
    self.FEATURE = config['FEATURE_FIELD']
    self.FEATURE_SEQ = self.FEATURE + config['LIST_SUFFIX']

and then you can get access to it, for example, within function calculate_loss,

def calculate_loss(self, interaction): 
    feature_seq = interaction[self.FEATURE_SEQ]

@mayaKaplansky
Copy link
Author

Thank you!
Is it correct that the changes you suggest above are only if I want to predict additional features?
If I am interested to only predict the next item_id, but use the features as additional info that can influence the model (used for learning), then I don't need to do these changes?

@rowedenny
Copy link
Contributor

Please allow me to confirm your user case, are you dealing with the case like:
For the movie recommendation, you would like to not only consider the movie that a user comments, but also the rating such that to which level the user likes the movie? (I believe this is the use the features as additioinal info that you describe)
In that case, you definitely need the changes.

@rowedenny
Copy link
Contributor

Be free to correct me if I am wrong.

Only the basic fields of the data frame have been pre-registered in the abstract_recommender.py
For example, if your model is inherited from SequentialRecommender, then the attributes

self.USER_ID = config['USER_ID_FIELD']
self.ITEM_ID = config['ITEM_ID_FIELD']
self.ITEM_SEQ = self.ITEM_ID + config['LIST_SUFFIX']
self.ITEM_SEQ_LEN = config['ITEM_LIST_LENGTH_FIELD']
self.POS_ITEM_ID = self.ITEM_ID
self.NEG_ITEM_ID = config['NEG_PREFIX'] + self.ITEM_ID

will get access to the corresponding fields in the dataframe. Other than that, all the customized fields need to be explicitly specified by the user.

Say if you wanna fetch the field named NUM_OF_TIMES, then you need to 1) register the filed name in the config 2) assign an attribute within your customized model, e.g self.NUM_OF_TIMES = config['NUM_OF_TIMES_FIELD'] 3) fetch the field via interaction[self.NUM_OF_TIMES]

@mayaKaplansky
Copy link
Author

Thanks, I guess we are waiting for an answer in the other thread :)

@ShanleiMu
Copy link
Member

Thanks for @rowedenny 's replies.

@mayaKaplansky If you want to use the additional inter feature fields in your sequential recommender. You can follow this #608 (comment) of rowedenny.

@mayaKaplansky
Copy link
Author

Thanks! Use in a way that the model will use them for learning, or use in a way that they can be predicted?
I don't want to predict them, just use them for learning.
If I don't change as #608 (comment) of rowedenny, then what do your changes do in the model?

@mayaKaplansky
Copy link
Author

Thank you for all your help.
so in abstract_recommender.py this is how it looks like now:

class SequentialRecommender(AbstractRecommender):
    """
    This is a abstract sequential recommender. All the sequential model should implement This class.
    """
    type = ModelType.SEQUENTIAL

    def __init__(self, config, dataset):
        super(SequentialRecommender, self).__init__()

        # load dataset info
        self.USER_ID = config['USER_ID_FIELD']
        self.ITEM_ID = config['ITEM_ID_FIELD']
        self.ITEM_SEQ = self.ITEM_ID + config['LIST_SUFFIX']
        self.ITEM_SEQ_LEN = config['ITEM_LIST_LENGTH_FIELD']
        self.POS_ITEM_ID = self.ITEM_ID
        self.NEG_ITEM_ID = config['NEG_PREFIX'] + self.ITEM_ID
        self.max_seq_length = config['MAX_ITEM_LIST_LENGTH']
        self.n_items = dataset.num(self.ITEM_ID)
        self.JobGroup = config['JobGroup_FIELD']
        self.JobGroup_SEQ = self.JobGroup + config['LIST_SUFFIX']
        self.AgeGroup = config['AgeGroup_FIELD']
        self.AgeGroup_SEQ = self.AgeGroup + config['LIST_SUFFIX']
        self.GenderID = config['GenderID_FIELD']
        self.GenderID_SEQ = self.GenderID + config['LIST_SUFFIX']
        self.PatientLocationID = config['PatientLocationID_FIELD']
        self.PatientLocationID = self.PatientLocationID + config['LIST_SUFFIX']

Is this OK?

You also explained I need to specify that in the config file which I assume you meant:

# Selectively Loading
load_col:
    inter: [session_id, item_id, timestamp, PatientLocationID,GenderID,AgeGroup, JobGroup]

And your last instruction was: fetch the field via interaction[self.NUM_OF_TIMES]
I couldn't find where I should do the fix, can you elaborate?

many thanks1

@rowedenny
Copy link
Contributor

rowedenny commented Dec 30, 2020

  1. The customized model looks OK for me. A minor tip, you may create a model inherited from SequentialRecommender instead of AbstractRecommender, and then create the additional fields you need. The benefit is that class SequentialRecomender has created specific functions, e.g data augmentation, and also correspondingly sample class RepeatableSampler has pre-defined based on the class of the model.
  2. For the NUM_OF_TIMES, I notice that you raise another thread for the session, yet I reply in this thread. However the idea to fetch the customize field still works, for example, if you wanna fetch job_group when in the function calculate_loss of the customized model, you can call interaction[self.JobGroup], and then it expected to make it.
  3. Finally I strongly suggest you examine the field values. I would like to firstly shrink the dataset into a smaller one, say only one user with several interactions, and then print out the dataframe. Next to check if the field value that fetches from interaction is identical.

@mayaKaplansky
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FAQ Frequently Asked Questions
Projects
None yet
Development

No branches or pull requests

7 participants