-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is it possible to learn an action model (or the action's effects) with v-jepa ? #71
Comments
Hi, If you want to predict the next pixel values in the next frame based on the previous frame, you can modify the masking methodology to mask only the last frame (or something similar). However, as mentioned in the blog, video may progress slowly which makes this type of task too easy.
|
Thanks a lot for your answer, but when you say "However, as mentioned in the blog, video may progress slowly which makes this type of task too easy." what do you mean exaclty by "video may progress slowly" ? |
Say, you have a person moving his hand at 60 fps. At frame 1 and frame 2, his hand is not that much far apart, while background and his body isn't moved at all. Thus, it is likely that the model will learn to predict the 2nd frame to be exactly the same as the 1st frame. In contrast, masking in tubelettes the whole z direction, in a small portion of x,y, like the paper, forces the model to learn to predict his, to illustrate, masked head using his body as a context. |
Hello,
I would like to know if it is possible to add the knowledge of the actions performed by an agent into the architecture.
From my understanding the unmasked part of the image and the coordinates of the masked parts are given as input to the predictor (which predicts the masked parts). So, as I understand, the prediction predicts static elements (parts of the same image) and not next states.
Would it be possible, instead, to make Jepa to predict next images, given a present image and an action ? Or, can the actual implementation be used to produce representations that would fit in this downstream task (i.e. obtaining the "effects" of an action onto an image) ?
Thanks a lot
The text was updated successfully, but these errors were encountered: