Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is it possible to learn an action model (or the action's effects) with v-jepa ? #71

Open
aymeric75 opened this issue Jun 26, 2024 · 3 comments

Comments

@aymeric75
Copy link

aymeric75 commented Jun 26, 2024

Hello,

I would like to know if it is possible to add the knowledge of the actions performed by an agent into the architecture.

From my understanding the unmasked part of the image and the coordinates of the masked parts are given as input to the predictor (which predicts the masked parts). So, as I understand, the prediction predicts static elements (parts of the same image) and not next states.

Would it be possible, instead, to make Jepa to predict next images, given a present image and an action ? Or, can the actual implementation be used to produce representations that would fit in this downstream task (i.e. obtaining the "effects" of an action onto an image) ?

Thanks a lot

@icekang
Copy link

icekang commented Jul 1, 2024

Hi,

If you want to predict the next pixel values in the next frame based on the previous frame, you can modify the masking methodology to mask only the last frame (or something similar). However, as mentioned in the blog, video may progress slowly which makes this type of task too easy.

It’s also important to note that, in most videos, things evolve somewhat slowly over time. If you mask a portion of the video but only for a specific instant in time and the model can see what came immediately before and/or immediately after, it also makes things too easy and the model almost certainly won’t learn anything interesting. As such, the team used an approach where it masked portions of the video in both space and time, which forces the model to learn and develop an understanding of the scene.

@aymeric75
Copy link
Author

Hi,

If you want to predict the next pixel values in the next frame based on the previous frame, you can modify the masking methodology to mask only the last frame (or something similar). However, as mentioned in the blog, video may progress slowly which makes this type of task too easy.

It’s also important to note that, in most videos, things evolve somewhat slowly over time. If you mask a portion of the video but only for a specific instant in time and the model can see what came immediately before and/or immediately after, it also makes things too easy and the model almost certainly won’t learn anything interesting. As such, the team used an approach where it masked portions of the video in both space and time, which forces the model to learn and develop an understanding of the scene.

Thanks a lot for your answer, but when you say "However, as mentioned in the blog, video may progress slowly which makes this type of task too easy." what do you mean exaclty by "video may progress slowly" ?

@icekang
Copy link

icekang commented Jul 7, 2024

Say, you have a person moving his hand at 60 fps. At frame 1 and frame 2, his hand is not that much far apart, while background and his body isn't moved at all. Thus, it is likely that the model will learn to predict the 2nd frame to be exactly the same as the 1st frame.

In contrast, masking in tubelettes the whole z direction, in a small portion of x,y, like the paper, forces the model to learn to predict his, to illustrate, masked head using his body as a context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants