Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Details about generating video captions for InterVid #142

Open
fmthoker opened this issue Jul 9, 2024 · 1 comment
Open

Details about generating video captions for InterVid #142

fmthoker opened this issue Jul 9, 2024 · 1 comment

Comments

@fmthoker
Copy link

fmthoker commented Jul 9, 2024

Dear authors,
Can you share some details about how we can generate the captions for new videos in the same manner as done for Intervid? From the paper, you generated a single caption for the middle frame using BLIP-2 and a frame-by-frame caption using Tag2Text model at a low fps. Can you share some details about the fps used for the Tag2Text part, and, how many frames were used for each video? is the number of frames fixed or variable based on the video length? Any other details would be helpful.
Finally, how did you summarize all the captions using the T5-summary model, any specific prompts?

@yinanhe
Copy link
Member

yinanhe commented Jul 10, 2024

Thank you for your interest in our work. In InternVid dataset, we employed the tag2Text model to capture frames at a rate of 1 frame per second and produce image-level captions. However, given the somewhat repetitive descriptions generated by tag2Text, we integrated BLIP-2 to enhance the richness of the captions. Additionally, we included descriptions of intermediate frames in the overall narrative. When it came to summarizing with the T5-summarize model, its prior training in summarization tasks eliminated the need for elaborate prompt crafting.

Additionally, please allow me to introduce to you our VideoChat2-HD, a more accurate and detailed multimodal video model. All you need to do is input the video into the model and use a simple prompt such as Describe the video in detail. The model will then generate descriptions that are richer and more precise than those produced by internvid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants