Skip to content

rkzheng99/ViLLa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

ViLLa: Video Reasoning Segmentation with Large Language Model

Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao*

[paper] [code]

While previous studies have explored solutions to integrate reasoning with video segmentation through LLMs, they struggled to effectively model the complex scenes -- characterized by multiple objects, rapid motion, heavy occlusions, and extended durations. ViLLa, Video reasoning segmentation with Large Language Model, demonstrates capability in handling complex reasoning and referring video segmentation. Also, our model shows impressive ability in different temporal understanding benchmarks.

Illustrations of ViLLa.

Our ViLLa is an effective and efficient LMM capable of segmenting and tracking: (a) multiple objects with rapid motion; (b) objects in crowded scenes; (c) objects in long videos with occlusions.

image

Visualization Results.

Comparison between ViLLa and VISA.

image

Experiments

Reasoning video segmentation results among ViLLa and previous related works on VideoReasonSeg benchmark. "Seg" refers to "Segmentation" while "MC" indicates "Multiple Choices".

image

About

Video Reasoning Segmentation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published