ViLLa: Video Reasoning Segmentation with Large Language Model

Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao*

While previous studies have explored solutions to integrate reasoning with video segmentation through LLMs, they struggled to effectively model the complex scenes -- characterized by multiple objects, rapid motion, heavy occlusions, and extended durations. ViLLa, Video reasoning segmentation with Large Language Model, demonstrates capability in handling complex reasoning and referring video segmentation. Also, our model shows impressive ability in different temporal understanding benchmarks.

Illustrations of ViLLa.

Our ViLLa is an effective and efficient LMM capable of segmenting and tracking: (a) multiple objects with rapid motion; (b) objects in crowded scenes; (c) objects in long videos with occlusions.

Visualization Results.

Comparison between ViLLa and VISA.

Experiments

Reasoning video segmentation results among ViLLa and previous related works on VideoReasonSeg benchmark. "Seg" refers to "Segmentation" while "MC" indicates "Multiple Choices".

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
pics		pics
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViLLa: Video Reasoning Segmentation with Large Language Model

Illustrations of ViLLa.

Our ViLLa is an effective and efficient LMM capable of segmenting and tracking: (a) multiple objects with rapid motion; (b) objects in crowded scenes; (c) objects in long videos with occlusions.

Visualization Results.

Comparison between ViLLa and VISA.

Experiments

Reasoning video segmentation results among ViLLa and previous related works on VideoReasonSeg benchmark. "Seg" refers to "Segmentation" while "MC" indicates "Multiple Choices".

About

Releases

Packages

License

rkzheng99/ViLLa

Folders and files

Latest commit

History

Repository files navigation

ViLLa: Video Reasoning Segmentation with Large Language Model

Illustrations of ViLLa.

Our ViLLa is an effective and efficient LMM capable of segmenting and tracking: (a) multiple objects with rapid motion; (b) objects in crowded scenes; (c) objects in long videos with occlusions.

Visualization Results.

Comparison between ViLLa and VISA.

Experiments

Reasoning video segmentation results among ViLLa and previous related works on VideoReasonSeg benchmark. "Seg" refers to "Segmentation" while "MC" indicates "Multiple Choices".

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages