[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
-
Updated
Aug 12, 2024 - Python
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
[CVPR2023 Highlight] GRES: Generalized Referring Expression Segmentation
[CVPR 2024] Aligning and Prompting Everything All at Once for Universal Visual Perception
[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation
Instruction Following Agents with Multimodal Transforemrs
[NeurIPS 2023] Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
VTC: Improving Video-Text Retrieval with User Comments
Streamlit App Combining Vision, Language, and Audio AI Models
Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling
Coding a Multi-Modal vision model like GPT-4o from scratch, inspired by @hkproj and PaliGemma
VizWiz Challenge Term Project for Multi Modal Machine Learning @ CMU (11777)
Add a description, image, and links to the vision-language-transformer topic page so that developers can more easily learn about it.
To associate your repository with the vision-language-transformer topic, visit your repo's landing page and select "manage topics."