Muhammad Sohail Danish*, Muhammad Akhtar Munir*, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro , Alexandre Lacoste and Salman Khan
* Equally contributing first authors
Mohamed bin Zayed University of AI, University College London, Linköping University, IBM Research Europe, UK, ServiceNow Research, Australian National University
Official GitHub repository for GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
.
- Dec-02-24: We release the benchmark dataset huggingface link.
- Dec-02-24: Arxiv Preprint is released arxiv link. 🔥🔥
The code and leaderboard will be released shortly. Follow this repository for updates!
Figure: Examples of tasks from the GEOBench-VLM benchmark. Our benchmark is designed to evaluate VLMs on a diverse range of remote sensing applications. The benchmark includes over 10,000 questions spanning a range of tasks essential for Earth Observation, such as Temporal Understanding, Referring Segmentation, Visual Grounding, Scene Understanding, Counting, Detailed Image Captioning, and Relational Reasoning. Each task is tailored to capture unique domain-specific challenges, featuring varied visual conditions and object scales, and requiring nuanced understanding for applications like disaster assessment, urban planning, and environmental monitoring.
Abstract: While numerous recent benchmarks focus on evaluating generic Vision-Language Models (VLMs), they fall short in addressing the unique demands of geospatial applications. Generic VLM benchmarks are not designed to handle the complexities of geospatial data, which is critical for applications such as environmental monitoring, urban planning, and disaster management. Some of the unique challenges in geospatial domain include temporal analysis for changes, counting objects in large quantities, detecting tiny objects, and understanding relationships between entities occurring in Remote Sensing imagery. To address this gap in the geospatial domain, we present GEOBench-VLM, a comprehensive benchmark specifically designed to evaluate VLMs on geospatial tasks, including scene understanding, object counting, localization, fine-grained categorization, and temporal analysis. Our benchmark features over 10,000 manually verified instructions and covers a diverse set of variations in visual conditions, object type, and scale. We evaluate several state-of-the-art VLMs to assess their accuracy within the geospatial context. The results indicate that although existing VLMs demonstrate potential, they face challenges when dealing with geospatial-specific examples, highlighting the room for further improvements. Specifically, the best-performing GPT4o achieves only 40% accuracy on MCQs, which is only double the random guess performance.
- GEOBench-VLM Benchmark. We introduce GEOBench-VLM, a benchmark suite designed specifically for evaluating VLMs on geospatial tasks, addressing geospatial data challenges. It covers 8 broad categories and 31 sub-tasks with over 10,000 manually verified questions.
- **Evaluation of VLMs.**We provide a detailed evaluation of ten state-of-the-art VLMs, including generic (open and closed-source) and task-specific geospatial VLMs, highlighting their capabilities and limitations in handling geospatial tasks.
- Analysis of Geospatial Task Performance. We analyze performance across a range of tasks, including scene classification, counting, change detection, relationship prediction, visual grounding, image captioning, segmentation, disaster detection, and temporal analysis, among others, providing key insights into improving VLMs for geospatial applications.
Table: Overview of Generic and Geospatial-specific Datasets & Benchmarks, detailing modalities (O=Optical, PAN=Panchromatic, MS=Multi-spectral, IR=Infrared, SAR=Synthetic Aperture Radar, V=Video, MI=Multi-image, BT=Bi-Temporal, MT=Multi-temporal), data sources (DRSD=Diverse RS Datasets, OSM=OpenStreetMap, GE=Google Earth, answer types (MCQ=Multiple Choice, SC=Single Choice, FF=Free-Form, BBox=Bounding Box, Seg=Segmentation Mask), and annotation types (A=Automatic, M=Manual).
Our pipeline integrates diverse datasets, automated tools, and manual annotation. Tasks such as scene understanding, object classification, and non-optical analysis are based on classification datasets, while GPT-4o generates unique MCQs with five options: one correct answer, one semantically similar "closest" option, and three plausible alternatives. Spatial relationship tasks rely on manually annotated object pair relationships, ensuring consistency through cross-verification. Caption generation leverages GPT-4o, combining image, object details, and spatial interactions with manual refinement for high precision.
Performance Summary of VLMs Across Geospatial Tasks. GPT-4o achieves better accuracy in relatively easy tasks like Aircraft Type Classification, Disaster Type Classification, Scene Classification, and Land Use Classification. But, on average the best-performing GPT-4o achieves only 40% accuracy on MCQs based on diverse geospatial tasks, which is only double the random guess performance. These results showcase the varying strengths of VLMs in addressing diverse geospatial tasks.
Results highlight the strengths of VLMs in handling temporal geospatial challenges. Evaluation across five tasks: Crop Type Classification, Disaster Type Classification, Farm Pond Change Detection, Land Use Classification, and Damaged Building Count. GPT-4o achieves the highest accuracy overall in classification and counting tasks.
Model | Crop Type Classification | Disaster Type Classification | Farm Pond Change Detection | Land Use Classification | Damaged Building Count |
---|---|---|---|---|---|
LLaVA-OneV | 0.1273 | 0.4493 | 0.1579 | 0.5672 | 0.2139 |
Qwen2-VL | 0.1273 | 0.5903 | 0.0921 | 0.5869 | 0.2270 |
GPT-4o | 0.1818 | 0.6344 | 0.1447 | 0.6230 | 0.2420 |
Referring expression detection. We report Precision on 0.5 IoU and 0.25 IoU
Model | Precision@0.5 IoU | Precision@0.25 IoU |
---|---|---|
Sphinx | 0.3408 | 0.5289 |
GeoChat | 0.1151 | 0.2100 |
Ferret | 0.0943 | 0.2003 |
Qwen2-VL | 0.1518 | 0.2524 |
GPT-4o | 0.0087 | 0.0386 |
Scene Understanding: This illustrates model performance on geospatial scene understanding tasks, highlighting successes in clear contexts and challenges in ambiguous scenes. The results emphasize the importance of contextual reasoning and addressing overlapping visual cues for accurate classification.
Counting: The figure showcases model performance on counting tasks, where Qwen 2-VL, GPT-4o and LLaVA-One have better performance in identifying objects. Other models, such as Ferret, struggled with overestimation, highlighting challenges in object differentiation and spatial reasoning.
Object Classification: The figure highlights model performance on object classification, showing success with familiar objects like the "atago-class destroyer" and "small civil transport/utility" aircraft. However, models struggled with rarer objects like the ``murasame-class destroyer" and ``garibaldi aircraft carrier" indicating a need for improvement on less common classes and fine-grained recognition.
Event Detection: Model performance on disaster assessment tasks, with success in scenarios like 'fire' and 'flooding' but challenges in ambiguous cases like 'tsunami' and 'seismic activity'. Misclassifications highlight limitations in contextual reasoning and insufficient exposure on overlapping disaster features.
Spatial Relations: The figure demonstrates model performance on spatial relationship tasks, with success in close-object scenarios and struggles in cluttered environments with distant objects.
If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows:
@article{danish2024geobenchvlm,
title={GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks},
author={Muhammad Sohail Danish and Muhammad Akhtar Munir and Syed Roshaan Ali Shah and Kartik Kuckreja and Fahad Shahbaz Khan and Paolo Fraccaro and Alexandre Lacoste and Salman Khan},
year={2024},
eprint={2411.19325},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.19325},
}
If you have any questions, please create an issue on this repository or contact at muhammad.sohail@mbzuai.ac.ae.