Skip to content

A Simple Aerial Detection Baseline of Multimodal Language Models.

Notifications You must be signed in to change notification settings

Li-Qingyun/mllm-mmrotate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LMMRotate 🎮: A Simple Aerial Detection Baseline of Multimodal Language Models

Qingyun LiYushi ChenXinya ShuDong ChenXin HeYi YuXue Yang

If you find our work helpful, please consider giving us a ⭐!

This repo is a technical practice to fine-tune Large Multimodal language Models for oriented object detection as in MMRotate and hosts the official implementation of the paper: A Simple Aerial Detection Baseline of Multimodal Language Models.

framework

We currently support fine-tuning and evaluating Florence-2 models on three optical datasets (DOTA-v1.0, DIOR-R, FAIR1M-v1.0) and two SAR datasets (SRSDD, RSAR) as reproductions of experimental results in the technical report paper. Thanks to the strong grounding and detection performance of the pre-trained foundation model, our detection performance rivals conventional detectors (e.g., RetinaNet, FCOS), even in challenging scenarios with dense and small-scale objects in the images. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding remote sensing data.

We'll also release a resource-friendly setting to enable conducting experiments on consumer-grade GPUs like RTX4090 (working in progress currently).

Performance

Get model weight on Huggingface

Click here for the visualization of the MLM detector, you can zoom in for a clearer view.

framework

NOTE: The results of joint trained Florence-2-large models above in current technical report paper will be updated soon.

The mAP_nc represents 'mAP without confidence score'. As our detector does not output confidence score, we use mAP_nc and mF_1 as evaluation metrics. You can refer to the technical report paper for more details. This notebook provides the practices during exploring stage.

Get Started (WIP)

First, refer to Enviroment.md to prepare an enviroment.

Then, refer to Data.md to prepare/download the data.

NOTE:

  1. We support multi-nodes distributed training based on SLURM. If your resource platform is different and requires multi-nodes distributed training, you may need adapt the shell scripts to your platform. Or you can mult the node count to gradient_accumulation_steps option. Concat us in issue for more support.
  2. The v2 in script name to record data version is response format version, not dataset version. dota1-v2 means DOTA-v1.0 of 2-th response.
  3. The users may misunderstand the data split name. We use trainval to represent all the default training split (training with trainval if val exist, else train only. testing with test only). However, as is described in the paper, the mF1 calculation requires ground-truth for evaluation. Hence, we add -train behind the dataset name to indicate only using train for training and val for evaluation. (Contact me in issue if there are still confusing things. I paint a pie to refactor this in future.)

Practices

srun ... bash scripts/florence-2-l_vis1024-lang2048_dota1-v2_b2x16-100e.sh
bash scripts/florence-2-l_vis1024-lang2048_dota1-v2_b2x8xga2-100e.sh
  • evaluate the model on DOTA-v1.0:
# get map nc
srun ... bash scripts/eval_slurm.sh <checkpoint folder path>
bash scripts/eval_standalone.sh <checkpoint folder path>
# then get f1
python -u -m lmmrotate.modules.f1_metric <checkpoint folder path>/<pkl file>
  • visualization (for sampled 20 figures)
bash scripts/eval_standalone.sh <checkpoint folder path> --shuffle_seed 42 --clip_num 20 --vis
  • train a baseline detector and get the map, map_nc, and f1 score (take Rotated-RetinaNet@RSAR as an example)
# train the detector
python -u playground/mmrotate_train.py playground/mmrotate_configs/rotated-retinanet-rbox-le90_r50_fpn_1x_rsar-1024.py --work-dir playground/mmrotate_workdir/rotated-retinanet-rbox-le90_r50_fpn_1x_rsar-1024
# inference on test dataloader to get result pickle file
python -u playground/mmrotate_test.py playground/mmrotate_configs/rotated-retinanet-rbox-le90_r50_fpn_1x_rsar-1024.py playground/mmrotate_workdir/rotated-retinanet-rbox-le90_r50_fpn_1x_rsar-1024/epoch_12.pth --out playground/mmrotate_workdir/rotated-retinanet-rbox-le90_r50_fpn_1x_rsar-1024/results_test.pkl
# get map_ac (best of map_nc on a sequence of score thresholds)
python -u playground/eval_mmrotate_detector_mapnc.py --dataset_name rsar --pickle_result_path playground/mmrotate_workdir/rotated-retinanet-rbox-le90_r50_fpn_1x_rsar-1024/results_test.pkl
# get f1 (best of f1 on a sequence of score thresholds)
python -u -m lmmrotate.modules.f1_metric playground/mmrotate_workdir/rotated-retinanet-rbox-le90_r50_fpn_1x_rsar-1024/results_test.pkl

Interface

Some options of the training script:

  • data_path and image_folder: you can pass multiple data here to run multiple dataset, remember to set dataset_mode
  • dataset_mode: optional from single, concat, and balanced, refer to the paper for more details.
  • model_type: optional from florence2 now, more models are working in progress.
  • model_name_or_path: provide the pretrained model path or name on huggingface hub.
  • image_square_length: set 1024 to train florence2 on 1024x1024, and it is useless for models with dynamic resolutions.
  • language_model_max_length: model_max_length for the language model.
  • model_revision: commit id of the model on huggingface hub. (Florence-2-large had an update, which is not used in this repo.)
  • language_model_lora: lora option similar to internvl.
  • response_format: box encoding and decoding format.
  • ...... (Contact me in issue if there are questions.)

Some options of the map_nc eval script:

  • model_ckpt_path: checkpoint path, you can pass multiple ckpt
  • result_path: folder to save eval log
  • eval_intermediate_checkpoints: whether to eval intermediate checkpoints
  • vis: visualize the result while evaluating
  • pass_evaluate: only inference and dump results, do not evaluate the result. (because inference requires gpu, while evaluate does not)
  • dataset_type: which dataset to eval. if is not passed, it will be decided according to checkpoint name.
  • split: which splits to get evaluation results.
  • clip_num: when you want to get results fast or visualize the results, you can clip the dataset.
  • shuffle_seed: seed for clip dataset.

Contact and Acknowledge

Feel free to contact me through my email (21b905003@stu.hit.edu.cn) or github issue. I'll continue to maintain this repo.

The code is based on MMRotate and Transformers. Many modules refer to InternVL and LLaVA. The model architecture benefits from the open-source general-purpose vision-language model Florence-2. Thanks for their brilliant works.

Citation

If you find our paper or benchmark helpful for your research, please consider citing our paper and giving this repo a star ⭐. Thank you very much!

@article{li2025lmmrotate,
  title={A Simple Aerial Detection Baseline of Multimodal Language Models},
  author={Li, Qingyun and Chen, Yushi and Shu, Xinya and Chen, Dong and He, Xin and Yu Yi and Yang, Xue },
  journal={arXiv preprint arXiv:2501.09720},
  year={2025}
}