Rank | A1: Perception (dev set) | A1: Perception (test set) | A2: Description | A3: Assessment |
---|---|---|---|---|
🥇 | InternLM-XComposer-VL (0.6535) | InternLM-XComposer-VL (0.6435) | InternLM-XComposer-VL (4.21/6) | InternLM-XComposer-VL (0.542,0.581) |
🥈 | LLaVA-v1.5-13B (0.6214) | InstructBLIP-T5-XL (0.6194) | Kosmos-2 (4.03/6) | Qwen-VL (0.475,0.506) |
🥉 | InstructBLIP-T5-XL (0.6147) | Qwen-VL (0.6167) | mPLUG-Owl (3.94/6) | LLaVA-v1.5-13B (0.444,0.473) |
About the partition of dev
and test
subsets, please see our dataset release notes. As some models excel on original testing pipeline while some others perform better under PPL-based testing, we maintain two leaderboards for two different testing methods. See examples for their different settings.
- 14 models tested
- via Multi-Choice Questions
Model Name | yes-or-no | what | how | distortion | others | in-context distortion | in-context others | overall |
---|---|---|---|---|---|---|---|---|
random guess | 0.5000 | 0.2786 | 0.3331 | 0.3789 | 0.3848 | 0.3828 | 0.3582 | 0.3780 |
LLaVA-v1.5 (Vicuna-v1.5-7B) | 0.6636 | 0.5819 | 0.5051 | 0.4942 | 0.6574 | 0.5461 | 0.7061 | 0.5866 |
LLaVA-v1.5 (Vicuna-v1.5-13B) | 0.6527 | 0.6438 | 0.5659 | 0.5603 | 0.6713 | 0.6118 | 0.6735 | 0.6214 |
InternLM-XComposer-VL (InternLM) | 0.6945 | 0.6527 | 0.6085 | 0.6167 | 0.7014 | 0.5691 | 0.7510 | 0.6535 |
IDEFICS-Instruct (LLaMA-7B) | 0.5618 | 0.4469 | 0.4402 | 0.4280 | 0.5417 | 0.4474 | 0.5633 | 0.4870 |
Qwen-VL (QwenLM) | 0.6309 | 0.5819 | 0.5639 | 0.5058 | 0.6273 | 0.5789 | 0.7388 | 0.5940 |
Shikra (Vicuna-7B) | 0.6564 | 0.4735 | 0.4909 | 0.4883 | 0.5949 | 0.5000 | 0.6408 | 0.5465 |
Otter-v1 (MPT-7B) | 0.5709 | 0.4071 | 0.3955 | 0.4222 | 0.4931 | 0.4408 | 0.5265 | 0.4635 |
InstructBLIP (Flan-T5-XL) | 0.6764 | 0.5996 | 0.5598 | 0.5623 | 0.6551 | 0.5822 | 0.6939 | 0.6147 |
InstructBLIP (Vicuna-7B) | 0.7164 | 0.5265 | 0.4381 | 0.4864 | 0.6250 | 0.5559 | 0.6490 | 0.5672 |
VisualGLM-6B (GLM-6B) | 0.6018 | 0.5420 | 0.4625 | 0.5175 | 0.5440 | 0.5362 | 0.5714 | 0.5378 |
mPLUG-Owl (LLaMA-7B) | 0.6600 | 0.5487 | 0.4402 | 0.5136 | 0.5509 | 0.5428 | 0.6571 | 0.5538 |
LLaMA-Adapter-V2 | 0.6618 | 0.5929 | 0.5213 | 0.5739 | 0.5625 | 0.6316 | 0.6490 | 0.5946 |
LLaVA-v1 (Vicuna-13B) | 0.5400 | 0.5310 | 0.5538 | 0.4864 | 0.5463 | 0.5559 | 0.6327 | 0.5418 |
MiniGPT-4 (Vicuna-13B) | 0.5582 | 0.5022 | 0.4037 | 0.4202 | 0.4838 | 0.5197 | 0.6122 | 0.4903 |
Results of GPT-4V and non-expert human:
Participant Name | yes-or-no | what | how | distortion | others | in-context distortion | in-context others | overall |
---|---|---|---|---|---|---|---|---|
GPT-4V (Close-Source Model) | 0.7792 | 0.7918 | 0.6268 | 0.7058 | 0.7303 | 0.7466 | 0.7795 | 0.7336 |
Junior-level Human | 0.8248 | 0.7939 | 0.6029 | 0.7562 | 0.7208 | 0.7637 | 0.7300 | 0.7431 |
Senior-level Human | 0.8431 | 0.8894 | 0.7202 | 0.7965 | 0.7947 | 0.8390 | 0.8707 | 0.8174 |
GPT-4V is primarily a Junior-level Human.
Results of Open-source models:
Model Name | yes-or-no | what | how | distortion | others | in-context distortion | in-context others | overall |
---|---|---|---|---|---|---|---|---|
random guess | 0.5000 | 0.2848 | 0.3330 | 0.3724 | 0.3850 | 0.3913 | 0.3710 | 0.3794 |
LLaVA-v1.5 (Vicuna-v1.5-7B) | 0.6460 | 0.5922 | 0.5576 | 0.4798 | 0.6730 | 0.5890 | 0.7376 | 0.6007 |
LLaVA-v1.5 (Vicuna-v1.5-13B) | 0.6496 | 0.6486 | 0.5412 | 0.5355 | 0.6659 | 0.5890 | 0.7148 | 0.6140 |
InternLM-XComposer-VL (InternLM) | 0.6843 | 0.6204 | 0.6193 | 0.5681 | 0.7041 | 0.5753 | 0.7719 | 0.6435 |
IDEFICS-Instruct (LLaMA-7B) | 0.6004 | 0.4642 | 0.4671 | 0.4038 | 0.5990 | 0.4726 | 0.6477 | 0.5151 |
Qwen-VL (QwenLM) | 0.6533 | 0.6074 | 0.5844 | 0.5413 | 0.6635 | 0.5822 | 0.7300 | 0.6167 |
Shikra(Vicuna-7B) | 0.6909 | 0.4793 | 0.4671 | 0.4731 | 0.6086 | 0.5308 | 0.6477 | 0.5532 |
Otter-v1 (MPT-7B) | 0.5766 | 0.3970 | 0.4259 | 0.4212 | 0.4893 | 0.4760 | 0.5417 | 0.4722 |
InstructBLIP (Flan-T5-XL) | 0.6953 | 0.5900 | 0.5617 | 0.5731 | 0.6551 | 0.5651 | 0.7121 | 0.6194 |
InstructBLIP (Vicuna-7B) | 0.7099 | 0.5141 | 0.4300 | 0.4500 | 0.6301 | 0.5719 | 0.6439 | 0.5585 |
VisualGLM-6B (GLM-6B) | 0.6131 | 0.5358 | 0.4403 | 0.4856 | 0.5489 | 0.5548 | 0.5779 | 0.5331 |
mPLUG-Owl (LLaMA-7B) | 0.7245 | 0.5488 | 0.4753 | 0.4962 | 0.6301 | 0.6267 | 0.6667 | 0.5893 |
LLaMA-Adapter-V2 | 0.6618 | 0.5466 | 0.5165 | 0.5615 | 0.6181 | 0.5925 | 0.5455 | 0.5806 |
LLaVA-v1 (Vicuna-13B) | 0.5712 | 0.5488 | 0.5185 | 0.4558 | 0.5800 | 0.5719 | 0.6477 | 0.5472 |
MiniGPT-4 (Vicuna-13B) | 0.6077 | 0.5033 | 0.4300 | 0.4558 | 0.5251 | 0.5342 | 0.6098 | 0.5177 |
- 11 models tested
- via Losses of Different Answers
- non-finalized work-in-progress version, may update
No options are provided in prompts!
Model Name | yes-or-no | what | how | distortion | others | in-context distortion | in-context others | overall |
---|---|---|---|---|---|---|---|---|
idefics | 0.6109 | 0.5332 | 0.4422 | 0.4942 | 0.5625 | 0.4704 | 0.6327 | 0.5318 |
instructblip_t5 | 0.6200 | 0.4425 | 0.3996 | 0.4280 | 0.5347 | 0.4737 | 0.5837 | 0.4936 |
instructblip_vicuna | 0.5964 | 0.4137 | 0.4158 | 0.4689 | 0.4699 | 0.4704 | 0.5429 | 0.4816 |
kosmos_2 | 0.5800 | 0.3562 | 0.3915 | 0.3969 | 0.4606 | 0.4408 | 0.5551 | 0.4502 |
llama_adapter_v2 | 0.5691 | 0.3208 | 0.4057 | 0.3852 | 0.4491 | 0.4671 | 0.5061 | 0.4401 |
llava_v1.5 | 0.6764 | 0.4071 | 0.3469 | 0.4280 | 0.5347 | 0.4803 | 0.5306 | 0.4863 |
llava_v1 | 0.5945 | 0.4071 | 0.3671 | 0.3872 | 0.5116 | 0.4671 | 0.5306 | 0.4629 |
minigpt4_13b | 0.5509 | 0.4248 | 0.3347 | 0.4047 | 0.4722 | 0.4112 | 0.5020 | 0.4415 |
mplug_owl | 0.7909 | 0.4027 | 0.3793 | 0.4981 | 0.5602 | 0.5757 | 0.5347 | 0.5378 |
otter_v1 | 0.6782 | 0.4248 | 0.4462 | 0.4514 | 0.5833 | 0.5164 | 0.5878 | 0.5251 |
shikra | 0.6655 | 0.4690 | 0.5030 | 0.4669 | 0.6042 | 0.5230 | 0.6776 | 0.5525 (rank 1) |
Model Name | yes-or-no | what | how | distortion | others | in-context distortion | in-context others | overall |
---|---|---|---|---|---|---|---|---|
idefics | 0.6752 | 0.5163 | 0.4280 | 0.4712 | 0.6396 | 0.4726 | 0.6250 | 0.5458 |
instructblip_t5 | 0.6661 | 0.4707 | 0.3971 | 0.4173 | 0.6181 | 0.4486 | 0.6364 | 0.5184 |
instructblip_vicuna | 0.6843 | 0.4469 | 0.3827 | 0.4981 | 0.5060 | 0.4726 | 0.5985 | 0.5130 |
kosmos_2 | 0.6496 | 0.3861 | 0.4239 | 0.4038 | 0.5585 | 0.4658 | 0.6061 | 0.4950 |
llama_adapter_v2 | 0.6551 | 0.3536 | 0.4012 | 0.4154 | 0.4964 | 0.4829 | 0.5758 | 0.4796 |
llava_v1.5 | 0.7500 | 0.4685 | 0.3519 | 0.4453 | 0.5776 | 0.5171 | 0.6578 | 0.5338 |
llava_v1 | 0.6642 | 0.4447 | 0.3951 | 0.4096 | 0.5847 | 0.4726 | 0.6250 | 0.5090 |
minigpt4_13b | 0.5730 | 0.4577 | 0.3580 | 0.4096 | 0.4988 | 0.4281 | 0.5758 | 0.4676 |
mplug_owl | 0.8449 | 0.4013 | 0.3951 | 0.4981 | 0.5752 | 0.6027 | 0.6212 | 0.5619 (rank 1) |
otter_v1 | 0.6971 | 0.4382 | 0.4568 | 0.4288 | 0.6372 | 0.4897 | 0.6553 | 0.5391 |
shikra | 0.6515 | 0.4729 | 0.5021 | 0.4269 | 0.6205 | 0.5034 | 0.7197 | 0.5478 |
Abbreviations for dimensions: comp: completeness, prec: precision, rele: relevance
Model Name | p_{0, comp} | p_{0, comp} | p_{2, comp} | s_{compl} | p_{0, prec} | p_{0, prec} | p_{2, prec} | s_{prec} | p_{0, rele} | p_{0, rele} | p_{2, rele} | s_{rele} | s_{sum} |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-v1.5 (Vicuna-v1.5-13B) | 27.68% | 53.78% | 18.55% | 0.91/2.00 | 25.45% | 21.47% | 53.08% | 1.28/2.00 | 6.31% | 58.75% | 34.94% | 1.29/2.00 | 3.47/6.00 |
InternLM-XComposer-VL (InternLM) | 19.94% | 51.82% | 28.24% | 1.08/2.00 | 22.59% | 28.99% | 48.42% | 1.26/2.00 | 1.05% | 10.62% | 88.32% | 1.87/2.00 | 4.21/6.00 |
IDEFICS-Instruct (LLaMA-7B) | 28.91% | 59.16% | 11.93% | 0.83/2.00 | 34.68% | 27.86% | 37.46% | 1.03/2.00 | 3.90% | 59.66% | 36.44% | 1.33/2.00 | 3.18/6.00 |
Qwen-VL (QwenLM) | 26.34% | 49.13% | 24.53% | 0.98/2.00 | 50.62% | 23.44% | 25.94% | 0.75/2.00 | 0.73% | 35.56% | 63.72% | 1.63/2.00 | 3.36/6.00 |
Shikra (Vicuna-7B) | 21.14% | 68.33% | 10.52% | 0.89/2.00 | 30.33% | 28.30% | 41.37% | 1.11/2.00 | 1.14% | 64.36% | 34.50% | 1.33/2.00 | 3.34/6.00 |
Otter-v1 (MPT-7B) | 22.38% | 59.36% | 18.25% | 0.96/2.00 | 40.68% | 35.99% | 23.33% | 0.83/2.00 | 1.95% | 13.2% | 84.85% | 1.83/2.00 | 3.61/6.00 |
Kosmos-2 | 8.76% | 70.91% | 20.33% | 1.12/2.00 | 29.45% | 34.75% | 35.81% | 1.06/2.00 | 0.16% | 14.77% | 85.06% | 1.85/2.00 | 4.03/6.00 |
InstructBLIP (Flan-T5-XL) | 23.16% | 66.44% | 10.40% | 0.87/2.00 | 34.85% | 26.03% | 39.12% | 1.04/2.00 | 14.71% | 59.87% | 25.42% | 1.11/2.00 | 3.02/6.00 |
InstructBLIP (Vicuna-7B) | 29.73% | 61.47% | 8.80% | 0.79/2.00 | 27.84% | 23.52% | 48.65% | 1.21/2.00 | 27.40% | 61.29% | 11.31% | 0.84/2.00 | 2.84/6.00 |
VisualGLM-6B (GLM-6B) | 30.75% | 56.64% | 12.61% | 0.82/2.00 | 38.64% | 26.18% | 35.18% | 0.97/2.00 | 6.14% | 67.15% | 26.71% | 1.21/2.00 | 2.99/6.00 |
mPLUG-Owl (LLaMA-7B) | 28.28% | 37.69% | 34.03% | 1.06/2.00 | 26.75% | 18.18% | 55.07% | 1.28/2.00 | 3.03% | 33.82% | 63.15% | 1.6/2.00 | 3.94/6.00 |
LLaMA-Adapter-V2 | 30.44% | 53.99% | 15.57% | 0.85/2.00 | 29.41% | 25.79% | 44.8% | 1.15/2.00 | 1.50% | 52.75% | 45.75% | 1.44/2.00 | 3.45/6.00 |
LLaVA-v1 (Vicuna-13B) | 34.10% | 40.52% | 25.39% | 0.91/2.00 | 30.02% | 15.15% | 54.83% | 1.25/2.00 | 1.06% | 38.03% | 60.91% | 1.6/2.00 | 3.76/6.00 |
MiniGPT-4 (Vicuna-13B) | 34.01% | 32.15% | 33.85% | 1.00/2.00 | 29.20% | 15.27% | 55.53% | 1.26/2.00 | 6.88% | 45.65% | 47.48% | 1.41/2.00 | 3.67/6.00 |
The datasets can be found here.
See IQA_outputs/eval.ipynb for our ablation experiments.
Model Name | KoNIQ-10k | SPAQ | LIVE-FB | LIVE-itw | CGIQA-6K | AGIQA-3K | KADID-10K | Average |
---|---|---|---|---|---|---|---|---|
NIQE | 0.316/0.377 | 0.693/0.669 | 0.211/0.288 | 0.480/0.451 | 0.075/0.056 | 0.562/0.517 | 0.374/0.428 | 0.387/0.398 |
CLIP-ViT-Large-14 | 0.468/0.505 | 0.385/0.389 | 0.218/0.237 | 0.307/0.308 | 0.285/0.290 | 0.436/0.458 | 0.376/0.388 | 0.354/0.368 |
LLaVA-v1.5 (Vicuna-v1.5-7B) | 0.463/0.459 | 0.443/0.467 | 0.305/0.321 | 0.344/0.358 | 0.321/0.333 | 0.672/0.738 | 0.417/0.440 | 0.424/0.445 |
LLaVA-v1.5 (Vicuna-v1.5-13B) | 0.448/0.460 | 0.563/0.584 | 0.310/0.339 | 0.445/0.481 | 0.285/0.297 | 0.664/0.754 | 0.390/0.400 | 0.444/0.474 |
InternLM-XComposer-VL (InternLM) | 0.568/0.616 | 0.731/0.751 | 0.358/0.413 | 0.619/0.678 | 0.246/0.268 | 0.734/0.777 | 0.540/0.563 | 0.542/0.581 |
IDEFICS-Instruct (LLaMA-7B) | 0.375/0.400 | 0.474/0.484 | 0.235/0.24 | 0.409/0.428 | 0.244/0.227 | 0.562/0.622 | 0.370/0.373 | 0.381/0.396 |
Qwen-VL (QwenLM) | 0.470/0.546 | 0.676/0.669 | 0.298/0.338 | 0.504/0.532 | 0.273/0.284 | 0.617/0.686 | 0.486/0.486 | 0.475/0.506 |
Shikra (Vicuna-7B) | 0.314/0.307 | 0.32/0.337 | 0.237/0.241 | 0.322/0.336 | 0.198/0.201 | 0.640/0.661 | 0.324/0.332 | 0.336/0.345 |
Otter-v1 (MPT-7B) | 0.406/0.406 | 0.436/0.441 | 0.143/0.142 | -0.008/0.018 | 0.254/0.264 | 0.475/0.481 | 0.557/0.577 | 0.323/0.333 |
Kosmos-2 | 0.255/0.281 | 0.644/0.641 | 0.196/0.195 | 0.358/0.368 | 0.210/0.225 | 0.489/0.491 | 0.359/0.365 | 0.359/0.367 |
InstructBLIP (Flan-T5-XL) | 0.334/0.362 | 0.582/0.599 | 0.248/0.267 | 0.113/0.113 | 0.167/0.188 | 0.378/0.400 | 0.211/0.179 | 0.290/0.301 |
InstructBLIP (Vicuna-7B) | 0.359/0.437 | 0.683/0.689 | 0.200/0.283 | 0.253/0.367 | 0.263/0.304 | 0.629/0.663 | 0.337/0.382 | 0.389/0.446 |
VisualGLM-6B (GLM-6B) | 0.247/0.234 | 0.498/0.507 | 0.146/0.154 | 0.110/0.116 | 0.209/0.183 | 0.342/0.349 | 0.127/0.131 | 0.240/0.239 |
mPLUG-Owl (LLaMA-7B) | 0.409/0.427 | 0.634/0.644 | 0.241/0.271 | 0.437/0.487 | 0.148/0.180 | 0.687/0.711 | 0.466/0.486 | 0.432/0.458 |
LLaMA-Adapter-V2 | 0.354/0.363 | 0.464/0.506 | 0.275/0.329 | 0.298/0.360 | 0.257/0.271 | 0.604/0.666 | 0.412/0.425 | 0.381/0.417 |
LLaVA-v1 (Vicuna-13B) | 0.462/0.457 | 0.442/0.462 | 0.264/0.280 | 0.404/0.417 | 0.208/0.237 | 0.626/0.684 | 0.349/0.372 | 0.394/0.416 |
MiniGPT-4 (Vicuna-13B) | 0.239/0.257 | 0.238/0.253 | 0.170/0.183 | 0.339/0.340 | 0.252/0.246 | 0.572/0.591 | 0.239/0.233 | 0.293/0.300 |
Overall, internlm_xcomposer_vl
has the best IQA performance among the models. (30th Oct) with 6 champions among 7 datasets. qwen-vl
and llava-v1.5
are good runner-ups.
We release the results of these models (as well as the post-evaluation code) in IQA_results for reference.
Please contact any of the first authors of this paper for queries.
- Haoning Wu,
haoning001@e.ntu.edu.sg
, @teowu - Zicheng Zhang,
zzc1998@sjtu.edu.cn
, @zzc-1998 - Erli Zhang,
ezhang005@e.ntu.edu.sg
, @ZhangErliCarl
If you find our work interesting, please feel free to cite our paper:
@article{wu2023qbench,
title={Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision},
author={Wu, Haoning and Zhang, Zicheng and Zhang, Erli and Chen, Chaofeng and Liao, Liang and Wang, Annan and Li, Chunyi and Sun, Wenxiu and Yan, Qiong and Zhai, Guangtao and Lin, Weisi},
year={2023},
}