MobileViT is a lightweight visual Transformer network that can be used as a general backbone network in the field of computer vision. MobileViT combines the advantages of CNN and Transformer, which can better deal with global features and local features, and better solve the problem of lack of inductive bias in Transformer models. , and finally, under the same amount of parameters, compared with other SOTA models, the tasks of image classification, object detection, and semantic segmentation have been greatly improved. Paper
Models | Top1 | Top5 | Reference top1 |
Reference top5 |
FLOPs (M) |
Params (M) |
---|---|---|---|---|---|---|
MobileViT_XXS | 0.6867 | 0.8878 | 0.690 | - | 1849.35 | 5.59 |
MobileViT_XS | 0.7454 | 0.9227 | 0.747 | - | 930.75 | 2.33 |
MobileViT_S | 0.7814 | 0.9413 | 0.783 | - | 337.24 | 1.28 |