CSWinTransformer is a new visual Transformer network that can be used as a general backbone network in the field of computer vision. CSWinTransformer proposes to do self-attention through a cross-shaped window, which not only has a very high computational efficiency, but also can obtain a global receptive field through two-layer calculation. CSWinTransformer also proposed a new encoding method: LePE, which further improved the accuracy of the model. Paper
Models | Top1 | Top5 | Reference top1 |
Reference top5 |
FLOPs (G) |
Params (M) |
---|---|---|---|---|---|---|
CSWinTransformer_tiny_224 | 0.8281 | 0.9628 | 0.828 | - | 4.1 | 22 |
CSWinTransformer_small_224 | 0.8358 | 0.9658 | 0.836 | - | 6.4 | 35 |
CSWinTransformer_base_224 | 0.8420 | 0.9692 | 0.842 | - | 14.3 | 77 |
CSWinTransformer_large_224 | 0.8643 | 0.9799 | 0.865 | - | 32.2 | 173.3 |
CSWinTransformer_base_384 | 0.8550 | 0.9749 | 0.855 | - | 42.2 | 77 |
CSWinTransformer_large_384 | 0.8748 | 0.9833 | 0.875 | - | 94.7 | 173.3 |