open-mmlab · mzr1996 · Jan 26, 2022 · Jan 25, 2022 · Jan 25, 2022 · Jan 25, 2022
diff --git a/configs/conformer/README.md b/configs/conformer/README.md
@@ -1,28 +1,16 @@
-# Conformer: Local Features Coupling Global Representations for Visual Recognition
-<!-- {Conformer} -->
+# Conformer
+
+> [Conformer: Local Features Coupling Global Representations for Visual Recognition](https://arxiv.org/abs/2105.03889)
 <!-- [ALGORITHM] -->
 
 ## Abstract
 
-<!-- [ABSTRACT] -->
 Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network.
 
-<!-- [IMAGE] -->
 <div align=center>
 <img src="https://user-images.githubusercontent.com/26739999/144957687-926390ed-6119-4e4c-beaa-9bc0017fe953.png" width="90%"/>
 </div>
 
-## Citation
-
-```latex
-@article{peng2021conformer,
-      title={Conformer: Local Features Coupling Global Representations for Visual Recognition},
-      author={Zhiliang Peng and Wei Huang and Shanzhi Gu and Lingxi Xie and Yaowei Wang and Jianbin Jiao and Qixiang Ye},
-      journal={arXiv preprint arXiv:2105.03889},
-      year={2021},
-}
-```
-
 ## Results and models
 
 ### ImageNet-1k
@@ -35,3 +23,14 @@ Within Convolutional Neural Network (CNN), the convolution operations are good a
 | Conformer-base-p16\*  |  83.29    | 22.89    | 83.82     | 96.59     | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/conformer/conformer-base-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-base-p16_3rdparty_8xb128_in1k_20211206-bfdf8637.pth) |
 
 *Models with \* are converted from the [official repo](https://github.com/pengzhiliang/Conformer). The config files of these models are only for validation. We don't ensure these config files' training accuracy and welcome you to contribute your reproduction results.*
+
+## Citation
+
+```
+@article{peng2021conformer,
+      title={Conformer: Local Features Coupling Global Representations for Visual Recognition},
+      author={Zhiliang Peng and Wei Huang and Shanzhi Gu and Lingxi Xie and Yaowei Wang and Jianbin Jiao and Qixiang Ye},
+      journal={arXiv preprint arXiv:2105.03889},
+      year={2021},
+}
+```
diff --git a/configs/conformer/metafile.yml b/configs/conformer/metafile.yml
@@ -10,9 +10,9 @@ Collections:
       URL: https://arxiv.org/abs/2105.03889
       Title: "Conformer: Local Features Coupling Global Representations for Visual Recognition"
     README: configs/conformer/README.md
-#    Code:
-#      URL:  # todo
-#      Version:  # todo
+    Code:
+      URL: https://github.com/open-mmlab/mmclassification/blob/v0.19.0/mmcls/models/backbones/conformer.py
+      Version: v0.19.0
 
 Models:
   - Name: conformer-tiny-p16_3rdparty_8xb128_in1k

diff --git a/configs/deit/README.md b/configs/deit/README.md
@@ -1,30 +1,16 @@
-# Training data-efficient image transformers & distillation through attention
-<!-- {DeiT} -->
+# DeiT
+
+> [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877)
 <!-- [ALGORITHM] -->
 
 ## Abstract
 
-<!-- [ABSTRACT] -->
 Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption.   In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data.   More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.
 
-<!-- [IMAGE] -->
 <div align=center>
 <img src="https://user-images.githubusercontent.com/26739999/143225703-c287c29e-82c9-4c85-a366-dfae30d198cd.png" width="40%"/>
 </div>
 
-## Citation
-```{latex}
-@InProceedings{pmlr-v139-touvron21a,
-  title =     {Training data-efficient image transformers &amp; distillation through attention},
-  author =    {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve},
-  booktitle = {International Conference on Machine Learning},
-  pages =     {10347--10357},
-  year =      {2021},
-  volume =    {139},
-  month =     {July}
-}
-```
-
 ## Results and models
 
 ### ImageNet-1k
@@ -48,3 +34,17 @@ The teacher of the distilled version DeiT is RegNetY-16GF.
 MMClassification doesn't support training the distilled version DeiT.
 And we provide distilled version checkpoints for inference only.
 ```
+
+## Citation
+
+```
+@InProceedings{pmlr-v139-touvron21a,
+  title =     {Training data-efficient image transformers &amp; distillation through attention},
+  author =    {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve},
+  booktitle = {International Conference on Machine Learning},
+  pages =     {10347--10357},
+  year =      {2021},
+  volume =    {139},
+  month =     {July}
+}
+```
diff --git a/configs/deit/metafile.yml b/configs/deit/metafile.yml
@@ -11,6 +11,9 @@ Collections:
       URL: https://arxiv.org/abs/2012.12877
       Title: "Training data-efficient image transformers & distillation through attention"
     README: configs/deit/README.md
+    Code:
+      URL: v0.19.0
+      Version: https://github.com/open-mmlab/mmclassification/blob/v0.19.0/mmcls/models/backbones/deit.py
 
 Models:
   - Name: deit-tiny_3rdparty_pt-4xb256_in1k

diff --git a/configs/efficientnet/README.md b/configs/efficientnet/README.md
@@ -1,30 +1,16 @@
-# Rethinking Model Scaling for Convolutional Neural Networks
-<!-- {EfficientNet} -->
+# EfficientNet
+
+> [Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946v5)
 <!-- [ALGORITHM] -->
 
 ## Abstract
 
-<!-- [ABSTRACT] -->
 Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet.   To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.
 
-<!-- [IMAGE] -->
 <div align=center>
 <img src="https://user-images.githubusercontent.com/26739999/150078232-d28c91fc-d0e8-43e3-9d20-b5162f0fb463.png" width="60%"/>
 </div>
 
-## Citation
-
-```latex
-@inproceedings{tan2019efficientnet,
-  title={Efficientnet: Rethinking model scaling for convolutional neural networks},
-  author={Tan, Mingxing and Le, Quoc},
-  booktitle={International Conference on Machine Learning},
-  pages={6105--6114},
-  year={2019},
-  organization={PMLR}
-}
-```
-
 ## Results and models
 
 ### ImageNet-1k
@@ -60,3 +46,16 @@ Note: In MMClassification, we support training with AutoAugment, don't support A
 | EfficientNet-B8 (AA + AdvProp)\* | 87.41 | 1.09 | 85.38 | 97.28 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/efficientnet/efficientnet-b8_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b8_3rdparty_8xb32-aa-advprop_in1k_20220119-297ce1b7.pth) |
 
 *Models with \* are converted from the [official repo](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet). The config files of these models are only for inference. We don't ensure these config files' training accuracy and welcome you to contribute your reproduction results.*
+
+## Citation
+
+```
+@inproceedings{tan2019efficientnet,
+  title={Efficientnet: Rethinking model scaling for convolutional neural networks},
+  author={Tan, Mingxing and Le, Quoc},
+  booktitle={International Conference on Machine Learning},
+  pages={6105--6114},
+  year={2019},
+  organization={PMLR}
+}
+```
diff --git a/configs/lenet/README.md b/configs/lenet/README.md
@@ -1,19 +1,19 @@
-# Backpropagation Applied to Handwritten Zip Code Recognition
-<!-- {LeNet} -->
+# LeNet
+
+> [Backpropagation Applied to Handwritten Zip Code Recognition](https://ieeexplore.ieee.org/document/6795724)
 <!-- [ALGORITHM] -->
 
 ## Abstract
 
-<!-- [ABSTRACT] -->
 The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.
 
-<!-- [IMAGE] -->
 <div align=center>
 <img src="https://user-images.githubusercontent.com/26739999/142561080-cd1c4bdc-8739-46ca-bc32-76d462a32901.png" width="50%"/>
 </div>
 
 ## Citation
-```latex
+
+```
 @ARTICLE{6795724,
   author={Y. {LeCun} and B. {Boser} and J. S. {Denker} and D. {Henderson} and R. E. {Howard} and W. {Hubbard} and L. D. {Jackel}},
   journal={Neural Computation},

diff --git a/configs/mlp_mixer/README.md b/configs/mlp_mixer/README.md
@@ -1,28 +1,16 @@
-# MLP-Mixer: An all-MLP Architecture for Vision
-<!-- {Mlp-Mixer} -->
+# Mlp-Mixer
+
+> [MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601)
 <!-- [ALGORITHM] -->
 
 ## Abstract
-<!-- [ABSTRACT] -->
+
 Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.
 
-<!-- [IMAGE] -->
 <div align=center>
 <img src="https://user-images.githubusercontent.com/26739999/143178327-7118b48a-5f5f-4844-a614-a571917384ca.png" width="90%"/>
 </div>
 
-## Citation
-```latex
-@misc{tolstikhin2021mlpmixer,
-      title={MLP-Mixer: An all-MLP Architecture for Vision},
-      author={Ilya Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Andreas Steiner and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
-      year={2021},
-      eprint={2105.01601},
-      archivePrefix={arXiv},
-      primaryClass={cs.CV}
-}
-```
-
 ## Results and models
 
 ### ImageNet-1k
@@ -33,3 +21,16 @@ Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Re
 |  Mixer-L/16\*  |  208.2   |  44.57    | 72.34     | 88.02     | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/mlp_mixer/mlp-mixer-large-p16_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-large-p16_3rdparty_64xb64_in1k_20211124-5a2519d2.pth) |
 
 *Models with \* are converted from [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py). The config files of these models are only for validation. We don't ensure these config files' training accuracy and welcome you to contribute your reproduction results.*
+
+## Citation
+
+```
+@misc{tolstikhin2021mlpmixer,
+      title={MLP-Mixer: An all-MLP Architecture for Vision},
+      author={Ilya Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Andreas Steiner and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
+      year={2021},
+      eprint={2105.01601},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/configs/mlp_mixer/metafile.yml b/configs/mlp_mixer/metafile.yml
@@ -10,9 +10,9 @@ Collections:
       URL: https://arxiv.org/abs/2105.01601
       Title: "MLP-Mixer: An all-MLP Architecture for Vision"
     README: configs/mlp_mixer/README.md
-#    Code:
-#      URL:  # todo
-#      Version:  # todo
+    Code:
+      URL: https://github.com/open-mmlab/mmclassification/blob/v0.18.0/mmcls/models/backbones/mlp_mixer.py
+      Version: v0.18.0
 
 Models:
   - Name: mlp-mixer-base-p16_3rdparty_64xb64_in1k

diff --git a/configs/mobilenet_v2/README.md b/configs/mobilenet_v2/README.md
@@ -1,20 +1,29 @@
-# MobileNetV2: Inverted Residuals and Linear Bottlenecks
-<!-- {MobileNet V2} -->
+# MobileNet V2
+
+> [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381)
 <!-- [ALGORITHM] -->
 
 ## Abstract
-<!-- [ABSTRACT] -->
+
 In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3.
 
 The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters
 
-<!-- [IMAGE] -->
 <div align=center>
 <img src="https://user-images.githubusercontent.com/26739999/142563365-7a9ea577-8f79-4c21-a750-ebcaad9bcc2f.png" width="40%"/>
 </div>
 
+## Results and models
+
+### ImageNet-1k
+
+|         Model         | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download |
+|:---------------------:|:---------:|:--------:|:---------:|:---------:|:---------:|:--------:|
+| MobileNet V2          | 3.5       | 0.319    | 71.86 | 90.42 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth) &#124; [log](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.log.json) |
+
 ## Citation
-```latex
+
+```
 @INPROCEEDINGS{8578572,
   author={M. {Sandler} and A. {Howard} and M. {Zhu} and A. {Zhmoginov} and L. {Chen}},
   booktitle={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
@@ -26,11 +35,3 @@ The MobileNetV2 architecture is based on an inverted residual structure where th
   doi={10.1109/CVPR.2018.00474}}
 }
 ```
-
-## Results and models
-
-### ImageNet-1k
-
-|         Model         | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download |
-|:---------------------:|:---------:|:--------:|:---------:|:---------:|:---------:|:--------:|
-| MobileNet V2          | 3.5       | 0.319    | 71.86 | 90.42 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth) &#124; [log](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.log.json) |