History of computer vision architectures. A focus on Classification, Segmentation and Object detection networks.
Paper | Date | Description |
---|---|---|
Neocognition | 1979 | A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position |
ConvNet | 1989 | Used back-propagation to learn the convolution kernel coefficients directly from images of hand-written numbers |
Lenet | December 1998 | Introduced Convolutions. |
Alex Net | September 2012 | Introduced ReLU activation and Dropout to CNNs. Winner ILSVRC 2012. |
ZfNet | 2013 | ZFNet is a classic convolutional neural network. The design was motivated by visualizing intermediate feature layers and the operation of the classifier. Compared to AlexNet, the filter sizes are reduced and the stride of the convolutions are reduced. |
GoogleNet | 2014 | One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection. |
VGG | September 2014 | Used large number of filters of small size in each layer to learn complex features. Achieved SOTA in ILSVRC 2014. |
Inception Net | September 2014 | Introduced Inception Modules consisting of multiple parallel convolutional layers, designed to recognize different features at multiple scales. |
HighwayNet | 2015 | Introduced a new architecture designed to ease gradient-based training of very deep networks |
Inception Net v2 / Inception Net v3 | December 2015 | Design Optimizations of the Inception Modules which improved performance and accuracy. |
Res Net | December 2015 | Introduced residual connections, which are shortcuts that bypass one or more layers in the network. Winner ILSVRC 2015. |
Inception Net v4 / Inception ResNet | February 2016 | Hybrid approach combining Inception Net and ResNet. |
Dense Net | August 2016 | Each layer receives input from all the previous layers, creating a dense network of connections between the layers, allowing to learn more diverse features. |
DarkNet | 2016 | A convolutional neural network that acts as a backbone for the YOLOv3 object detection approach. |
Xception | October 2016 | Based on InceptionV3 but uses depthwise separable convolutions instead on inception modules. |
Res Next | November 2016 | Built over ResNet, introduces the concept of grouped convolutions, where the filters in a convolutional layer are divided into multiple groups. |
FractalNet | 2017 | The first simple alternative to ResNet. |
Capsule Networks | 2017 | Proposed to improve the performance of CNNs, especially in terms of spatial hierarchies and rotation invariance. |
WideResNet | 2017 | This paper first introduces a simple principle for reducing the descriptions of event sequences without loss of information. |
PolyNet | 2017 | This paper proposes a novel synthetic network management model based on ForCES. This model regards the device under management (DUM) as forwarding element (FE). |
Pyramidal Net | 2017 | A PyramidNet is a type of convolutional network where the key idea is to concentrate on the feature map dimension by increasing it gradually instead of by increasing it sharply at each residual unit with downsampling. In addition, the network architecture works as a mixture of both plain and residual networks by using zero-padded identity-mapping shortcut connections when increasing the feature map dimension. |
Squeeze and Excitation Nets | 2017 | Focus on the channel relationship and propose a novel architectural unit, termed the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. These blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. |
Mobile Net V1 | April 2017 | Uses depthwise separable convolutions to reduce the number of parameters and computation required. |
CMPE-SE | 2018 | Competitive squeeze and excitation networks |
RAN | 2018 | Residual attention neural network. Residual Attention Network is built by stacking Attention Modules which generate attention-aware features. The attention-aware features from different modules change adaptively as layers going deeper. |
CB-CNN | 2018 | Channel boosted CNN, This idea of Channel Boosting exploits both the channel dimension of CNN (learning from multiple input channels) and Transfer learning (TL). TL is utilized at two different stages; channel generation and channel exploitation. |
CBAM | 2018 | Convolutional Block Attention Module, a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, the module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. |
Mobile Net V2 | January 2018 | Built upon the MobileNetv1 architecture, uses inverted residuals and linear bottlenecks. |
Mobile Net V3 | May 2019 | Uses AutoML to find the best possible neural network architecture for a given problem. |
Efficient Net | May 2019 | Uses a compound scaling method to scale the network's depth, width, and resolution to achieve a high accuracy with a relatively low computational cost. |
NoisyStudent | 2020 | Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. |
Vision Transformer | October 2020 | Images are segmented into patches, which are treated as tokens and a sequence of linear embeddings of these patches are input to a Transformer |
SwAV | 2020 | Self-supervised learning approach for image classification |
ResNesT | 2022 | Designed to scale ResNet-style models to new levels of performance |
DeiT | December 2020 | A convolution-free vision transformer that uses a teacher-student strategy with attention-based distillation tokens. |
Swin Transformer | March 2021 | A hierarchical vision transformer that uses shifted windows to addresses the challenges of adapting the transformer model to computer vision. |
CaiT | 2021 | Combines vision transformers with convolutional layers |
T2T-ViT | 2021 | Improved transformer-based vision models with token-to-token vision transformers. |
TNT | 2021 | Transformer in Transformer architecture for better hierarchical feature learning |
BEiT | June 2021 | Utilizes a masked image modeling task inspired by BERT in, involving image patches and visual tokens to pretrain vision Transformers. |
MobileViT | October 2021 | A lightweight vision transformer designed for mobile devices, effectively combining the strengths of CNNs and ViTs. |
Masked AutoEncoder | November 2021 | An encoder-decoder architecture that reconstructs input images by masking random patches and leveraging a high proportion of masking for self-supervision. |
CoAtNet | 2021 | CoAtNets (Convolution and Self-Attention Network) |
ConvNeXt | 2021 | A design that adopts a transformer-like architecture while being a convolutional network. It improves upon the designs of earlier CNNs. |
NFNet | 2021 | High-Performance Large-Scale Image Recognition Without Normalization |
MLP-Mixer | 2021 | Introduced mixer layers as an alternative to convolutional layers. |
gMLP | 2021 | Gated activations for better gradient flow |
Conv Mixer | January 2022 | Processes image patches using standard convolutions for mixing spatial and channel dimensions. |
MViT | 2022 | A multiview vision transformer, designed for processing videos, providing a way to integrate information from different frames efficiently. |
Shuffle Transformer | 2022 | Combined shuffle units with transformer blocks for efficient processing |
BEiT | 2022 | Introduces a BERT-style pre-training approach for image recognition, using masked image modeling. |
CrossViT | 2022 | Combines vision transformers with convolutional layers |
Masked Autoencoders (MAE) | 2022 | A self-supervised learning method where the model learns to reconstruct images from partial inputs, improving efficiency and performance. |
RegNet | 2023 | Introduced a design space exploration approach to neural network architecture search, producing efficient and high-performing models for image classification and other tasks |
Paper | Date | Description |
---|---|---|
RCNN | November 2013 | Uses selective search for region proposals, CNNs for feature extraction, SVM for classification followed by box offset regression. |
SPPNet | 2014 | Spatial Pyramid Pooling Network. |
Fast RCNN | April 2015 | Processes entire image through CNN, employs RoI Pooling to extract feature vectors from ROIs, followed by classification and BBox regression. |
Faster RCNN | June 2015 | A region proposal network (RPN) and a Fast R-CNN detector, collaboratively predict object regions by sharing convolutional features. |
YOLOv1 | 2015 | You only look Once V1. |
SSD | December 2015 | Discretizes bounding box outputs over a span of various scales and aspect ratios per feature map. |
RFCN | 2016 | Region-based Fully Convolutional Networks. |
YOLOv2 | 2016 | You only look Once V2. |
Feature Pyramid Network | December 2016 | Leverages the inherent multi-scale hierarchy of deep convolutional networks to efficiently construct feature pyramids. |
Mask RCNN | March 2017 | Extends Faster R-CNN to solve instance segmentation tasks, by adding a branch for predicting an object mask in parallel with the existing branch. |
Focal Loss | August 2017 | Addresses class imbalance in dense object detectors by down-weighting the loss assigned to well-classified examples. |
RetinaNet | 2017 | A one-stage object detection model that utilizes a focal loss function to address class imbalance during training. |
Cascade RCNN | 2018 | A multi-stage object detection architecture, the Cascade R-CNN, consists of a sequence of detectors trained with increasing IoU thresholds, to be sequentially more selective against close false positives. The detectors are trained stage by stage, leveraging the observation that the output of a detector is a good distribution for training the next higher quality detector. |
YOLOv3 | 2018 | You only look Once V3. |
EfficientDet | 2019 | This paper aims to tackle this problem by systematically studying various design choices of detector architectures. |
CenterNet | 2019 | This paper presents an efficient solution which explores the visual patterns within each cropped region with minimal costs. |
DETR | 2020 | Detection Transformer, End-to-End Object Detection with Transformers, A new method that views object detection as a direct set prediction problem. |
YOLOv4 | 2020 | You only look Once V4. |
YOLOv5 | 2020 | You only look Once V5. |
YOLOv6 | 2022 | You only look Once V6. |
YOLOv7 | 2022 | You only look Once V7. |
YOLOv8 | 2023 | You only look Once V8. |
YOLO-NAS | 2023 | The new YOLO-NAS architecture sets a new frontier for object detection tasks, offering the best accuracy and latency tradeoff performance. |
RT-DETR | 2023 | A cutting-edge end-to-end object detector that provides real-time performance while maintaining high accuracy. It leverages the power of Vision Transformers (ViT) to efficiently process multiscale features by decoupling intra-scale interaction and cross-scale fusion. RT-DETR is highly adaptable, supporting flexible adjustment of inference speed using different decoder layers without retraining. The model excels on accelerated backends like CUDA with TensorRT, outperforming many other real-time object detectors. |
SAM | 2023 | The Segment Anything Model, or SAM, is a cutting-edge image segmentation model that allows for promptable segmentation, providing unparalleled versatility in image analysis tasks. SAM forms the heart of the Segment Anything initiative, a groundbreaking project that introduces a novel model, task, and dataset for image segmentation. |
Fast-SAM | 2023 | FastSAM is designed to address the limitations of the Segment Anything Model (SAM), a heavy Transformer model with substantial computational resource requirements. The FastSAM decouples the segment anything task into two sequential stages: all-instance segmentation and prompt-guided selection. The first stage uses YOLOv8-seg to produce the segmentation masks of all instances in the image. In the second stage, it outputs the region-of-interest corresponding to the prompt. |
Mobile-SAM | 2023 | Mobile Segment Anything (MobileSAM). |
YOLOv9 | 2024 | You only look Once V9. |
YOLO-World | 2024 | YOLO-World tackles the challenges faced by traditional Open-Vocabulary detection models, which often rely on cumbersome Transformer models requiring extensive computational resources. These models' dependence on pre-defined object categories also restricts their utility in dynamic scenarios. YOLO-World revitalizes the YOLOv8 framework with open-vocabulary detection capabilities, employing vision-language modeling and pre-training on expansive datasets to excel at identifying a broad array of objects in zero-shot scenarios with unmatched efficiency. |