GitHub - Marshall-mk/computer-vision: A decently comprehensive list of computer vision architectures, then and now.

History of computer vision architectures. A focus on Classification, Segmentation and Object detection networks.

Paper	Date	Description
Neocognition	1979	A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position
ConvNet	1989	Used back-propagation to learn the convolution kernel coefficients directly from images of hand-written numbers
Lenet	December 1998	Introduced Convolutions.
Alex Net	September 2012	Introduced ReLU activation and Dropout to CNNs. Winner ILSVRC 2012.
ZfNet	2013	ZFNet is a classic convolutional neural network. The design was motivated by visualizing intermediate feature layers and the operation of the classifier. Compared to AlexNet, the filter sizes are reduced and the stride of the convolutions are reduced.
GoogleNet	2014	One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
VGG	September 2014	Used large number of filters of small size in each layer to learn complex features. Achieved SOTA in ILSVRC 2014.
Inception Net	September 2014	Introduced Inception Modules consisting of multiple parallel convolutional layers, designed to recognize different features at multiple scales.
HighwayNet	2015	Introduced a new architecture designed to ease gradient-based training of very deep networks
Inception Net v2 / Inception Net v3	December 2015	Design Optimizations of the Inception Modules which improved performance and accuracy.
Res Net	December 2015	Introduced residual connections, which are shortcuts that bypass one or more layers in the network. Winner ILSVRC 2015.
Inception Net v4 / Inception ResNet	February 2016	Hybrid approach combining Inception Net and ResNet.
Dense Net	August 2016	Each layer receives input from all the previous layers, creating a dense network of connections between the layers, allowing to learn more diverse features.
DarkNet	2016	A convolutional neural network that acts as a backbone for the YOLOv3 object detection approach.
Xception	October 2016	Based on InceptionV3 but uses depthwise separable convolutions instead on inception modules.
Res Next	November 2016	Built over ResNet, introduces the concept of grouped convolutions, where the filters in a convolutional layer are divided into multiple groups.
FractalNet	2017	The first simple alternative to ResNet.
Capsule Networks	2017	Proposed to improve the performance of CNNs, especially in terms of spatial hierarchies and rotation invariance.
WideResNet	2017	This paper first introduces a simple principle for reducing the descriptions of event sequences without loss of information.
PolyNet	2017	This paper proposes a novel synthetic network management model based on ForCES. This model regards the device under management (DUM) as forwarding element (FE).
Pyramidal Net	2017	A PyramidNet is a type of convolutional network where the key idea is to concentrate on the feature map dimension by increasing it gradually instead of by increasing it sharply at each residual unit with downsampling. In addition, the network architecture works as a mixture of both plain and residual networks by using zero-padded identity-mapping shortcut connections when increasing the feature map dimension.
Squeeze and Excitation Nets	2017	Focus on the channel relationship and propose a novel architectural unit, termed the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. These blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.
Mobile Net V1	April 2017	Uses depthwise separable convolutions to reduce the number of parameters and computation required.
CMPE-SE	2018	Competitive squeeze and excitation networks
RAN	2018	Residual attention neural network. Residual Attention Network is built by stacking Attention Modules which generate attention-aware features. The attention-aware features from different modules change adaptively as layers going deeper.
CB-CNN	2018	Channel boosted CNN, This idea of Channel Boosting exploits both the channel dimension of CNN (learning from multiple input channels) and Transfer learning (TL). TL is utilized at two different stages; channel generation and channel exploitation.
CBAM	2018	Convolutional Block Attention Module, a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, the module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.
Mobile Net V2	January 2018	Built upon the MobileNetv1 architecture, uses inverted residuals and linear bottlenecks.
Mobile Net V3	May 2019	Uses AutoML to find the best possible neural network architecture for a given problem.
Efficient Net	May 2019	Uses a compound scaling method to scale the network's depth, width, and resolution to achieve a high accuracy with a relatively low computational cost.
NoisyStudent	2020	Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images.
Vision Transformer	October 2020	Images are segmented into patches, which are treated as tokens and a sequence of linear embeddings of these patches are input to a Transformer
SwAV	2020	Self-supervised learning approach for image classification
ResNesT	2022	Designed to scale ResNet-style models to new levels of performance
DeiT	December 2020	A convolution-free vision transformer that uses a teacher-student strategy with attention-based distillation tokens.
Swin Transformer	March 2021	A hierarchical vision transformer that uses shifted windows to addresses the challenges of adapting the transformer model to computer vision.
CaiT	2021	Combines vision transformers with convolutional layers
T2T-ViT	2021	Improved transformer-based vision models with token-to-token vision transformers.
TNT	2021	Transformer in Transformer architecture for better hierarchical feature learning
BEiT	June 2021	Utilizes a masked image modeling task inspired by BERT in, involving image patches and visual tokens to pretrain vision Transformers.
MobileViT	October 2021	A lightweight vision transformer designed for mobile devices, effectively combining the strengths of CNNs and ViTs.
Masked AutoEncoder	November 2021	An encoder-decoder architecture that reconstructs input images by masking random patches and leveraging a high proportion of masking for self-supervision.
CoAtNet	2021	CoAtNets (Convolution and Self-Attention Network)
ConvNeXt	2021	A design that adopts a transformer-like architecture while being a convolutional network. It improves upon the designs of earlier CNNs.
NFNet	2021	High-Performance Large-Scale Image Recognition Without Normalization
MLP-Mixer	2021	Introduced mixer layers as an alternative to convolutional layers.
gMLP	2021	Gated activations for better gradient flow
Conv Mixer	January 2022	Processes image patches using standard convolutions for mixing spatial and channel dimensions.
MViT	2022	A multiview vision transformer, designed for processing videos, providing a way to integrate information from different frames efficiently.
Shuffle Transformer	2022	Combined shuffle units with transformer blocks for efficient processing
BEiT	2022	Introduces a BERT-style pre-training approach for image recognition, using masked image modeling.
CrossViT	2022	Combines vision transformers with convolutional layers
Masked Autoencoders (MAE)	2022	A self-supervised learning method where the model learns to reconstruct images from partial inputs, improving efficiency and performance.
RegNet	2023	Introduced a design space exploration approach to neural network architecture search, producing efficient and high-performing models for image classification and other tasks

Object Detection

Paper	Date	Description
RCNN	November 2013	Uses selective search for region proposals, CNNs for feature extraction, SVM for classification followed by box offset regression.
SPPNet	2014	Spatial Pyramid Pooling Network.
Fast RCNN	April 2015	Processes entire image through CNN, employs RoI Pooling to extract feature vectors from ROIs, followed by classification and BBox regression.
Faster RCNN	June 2015	A region proposal network (RPN) and a Fast R-CNN detector, collaboratively predict object regions by sharing convolutional features.
YOLOv1	2015	You only look Once V1.
SSD	December 2015	Discretizes bounding box outputs over a span of various scales and aspect ratios per feature map.
RFCN	2016	Region-based Fully Convolutional Networks.
YOLOv2	2016	You only look Once V2.
Feature Pyramid Network	December 2016	Leverages the inherent multi-scale hierarchy of deep convolutional networks to efficiently construct feature pyramids.
Mask RCNN	March 2017	Extends Faster R-CNN to solve instance segmentation tasks, by adding a branch for predicting an object mask in parallel with the existing branch.
Focal Loss	August 2017	Addresses class imbalance in dense object detectors by down-weighting the loss assigned to well-classified examples.
RetinaNet	2017	A one-stage object detection model that utilizes a focal loss function to address class imbalance during training.
Cascade RCNN	2018	A multi-stage object detection architecture, the Cascade R-CNN, consists of a sequence of detectors trained with increasing IoU thresholds, to be sequentially more selective against close false positives. The detectors are trained stage by stage, leveraging the observation that the output of a detector is a good distribution for training the next higher quality detector.
YOLOv3	2018	You only look Once V3.
EfficientDet	2019	This paper aims to tackle this problem by systematically studying various design choices of detector architectures.
CenterNet	2019	This paper presents an efficient solution which explores the visual patterns within each cropped region with minimal costs.
DETR	2020	Detection Transformer, End-to-End Object Detection with Transformers, A new method that views object detection as a direct set prediction problem.
YOLOv4	2020	You only look Once V4.
YOLOv5	2020	You only look Once V5.
YOLOv6	2022	You only look Once V6.
YOLOv7	2022	You only look Once V7.
YOLOv8	2023	You only look Once V8.
YOLO-NAS	2023	The new YOLO-NAS architecture sets a new frontier for object detection tasks, offering the best accuracy and latency tradeoff performance.
RT-DETR	2023	A cutting-edge end-to-end object detector that provides real-time performance while maintaining high accuracy. It leverages the power of Vision Transformers (ViT) to efficiently process multiscale features by decoupling intra-scale interaction and cross-scale fusion. RT-DETR is highly adaptable, supporting flexible adjustment of inference speed using different decoder layers without retraining. The model excels on accelerated backends like CUDA with TensorRT, outperforming many other real-time object detectors.
SAM	2023	The Segment Anything Model, or SAM, is a cutting-edge image segmentation model that allows for promptable segmentation, providing unparalleled versatility in image analysis tasks. SAM forms the heart of the Segment Anything initiative, a groundbreaking project that introduces a novel model, task, and dataset for image segmentation.
Fast-SAM	2023	FastSAM is designed to address the limitations of the Segment Anything Model (SAM), a heavy Transformer model with substantial computational resource requirements. The FastSAM decouples the segment anything task into two sequential stages: all-instance segmentation and prompt-guided selection. The first stage uses YOLOv8-seg to produce the segmentation masks of all instances in the image. In the second stage, it outputs the region-of-interest corresponding to the prompt.
Mobile-SAM	2023	Mobile Segment Anything (MobileSAM).
YOLOv9	2024	You only look Once V9.
YOLO-World	2024	YOLO-World tackles the challenges faced by traditional Open-Vocabulary detection models, which often rely on cumbersome Transformer models requiring extensive computational resources. These models' dependence on pre-defined object categories also restricts their utility in dynamic scenarios. YOLO-World revitalizes the YOLOv8 framework with open-vocabulary detection capabilities, employing vision-language modeling and pre-training on expansive datasets to excel at identifying a broad array of objects in zero-shot scenarios with unmatched efficiency.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

History of computer vision architectures. A focus on Classification, Segmentation and Object detection networks.

Object Detection

About

Releases

Packages

Marshall-mk/computer-vision

Folders and files

Latest commit

History

Repository files navigation

History of computer vision architectures. A focus on Classification, Segmentation and Object detection networks.

Object Detection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages