1.ARIA: Adversarially Robust Image Attribution for Content Provenance ⬇️
Image attribution -- matching an image back to a trusted source -- is an emerging tool in the fight against online misinformation. Deep visual fingerprinting models have recently been explored for this purpose. However, they are not robust to tiny input perturbations known as adversarial examples. First we illustrate how to generate valid adversarial images that can easily cause incorrect image attribution. Then we describe an approach to prevent imperceptible adversarial attacks on deep visual fingerprinting models, via robust contrastive learning. The proposed training procedure leverages training on
$\ell_\infty$ -bounded adversarial examples, it is conceptually simple and incurs only a small computational overhead. The resulting models are substantially more robust, are accurate even on unperturbed images, and perform well even over a database with millions of images. In particular, we achieve 91.6% standard and 85.1% adversarial recall under$\ell_\infty$ -bounded perturbations on manipulated images compared to 80.1% and 0.0% from prior work. We also show that robustness generalizes to other types of imperceptible perturbations unseen during training. Finally, we show how to train an adversarially robust image comparator model for detecting editorial changes in matched images.
2.RELMOBNET: A Robust Two-Stage End-To-End Training Approach For MOBILENETV3 Based Relative Camera Pose Estimation ⬇️
Relative camera pose estimation plays a pivotal role in dealing with 3D reconstruction and visual localization. To address this, we propose a Siamese network based on MobileNetV3-Large for an end-to-end relative camera pose regression independent of camera parameters. The proposed network uses pair of images taken at different locations in the same scene to estimate the 3D translation vector and rotation vector in unit quaternion. To increase the generality of the model, rather than training it for a single scene, data for four scenes are combined to train a single universal model to estimate the relative pose. Further for independency of hyperparameter weighing between translation and rotation loss is not used. Instead we use the novel two-stage training procedure to learn the balance implicitly with faster convergence. We compare the results obtained with the Cambridge Landmarks dataset, comprising of different scenes, with existing CNN-based regression methods as baselines, e.g., RPNet and RCPNet. The findings indicate that, when compared to RCPNet, proposed model improves the estimation of the translation vector by a percentage change of 16.11%, 28.88%, 52.27% on the Kings College, Old Hospital, St Marys Church scenes from Cambridge Landmarks dataset, respectively.
3.NeuralFusion: Neural Volumetric Rendering under Human-object Interactions ⬇️
4D reconstruction and rendering of human activities is critical for immersive VR/AR experience. Recent advances still fail to recover fine geometry and texture results with the level of detail present in the input images from sparse multi-view RGB cameras. In this paper, we propose NeuralHumanFVV, a real-time neural human performance capture and rendering system to generate both high-quality geometry and photo-realistic texture of human activities in arbitrary novel views. We propose a neural geometry generation scheme with a hierarchical sampling strategy for real-time implicit geometry inference, as well as a novel neural blending scheme to generate high resolution (e.g., 1k) and photo-realistic texture results in the novel views. Furthermore, we adopt neural normal blending to enhance geometry details and formulate our neural geometry and texture rendering into a multi-task learning framework. Extensive experiments demonstrate the effectiveness of our approach to achieve high-quality geometry and photo-realistic free view-point reconstruction for challenging human performances.
4.Improving generalization with synthetic training data for deep learning based quality inspection ⬇️
Automating quality inspection with computer vision techniques is often a very data-demanding task. Specifically, supervised deep learning requires a large amount of annotated images for training. In practice, collecting and annotating such data is not only costly and laborious, but also inefficient, given the fact that only a few instances may be available for certain defect classes. If working with video frames can increase the number of these instances, it has a major disadvantage: the resulting images will be highly correlated with one another. As a consequence, models trained under such constraints are expected to be very sensitive to input distribution changes, which may be caused in practice by changes in the acquisition system (cameras, lights), in the parts or in the defects aspect. In this work, we demonstrate the use of randomly generated synthetic training images can help tackle domain instability issues, making the trained models more robust to contextual changes. We detail both our synthetic data generation pipeline and our deep learning methodology for answering these questions.
5.Sensing accident-prone features in urban scenes for proactive driving and accident prevention ⬇️
In urban cities, visual information along and on roadways is likely to distract drivers and leads to missing traffic signs and other accident-prone features. As a solution to avoid accidents due to missing these visual cues, this paper proposes a visual notification of accident-prone features to drivers, based on real-time images obtained via dashcam. For this purpose, Google Street View images around accident hotspots (areas of dense accident occurrence) identified by accident dataset are used to train a family of deep convolutional neural networks (CNNs). Trained CNNs are able to detect accident-prone features and classify a given urban scene into an accident hotspot and a non-hotspot (area of sparse accident occurrence). For given accident hotspot, the trained CNNs can classify it into an accident hotspot with the accuracy up to 90%. The capability of detecting accident-prone features by the family of CNNs is analyzed by a comparative study of four different class activation map (CAM) methods, which are used to inspect specific accident-prone features causing the decision of CNNs, and pixel-level object class classification. The outputs of CAM methods are processed by an image processing pipeline to extract only the accident-prone features that are explainable to drivers with the help of visual notification system. To prove the efficacy of accident-prone features, an ablation study is conducted. Ablation of accident-prone features taking 7.7%, on average, of total area in each image sample causes up to 13.7% more chance of given area to be classified as a non-hotspot.
6.Confidence Calibration for Object Detection and Segmentation ⬇️
Calibrated confidence estimates obtained from neural networks are crucial, particularly for safety-critical applications such as autonomous driving or medical image diagnosis. However, although the task of confidence calibration has been investigated on classification problems, thorough in-ves-tiga-tions on object detection and segmentation problems are still missing. Therefore, we focus on the investigation of confidence calibration for object detection and segmentation models in this chapter. We introduce the concept of multivariate confidence calibration that is an extension of well-known calibration methods to the task of object detection and segmentation. This allows for an extended confidence calibration that is also aware of additional features such as bounding box/pixel position, shape information, etc. Furthermore, we extend the expected calibration error (ECE) to measure mis-ca-li-bra-tion of object detection and segmentation models. We examine several network architectures on MS COCO as well as on Cityscapes and show that especially object detection as well as instance segmentation models are intrinsically miscalibrated given the introduced definition of calibration. Using our proposed calibration methods, we have been able to improve calibration so that it also has a positive impact on the quality of segmentation masks as well.
7.Towards Safe, Real-Time Systems: Stereo vs Images and LiDAR for 3D Object Detection ⬇️
As object detectors rapidly improve, attention has expanded past image-only networks to include a range of 3D and multimodal frameworks, especially ones that incorporate LiDAR. However, due to cost, logistics, and even some safety considerations, stereo can be an appealing alternative. Towards understanding the efficacy of stereo as a replacement for monocular input or LiDAR in object detectors, we show that multimodal learning with traditional disparity algorithms can improve image-based results without increasing the number of parameters, and that learning over stereo error can impart similar 3D localization power to LiDAR in certain contexts. Furthermore, doing so also has calibration benefits with respect to image-only methods. We benchmark on the public dataset KITTI, and in doing so, reveal a few small but common algorithmic mistakes currently used in computing metrics on that set, and offer efficient, provably correct alternatives.
8.Data refinement for fully unsupervised visual inspection using pre-trained networks ⬇️
Anomaly detection has recently seen great progress in the field of visual inspection. More specifically, the use of classical outlier detection techniques on features extracted by deep pre-trained neural networks have been shown to deliver remarkable performances on the MVTec Anomaly Detection (MVTec AD) dataset. However, like most other anomaly detection strategies, these pre-trained methods assume all training data to be normal. As a consequence, they cannot be considered as fully unsupervised. There exists to our knowledge no work studying these pre-trained methods under fully unsupervised setting. In this work, we first assess the robustness of these pre-trained methods to fully unsupervised context, using polluted training sets (i.e. containing defective samples), and show that these methods are more robust to pollution compared to methods such as CutPaste. We then propose SROC, a Simple Refinement strategy for One Class classification. SROC enables to remove most of the polluted images from the training set, and to recover some of the lost AUC. We further show that our simple heuristic competes with, and even outperforms much more complex strategies from the existing literature.
9.Synthesizing Photorealistic Images with Deep Generative Learning ⬇️
The goal of this thesis is to present my research contributions towards solving various visual synthesis and generation tasks, comprising image translation, image completion, and completed scene decomposition. This thesis consists of five pieces of work, each of which presents a new learning-based approach for synthesizing images with plausible content as well as visually realistic appearance. Each work demonstrates the superiority of the proposed approach on image synthesis, with some further contributing to other tasks, such as depth estimation.
10.The effect of fatigue on the performance of online writer recognition ⬇️
Background: The performance of biometric modalities based on things done by the subject, like signature and text-based recognition, may be affected by the subject state. Fatigue is one of the conditions that can significantly affect the outcome of handwriting tasks. Recent research has already shown that physical fatigue produces measurable differences in some features extracted from common writing and drawing tasks. It is important to establish to which extent physical fatigue contributes to the intra-person variability observed in these biometric modalities and also to know whether the performance of recognition methods is affected by fatigue. Goal: In this paper we assess the impact of fatigue on intra-user variability and on the performance of signature-based and text-based writer recognition approaches encompassing both identification and verification. Methods: Several signature and text recognition methods are considered and applied to samples gathered after different levels of induced fatigue, measured by metabolic and mechanical assessment and, also by subjective perception. The recognition methods are Dynamic Time Warping and Multi Section Vector Quantization, for signatures, and Allographic Text-Dependent Recognition for text in capital letters. For each fatigue level, the identification and verification performance of these methods is measured. Results: Signature shows no statistically significant intra-user impact, but text does. On the other hand, performance of signature-based recognition approaches is negatively impacted by fatigue whereas the impact is not noticeable in text-based recognition, provided long enough sequences are considered.
11.Reconstruction of Perceived Images from fMRI Patterns and Semantic Brain Exploration using Instance-Conditioned GANs ⬇️
Reconstructing perceived natural images from fMRI signals is one of the most engaging topics of neural decoding research. Prior studies had success in reconstructing either the low-level image features or the semantic/high-level aspects, but rarely both. In this study, we utilized an Instance-Conditioned GAN (IC-GAN) model to reconstruct images from fMRI patterns with both accurate semantic attributes and preserved low-level details. The IC-GAN model takes as input a 119-dim noise vector and a 2048-dim instance feature vector extracted from a target image via a self-supervised learning model (SwAV ResNet-50); these instance features act as a conditioning for IC-GAN image generation, while the noise vector introduces variability between samples. We trained ridge regression models to predict instance features, noise vectors, and dense vectors (the output of the first dense layer of the IC-GAN generator) of stimuli from corresponding fMRI patterns. Then, we used the IC-GAN generator to reconstruct novel test images based on these fMRI-predicted variables. The generated images presented state-of-the-art results in terms of capturing the semantic attributes of the original test images while remaining relatively faithful to low-level image details. Finally, we use the learned regression model and the IC-GAN generator to systematically explore and visualize the semantic features that maximally drive each of several regions-of-interest in the human brain.
12.On Modality Bias Recognition and Reduction ⬇️
Making each modality in multi-modal data contribute is of vital importance to learning a versatile multi-modal model. Existing methods, however, are often dominated by one or few of modalities during model training, resulting in sub-optimal performance. In this paper, we refer to this problem as modality bias and attempt to study it in the context of multi-modal classification systematically and comprehensively. After stepping into several empirical analysis, we recognize that one modality affects the model prediction more just because this modality has a spurious correlation with instance labels. In order to primarily facilitate the evaluation on the modality bias problem, we construct two datasets respectively for the colored digit recognition and video action recognition tasks in line with the Out-of-Distribution (OoD) protocol. Collaborating with the benchmarks in the visual question answering task, we empirically justify the performance degradation of the existing methods on these OoD datasets, which serves as evidence to justify the modality bias learning. In addition, to overcome this problem, we propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned according to the training set statistics. Thereafter, we apply this method on eight baselines in total to test its effectiveness. From the results on four datasets regarding the above three tasks, our method yields remarkable performance improvements compared with the baselines, demonstrating its superiority on reducing the modality bias problem.
13.Improving Amharic Handwritten Word Recognition Using Auxiliary Task ⬇️
Amharic is one of the official languages of the Federal Democratic Republic of Ethiopia. It is one of the languages that use an Ethiopic script which is derived from Gee'z, ancient and currently a liturgical language. Amharic is also one of the most widely used literature-rich languages of Ethiopia. There are very limited innovative and customized research works in Amharic optical character recognition (OCR) in general and Amharic handwritten text recognition in particular. In this study, Amharic handwritten word recognition will be investigated. State-of-the-art deep learning techniques including convolutional neural networks together with recurrent neural networks and connectionist temporal classification (CTC) loss were used to make the recognition in an end-to-end fashion. More importantly, an innovative way of complementing the loss function using the auxiliary task from the row-wise similarities of the Amharic alphabet was tested to show a significant recognition improvement over a baseline method. Such findings will promote innovative problem-specific solutions as well as will open insight to a generalized solution that emerges from problem-specific domains.
14.Deep Dirichlet uncertainty for unsupervised out-of-distribution detection of eye fundus photographs in glaucoma screening ⬇️
The development of automatic tools for early glaucoma diagnosis with color fundus photographs can significantly reduce the impact of this disease. However, current state-of-the-art solutions are not robust to real-world scenarios, providing over-confident predictions for out-of-distribution cases. With this in mind, we propose a model based on the Dirichlet distribution that allows to obtain class-wise probabilities together with an uncertainty estimation without exposure to out-of-distribution cases. We demonstrate our approach on the AIROGS challenge. At the start of the final test phase (8 Feb. 2022), our method had the highest average score among all submissions.
15.Joint Answering and Explanation for Visual Commonsense Reasoning ⬇️
Visual Commonsense Reasoning (VCR), deemed as one challenging extension of the Visual Question Answering (VQA), endeavors to pursue a more high-level visual comprehension. It is composed of two indispensable processes: question answering over a given image and rationale inference for answer explanation. Over the years, a variety of methods tackling VCR have advanced the performance on the benchmark dataset. Despite significant as these methods are, they often treat the two processes in a separate manner and hence decompose the VCR into two irrelevant VQA instances. As a result, the pivotal connection between question answering and rationale inference is interrupted, rendering existing efforts less faithful on visual reasoning. To empirically study this issue, we perform some in-depth explorations in terms of both language shortcuts and generalization capability to verify the pitfalls of this treatment. Based on our findings, in this paper, we present a plug-and-play knowledge distillation enhanced framework to couple the question answering and rationale inference processes. The key contribution is the introduction of a novel branch, which serves as the bridge to conduct processes connecting. Given that our framework is model-agnostic, we apply it to the existing popular baselines and validate its effectiveness on the benchmark dataset. As detailed in the experimental results, when equipped with our framework, these baselines achieve consistent and significant performance improvements, demonstrating the viability of processes coupling, as well as the superiority of the proposed framework.
16.LF-VIO: A Visual-Inertial-Odometry Framework for Large Field-of-View Cameras with Negative Plane ⬇️
Visual-inertial-odometry has attracted extensive attention in the field of autonomous driving and robotics. The size of Field of View (FoV) plays an important role in Visual-Odometry (VO) and Visual-Inertial-Odometry (VIO), as a large FoV enables to perceive a wide range of surrounding scene elements and features. However, when the field of the camera reaches the negative half plane, one cannot simply use [u,v,1]^T to represent the image feature points anymore. To tackle this issue, we propose LF-VIO, a real-time VIO framework for cameras with extremely large FoV. We leverage a three-dimensional vector with unit length to represent feature points, and design a series of algorithms to overcome this challenge. To address the scarcity of panoramic visual odometry datasets with ground-truth location and pose, we present the PALVIO dataset, collected with a Panoramic Annular Lens (PAL) system with an entire FoV of 360x(40-120) degrees and an IMU sensor. With a comprehensive variety of experiments, the proposed LF-VIO is verified on both the established PALVIO benchmark and a public fisheye camera dataset with a FoV of 360x(0-93.5) degrees. LF-VIO outperforms state-of-the-art visual-inertial-odometry methods. Our dataset and code are made publicly available at this https URL
17.Active Learning for Point Cloud Semantic Segmentation via Spatial-Structural Diversity Reasoning ⬇️
The expensive annotation cost is notoriously known as a main constraint for the development of the point cloud semantic segmentation technique. In this paper, we propose a novel active learning-based method to tackle this problem. Dubbed SSDR-AL, our method groups the original point clouds into superpoints and selects the most informative and representative ones for label acquisition. We achieve the selection mechanism via a graph reasoning network that considers both the spatial and structural diversity of the superpoints. To deploy SSDR-AL in a more practical scenario, we design a noise aware iterative labeling scheme to confront the "noisy annotation" problem introduced by previous dominant labeling methods in superpoints. Extensive experiments on two point cloud benchmarks demonstrate the effectiveness of SSDR-AL in the semantic segmentation task. Particularly, SSDR-AL significantly outperforms the baseline method when the labeled sets are small, where SSDR-AL requires only
$5.7%$ and$1.9%$ annotation costs to achieve the performance of$90%$ fully supervised learning on S3DIS and Semantic3D datasets, respectively.
18.An exploration of the performances achievable by combining unsupervised background subtraction algorithms ⬇️
Background subtraction (BGS) is a common choice for performing motion detection in video. Hundreds of BGS algorithms are released every year, but combining them to detect motion remains largely unexplored. We found that combination strategies allow to capitalize on this massive amount of available BGS algorithms, and offer significant space for performance improvement. In this paper, we explore sets of performances achievable by 6 strategies combining, pixelwise, the outputs of 26 unsupervised BGS algorithms, on the CDnet 2014 dataset, both in the ROC space and in terms of the F1 score. The chosen strategies are representative for a large panel of strategies, including both deterministic and non-deterministic ones, voting and learning. In our experiments, we compare our results with the state-of-the-art combinations IUTIS-5 and CNN-SFC, and report six conclusions, among which the existence of an important gap between the performances of the individual algorithms and the best performances achievable by combining them.
19.6D Rotation Representation For Unconstrained Head Pose Estimation ⬇️
In this paper, we present a method for unconstrained end-to-end head pose estimation. We address the problem of ambiguous rotation labels by introducing the rotation matrix formalism for our ground truth data and propose a continuous 6D rotation matrix representation for efficient and robust direct regression. This way, our method can learn the full rotation appearance which is contrary to previous approaches that restrict the pose prediction to a narrow-angle for satisfactory results. In addition, we propose a geodesic distance-based loss to penalize our network with respect to the SO(3) manifold geometry. Experiments on the public AFLW2000 and BIWI datasets demonstrate that our proposed method significantly outperforms other state-of-the-art methods by up to 20%. We open-source our training and testing code along with our pre-trained models: this https URL.
20.Improved Dual Correlation Reduction Network ⬇️
Deep graph clustering, which aims to reveal the underlying graph structure and divide the nodes into different clusters without human annotations, is a fundamental yet challenging task. However, we observed that the existing methods suffer from the representation collapse problem and easily tend to encode samples with different classes into the same latent embedding. Consequently, the discriminative capability of nodes is limited, resulting in sub-optimal clustering performance. To address this problem, we propose a novel deep graph clustering algorithm termed Improved Dual Correlation Reduction Network (IDCRN) through improving the discriminative capability of samples. Specifically, by approximating the cross-view feature correlation matrix to an identity matrix, we reduce the redundancy between different dimensions of features, thus improving the discriminative capability of the latent space explicitly. Meanwhile, the cross-view sample correlation matrix is forced to approximate the designed clustering-refined adjacency matrix to guide the learned latent representation to recover the affinity matrix even across views, thus enhancing the discriminative capability of features implicitly. Moreover, we avoid the collapsed representation caused by the over-smoothing issue in Graph Convolutional Networks (GCNs) through an introduced propagation regularization term, enabling IDCRN to capture the long-range information with the shallow network structure. Extensive experimental results on six benchmarks have demonstrated the effectiveness and the efficiency of IDCRN compared to the existing state-of-the-art deep graph clustering algorithms.
21.A Novel Hand Gesture Detection and Recognition system based on ensemble-based Convolutional Neural Network ⬇️
Nowadays, hand gesture recognition has become an alternative for human-machine interaction. It has covered a large area of applications like 3D game technology, sign language interpreting, VR (virtual reality) environment, and robotics. But detection of the hand portion has become a challenging task in computer vision and pattern recognition communities. Deep learning algorithm like convolutional neural network (CNN) architecture has become a very popular choice for classification tasks, but CNN architectures suffer from some problems like high variance during prediction, overfitting problem and also prediction errors. To overcome these problems, an ensemble of CNN-based approaches is presented in this paper. Firstly, the gesture portion is detected by using the background separation method based on binary thresholding. After that, the contour portion is extracted, and the hand region is segmented. Then, the images have been resized and fed into three individual CNN models to train them in parallel. In the last part, the output scores of CNN models are averaged to construct an optimal ensemble model for the final prediction. Two publicly available datasets (labeled as Dataset-1 and Dataset-2) containing infrared images and one self-constructed dataset have been used to validate the proposed system. Experimental results are compared with the existing state-of-the-art approaches, and it is observed that our proposed ensemble model outperforms other existing proposed methods.
22.TeachAugment: Data Augmentation Optimization Using Teacher Knowledge ⬇️
Optimization of image transformation functions for the purpose of data augmentation has been intensively studied. In particular, adversarial data augmentation strategies, which search augmentation maximizing task loss, show significant improvement in the model generalization for many tasks. However, the existing methods require careful parameter tuning to avoid excessively strong deformations that take away image features critical for acquiring generalization. In this paper, we propose a data augmentation optimization method based on the adversarial strategy called TeachAugment, which can produce informative transformed images to the model without requiring careful tuning by leveraging a teacher model. Specifically, the augmentation is searched so that augmented images are adversarial for the target model and recognizable for the teacher model. We also propose data augmentation using neural networks, which simplifies the search space design and allows for updating of the data augmentation using the gradient method. We show that TeachAugment outperforms existing methods in experiments of image classification, semantic segmentation, and unsupervised representation learning tasks.
23.RRL:Regional Rotation Layer in Convolutional Neural Networks ⬇️
Convolutional Neural Networks (CNNs) perform very well in image classification and object detection in recent years, but even the most advanced models have limited rotation invariance. Known solutions include the enhancement of training data and the increase of rotation invariance by globally merging the rotation equivariant features. These methods either increase the workload of training or increase the number of model parameters. To address this problem, this paper proposes a module that can be inserted into the existing networks, and directly incorporates the rotation invariance into the feature extraction layers of the CNNs. This module does not have learnable parameters and will not increase the complexity of the model. At the same time, only by training the upright data, it can perform well on the rotated testing set. These advantages will be suitable for fields such as biomedicine and astronomy where it is difficult to obtain upright samples or the target has no directionality. Evaluate our module with LeNet-5, ResNet-18 and tiny-yolov3, we get impressive results.
24.Implicit Optimizer for Diffeomorphic Image Registration ⬇️
Diffeomorphic image registration is the underlying technology in medical image processing which enables the invertibility and point-to-point correspondence. Recently, numerous learning-based methods utilizing convolutional neural networks (CNNs) have been proposed for registration problems. Compared with the speed boosting, accuracy improvement brought by the complicated CNN-based methods is minor. To tackle this problem, we propose a rapid and accurate Implicit Optimizer for Diffeomorphic Image Registration (IDIR) which utilizes the Deep Implicit Function as the neural velocity field (NVF) whose input is the point coordinate p and output is velocity vector at that point v. To reduce the huge memory consumption brought by NVF for 3D volumes, a sparse sampling is employed to the framework. We evaluate our method on two 3D large-scale MR brain scan datasets, the results show that our proposed method provides faster and better registration results than conventional image registration approaches and outperforms the learning-based methods by a significant margin while maintaining the desired diffeomorphic properties.
25.Learn From the Past: Experience Ensemble Knowledge Distillation ⬇️
Traditional knowledge distillation transfers "dark knowledge" of a pre-trained teacher network to a student network, and ignores the knowledge in the training process of the teacher, which we call teacher's experience. However, in realistic educational scenarios, learning experience is often more important than learning results. In this work, we propose a novel knowledge distillation method by integrating the teacher's experience for knowledge transfer, named experience ensemble knowledge distillation (EEKD). We save a moderate number of intermediate models from the training process of the teacher model uniformly, and then integrate the knowledge of these intermediate models by ensemble technique. A self-attention module is used to adaptively assign weights to different intermediate models in the process of knowledge transfer. Three principles of constructing EEKD on the quality, weights and number of intermediate models are explored. A surprising conclusion is found that strong ensemble teachers do not necessarily produce strong students. The experimental results on CIFAR-100 and ImageNet show that EEKD outperforms the mainstream knowledge distillation methods and achieves the state-of-the-art. In particular, EEKD even surpasses the standard ensemble distillation on the premise of saving training cost.
26.Understanding Adversarial Robustness from Feature Maps of Convolutional Layers ⬇️
The adversarial robustness of a neural network mainly relies on two factors, one is the feature representation capacity of the network, and the other is its resistance ability to perturbations. In this paper, we study the anti-perturbation ability of the network from the feature maps of convolutional layers. Our theoretical analysis discovers that larger convolutional features before average pooling can contribute to better resistance to perturbations, but the conclusion is not true for max pooling. Based on the theoretical findings, we present two feasible ways to improve the robustness of existing neural networks. The proposed approaches are very simple and only require upsampling the inputs or modifying the stride configuration of convolution operators. We test our approaches on several benchmark neural network architectures, including AlexNet, VGG16, RestNet18 and PreActResNet18, and achieve non-trivial improvements on both natural accuracy and robustness under various attacks. Our study brings new insights into the design of robust neural networks. The code is available at \url{this https URL}.
27.Efficient Video Segmentation Models with Per-frame Inference ⬇️
Most existing real-time deep models trained with each frame independently may produce inconsistent results across the temporal axis when tested on a video sequence. A few methods take the correlations in the video sequence into account,e.g., by propagating the results to the neighboring frames using optical flow or extracting frame representations using multi-frame information, which may lead to inaccurate results or unbalanced latency. In this work, we focus on improving the temporal consistency without introducing computation overhead in inference. To this end, we perform inference at each frame. Temporal consistency is achieved by learning from video frames with extra constraints during the training phase. introduced for inference. We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods. On the task of semantic video segmentation, weighing among accuracy, temporal smoothness, and efficiency, our proposed method outperforms keyframe-based methods and a few baseline methods that are trained with each frame independently, on datasets including Cityscapes, Camvid, and 300VW-Mask. We further apply our training method to video instance segmentation on YouTubeVISand develop an application of portrait matting in video sequences, by segmenting temporally consistent instance-level trimaps across frames. Experiments show superior qualitative and quantitative results. Code is available at: this https URL.
28.Analyzing Human Observer Ability in Morphing Attack Detection -- Where Do We Stand? ⬇️
While several works have studied the vulnerability of automated FRS and have proposed morphing attack detection (MAD) methods, very few have focused on studying the human ability to detect morphing attacks. The examiner/observer's face morph detection ability is based on their observation, domain knowledge, experience, and familiarity with the problem, and no works report the detailed findings from observers who check identity documents as a part of their everyday professional life. This work creates a new benchmark database of realistic morphing attacks from 48 unique subjects leading to 400 morphed images presented to the observers in a Differential-MAD (D-MAD) setting. Unlike the existing databases, the newly created morphed image database has been created with careful considerations to age, gender and ethnicity to create realistic morph attacks. Further, unlike the previous works, we also capture ten images from Automated Border Control (ABC) gates to mimic the realistic D-MAD setting leading to 400 probe images in border crossing scenarios. The newly created dataset is further used to study the ability of human observers' ability to detect morphed images. In addition, a new dataset of 180 morphed images is also created using the FRGCv2 dataset under the Single Image-MAD (S-MAD) setting. Further, to benchmark the human ability in detecting morphs, a new evaluation platform is created to conduct S-MAD and D-MAD analysis. The benchmark study employs 469 observers for D-MAD and 410 observers for S-MAD who are primarily governmental employees from more than 40 countries. The analysis provides interesting insights and points to expert observers' missing competence and failure to detect a considerable amount of morphing attacks. Human observers tend to detect morphed images to a lower accuracy as compared to the automated MAD algorithms evaluated in this work.
29.Optimal channel selection with discrete QCQP ⬇️
Reducing the high computational cost of large convolutional neural networks is crucial when deploying the networks to resource-constrained environments. We first show the greedy approach of recent channel pruning methods ignores the inherent quadratic coupling between channels in the neighboring layers and cannot safely remove inactive weights during the pruning procedure. Furthermore, due to these inactive weights, the greedy methods cannot guarantee to satisfy the given resource constraints and deviate with the true objective. In this regard, we propose a novel channel selection method that optimally selects channels via discrete QCQP, which provably prevents any inactive weights and guarantees to meet the resource constraints tightly in terms of FLOPs, memory usage, and network size. We also propose a quadratic model that accurately estimates the actual inference time of the pruned network, which allows us to adopt inference time as a resource constraint option. Furthermore, we generalize our method to extend the selection granularity beyond channels and handle non-sequential connections. Our experiments on CIFAR-10 and ImageNet show our proposed pruning method outperforms other fixed-importance channel pruning methods on various network architectures.
30.Fourier-Based Augmentations for Improved Robustness and Uncertainty Calibration ⬇️
Diverse data augmentation strategies are a natural approach to improving robustness in computer vision models against unforeseen shifts in data distribution. However, the ability to tailor such strategies to inoculate a model against specific classes of corruptions or attacks -- without incurring substantial losses in robustness against other classes of corruptions -- remains elusive. In this work, we successfully harden a model against Fourier-based attacks, while producing superior-to-AugMix accuracy and calibration results on both the CIFAR-10-C and CIFAR-100-C datasets; classification error is reduced by over ten percentage points for some high-severity noise and digital-type corruptions. We achieve this by incorporating Fourier-basis perturbations in the AugMix image-augmentation framework. Thus we demonstrate that the AugMix framework can be tailored to effectively target particular distribution shifts, while boosting overall model robustness.
31.Learning Transferable Reward for Query Object Localization with Policy Adaptation ⬇️
We propose a reinforcement learning based approach to \emph{query object localization}, for which an agent is trained to localize objects of interest specified by a small exemplary set. We learn a transferable reward signal formulated using the exemplary set by ordinal metric learning. Our proposed method enables test-time policy adaptation to new environments where the reward signals are not readily available, and outperforms fine-tuning approaches that are limited to annotated images. In addition, the transferable reward allows repurposing the trained agent from one specific class to another class. Experiments on corrupted MNIST, CU-Birds, and COCO datasets demonstrate the effectiveness of our approach.
32.Highly-Efficient Binary Neural Networks for Visual Place Recognition ⬇️
VPR is a fundamental task for autonomous navigation as it enables a robot to localize itself in the workspace when a known location is detected. Although accuracy is an essential requirement for a VPR technique, computational and energy efficiency are not less important for real-world applications. CNN-based techniques archive state-of-the-art VPR performance but are computationally intensive and energy demanding. Binary neural networks (BNN) have been recently proposed to address VPR efficiently. Although a typical BNN is an order of magnitude more efficient than a CNN, its processing time and energy usage can be further improved. In a typical BNN, the first convolution is not completely binarized for the sake of accuracy. Consequently, the first layer is the slowest network stage, requiring a large share of the entire computational effort. This paper presents a class of BNNs for VPR that combines depthwise separable factorization and binarization to replace the first convolutional layer to improve computational and energy efficiency. Our best model achieves state-of-the-art VPR performance while spending considerably less time and energy to process an image than a BNN using a non-binary convolution as a first stage.
33.On Monocular Depth Estimation and Uncertainty Quantification using Classification Approaches for Regression ⬇️
Monocular depth is important in many tasks, such as 3D reconstruction and autonomous driving. Deep learning based models achieve state-of-the-art performance in this field. A set of novel approaches for estimating monocular depth consists of transforming the regression task into a classification one. However, there is a lack of detailed descriptions and comparisons for Classification Approaches for Regression (CAR) in the community and no in-depth exploration of their potential for uncertainty estimation. To this end, this paper will introduce a taxonomy and summary of CAR approaches, a new uncertainty estimation solution for CAR, and a set of experiments on depth accuracy and uncertainty quantification for CAR-based models on KITTI dataset. The experiments reflect the differences in the portability of various CAR methods on two backbones. Meanwhile, the newly proposed method for uncertainty estimation can outperform the ensembling method with only one forward propagation.
34.Instantaneous Physiological Estimation using Video Transformers ⬇️
Video-based physiological signal estimation has been limited primarily to predicting episodic scores in windowed intervals. While these intermittent values are useful, they provide an incomplete picture of patients' physiological status and may lead to late detection of critical conditions. We propose a video Transformer for estimating instantaneous heart rate and respiration rate from face videos. Physiological signals are typically confounded by alignment errors in space and time. To overcome this, we formulated the loss in the frequency domain. We evaluated the method on the large scale Vision-for-Vitals (V4V) benchmark. It outperformed both shallow and deep learning based methods for instantaneous respiration rate estimation. In the case of heart-rate estimation, it achieved an instantaneous-MAE of 13.0 beats-per-minute.
35.StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation ⬇️
Generating images that fit a given text description using machine learning has improved greatly with the release of technologies such as the CLIP image-text encoder model; however, current methods lack artistic control of the style of image to be generated. We present an approach for generating styled drawings for a given text description where a user can specify a desired drawing style using a sample image. Inspired by a theory in art that style and content are generally inseparable during the creative process, we propose a coupled approach, known here as StyleCLIPDraw, whereby the drawing is generated by optimizing for style and content simultaneously throughout the process as opposed to applying style transfer after creating content in a sequence. Based on human evaluation, the styles of images generated by StyleCLIPDraw are strongly preferred to those by the sequential approach. Although the quality of content generation degrades for certain styles, overall considering both content \textit{and} style, StyleCLIPDraw is found far more preferred, indicating the importance of style, look, and feel of machine generated images to people as well as indicating that style is coupled in the drawing process itself. Our code (this https URL), a demonstration (this https URL), and style evaluation data (this https URL) are publicly available.
36.RescueNet: A High Resolution UAV Semantic Segmentation Benchmark Dataset for Natural Disaster Damage Assessment ⬇️
Due to climate change, we can observe a recent surge of natural disasters all around the world. These disasters are causing disastrous impact on both nature and human lives. Economic losses are getting greater due to the hurricanes. Quick and prompt response of the rescue teams are crucial in saving human lives and reducing economic cost. Deep learning based computer vision techniques can help in scene understanding, and help rescue teams with precise damage assessment. Semantic segmentation, an active research area in computer vision, can put labels to each pixel of an image, and therefore can be a valuable arsenal in the effort of reducing the impacts of hurricanes. Unfortunately, available datasets for natural disaster damage assessment lack detailed annotation of the affected areas, and therefore do not support the deep learning models in total damage assessment. To this end, we introduce the RescueNet, a high resolution post disaster dataset, for semantic segmentation to assess damages after natural disasters. The RescueNet consists of post disaster images collected after Hurricane Michael. The data is collected using Unmanned Aerial Vehicles (UAVs) from several areas impacted by the hurricane. The uniqueness of the RescueNet comes from the fact that this dataset provides high resolution post-disaster images and comprehensive annotation of each image. While most of the existing dataset offer annotation of only part of the scene, like building, road, or river, RescueNet provides pixel level annotation of all the classes including building, road, pool, tree, debris, and so on. We further analyze the usefulness of the dataset by implementing state-of-the-art segmentation models on the RescueNet. The experiments demonstrate that our dataset can be valuable in further improvement of the existing methodologies for natural disaster damage assessment.
37.Learning to Identify Perceptual Bugs in 3D Video Games ⬇️
Automated Bug Detection (ABD) in video games is composed of two distinct but complementary problems: automated game exploration and bug identification. Automated game exploration has received much recent attention, spurred on by developments in fields such as reinforcement learning. The complementary problem of identifying the bugs present in a player's experience has for the most part relied on the manual specification of rules. Although it is widely recognised that many bugs of interest cannot be identified with such methods, little progress has been made in this direction. In this work we show that it is possible to identify a range of perceptual bugs using learning-based methods by making use of only the rendered game screen as seen by the player. To support our work, we have developed World of Bugs (WOB) an open platform for testing ABD methods in 3D game environments.
38.Online handwriting, signature and touch dynamics: tasks and potential applications in the field of security and health ⬇️
Background: An advantageous property of behavioural signals ,e.g. handwriting, in contrast to morphological ones, such as iris, fingerprint, hand geometry, etc., is the possibility to ask a user for a very rich amount of different tasks. Methods: This article summarises recent findings and applications of different handwriting and drawing tasks in the field of security and health. More specifically, it is focused on on-line handwriting and hand-based interaction, i.e. signals that utilise a digitizing device (specific devoted or general-purpose tablet/smartphone) during the realization of the tasks. Such devices permit the acquisition of on-surface dynamics as well as in-air movements in time, thus providing complex and richer information when compared to the conventional pen and paper method. Conclusions: Although the scientific literature reports a wide range of tasks and applications, in this paper, we summarize only those providing competitive results (e.g. in terms of discrimination power) and having a significant impact in the field.
39.Predicting 4D Liver MRI for MR-guided Interventions ⬇️
Organ motion poses an unresolved challenge in image-guided interventions. In the pursuit of solving this problem, the research field of time-resolved volumetric magnetic resonance imaging (4D MRI) has evolved. However, current techniques are unsuitable for most interventional settings because they lack sufficient temporal and/or spatial resolution or have long acquisition times. In this work, we propose a novel approach for real-time, high-resolution 4D MRI with large fields of view for MR-guided interventions. To this end, we trained a convolutional neural network (CNN) end-to-end to predict a 3D liver MRI that correctly predicts the liver's respiratory state from a live 2D navigator MRI of a subject. Our method can be used in two ways: First, it can reconstruct near real-time 4D MRI with high quality and high resolution (209x128x128 matrix size with isotropic 1.8mm voxel size and 0.6s/volume) given a dynamic interventional 2D navigator slice for guidance during an intervention. Second, it can be used for retrospective 4D reconstruction with a temporal resolution of below 0.2s/volume for motion analysis and use in radiation therapy. We report a mean target registration error (TRE) of 1.19 $\pm$0.74mm, which is below voxel size. We compare our results with a state-of-the-art retrospective 4D MRI reconstruction. Visual evaluation shows comparable quality. We show that small training sizes with short acquisition times down to 2min can already achieve promising results and 24min are sufficient for high quality results. Because our method can be readily combined with earlier methods, acquisition time can be further decreased while also limiting quality loss. We show that an end-to-end, deep learning formulation is highly promising for 4D MRI reconstruction.
40.Local Intensity Order Transformation for Robust Curvilinear Object Segmentation ⬇️
Segmentation of curvilinear structures is important in many applications, such as retinal blood vessel segmentation for early detection of vessel diseases and pavement crack segmentation for road condition evaluation and maintenance. Currently, deep learning-based methods have achieved impressive performance on these tasks. Yet, most of them mainly focus on finding powerful deep architectures but ignore capturing the inherent curvilinear structure feature (e.g., the curvilinear structure is darker than the context) for a more robust representation. In consequence, the performance usually drops a lot on cross-datasets, which poses great challenges in practice. In this paper, we aim to improve the generalizability by introducing a novel local intensity order transformation (LIOT). Specifically, we transfer a gray-scale image into a contrast-invariant four-channel image based on the intensity order between each pixel and its nearby pixels along with the four (horizontal and vertical) directions. This results in a representation that preserves the inherent characteristic of the curvilinear structure while being robust to contrast changes. Cross-dataset evaluation on three retinal blood vessel segmentation datasets demonstrates that LIOT improves the generalizability of some state-of-the-art methods. Additionally, the cross-dataset evaluation between retinal blood vessel segmentation and pavement crack segmentation shows that LIOT is able to preserve the inherent characteristic of curvilinear structure with large appearance gaps. An implementation of the proposed method is available at this https URL.
41.An Ensemble Approach for Patient Prognosis of Head and Neck Tumor Using Multimodal Data ⬇️
Accurate prognosis of a tumor can help doctors provide a proper course of treatment and, therefore, save the lives of many. Traditional machine learning algorithms have been eminently useful in crafting prognostic models in the last few decades. Recently, deep learning algorithms have shown significant improvement when developing diagnosis and prognosis solutions to different healthcare problems. However, most of these solutions rely solely on either imaging or clinical data. Utilizing patient tabular data such as demographics and patient medical history alongside imaging data in a multimodal approach to solve a prognosis task has started to gain more interest recently and has the potential to create more accurate solutions. The main issue when using clinical and imaging data to train a deep learning model is to decide on how to combine the information from these sources. We propose a multimodal network that ensembles deep multi-task logistic regression (MTLR), Cox proportional hazard (CoxPH) and CNN models to predict prognostic outcomes for patients with head and neck tumors using patients' clinical and imaging (CT and PET) data. Features from CT and PET scans are fused and then combined with patients' electronic health records for the prediction. The proposed model is trained and tested on 224 and 101 patient records respectively. Experimental results show that our proposed ensemble solution achieves a C-index of 0.72 on The HECKTOR test set that saved us the first place in prognosis task of the HECKTOR challenge. The full implementation based on PyTorch is available on \url{this https URL}.
42.Faithful learning with sure data for lung nodule diagnosis ⬇️
Recent evolution in deep learning has proven its value for CT-based lung nodule classification. Most current techniques are intrinsically black-box systems, suffering from two generalizability issues in clinical practice. First, benign-malignant discrimination is often assessed by human observers without pathologic diagnoses at the nodule level. We termed these data as "unsure data". Second, a classifier does not necessarily acquire reliable nodule features for stable learning and robust prediction with patch-level labels during learning. In this study, we construct a sure dataset with pathologically-confirmed labels and propose a collaborative learning framework to facilitate sure nodule classification by integrating unsure data knowledge through nodule segmentation and malignancy score regression. A loss function is designed to learn reliable features by introducing interpretability constraints regulated with nodule segmentation maps. Furthermore, based on model inference results that reflect the understanding from both machine and experts, we explore a new nodule analysis method for similar historical nodule retrieval and interpretable diagnosis. Detailed experimental results demonstrate that our approach is beneficial for achieving improved performance coupled with faithful model reasoning for lung cancer prediction. Extensive cross-evaluation results further illustrate the effect of unsure data for deep-learning-based methods in lung nodule classification.
43.Monogenic Wavelet Scattering Network for Texture Image Classification ⬇️
The scattering transform network (STN), which has a similar structure as that of a popular convolutional neural network except its use of predefined convolution filters and a small number of layers, can generates a robust representation of an input signal relative to small deformations. We propose a novel Monogenic Wavelet Scattering Network (MWSN) for 2D texture image classification through a cascade of monogenic wavelet filtering with nonlinear modulus and averaging operators by replacing the 2D Morlet wavelet filtering in the standard STN. Our MWSN can extract useful hierarchical and directional features with interpretable coefficients, which can be further compressed by PCA and fed into a classifier. Using the CUReT texture image database, we demonstrate the superior performance of our MWSN over the standard STN. This performance improvement can be explained by the natural extension of 1D analyticity to 2D monogenicity.
44.Structure-aware Unsupervised Tagged-to-Cine MRI Synthesis with Self Disentanglement ⬇️
Cycle reconstruction regularized adversarial training -- e.g., CycleGAN, DiscoGAN, and DualGAN -- has been widely used for image style transfer with unpaired training data. Several recent works, however, have shown that local distortions are frequent, and structural consistency cannot be guaranteed. Targeting this issue, prior works usually relied on additional segmentation or consistent feature extraction steps that are task-specific. To counter this, this work aims to learn a general add-on structural feature extractor, by explicitly enforcing the structural alignment between an input and its synthesized image. Specifically, we propose a novel input-output image patches self-training scheme to achieve a disentanglement of underlying anatomical structures and imaging modalities. The translator and structure encoder are updated, following an alternating training protocol. In addition, the information w.r.t. imaging modality can be eliminated with an asymmetric adversarial game. We train, validate, and test our network on 1,768, 416, and 1,560 unpaired subject-independent slices of tagged and cine magnetic resonance imaging from a total of twenty healthy subjects, respectively, demonstrating superior performance over competing methods.
45.Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance ⬇️
In this paper, we study contrastive learning from an optimization perspective, aiming to analyze and address a fundamental issue of existing contrastive learning methods that either rely on a large batch size or a large dictionary. We consider a global objective for contrastive learning, which contrasts each positive pair with all negative pairs for an anchor point. From the optimization perspective, we explain why existing methods such as SimCLR requires a large batch size in order to achieve a satisfactory result. In order to remove such requirement, we propose a memory-efficient Stochastic Optimization algorithm for solving the Global objective of Contrastive Learning of Representations, named SogCLR. We show that its optimization error is negligible under a reasonable condition after a sufficient number of iterations or is diminishing for a slightly different global contrastive objective. Empirically, we demonstrate that on ImageNet with a batch size 256, SogCLR achieves a performance of 69.4% for top-1 linear evaluation accuracy using ResNet-50, which is on par with SimCLR (69.3%) with a large batch size 8,192. We also attempt to show that the proposed optimization technique is generic and can be applied to solving other contrastive losses, e.g., two-way contrastive losses for bimodal contrastive learning.
46.TwistSLAM: Constrained SLAM in Dynamic Environment ⬇️
Moving objects are present in most scenes of our life. However they can be very problematic for classical SLAM algorithms that assume the scene to be rigid. This assumption limits the applicability of those algorithms as they are unable to accurately estimate the camera pose and world structure in many scenarios. Some SLAM systems have been proposed to detect and mask out dynamic objects, making the static scene assumption valid. However this information can allow the system to track objects within the scene, while tracking the camera, which can be crucial for some applications. In this paper we present TwistSLAM a semantic, dynamic, stereo SLAM system that can track dynamic objects in the scene. Our algorithm creates clusters of points according to their semantic class. It uses the static parts of the environment to robustly localize the camera and tracks the remaining objects. We propose a new formulation for the tracking and the bundle adjustment to take in account the characteristics of mechanical joints between clusters to constrain and improve their pose estimation. We evaluate our approach on several sequences from a public dataset and show that we improve camera and object tracking compared to state of the art.
47.Time Efficient Training of Progressive Generative Adversarial Network using Depthwise Separable Convolution and Super Resolution Generative Adversarial Network ⬇️
Generative Adversarial Networks have been employed successfully to generate high-resolution augmented images of size 1024^2. Although the augmented images generated are unprecedented, the training time of the model is exceptionally high. Conventional GAN requires training of both Discriminator as well as the Generator. In Progressive GAN, which is the current state-of-the-art GAN for image augmentation, instead of training the GAN all at once, a new concept of progressing growing of Discriminator and Generator simultaneously, was proposed. Although the lower stages such as 4x4 and 8x8 train rather quickly, the later stages consume a tremendous amount of time which could take days to finish the model training. In our paper, we propose a novel pipeline that combines Progressive GAN with slight modifications and Super Resolution GAN. Super Resolution GAN up samples low-resolution images to high-resolution images which can prove to be a useful resource to reduce the training time exponentially.
48.Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph ⬇️
This paper addresses the unsupervised learning of content-style decomposed representation. We first give a definition of style and then model the content-style representation as a token-level bipartite graph. An unsupervised framework, named Retriever, is proposed to learn such representations. First, a cross-attention module is employed to retrieve permutation invariant (P.I.) information, defined as style, from the input data. Second, a vector quantization (VQ) module is used, together with man-induced constraints, to produce interpretable content tokens. Last, an innovative link attention module serves as the decoder to reconstruct data from the decomposed content and style, with the help of the linking keys. Being modal-agnostic, the proposed Retriever is evaluated in both speech and image domains. The state-of-the-art zero-shot voice conversion performance confirms the disentangling ability of our framework. Top performance is also achieved in the part discovery task for images, verifying the interpretability of our representation. In addition, the vivid part-based style transfer quality demonstrates the potential of Retriever to support various fascinating generative tasks. Project page at this https URL.