Foreground detection in videos captured by moving cameras
Source code of paper: "Real-Time Hysteresis Foreground Detection in Video Captured by Moving Cameras", the 2022 IEEE International Conference on Imaging Systems and Techniques, IST 2022, June 21-23, 2022 (link)
- The program is tested on Windows 10 with OpenCV 3.4.1 in Release x64 mode. It should work with any version of OpenCV 3.
- The .exe file needs one argument which is the path to the video file
- The program is tested with the DAVIS and SCBU datasets
- The SCBU data can also be found here
- setup Visual Studio with OpenCV (guide)
- add a folder called "results" next to main.cpp in the project directory
- set the desired parameters in config.xml and also in DCFG.h
- add the path to the video file in Visual Studio > Project > Properties > Debugging > Command Arguments
- run the program
Foreground detection is an important first step in video analytics. While the stationary cameras facilitate the foreground detection due to the apparent motion between the moving foreground and the still background, the moving cameras make such a task more challenging because both the foreground and the background appear in motion in the video. To tackle this challenging problem, an innovative real-time foreground detection method is presented, that models the foreground and the background simultaneously and works for both moving and stationary cameras. In particular, first, each input video frame is partitioned into a number of blocks. Then, assuming the background takes the majority of each video frame, the iterative pyramidal implementation of the Lucas-Kanade optical flow approach is applied on the centers of the background blocks in order to estimate the global motion and compensate for the camera movements. Subsequently, each block in the background is modeled by a mixture of Gaussian distributions and a separate Gaussian mixture model is constructed for the foreground in order to enhance the classification. However, the errors in motion compensation can contaminate the foreground model with background values. The novel idea of the proposed method matches a set of background samples to their corresponding block for the most recent frames in order to avoid contaminating the foreground model with background samples. The input values that do not fit into either the statistical or the sample-based background models are used to update the foreground model. Finally, the foreground is detected by applying the Bayes classification technique to the major components in the background and foreground models, which removes the false positives caused by the hysteresis effect. Experimental evaluations demonstrate the feasibility of the proposed method in the foreground segmentation when applied to videos in public datasets.
The real-world applicability of the current methods for foreground detection in moving cameras suffers from high requirements in computational resources and/or low performance in classifying foreground and background. Here we apply spatial and temporal features for statistical modeling of the background and the foreground separately in order to classify them in real-time. Each block of the background is modeled using a mixture of Gaussian distributions (MOG) and a set of values sampled randomly in spatial and temporal domains. At each video frame the Lucas-Kanade optical flow method is applied on the block centers in order to estimate the camera motion and find the corresponding locations between two adjacent frames. The global motion is then compensated by updating the background models of each block according to the values of its corresponding location in the previous frame. On the other hand, the foreground is modeled by another MOG which is updated by the input values that do not fit into the background models.
First observation in videos obtained by moving cameras is that the entire captured scene appears to be moving from the camera's perspective. By assuming the background to occupy the majority of the scene compared to the objects of interest we can estimate the motion of the camera relative to the background. Afterwards, the estimated camera motion can be compensated by using the corresponding values in the previous frame for updating background models. Then the foreground can be segmented using approaches similar to the methods used for the applications of stationary cameras. Here, we apply an MOG to model the entire foreground using the values that are not absorbed by the background models. The major components of the Gaussian mixture distributions in the background and foreground models are utilized for final binary classification. The details of each step are described in this section.
In many scenarios the objects of interest occupy a portion each video frame and the remaining majority is considered to be background.
Therefore, the majority of point displacements among video frames is caused by the camera motion which can be estimated by calculating the global motion.
For the sake of computational efficiency and accounting for spatial relationships, a similar approach to is applied where the input image is converted to grayscale and divided into a number of grids with equal sizes.
The Kanade–Lucas–Tomasi feature tracking approach is applied on the centers of the grid cells from the previous frame.
Then a homography matrix is obtained that warps the image pixels at frame
where
and a reverse transformation matrix
which is solved by applying the by RANSAC algorithm in order to remove outliers from the further calculations. Also the center points of the blocks classified as foreground in the previous frame are excluded from this calculation as they do not contribute to the camera motion.
Each block of the image is modeled by a mixture of Gaussian distributions and the model is updated at each video frame. In order to update the background models at each frame we have to calculate the corresponding values in the warped background image of the previous frame. The mean and variance of the warped background model is calculated as a weighted sum of the neighboring models where each weight is proportional to a rectangular area as a bilinear interpolation:
$$ \begin{gathered} \tilde{\mu}i^{(t-1)} = \sum{k \in \mathcal{R}_i}\omega_k \mu_k^{(t-1)} \ \tilde{\sigma}i^{(t-1)} = \sum{k \in \mathcal{R}_i}\omega_k \sigma_k^{(t-1)} \end{gathered} $$
where
Since the camera might have movements in the form of pan there can be slight variations in the illumination due to the changes in the angle of view and light direction.
The Gaussian modeling keeps the information of previous frames and might be slow in catching up with the pace of changing values at the borders of the video frames.
In order to make the model parameters adapt to these changes a global variation factor
with
$$ \begin{gathered} \mu_k^{(t)} = \left(n_k^{(t-1)}\left(\tilde{\mu}_k^{(t-1)} + g^{(t)}\right) + M^{(t)}\right) / (n_k^{(t-1)} + 1) \ \sigma_k^{(t)} = \left(n_k^{(t-1)}\tilde{\sigma}k^{(t-1)} + V^{(t-1)}\right) / (n_k^{(t-1)} + 1) \ n_k^{(t)} = n_k^{(t-1)} + 1 \ \alpha_k^{(t)} = n_k^{(t)} / \sum{k=1}^{K}n_k^{(t)} \end{gathered} $$
where
In case of moving cameras the objects of interest are usually present in the scene for a longer time as the camera is focused on them. Therefore, it is reasonable to model the values of the foreground objects throughout the video. A similar approach to background modeling is applied for modeling the foreground except only one mixture of Gaussian distributions is used for the entire foreground pixels. Also, instead of a single component, a number of components from the foreground model that have the largest weights are considered to represent the foreground objects. This is because the foreground objects have multiple parts with different intensity values and each major component in the foreground model is used to represent one part of the foreground.
In addition to the statistical modeling and inspired by the ViBe method, we keep a set of sample values as a secondary non-parametric model for each block.
This set is initialized by the mean value of the block and its neighboring blocks at the first frame.
At each of the consecutive frames one of the values in the set is selected randomly and replaced with the new mean value.
We can denote the collection of background sample values for the block
where
with
For the final classification, at first the foreground likelihood values are calculated for each pixel at an input image as follows:
where
For final classification the mean value of each super-pixel is compared against the major component in the background model of the corresponding block as well as each component in the foreground model.
The foreground confidence map
where
where
The different stages in the classification process can be seen in the figure . From top to bottom, each row in the figure represents a sample video frame from the DAVIS, Segment Pool Tracking, and SCBU datasets, respectively. The second column represents heatmaps where the pixels with a higher probability of belonging to the foreground are represented by red colors. The third column is the results of the watershed segmentation algorithm applied to each video frame with the markers chosen uniformly across the image at the same locations as the background block centers. The fourth column illustrates the foreground confidence maps calculated based on the equation and the last column is the final results of foreground detection after morphological dilation.
The performance of the proposed method is evaluated using video data collected from the publicly available SCBU dataset that consists of nine video sequences captured by moving cameras. The videos in the dataset impose various challenges in the way of foreground segmentation, such as fast or slow moving objects, objects with different sizes, illumination changes, and the similarities in intensity values between the background and foreground. In terms of time and space complexity, the statistical methods are more efficient as the methods based on deep neural networks require more resources. Therefore, our method is more practical in applications with real-time requirements and edge devices that have a lower hardware capacity. It can be seen that our proposed method is able to detect the foreground in various challenging scenarios. Compared to some of the representative methods, such as MCD and MCD NP, our method models the foreground and background separately, which enhances the classifications results. One of the limitations in the proposed method is the ability of the foreground model to adapt well to sudden illumination changes caused by the pan movements of the camera. Also, the camouflage problem, where the foreground color values are very similar to those of the corresponding background block, can lead to false negative results. This problem can be solved by introducing more discriminating features to the statistical modeling process in future studies. The f-score metric is used in order to evaluate the quantitative results:
where
The average running speed of the proposed method is reported in the table for each video frame of size
In this study, a new real-time method is proposed for locating the moving objects in videos captured by non-stationary cameras, which is one the challenging problems in computer vision. The global motion is estimated and used to compensate for background variations caused by the camera movements. Each block is modeled by a mixture of Gaussian distributions which is updated by the values at the corresponding locations in the warped image after motion compensation. Additionally, the mean values of each block are modeled along with the mean values of its neighboring blocks as a set of samples which is in turn updated by random selection. The foreground on the other hand is modeled by a separate MOG which is updated by values that do not fit into either of the statistical or sample-based background models. For classification, each input value is compared against both the background and foreground models to obtain the definite and the candidate foreground locations, respectively. The watershed segmentation algorithm is then applied to detect the final foreground mask. Experimental results demonstrate the feasibility of the proposed method in real-time video analytics systems.
@inproceedings{ghahremannezhad2022real,
title={Real-Time Hysteresis Foreground Detection in Video Captured by Moving Cameras},
author={Ghahremannezhad, Hadi and Shi, Hang and Liu, Chengjun},
booktitle={2022 IEEE International Conference on Imaging Systems and Techniques (IST)},
pages={1--6},
year={2022},
organization={IEEE}
}