The trade-off between accuracy and the complexity of real-time background subtraction

Background subtraction used in object detection, tracking and action recognition is a typical method that separates foreground objects from the background. These applications require accuracy and a complexity reduction technique. Some approaches have been proposed to either increase accuracy or decrease complexity. However, the trade-off between increasing accuracy and reducing the complexity of background subtraction is a big chal-lenge. To address this issue, a background subtraction-based real-time moving object-detection approach is proposed. The key contribution in authors’ proposal is to use a colour image and a novel colour-gradient blending fused image to achieve accurate back-ground/foreground segmentation. The fused image is a combination of a gradient image and a colour image to correct illumination variations and preserve the edge information. Also, thresholds are adaptively selected based on the dynamic background behaviour to attain a more robust classiﬁcation system. The proposed model based on real-time and complex videos from the CD-2012 and CD-2014 change detection data sets, and the CMD data set is evaluated. Experimental results indicate that authors’ method processes around 43 frames per second and requires six bytes of memory per pixel, which is noticeably more efﬁcient and less complex than other background subtraction methods.


INTRODUCTION
For many years, background subtraction (BGS) has attracted a lot of attention in computer vision and image processing for its broad array of applications. BGS extracts moving objects from a video. The approach needs to model the background and must capture different types of background to detect moving foreground (FG) objects. A frame-difference BGS approach detects moving objects from the absolute intensity difference between the observed frame and the reference frame. On the other hand, some sample consensus approaches used local binary pattern (LBP) [1], local binary similarity pattern (LBSP) [2,3], local singular value decomposition binary pattern (SVDBP) [4], or gradient magnitude (GM) [5] in addition to colour to increase accuracy. Although the additional features with colour-based BGS increase accuracy to some extent, the method brings extraneous complexity. Therefore, to solve the complexity-versus-accuracy problem, we utilize a colour-feature and novel colour-gradient blending fused feature for background/foreground segmenta-tion. The new fused feature is created from the colour feature and the gradient feature. There have been many proposed approaches to object detection, all with different goals. We classify object detection methods into two categories: typical object detection methods and moving-object detection methods. The typical object detection approach is the process of segmenting objects, such as humans, cars, trees, and buildings in images or videos. On the other hand, the latter approach classifies objects in motion, such as humans, vehicles, and animals. BGS is mainly used in background reconstruction, moving-object classification, and moving-object detection. The moving object-related applications are in intelligent visual surveillance systems, like road traffic surveillance, video surveillance, and maritime and airport monitoring [6,7]. Other applications include human-computer interaction, traffic monitoring, real-time gesture recognition, content-based video coding, optical motion capture, object tracking, and activity recognition [6,8,9]. The demand for realtime moving-object detection is growing nowadays. Real-time 350 wileyonlinelibrary.com/iet-ipr IET Image Process. 2021;15:350-368.
moving-object detection and tracking are promising applications in the smart home, the smart environment, smart cities, robotics, and drones. The Internet of Things, cloud computing, fog computing, and edge computing also use detection services for surveillance [10]. Many sensors have limited memory and processing capabilities. Hence, we need applications that utilize fewer resources and yet still provide accuracy. According to our recent review of moving-object detection, we conclude that although some methods are quite effective in terms of accuracy, their complexity is not acceptable for extensive use. Compared to a shallow learning-based method, an accurate deep learning-based method incurs huge computational costs, and requires a large volume of ground truth data to train the model. Bouwmans and Garcia-Garcia [6] identified the present trends in the BGS approaches in terms of effectiveness, memory usage, and time complexity. Therefore, we propose a BGS-based realtime moving-object detection approach. The proposed method can be applied in limited-computing and low memory-space devices, and is effective and efficient. Moreover, it can work well despite dynamic backgrounds, thermal effects, unstable video, bad weather, and turbulence. There are many challenges in BGS or moving foreground detection, such as illumination variations, light switching, dynamic backgrounds, shadows, motionless moving objects, camouflage, sleeping foreground objects, camera jitter [6], noisy images, inserted background objects [11], and so on. To solve such challenges, many methods have been proposed. A framedifference method [12,13] was proposed for background/ foreground segmentation that was effective to some extent for non-challenging video cases. Although the method can quickly classify background and FG, it is very sensitive to illumination variations and camera jitter. But, this kind of approach fails when there are either mild or large variations in the background of a video. Later on, the statistical method [14][15][16][17][18][19][20] and the codebook method [21] tried to make an illumination-invariant pixel model. Although the accuracy of this method increased compared to the frame-difference method, the method underperforms when dealing with background variations, just like the frame-difference method. Also, the sample consensusbased approaches [2,22,23] achieved some success with intensity changes, but the methods sharply decreased in accuracy when there are large, unexpected changes in a video. To work in dynamic backgrounds (either mild or large changes in background), sample consensus-based approaches [2-5, 24, 25] added some pixel descriptors to create an illuminationand shadow-invariant model, and introduced a dynamic background control mechanism. Later on, deep learning-based BGS (DeepBS) [26] exploited two methods [3,19] to create backgrounds to train and test the model. Although the methods [3,26] increased performance to some extent, they incurred unnecessary costs. Therefore, it is difficult to exploit these approaches in real-time applications.
Some BGS approaches [1,5,27,28] dispense local descriptors, like LBP, LBSP, and GM to get the texture and neighbour information, correct the intensity, and remove shadow. The LBP-or LBSP-based BGS incurs a higher cost than gradientbased BGS. In LBP [1,27], a local binary pattern emerges from matching each neighbouring pixel with the centre pixel based on a constant threshold. The threshold is some fraction of the centre pixel. Unlike LBP, LBSP [28] encodes the pattern based on the similarity between the centre pixel and the neighbouring pixels. Although LBP and LBSP descriptors are said to be intensity-invariant, they fall in a flat region, as shown in Figure 1, which results in false detections. The window size of the descriptors affects intensity correction. For example, considering the same window size of 5 × 5 (Figure 1), the LBSP pattern shows more variations in frame #160 than the corresponding patterns in frame #1 and frame #83. Therefore, large block-size descriptors are very sensitive to mild variations in the background. In the same way, LBP and GM also exhibited different patterns and magnitudes for those three frames. On the other hand, in our design, a colour-gradient blending fused image (FI) with a 3 × 3 kernel shows the same values in cases with only the background, the shadow, and the front side of the foreground. The FI can reduce illumination variance. In conclusion, the novel FI can detect all the edges and correct the intensity to some extent at the same time. Deep learning-based typical object-detection approaches, such as the fast region-based convolutional neural network (Fast R-CNN) [29], Faster R-CNN [30], you only look once (YOLO) [31], and Mask R-CNN [32] were proposed for accurate object detection. However, those methods did not mention how to detect moving objects. Alternatively, the traditional image processing-based methods such as Visual Background extractor (ViBe) [23], ViBe + [24], pixel-based adaptive segmenter (PBAS) [5], LOBSTER [2], Self-balanced Sensitivity Segmenter (SuBSENSE) [3], and Sub-super pixel-based Background Subtraction (SBS) [20], and machine learning-based approaches such as the weightless neural network (WNN) [33], the CNN [34], and DeepBS [26] were proposed for moving object segmentation. PBAS and SuBSENSE used pixel features and dynamic background control mechanisms for accurate segmentation. Our proposal enhances PBAS and SuBSENSE using the novel FI. Our method outperforms the existing approaches in terms of accuracy and computational complexity.
In this research, we propose a novel FI-based background subtraction (FBGS) approach. First, we create the FI, which is a combination of both the gradient image and the colour image and which leads to reduced noise while preserving edge information. Afterwards, we fuse the FI and the colour image, model the dynamic background characteristics, update the background samples, and select an adaptive threshold for background/foreground classification. Finally, we do postprocessing to discard noise in raw background/foreground segmentation. We list our key contributions as follows: (1) We design a novel colour-gradient FI, which is a weighted combination of colour and the GM for effective background/ foreground classification. (2) Under different pixel intensity variations, thresholds are effectively selected at runtime to achieve accurate segmentation. (3) Our system is based on colour, the FI, and an adaptive threshold, which can improve the segmentation accuracy in challenging cases. (4) We test our system with 54 videos from the CD-2012 and CD-2014 change detection data sets, and the CMD data set. Our experimental  179 179 181 183 185  185  186  186  186   185 186 187 189  188 187 183 183  188 188 184 184  187 188 187 186   172 173 176 179 181  182  184  182  183   181 180 180 181  185 184 181 180  184 183 181 180  184 185 184 182   181 181 181 181 182  185  185  186  196   184 184 184 184  194 184 192 177  189 189 188 196  183 176 169  Colour intensity, local binary pattern (LBP), local binary similarity pattern (LBSP), gradient magnitude (GM), and fused image (FI) are displayed at (344, 410) for frame 1, frame 83, and frame 160. Frame 1, frame 83, and frame 160 for the specific coordinates have only background, shadow, and the front side of the foreground, respectively results demonstrate the performance effectiveness of our proposed system.
In the rest of the paper, Section 2 describes other relevant work. The hyperparameters setup, background model, background/foreground classification, background change measurement, dynamic or non-dynamic region detection, blinking pixel level, threshold factor update, background factor update, background model update decision, background diffusion, unstable video detection, and post-processing are presented in Section 3. We describe the parameters settings, the image denosing effects, the importance of FI magnitude, a comparison of methods, and complexity measurements in Section 4, and we discuss the important aspects of our method in Section 5. Conclusions and future work are elaborated on in Section 6.

RELATED WORKS
From the aspect of high-accuracy performance, many BGS approaches have been proposed. BGS approaches can be classified into basic/simple methods, statistical-and cluster-based methods, sample consensus-based methods, and learning-based methods. Figure 2 depicts the taxonomy of the BGS algorithm.
(1) Simple method. A background model is based on two-or three-frame differences, persistent frame differences, averages, medians, and histograms of pixels in a video. In the two-frame difference method [13] and the block-frame difference method [35], an observed video frame is subtracted from a reference background model to detect moving objects. While the absolute difference between two frames extracts some moving objects, it results in false classifications if there are dynamic backgrounds, intensity changes, colour variations, shadows, unstable video, start-

MOG [37]
KNN [17] GMM [14] FTSG [19] SBS [20] SC-SOBS [38] WNN [33] CNN [34] DeepBS [26] FIGURE 2 Taxonomy of background subtraction approach up objects, and so on. The method can only detect the boundary edge of evenly coloured intensity or grey intensity objects. In a median-based BGS, the median of each pixel over the temporal direction separates the background from the FG. The temporal median-based BGS fails when each pixel is observed in less than 50% of the total observations. Spectral-360 [36] exploited an image formation model and colour spectrum reflectance, and classified the foreground from the video based on full-colour spectrum reflectance feature resemblance, instead of colour/grey intensity similarity in a pixel. (2) Statistical-and cluster-based BGS methods. In the mixture of Gaussians (MOG) model [37], a pixel belongs to more than one Gaussian distribution, whereas Zivkovic and Van Der Heijden [17] used the K-nearest neighbours (KNN) algorithm for BGS, where a pixel belongs to a single cluster. Although MOG and KNN operate well for non-dynamic backgrounds, they fail to segment a moving object from a multi-modal background. In the Gaussian mixture model (GMM) [14], Stauffer and Grimson observed the pixels' intensity behaviour (road and water surface pixels, and the monitor flickering pixel) over time, and found a lot of intensity variations. The variation remained approximately constant for a static environment under a fixed lighting condition. The GMM cannot accurately track an object inserted into a scene, since it does not consider recent pixel behaviour. In addition, the GMM results in a false classification when an object stays static for a long time, and then starts to change position. A Gaussian distribution can be exploited for BGS when there are small, grey-intensity or colour-intensity variations in a pixel, rather than dynamic backgrounds such as rippling water, a plant's leaves swinging, and the like. Wang et al. [19] presented a method named flux tensor with a split Gaussian (FTSG) model, a combination of the spatio-temporal motion detector called a flux tensor and the mixture of a Gaussian distribution for BGS. Unlike MOG, which shares the same fixed number of Gaussian distributions for both foreground and background, FTSG brings one Gaussian distribution for the foreground and one Gaussian mixture for the background. It is known that MOG is not illumination-and shadowinvariant. Also, the flux tensor cannot detect soft motion. Instead of the per-pixel similarity measure, Chen et al. [20] measured a superpixel-by-superpixel analogy and called it SBS. The superpixel was constructed using a simple linear iterative clustering (SLIC) algorithm, and then, the superpixel was subsequently clustered into a sub-superpixel with the help of a K-means algorithm. Moreover, the process of intensity correction could not be more effective, because intensity varies randomly in reality. (3) Sample consensus-based method. At first, Wang and Suter [22] proposed the novel concept of a sample consensus (SACON) for BGS. Later on, Barnich and Van Droogenbroeck presented ViBe [23] to classify the background or FG. Like SACON, ViBe is also a sample consensus-based BGS approach. Unlike SACON, which updates blob samples and pixel samples simultaneously, ViBe updates only pixel samples randomly over time, based on a conservative update policy. In the conservative update policy, the foreground will never be considered as background in the whole BGS process. Therefore, if moving objects stop, the objects will not be background objects anymore. Afterwards, Van Droogenbroeck et al. improved ViBe, naming it ViBe + [24], introducing an updating factor of 1∕5 (because of large global motion due to unstable video) instead of one ran-dom selection out of 20 pixels (background sample), and added a new post-processing operation. Unlike ViBe, St. Charles et al. utilized the LBSP descriptor and colour feature in the LOBSTER [2] and SuBSENSE [3] methods. Both ViBe and LOBSTER create the same background modelling policy, diffusion mechanism, and median filterbased post-processing. In the same way, Hofmann et al. [5] also came across a sample consensus-based BGS called the pixel-based adaptive segmenter. For the first time, PBAS created a dynamic background modelling mechanism after realizing the essence of updating stored background samples and thresholds for the dynamic background. Therefore, PBAS introduced two parameters: the decision threshold, which was updated dynamically at a fixed scale, and a learning rate calculating the degree of background dynamics in order to update background at runtime. An extension of PBAS introduced by Tiefenbacher et al. [25] used a proportional-integral-derivative (PID) regulator to manage the decision threshold and to change the update rate variables more effectively. In 2015, St. Charles et al. improved their previous LOBSTER method by introducing intra-LBSB, inter-LBSB, blind update (the background could be updated by both foreground pixel and background pixel), a dynamic background modelling mechanism like PBAS, and adaptive post-processing. The goal of this research work is to detect moving objects on a dynamic environment like PBAS and SuBSENSE. To clearly explain the difference between PBAS, SuBSENSE, and our FBGS architecture with classic dynamic-update-based methods, we present the key performance indicators used in each approach and the corresponding operations in Table 1. In 2016, to make a shadow-and illumination-invariant BGS system, Guo et al. [4] converted SVD of an image block (normalized SVD), and finally encoded the LBP pattern of the normalized SVD. Although Guo et al. [4] claimed that SVDBP was more illumination-and shadow-invariant than LBP, the performance of this approach was not as good as the LBSPbased SuBSENSE [3]. (4) Learning-based method. A self-organizing method called spatially coherent self-organizing background subtraction (SC-SOBS) [38] automatically creates a neural network model for background/foreground segmentation. SC-SOBS was proposed to work in a challenging environment, such as a dynamic background, with slow illumination changes, and with camouflage problems. In 2014, the WNN [33] was proposed to adapt to random behaviour in the background and to control the effectiveness of the BGS system using colour intensity history. Braham et al. [34] proposed a CNN-based BGS that matched patchwise background images created by the median-based method [39]. Babaee et al. [26] developed DeepBS, and DeepBS matched the image patch of the background frame generated by the combination of SuB-SENSE [3] and FTSG [19], rather than only the framedifference BGS in the previous approach [34]. The DeepBS method is not completely error-free, because a portion of the image patch could be foreground and background at the same time. Also, a background image created by the The segmentation decision is only based on a weighted fusion of colour and gradient magnitude. In addition, in PBAS, the background and the threshold, respectively, are updated according to a conservative policy or the minimal distance between an observed pixel and its stored background pixel.

SuBS. Colour LBSP Dynamic Dynamic
The segmentation decision is based on colour and LBSP. Moreover, SuBSENSE updates the background and the threshold according to a blind policy and a threshold factor, respectively. The threshold factor is determined not only based on the minimal distance between an observed pixel and its stored background pixel, like PBAS, but also from the noise of the raw segmentation of the previous frame.

METHOD PRELIMINARIES
The sample consensus-based method is a pixel-by-pixel background/foreground classification method. Hereinafter, we define background samples to represent background pixels. In sample consensus-based BGS, neighbours of each pixel are separately stored as background samples from the first video frame. The background samples are updated based on the background condition over time. Then, an observed pixel from the next video sequences is segmented as an foreground if a certain number of background pixels do not match the observed pixel. The proposed FBGS method uses our newly created colour-gradient blending FI and a colour image to extract the foreground from the BG. We define notations in Table 2. According to LOB-STER [2], SuBSENSE [3], PBAS [5], ViBe [23], and ViBe + [24], these hyperparameters are found empirically. The best values of these hyperparameters are used to achieve the best performance results. Figure 3 depicts the FBGS method. The FBGS model has 13 blocks, as follows.
(1) FI creation. The FI is a combination of a colour image and a gradient image. The gradient image is defined as an image that takes a GM of a colour intensity image for each pixel.
The GM shows a directional change in the colour intensity of a colour-intensity image. The specific percentage of colour/grey intensity and GM are combined to create the new FI (see Section 3.1). between a pixel and its background samples is estimated in order to measure the background change, which is used to update the threshold factor and background factor (see Section 3.4). (5) Dynamic or non-dynamic region detection. Tree leaves rustling, a fountain, light fluctuations, and the like constitute a dynamic background; otherwise, the background is static (see Section 3.5). (6) Blinking pixel level. A blinking pixel is defined as a pixel that switches between foreground and BG. The number of blinking pixels is estimated from raw segmentation. This level is used to decide the degree of change in the threshold factor and background factor (see Section 3.6). (7) Threshold factor update.The threshold factor is updated over time based on a minimal distance between a pixel and its background samples. This factor is utilized to compute the decision thresholds (see Section 3.7). (8) background factor update. The background factor is used to calculate the probability of a background sample update and the probability of diffusion (see Section 3.8).
The Manhattan distance between A and B; let the intensity of a pixel = A, and the pixel's background sample intensity or the pixel's intensity of the previous frame = B The minimum Manhattan distance between A and B The decision threshold of a colour/grey image pixel The decision threshold of a fused image pixel The degree of the background factor decrement 0.60 The weight factor to fuse colour and normalized gradient pixels The number of background samples The fusion weight factor of colour/grey image pixel and fused image pixel; for a colour pixel, t = 1; for a grey pixel, t = 3 The minimum number of required matches The minimum threshold of a colour/grey image pixel The threshold offset of a colour/grey image pixel The minimum threshold of a fused image pixel The threshold offset of a fused image pixel The background factor; initially, U (x, y) = 4; the lower limit of U (x, y) = 4; the upper limit of U (x, y) = 256

FIGURE 3
The fused image-based FBGS model (11) Background diffusion. In addition to background sample replacement, the neighbouring pixels' background samples are replaced with the detected background pixels based on the probability of a background update, the ghost artefacts, and the background condition (dynamic or static) (see Section 3.11). (12) Unstable video detection. An unstable video can be caused by a non-static camera. The unstable video replaces the background sample with background pixels more frequently, and increases the threshold more sharply (see Section 3.12). (13) Post-processing. Post-processing discards noise (isolated foreground pixels) and fills the unfilled region of objects detected during background/foreground classification (see Section 3.13).

FI creation
In this subsection, we detail our FI-generation model, which takes red-green-blue (RGB) or greyscale images as input. (1) Gradient image. Input to the model can be video frames or image sequences. The video frame and the gradient image are, respectively, denoted as I I (x, y) and I G (x, y). Here, x and y, respectively, represent the row and column of the frame. I I (x, y) is taken without any smoothing to create I G (x, y). We do not smooth I I (x, y) during convolution because smoothing erodes edge information in addition to reducing noise. I I (x, y) is convoluted with horizontal convolution kernel K x and vertical convolution kernel K y to, respectively, get the gradient in horizontal direction ∇X and vertical direction ∇Y . Both K x and K y of 3 × 3 kernels are called the Sobel kernel. The asterisk * indicates a twodimensional convolution operator. Therefore, the gradient vector ∇M is represented as [∇X, ∇Y ] T , as in (5). The gradient frame, I G (x, y), is calculated using (6).
(2) Normalized image. The gradient image, I G (x, y), is normalized to ensure it is in the range [0, 255] as input image I I (x, y). Normalized gradient image, I NG (x, y), is calculated as (3) FI. Before fusion, a Gaussian function is used to smooth I I (x, y), which results in I S (x, y). Consequently, we combine The colour-gradient blending rule-based fused image construction model I S (x, y) with I NG (x, y) to get FI I F (x, y) using the following colour-gradient blending rule: where is set to 0.60, after experiment explained in Section 4.1.1. I F (x, y) preserves sharp edges and reduces noise at the same time. The GM of an image is not illuminationinvariant. When the intensity of the neighbours of observed pixel I I (x, y) increases, the GM of the observed pixel also increases; and while the intensity of the neighbours of observed pixel I I (x, y) decreases, the GM also decreases. That is why we smooth I I (x, y) when combining it with I NG (x, y) in (8), because we are interested in taking an average colour/grey intensity change of a pixel, since colour/grey intensity change can be noise or real motion. Although I F (x, y) is not completely illumination-invariant, I F (x, y) reduces noise while keeping actual motion and edge information.

Background model
RGB or greyscale image I I (x, y) and FI I F (x, y) are the input of the FBGS model. We follow a sample consensus approach for segmentation. The sample consensus-based BGS approaches [3,5,23] temporarily store background samples for a pixel, and then segment the pixel based on similarities/dissimilarities between the pixel and its stored background samples. If I I (x, y) and I F (x, y) are the first frames, colour/grey image background model B I (x, y) from I I (x, y) and FI background model B F (x, y) from I F (x, y) are initialized for the first time before background/foreground segmentation, as follows:

Background/foreground classification
After background sample initialization from the first frame, I I 1 (x, y), background/foreground classification starts from the second frame, I I 2 (x, y). According to the sample consensusbased BGS, our FBGS compares each observed pixel, I I O (x, y) and I F O (x, y) at coordinates (x, y), with background samples B I N (x, y) and B F N (x, y), which were previously initialized and continuously updated. If the similarity matching number, # count , of each observed pixel at coordinates (x, y) is less than the minimum number of required matches, # min , observed pixel (x, y) will be classified as foreground (i.e. 1); otherwise, the pixel will be segmented as background (i.e. 0). The raw segmentation, F R (x, y), is represented as expressed in (11). In this equation, T I (x, y) is the decision threshold of I I (x, y), and T F (x, y) is the decision threshold of I F (x, y). L I 1 (I I  O (x, y), B I N (x, y)) is the Manhattan distance between intensity pixel I I O (x, y) and its background sample, B I N (x, y), and L F y)) is the Manhattan distance between fused pixel I F O (x, y) and its back- y))) is the combination of the Manhattan distances for pixel (x, y). Here, t is the fusion weight factor.

Background change measurement
Our background change measurement is similar to those in [3] and [5]. We mention that the dynamic region or non-dynamic region of a video control, H (x, y) and U (x, y), can increase or decrease detection accuracy. The difference between the observed background pixel and its background samples, noise level (approximated by blinking pixels), and type of video (stable or unstable) leads to successively determining the nature (dynamic or non-dynamic) of the background. Before calculating short-term moving average M S A (x, y) and long-term moving average M L A (x, y) between a pixel at coordinates (x, y) and its background samples, B I N (x, y) and B F N (x, y), the minimum average distance, D min (x, y) between a pixel at coordinates (x, y) and its background samples is estimated as in seen in (12). In that equation, L1 min (I I O (x, y), B I N (x, y)) is the minimum Manhattan distance between observed pixel I I O (x, y) and its background samples, y)) is the minimum Manhattan distance between observed pixel I F O (x, y) and its background samples, B F N (x, y); and mxIDst is the maximum intensity distance used to normalize the distance. The range of Afterwards, we determine short-term moving average, M S A (x, y), and long-term moving average, M L A (x, y), as follows: where is the long-term learning rate, and is the short-term learning rate. Note that these two parameters, and , have the same meaning throughout our proposed method.

Dynamic or non-dynamic region detection
The state (dynamic or non-dynamic) of a region has a very important effect on segmentation. Rather than global measurement, we calculate the local behaviour of the background. A dynamic background does not appear at the same magnitude, and also does not happen in all the regions of the frame, except in unstable videos. In the case of an unstable video, we know that all the pixels transform the same amount between consecutive frames, on average, because of the camera movement. In a dynamic BG, for example, in some regions, tree leaves move, fountains spray water, and in the rest of the frame, nothing happens. Also, pixel-to-pixel motion within a dynamic background region varies. Therefore, we approximate the segmentation fault to separately identify dynamic or non-dynamic background S (x, y) per region. We calculate per-pixel short-term segmentation fault R S (x, y) and long-term segmentation fault R L (x, y) in the following way: We apply this normalization to keep the range of R S (x, y) and R L (x, y) within [0, 1]. Wrongly classified foreground produces a false positive. Therefore, dynamic or non-dynamic region S (x, y) is calculated as expressed in (17). In that equation, H (x, y) is the threshold factor in (19), Rd m is the minimum threshold, and R m is the minimum unstable ratio of the background region. Logical variable S (x, y) is either 1 or 0. S (x, y) = 1 means a dynamic pixel; otherwise S (x, y) = 0 indicates a non-dynamic pixel. The flag, S (x, y), increases or decreases decision thresholds T I (x, y) and T F (x, y) in (21) and (22).

3.6
Blinking pixel level FBGS calculates pixel-level classification errors by identifying blinking pixels and average background noise. Note that the blinking pixels are identified like the methods [3,5,24]. In FBGS, blinking pixels reside either in observed classified frame F R O (x, y) or the previously segmented frame, F R O−1 (x, y), but the pixels do not remain in both frames. Thus, the easiest way to find blinking pixel B(x, y) is to use the XOR operation between F R O (x, y) and F R O−1 (x, y). Some errors occur if a direct XOR is carried out between F R O (x, y) and post-processed F R O−1 (x, y) since the brink of moving foreground in F R O (x, y) produces extra blinking pixels. Therefore, XOR is taken between the dilated F R O (x, y) and post-processed F R O−1 (x, y). Therefore, pixel-level classification error W (x, y) is adjusted dynamically, as expressed in (18). In that equation, R m is the minimum unstable ratio, W icre is the variational increment, and W dcre is the variational decrement. W (x, y) is always ⩾ 0. T indicates true. With a small amount of noise, W (x, y) ≈ 0. This happens when a region has a very low-intensity variation, a completely static background, or very near-to-static background. Alternatively, W (x, y) ≈ 1 with a completely dynamic region, such as fountains, tree leaves rustling, or very large-intensity variations due to camera internal errors, light fluctuations, and so on.

Threshold factor update
Now, local distance threshold factor H (x, y) is updated as expressed in (19) based on M S A (x, y), M L A (x, y), and W (x, y). In (19), Z is the threshold factor step size . H (x, y) is always ⩾ 1.

Background factor update
In addition to the threshold factor update, background factor U (x, y) is also continuously updated, as seen in (20). In the equation, U icre and U dcre , respectively, are the degree of background factor increment and decrement, while R m is the minimum unstable ratio of the background region, and U (x, y) is the background factor.

Decision threshold
Two decision thresholds, T I (x, y) and T F (x, y) in (11), are determined as follows: where T I O is the colour threshold offset, T F O is the fused threshold offset, T I m is the minimum colour distance threshold, and T F m is the minimum fused distance threshold. A static region leads to smaller values of T I (x, y) and T F (x, y). Any changes in the videos are detected as foreground objects owing to these smaller values. On the other hand, a high dynamic region (rustling tree leaves, rippling water, a shaking video, quick light changes etc.) leads to large values for T I (x, y) and T F (x, y). But then, we cannot detect soft-motion objects, which will increase the false negatives. Therefore, a robust strategy is needed to avoid false positives and false negatives. That is why we introduce pixel-level threshold factor, H (x, y), just like in [3,5]. In a dynamic region, H (x, y) needs to step up to increase T I (x, y) and T F (x, y) in order to reduce false classifications. On the other hand, for a non-dynamic (static) area, H (x, y) needs to diminish to decrease T I (x, y) and T F (x, y) and manage camouflage problems. Also, we bring in pixellevel background factor U (x, y) like the methods in [3,5], to measure the degree of dynamic background or non-dynamic background. Note that the probability of a background sample update, P (x, y), is equal to 1∕U (x, y). The proposed FBGS randomly updates background samples in B I  N (x, y) and B F N (x, y) based on P (x, y). The dynamic region steps down U (x, y) which, in turn, increases P (x, y). And non-dynamic increases for U (x, y) eventually decrease P (x, y). Moreover, H (x, y) and U (x, y) are two important variables that can increase or decrease the accuracy of segmentation.

Background model update decision
Background model update plays a very important role in a sample consensus-based background/foreground classification. The decision variables T I (x, y) and T F (x, y), the minimum number of required samples, # min , the neighbour window size ((3 × 3) or (5 × 5)), and sample size N all directly affect detection accuracy. There are two background update strategies: a conservative update and a blind update. The conservative update never takes into account foreground-classified pixels in the background samples. So, the problem is that a badly recognized foreground, the object's moving place, and the foreground after stopping, will never be included in the background. This situation is called deadlock [23]. A blind update policy never causes a deadlock because it can add a current pixel to the background, whether it is background or foreground. In our proposed method, we consider this policy of update despite one of its drawbacks: slowly moving objects can be assimilated into the background if any condition does not bind it. For this reason, our method increases U (x, y) in the static (non-dynamic) region, and will decrease it in a dynamic region. As a result, the probability of a background sample update becomes low in a non-dynamic region and becomes high in a dynamic region. When an observed pixel, I I O (x, y) or I F O (x, y), is classified as either foreground or background, then the foreground or background pixel has 1∕U (x, y) chances to randomly replace the background samples in B I N (x, y) and B F  N (x, y). A dynamic or a non-dynamic background updates H (x, y) and U (x, y) at runtime. Incorrectly increasing or decreasing H (x, y) and U (x, y) results in false detections. Therefore, accurate identification of the nature of a region is needed to correctly adapt these two variables. Based on some conditions, neighbours' background samples are also replaced in order to preserve spatial consistency, called diffusion. And more background samples are randomly replaced because of unstable video.

Background diffusion
We diffuse the background based on the background's nature (dynamic or non-dynamic) and ghost artefacts, G (x, y). G (x, y) arises when an object starts moving for the first time. Therefore, G (x, y) is calculated as  1 (x, y))∕mxIDst )∕2. (24) where G min = 0.995 is the ghost minimum, G max = 0.010 is the ghost maximum, and M NS A (x, y) is the short-term moving average between consecutive pixels of consecutive frames, not like the short-term moving average between a pixel and its background samples, M S A (x, y). G (x, y) = 1 if it has a ghost effect. G (x, y) = 0 if it does not have a ghost artefact. If G (x, y) = 1, the dynamic or non-dynamic region, S (x, y), will lead to, respectively, diffuse 5 × 5 and 3 × 3 neighbours randomly, based on probability P (x, y).
Before measuring short-term moving average between consecutive pixels of consecutive frames, M NS A (x, y), the mean of the normalized combined colour and fused distance, D N (x, y), is approximated, as seen in (24). In this equation, y)) is the Manhattan distance between observed pixel I I O (x, y) and its previous pixel, y)) is the Manhattan distance between observed pixel I F O (x, y) and its previous pixel, Thereafter, per-pixel normalized short-term moving-average distance between consecutive frames, M NS A (x, y), is estimated as where the short-term learning rate is .

Unstable video detection
Unstable video segmentation is a very challenging problem because both the foreground and the background move. We measured short-term moving average V S m (x, y) and long-term moving average V L m (x, y) of input video frames I I (x, y). To find the moving averages, we downsampled I I (x, y) to create downsampled version I D (x, y) to reduce the complexity. Therefore, we calculate the moving averages as Afterwards, frame-level motion distance dist (x, y) is estimated as when dist (x, y) > T I m ∕2, 10% of B I (x, y) and 10% of B F (x, y) are randomly replaced from the last video frame, I I O−1 (x, y) and 1 (x, y). In addition, the lower limit and the upper limit of U (x, y) are respectively updated to 2 and 256 when dist (x, y) > T I m ∕4.

Post-processing
According to [40], a post-processing operation can play a key role in background/foreground segmentation. The postprocessing operation is carried out as follows. On the raw segmentation frame, F R (x, y), a morphological closing is applied followed by a flood fill and ending with the complement operation. A morphological close operation is performed on the raw segmentation frame, F R (x, y), which is followed by a flood fill operation. Then, the complement operation is accomplished. After that, the bitwise OR operation is performed between F R (x, y) and this complement output. Finally, we use a median filter with this OR output to get an error-free moving object silhouette, F F (x, y). The size of the median filter is chosen dynamically, based on the input frame size.

Performance analysis process
We used three data sets (change detection data sets CD-2012 [41] and CD-2014 [42], and CMD [43]) to appraise the performance of our proposed FBGS. CD-2012 contains 31 videos divided into six categories: baseline (BL), thermal (TH), shadow (SH), dynamic background (DB), camera jitter (CJ), and intermittent object motion (IM). Table 3 describes the six different video categories of this data set. And CD-2014 presents 53 actual, complex video sequences of five more categories in addition to the six categories from CD-2012. The five new categories are low frame rate (LF), night video (NV), PTZ, turbulence (TB), and bad weather (BW). We describe these five categories in Table 4.
CMD is a small data set that has one video of 500 frames. The video contains soft camera jitter. We used these data sets because they have challenging sequences in different categories. Also, we can easily compare our FBGS with most of the recent BGS approaches, because those methods measured their performance using these data sets as well. The key performance metrics (KPM) from [41] are utilized to assess the effectiveness of our FBGS approach. Table 5 lists the metrics used for the accuracy calculation.
All approaches are developed in OpenCV C++ 3.3.0 inside the Microsoft Visual Studio 2015 in a computer with an Intel Core i5-8400 2.80 GHz 2.81 GHz CPU, running the Windows 10 (64 bit) OS with 16 GB of RAM. Memory usage and processing time are estimated in terms of bytes per pixel (bpp) and frames per second (fps), respectively. We take RGB videos with the dimensions of 320 × 240 for the complexity comparison.

EXPERIMENTS AND RESULTS EVALUATIONS
This section includes (1) the determination of parameters, (2) an image denoising effect experiment, (3) the effect of GM and FI magnitude, (4) accuracy under different metrics, and (5) a complexity measurement.

Determination of parameters
As shown in Figure 3, our system is robust because it consists of many functions that are related together. It is difficult to achieve

Metric Denoted Description
True Positive TP Pixel segmented as foreground pixels are foreground pixels.
False Positive FP background pixel inaccurately segmented as an foreground pixel.
True Negative TN Pixel that is a background pixel is detected as a background pixel.
False Negative FN foreground pixel wrongly classified as a background pixel.
Recall Re = TP TP + FN How many samples a system segments correctly within all positive classes; a large value of Re is better.
Correctly segmented background samples within the total number of background samples. Sp is also named the true negative rate (TNR); a large value of Sp is better.
The number of incorrect foreground pixels from all the background pixels; a small value for FPR is better.
The number of foreground pixels wrongly segmented as background pixels from the total background pixels squared; a small value for FNR is better.
Percentage of Wrong Classification PWC = 100(FN + FP ) (TP + FN + FP + TN ) The percentage of foreground pixels incorrectly classified as background, and background pixels classified as foreground, out of the total true foreground and background pixels; a small value for PWC/PBC is better.
Precision Pr = TP TP + FP How correctly a system classifies all positive detections; a large value for Pr is better.
Pr + Re Detection accuracy, measured by taking a harmonic mean between recall and precision. F -measure is also called F -score or F1-score; a large value for FM is better.

FIGURE 5
Input image, raw gradient image, normalized gradient image, and the fused image experimental results with different combinations of parameters to determine optimal hyperparameters. Moreover, in each function, we calculate the optimal parameters to get the best performance for the next function. Therefore, we inherited some parameters from existing approaches such as LOBSTER [2], SuBSENSE [3], PBAS [5], ViBe [23], and ViBe + [24] and rest of the parameters are determined the way the existing approaches achieved. We take the following hyperparameters from the existing approaches: (1) # min = 2: The required number of matching samples for classification is chosen similar to ViBe, ViBe + , and PBAS. However, the optimal values of the rest of hyperparameters were determined experimentally on CD-2012 [41] and CD-2014 [42] data sets, as discussed in the next subsection. Figure 5 shows the input colour image, the raw gradient image, the normalized gradient image, and the novel FI. We created the raw gradient image, normalized gradient image, and novel FI from the input colour image. The raw gradient image looks noisier than the normalized gradient image and the FI. The FI looks sharper, and with less error, than the raw gradient image and the normalized gradient image. The normalized gradient image contains only edge information, whereas the FI exhibits smoothed edge information as well as non-edge information.  We experimented to choose the value of described in Section 3.1. Table 6 shows the colour and GM ratio to create an FI. We selected the ratio based on the accuracy in terms of overall average F1-score for the CD-2012 and CD-2014 data sets. From Table 6, we conclude that accuracy in terms of F1score increases with the increase in GM. Otherwise, the accuracy decreases with an increase in colour. Moreover, the sum of colour and GM, is constant: 100%. Consequently, we find the best accuracy with 40% colour and 60% GM. Therefore, we chose = 0.60 to create the FI.

Effect of the number of background samples
The complexity of FBGS does not depend much on the number of background samples, N . We selected N based on the effect of a different N on accuracy regarding the overall F1score. Figure 6 shows the accuracy from a different N with CD-2012, CD-2014 data sets. We see that for CD-2014, the accuracy increased with the increased N , but for CD-2012, the accuracy escalated until N = 49, and then, the accuracy fell. Note that  CD-2014 is a more challenging data set than CD-2012. Therefore, a large number for N performs slightly better for challenging video cases. And we find the best scores for N = 49 with all data sets. Therefore, we fixed N at 49.

Other parameters settings
According to the experiment shown in Table 7, we get the optimal values of the minimum ghost artefacts, G min = 0.995, and the maximum ghost artefacts, G max = 0.010. Also, we take the optimal values of the minimum segmentation fault, Rd m = 3.000, and the minimum unstable ratio of the background region, R m = 0.100 according to Table 8. Moreover, the optimal value of the threshold factor step size, Z = 0.01, is determined based on the experiments shown in Table 9. Consequently, in the same way we determine the rest of the hyperparameters according to experimental results.

Image denosing effect
We analysed the denoising impact on input image I I (x, y) for background/foreground segmentation by considering the two inputs: I F (x, y) + Gaussian smoothing of I I (x, y) and I F (x, y) + non-smoothing of I I (x, y). Table 10 shows that F1-scores of I F (x, y) + non-smoothing of I I (x, y) are always better than I F (x, y) + Gaussian smoothing of I I (x, y). This happens because the  smoothing function (via Gaussian filter, average filter etc.) not only reduces noise, but also distorts the essential information of a pixel. Therefore, we used an intact I I (x, y) in addition to I F (x, y) in the FBGS model. Note that we only smooth I I (x, y) when combining it with the GM to construct FI I F (x, y).

Effect of GM and FI magnitude
By analysing the accuracy difference between colour + GM and colour + FI, shown in Table 11, we observe that the proposed colour + FI always performs better than colour + GM. Therefore, we use the colour image and the FI for background/ foreground segmentation.

Accuracy under different metrics
We measured the effectiveness of our proposed FBGS with CD-2012, CD-2014, and CMD based on the metrics described in Table 5.   Table 12 shows the performance of our method for each category (Tables 3 and 4) of CD-2012 and the CD-2014, and CMD data sets in terms of recall (Re), specificity (Sp), FPR, FNR, PWC, precision (Pr), and F1-score. Furthermore, we show the overall average scores of these metrics for these data sets. The overall-average F1-scores are 0.8088, 0.7223, and 0.9481 for the CD-2012, the CD-2014, and the CMD data sets, respectively. We see that F1-scores do not deviate much for the different categories of the CD-2012 dataset. But for LF, NV, and PTZ from the CD-2014 data set, the accuracy in terms of F1-score is lower. The precision falls in challenging cases more than nonchallenging cases, but recall does not fluctuate much for the different categories. Most importantly, our proposed FBGS mainly shows better accuracy in challenging cases, like those with shadows, dynamic backgrounds, camera jitter, bad weather, PTZ, and turbulence, compared to the rest of the categories. Therefore, the performance shows the effectiveness of our proposed approach. Since not all changes in the videos are noise, we created an FI that reduces noise rather than completely removing it. The remaining noise does not affect segmentation as we adaptively determine the dynamic threshold, update the background samples, and do the post-processing.

Comparison of our method with existing methods for the CMD data set
We show the comparison of our proposed FBGS approach with the existing methods in Table 13 in terms of Re, Sp, FPR, FNR, PWC, Pr, and F1-score. Moreover, for the comparison, F1-score and Re mainly focus on representing detection accuracy for background/foreground segmentation in our experiment. The bold, italic, and underlining indicate first, second, and third positions. For the CMD data set in terms of the Re performance metric, our FBGS outperforms at 98.51%. The reason is that our non-smoothing colour/grey image, I I (x, y), does not lose intensity changes, and our novel FI, I F (x, y), preserves subtle edge information, which increases true positives. On the other hand, while considering accuracy in terms of F1-score, our proposed method was the best, with 94.81% accuracy, as shown in Table 13. This happened because of our novel FI, which reduces noise due to pixel intensity change, and our threshold selection, which is based on background behaviour.

Comparison of the methods with CD-2012 and CD-2014 data sets
Unlike the accuracy assessment for the CMD data set, we compared our proposed approach with the state-of-the-art approaches based on average F1-score of each video category in the CD-2012 and CD-2014 data sets. The comparison results are shown in Table 14. Like Table 13, bold, italic, and underlining denote first, second, and third positions. Our proposal outperforms for DB and PTZ videos, with 88.32% and 35.64% average F1-scores, respectively. Recall that the backgrounds of DB videos are dynamic, and the PTZ video has the pantilt-zoom effect. Our method outperforms in these two cases for two reasons. (1) Our proposed novel FI not only reduces noise but also preserve the edge information, which leads to an increase in Re and Pr. (2) We determine decision thresholds per pixel, based on the nature of the local background region, and we update the background model dynamically, which reduces false positives and improves true positive detection.

Complexity measurement
According to Bouwmans and Garcia-Garcia [6], lowering the complexity and maintaining the accuracy of BGS is a present research trend. Therefore, complexity reduction in addition to increasing accuracy is an important concern. Table 15 shows the complexity comparison of PBAS, LOBSTER (LOB), SuB-SENSE (SuBS), and ViBe + with our FBGS method. The bold, italic, and underlining represent the same positions as in previous tables. In terms of frames per second, PBAS performed the best accuracy, and ViBe + took last among the five methods.
On the other hand, the proposed FBGS and LOB finished second and third, respectively. PBAS processes 3.73 more frames per second than our FBGS. SuBS displayed 25.02 fps, whereas our method exhibited 42.78 fps, which is 17.58 fps more than SuBS. Although accuracy with SuBS in terms of average F1score is slightly better, the complexity is greater in terms of frames per second and bytes per pixel. Because DeepBS uses SuBS and FTSG for background construction, the complexity of DeepBS is higher than SuBS and our proposed method. In terms of memory requirements, ViBe + requires the least space among the five methods. Ours uses 6 bpp, which is 9 bpp fewer than PBAS, LOB, and SuBS. PBAS, LOB, and SuBS allocate 15 bpp. Our proposed FBGS placed second in terms of time complexity and memory usage. On average, our method achieved the best position.

DISCUSSION
FBGS proved its effectiveness when we assessed the performance with the three data sets (Tables 12, 13, and 14). Our proposed method exhibited a 0.7230 F1-score, a 0.8088 F1-score, and a 0.9481 F1-score for the CD-2012, CD-2014, and CMD data sets, respectively (Tables 12, 13, and 14). Tables 13, 14, and 15 show the advantages of our system as follows. First, according to Table 13, our FBGS achieved the best recall (Re) and F1-score on CMD data set [43]. Also, in Table 14, our method performed the best accuracy on dynamic background (DB) and pan-tilt-zoom (PTZ) videos. Second, according to Table 15, the proposed method noticeably reduced algorithmic complexity so that it can be used in limited computing devices. Considering real-time BGS, DB and PTZ are major video categories that are used to compare each method in our proposal. Although it seems the performance of the proposed method is not as promising as the deep learning-based approaches such as DeepBS, our approach shows better results than DeepBS on DB and PTZ videos (Table 14). Even SuB-SENSE does not include a CNN framework that shows better performance in many scenarios. However, F1-score is known as a very important KPI used to measure segmentation accuracy. Our proposal achieved better F1-score than SuBSENSE on CMD data set as well as on DB and PTZ videos of CD-2014 data set [42] (Tables 13 and 14).
We observe that, for unstabilized video, the colour feature of a pixel needs to be considered for better classification than the colour + FI features. The colour + FI features result in more false positives for the unstabilized video (PTZ) than the other cases. For PTZ, we obtained a 0.5425 F1-score and a 0.3564 F1-score for colour and colour + FI, respectively. But, for the rest of the categories, colour + FI always performed better than only colour. We also observe that adaptive threshold determination, background sample update, and post-processing play a key role in accurate segmentation.  Finally, by comparing our proposal and the existing approaches, we make the following observations: (1) The intensity variation of a pixel is one of the main reasons for false detection. To address this issue, we propose a novel FI that can correct intensity variations ( Figure 1). Therefore, using the FI, our proposed BGS provides correct background/foreground segmentation, as shown in experimental results (Table 12).
(2) The edge information is not illumination-invariant. Exactly, the boundary edges of objects can increase false positive due to illumination variations. However, the proposed FI can smooth the edge information to reduce the false positive. Table 11 shows that our colour + FI-based BGS increases 1.59% average F1-score compared with the colour + GM-based BGS. (3) Accurate segmentation is very difficult to perform for challenging video cases such as dynamic background, thermal effect, intermittent object motion, camera jitters, bad weather, night videos, pan-tilt-zooming, and turbulence. However, our FI-based BGS approach can improve the classification accuracy for these challenging video cases. In Table 14, our proposal achieves the best segmentation accuracy for dynamic background and pan-tilt-zooming video cases. 4. Real-time moving object detection can be possible more easily by using our FI-based BGS approach. Our proposal reduces the time complexity and memory space complexity significantly to use for real-time applications (Table 15).

CONCLUSIONS AND FUTURE WORKS
The FI is a combination of a colour/grey image and a gradient image. Only the GM is not error-free. Therefore, we created an FI that reduces noise due to illumination changes, thermal effects, shadows, and so on. Therefore, the FI can also be applied in other research areas where image noise reduction is necessary. On the other hand, dynamic threshold determination, background model updates, and post-processing also play a key role in correct background/foreground segmentation.
The proposed FBGS method shows competitive effectiveness. More importantly, our proposed algorithm performs well for the most challenging cases, such as dynamic backgrounds, PTZ, turbulence, camera jitter, thermal images, intermittent object motion, bad weather, shadows, and so on. Also, our FBGS can be utilized in real time.
In the future, we will use deep CNN to extract salient features in addition to FI feature in our method in order to improve detection accuracy. Also, we will create a condition-dependent sample update strategy instead of a random sample update policy for the background sample update. Moreover, we will reduce the method's complexity by considering fewer features and fewer parameters when determining the dynamics of the background. In addition, because of the many parameters used to model dynamic backgrounds, it is hard to analyse them to find the optimal parameters. In the future, we will study how to design an optimal hyperparameter problem.