An efficient local stereo matching method based on an adaptive exponentially weighted moving average filter in SLIC space

Funding information National Basic Research Program of China, Grant/Award Number: 2015CB352001; Shanghai Leading Academic Discipline Project, Grant/Award Number: S30502 Abstract Rapidly obtaining accurate dense disparity maps has been the focus of stereo matching research. At present, approaches that achieve superior disparity maps require a large amount of computation, which is not suitable for practical applications. To address this issue, this paper proposes an efficient local matching method based on an adaptive exponentially weighted moving average filter and simple linear iterative clustering segmentation algorithm. First, an effective matching cost is introduced to adaptively integrate absolute intensity difference with Census transform, which is robust against texture free and luminance variate. Following this, during the cost aggregation, the exponentially weighted moving average filter and the SLIC segmentation are combined to handle the problems of computing consumption and adaptive expansion of the cost aggregation window. Finally, the dense disparity map is obtained by a winner-takes-all approach and disparity refinement. To demonstrate its efficiency and validity, the method is quantitatively tested and compared to existing approaches on the Middlebury benchmark. The results show that it has a non-occlusion accuracy of 90.66% and an average runtime of 7.01 s on the 2014 Middlebury dataset. Compared with existing competitive methods, the proposed method achieves superior matching results with a lower time cost.


INTRODUCTION
Stereo matching is at the forefront of the field of computer vision. It has wide applications, such as achieving the bokeh effect in images [1], 3D reconstruction [2], and industrial measurements [3]. Owing to the need for real-time processing and precision, fast acquisition of dense disparity maps with high accuracy has been the fundamental principle of stereo matching. To achieve this, four critical steps are considered: cost computation, cost aggregation, disparity computation, and disparity optimization [4]. Existing algorithms have been classified into global and local algorithms based on the strategy used in the disparity computation step. The global algorithms construct the global energy function through dynamic programming [5], belief propagation [6], and graph cuts [7] based on the smoothness assumption step. SGM [8,9]  datasets is over 120 s [10], which is not suitable for practical applications. Moreover, local methods have received substantial attention owing to the use of the winner-takes-all strategy (WTA) for disparity computation, which significantly reduces the time complexity. The primitive approach is the box filter [4], which uses integral operations to achieve linear time. However, the fixedsize windows (FW) would destroy the disparity discontinuity and lead to edge blurring. Therefore, the local smoothness assumption has been proposed. This assumption states that the support window should fit the object boundary to improve the performance at the edges. Accordingly, methods based on variable support windows (VSW) have been researched.
In recent times, the use of convolutional neural networks (CNNs) for stereo matching to produce high quality disparity maps is increasing. Matching cost-convolutional neural network (MC-CNN) uses CNNs to get matching costs [11]. Liang [12] and Nikolaus [13] proposes an end-to-end network architecture of stereo matching. Cascade Residual Learning (CRL) [14] uses the cascade CNN network to produce high-quality disparities. All these CNN-based approaches have yielded better results but they require a large number of training data with accurate ground truth and hardware resources for training and computation, which limits their use in low-end products.
Considering these factors, a local algorithm is chosen as the focus of this paper. This study focuses on the cost computation and aggregation steps. Cost computation provides source information for subsequent processing, which can build the foundation for excellent results. Additionally, cost aggregation is critical for preserving structure and recovering accurate disparity by producing proper support windows. An adaptively moving exponentially weighted average filter method based on simple linear iterative clustering (SLIC) space is proposed to handle the issue of balancing computation efficiency and disparity accuracy. To handle the shortage of computational efficiency in the local algorithm, the SLIC segment algorithm [15] is introduced to decrease the dimension from the pixel domain to the superpixel domain. Unlike other segment-based methods, the disparity varies in individual segments, owing to the combination of pixel-wise cost and superpixel-wise cost. To overcome the problem of insufficient information on a limited aggregation window, the adaptive exponentially weighted moving average (EWMA) filter [16] is introduced, which combines the smoothing filter and adaptive weights.
The remainder of this paper is organised as follows. Section 2 further describes the related work. Section 3 describes the proposed stereo matching method in terms of the four critical steps mentioned above. Section 4 presents the experimental approach and the evaluation results. Finally, the main conclusions are discussed in Section 5.

RELATED WORKS
Since cost computation and cost aggregation are the two major areas related to our study, we briefly review the latest literature related to them in the following sections.
In order to avoid the weakness of individual cost functions, multiple cost functions are combined to improve performance. Zhu [17] constructed a cost function based on truncated absolute difference (AD) and gradient. An explicit factor is used to balance the two terms. A similar equation is adopted in [18], where gradient and CT are blended with an implicit factor. However, the factor in these methods is a constant value. This study employs a variable factor to adaptively balance the shares of metrics to enhance the strength of the combination.

Cost aggregation based on segmentation and iterative filtering
A proper aggregation strategy can reduce the computational load and can have a significant effect on large low-texture regions, repetitive regions and occlusion regions. Veksler [19] proposed multiple windows to aggregate the optimal cost. However, with this, the improvement over the box filter was limited owing to the finite number of window sizes. Mei et al. [20] improved the cross-based windows proposed by Zhang et al. [21]. This approach chooses optimal window shapes to preserve the boundaries, which significantly increases the time consumption. Considering that both the VSW filters and crossbased window filters only have binary weight values, the bilateral filter (BF) [22] was studied by adjusting weight depending on local distributions of colour and distance. Inspired by the concept of the BF, Yoon and Kweon [23,24] proposed the adaptive approach, which replaced the Gaussian function with an exponential function to assign the weights based on the Euclidean distance of range similarity and spatial proximity. Further, in order to reduce the time consumption, Hosni [25] introduced the guided filter (GF) into the stereo matching algorithm with independent kernel size processing times. While these adaptive support weight (ASW) methods further improve the accuracy, the computational efficiency gradually decreases [26,27].
Various methods to further increase the matching efficiency have been researched. On the one hand, the non-local solution based on the minimum spanning tree (MST) [28] has been studied. The segment tree (ST) [29] and iterative MST structure [30] are presented immediately after the MST. However, for all the MST-based solutions, it is difficult to guarantee the performance on inclined surfaces [31] owing to limited valid aggregation concerning the geodesic distance. This is particularly significant in large, low-texture regions, such as grey grounds and white walls. On the other hand, segment-based methods have received significant attention from researchers for the efficiency of down-sampling. Tombari [32] used the Mean-Shift segmentation algorithm to obtain segments. Then, it aggregated the cost from a segment and the corresponding adaptive window, which achieved better performance in highly textured regions. But errors still existed in low-textured regions owing to few identifiable correlations. To improve this problem, Chang [33] employed the SLIC segmentation algorithm and blended the cost of multiple neighbourhood segments. However, these segment-based methods have a common shortage: the disparities between segments may be discontinuous.
This paper combines the adaptive EWMA filter and SLIC segmentation algorithm in the cost aggregation step. In contrast to the segmentation-based adaptive support (SAS) algorithm [32], the proposed method performs the weighted average between segments to obtain sufficient information. Different from the segment-based cost aggregation (SCA) algorithm [33] using a fixed weight, our approach employs adaptive weights for the EWMA filter, which efficiently extends the support window in SLIC space.

METHODS
This section describes the proposed method in detail. The flowchart of the method is also shown (see Figure 1).

Matching cost computation
The cost computation process utilises a variety of cost functions. The individual functions are straightforward and efficient, but the results are vulnerable to noise perturbation and illumination changes. Therefore, the combinations of multiple matching costs are exploited to obtain better performance. The AD detects the luminance difference between pixels and is best suited for low-texture and flat areas, while the CT measures the relative magnitudes within neighbourhoods and has a high tolerance to radiometric distortion and highly textured regions. Therefore, the cost function combined with AD and CT is an optimal choice for robustness in a variety of scenarios. The formula thus obtained is as follows: (1) where are the values of different channels belonging to the left and right images, respectively, and ⊕ is the XOR oper-ator. H (.) is the binary luminance difference. The -value varies depending on the geometric structure to make full use of the characteristics of the AD and CT strategies, making the matching cost curve of the image more robust to textureless regions and brightness changes.
To attribute more weight to the AD-based cost C AD in textureless regions, where the CT cost C census has no discriminating power, the parameter is negatively correlated with the detail strength. is defined by the Laplacian function with respect to gradient I G at pixel (i, j ).
The value of g is used to control the shape of the Laplacian filter. The gradient I G (i, j ) is detected with convolution operations in a discrete form.
Here, the I GH and I GV are the horizontal and vertical gradients, respectively, and f H and f V are filter kernels with radii R 1 and R 2 . The coefficient value of the convolution kernel is negatively correlated to the radial distance. Hence, the initial values f (x, y) are generated by the Gaussian function.
The values of x and y are used to control the shape of the function. Some additional processing is necessary to make the filter sensitive to gradients similar to the Sobel operator. The values of the middle column of f (x, y) are set to zero and the values of the left column take the opposite number. Then, f H is produced to detects the gradients in the horizontal direction and f V takes the transpose of f H .

Cost aggregation
Classic cost aggregation methods are pixel-by-pixel, which leads to a significant amount of redundant computation and slow propagation of information. This paper proposes an efficient local stereo matching method using SLIC segmentation and adaptive EWMA filtering. This approach has two advantages. First, based on the weight matrix for obtaining a smooth matching cost, the EWMA filter can quickly expand the support window. Second, the SLIC segmentation can make similarly coloured or neighbouring pixels cluster together for matching. This not only makes it possible to resist the influence of noise to a certain extent, but also makes it possible to reduce its dimensionality from pixel-wise to superpixel-wise. These factors can significantly improve the computational efficiency and matching accuracy. Hence, the EWMA filter propagation in SLIC space is adopted for cost aggregation. A detailed description of the cost aggregation process is provided below. Firstly, through the SLIC segmentation algorithm, an SLIC space is established to cluster pixels into superpixels based on the local intensity distribution. The initial cost matrix is aggregated in individual superpixels rather than a rectangle or weighted window. This is a down-sampling step, projecting matching costs from pixel-space C p into SLIC-space C s by employing a splat matrix S and a mean filter.
Through the mean filtering in each segment, some spurious local minima caused by noise interference disappear, while the true ones are retained [34].
SLIC segmentation algorithm [15] uses k-means to generate superpixels, which is faster and more memory efficient than other methods. Two significant principles involved here are that the space distance range is proportional to the superpixel size and that the weighted distance measurement combines normalised colour similarity and spatial closeness. Therefore, pixels with near position in colour and spatial space are efficiently clustered. The disparity maps generated from three cost aggregation strategies are shown in Figure 2. The first strategy only uses WTA on pixel cost, while the second one uses a combination of pixel cost and superpixel cost without the EWMA filter. The third one adds the EWMA filter based on the second strategy. It is noteworthy that many spurious points are removed owing to superpixel cost especially in low-texture regions, while the EWMA filter dramatically improves the performance of disparity maps owing to the cost aggregation in vast regions. Secondly, cost aggregation is a critical step in maintaining the disparity boundary and smoothness. Distinct from global algorithms, the smoothness constraint is embedded into the weight matrix used for the EWMA filter. Further, to adaptively propagate the cost among specific regions according to image content, the weight matrix corresponding to the SLIC space is computed. The weight matrix is an indicator of the connection between neighbourhoods in the SLIC space. Adjacent superpixels with similar colours share prominent correlations. These correlations can be enhanced by applying the smoothness filter between them. Here, are the average values of pixels within superpixel i and j , respectively, in CIE-Lab colour space; c is the parameter that controls the decay rate of colour weight in Gaussian function; and ‖ * ‖ 2 represents the L2 norm. Further, are the coordinates of barycenters of superpixel i and j in Euclidean space, and d is the parameter that controls the decay rate of distance weight in Gaussian function. Based on the weight matrix between superpixels, the matching cost is passed through the EWMA For computational efficiency, the weight matrix is normalised toW . C s (t ) is the cost matrix at t iterations, and c is the control probability. The weighted support windows following the propagation of the EWMA filter in three synthetic images are as shown in Figure 3, in which the red points represent the centre of the superpixel, and the black arrows represent the propagation direction at each iteration. Once the iterations are complete, the support window contributing to the central superpixel is constructed (marked in blue in Figures 3 and 4, and marked in grey in Figure 5b. The brighter superpixels have larger support weights). Furthermore, benefiting from the aggregation effect in SLIC-space, the propagation of the EWMA filter continues to effectively maintain its original diffusion trajectories (Figure 3) in the presence of severe Gaussian noise perturbation ( = 64). This indicates that the support window is uniformly expanding along with all directions around the central superpixel in flat regions, while the weight energy degenerates with increasing radial distance. As shown in Figure 5, a similar distribution can be observed in the edge region, the difference being that the diffusion window stops moving in the presence of rapid colour transition. Hence, the propagation strategy can quickly and accurately locate the adaptive weight window even with noise disturbance in typical scenes. As a result, adjacent segments with similar colours have homogeneous disparities, while others possess mutant disparities.  Subsequently, for the sake of the performance of disparity computation, the superpixel-wise matching cost volume needs to be transformed. Essentially, the pixel-wise cost C * p is generated by the up-sampling projection of the pixel-wise cost C * s obtained by the EWMA filter.
Finally, unlike in other methods [32,33], the matching costs C was formed by combining the two costs from pixel-space and SLIC-space, Here, is the balance factor. Owing to the integration of the smooth superpixel and dependable pixel costs, the performance of the strategy is similar to that of semi-global algorithms. Consequently, the disparity preserves the discontinuity and smoothness, which outperforms other segment-based algorithms.

Disparity selection
Following the cost combination, the final cost volume in the pixel-space is generated. Based on the characters belonging to the above-mentioned cost volume, complex disparity computation methods are avoided in order to reduce the computational burden. Therefore, the disparity map is acquired only by the WTA strategy. The initial disparity map D is determined by the disparity range ⇀ d and matching cost C.

Disparity refinement
The initial disparity map obtained from the disparity selection still has errors that require detection and correction. Owing to the cost propagation in SLIC space, a mismatch often emerges as a cluster of pixels. In such cases, the commonly used median filter [8] and nearest interpolation [35] approaches are not suitable. Therefore, plane fitting [36] is used to refine the unstable disparity here. The left-right consistency check (LRC) is used to obtain the erroneous matching violating the coherence constraint: Here, D Left (p) and D Right (p) are the disparity maps for the left and right images, respectively. Following the LRC operations, stable candidates are obtained for computing the plane parameters in segments. Furthermore, the arithmetic average error is used to measure the quality of plane fitting. Nevertheless, there are some segments with failed fitting due to the nonplanarity of an object or improper segmentation. Smaller segmentation is necessary for additional fitting to further correct false disparities. Therefore, the segment size for iterative plane fitting is set to progress from large to small. Then, an optimised disparity map is produced.

EXPERIMENTS
To verify the contributions of this paper, relevant experiments were designed. The proposed method is tested on a desktop computer with an Intel Core i7 3.2 GHz CPU. The implementation is done in MATLAB. The parameters in this paper are set as follows: x = 5.0, y = 2.2, g = 1.5, ff = 0.4, d = c = 10.0, t = 15, and c = 0.9995. The superpixel size is set to 5 × 5. The superpixel number is adaptively calculated by dividing the pixel number with the superpixel size. As a result, the sizes of all superpixels fluctuates around a fixed size regardless of the variation in image resolution. Additionally, the Middlebury v.2 [37] and the v.3 [38] datasets were selected as the experimental datasets.

Stability of the matching cost method under different luminance levels
The radiometric changes generated from fluctuations in the camera exposure as well as environmental illuminations are some of the primary factors that impact the matching precision. The 2006 dataset with third resolution utilised in the experiments included 21 stereo pairs, where each pair of images contained three different illuminances (I1, I2, I3), and each illuminance contained three different exposures (E0, E1, E2). The left image was selected at exposure E1, while the right image was selected at exposures E0, E1, and E2 under the different illuminances I1, I2, and I3. The widely used matching cost functions were selected for comparison. This included combinations of the CT and GA, combinations of the GA and AD, the combination of the CT and AD, and the combination of the CT, AD, and GA. The results after cost aggregation (Section 2.2) are presented in this section.
The visual performance of Flowerpots is shown in Figure 6. The red marks the errors, while the blue marks the errors where the proposed method has obtained accurate results. Notably, it is clear that the variability of the exposure is inversely related to the accuracy. Table 1 presents the quantitative results in detail. Among the methods tested, the proposed approach has the fewest errors in the slant plane and edges. It also can be observed that the combination of CT and AD ranks second, which provides a solid foundation for the cost function proposed herein. Moreover, the overall average mismatching rate of the proposed method is the lowest under different illumination conditions. This indicates that the proposed approach is more robust to exposure variation.

4.2.1
Quantitative comparison between existing methods and the proposed method Accuracy and computational efficiency are two key standards of stereo matching. The Middlebury 2014 training dataset with quarter resolution was chosen to evaluate the effectiveness of the method based on these factors. This dataset includes 15 typical scenarios of stereo pairs with ground truth. Several wellknown matching methods were considered for comparison. The  box filter [4] is the fastest local algorithm, which sets a target for the runtime. The BF [24] is an important adaptive approach in stereo matching, while the GF [25] dramatically reduces the computational burden. All the three methods were applied to the cost volume in the aggregation step. The box filter was used to evaluate the efficiency of the proposed method, while BF and GF were chosen to evaluate the quality of the proposed method. These methods are applied in the same software environment. The disparity maps (Figure 7) show that compared with other methods, in the Adirondack, Piano, and PlaytableP datasets, the proposed method shows good performance in occluded regions. On the other hand, in the Recycle, Shelves, and Vintage datasets, the proposed method shows superior performance on slanted planes with large low-texture regions. Moreover, our method accurately restores the disparity in repet-itive patterns (the back of chair in Adirondack, the keys in Piano, and the keypads in Vintage). The overall results indicate that the combination of the EWMA filter and SLIC segment can prevent the influence of occlusion and improve accuracy in slanted planes, repetitive pattern, and low-texture regions. SLIC segmentation categorises pixels in different structures and the EWMA filter propagates information from reliable regions to unreliable ones. As a consequence, the disparity of various regions is properly preserved by the proposed framework.
The detailed accuracy ratio (see Table 2) reveals that compared to the BF [24] and GF [25], the algorithm proposed herein has achieved the optimal accuracy: 85.63% overall and 90.66% in the non-occluded regions. In the discontinuity regions, the accuracy of the proposed method is 16.72% higher than that of the GF, which is the only one method whose accuracy exceeds   The running time of GF_EMA is not published in the original paper. 50%. The computations required also show that its runtime is far lower than that of others, even competing with that of the box filter. The proposed method is further compared with related local algorithms on the Middlebury 2014Q. Bad 1.0 is set as the error threshold in non-occluded regions. SAS [32], SCA [33], and the proposed method are implemented with the SLIC algorithm. GF_EMA [39] employed the EWMA filter and guided filter for smoothing in the cost volume in pixel space.
The quantitative results are shown in Table 3. On the one hand, the proposed method performs well with higher accuracy and computational efficiency among the four algorithms. Compare to SAS and SCA, the running time is reduced by 97.94% and 96.38% respectively. On the other hand, our method is more stable with a standard deviation of 66.17% lower than GF_EMA.

4.2.2
Quantitative comparison between competitive methods and the proposed method To further evaluate the potential of the proposed method in terms of efficiency and effectiveness, it is compared with competitive methods, including MSCS [40], LocalExp [41], DoG-Guided [42], and INTS [43], on the Middlebury 2014 dataset. These algorithms are the latest evolution of the classic local and global techniques, such as GF, SGM, and graph cuts. MSCS is a global disparity estimator that combines MeshStereo and cross-scale cost filtering for faster computation. LocalExp is a state-of-the-art algorithm that combines slanted patch matching and improved graph cuts with MC-CNN-acrt embedded. DoG-Guided is a local stereo matching algorithm that employs an adaptive GF, while INTS is a stereo matching algorithm based on a semi-global approach.
The quantitative results are shown in Table 4. The average precision percentage is listed, where the subscript numbers indi-  However, in terms of time efficiency, MSCS is ten times slower than the proposed method. This is due to the convergence of the global energy function in the former approach. Similarly, the proposed method is 120 times faster than LocalExp, although they share the same average rank. Both DoGGuided and INTS are less efficient than the proposed method in terms of both accuracy and time consumption. These comparisons indicate that the proposed method not only achieves higher accuracy, but also significantly reduces the running time. This demonstrates that the present work enables extensive improvement in stereo matching performance.

Quantitative comparison on Middlebury v.2 and v.3 datasets
To verify the stability and robustness of the algorithm, four well-known algorithms, including SWSS [44], MGM [45], MSCS [40], and CS-PM [46], are tested on Middlebury v.2 datasets (31 stereo pairs with third resolution) and Middlebury v.3 datasets (28 stereo pairs with quarter resolution). Here, SWSS and CS-PM are local algorithms whereas MGM and MSCS are global algorithms.
The quantitative results are shown in Tables 5 and 6. Bad 1.0 and bad 2.0 in non-occluded regions are measured for v.2 and v.3 datasets, respectively. The proposed method achieves the first rank along with MSCS. However, although MSCS has a lower matching error than our method, its execution time is more than 10 times as long as ours. Furthermore, our method is faster and more accurate than SWSS, MGM, and CS-PM. The result demonstrates that the proposed method performs well in various scenarios in terms of stability and robustness.

Impact of different parameter settings
Different combinations of parameters can produce different disparity qualities. Therefore, to obtain better results, it is necessary to explore the effects of typical parameters. Based on the theory presented above, the number of iterations and the size of a superpixel are two key parameters. The experiments with various parameter values are conducted on the 2014 training datasets.

Effects of varying the superpixel size
The size of the superpixel k-value not only impacts the accuracy but also influences the computation efficiency. Since the superpixel size is inversely proportional to the superpixel number, the larger the superpixel size, the lesser is the computational load. The results of different superpixel sizes (varying from approximately 2 × 2 to 15 × 15) on the average mismatching ratio and time consumption is shown for all regions (see Figure 8a). It is noteworthy that for up to a moderate superpixel size, such as 7 × 7, the accuracy oscillates within a small range. Furthermore, the good performance is achieved (average accuracy of 90.66% and runtime of 6.94 s) when the superpixel size k is 5×5.

Effects of the number of iterations
It is clear that the number of iterations directly impacts the support window of the EWMA filter. The greater the number of iterations, the wider the adaptive window. Meanwhile, the runtime increases. Here, the number of iterations was varied from 5 to 35, and the resulting impact on the average mismatching ratio and time consumption is shown for all regions (Figure 8b). The graph shows that in the non-occluded region, good perfor-mance is achieved (average accuracy of 90.01% and runtime of 6.71 s) with the number of iterations t is 15.

CONCLUSION
This study aims to provide a local stereo matching method to more efficiently obtain an accurate disparity map. This is achieved by focusing on the cost computation and aggregation steps of stereo matching to quickly obtain the cost volume with appropriate support windows. By this approach, a novel algorithm is proposed based on an adaptive EWMA filter and SLIC segmentation algorithm. In the cost computation step, an adaptive strategy based on AD and CT is introduced. Then, during cost aggregation, the EWMA filter and the SLIC segmentation are integrated to tackle the problems of computing efficiency and increase smooth information acquisition. Experiments on the Middlebury 2006 dataset demonstrate that the proposed approach is robust against radiometric variations. The average matching error rate on 2014 Middlebury standard image pairs is 90.66%, which is superior to several representative local methods. Furthermore, the average runtime is 7.01 s, which is comparatively lower than that of other commonly used methods. Furthermore, the proposed method performs well on Middlebury v.2 and v.3 datasets. The efficiency and effectiveness of the proposed approach show considerable potential for practical applications, such as the bokeh effect and auto focus in smart phones. However, the approach proposed in this paper has a weak discriminatory disparity in large flat areas with no valid information. Adding proper embodying constraints into EWMA is a possible solution that can be explored in future work. Furthermore, the software environment in which the proposed method is implemented is MATLAB. For practical applications, hardware acceleration could be adopted to reduce the runtime.

ACKNOWLEDGEMENT
This work was supported by the National Basic Research Program of China (2015CB352001) and the Shanghai leading academic discipline project (S30502).