CSNet: Cascade stereo matching network using multi-information cost volume

The disparity map produced by matching a pair of rectiﬁed stereo images provides estimated depth information and accurate distance calculations for autonomous driving. For most stereo matching networks, cost volume plays a crucial role in the accuracy of disparity maps. To increase the accuracy of disparity maps, an improved cost volume method called multi-information cost volume (MICV) is proposed, with the fusion of concatenation volume and improved correlation volume, which is calculated both in inner product space and Euclidean space, to measure the correlation similarity between features in the left and right images. The inclusion of a squeeze and excitation (SE) module further improves MICV by adjusting the contribution of the correlation volume. To reﬁne the disparity map and enhance the semantics of small objects, a cascade stereo network called CSNet is proposed, with a dilation feature fusion unit (DFFU) to calculate and integrate disparity maps from three different scale branches. The smaller scale branches are gradually integrated into the larger scale branches to implement the transmission of semantic information. The proposed method was evaluated using several established benchmarks including the Sceneow, KITTI2012 and KITTI2015 datasets. Experimental results demonstrate that authors’ method produces more accurate disparity maps compared with existing state-of-the-art methods.


INTRODUCTION
Autonomous vehicles, intelligent driver assistance systems, and other smart devices need to perceive the environment. A great deal of work is currently devoted to environmental perception, such as target detection [1,10], target tracking [2] and semantic segmentation [3]. However, these methods can only produce 2D results, and they need to be combined with depth information to be translated into usable data to generate 3D scenes. Stereo matching is a passive depth perception technology that estimates the horizontal displacement (disparity) between a pair of corresponding pixels on a rectified pair of stereo images. For one pixel in one image of the pair, if its disparity to the corresponding point in the other is d , then this pixel's depth is computed as l f ∕d , where l denotes the baseline distance between two camera and f is the focal length. conventional methods significantly. Researchers have particularly focused on using CNN to improve cost volume. MC-CNN [4] first introduced CNN to learn how to match corresponding points between a rectified stereo image pair. DispNet [5] constructed a correlation cost volume from left and right feature maps, which was later put into a CNN for regularization. GC-Net [6] and PSM [7] applied a 4D concatenation-based cost volume instead of correlation volume, which could coalesce the left and right features without losing the knowledge of the geometry of stereo vision. Gwc [8] proposed a group-wise correlation composed of the correlation volume and the concatenation volume, which produced a more accurate disparity map.
However, the cost volume methods above still have limitations: (1) correlation volume loses the spatial information of features because it only generates a single-channel correlation map across each disparity level; (2) concatenation volume forces the following aggregation network to use more parameters to learn the similarity from scratch, which overburdens the network; (3) although the group-wise correlation has the advantages of both the correlation and concatenation modes, it computes the similarity in inner product space and stiffly combines them without any guidance.
Furthermore, disparity map refinement is an important part of stereo matching in both traditional and CNN-based methods. Most stereo matching networks simply stack multiple convolutional layers together to refine the initial disparity map. Many downsampling operations occur within the complex and redundant network [17,19], and some detailed information on small objects on the road is lost in the process, which decreases the accuracy of the disparity map and further affects the accurate calculation of distance information for vehicles on the roads, providing incorrect distance information to the driver.
To overcome these disadvantages, we propose a novel cascade stereo matching network (CSNet) that uses multiinformation cost volume (MICV). To increase disparity map accuracy, MICV fuses the concatenation volume and improved correlation volume. Unlike in the group-wise correlation, we improved the correlation volume calculation not only in inner product space but also in Euclidean space in order to preserve and enhance the similarity information of traffic features from the left and right images. MICV is further improved with a squeeze and excitation (SE) module [9] to adaptively adjust the map based on the input of the correlation volume.
To achieve efficient disparity map refinement, we designed CSNet with MICV to repair disparity maps gradually. The input stereo images are first mapped into three groups of highdimensional features with different scales through the feature extractor network, the groups are then after combined into three MICVs (V 1∕16 , V 1∕8 , V 1∕4 ) with different scales using the proposed method. Then, the MICVs are fed into three independent aggregation networks to produce a disparity map. For clarity, we refer to the three branches from top to bottom as branch_1, branch_2, and branch_3. In conventional CNN, as the network deepens, the feature maps carry more semantic information, but multiple downsampling operations filter out detailed information on small objects. Using three MICVs, with the scale increasing from V 1∕16 to V 1∕4 , the resolution and level of detail are increased, but semantic information is reduced. Therefore, we use branch_1 and branch_2 to gradually supplement the semantic information in branch_3 to improve disparity accuracy. In addition, in order to fuse the information from the different branches more efficiently, we propose a DFFU, which merges the information from surrounding pixels to repair the upsampled feature map.
The logical flow diagram of the proposed method is shown in Figure 1 and is explained in detail in Section 3. We set up a series of ablation experiments and comparison experiments on Sceneflow [5], KITTI2012 [11] and KITTI2015 [12] datasets to verify the effectiveness of the proposed method. Experimental results show that our method outperforms most existing methods.
Our main contributions can be summarized as follows.
1. We propose a new combination of left and right features MICV, which consists of the concatenation volume and the improved correlation volume; input from the concatenation

RELATED WORK
To make smart devices understand their surroundings, they must be given depth information. In the field of autonomous driving, stixel-world [28] is often used to define a compact medium-level representation of dense 3D disparity data. In stixel-world, millions of disparity pixels are abstracted into hundreds or thousands of stixels to measure the space in front of the vehicle restricted by objects with almost vertical surfaces. Schneider et al. [29] propose a novel vision-based scene model that uses disparity maps to compute the geometric and semantic layout of a scene. Daniel et al. [30] further fused semantic segmentation maps with disparity maps to obtain more accurate slanted stixels. However, calculating stixels requires a precise and dense disparity map. Traditional stereo matching [13] typically consists of four steps: (1) matching cost computation; (2) cost aggregation; (3) disparity calculation; (4) disparity refinement. Due to CNN's excellent characterization capabilities for various computer vision applications, CNN is also used to implement part or all of these four steps.
Zbontar and LeCun [4] first introduced CNN to compute matching cost between a pair of 9×9 image patches. On this basis, Luo et al. [14] designed a faster network that utilizes a product layer to measure the similarity between the two features of a siamese architecture, and trained the whole network with multi-category classification patterns. Shaked and Wolf [15] presented a novel highway network architecture to compute matching cost and to modify it with a confidence network. Though the above data-driven approaches significantly outperform conventional hand-crafted methods, numerous and complex post-processing procedures are still necessary to produce usable results, including cost aggregation, semiglobal matching [16] and sub-pixel enhancement. However, these CNN-based stereo matching methods are two-stage networks with post-processing, not the end-to-end networks.
To realize the end-to-end process, CNN directly estimates the disparity, after professional design, without any post-processing. Mayer et al. [5] proposed a large synthetic dataset for stereo matching and a stereo network (DispNet) for predicting disparity end to end. Following DispNet, several researchers introduced the correlation cost volume to regress disparity. Kendall et al. [6] designed a novel concatenation-based cost volume (GC-Net), which is put into a 3D CNN to incorporate contextual information from the height, width, and disparity dimensions. Chang et al. [7] extended GC-Net and proposed a stacked hourglass module to process the cost volume. This module has the ability to repeatedly aggregate the global context information of the cost volume. Gwc [8] introduced the group-wise correlation which consists of the correlation volume and the concatenation volume for measuring feature similarities without losing spatial information. However, these networks do not consider the refinement of the disparity, which results in inaccurate depth information for vehicles.
In the area of disparity map refinement, Gidaris and Komodakis [17] presented a network which improves label accuracy in three steps: (1) detects incorrect labels, (2) replaces them with new labels, and (3) refines the new labels. Gidaris and Komodakis [17] used this network to refine disparity maps. Following DispNet, Pang et al. [18] proposed a two-stage network (CRL). The first stage of the network, composed of Dispnet, was used to generate the initial disparity, and the second stage of the network was used to learn the correction. Liang et al. [19] stacked several convolutional layers to repair the disparity map with an iterative strategy. Khamis et al. [20] applied six dilation convolution blocks to enlarge the receptive field for disparity refinement. We adopted a cascade architecture because we considered the characteristics common to CNN, including semantic segmentation, object detection, edge extraction etc. For semantic segmentation, ICNet [21] used a cascade hierarchical optimization strategy to obtain rough semantic segmentation feature maps from low-resolution images, and then obtained detailed information from high-resolution images to realize hierarchical optimization of segmentation images. DFANet [22] applied a multi-stage and multi-network sequential optimization strategy to obtain better style results. For object detection, Cascade RCNN [23] gradually obtains a more accurate detection frame by increasing the IoU threshold. Cascade RPN obtains a more accurate anchor by optimizing the existing anchor method one by one, thereby greatly improving detection performance. The edge extraction network RCFNet [24] obtains detailed edge extraction results from the bottom layer and then gradually integrates high-level semantic information to produce accurate edges. Based on the successes of these methods, we propose a CSNet system with MICV and DFFU and detail them in next sections.

APPROACH
In this section, we describe the construction process of MICV and the architecture of CSNet. We then introduce the DFFU to facilitate the fusion of information at different scale branches. Finally, we detail how to train our network via multisupervised loss.

Feature extractor
Using PSMNet [7] as our basis, we built our feature extractor network with three changes: we (1) reduced the number of stack blocks in residual blocks, (2) removed the spatial pyramid pooling module, (3) included attention modules based on SE [9] after each residual module. The improved feature extractor network is detailed in Table 1.

MICV
In order to produce accurate disparity maps, it is especially important to build an easy-to-handle, informative cost volume.
Early stereo matching networks [4,5,14,15,18] often used inner product operation or distance metrics to measure the similarity of two picture patches. Obviously, these operations distort the spatial data of features. To overcome this problem, GCNet [6] concatenated the left and right feature maps ( f l , f r ) across each disparity level, forming a 4D cost volume to force to retain stereoscopic geometric data. In order to retain spatial information from the features and measure the similarity between them, Gwc [8] proposed a group-wise correlation that combines the concatenation volume with the correlation volume. However, group-wise correlation (1) uses only the inner product operation to measure the features similarity of left and right images without considering the distance metric and (2) mechanically combines the concatenation volume with the correlation volume without guidance. To overcome these problems, we propose MICV composed of the correlation volume and the concatenation volume both in inner product space and Euclidean space in order to further supplement the similarity information of left and right features. Moreover, we introduced an SE module to adaptively balance the relationship between correlation volume and concatenation volume. Figure 2 illustrates the details of the multi-information cost volume.
Before building MICV, we usually set a maximum disparity D max for the input images to exclude some pixels with excessive disparity value. For the unary features ( f l ∕ f r ) of the shape [H ∕4, W ∕4, C ], the maximum disparity was D max ∕4.
We aggregate the left unary features and the right unary features by using two correlation operations and generate two feature maps: F in and F dis . Each denotes the similarity of both left and right features in inner product space and Euclidean space, respectively. F in and F dis are then concatenated along the channel dimension, producing the correlation feature. In a word, the

FIGURE 2
Details of the multi-information cost volume correlation is calculated as where d denotes each possible candidate disparity. Equation (1) shows that the correlation feature is computed at all disparity levels d . Then, all the correlation features are packed into a cor- Based on GC-Net [6], we constructed the concatenation volume using a shift operation to form it. The right feature shifts different disparity levels and then concatenates with the left feature along the channel dimension, as shown in Equation (2).  (3).

CSNet
The quality of the initial disparity maps often suffers from outliers and blurred edges. Therefore, disparity refinement plays an important role in both tradition and CNN-based methods. Some recent stereo matching networks also use refinement modules. CRL [18] constructed a residual network for learning the correction of initial disparity. StereoNet [20] simply stacks six dilation convolution blocks to refine initial disparity map. However, these methods are unable to preserve missing details in the sampling feature maps. Cascade structure is used in many recent detection networks to retain detailed information and enhance the ability to detect small objects. Figure 3 shows details of the cascade stereo network. Unlike previous work [6][7][8]20], which only used one scale of feature to calculate the disparity map, CSNet can use three different scale features (V 1∕16 , V 1∕8 , V 1∕4 ) to compute disparity maps, respectively. V 1∕16 is put into the stacked hourglass module to aggregate the feature information, and obtain a predication cost C 1∕16 of 1/16 resolution. V 1∕8 and V 1∕4 use the same information aggregation strategy, but to reduce the calculation time, the number of hourglass modules is reduced to 2 and 1 respectively, generating predication costs (C 1∕8 , C 1∕4 ) at 1/8 and 1/4 resolution. Then, C 1∕16 is gradually fused with branch_2 and C 1∕8 _2 is gradually fused with branch_3 to enhance the semantic information of small objects and further improve the disparity map accuracy. With the three branches, CSNet  Since the features of each branch of CSNet have different spatial scales, it is necessary to perform an upsampling operation on the width, height, and disparity dimensions of the small feature map before feature mixing. However, the fine texture of the feature map after upsampling by interpolation is bound to decrease, which affects the quality of the disparity map. In order to solve this problem, we propose a DFFU to merge the information of the surrounding pixels to effectively repair the up-sampled feature map, as shown in Figure 4. Take the fusion of branch_1 and branch_2 as an example, the input for DFFU consists of two parts: C 1∕16 and f 1∕8 . f 1∕8 is the feature map of V 1∕8 after two convolutional sampling. We first up-sampled the predicted cost C 1∕16 to the same spatial size as f 1∕8 . This was then passed through three dilated convolution layers with dilation factors of 2, 4, and 8 to sample the up-sampled volume from a larger context without increasing the network size. Moreover, each dilated convolution was followed by a batch normalization layer and Relu activation layer. The output was then fused with f 1∕8 to generate a new predicted cost f ′ 1∕8 rich in semantic information.

Disparity regression and loss function
For each branch, two 3D convolution layers were employed to refine the predicted cost and then output the finally predicted volume with size 1× H × W × D max where ∈ {1∕4, 1∕8, 1∕16}, respectively. After that, the predicted volume was up-sampled and converted into a probability volume P with softmax operation across the disparity dimension. For each point, the predicted disparityd was calculated as the sum of each disparity weighted by its probability, aŝ where k and P k denote every possible disparity level and its corresponding probability, respectively. The final predicted disparity maps output by the branches_(1, 2, and 3) are, respectively, denoted asd 1 ,d 2 ,d 3 . To effectively train the entire network, we used the following multi-supervised loss function: where N denotes the number of pixels in the input image, i represents the weight for the i th predicted map and d denotes the ground-truth disparity maps. Smooth-L1 loss [7] is less sensitive to outliers, and computed with Equation (6).

EXPERIMENTS
In this section, the experiments conducted on CSNet are presented. First, we introduce the datasets used for training and testing, followed by the evaluation method and the experimental process. Finally, the results of our network experiments are analyzed and discussed.

Datasets and evaluation metric
Public datasets were adopted to evaluate the proposed CSNet, Only training images contain the corresponding ground-truth disparity in both datasets, the predicted disparity map of the testing process must be uploaded to the KITTI server to get the testing results. Therefore, we divided the combined training data of these two datasets into a training set (160/160) and a validation set (35/40) respectively.
In order to evaluate the stereo matching result fairly and effectively, we adopted the end-point error (EPE) as the testing metric. EPE is calculated as the average Euclidean distance between predicted and ground-truth disparity. We also used the percentage of bad pixels whose disparity error were larger than t pixels, expressed as t-px error. For all error metrics, lower value are better.

Implementation details
The proposed network was implemented based on PyTorch. We applied the Adam [25] optimizer ( 1 = 0.9, 2 = 0.999) to train all models on 2 Nvidia Titan-XP GPUs with batch sizes of 6. Data augmentation was also carried out for training to help train a robust model against noise, including color normalization and random cropping. The weights of three outputs were set as 0 = 0.5, 2 = 0.7, 3 = 1.0. For Sceneflow dataset, images were randomly cropped to 256(H )×512(W ) during training. Following PSM [7], the maximum disparity value D max is set to 192. We trained our models from scratch with the initial learning rate of 0.001 for 20 epochs. After the 10 th epoch, the learning rate was halved for every two epochs. The trained model was used both directly for testing and as a starting model for KITTI.
For KITTI2012∕2015, we further fine-tuned the trained model on the KITTI2012∕2015 training set for 300 epochs. The initial learning rate for the fine-tuning was set to 0.001 and dropped to 0.0001 after 200 epochs. In addition, for fair comparison with advanced methods, we used all KITTI2012/2015  training data to fine-tune the trained mode for 500 epochs, and then submitted testing results to the KITTI server.

Ablation experiments
In this section, we explore the performance of different model variants on the SceneFlow dataset and KITTI dataset and justify our design choices.

Ablation experiments for cost volume
In order to compare the effects of different cost volumes on the matching results, we conducted experiments with several settings on the base model (SNet-Cat), which only has branch_3. The experimental results are shown in Tables 2 and 3. In the "Cost Volume Mode" column, we use "cat" to represent the concatenation mode, while "dis" and "inn" depict the correlation volume calculation in Euclidean space and inner product space respectively. Therefore, "SNet-Di" represents that the cost volume consists of a distance volume and an inner volume; "SNet-Cd" represents that the cost volume consists of a concatenation volume and an distance volume; "SNet-Ci" represents that the cost volume consists of a concatenation volume and a inner product volume; "SNet-Cdi" represents that the cost volume consists of a concatenation volume and a  correlation volume calculation in Euclidean space and inner product space, respectively; "SNet-Multi" represents that the network uses the final MICV. As shown in Tables 2 and 3, the EPE for SNet-Cd was 0.728, which was lower than the EPE of 0.732 for SNet-Cat. Meanwhile, the EPE of SNet-Cat was lower than the 0.733 of SNet-Di. These results indicate that (1) the more composition information the cost volume has, the higher the accuracy in the disparity has; (2) the spatial information of feature plays an important role in the task of stereo matching, and adding similarity information can further improve the performance. Unlike Gwc [8], SNet-Cdi calculates the similarity between Euclidean space and inner product space, and obtains higher quality disparity maps. The error can be further reduced by applying an adaptive weight module to the correlation volume. The performance of SNet-Multi exceeds that of almost all comparison groups.

4.3.2
Ablation experiments for cascade architecture In this paper, we propose a stereo network with cascade architecture (CSNet) that progressively improves the quality of disparity maps. In this section, we evaluate the significance of cascade architecture and DFFU. Experimental results in Table 4 show that the cascade architecture plays a crucial role for the accuracy betterment. The EPE of the initial model (SNetmlti) was 0.712, and then it decreased to 0.705 and 0.686 after adding branch_2 and branch_1. This is because the small-scale branches contain more semantic information, while the largescale branches have more detailed information, and the cascade network we designed aggregates them to improve the disparity accuracy of the larger scale branches. After excluding the DFFU or its dilation operation, the EPE value rose to 0.694 and 0.687, respectively, which shows that the DFFU facilitates the merging of information from different branches, while dilation convolution can further repair the upsampling feature, compared with ordinary convolution.

Sceneflow
The comparative results shown in Table 5. It is easy to observe that our method can provide the highest quality disparity maps. CSNet surpassed EdgeStereo [27] by 0.05px on EPE and SSPCV [26] by 0.02% on 3-px. Moreover, in order to visually demonstrate the effectiveness of the proposed method, we present three samples predicted disparity maps in Figure 5. These sample maps demonstrate that our method generates very accurate disparity estimations, even for small and complex objects in the orange box.

KITTI2012
For KITTI2012, we first fine-tuned the pre-trained model on the Sceneflow dataset for 500 epochs, and then uploaded the testing results to the KITTI server to assess performance. The results of the comparison with other models are shown in the Table 6. In this Table, the 2, 3, and 5 pixel errors on all areas (All) and non-occluded regions (Noc) are calculated. Our method produced results superior to those of most existing methods, and its 3-px error exceeded EdgeStereo [27] by 0.14% and GwcNet [8] by 0.01% on all areas. For visual display, Figure 6 presents two samples disparity maps and corresponding error maps, which were estimated by GCNet, PSMNet, GwcNet and CSNet. These sample maps illustrate that our method not only produced more accurate disparities in ill-posed regions such as sky and strong light, but also produced smoother disparity values at the edge of objects. Take picture (b) in Figure 6 as an example, the rear windshield of a car is a part of the whole object, and its disparity value should be similar, and the color should be consistent in the disparity map. However, the disparity maps produced by GCNet, PSMNet, and GwcNet all have

FIGURE 5
Three visual results on the Sceneflow datasets. The first column shows the left input images, the second column shows the corresponding ground truth, and the third column are the disparity maps generated by CSNet. In the orange box, we present CSNet's excellent matching capabilities for small complex objects

KITTI2015
For KITTI2015, we first fine-tuned the pre-trained model on the Sceneflow dataset for 500 epochs, and then uploaded the testing results to the KITTI server to assess performance. The results of the comparison with other models are shown in the Table 7. In this table, the 3-pixel errors in the background (D1-bg), foreground (D1-fg), and all areas (D1-all) were calculated. In addition, "All" means that all pixels were included in the calculation, and "Noc" represent that only pixels in non-occluded regions were considered. The results of the comparison with other models, shown in the Table 7 indicate that our method was superior to most existing methods. Its 3-px error exceeded GwcNet's by 0.01% on All D1-all area and EdgeStereo's by 0.02% on Noc D1-all area. For visual display, Figure 7 shows two samples disparity maps and corresponding error maps that were estimated by GCNet, PSMNet, GwcNet and CSNet. Our method not only generated   Figure 7(b) has a reflection problem. The other comparison methods resulted in long strips of mismatched points, while CSNet effectively reduced such mismatches (with fewer striped mismatched points).

Time spent
Since most researchers do not disclose time spent on the Sceneow test set, we mainly compared time using the KITTI test set. Image sizes in the KITTI 2012 and KITTI 2015 datasets are the same, so the stereo matching network take roughly the same time to calculate the disparity map on these two datasets. Results shown in Tables 6 and 7 indicate that most networks needed several hundred milliseconds to calculate a disparity map, a time FIGURE 7 Two predicted disparity maps and corresponding error maps for KITTI 2015 test images. The first row shows the input left and right images. For each image, the disparity maps and corresponding error maps are generated by GC-Net, PSMNet, GwcNet, and CSNet. In addition, the top left corner of each disparity map show its 3-px error in all areas that cannot meet real-time requirements. DispNet [5] was the fastest of all comparison methods. It took only 0.06 seconds to calculate a disparity map, but its disparity accuracy (D1 all) is only half of that of CSNet. It took 0.22 seconds for iResnet to calculate the disparity map, and its disparity accuracy is comparable to CSNets, but its parameter count reached 43.00 M, far exceeding the 6.41 M for CSNet.

CONCLUSION
In this paper, we propose a novel cascade stereo network that uses a multi-information cost volume. The cost volume consists of the concatenation volume and the correlation volume, which is calculated in inner product space and European space. Moreover, we introduce the SE module to adaptively adjust input from the correlation volume, further improving performance. The cascade structure and accompanying DFFU effectively fuses the semantic information from the smaller scale branches with the detailed information from the larger scale branches to gradually refine the disparity map. Comprehensive experiments on Sceneow and KITTI datasets demonstrated that the matching precision and generalization ability of CSNet are better than many advanced stereo networks. For future work, we are interested in exploring an unsupervised stereo matching network that can achieve comparable results to those of existing state-of-the-art networks. In addition, we will explore the impact of the night environment on the performance of the stereo matching network.