A full‐reference stereoscopic image quality assessment index based on stable aggregation of monocular and binocular visual features

Funding information Key Research & Development Plan of Shandong Province, Grant/Award Number: 2019GGX101021; China Postdoctoral Science Foundation, Grant/Award Number: 2017M622136 Abstract In stereoscopic image quality assessment, human visual system has been universally taken into account to detect perceptual characteristics. A novel full-reference stereoscopic image assessment metric by considering both monocular and binocular visual features of human visual system is proposed. In particular, a new region segmentation algorithm is firstly proposed to divide 3D images into occluded and non-occluded regions. The just noticeable difference model is employed on the occluded regions to formulate the monocular vision, while the binocular just noticeable difference model is applied to the non-occluded regions to reveal the binocular vision of the human visual system. In the proposed region segmentation, disparity information and Euclidean distance between stereo pairs are both adopted to solve the unstable segmentation problem of traditional methods. A new pooling strategy based on global edge features is then presented to aggregate the just noticeable difference and binocular just noticeable difference evaluation maps. In addition, some local image features as supplementary of just noticeable difference to describe visual characteristics of the human visual system are also extracted. Finally, an overall quality score is calculated based on the above-mentioned features to measure the visual quality of distorted stereo pairs. Experimental results show that the proposed metric achieves high consistency with the human visual system, and outperforms state-of-the-art algorithms on stereoscopic image quality assessment.


INTRODUCTION
With development of society and technology, two-dimensional (2D) images cannot satisfy human's growing demand for perceiving the real world. Three-dimensional (3D) images become increasingly prevalent [1]. A new challenge is how to assess the quality of 3D images for various application scenarios [2][3][4][5][6][7].
Human visual system (HVS) has been seen as a vital factor to produce objective metrics for image quality assessment (IQA) [8][9][10][11]. In order to reveal perceptual characteristics of HVS, various just noticeable difference (JND) and binocular just noticeable difference (BJND) models have been proposed and employed for 2D/3D IQA [12][13][14]. For 2D images, JND models are proposed based on monocular vision to indicate the minimum perceptible difference threshold. Humans cannot perceive the difference if image This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2021 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology intensity variation is less than the threshold. In 2D IQA, JND threshold map of a distorted image can be obtained to predict a quality score of the image. However, for 3D images, monocular vision and binocular vision occur simultaneously. JND models are not suitable to express both of them. BJND ones are proposed based on binocular vision, considering perceptible difference of both left and right views. BJND indicates that HVS cannot find the asymmetric distortions that are less than BJND threshold. In stereoscopic IQA (SIQA), BJND values of both left and right images are used to get the final quality score.
In order to get better performance, some researchers make efforts to integrate both monocular and binocular perceptual factors into SIQA models [14][15][16][17]. The most representative one is the model in [14] proposed by Fezza et al. They proposed a SIQA model based on both JND and BJND. Two areas named occluded (OC) region and non-occluded (NOC) region are FIGURE 1 The frame of proposed stereoscopic image quality assessment metric divided in stereo images. For a stereo pair, these pixels that have the same information between left and right images are classified into OC region. NOC region includes other pixels. In the OC region monocular vision occurs, while in the NOC region binocular vision appears. In the OC region, JND is employed to get a predicted score and BJND is used to obtain another one in the NOC region. The final prediction result is aggregated by these two scores. Considering both monocular and binocular perception, the model in [14] achieved better performance. Apart from monocular and binocular sight, some other features are also be considered in [18,19]. For example, visual attention and natural scene statistics are included to assess image quality. Other models have also taken some features such as depth perception, disparity and binocular fusion into account [15][16][17]. For HVS, global and local features both play significant roles. The comprehensive consideration for HVS features can boost the performance of SIQA models. However, the characteristics of the mentioned works above are all based on the global images and have ignored some local information of stereo images. Due to the insufficient consideration, these SIQAs cannot achieve expected assessment results.
Up to now, few people have paid close attention to the segmentation of OC and NOC regions, which is also a key procedure in SIQA. An effective dividing algorithm can not only coincide with HVS better but also improve the accuracy of existing SIQAsthat are based on region partition. The most commonly used region segmentation method to divide OC and NOC regions can be found in [20], which also has been used in [14,21]. As introduced in [20], two disparity maps can be obtained for a stereo image pair. By computing the difference of these two disparity maps, a Left Right Consistency (LRC) map can be obtained by acquiring the absolute value of the difference. And each pixel in the LRC map is seen as an index of region segmentation. Finally, an appropriate threshold is assigned to determine OC and NOC regions. For a pixel in the left or right image, it can be classified into OC region if the value in the same location of its LRC map is greater than the threshold, so as other pixels. However, in the method [20], the calculation of disparity maps can be easily affected by noises, especially for images with severe distortion. For distorted stereo pairs of the same reference images, their disparity maps may vary greatly. The consequence of this is that the resulted LRC maps are unstable for region segmentation.
In this paper, we propose a full-reference (FR) SIQA model based on a novel region segmentation method and a global feature-based pooling strategy. The frame of the proposed model is illustrated in Figure 1. On the one hand, a better performed Region Segmentation method based on Disparity maps and Euclidean distance (named RSDE) is developed to divide stereo pairs into OC and NOC regions. In order to simulate monocular and binocular vision, JND [12] and BJND [13] are employed in different regions. Moreover, a Pooling Strategy based on Global Edge features (PSGE) is proposed to get a predicted score from both JND and BJND maps. On the other hand, we extract local phase and amplitude in spatial domain to get local information that is not acquired by JND and BJND. Then, we can get similarity of them and produce another predicted quality score. The final evaluation score can be obtained by aggregating the two predicted quality scores. The contributions can be summarized as follows: 1) Considering the importance of region division, we develop a novel region segmentation method based on disparity map and Euclidean distance (ED; i.e. RSDE). Compared with conventional methods based on LRC maps, segmentation results of the proposed RSDE is more stable, reducing the effect of distortions on points. Therefore, we can get a better balance between monocular and binocular vision. 2) In SIQA, monocular and binocular vision have different proportions. The proposed PSGE polling strategy provides an effective way to evaluate the visual contributions of NOC and OC regions. 3) Considering the lack of local information, some local HVS features including local amplitude and phase are extracted to get more information for better quality evaluation, which can be seen as the complementary of JND and BJND characteristics.
The remaining part of the paper is established as follows. In Section 2, we briefly review related works. Some basic foundations of the proposed model can be seen in Section 3. The proposed SIQA model is introduced in Section 4. The experiments and analysis of the proposed method are demonstrated in Section 5. Finally, the conclusion is depicted in Section 6.

RELATED WORKS
In the field of SIQA, similar to 2D IQA, research works can be categorized into subjective and objective methods. Subjective ones depend on human, while objective ones provide metrics that are consistent to subjective results. Many subjective experiments have been conducted including a growing number of factors that affect prediction scores. However, due to the expensive cost consumption of human resources and time, objective SIQA methods consistent with HVS are more practical in applications.
Some SIQA methods adopt some well-known 2D IQA models such as Structural SIMilarity (SSIM) [43] and Universal Quality Index (UQI) [44] into SIQA models. Campisi et al. [22] used four 2D models SSIM, UQI, C4 [45] and RR IQA (RRIQA) [25] to assess the performance of SIQA models. In [23], SSIM and peak signal-to-noise ratio (PSNR) are used directly by Yasakethu et al. to assess stereoscopic images. The quality score is the weighted average of two scores. In [24], Kuo et al. obtained a reference cyclopean image and a distorted one by combining left and right images. Then, information content weighting SSIM (IW-SSIM) [46] is used to calculate the similarity between the two images. In [15], apart from cyclopean as [24], depth perception is considered by Liu et al. Two depth maps and cyclopean images can be obtained from reference and distorted stereo pairs. Then, MS-SSIM [47] is used to get the similarity scores between depth maps and cyclopean images, respectively. The combination of the scores is seen as the final quality score. In [48], SSIM and C4 are applied for disparity maps and stereo views. Fan et al. [49] used disparity maps to produce cyclopean images. And, UQI is used to calculate the similarity between cyclopean images and disparity maps, respectively. The quality score is the pooling of UQI results and JND values for cyclopean images. Many 2D IQA models are adopted to construct SIQA models, and a comprehensive performance comparison can be found in [50]. In addition, Zhou et al. [34] proposed an NR-SIQA model based on binocular combination and an learning machine. Two binocular combinations of stimuli are generated using different combination strategies. Then some quality-aware features of the combinations are extracted and used to produce the final quality result. In [35], Kim et al. proposed a patch-based Convolutional Neural Network (CNN) to extract features. A series of stereo images are divided into a desirable number of patches to produce training data. Then they are fed into the network and get the quality score of each patch. Finally, the patch quality results of the image are pooled into the overall quality score of it. Ding et al. [36] extracted features from cyclopean images and disparity maps using a twocolumn dense CNN model, and fuses these features together to get the final SIQA score of stereo images. Yue et al. [40] proposed an NR-SIQA method via obtaining the quality-sensitive features from cyclopean phase map and applying an Support Vector Regression (SVR) model to combine the features. In [41], Liu et al. proposed an NR-SIQA method based on hierarchical learning. The stereoscopic images are divided into different groups according to distortion firstly. And the classified image group is fed into image quality predictor including five different perceptual channels. Finally, an Support Vector Machine (SVM) is applied to combine the outputs of five channels and obtain quality assessment results. Oh et al. [51] proposed a deep NR IQA model in terms of local and global feature aggregation. A CNN-based model is adopted to extract features of left and right patches and get the final quality score. These CNN-based methods can extract features automatically, but may ignore the relationship between patches of stereo pairs. And attributed to enclosed feature extraction, these CNN-based models are lack of interpretability, which can lead to difficult optimizations of various networks. In addition, the training of CNNs need a high time complexity. In contrary, traditional metrics have been studied widely due to the advantages such as higher interpretability, lower time complexity and so on. Hence, here, we mainly focus on investigating some effective artificial features on SIQA, and want to explore more aggregation manners to reveal monocular and binocular mechanism of the HVS.
Some SIQA models also take some other HVS characteristics into account. JND-based models have attracted great interest recently. For conventional 2D images, JND models have been proposed to indicate maximum image distortions that are undetectable to HVS [52][53][54][55][56][57][58][59]. They can be classified into two categories: frequency domain and spatial domain. These models in frequency domain often transfer an image into sub-band representation, for example, Discrete Cosine Transform (DCT) and wavelet decomposition [52][53][54]. Models in spatial domain seem to be more convenient because these models are calculated directly on pixels in an image [57][58][59][60][61]. Therefore, we focus on spatial (pixel-wise) models. For pixel-wise JND models, some HVS features such as Luminance Adaption (LA) and Contrast Masking (CM) effects are often considered. Based on both of them, early JND models are proposed by Chou et al. [57,58]. Further, Zhang et al. [59] used Contrast Sensitivity Function (CSF) [62] as well as LA and CM to obtain JND thresholds. Liu et al. [60] developed a more accurate JND model on the basis of [59] in the advantages of edge masking and texture masking. Wang et al. [61] proposed a JND model for screen content images based on edge profile, which is decomposed into three features including luminance, contrast and structure information.
But JND models are not suitable for stereoscopic images. In order to keep compatible for stereo pairs, some stereo characteristics such as binocular vision and depth features must be considered. For 3D images, several studies have concentrated on finding a visibility threshold for 3D images/videos. Zhao et al. [13] proposed a BJND model considering LA and CM. Silva et al. [63,64] studied JND in depth perception for 3D video displays. And the model can be utilized to get depth perception when the depth information is lost or lossy compressed. Some other features are considered in [18,19,[65][66][67][68]. Based on free energy principle and HVS, Zhu et al. [65] imported 2D IQA methods into SIQA ones by using dual-weight model. In [66,67]

Just noticeable difference (JND)
In recent works, JND is used widely in image processing to keep consistent with HVS. In the OC region, single sight occurs. In order to maintain high consistency with HVS, JND is used in our model to simulate monocular vision [12]. It is acquired by LA and CM, which can be calculated as follows: where (u, v) represents a pixel location in an image. LA, which represents LA of HVS, can be described as follows [69]: where I (u, v) represents the pixel value at location (u, v).
CM, as another phenomenon of HVS, indicates the sensitivity of different feature changes (i.e. spatial frequency, direction and location) [70]. It is represented as follows: where EM and TM are edge masking and texture masking, respectively. In image decomposition processing, an image I is seen as the sum of the structural image I e and the textural image I t (i.e. I = I e + I t ) [71] . Let C e and C t denote the maximum luminance contrast of I e and I t . EM and TM can be obtained as follows: where e and t are signs of EM and TM , which are set as 1 and 3, the constant is assigned to 0.117.

Binocular just noticeable difference (BJND)
In NOC regions, binocular vision occurs. BJND is the best choice to model visual factors mentioned in the last section. Similar to JND, CM and LA are also considered in BJND mod-els. The overall BJND can be defined as follows: where the parameters N a and are equal to 0.3 and 3.76. T l |r (u, v) is the maximum JND threshold in location (u, v) of the left or right image, and it can be calculated as follows: where L bg (u, v) is the luminance of background in location (u, v) of stereo pairs and it is filtered by a filter B in Equation (7). d represents the disparity between left and right images.

PROPOSED SIQA MODEL BASED ON RSDE AND HVS CHARACTERISTICS
As mentioned in Section 1, existing SIQA works based on JND and BJND have strong dependence on regional division. And the conventional region segmentation method based on LRC maps (RSLM) has shown its weaknesses. We propose a novel FR SIQA model in which an effective region segmentation method is proposed to divide OC and NOC regions. And, a PSGE is presented to fuse JND and BJND scores on OC and NOC regions. Moreover, as supplement of JND and BJND evaluation, some local features (i.e. local phase and amplitude) are also adopted to predict the visual quality of distorted images. Finally, evaluation results of these two parts are combined together to indicate the real quality scores of distorted images.

Proposed RSDE algorithm for region segmentation
As mentioned in Section 1, traditional methods (e.g. RSLM) to divide stereo images into OC and NOC regions have shown disadvantages, for example, they cannot obtain stable LRC map and get the expected result of region segmentation. ED, which represents the distance between two objects, has been widely adopted to reflect the similarity between two locations of a FIGURE 2 Region segmentation of the method based on LRC map and RSDE. For the both results, they include white and black area, the former one represents NOC region and the latter one represents OC region graphic due to the convenient calculation and high intelligibility in image processing.
Therefore, we develop a novel region segmentation algorithm (i.e. RSDE.) based on disparity map and ED to distinguish OC and NOC districts and to get better results of region segmentation. For a distorted stereo pair in FR SIQA, its reference pair can be obtained. Based on the reference left and right images, we can get two disparity maps (d l and d r ) as [20]. As shown in Figure 2, besides the disparity information, we also consider the impact of image distortions on the region segmentation. The similarity of distorted left and right images can be evaluated by ED. In order to solve the problem that the results of ED calculation may be effected by different distortions on points or some points whose disparity are wrongly calculated, we calculate Local Euclidean Distance (LED) with a 5×5 window. The LED of one pixel in distorted left view and its responding pixel in the right one can be obtained as Equation (8). The right LED map related to the distorted right image can also be computed as Equation (9).
where E l (u, v) and E r (u, v) are two LED maps on the basis of distorted left and right images, respectively. P represents the matrix of (u, v) with 5×5 neighbourhoods in distorted left(right) view and P ′ represents the matrix of (u, v ± d (u, v)) with 5×5 neighbourhoods in responding right(left) view. d l and d r are disparity maps of the reference pair.
As to the two LED maps above, less ED values indicate higher similarity [72]. For stereoscopic pairs, the number of pixels belonging to the NOC region is much more than that attributed to the OC region. In other words, for a random pixel in one view, the probability that it belongs to NOC region is greater. Based on the above, to the greatest extent we can accept the truth that pixels are divided into the NOC region by mistake. Therefore, we can get an LED map that is used to divide regions by a minimum strategy as follows: Finally, an appropriate threshold, which is viewed as the bound of different areas, is used to get the ultimate result of RSDE as follows: where th o is the threshold to divide areas.

PSGE pooling strategy
Based on the region segmentation map RSDE (u, v), JND model is then carried out to evaluate the visual distortion of the OC region while BJND is employed on the NOC region. And a novel proposed pooling strategy PSGE is used to pooling different parts and a predicted quality score reflecting monocular and binocular characteristics can be obtained after the congregation.
In the proposed PSGE pooling strategy, global edge features are extracted as weights to evaluate the visual contributions of NOC and OC regions. We use Global Contrast (GC) and Global Width (GW) to reflect edge properties.
For a distorted image and its reference image, let ⊗, a, w, b denote the convolution operation, the contrast parameter, the width parameter, and the basis parameter, respectively, a practical model s(u; a, b, , u 0 ) used to extract GC map and GW map by convoluting the ideal edge model e(x; b, c, x 0 ) with the Gaussian kernel g(x; w) [73] can be seen in Equation (12). And the ideal edge model can be described as follows: where U(⋅) denotes the unit-step function. And erf(⋅) is the error function as follows: And the parameters a and can be defined as follows: where l 1 = d 2 1 ∕d 2 d 3 and l 2 = d 2 ∕d 3 . And they can be defined with the derivative of the Gaussian filter g(u, d ) as follows: ] .
Here we set (1; a, , d , u 0 ) and d 3 = d (−1; a, , d , u 0 ), respectively. After calculating feature maps GC and GW as Equation (12), the similarity between left (right) distorted and reference images can be calculated in the same way as follows: where l ,r mdis and l ,r mre f represent the GC and GW map of distorted and reference left (right) image. And the weights of left (right) distorted image can be obtained as follows: where 1 and 2 are both set to 1. And a result of quality score in OC region can be defined as follows: In a similar way, we can get a quality score in the NOC region as follows:

Local feature description for quality evaluation
Image local features are also of great importance in SIQA. Phase, as an important characteristic of images, has been applied in image processing due to the advantages of high stability and adaptability [74]. However, different regions are evaluated independently in traditional methods, in this way global phase from direct Fourier transform is not an effective measure to express local features [21]. Moreover, existing works have shown that log-Gabor is an excellent choice to imitate simple cells in binocular vision [75][76][77] where the parameters and are the normalized radial frequency and the orientation angle of the filter, and s and s are the corresponding centre frequency and orientation of the filter, respectively. The parameters s and o determine the strength of the filter. Then the Local aMplitude (LM) at location x on scale s and along orientation o can be calculated as follows: and the local energy along orientation o can be computed as follows: The PC along orientation o can be computed as follows: where 1 is a small constant that is greater than 0. However, we use Local Phase (LP) and LM to describe local features rather than PC directly. Let where o m denotes the orientation which corresponds to the maximum PC value. Then the local amplitude is defined as the sum of local amplitudes of all the scales along the orientation o m :

Overall quality estimation
Based on score maps of OC and NOC regions as Equation (20) and Equation (21), we can get a predicted score of monocular and binocular characteristics. Moreover, In Section 4.3, we have obtained LM and LP maps of distorted and reference images. In this section, we get similarities of LM and LP maps and pool them into another predicted score. Finally, a total quality assessment result can be obtained by combining both of them. The predicted score of monocular and binocular characteristics can be obtained as follows: where W oc and W noc are weight maps of OC and NOC regions. Let M , N denote the number of pixels of OC and NOC region, respectively, and they can be obtained as follows: For LA and LM maps, we can get similarity maps by using Equation (18) as follows: where re f and dis denote reference and distorted stereo images. 2 is a constant and it is set to 0.01. The similarity maps of left (right) views can be obtained as follows: where W l ,r LM and W l ,r LP are weights of S l ,r LM and S l ,r LP by log-Gabor filter, respectively. Then the prediction score of local features can be obtained as follows: where Ω denotes the number of pixels for a local feature map and it can be obtained as follows: Finally, we can get the predicted score of our proposed model as follows: where f 1 and f 2 are constants and f 1 + f 2 = 1. Here, we set f 1 = f 2 = 0.5.

Description of SIQA database
As mentioned in Section 2, more and more subjective experiments have been conducted in the case of taking a growing number of factors that affect prediction scores into account. In order to achieve the same evaluation criteria of objective SIQA models, many databases have been created. Some prevalent databases used widely in SIQA are LIVE 3D Phase I [78], LIVE 3D Phase II [79], IEEE-SA DB [80] and IVC [81]. We use LIVE 3D Phase I and Phase II to assess our proposed model. Similar to [82], five same symmetric distorted types such as JPEG, JPEG2000 (JP2K), Gaussian White Noise (WN), Gaussian blur and Raleigh Fast Fading (FF) have been used to establish LIVE 3D Phase I. For all reference images, they are all distorted by five different distortions at different extents. But for left and right images of an image pair, both of them are all distorted at the same degree. Here, the number of 20 reference images and 365 distorted images, which includes 80 pairs each for JPEG, JP2K, WN, FF and 45 for Blur, have been the components of Phase I. Similar to Phase I, Phase II consists of 8 reference images and 360 distorted images with co-registered human scores in the form of DMOS. For each reference stereo pair, they are all processed to three symmetric distorted stereo pairs and six asymmetric distorted stereo pairs.

Performance index
Three widely used indicators including Pearson Linear Correlation Coefficient (PLCC), Spearman Rank Order Correlation Coefficient (SROCC) and Root Mean Squared Error (RMSE) between subjective and objective scores are applied to assess performance of our proposed model after non-linear regression. And we use a four-parameter logistic mapping function for non-linear regression [83] as follows: where b 1 , b 2 , b 3 and b 4 are constant parameters. For three indexes above, the thresholds of PLCC and SROCC are both [0 1] while the value of RMSE is greater than 0. A higher value of PLCC and SROCC indicates a better correlation between subjective and objective prediction results. But for RMSE, a lower value reflects the same result.

Performance of region segmentation
We propose a novel region segmentation method (i.e. RSDE) based on disparity and ED. Thinking of the effects of the points wrongly calculated, we calculate the ED between the points with 5×5 neighbourhoods. In order to prove the effectiveness of the different sizes of neighbourhoods, we provide the results of various sizes of neighbourhoods including 3×3, 5×5, 7×7 and 9×9 in our model, which can be seen in Table 1. In the table, apart from the difference of sizes, other conditions are all the same to our model in Section 4. From the table, we can draw a conclusion that the size of 5×5 neighbourhoods have the best result of our model. As for the threshold in Equation (11), we test different thresholds from 0.02 to 0.20 for our model. In Table 2, some well-performed results are shown here. All the tests are based on the condition that the only variable is threshold and other conditions are all the same as our model in Section 4. From the table, we can clearly find that the threshold 0.12 has the best performance.
In Figure 3, we provide results of the conventional method RSLM [20] and our RSDE. Figures 3(a) and 3(b) are four stereoscopic pairs of different scenes selected randomly in LIVE 3D Phase I. Figure 3(c) contains the results of RSLM to divide regions while Figure 3(d) presents results of RSDE in Section 4.1. For the results above, white area represents NOC region while black area denotes OC region. To prove the effectiveness of RSDE, we do a performance comparison of SIQAs by using RSLM and RSDE for region segmentation in Table 3. From the table, we can learn that the values of PLCC and SROCC by RSDE are more than that of conventional method, which means that a better performance of the SIQA model is shown by using our proposed RSDE.
In order to prove the stability of RSDE, we also use the third stereo pair of Figure 3 as reference images. Then for each of  the distortions including JP2K, JPEG, WN, FF and Blur, we randomly select one stereo pair and get five stereo pairs. In  Figure 5, the similarity in Figure 4 is lower, which means the effectiveness of LED maps with 5×5 window and the stability of proposed RSDE. But for WN distortion, the similarity between Figures 4(a) and 4(c) is higher than that between Figures 5(a) and 5(c), which means that the proposed RSDE is not highly effective for White Noise.

Performance of local feature extraction
Feature extraction is one of the most important processes in SIQA. An excellent result of feature extraction is not only benefit to the improvement of accuracy for SIQA models but also instrumental in coinciding with subjective results. Here, the image in Figure 3 Figure 7 shows the amplitude maps of the same original image. In order to demonstrate the effects of different distortions, distorted images with the similar values of DMOS are selected here. As can be seen from Figure 6, it is obvious that the similarity between (a) and (d) is lower than that between (a) and any other one such as (b), (c), (d) and (f). Except (e) with little important structural visibility, other maps (b, c, e, f) all have a high similarity with (a), which means that LP is a fantastic description of features in SIQA due to the preservation of essential structure information. At the same time, we can find the same situation in Figure 7. The LM maps in (b), (c), (e), (f) are all more similar with (a) than (d). These two alike phenomenons indicate that the method to extract features using PC may be not highly effective for White Noise, which may influence the results of the proposed model.
In order to prove the effectiveness of feature extractions, four different schemes are used in Phase I. JND and BJND are all included in all schemes. For Scheme I, only JND and BJND are applied to produce the final prediction scores as Equation (28). For Scheme II, besides JND and BJND, only LP is used to calculate the local scores. For Scheme III, only local amplitude is used and other operations are the same as Scheme I. But for JP2K and WN distortion, the performance of the proposed model is less than Scheme II and Scheme III,  Tables 4 and 5. From the tables we can learn that the result of Scheme I has lower PLCC and SROCC than others, regardless of some structure features. For Scheme II, Scheme III and Scheme IV, we can see that Scheme II and Scheme III have higher results than Scheme IV because the use of global PC is not a best choice to extract features as mentioned in Sec-tion 4.3. And compared with Scheme II and Scheme III, we can get a conclusion that in the process of feature extractions, LP has more contribution than local amplitude. The proposed model has the best result than others by considering both HVS characteristics and local features.

Performance of proposed model
All experiments are done in LIVE 3D Phase I and Phase II mentioned in Section 5.1. Besides, PLCC, SROCC and RMSE mentioned in 5.2 are regarded as indexes to measure the effectiveness of different methods.      In Table 6, for both the datasets, overall performances including 12 existing models and our proposed model are displayed. Also, Tables 7 and 8 have shown their performances for each of distortions including JP2K, JPWG, WN, FF and Blur in different phases. In the tables, the bold value of each performance index means the best result. From the tables, it can be found that conventional methods SSIM and MS-SSIM have lower performances than the proposed model in both Phase I and Phase II, because some binocular features are not taken into account. For eight FR models, Fezza's method uses RSLM to divide images into different regions. And in Section 1 and Section 4.1 we have introduced the limitation of RSLM. Therefore it cannot have satisfied results due to a poor region division way. For Shao's model, the strategy to divide different regions may have the similar weakness with Fezza's and therefore its overall result is not well performed. Lin's method cannot reach the expected result and that may be caused by the lack of disparity maps. Chen's method [27] and Benoit's method [81] have a high dependence on stereo matching. Ma's method and Liu's method are based on conventional algorithm SSIM and MS-SSIM. And the unexpected result may be attributed to the difference between 2D and 3D images. Chen's method [29] has taken some local and global features into account, however, some HVS characteristics may be lost. But it has an excellent performance in White Noise which are the same as Liu's [15]. In Table 6, for five NR methods, Kim's [35] and Oh's method [51] have a better   performance than other NR ones in Phase I, which can be attributed to the strength of CNN to extract features. For the same reason, Kim's method has the best performance for JPEG distortion than others. Compared with Yue's [40] and Liu's metric [41], the proposed method achieves much better assessment results on LIVE Phase I, which can be ascribed to the sufficient consideration of both global and local features. But Liu's method has a better performance than the proposed model in Phase II, which may be caused by the individual calculation of different distortions. For Oh's method [51], due to the lack of binocular features, it cannot gain expected results compared with the proposed metric. However, the main research content is FR models. Apart from NR models, the performance of the proposed model is better than others in Phase I and Phase II (see Table 6). And the performance of the model for JP2K, FF and Blur are all better than other models in Phase I, and for JPEG and WN distortion the performance of the proposed method has a similar result with others. Besides, the performance of the proposed model in Phase II is better than most of other FR models. What is mentioned above can prove the effectiveness of RSDE, PSGE pooling method and the proposed model.

Time complexity tests
In this subsection, a series of experiments are conducted to analyse the time complexity of the proposed model. In Table 9, the mean calculation time of each 3D image pair in LIVE 3D Phase I of different methods is displayed. From the table, we can find that compared with Kim's and Lin's methods, the proposed metric has a lower time complexity, which shows the superiority in running speed. Kim's method computes the final quality score based on local patches, therefore it needs more computational time. For Lin's method, the extraction of complicated features leads to higher computational time than the proposed metric. Liu's model [15] has a lower running time, but its performance is lower than the proposed metric due to insufficient consideration of both monocular and binocular features.

CONCLUSION
This paper has presented an FR stereoscopic image quality model considering both monocular and binocular characteristics and local features. The advantages of our work can be shown as follows: (1) The proposed region segmentation method (RSDE) divided the stereo images into OC and NOC regions stably, ensuring a better balance between monocular and binocular vision; (2) the PSGE pooling strategy can well reflect visual contributions of OC and NOC regions; (3) extracted local features provided more local information to assist the visual evaluation. Compared with some existing SIQA models, our model has a better performance than them, which means that the proposed model has higher consistency with subjective scores.