Sonar image quality evaluation using deep neural network

Sonar technology plays an important role in the development of marine resources and military strategy. Due to the bad quality of underwater acoustics channels, the sonar images collected by sonar technology equipment are easily affected by various kinds of distortions. To obtain high-quality sonar images, the authors devise a novel dual-path deep neural network (DPDNN) to measure the quality of sonar images. In these two paths, the authors use a batch normalization layer to reduce the training time and use the skip operation to speed up the feature extraction . Based on the above two operations, the authors extract the microscopic and macroscopic structures of sonar images, respectively. Finally, a global average pooling layer and a fully connection layer are used to connect the above two paths. Experiments show that the authors’ DPDNN achieves signiﬁcant improvements in prediction performance and efﬁciency. The source code will be published in the near future.


INTRODUCTION
With the development of marine resources in various countries, side scan sonar technology is becoming increasingly popular among researchers. Researchers also have established higher requirements for the quality of sonar images. It should be noted that the internal space of underwater navigation equipment is small (as shown in Figure 1), and so it is impossible to carry a large amount of computer equipment. Therefore, the collected sonar images need to be transmitted to ground equipment through an underwater acoustic channel for further analysis and processing. Due to the complexity of the underwater acoustic channel environment, sonar images are often damaged by various kinds of distortion, and this results in degraded image quality [1,2]. However, low-quality sonar images cannot meet the research requirements, and so it is necessary to design an effective sonar image quality evaluation (IQE) technology to guide the transmission process of sonar image. Moreover, IQE plays an important role in estimating image degradation and optimizing the image compression parameters. According to the amount of information used in reference images, IQE can be divided into three categories: full-reference IQE, reducedreference IQE and no-reference IQE.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. The earliest application objects of IQE were natural scene images [3,4] and screen content images [5,6], and the corresponding research is relatively mature. However, there are few studies on sonar IQE. The difference between sonar images and common natural scene images and screen content images is that the quality of the latter two types of images can be judged according to the observer's own perception. Generally, natural scene images are mostly composed of thick lines, rich colours and complex textural content, while screen content images are mostly composed of thin lines, limited colours and simple shapes. Sonar images are mostly greyscale images, in which the pixel value changes infrequently, the contrast is low and the details and content are few, furthermore, they are mostly used for mapping and detection rather than aesthetics, and so the structural information is more of a concern for the observer. Therefore, when evaluating sonar image quality, it is better to use certain professional knowledge rather than use normal aesthetic judgement. The characteristics of the above three types of images are obviously different, and so the evaluation methods are also different.
In recent years, sonar IQE has been widely studied and developed. Chen et al. [7] propose a full-reference sonar image quality predictor based on the similarity of the local entropy IET Image Process. 2021;1-8. wileyonlinelibrary.com/iet-ipr

FIGURE 1
Underwater navigation equipment for sonar image acquisition map and edge map from the perspective of statistical information and structural information. Chen et al. [8] propose a three-stage part-reference sonar image quality predictor based on the definition of the target in the sonar image, the amount of information that can be extracted from the sonar image and the degree of comfort degradation of the sonar image. Chen et al. [9] propose a free-reference sonar image quality index to evaluate image quality by measuring the degree of degradation of the contour after the distorted image is filtered. In addition, some machine learning methods [10][11][12] are also widely used in the field of IQE. However, the performance of the above-mentioned sonar IQE methods based on traditional image analysis technology is not ideal [2,[7][8][9]. Therefore, in this work, we devise a novel dual-path deep neural network (DPDNN) to better evaluate the quality of sonar images, which is inspired by the multi-path neural network [13]. Specifically, in the first path, we connect the convolutional layer and max-pooling layer in turn, and selectively add batch normalization (BN) layers to reduce the training time. In the second path, in addition to the convolutional layers, max-pooling layers and BN layers, we add a skip connection operation to speed up the feature extraction. Finally, the global average pooling (GAP) layer and the fully connected (FC) layer are used to connect the above two paths to get the evaluation quality score of a sonar image. Compared with the traditional sonar IQE method, the DPDNN model achieves significant improvements in the accuracy, monotonicity and consistency of prediction.
The contributions of this work are summarized as follows: (1) Aiming to address the shortcomings of traditional image analysis techniques [2,[7][8][9], we applied deep learning technology to sonar IQE and achieved improved results; (2) We designed a novel DPDNN with very few parameters and high execution efficiency; (3) Compared with the state-of-the-art no-reference contour degradation measurement (NRCDM) [9] model, the performance of our DPDNN model is improved by 5.76% on the Pearson linear correlation coefficient (PLCC), 11.01% on the Spearman rank order correlation coefficient (SROCC), 15.90% on the Kendall's rank order correlation coefficient (KROCC), 5.77% on the root mean square error The basic structure of DPDNN includes two paths network for feature extraction, cascade operation for feature fusion, convolution layer, GAP layer and FC layer for regression (RMSE) and 8.41% on the mean absolute error (MAE), and the execution efficiency is increased by 15 times.
The rest of this paper is arranged as follows. In Section 2, we introduce the proposed DPDNN model and network training in detail. Section 3 provides the results of the experiments on the new sonar image quality database (SIQD) [14] to demonstrate the execution efficiency and evaluation performance of the DPDNN model. Finally, we summarize this paper in Section 4.

PROPOSED DPDNN METHOD
With the development of sonar technology, there has been an increase in technologies used to evaluate sonar image quality [2,[7][8][9], but these traditional image analysis technologies still cannot achieve satisfactory results. Therefore, we will elaborate the DPDNN model specially designed for sonar IQE in this part, and the structure of our DPDNN model is shown in Figure 2.
In the second section, we first introduce the structure of the DPDNN network, including the establishment of the two paths and how to combine the two paths to evaluate the quality of sonar images. Then, we explain the specific content of the network training.

DPDNN structure
First, the establishment of our first path is inspired by the classic VGG-Net network structure [15]. VGG-Net only uses convolutional layers and max-poolings layer for feature extraction, which increases the training time. Therefore, we selectively add a BN layer to each of the last six convolutional layers to reduce the training time of the network and prevent parameter overfitting. Specifically, we connect eight convolution layers, six BN layers and three max-pooling layers to construct a sequence network for extracting the microscopic features of sonar images. Note that the BN layer is not added to the first two convolution layers for better feature protection. The standardization process of each feature map F i is defined as follows: 2 are the mean and variance of the meta-batch, respectively; N is the size of the meta-batch; F i, j represents the ith feature of the jth sample in the meta-batch; and is a fitting parameter to improve the stability of the network. However, the representation ability of features may decrease with normalization. Therefore, we introduce two parameters and to transform the normalization feature in the BN layer to solve this problem: whereF i is the standardized feature of the transformation, and and are the optimal values obtained through network training. After going through the above network structure, we express the output F 11 of the first path as follows: where ' * ' is the convolution operator; 11 and b 11 represent the weight matrix and offset matrix of the 17th layer, respectively; and F 16 is the feature map of the 16th layer. Secondly, the construction of our second path benefits from the skip connection operation in Res-Net [16]. Its introduction can avoid the disappearing gradient problem and improve the feature propagation speed. Therefore, we introduce the skip connection operation on the basis of the first path. Specifically, we connect 12 convolution layers, 3 max-pooling layers and 9 BN layers to form a sequence network for extracting macroscopic features. Then, the convolution kernel size of the first convolution layer F 1 is 9, which is used to extract more abundant image features. In addition, the BN layer is not added to the sixth convolution layer F 6 because this layer is used to reduce the dimensions by properly fusing feature maps. At last, the skip connection operation is used to connect the feature map F 1 of the ith layer's output with the feature map F 5 of the jth layer's output. The 1 × 1 convolution kernel used in the cascade operation is as follows: where [F 1 , F 5 ] means connecting the two 3D feature mapping matrices in the third dimension; 6 and b 6 represent the weight matrix and offset matrix of the sixth convolution layer, respectively; and F 6 is the feature map of the sixth layer. Finally, the output of the second path is expressed as follows: where 16 , b 16 , F 15 and F 16 have the same meaning as the corresponding variables in Equation (3). Note that X x represents the first path and X x represents the second path, where X ∈ ( , b, F ) and x ∈ (1, 2, …). Finally, we design a feature fusion connection structure to connect the microscopic and macroscopic features. Specifically, the above two paths are fused by a connection operation, and then the convolution layer with a kernel size of one is used to transition features: where 17 , b 17 and F 17 have the same meaning as the corresponding positions in Equation (4). Then, the GAP layer is used to reduce the number of parameters and alleviate the overfitting problem: where p s,t represents the sth pixel value in the tth feature map, S is the total number of pixels in each feature map and t is the number of sequences of the feature map. In the end, 128 GAP values are trained through the FC layer to get the prediction value of the DPDNN for sonar image quality.

Network training settings
This paper uses the stochastic gradient descent optimization algorithm in the training process, which has shown good performance in previous research . To make the network converge to better parameters, we use momentum and learning rate attenuation in the optimization algorithm, so that the network can avoid falling into the local optimum and a saddle point [17]. Specifically, the momentum coefficient is 0.9, the initial learning rate is 0.0001, the attenuation rate is 0.00001, the minimum batch size is set to 16 and the number of training epoch is set to 300 [18]. Similar to most regression tasks, the mean square error (MSE) loss function is used to optimize the network during training: wherex i and x i represent the true label value and predicted value of the ith image, respectively; and N represents the total number of labels. Then, DPDNN is trained in the built environment, so that the network can obtain the best correlation and the best model parameters on the verification set. The network layer structure parameters of the DPDNN are shown in Table 1.
In the neural network training, the amount of data is very important since it determines the robustness and generalization of the network. However, the number of images in the SIQD database is only 840 sonar images, and so we augment the amount of sonar data. Specifically, we will rotate, horizontally fold and vertically fold the sonar images and expand the data to eight times the original of data. Figure 3 is a display diagram.

EXPERIMENTAL RESULTS
In this section, we will demonstrate the performance of the DPDNN model on the newly proposed SIQD database. The SIQD database contains 40 lossless sonar images taken by sonar equipment (as shown in Figure 4, which consists of swimmers, seabed, shipwrecks and underwater creatures) and 800 lossy sonar images generated by four distortion types at four to six distortion levels. Each sonar image in the SIQD database will be scored subjectively by 25 non-professional undergraduate or graduate students, and finally the mean opinion score (MOS) will be obtained through data cleaning. For more details, please refer to reference [14]. In the neural network training task, the amount of data is an important problem to be considered. Therefore, we rotate or flip the images in the SIQD database by 90 • , 180 • , 270 • , horizontally and vertically to get 6720 images. Then the expanded data set is divided into five groups according to the types of reference images, among which three groups form the training set (4032 images), one group is the verification set (1344 images) and the other group is the test set (1344 images). It should be noted that each group needs to be tested, and so we need to train five models and take the final average as the result of the DPDNN. To better verify the performance, we compare the DPDNN model with 11 state-of-the-art IQE models: BLind Image Integrity Notator using DCT Statistics 2 (BLIINDS2) [19], Blind/Reference-less Image Spatial QUality Evaluator (BRISQUE) [20], Blind image quality assessment using joint statistics of Gradient Magnitude and Laplacian Features (GMLF) [21], No-reference Free Energy based Robust Metric (NFERM) [22], AR-based Image Sharpness Metric (ARISM) [23], Blind Quality Measure for Screen content images (BQMS) [24], Blind Image Quality Assessment method based on High Order Statistics Aggregation (HOSA) [25], Accelerated Screen Image Quality Evaluator (ASIQE) [26], Blind Pseudo Reference Image based quality measure (BMPRI) [27], Blind Multiple Pseudo Reference Image based quality measure (BPRI) [28] and No-Reference Contour Degradation Measurement (NRCDM) [9] . Among these, the NRCDM model is a recently proposed model for evaluating the quality of sonar images based on the degree of degradation of a filtered sonar image contour. To fairly measure the quality, we use five classic metrics recommended by the video quality expert group [29]: the PLCC is used to measure the model prediction accuracy; the SROCC and KROCC are used to measure the monotonicity of the model predictions and the RMSE and MAE are used to measure the consistency of the model predictions. The values of the PLCC, SROCC and KROCC are closer to 1 and the values of RMSE and MAE are closer to 0, indicating that the performance of the model is better. It should be noted that there may be non-linearity in the subjective scoring process. Therefore, we first use a logistic regression to map the objective quality prediction score of each where f (s) is the evaluation score after the regression; s represents the sonar image quality prediction score and 1,2,3,4,5 is the five fitting parameters based on the Gauss-Newton method. The data recorded in Table 2 were obtained using the SIQD database according to the above evaluation metrics. For facilitating identification by readers, we roughen the results of the two models with the best performance. It can be seen from the experimental results that the DPDNN model proposed in this paper achieves the best performance in five test indexes, among which the performance gain are approximately 5.76%, 11.01%, 15.90%, 5.77% and 8.41% in terms of the PLCC, SROCC, KROCC, RMSE and MAE, respectively. In addition, only the performance of our proposed model is higher than 0.75 on the PLCC and SROCC, higher than 0.55 on the KROCC and lower than 9 on the RMSE and MAE, which again show that the DPDNN model makes excellent predictions with respect to accuracy, monotonicity and consistency. Except for the NRCDM, all other selected methods have achieved good performance for natural images or screen content images, but their performances on sonar images are not as good as our DPDNN model, as shown in Table 2.
In addition to the above four indicators, the execution efficiency is also an important measure. In the experiment, we use MATLAB R2014a to test the selected comparative model, and use TensorFlow [30] and Keras [31] to train and test the DPDNN model. The operating environment is a win10 system server, which has an Intel(R) Xeon(R) CPU E5-2620 v4 at 2.10 GHz with 192.00 GB of RAM and an NVIDIA TITAN Xp GPU. From Table 3, it can be easily found that the proposed DPDNN model only consumes 11.7 ms for each picture, far exceeding the execution efficiency of other models, mainly because the DPDNN model has fewer parameters at only 0.45 MB. Compared with the state-of-the-art NRCDM, our DPDNN not only achieves excellent prediction performance, but it also improves the execution efficiency by approximately 15 times.
To show the performance of DPDNN model more intuitively, we draw the scatter plots between the objective quality prediction score obtained by the above 11 IQE models and our DPDNN model and the corresponding MOS of in Figure 5. In each scatter diagram, we use different colour marks to distinguish the sampling points of different distortion types. Among them, red circles, green squares, blue diamonds and pink triangles are used to represent 'Group 1', 'Group 2', 'Group 3' and 'Group 4', respectively. 'Group 1'-'Group 4' are image groups generated by the four types of distortion. We can easily find two conclusions from Figure 5. (1) Compared with the other 11 IQE models, our DPDNN model has higher monotonicity and linearity. Specifically, the DPDNN model is slenderer than NRCDM model (ranking second) which is specially used for sonar IQE. (2) As we all know, an outstanding IQE model should be able to predict each distortion type well. As shown in Figure 5, BLIINDS2, BRISQUE, GMLF, ARISM, BQMS, HOSA and ASIQE cannot accurately evaluate 'Group 3'; NFERM cannot evaluate 'Group 1' and 'Group 3' very well and BMPRI and BPRI cannot evaluate 'Group 2' properly. However, our DPDNN model is robust to the four distortion types in the SIQD database. In addition, we also show some examples of the DPDNN and NRCDM model (ranking second) when evaluating sonar image quality. As shown in Figure 6, the evaluation result of the proposed DPDNN network is similar to that of manual scoring, while the prediction results of the NRCDM have a certain gap with MOS scores, which shows that the DPDNN has good performance.
Finally, we use the F -test to further test the statistical significance of the DPDNN model. Specifically, we need to compare the variance of two objective methods in the prediction value. If F > F critical , there is a significant difference between the two methods. In our test, F critical is equal to 0.05. Therefore, we can make a reasonable statistical judgement on the new objective method by analysing the characteristics of the proposed objective method. The statistical significance results are shown in Table 4. The symbol '0' indicates that there is no significant difference between the two objective methods, '+1' indicates that our method is statistically superior to the other method and '−1' indicates that our method is statistically inferior to the method. It can be seen from Table 4 that the DPDNN model we proposed obtains a statistically significant performance improvement.

FIGURE 6
Evaluation results of sonar image by DPDNN and NRCDM model the FC layer are used to integrate the microscopic and macroscopic features extracted by the two paths. The experimental results on the newly established SIQD database show that the proposed DPDNN model achieves a significant improvement in prediction performance and execution efficiency. Compared with the state-of-the-art NRCDM model, our DPDNN model is 15 times more efficient in terms of its execution efficiency, which makes it very suitable for environments with high realtime requirements, such as torpedo detection. In the future research, we will consider applying block processing to sonar images to achieve improved performance.