An image enhancement algorithm of video surveillance scene based on deep learning ShuaiLiu,HunanProvincialKeyLaboratoryofIntelligentComputingandLanguageInformationProcessing,HunanNormalUniversity,Changsha410081,China.Email:liushuai@hunnu.edu.cnYu-DongZhang,SchoolofInformatics,UniversityofLeicester,LeicesterLE17RH,UK.Email:yudongzhang@ieee.org SuqianScienceandTechnologyProject,Grant/AwardNumber:Z2019109;NaturalSci-enceFoundationofHunanProvince,Grant/AwardNumber:2020JJ4434;KeyScientiﬁcResearchProjectsofDepartmentofEducationofHunanProvince,Grant/AwardNumber:19A312;HunanProvincialScience&TechnologyProjectFoundation,Grant/AwardNumbers:2018TP1018,2018RS3065;InnovationandEntrepreneurshipTrainingPro-gramofHunanXiangjiangArtiﬁcialIntelligenceAcademy,EducationalReformProjectofHunanXiangjiangArtiﬁcialIntelligenceAcademy,OpenProjectProgramoftheStateKeyLabofCAD&CG,Grant/AwardNumber:A1926

Target enhancement is the most important task in a video surveillance system. In order to improve the accuracy and efﬁciency of target enhancement, and better deal with the subsequent recognition, tracking, behaviour understanding and other processing of targets, a deep learning-based image enhancement algorithm for video surveillance scenes is proposed. First, the super-resolution reconstruction of the image is carried out through the image super-resolution reconstruction method based on the hybrid deep convolutional network to improve the sharpness of the image. Then, for the reconstructed video surveillance scene image, the watershed image enhancement algorithm based on morphology and region merging is used to realize the enhancement of the video surveillance scene image. Deep learning algorithms can improve the accuracy of image enhancement through iterative calculations. Experimental results show that after image enhancement in daytime, night and noisy video surveillance scenes, the maximum enhancement difference rate is less than 0.5%, the cross-linking degree is close to 1, and the average image enhancement time is less than 1.3 s. It can realize image enhancement of video surveillance scenes and improve the image clarity of the video surveillance scene.


INTRODUCTION
In daily life, human beings get most information by using visual system, which belongs to image information, and so image is an irreplaceable information carrier for human perception of the world [1]. With the development of science and technology, humans use computer to simulate the process of human brain recognizing the target in the image, in order to get the method of image processing, and make the image more suitable for human eye observation or convenient for instrument detection [2]. In the field of computer vision application research, image enhancement is the first step, which is very important in still a lot of room for improvement. People are still trying to combine the application of advanced algorithms into the field of image enhancement, hoping to produce better enhancement results. Therefore, further research on image enhancement algorithm is still of great practical significance [4].
In the research and application of image, people are often interested in some parts of the image, which are often called targets or prospects. They generally correspond to specific regions with unique properties in the image. In order to identify and analyse the objects in the image, it is necessary to separate them from the image [5]. On this basis, people can further measure the target and use the image. With the development of computer technology and digital image monitoring technology, the first level targets of military key alert have begun to use the digital image monitoring system to improve the monitoring efficiency [6]. In order to further improve the intelligent degree of the monitoring system, it is necessary to analyse the image effectively. Video surveillance scene image enhancement is an important image analysis technology [7].
Related scholars have done research on image enhancement algorithms for video surveillance scenes, such as signal-to-noise ratio (PSNR) [8] and structural similarity (SSIM) [9]. The main advantage of PSNR algorithm is that the algorithm does not need parameter control, and the enhancement accuracy is relatively high, but the algorithm needs to calculate the similarity between adjacent regions. As the most basic calculation unit, the SSIM is an index that measures the similarity of two images. It can improve the accuracy of image enhancement, but the algorithm takes a long time, resulting in low efficiency of the algorithm.
Deep learning is a branch of machine learning, which is a new direction in the recent decade. The essence of deep learning is bionics, which simulates the process of human brain perception and cognition. It is the research goal of deep learning to construct the neural network of human brain for analytical learning and imitate the mechanism of visual system to analyse image, sound and text information. Deep learning is developed on the basis of artificial neural network. At the end of the last century, neural network was once paid great attention to in the field of machine learning, but later, there appeared some problems, such as easy fitting, difficult to adjust parameters, slow training speed, and not more efficient than other methods when there were fewer levels, so it gradually faded out [10]. Until 2006, Geoffrey Hinton put forward the idea of deep learning. Subsequently, deep learning has been paid attention to by many scholars in academia and widely used in industry. It has made a breakthrough in the fields of speech recognition, image recognition and natural language processing. The difference between deep learning and traditional shallow learning is that deep learning requires deep network structure and attaches great importance to feature learning. Compared with the method of artificial feature construction, it is easier to obtain all kinds of connotative information by using massive data for training and learning features. At present, deep learning has made remarkable achievements in many fields, and has become a research hotspot that attracts the attention of scholars in the industry. Moreover, the upsurge of deep learning is still continuing, and continuous major breakthroughs have been made, which will have a significant impact on the field of machine learning.
In this paper, deep learning technology is used in the video surveillance scene image enhancement, and an image enhancement algorithm of video surveillance scene based on deep learning is proposed to realize the video surveillance scene image enhancement.

IMAGE ENHANCEMENT ALGORITHM OF VIDEO SURVEILLANCE SCENE BASED ON DEEP LEARNING
This chapter is based on the super-resolution (SR) reconstruction of the image based on the hybrid deep convolutional network, which is based on the image background and target characterized by morphological opening, closing mixing operation and the Laplacian sharpening technology. The contour is then enhanced by the Laplacian operator. According to the brightness of the video surveillance scene image area, the object and background area of the video surveillance scene image are marked, and the modified gradient amplitude image watershed is segmented. Finally, RMMS is used to merge the segmented image regions, and to merge the missing labelled regions into the target or background regions until there is no new region merging algorithm. The above steps can effectively improve the clarity of the reconstructed image and improve the accuracy of the reconstruction, and are used to implement an image enhancement algorithm for a video surveillance scene.

Image SR reconstruction method based on hybrid deep convolution network
Super resolution (SR) reconstruction aims to reconstruct highresolution (HR) images with rich details by inputting one or more low-resolution (LR) images. As an ill posed inverse problem, reconstruction technology needs to collect and analyse as many adjacent pixels as possible to obtain more clues, which can be used to supplement the missing pixel information in the process of up-sampling. The signal image super resolution (SISR) reconstruction of video surveillance scene image is to use the rich information contained in a video surveillance scene image and the visual priors obtained from the sample image to identify important visual clues, fill in details, and present as faithfully and aesthetically as possible. In recent years, reconstruction technology has been widely used in medical image, satellite remote sensing, video surveillance and other fields [11]. This paper proposes an image SR reconstruction method based on hybrid deep convolution network, which is one of the core technologies in the field of deep learning. In this method, the codec structure constructed by convolution and deconvolution can remove the noise generated in the process of image reconstruction. Through the mixed use of different convolution methods, the end-to-end network is constructed, and the SR image in line with the original image is reconstructed [12].
In this paper, deep learning technology is used in the video surveillance scene image enhancement, and an image enhancement algorithm of video surveillance scene based on deep learning is proposed to realize the video surveillance scene image enhancement. Through the image SR reconstruction method based on the hybrid deep convolutional network, the image is SR reconstruction to improve the sharpness of the image. Then, for the reconstructed video surveillance scene image, the watershed image enhancement algorithm based on morphology and region merging is used to realize the enhancement of the video surveillance scene image.

2.1.1
Up-sampling In the task of SR reconstruction of video surveillance scene image based on deep learning, the reconstruction model of high-resolution image (formula 1) is used. In the up-sampling module, the LR image of video surveillance scene is scaled to a sub LR image with the same number of pixels as the high-resolution image [13]. The simplest way of up-sampling is resampling and interpolation: the input image is rescaled to a desired size, and the pixels of each point are calculated at the same time. In this paper, a deconvolution network structure is used to sample the input LR image to the target image size [14]. In this part, a deconvolution layer with convolution kernel of 3 × 3 is used to sample the LR image of the video surveillance scene to the same scale of the target image as the input of the feature extraction layer.
where I H is the high-resolution image of video surveillance scene; D −1 is the up-sampling operation; I L is the lowresolution image of video surveillance scene; B −1 is the deblurring operation; −S is the noise reduction operation.

Image feature extraction
In traditional image restoration, a strategy of image feature extraction is to extract image blocks intensively, and then use a set of pre trained bases (such as principal component analysis (PCA), discrete cosine transform (DCT) etc.) to represent them. In convolution neural network, this part can be incorporated into the basic optimization of the network, and convolution operation can automatically extract video surveillance scene image features [15]. Formally, the sub network of feature extraction in this paper is represented as follows: where Z is the LR image of the video surveillance scene after up-sampling; W 1 and b 1 represent the convolution weight and offset, respectively, where the size of W 1 is 3 × 3 × 64; ⊗ is the convolution operation, 0 boundary is added and step size is 1, so that the input and output sizes are consistent and the boundary rank reduction is prevented.

Structure of codec denoising
In the feature denoising structure, convolution and deconvolution are cascaded to construct the encoding and decoding structure, which can eliminate the feature noise of video surveillance scene image to the maximum extent. Convolution layer retains the main video surveillance scene image content, while deconvolution layer is used to compensate for the details, so as to achieve good denoising effect and better retain the video surveillance scene image content. Formally, this layer is expressed as: where the output in the image feature extraction stage of video surveillance scene is F 1 ; W 2 and b 2 are convolution weight and offset size. W 3 and b 3 are deconvolution weights and offsets; H 1 (F 1 ) and H 2 (F 1 ) are the image features of video surveillance scene extracted by deconvolution one operation and two operations in turn; H 1 is deconvolution one operation. The convolution layer gradually reduces the size of the feature map, retains the main image information of the video surveillance scene, and obtains the abstract content of the image features of the video surveillance scene. The deconvolution layer gradually increases the size of the feature map, enlarges the size of the feature map, and restores the detail information of the image features of the video surveillance scene. At the same time, jump connection is used to speed up the training process, so as to ensure that the input and output sizes of the encoding and decoding structure are consistent. At the same time, it also ensures the test efficiency in the case of limited computing power of the mobile terminal, and obtains the image feature map of the video surveillance scene to remove the noise [16].
All 64 convolution cores of 3 × 3 × 64 are used in this part of the convolution layer. However, the first half of the convolution operation does not add 0 boundary and the step size is 2, so that the output size of the feature becomes half of the input size. In the second part of deconvolution feature recovery, 0 boundary is added and the step size is 2. The video surveillance scene feature map is restored to the original size to ensure the image scale integrity.

Reconstruction of video surveillance scene image
In the process of video surveillance scene image reconstruction, the input of hidden state feature image F 2 and the output of SR reconstruction image can be regarded as the inverse operation of feature extraction stage. In network convolution, convolution kernel W d c is used as a reaction base coefficient, and each position of high-dimensional hidden state image feature is regarded as SR image domain to obtain SR reconstruction image. For this reason, a convolution layer is defined to generate the final SR image I SR where W d c is c convolutions with the size of 1 × 1 × 64, c is the number of video surveillance scene image channels, b 5 is the convolution bias, and F 4 is the convolution kernel after four convolutions.
The first four convolutions in this part use 64 convolutions with 3 × 3 × 64 to check the image features of video surveillance scene for high-dimensional extraction. The convolution kernel convolution point spacing is 1, 1, 2 and 4, respectively, which can calculate the image features with a larger receptive field. Then, a convolution layer composed of 64 convolution cores with 3 × 3 × 64 is used to sum the first extracted image features of video surveillance scene with the high-dimensional features to ensure the full utilization of the features. Finally, the required SR image is reconstructed by using the convolution layer of c convolution cores with 1 × 1 × 64.

Loss function and optimization
For any given training data set , the goal is to find an accurate mapping value F (x), so as to minimize the mean square error (MSE) between the SR reconstructed image I SR and the real image T , and improve the image quality evaluation index-peak signal-to-noise ratio (PSNR). Although high PSNR does not represent absolute excellence of reconstructed image, satisfactory performance is still observed when alternative evaluation indexes are used to evaluate the model.
where P (i, j ) and T (i, j ) represent the predicted image and the real image, respectively; i and j are the reconstruction times of the height and width of the video surveillance scene image; H and W are the height and width of the video surveillance scene image, respectively.
That is, where f loss is the loss function; ‖ ‖ 2 is the multiplicative coefficient of weight attenuation; N is the optimization times.

2.2
Watershed image enhancement algorithm based on morphology and region merging

2.2.1
Morphological open-close hybrid operation Mathematical morphology is a tool for mathematical analysis of images based on structural elements. In the process of mathematical analysis, structural elements with certain morphological structure are used to measure and extract the corresponding shape in the image, so as to further achieve the purpose of image recognition and analysis. In the process of image processing, there are four basic operations of mathematical morphology: expansion, corrosion, open operation and closed operation.
Image etching can remove small protrusions or burr of objects smaller than structural elements. By selecting structural elements of different sizes, objects of different sizes can be removed from the image. The expansion operation can enlarge the image and fill the holes in the object.
In morphology, structural element is the most basic concept. The function of structural elements in morphological transformation is equivalent to 'filter window' in signal processing. B ∈ B(x) is used to represent the target structure element in the video surveillance scene image. For each point x in the video surveillance scene image space E, the operations of corrosion X and expansion Y are defined as follows: Morphological open-close hybrid operation can achieve the goal of eliminating the hole of target details in the image on the premise of maintaining the integrity and position of target information. This step can eliminate the local extreme caused by irregular grey disturbance and noise in the gradient image before the subsequent processing, and preserve the object contour and information of the video surveillance scene image.

Laplacian sharpening
The edge and details of the video surveillance scene image are further emphasized, and the video surveillance scene image is sharpened to improve the contrast [17]. Laplacian sharpening image is related to the mutation degree of the surrounding pixels of a pixel to this pixel, and the transition degree of the pixel can be found according to the second differential. When the grey value of the centre pixel in the neighbourhood is lower than the average grey value of the neighbourhood, the grey value of the centre pixel should also decrease; when the grey value of the centre pixel is higher than the average grey value of the neighbourhood, the grey value of the centre pixel should increase. In this way, image sharpening of video surveillance scene is realized [18]. The relationship between second-order differential and pixel is as follows: where the input video surveillance scene image is f (x, y), x and y represent the pixels of the image. The central mask coefficient of Laplace is as follows: Four neighbourhood matrix templates are obtained.
Similarly, the sharpening process of eight neighbourhood of video surveillance scene image is as follows: The matrix template is: By replacing the value at the original pixel (x,y) with the obtained value, similar boundary can be obtained, and the sharpened video surveillance scene image can be obtained by the following formula:

Marking background and target
When looking for the mark, the target object is relatively bright, while the background corresponds to the dark area. If the background area of the video surveillance scene image is dark at this time, the image binarization can be used to find the corresponding area of the background, so as to ensure that the external markers can be included in the background. Because the target object in the video surveillance scene image is generally located in the prominent position of other objects in the image, the target object may be the largest area of the local area pixel value, which can be found by finding the local area extreme value. The

FIGURE 1
The process of finding inner and outer markers large value is used as the marker of the target object [19]. The process of finding internal and external markers is shown in Figure 1. First, the video surveillance scene image is continuously opened and closed to generate the maximum and minimum value of the image area, and the maximum value point is marked as the target. Then, the video surveillance scene image is binarized, and the pixel position information is transformed into the grey information of the video surveillance scene image by distance transformation. Then, the boundary between the adjacent areas of the video surveillance scene image is taken out to form the background marker, and the gradient amplitude image of the video surveillance scene image is modified.

2.2.4
Region merging based on maximum similarity (RMMS) After the marking in the previous step, there are still some missing unmarked areas in the video surveillance scene image. This method can transform the merged region to the target region or background region. RMMS is divided into two stages. The first stage is sampling iteration method, which merges the unmarked regions which are more similar to the background set until there is no new region in the background set to merge with a certain region of the unmarked set. In each iteration step, the background set and unlabelled set are updated continuously, the background set expands and the unlabelled area set shrinks. The second stage focuses on the remaining regions after the completion of the first stage. In the unlabelled region set, regions within the set are merged according to the maximum similarity of regions. The merged regions belong to the target or background, and are handed over to the next round for completion. The two phases are implemented successively until there is no new regional merging. The unmarked area that is more similar to the background set is merged with the background set, until there is no new area in the background set and a certain area of the unmarked set, the image is merged.

• A is the neighbourhood of B, and S A = {S
• The degree of similarity (A, S A i ) between A and all its neighbours is calculated.
• When the similarity between region A and region B is (B, A) = max (A, S A i ), region B and region A are merged.

2.2.5
Flow chart of the algorithm Figure 2 shows the flow chart of the proposed algorithm. First, the obtained video surveillance scene image is transformed into grey image, and the grey image is opened and closed by morphology, and then the contour is enhanced by Laplacian [20]. Then, based on the brightness of the video surveillance scene image region, the object and background region of the video surveillance scene image are marked, and the modified gradient amplitude image watershed is segmented. Finally, RMMS is used to merge the segmented image regions, and the missing marked regions are merged into the target or background regions until there is no new region merging algorithm.

EXPERIMENTAL ANALYSIS
In order to further test the effect of the proposed algorithm on video surveillance scene image processing, the experiment is completed on a computer with Intel Core i7-7500u CPU and 8 GB RAM. The tool used in the experiment is MATLAB2015.

Super resolution reconstruction effect of video surveillance scene image
In order to test the enhancement effect of the proposed algorithm on video surveillance scene image, the video surveillance scene image in daytime, night and noisy video surveillance scene image, in Figure 3, are taken as examples to test the image quality change after processing by this algorithm. The visual effect of video surveillance scene image in Figure 3 after SR reconstruction is shown in Figure 4.
Comparing Figures 3 and 4, we can see that the clarity of video surveillance scene images of daytime, night and noisy types can be improved after the proposed algorithm is processed. The reason is that the proposed algorithm can directly realize the end-to-end mapping between LR and SR of video surveillance scene images, and adopt convolution and deconvo-lution cascade encoding and decoding method to improve the clarity of video surveillance scene images and eliminate noise, so as to solve the noise in the image features and optimize the image quality of video surveillance scene.
The above experimental results are used to test the effect of the algorithm from the visual angle, and the following two classical standard peak signal-to-noise ratio (PSNR) and SSIM are used to evaluate the SR reconstruction effect of the algorithm for video surveillance scene image.
where the root MSE of video surveillance scene image reconstruction is RMSE and SSIM.
where p is the number of rows of the video surveillance scene image; q is the number of columns of the video surveillance scene image; I SR is the reconstructed video surveillance scene image; and T is the actual video surveillance scene image.
where 1 , 2 and 3 are weights; c is constant; l is the maximum pixel value of video surveillance scene image; and s is the contrast of reconstructed video surveillance scene image. In general, it is considered that the higher the values of PSNR and SSIM the better the effectiveness of reconstructed video surveillance scene image and the accurateness of the structure.
The processing times of the video surveillance scene image is set to be processed to 150, 250, 350, 450 and 550 in turn. After testing the proposed algorithm, the peak signal-to-noise ratio and structure similarity of the three video surveillance scene images in Figure 3 are shown. The results are shown in Figure 5.
It can be seen from the analysis of Figure 5 that after the algorithm is processed for many times, the peak signal-to-noise ratio of the three video surveillance scene images is greater than 36 dB; the SSIM is greater than 95%, and the SSIM is significant. It can be seen that after the three kinds of video surveillance scene image processing, the reconstructed image has good effect and accurate structure.

Image enhancement effect of video surveillance scene
When judging the performance of the proposed algorithm, the enhancement difference rate, the intersection and union ratio, and the enhancement efficiency are used. The enhancement difference rate is defined as:  where S 1 and S 2 are the difference rates of video surveillance scene images before and after enhancement. The large difference rate indicates that the enhancement effect of this algorithm is poor. The image processing times of the video surveillance scene are set to be processed to 150, 250, 350, 450 and 550 in turn, and the enhancement difference rate test results are shown in Figure 6. It can be seen from the analysis of Figure 6 that after segmenting the video surveillance scene images of daytime, night and noisy types by using the proposed algorithm, the difference rate between the segmented video surveillance scene image and the original image is very low, and the maximum value is less than 0.5%, which indicates that the enhancement effect of the proposed algorithm is better.
The calculation method of intersection over-union (IOU) is as follows: where area(C ) and area(G ) are the real background part of video surveillance scene image and the real background part in enhancement. The closer the intersection and union ratio is to 1, the higher the enhancement accuracy of video surveillance scene image. The image processing times of the video surveillance scene are set to be processed to 150, 250, 350, 450 and 550 in turn, and the test results of the enhancement and cross parallel ratio are shown in Figure 7.
It can be seen from the analysis of Figure 7 that after the enhancement of video surveillance scene images of daytime, night and noisy types, the intersection and union ratio of the enhancement of the proposed algorithm is very close to 1, and the ratio between the real background part and the real background part in the enhancement is close to 1, indicating that this algorithm has high enhancement accuracy for video surveillance scene images.
The processing times of the video surveillance scene image are set to be processed to 150, 250, 350, 450 and 550 in turn, and the execution time of the proposed algorithm used in the enhancement of the three types of video surveillance scene images in Figure 3 is tested. The test results are shown in Table 1.
Analysis of Table 1 shows that before using the proposed algorithm to segment daytime, night and noisy video surveillance scene images, the average enhancement time of daytime, night and noisy video surveillance scene images is more than 12 s, and the enhancement efficiency is low; after using the proposed algorithm to segment daytime, night and noisy video surveillance scene images, the average time of scene image enhancement of daytime, night and noisy video surveillance scene images is less than 1.3 s, and the enhancement efficiency is higher. The proposed method uses a deep learning algorithm, through iterative learning, the optimal convergence state can be quickly obtained, and the image is efficiently reconstructed, which improves the efficiency of the algorithm.

CONCLUSION
Image enhancement is the basis of image analysis, image recognition and image understanding. The quality of image enhancement is directly related to the effect of subsequent image processing, so image enhancement plays a very important role in the process of image processing. Image enhancement is the use of image grey, colour, texture, shape and other information in accordance with the standard, and the image has a special meaning of different regions separated, in order to ensure that the internal consistency of each region is satisfied, and the differences between regions are satisfied. Image enhancement can extract the objects that people are interested in from the complex scene, and then proceed to the next step. Video surveillance scene image enhancement can effectively divide the target and background into two parts, and help people effectively and quickly judge the target information in the video surveillance scene image. This paper takes the video surveillance scene image as the research goal, and proposes an image enhancement algorithm of video surveillance scene based on deep learning. In the experiment, the simulation test is carried out by MATLAB2016 software. After the test, the algorithm can segment various types of video surveillance scene images with high accuracy and high efficiency, and it is an effective image enhancement algorithm.
1. Through various video surveillance scene image enhancement crossover algorithms and comparative test results, it is shown that the designed algorithm can effectively improve the definition of video surveillance scene images.
2. The average enhancement time of the algorithm in this paper is less than 1.3 s, and the enhancement efficiency is high. This shows that the algorithm has better efficiency.
In the future research, the goal of applying the video surveillance scene image enhancement algorithm to the batch processing of video surveillance scene images should be to further improve the video surveillance scene image enhancement algorithm based on deep learning.