Blood vessel and background separation for retinal image quality assessment

Retinal image analysis has become an intuitive and standard aided diagnostic technique for eye diseases. The good image quality is essential support for doctors to provide timely and accurate disease diagnosis. This paper proposes an end-to-end learning based method for evaluating the retinal image quality. First, blood vessels of the input image are segmented by U-Net, and the fundus image is divided into two parts: blood vessels and background. Then, we design a dual branch network module which extracts global features that inﬂuence the image quality and suppress the interference of blood vessels and local textures to achieve better performance. The proposed module can be embedded in various advanced network structures. The experimental results show the more efﬁcient convergence rate for the network with the module. The best network accuracy rate is 85.83%, the AUC is 0.9296, and the F1-score is 0.7967 on the collected local dataset. Additionally, the model generalization is tested on the public DRIMDB dataset. The accuracy, AUC, and F1-score reach 97.89%, 0.9978, and 0.9688, respectively. Compared with the state-of-the-art networks, the performance of the proposed method is proven to be accurate and effective for retinal image quality assessment.


INTRODUCTION
Diabetic retinopathy (DR) is a common complication of diabetes and one of the most common fundus diseases. About a third of diabetic patients will have symptoms of diabetic retinopathy, and 10% patients will have vision threatening conditions [1]. Since the early stage of silent diseases is usually asymptomatic, patients are not aware of the disease until suffering from serious visual impairment in an advanced disease stage, and it is too late to prevent blindness. Therefore, it is highly recommended that digital fundus photography based screening technology is used as early diagnosis tool [2]. In current clinical diagnosis, DR detection is based on detailed analyses of retinal images collected by fundus cameras, and then different treatment plans are formulated based This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2021 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology on detection results [3]. This task is manually performed and time consuming. In contrast, computer-aided diagnostic system (CADS) reduces the experts workload and speed up the diagnosis process. Generally, modern CADS for ophthalmology apllies advanced image processing and machine learning techniques, which helps to improve the lesion detection rate [4].
A study involving an extensive database of retinal images shows that more than 25% of the images displayed have insufficient quality for proper medical diagnosis [5] which limits the CADS performance. Fundus images are scanned by a technician and subsequently sent to a physician or CADS to read the image, but there is a lag time window. Once the image quality problem is found during the image reading process, the scanning step needs to be restarted. In addition to the financial investment required to obtain a better quality photo, the more serious situation is misdiagnosis caused by poor image quality delaying the treatment timing. To improve the robustness of physician diagnosis and CADS, the quality assessment of the input retinal image is significant during image acquisition. We expect to determine whether the image satisfies the requirements shortly after scanning. To ensure the quality of the fundus image and reduce the time and effort of manual screening, it is necessary to evaluate the quality of fundus images automatically and objectively during the acquisition process.
In recent years, convolutional neural networks (CNNs) develop rapidly in the field of image analysis, and the endto-end learning gradually substitutes manual feature extraction methods. However, in the image quality assessment task, we notice that the trained model focuses more on fundus vessels with local strong gradients and thus ignores large scale features that affect image quality in the background, such as overexposed or blurred areas, if directly sending the whole fundus image to CNN for training. For some slightly blurred or exposed images, fundus blood vessel structure is relatively continuous and still has stronger contrast, so more attention on blood vessels will adversely influence the final result of image quality assessment. Motivated by this observation through feature visualization, we propose the dual branching module by vessel segmentation guided background extraction to enhance the CNN performance for retinal image quality assessment. Our contributions are listed as follows: 1. A novel training strategy for blood vessel and background separation of retinal image is proposed for the retinal image quality assessment task, which achieves the better results compared with the state-of-art methods on the collected and public dataset. 2. The proposed framework can be assembled with common CNN structures and substantially improves their classification performances. 3. The proposed dual branch framework extracts more discriminative information from the background area rather than vascular information for retinal image quality assessment.
The rest of this paper is organized as follows. Section 2 describes the related work. In Section 3, the dual-branches module for separating blood vessels and background is introduced in detail. In Section 4, the experimental results of our work besides the state-of-the-art methods are given. Finally, we draw conclusions and future work in Section 5.

RELATED WORKS
Automatic image quality assessment (IQA) of fundus images is very important to filter out unqualified images before an ophthalmologist makes a diagnostic decision. The IQA method for natural images cannot be directly applied to fundus images, because the evaluating standard for natural images is quite different from that for rentinal images which does not consider the location of quality defects, and the quality evaluation for natural images is irrelevant with deep semantic information [6].
Martini et al. [7] conducted a survey in the field of retinal image quality assessment (RIQA). Lee et al. [8] created Q index with the histogram model, pioneering work on RIQA, to classify retinal images into two categories, good or poor quality. Nugroho et al. [9] used a Bayesian classifier and contrast measurement to classify 47 images. The sensitivity was 97.6%, specificity was 80.0%, and accuracy was 89.3%. Yao et al. [10] obtained 93.8% specificity and 96.19% AUC using general quality parameters (including entropy, texture, symmetry, and frequency components) and support vector machine (SVM) classifiers. Abdel-Hamid et al. [11] proposed a classifier based on wavelet and hue saturation value space to evaluate sharpness, illuminance, uniformity, field of view (FOV), and outliers. Their research used multiple databases and obtained 99.8% AUC through the SVM classifier. Carrillo et al. [12] analysed several generic features, such as the statistical features of histograms, cooccurrence matrices, run-length and cumulative probability of blur detection. Javidi et al. [13] presented the morphological component analysis based framework. Beneficial from the adaptive representations via dictionary learning, it was adapted to work on retinal images with severe DR to simultaneously separate vessels and exudate lesions as diagnostically useful morphological components. However, there are still many shortcomings in traditional machine learning based methods, such as insufficient representation ability, poor performance in fitting large scale data, weak generalization ability, and complex handcrafted feature engineering. Deep learning has the advantages of a fast convergence rate, feature learning from the original input data without any manual operation, and better performance than traditional machine learning in many fields with big data.
In recent years, deep learning has also been applied to retinal image quality assessment. The assessment accuracy and efficiency are significantly better than those of traditional methods. In ref. [14], the authors combined the features of local salient feature maps and CNNs, and trained the model on 20,000 images using random forest and SVM classifiers. After the cross validation, their specificity reached 97.8%. Saha et al. [15] trained AlexNet with EyePACS database images showing an accuracy of 100% to successfully categorize 'accept' and 'reject' images, and on a clinical trial the results showed 97% agreement with human graders. Zago et al. [16] used a pretrained model of the ImageNet database to extract general features and evaluate the classification performance of retinal quality by fine-tuning the weight parameters. Zhang et al. [17] combined residual and dense network blocks to obtain more detailed features of fundus images. Chalakkal et al. [18] proposed a two-stage evaluation method. In the first stage, the neural network is trained on various public datasets to determine the quality of the retinal image and field definition. Images are divided into two categories: low and high quality. In the second stage, unsupervised classification technology is applied to detect important structures (such as optic disc and macular) that are classified as high quality images in the first stage and further divided high quality images into medically suitable retinal images and low quality images. Fu et al. [19] noted that the previous work ignored RGB excluded colour spaces in the human visual system. Therefore, the general multicolour space fusion network was proposed to integrate the representation of RGB, HSV, and LAB colour spaces. Robin et al. [20] proposed a lightweight RIQA algorithm for the macular region localization and evaluated the visibility of macular images through a lightweight CNN. If the macular region was clearly visible in the field of view, the image quality was qualified. However, this algorithm ignores the of local exposure image, so it is difficult to make correct predictions. Wanderley et al. [21] presented a thorough evaluation of the performance for RIQA. Zhang et al. [22] presented a novel OCT image quality assessment method. The concept of pairwise learning was introduced to extract image features sensitive to OCT image quality levels. Gao et al. [23] presented an OCT image quality assessment system capable of automatic classification based on signal completeness, location, and effectiveness. Reddy et al. [24] proposed a CNN to classify the quality of fundus images automatically, which outperformed the best known blind image quality assessment algorithms, namely, DIVINE, BLIINDS-II, and BRISQUE. Shen et al. [25] presented a new multitask domain adaptation framework to assess fundus image quality. The proposed framework provides interpretable quality assessment with both quantitative scores and quality visualization for potential realtime image recapture with proper adjustment. In particular, the present approach can detect optic disc and fovea structures as landmarks to assist the quality assessment through coarse-to-fine feature encoding. Calderon et al. [26] evaluated eight CNN architectures with different configurations for the quality assessment of retinal images taken by a nonmydriatic fundus camera with 200 • FOV.
All these works feed the whole image into the model during the training stage. We find that this way may cause the model to pay more attention on local vascular features and ignore other features in the background via feature visualization, which makes it impossible for the model to make correct predictions when identifying some poor quality images due to exposure because the overall blood vessel information of the exposed image is still well reserved. In addition, the background contains information such as macula and optic disc. Generally, poor quality is caused by insufficient imaging conditions, such as bad lighting, poor focusing or improper operation, which leads to exposure, artifacts and image blur. These factors are more likely to fall into the background occupying the majority of the image where the vascular information may be more complete, and higher weight of vascular information may be misjudged as a good image. Therefore, it is necessary to enhance the information extraction in background with the global style. To solve the abovementioned problems, we propose a dual-branch network structure that divides the input image into two components, blood vessel or background, and sends them to the training model to capture more quality relevant features. Supplemented by an additional public vessel segmentation datasets with only 40 images, the proposed module can significantly improve CNNs performance.

Datasets
There are two datasets used in this paper:

Collected dataset
We collected the original dataset, including 3,501 images labelled by professional doctors, where images with poor quality cannot be used to make a valid diagnosis. We randomly sample 80% of the images and augment them as the training set. After separating retinal blood vessels and background information, we train the retinal image quality assessment model.

DRIMDB
DRIMDB [27] was created by Sevik et al. The dataset contains 216 images of ophthalmic retinas obtained from the medical school of Karadeniz Technical University in Turkey. Images were taken by a Canon CF-60UVi fundus camera at 60 • FOV and then stored in a JPEG file at 570 × 760 pixel resolution. Images were divided into three categories by a professional ophthalmologist: 125 (good), 69 (poor), and 22 (outlier) images (which were not considered). Sample images are shown in Figure 1. The retinal image has no low-quality factors, and all retinopathy characteristics are clearly visible, such as Figure 1(a) is a good quality retinal image. Retinal images may have serious quality issues such as blur, illumination, and low-contrast which cannot provide the professional and comprehensive diagnosis, as shown in Figure 1(b-d) are the low quality retinal images.

Data preprocessing
Since fundus images are contained in data sources captured by different fundus cameras, some images contain large black backgrounds. The first step of preprocessing is to extract the region of interest (ROI), remove the black edges, and reserve the fundus portion. The final trimming result of one fundus image is shown in Figure 2.
Because good quality and poor quality images in the training dataset are unbalanced, we use data augmentation methods to enlarge the training data multiple times. More specifically, we randomly select the operation from left, right, upside down, and rotation angles of 90 • , 180 • , 270 • to generate more images. According to the image distribution of the training data, the number of good quality images increases by 4 times, and the number of poor quality images increases by 8 times. Finally, we resize the training and test images to the same size of 224 × 224 and normalize all images by subtracting the mean of all image wherex represents the mean and represents the variance.

The proposed dual branch module for blood vessels and background separation
The schematic illustration of our algorithm is shown in Figure 3. The content of one fundus retinal image can be roughly divided into two parts, fundus vessels and the background. The integrity of blood vessel information is the most intuitive factor to measure image quality. If the blood vessels in the fundus image are occluded, the image quality is not satisfactory. Due to improper operation, overexposed region is more likely to appear in background and becomes the global style feature. But it does not affect the integrity of vascular information and local high contrast. In this case, the weight of vascular information is assigned too high, so that the negative quality factors in the background area will be ignored, resulting in misjudgement of image quality.
Here, we train the encoder-decoder network to learn the feature representation as where I and l represent the input image and label, f e represents the encoding network, and e is the corresponding parameter of the encoding network. Optimization is achieved based on the gradient descent over e . Therefore, the features learned by CNN from the input image are crucial to the final classification result. However, CNN focuses more on object texture rather than the generally global shape [28]. In the retinal image quality assessment task, the model will be more biased to extract blood vessel features because of the local salient contrast. Therefore, we propose to separate the blood vessels and background in fundus images to extract more discriminative features from two components, respectively. This problem can be regarded as where I B represents fundus background, and I V represents blood vessel information.
Although U-Net [29] was not originally proposed on the fundus image, it still achieved good results in vessel segmentation on the fundus image beneficial from the skip connection structure. We train the U-Net model on the public dataset DRIVE [30] to achieve image separation.
Consider the training dataset for vessels segmentation, T = {(X 1 , Y 1 ), (X 2 , Y 2 ), … , (X p , Y p )}, where X p is one image and Y p is the corresponding label map. The label set of T is{(0, 1)} with 0 being the background label. The corresponding feature maps of U-Net are denoted as F U ∈ R c×h×w , where c is the number of channels and h and are the height and width of F U , respectively.
Based on F U , we set the number of output channels to 2, and the channels are used to determine whether the pixels are blood vessels. The output M ∈ R 2×h×w indicates the vessels or background likelihood for each pixel. Therefore, we extract the vessel mask M (1) from one channel and background mask M (2) from the other channel and then multiply it with the original image elementwise, and the final result is shown in Figure 4. The overall structure of the blood vessels in the retinal image is roughly preserved, but some subtle blood vessels have not been completely separated because of the domain shift between the vessels segmentation training dataset and the image quality assessment dataset.

Network structure
The proposed module can be embedded into common CNN models. Here, the main network structure takes ResNet [31] as   Table 1.
ResNet introduces skip connections, which solves the vanishing gradient problem for deep network training. The overall network structure is shown in Figure 5. After separating the two components of fundus image, the image containing only blood vessels or background is sent to ResNet, respectively. The concatenation operation on the channel dimension is performed after two residual blocks. Batch normalization (BN) [32] followed by rectified linear unit (ReLU) [33] is used for the vessel branch, which is consistent with the original ResNet structure, and in the background branches, BN is replaced by instance normalization (IN) [34]; the rest of the structure is unchanged. BN focuses on normalizing each batch to ensure the consistency of the data distribution, but this operation may weaken and eliminate some unique features, such as local exposure in low quality images. In the background path, we aim to learn more detailed features. IN normalizes the information from the image and can be regarded as the integration and adjustment of global information.
The reserved feature makes the network learn the particular information for each image, so we choose IN as the standard operation in the background path. Since the feature information learned by CNN in the final convolutional layer is relatively abstractive and the spatial information of the original fundus image is lost, we introduce the skip connection between the shallow layer and the deeper layer of CNN and scale correction through three 1 × 1 convolutions. The number of channels in each 1 × 1 convolution is the same as that of the feature map with the same size in the main network, which is 512, 1024, and 2048. In addition, we use one convolution after the concatenation operation of each feature map, and the number of channels is set to half of the channel number of the feature map after concatenation to reduce the feature dimensions. We then reweight all the feature maps with the SE-Block [35], which models the inter-dependencies between channels of convolutional features. It is defined as the transformation from feature maps F to F ′ : where and refer to the sigmoid activation and the ReLU function, respectively. W 1 and W 2 are the weight matrices of two fully connected (FC) layers. f Global is the global average pooling operation, and denotes channelwise multiplication. By reweighting channel features using the vector w i , information features can be selectively emphasized. This allows the foreground and background information to be fully explored and helps to optimize the final feature descriptor for image quality assessment. Then, the FC layer and cross entropy are employed. The pseudocode of our algorithm's forward propagation process is shown in Table 2. First, the blood vessels of the input image is extracted by the pretrained U-Net model on DRIVE, and the fundus image can be divided into two components: the blood vessel part and background. Then, the dual branch network module is designed to extract features from vessels and background respectively. In this module, IN is used for the background branch and BN for the vessels branch. At last, we fuse the feature maps from two branches, concatenate the shallow spatial information and reweight feature maps with SE-Blocks.

EXPERIMENTAL RESULTS
To verify the superiority of our network, we collect labelled fundus image datasets and train the proposed method and the stateof-art network architecture on the datasets to evaluate fundus image quality. There are many hyperparameters in CNNs, such as network layer, learning rate, batch size, and optimizer parameters. We use Adam to train our network and set the initial learning rate to 2e-5. In addition, we train the model with the batch size of 8. The parameters used in the other networks are the same.

Compared with the state-of-the-art CNN models
We compare the proposed method with other popular models, including ResNet [31], SeNet [35], Inception-V3 [36], Inception-V4 [37], and DenseNet [38]. The experimental results are shown in Table 3. The suffix p represents the network assembled by the proposed module, which manifests the better accuracy, AUC and F1-score than the original architecture, which shows the effectiveness of the vascular background separation strategy for enhancing the quality assessment performance.
The experimental results show that the dual branch module is feasible and effective in fundus image quality assessment. Various performance indicators all have been improved to some extend. In addition, we select three recent research works to train on the local dataset. The results are given below the solid line in Table 3. We compare the upgraded Inception-V3 with the method proposed by Zago [16]. The method proposed by Zhang [17] and Fu [19] replace the basic ResNet modules. We also introduce two natural IQA methods to validate the performance in the RIQA area, and finetune the output layer of the network. The results are shown in Table 3. The method proposed by Hou [39] extracts features from a pair of images by using two ResNet-18 with shared weights, and then subtract the output feature maps of the two models element by element, and obtain the final result. Because the advanced network structure is popular, it achieves better results than the comparative models, but it does not exceed any model based on our proposed framework (ResNet50-p, ResNet101p, SeNet50-p, SeNet101-p, Inception-V3-p, Inception-V4-p, DenseNet121-p).

Model generalization
To further verify the model generalization, we test the models performance on DRIMDB which is one public fundus image quality assessment dataset. The dataset includes 125 good images, 65 poor images and 22 outliers. The dataset is too small to train a CNN model, so we only use it as the test dataset. The results are shown in Table 4. Our method is labeled by the suffix p in Table 4. The experimental results show that each of the commonly used models achieves better results after assembling our module. Various indicators improved by 4-22%. One of the possible reasons is that there is a big stylistic difference between the training and test data. Training entire image as the model input may result in learning the local strong texture information.
Using the dual branch network structure reduces the possibility of this phenomenon effectively. Compared with the recently proposed methods, our method has obvious performance promotion over evaluation indices. Fu's method based on DenseNet is slightly ahead of DenseNet121-p over AUC, perhaps because it uses three basic models to ensemble the training results and requires higher memory use, while our proposed method uses only a small number of additional modules composed of convolution layers and can achieve the leading results. Li [40] proposed to use left and right view angles to extract image features respectively and fuse them step by step to simulate the working mode of both eyes and make full use of high and low-frequency information. This method is based on the behaviour of human eyes and can achieve good results in natural IQA tasks. But it still does not distinguish between lesion and normal area in RIQA, which leads to the lack of competitiveness on the overall network performance, especially in DRIMDB dataset.

Comparison of training rates for optimal network performance
In this section, we find that our proposed method not only outperforms the common models in performance but also achieves the best performance faster than the other models. Figure 6 shows the iterative results of all network training and testing. We use the same colour to represent the same network, the dotted line represents the basic network, and the solid line represents the network using our proposed method. We can see that our model converges faster than the dotted line, and present the lower error rate with less fluctuation. The number of training epoches for each model to achieve the best result is shown in Figure 7. We find that when the number of network layers is shallow, our proposed method has significant performance  With the deepening of the number of network layers, the network can learn more detailed feature representations, but there is still a certain accuracy gap with our method. From Figure 7(a), the gap among SeNet, DenseNet and our methods in achieving the best effect in each epoch is relatively small because they adopt channel enhancement and multilevel connection respectively to boost the feature representation ability and accelerate convergence rate. Essentially, the proposed dual branch module helps separate image features of local and global levels actively in order to further improve the  feature representation ability at the expense of small amount of computation and storage space. Figure 7(b) shows the convergence rate performance of each model on DRIMDB. Our method achieves the best performance very quickly because the small data quantity in DRIMDB, and the difference between the good and poor quality is obvious. Most of the poor quality images do not contain a complete vascular structure. Our method can effectively identify and determine broken structures. In addition, we can see that the performance of the SeNet-101 optimal rate is very good because attention based channel dependency modelling can effectively enhance the background characteristics, while the performance of DenseNet performs much worse than before because its multilevel connection mode still contains many redundant vascular features and does not truly learn the features in the background. Feature redundancy of DenseNet is also found in DR recognition task [41].

Effect of model capacity sensitivity
In our proposed dual branch network structure, compared with baseline, we add one training path, resulting in the increase of the parameters of the whole network. In order to prove that the main reason for the improvement of the final effect is vessels and background separation rather than the expansion of model capacity, we conduct an experiment in which two original images are input into the dual branch network for training, The rest settings remain unchanged (represented by o). Meanwhile, we test the input image containing only background information into  Table 5.
The performance of o in the local dataset is much lower than that of the proposed network, even weaker than that of baseline. There is feature redundancy extracted from the training network using the two same images, which affects the overall performance. As a result of learning a large number of redundant features, the network does not converge on DRIMDB.
Using separated background images as the network input during the training stage, the overall performance has been boosted to some extent, but it is still slightly lower than the experimental results of the dual branch network, indicating that the vascular information is beneficial to the final quality assessment and cannot be completely eliminated. Based on this idea, we propose the vessels and background dual branch training strategy, which not only avoids extracting the same redundant features, but also combines the local and global features to improve the performance.

The CNNs parameters
The parameters calculation of the network are shown in the Table 6. Our models increase lots of the parameters calculation, mainly on account of large channel SE-Block in the deep layer. Although the proposed method costs a certain amount of computational burden, it only needs a few epoches to achieve the optimal performance. Moreover, the model with less parameters (like DenseNet121-p) is better than the model with more parameters (like Inception-V4). The accuracy of DenseNet121p is 85.37%, the AUC is 0.9296, the F1-score is 0.783, better than the accuracy of Inception-V4 is 83.66%, the AUC is 0.9072, the F1-score is 0.7673.

Feature map visualization
In the previous experiments, we find that the extracted features are mostly from retinal blood vessels by visualizing the feature map in traditional CNNs like ResNet. Thus, the critical factors for retinal image quality are ignored, such as macular, dark or overexposure area in the background. The visualization results are shown in Figure 8. Figure 8(a) are low quality images with dark or overexposure areas. Figure 8(b) are the feature response images of ResNet50. The maximum response area mainly focus on the retinal vessel (the redder the colour, the higher the response), and the response on the dark or overexposure area is feeble. We also provide the feature response images Figure 8(c) based on our method. These images intuitively illustrate that the proposed model no longer mainly focuses on the blood vessel but pays more attention to the area that influence the image quality assessment.
Given a low quality test image that has a dark area at the upper left corner and overexposure in the right middle corner, we overlay the feature maps of different models on test images, and the visualization results are given in Figure 9. Although ResNet and SeNet successfully recognize the overexposure area, the dark area at the upper left corner cannot be detected, and the features matrix of DenseNet is redundant where the heat map is lightened almost everywhere. After embedding our proposed method, all these methods have detected more accurate zones which directly influence the image quality.

CONCLUSIONS AND FUTURE WORK
In this paper, we propose the retinal quality assessment method based on fundus vessels and background separation. The blood vessels are obtained by the pretrained U-Net, which is used as the mask, and the input image is multiplied element by element to obtain two images with only blood vessel and background that are fed into the network to make it learn the discriminative features closely related to the image quality. The proposed module can be embedded into common CNN structures. Experiments show that this method can achieve better results and obtain the best performance more quickly. Specially, it is also generalized to unseen public dataset DRIMDB with the accuracy of 97.89%, AUC of 0.9978 and F1-score of 0.9688. In the future work, more factors under the multi-task learning framework will be considered, such as clarity, field definition etc.