Ground truth free retinal vessel segmentation by learning from simple pixels

Retinal vessel segmentation is fundamental for the automatic retinal image analysis and ocular disease screening. This paper aims to learn a ground truth free feature aggregation strategy for the vessel segmentation. Five vesselness maps modelling the vessels’proﬁle, appearance, and shape are ﬁrst generated. Together, the histogram of the local binary pattern and the green colour are extracted. In each vesselness map, the pixels with large vesselness values are regarded as simple positive samples. The pixels with small vesselness values are regarded as simple negative samples, and the pixels with mediocre values are treated as difﬁcult pixels. The simple positive samples and simple negative samples near the difﬁcult pixels consist of the training dataset while the rest vesselness maps together with the local binary pattern histogram, and green colour channel are used as the features to learn a strong classiﬁer. Then, without leveraging any ground truth, multiple kernel boosting is used to combine four support vector machine kernels to learn a strong vessel model for each image. Applying this learnt model to the pixels with mediocre values in the single vesselness map, their label will be determined. Totally, ﬁve strong vessel models are learnt. Finally, pixels with the majority supports from the strong vessel models are labelled as vessel pixels. The proposed method achieves accuracy of 94.83%, sensitivity of 72.59%, and speciﬁcity of 98.11% on DRIVE dataset, and accuracy of 95.51%, sensitivity of 78.09%, and speciﬁcity of 97.56% on STARE. It outperforms the state-of-the-art unsupervised methods and achieves comparable performances to the supervised methods.


INTRODUCTION
Retinal fundus images provide a window to inspect the fundus of the eye, and they are widely used for the diagnosis of various pathologies, such as age-related macular degeneration, diabetic retinopathy and glaucoma. Manual analysis of the retinal images is time consuming and tedious for ophthalmologists. Moreover, it is impossible to accurately quantify the fine structures in the fundus. Therefore, the automation of the analysis becomes highly important. As one of the basic procedures in automatic analysis, vessel segmentation is still a challenging task. On one hand, the width of the vessels has large variability. For example, vessels at the end of each branch are always only several pixel wide or even only one pixel wide in the images. On the other hand, vessel segmentation task is adversely affected by other structures with as linearity and connectivity. Mendonca and Campilho [2] use a top-hat transform with circular structuring elements to enhance the vessels first, and then reconstruction operators are used to segment the vessels. In [3], revised top-bottom-hat transformation is proposed for segmentation. Dash and Bhoi [4] generate a vessel-enhanced image and then use adaptive threshold to extract the vessels. Inspired by the shape and profile prior of the vessel, matched filters are proposed. For example, assuming that the vessels are bar-shaped structures, Azzopardi et al. use the trainable bar-selective combination of shifted filter responses (B-COSFIRE) to measure the vesselness [5]. Nguyen et al. assume that the vessels are line-like structures, and propose to use the multi-scale line filter responses to measure the vesselness [6]. Inspired by [6], improved line detectors are proposed for vessel segmentation in [7,8]. Based on the observation that the profile of the vessels are Gaussian-shaped and symmetric with respect to the peak position, Zhang et al. propose to use the responses of the first-order of derivative of Gaussian to measure the vesselness [9]. Gou et al. also propose to use the responses of multi-scale Gaussian filtering to measure the vesselness [10] while Shukla et al. propose to use the responses of fractional filtering to measure the vesselness [11].
Supervised Methods: The supervised methods formulate the vessel segmentation as a classification problem, and they use the ground truth to learn aggregation strategies to aggregate various features for the classification. For example, Ricci and Perfetti [12] learn a support vector machine (SVM) to detect the vessel points from line filter responses. Lupascu et al. [13] learn an Adaboost classifier which takes multi-scale local intensity structure, spatial properties and geometry features as input, while Fraz et al. [14] learn a LogitBoost classifier. In [15], greylevel and moment invariants based features are extracted to learn a neural network scheme for vessel segmentation. In [16], information theory and machine learning techniques are used to automatically select a subset of B-COSFIRE filters to further improve the performances of [5]. Li et al. [17] train a deep neural network to learn a cross-modality data transform from retinal image to vessel map. Fu et al. [18] generate a vessel probability map by a fully convolutional neural networks, and then further refine the probability map by a fully connected conditional random fields. Thereafter, deep learning technologies are used to segment the vessels in [19][20][21][22][23]. Supervised methods are more invariable to vessel deformations and brightness variation since they combine different priors about the vessels. However, on one hand, in such methods, the ground truth is necessary. It needs professional skills to label the pixel-level ground truth. Also it is tedious and time consuming. Specifically, for deep learning-based vessel segmentation methods, Graphics Process Units (GPUs) are required to train the vessel segmentation model. On the other hand, supervised methods are datasetdependent. Their performances usually drop sharply when the images are from different datasets.
To learn a feature aggregation strategy without relying on the ground truth, we propose a ground truth free learning framework to segment the vessels. We observe that the vesselness maps provide some vessel pixels and background pixels, which can be used to learn a feature aggregation strategy and the vesselness maps can be generated by unsupervised methods. Specifically, pixels with high vesselness values are highly possible to be vessel pixels while pixels with small vesselness values are highly possible to be background. We call these pixels simple pixels. Pixels with mediocre vesselness values are uncertain whether to be vessel pixels or background pixels. We call these pixels difficult pixels. Thereby, it is natural for us to learn the feature aggregation strategy from the simple pixels to determine the labels of the difficult pixels.
There are two stages to learn vessel model from simple pixels. The first stage is the vesselness maps generation. In this stage, five vesselness maps are generated where the first two are proposed in this work: one is the centreline-boundary contrast map, modelling the appearances of the vessels; the other is the difference map of diffusion which uses the difference between the diffused image and the original one to measure the vesselness. The rest three are generated using existing unsupervised vessel segmentation methods, that is, the response map of B-COSFIRE filter [5], the line detector response map [6] and the green channel without the local mean. Those vesselness maps are used to generate the simple samples for the learning of the strong vessel model. The second stage is learning the strong vessel model from the simple samples. For each vesselness map of the retinal image, we use the simple pixels as the training set. For each sample, three feature descriptors are exploited, including the rest vesselness values, the histogram of local binary pattern (LBP) [24,25] and the green colour channels to characterise the sample. Then we learn a strong vessel model based on the multiple kernel boosting (MKB) [26] to classify the difficult pixels. To exploit rich feature representations, we use four kernels including the linear function, polynomial function, radial basis function (RBF), and the sigmoid function. Finally, the strong vessel model is utilised to determine the labels of the difficult pixels. Totally, five strong vessel models are learnt for each retinal image and pixels those win three or more votes are labelled as vessel pixels finally.
Overall, the main contributions of this paper are twofold: 1. We propose two vesselness maps. One is the centrelineboundary contrast map and the other is the difference map of diffusion. They provide training samples for the strong vessel model. 2. We propose a vessel segmentation framework which learns a strong vessel model from the simple pixels for each retinal image. For each image, the strong model is customised and ground truth free.
Preliminary work has been published in MICCAI workshop [27]. We extend our previous work [27] by making the following improvements: (1) we use the LBP histogram to describe the texture information. (2) To cope with the various type of the features, we propose to use the MKB [26] to learn the feature aggregation strategy, while the conference version [27] simply aggregate the vesselness maps and Specifically, comparing to unsupervised methods, our method achieves better performances. Comparing to supervised methods, without using any ground truth information, our method achieves comparable performances.
In the remainder of this paper, we first introduce the proposed method in Section 2. Then we report our experimental results in Section 3. We discuss our methods and the comparisons to state-of-the-arts in Section 4. Finally, we conclude our paper in Section 5.

METHODOLOGY
The proposed unsupervised method obtains training samples from vesselness maps and learns strong vessel model from these samples; it includes two stages. The first stage is the generation of the vesselness maps and the second stage is the learning of the strong vessel model. The final results are obtained by voting.

Vesselness maps generation
We extract vesselness maps from the green channel of the retinal images since the contrast between vessels and background in green channel is higher than that in the red and blue ones [1,5,12,28]. To take the advantages of various vesselness maps, five vesselness maps are extracted. The first one is the multi-scale centreline-boundary contrast map ( f C ), which takes the appearance and the linearity into consideration. The second one is the difference of the diffusion map ( f D ), which considers the appearance difference between the vessels and the background from the view of diffusion system. The third one is the response of B-COSFIRE filters ( f B ) [5], which assumes that the vessels are bar-shaped while the fourth one is the response of the multi-scale line filters ( f L ) [6], which supposes the prior that the vessels are line-like. The last one is the green channel without the local mean ( f M ), which models the appearance of the vessels. In the following, we describe the first two maps in detail since f B and f L have been described in [5] and [6], respectively, and the last one is easy to obtain.

Multi-scale centreline-boundary contrast map
According to the properties that the intensities of the vessels are lower than the background pixels and the vessels seem to be linear locally, we present a centreline-boundary contrast filter with line structure to enhance the vessels. Figure 1(a) shows a basic centreline-surround contrast filter with size 5 × 7. Generally, we define a basic centreline-boundary contrast filter size of (2r 1 + 1) × (2r 2 + 1) with direction 0• by where x is the row index ranging from 1 to (2r 1 + 1) and y is the column index ranging from 1 to (2r 2 + 1). Note that (2r 1 + 1) is the height of the filter and (2r 2 + 1) is the width of the filter. Vessels stretch over the whole retinal image and converge into the optic disk from different directions. Moreover, the width of vessels is variable. Usually, the width of the trunk vessels is larger than their branch vessels. To handle the vessels in different directions, following [6] [12], we rotate G (⋅; r 1 , r 2 , 0•) every 15• from 0• to 180•, and generate 12 filters. To enhance the vessels with variable width, the common strategy is multiscale technology. For example, Nguyen et al. [6] adopt eight scales from 1 to 15 with a step 2 for the line detectors. Since we compute the contrast between the vessels and the background, we adopt a larger scale ranging from 3 to 19. So we generalise the centreline-boundary contrast filter by varying the half height r 1 and half width r 2 from 1 to 9 with step 1. Correspondingly, given an image I , the centreline-boundary contrast vesselness map is computed by where • is a convolution operator. Thus, vessel-like pixels are enhanced due to their high contrast to background while the background pixels are suppressed for their smoothness.

Difference map of diffusion
Regarding the image as a partial differential equation (PDE) based diffusion system, the intensity value can be regarded as concentration at a location [29]. Thus, when the PDE is applied on the retinal image, the intensities will defuse from high values to low values. As for retinal images, the intensities of the vessel pixels are lower than those of the background pixels. So the intensities of the vessel pixels increase after diffusion. Moreover, the background pixels are almost homogeneous. Consequently, the intensities of the background pixels change slightly. Therefore, we propose to use the difference map between the diffused image and the original image to measure the vesselness.
Mathematically, we formulate the image diffusion as an evolutionary PDE controlled by a function g(x, y, t ): where DIV is a divergence operator, g(x, y, t ) is a non-linear diffusion function of and t ▽ = [ x , y ] is the spatial gradient. We adopt tri-diagonal matrix algorithm to solve the diffusion equation [30]. Then the difference map between the diffused image and the original image is obtained by where I t is the diffused image at time t . Usually, a large t yields a difference map with clear background while the small vessels are eliminated. On the contrast, a small t yields a difference map with abundant details about vessels as well as noises. To make a balance between the less noise and more details about the vessels, we set t = 10. Figure 1(d) shows an example of the difference map of diffusion.

Unsupervised vessel segmentation
Until now, we obtain five unsupervised vesselness maps f C , f D , f B , f L , f M . We normalise each vesselness map into [0,1], and the vesselness value indicates the probability of the pixel to be a vessel pixel. In each vesselness map, the pixels with extremely large vesselness values are highly possible to be vessel pixels. Similarly, pixels with extremely small vesselness values are highly possible to be background pixels. Thus these simple pixels can be directly labelled as vessel pixels or background pixels. However, the labels of pixels with mediocre vesselness values are controversial only according to one single vesselness map. To determine the labels of these difficult pixels, we learn a strong classifier on simple pixels sampled from one single vesselness map. Taking the f C as example, Figure 2 illustrates the flow chart of our method using the simple pixels to determine the labels of the difficult pixels in f C . In detail, for each vesselness map f ∈ { f C , f D , f B , f L , f M }, we first decompose it into two sets: the vessel candidate set S C and the background set S B .S C collects the top T 1 pixels and S B collects the rest pixels. Pixels in S B are directly regarded as background pixels while the labels of pixels in S C are given by a strong classifier. To learn the strong classifier, we extract the training samples from f . We take the top T 2 pixels in f as the positive samples, denoted as S + . Pixels near the S C are regarded as the negative samples, denoted by S − .S + and S − consist of the training set S = {S + ∪ S − }. For each sample in the training set, we extract three types of features. The first one is the rest vesselness maps { f C , f D , f B , f L , f M }∕ f . Usually, the vessel pixels close to the vessel centreline have smaller values in green channel than those vessel pixels close to the vessel wall. Vessel pixels close to the vessel wall have smaller values than the background pixels. To describe such a structure, we compute the LBP histogram [24,25] of the patch centred on the sample from the green channel as the second feature. Due to the high contrast between the vessels and background in green channel, similar to [28], we also use the green channel as the third feature. To directly learn an SVM for the retinal image, it is required to determine the appropriate kernel beforehand. However, it is still not clear how these vessel values, texture information as well as colour information can be well combined. To handle this problem, we propose to use multiple kernel boosting [26,31] to combine several SVMs with different kernels.
Formally, for a sample s, the linear combination of L kernels where s i ∈ S , and l is the l th kernel weight. Then the decision function is defined as where N is the number of the samples in S, { i } is the Lagrange multiplier and b is the bias in the standard SVM, and y i = 1 if s i is from S + otherwise y i = −1. Let h l denote the response from a single SVM: Treating each single SVM h l (s) as a weak classifier and the strong vessel model as the linear combination of all the SVMs, it is natural for us to formulate the problem with Adaboost algorithm in the form: where J is the number of the boosting iterations. Following [26], we first separately train the single SVM classifier h l (s) to construct an SVM pool. In our case, we use four kernel functions, that is, the linear function, the polynomial function, the RBF and the sigmoid function and extract three kinds of features for each sample. Thereby, the SVM pool consists of 12 SVMs. Then we iteratively select one SVM with smallest weighted classification error j into the decision function, where j is computed In Equation (10), sgn(⋅) is a function that equals 1 when the input is larger than 0. Otherwise, it outputs 0. w j (i ) is the weight of the sample s i at iteration j . When j = 1, w j (i ) = For each vesselness map, we learn a strong vessel model. Totally we learn five strong vessel models. We apply these strong models to the vessel candidate set, and label the pixels winning the majority voting as vessel pixels, obtaining the segmentation.

1∕N . When
As there are small isolated holes and noises in the segmentation, we fill the holes whose area sizes are less than 20 pixels. Meanwhile, we compute the equivalent diameter R E of the circle with the same area to each connected component in the segmentation and the major axis R M of the ellipse that has the same normalised second central moments as the connected component. The larger the ratio R E ∕R M , the closer the connected components looks like a circle. Thus, to remove the noises that look like circles, we remove the small connected components satisfying R E ∕R M ≤ T 3 , where T 3 is a threshold.

EXPERIMENTAL RESULTS
We use two widely used public datasets to evaluate the proposed method. One is DRIVE [32] which includes 20 retinal images for training and 20 images for testing. These images are captured by a Canon CR5 non-mydriatic 3CCD camera at 45 • field of view. For each image in the testing set, two manually segmented maps by two observers and a mask image are provided. The first observer marked 577,649 pixels as vessel and 3,960,494 pixels as background (12.7% vessel) while the second observer marked 556,532 pixels as vessel and 3,918,611 pixels as background (12.3%). The other is STARE [33], which includes 20 images and two corresponding manually segmentations by two different observers. In STARE [33], there are 10 images containing signs of pathologies and 10 healthy retinal images. These images are captured by a TopCon TRV-50 fundus camera at 35 • field of view. Two observers labelled the vessel pixels. The first observer labelled 10.4% pixels as vessel while the second observer labelled 14.9% pixels as vessel. Following [6,32], we use the segmentations of the first observer as ground truth for evaluation and the performance of the second observation as a benchmark for comparison.
In our experiments, since the vessel pixels account for 12.7% in DRIVE and 10.4% in STARE, to ensure that almost all the vessel pixels are included by the candidate set, we set T 1 to 14% for DRIVE [32] and 13% for STARE [33], respectively. To ensure that almost all the pixels in the simple positive set are vessel pixels, we choose the top 7% pixels as simple positive pixels. So we set T 2 = 7% for both two datasets. We vary T 3 from 0.1 to 0.8 with a step 0.1 on the training set of DRIVE [32]. We find that the performances are best when T 3 = 0.4. So we set T 3 = 0.4 on both datasets.
Following the previous methods [5,14,17], we adopt accuracy (Acc), sensitivity (Se) and specificity (Sp) to compare the performance of our method with the state-of-the-art methods. The accuracy measures the proportion of pixels that are correctly detected. The sensitivity measures the proportion of vessel pixels that are correctly detected. The specificity is the proportion of background pixels that are correctly detected. Conventionally, only the pixels in the field of the view are taken into consideration since pixels out the field of the view can be easily labelled as background via thresholding. The performance results for each image in the two datasets are reported in Table 1. The last row of Table 1 is the average value of sensitivity, specificity and accuracy.

DISCUSSION
In this section, we first discuss the effectiveness of each component in the proposed method. Then to show that the proposed method reaches the state-of-the-art performance, we compare the proposed method with the state-of-the-arts.

Component analysis
To demonstrate the effectiveness of the proposed unsupersized learning framework, we first compare our method with each vesselness maps. As is shown in Table 2, when we use an RBF SVM to learn the classifier [27], the specificity on DRIVE [32] is 97.07% and the specificity of the best component, that is, [5], is 97.04%. The accuracy of our method is superior to each vesselness map. When we combine the vesselness maps together with histogram of LBP [24] [25] and green colour features, the specificity is further improved. Although the sensitivity decreases, the final accuracy is still improved on DIRVE. Similarly, on STARE [33], our method with RBF SVM achieves better accuracy than single vesselness map. When we adopt MKB to combine the features, the accuracy is further improved. It is worth noting that Table 2 also demonstrates the effectiveness of the two proposed vesselness maps f C and f D .   Figure 3 shows two examples to illustrate the effectiveness of using MKB-based SVM. When combining the LBP histogram [24,25] and green channel using the SVM based on MKB, less background pixels with high contrast to their surrounding pixels are segmented out. We also validate the effectiveness of the LBP [24,25] and green channel in vessel segmentation and the post-processing. Table 3 shows that the accuracies on both two datasets after combining with LBP [24,25] are slightly improved compared to those only aggregating the vesselness maps and green channel. After filling the holes and removing the small circle-like structures, the accuracies are further improved.

Comparisons with the state-of-the-arts
We compare our method with seven supervised methods and nine unsupervised methods. Among the supervised methods, Fu et al. [18] and DUNet [21] are deep learning based. The results of these compared methods are from their original paper and they are compared under the same experimental protocol. The results of our method and the compared methods on DIRVE [32] and STARE [33] are reported in Tables 4  and 5, respectively. Generally, our method outperforms the first five unsupervised methods in terms of sensitivity, specificity and accuracy on both two datasets. Compared to the last four unsupervised methods, our method achieves comparable performances on both two datasets. This indicates that aggregating multiple vesselness maps by learning from simple pixels is able to further advance the segmentation performances. Comparing to the supervised methods, the proposed method still achieves better sensitivity. Although the performances of [17] and [34] are superior to our method on DRIVE [32] when the models are trained on the training images from DRIVE [32], their accuracies decrease when their models are trained on STARE [33]. Their accuracies decrease to 0.9486 and 0.9428, respectively, while ours is 0.9483. Similarly, [17] and [34] achieve better accuracy on STARE [33] when training their models on STARE [33]. But their accuracies decrease to 0.9545 and 0.9413 while ours is 0.9551. The performance drop comes from an assumption. Commonly, supervised methods assumes that testing images and training images follow the same distribution. Once the testing image does not follow the distribution of the training images, supervised methods tend to make wrong decisions. It is worth to mention that our method achieves similar performances on both DRIVE and STARE to deep learning-based method [18]. The possible reason is that the spatial scale of vessels is extremely small and preserving the spatial resolution when feature extraction contributes to the segmentation performance. However, the stan-dard convolution neural network used in Fu et al. [18] for feature extraction is unable to preserve the resolution of feature maps at high-level due to the maxpooling layer with downsampling. For the deep learning-based methods DUNet [21], it is obvious that it achieves higher sensitivity and accuracy than ours, and similar specificity to ours on DRIVE when it is trained on DRIVE. However, when it is trained on STARE, the performances of DUNet [21] drop sharply and are inferior to ours in terms of sensitivity and accuracy. This indicates that the performances of DUNet [21] is dataset-dependent. Similarly, DUNet [21] achieves superior to ours in terms of specificity and accuracy on STARE when it is trained on STARE. But its performances drop sharply when it is trained on DRIVE and are inferior to ours on sensitivity, specificity and accuracy consistently. Additionally, the training of DUNet  Tables 4 and  5, our method achieves superior accuracies on both two datasets.
We make visual comparisons between [5], [6] and our method. Figure 4 shows two examples from DRIVE, in which pixels in yellow are the true positive pixels. Pixels in red are the false negative pixels, and pixels in green are the false positives  Pixels in yellow are the true positive pixels. Pixels in red are the false negative pixels, and pixels in green are the false positives pixels. We can see that our method fails to segment the tiny vessels in the first example and wrongly classifies the red lesions to vessel pixels in the second example pixels. From the first example, we can see that B-COSFIRE [5] and multi-scale line detector [6] fail to segment the vessel pixels in white ellipses due to the low contrast to background. From the second example, we can see that B-COSFIRE [5] and multiscale line detector [6] are confused by the noise background and optic disc. On the contrary, our method works well. The possible reason is that our method is able to aggregate multiple vesselness maps by learning to make decisions on difficult pixels more accurately. Figure 5 shows two examples from STARE. Similarly, our method works better when the background is messy with lesions than B-COSFIRE [5] and multi-scale line detector [6].
Although our method is able to aggregate the vesselness maps to achieve advanced performances, it still fails to segment tiny vessels as is shown in Figure 6(b). The possible reason is that the features we exploit are not discriminative enough to distinct the tiny vessels from background. Additionally, since red lesions have similar appearances to vessels, our method tends to treat them as vessel pixels wrongly, as is shown in Figure 6(d).

CONCLUSION
We present two vesselness maps and a new framework for vessel segmentation in retinal images. Our framework learns strong vessel models from the simple pixels whose labels can be determined easily to determine the labels of the difficult pixels that are controversial in a single feature map. To combine various types of features, we propose to use MKB to choose the kernels automatically. For each retinal image, the strong vessel models are customised which only use the information from the image. Our method shows its advantages in two aspects: (1) comparing to most of other unsupervised methods, it avoids the sensitive threshold selection. Moreover, our method achieves better performances on the two public datasets DIRVE [32] and STARE [33] than the existing unsupervised methods. (2) Comparing to supervised methods, our method does not rely on the expensive ground truth while it achieves competitive performances to them.