Segmentation of natural images based on super pixel and graph merging

Sanjoy K. Saha, Department of Computer Science and Engineering, Jadavpur University, Kolkata, India. Email: sks_ju@yahoo.co.in Abstract The task of natural image segmentation is one of the most researched topics of computer vision. There are mainly two principal approaches for the task, the statistical approach and the supervised approach. The proposed methodology segments natural images combining a set of statistical algorithms. First, the image is preprocessed to enhance the edges. Weighted average of the denoised image and its derivatives is the preprocessed output. Thereafter, an energy based super pixelation is applied to over segment the image. Finally, a connectivity graph is built where nodes correspond to super pixels and edges connect the adjacent super pixels. The adjacent super pixels are merged based on the confidence value defined in terms of their textural and colour similarity. Proposed methodology has been applied on the images of BSDS500 dataset. Performance of the proposed work has been compared with that of other works based on detected edge maps. Few works generate ultrametric contour maps (UCM). To compare the performance with those works, UCM is also generated by the proposed methodology. To do so images at multiple scales are considered. It is observed that the output of segmentation is better in case of the proposed methodology. Proposed methodology is much faster than others. Thus, makes it suitable for real time application in robot vision.


| INTRODUCTION
Segmentation of natural images is one of the most challenging problems of computer vision and still an active area of research. Semantic segmentation is quite subjective in nature. Furthermore, factors like occlusion, illumination, texture and variation in orientation, context and so on make the task further critical. Devising a solution for a variety of images and one that can handle all these issues is still in demand. Though deep networks provide good results for a set of images but it suffers from huge training data requirement and susceptibility to pixel level noise [1] and uncertainty of operations. However, Bayesian deep learning is directed towards such issues [2,3]. The statistical algorithms on the other hand are free from training and physical significance of the parameters are easily interpreted. Thus the limitations of supervised approach and flexibility of statistical approach inspired us to choose the second one to work with.
The present work is motivated from the need to generate a lightweight system to segment natural scenes for possible usage in robotic applications like semantic simultaneous localization and mapping (SLAM). In most of the cases semantically similar regions in a natural image are coherent in terms of statistical features. Our objective is to exploit this aspect in developing a set of algorithms to segment the scene. Thereafter, individual segments can be used for semantic labelling. Though deep networks can perform the task, it is quite power consuming and thus not suitable for robots with limited resources. The proposed methodology first uses a fast super pixel segmentation for over segmenting the image and then groups the statistically similar super pixels to form segments. For such grouping graph merging based methodology has been adopted. The rest of the paper is organized as follows: Section 2 provides a survey of past efforts. Proposed methodology is described in Section 3. Experimental result and concluding remarks are put in Sections 4 and 5, respectively.

| PAST WORK
Past efforts can be broadly categorized as supervised learning based approach and unsupervised image statistics based approach. A brief survey on the both the approaches are outlined as follows.

| Supervised learning based approaches
An early work by Martin et al. [4] is based on supervised approach. It detects contextual boundaries using local patches and certain global information. Finally builds the segmented map. This is achieved by training a probabilistic classifier that puts fuzzy marking on boundary pixels by evaluating its context defined by surrounding patch. A two stage approach was proposed by Dong et al. [5]. At the first stage colour quantization is achieved through clustering on LUV space. Subsequently, classification on reduced colour space is done using neural network [5]. In recent times, the use of deep convolutional neural network (DCNN) has become the leading trend. To generate the labelled map of pixels, deep network either works on the whole image or it works with patches and classifies pixels according to its neighbourhood patch. Bell et al. [6] segmented the image by labelling each pixel according to its physical material predicted from its surrounding patch using DCNN. Instead of using pixel level annotation Papandreou et al. [7] considered bounding box of the objects as the input for training the weakly supervised CNN with expectation minimization.
Deep learning is also utilized for semantic contour detection [8]. Chen et al. [9,10] employed DCNN, atrous spatial pyramid pooling and conditional random fields in respective order to segment natural images at pixel level. Deconvolution network or deconvnet is becoming popular that takes the whole image as input rather than pixel neighbourhood patches and generates the complete segmented map in one shot. It uses an encoder-decoder type architecture for segmentation. Works by Noh et al. [11] and Badrinarayan et al. [12] are among the pioneers in this approach. Chen et al. [13] proposed a semantic image segmentation by using multiscaled input to an attention network for refining the pixel label. Lin et al. [14] proposed the idea of using multiscaled input to parallel networks that are further tuned with multipath refinement. Loss of information due to pooling is a major concern in deep networks, which is handled by fusion of multiple parallel networks trained with different variants of loss function. Though pixel level segmentation worked well with deep networks, most of the errors were due to incorrect segment boundaries. Atrous spatial pyramid pooling coupled with deconvolution network was proposed by Chen et al. [15] to handle the fuzzyness of categorization at object boundaries. In recent times, Xian et al. [16] proposed pixel level classification by a novel deep net architecture known as semantic projection network. It can be trained with very few instances using transfer learning. Wu et al. [17] developed the idea of unseen category classification by generating annotated synthetic datasets based on trained data and further adapting the network to it.
Supervised approaches works on the idea of training a classifier that labels pixels or patches based on the features learnt during training. With considerable amount of training and complex hierarchy of the classifier, good results are achievable but designing or training a universal model is very difficult. Though recent works are trying to train models with few training instances, their performance deteriorates with increasing intraclass feature diversity.

| Unsupervised image statistics based approaches
The unsupervised statistics based approach mainly focuses on dividing the image into a collection of pixels based on the continuity in terms of properties like colour, texture and so on. It is, generally, followed by merging or splitting of those collections. Different works have followed different criteria for split/merge process. Variants of graph cut has also been used. This approach intends to segment semantically without labelling the segments or pixels. However, another stage of recognition has to be considered for categorizing or labelling the segments.
The concept of normalized cut that uses intragroup similarity and intergroup dissimilarity metric, was first proposed by Shi and Malik [18]. It formed the basis of many graph based segmentation methods. Felzenszwalb et al. [19] used a pairwise region element comparison for gradually segmenting the fully connected spatial graph and merging them iteratively. Normalized cut coupled with mean shift segmentation was proposed by Tao et al. [20]. It was to reduce the complexity of graph cut. Zhang et al. [21] proposed a hybrid scheme combining the aspects of both statistical and supervised approaches. A weak supervised learning was considered that utilized Gaussian mixture model (GMM) on structural features of super pixels. Graphlet-cut (a variant of graph cut) that segments natural images [21] was proposed.
Careful design of feature extractor is the main aspect of statistical approach. A variety of features have been proposed. Spectral signature based approach for patch categorisation and subsequent fuzzy segmentation [22], interdistance discrimination between Laplacian of marginal histogram of linear filter output [23], integrating level set with GMM [24] are some of those. Rao et al. [25] used GMM along with adaptive chain coding for detecting edges of homogeneous regions.
Apart from using the graphs, tree-based approaches are also quite popular. An approach for detecting contours by combining spectral clustering with hierarchical region tree was proposed by Arbelaez et al. [26]. Tree partitioning over initial region descriptor features was suggested by Panagiotakis et al. [27]. It is further supplemented with Bayesian flooding and region merging. A simpler version using bottom up hierarchical segmentation was proposed by Alpert et al. [28]. The work is based on a probabilistic classifier that takes geometric, spectral and textural cues on the neighbourhood patch of a pixel to label it. It is observed that graph cut approaches mainly uses correlation clustering. In this case the variation in local boundary ambiguity prevents merging or splitting of sections. This problem can be handled by higher order correlation clustering as suggested by Kim et al. [29].
K-means is used widely for initial stages in the segmentation process. Zheng et al. [30] developed hierarchical generalized K-means to improve on the results. Recently, region merging by multiscale combinatorial grouping augmented with normalized cut was proposed by Pont-Tuset et al. [31] that produces considerably good results on natural images. Considering multiple hypothesis based on different distance metrics is a recent trend. By simple minimization of cost functions over the outcome of different merging schemes on an oversegmented image achieved new benchmarks [32].
It must be noted that statistical approaches do not label the pixels or regions semantically. Supervised approaches does labelling but that is mostly restricted within the pixels whose classes are in its training set. Otherwise, the pixels are ignored as background. As a result, the segmentation also suffers.
The past study indicates that natural image segmentation is still an open area and significant works are still being reported. Most of the current works are following the image feature based approaches supplemented by merge/split operation. It acts as motivation for us to explore similar approach. A previous work of ours [33] considered K-means-based segmentation of the image into spatially connected super pixels and further used a graph-cut based merging process to refine the results. With further experiments and analysis it was understood that a faster oversegmentation method to obtain the super pixels and a robust merging technique can be explored to make the process faster and to improve the accuracy. The methodology was developed mostly to cater segmentation of scenes used for mapping in SLAM. A complete map is needed for robot vision rather than partial map of labelled segments (which is restricted to trained categories). In general, deep learning and supervised approaches tends to segment only parts of the scene that can be labelled whereas a statistical approach segments the whole scene on visual cues. Thus our motivation is fuelled by the previous experience, past studies and specific requirements. In this context, present work focuses on devising a fast methodology to segment natural scenes.

| PROPOSED METHODOLOGY
The proposed methodology consists of multiple stages. The broad steps are preprocessing, super pixel segmentation, neighbourhood graph-cut-based merging. The methodology can also be extended to work on multiscaled versions of an image and thereby can generate edges of varying confidence. Preprocessing step intends to meet the contradictory criteria of noise removal and strengthening the edges. Super pixel segmentation over segments the image. The extracted segments are considered as the node in the graph cut based merging step that generates the final output. The overall block diagram of the process is shown in Figure 1. Individual steps are detailed in following sections.

| Preproccessing
Super pixel segmentation forms the core part of the proposed methodology. Its performance gets affected by the noise that corrupts the edges. Hence, it is important to get rid of such noises without significant compromise with edges. To achieve this requirement an edge preserving smoothing is adopted. A fast nonlocal means denoising technique [34] is applied. It removes the grain noise and simultaneously preserves bold texture features. As Super pixel-based segmentation relies on edges, we apply measures to enhance and sharpen the edges further. First-and second-order derivatives of the denoised image are computed. An weighted average of the denoised image and two derivatives is the final preprocessed version used in super pixel based segmentation. The derivatives help to sharpen the edges that may have been weakened by denoising. On the other hand, to reduce the impact of residual noise (if any) weights of the derivatives are kept low in the process of averaging. In our work, the weights of denoised image, first and second derivatives are of the ratio 4:2:1. Figure 2 shows the preprocessing steps.

| Super pixel segmentation
The Super pixel segmentation stage over segments an image and extracted segments are taken as the nodes for the graph based merging. We applied the SEED Super pixel method proposed by Bergh et al. [35]. It is faster than other state-ofthe-art Super pixel algorithms [35]. That has become the motivation for considering SEED in our methodology as we intend to apply segmentation in the context of real time robotics application like SLAM. For understanding we provide outline the principle of the technique.
SEED super pixel method tries to segment the image into number of super pixels by maximizing an energy function. The energy function consists of two terms. One term focuses on the colour distribution in the super pixel. Thus, maximising the energy function tends to achieve colour homogeneity. The other term is called boundary term that takes care of the shapes of the super pixel boundaries. To compute the boundary term, a patch around the boundary pixel is considered. Around the edges, the patch covers multiple super pixels. In the process of maximization, essentially it tries to redistribute the edge pixels into neighbouring super pixels. Initially, image is divided into nonoverlapped blocks of small size. Such blocks are taken as the super pixels. The process goes on iteratively following hill climbing optimization algorithm updating the super pixel boundary and maintaining the colour consistency. In comparison to traditional region growing approaches or graph cut based approach, this method utilizes a local optimizer that works in a very efficient fashion by simple memory look-up. The process is extremely fast. The parameters like initial number of super pixels and the number of iterations are to be specified. Higher the number of super pixels or iteration, more is the time requirement. The process, in general, over segments an image. In the proposed methodology, the final step is graph based merging. Use of super pixels speeds up this process. Hence, for our purpose fine tuning of parameters to achieve optimal output is not essential. Moreover, such tuning will vary for different images. To make the process simple we have considered a fixed set of parameters that are 200 and 10 for initial number of super pixels and number of iterations, respectively. SEED algorithm returns a labelled image. The label of the pixel identifies the corresponding super pixel. The value ranges from 0 to N sp À 1 where N sp is the number of super pixels detected by the algorithm. Figure 3 shows the overall block diagram of super pixel segmentation. In general, SEED algorithm works with HSV model. But it is observed that images with substantial grayness where R, G, B are almost equal, the outcome suffers. In our work, either RGB or HSV model is considered depending on the colour variance. RGB image is divided into number of blocks. For each block average R, G, B values are computed. Variance of block level averages of the components (σ r ; σ g ; σ b ) are then obtained. Average of σ r ,σ g and σ b denotes the overall variance of the image in RGB model. For HSV model also the variance is computed in the similar way. The model that reflects more variance is taken as the input. Computation of variance at the block level makes the process faster. In our work, an image is divided into 25 blocks in 5�5 fashion irrespective of aspect ratio.

| Graph-based merging
Super pixel segmentation over segments an image. As edges have a dominating role in the process, colour variation and textures present in an object result into multiple super pixels. At this stage, we take up a merging process to reduce such effects. In the proposed methodology, graph-cut based merging is a conglomeration of different distance functions that works on different features of the image. Primarily a neighbourhood graph is created where each node are the super pixels. There is an edge between two nodes if corresponding super pixels are neighbours F I G U R E 3 SEEDS super pixel segmentation 4in spatial domain. The graph is represented by an adjacency matrix GN of size N sp �N sp . GN ij ¼ 1 if super pixels i and j are spatial neighbours and GN ij ¼ 0 otherwise. Similarity between each connected pair of nodes (super pixels) is studied for possible merging. We have focused on textural and colour similarity. Figure 4 shows the block diagram for the merging process.

| Distance in terms of texture
The local binary pattern (LBP) [36] of an image captures the textural features around each pixel in a small eightneighbourhood region. The histogram of LBP values in a super pixel captures the general texture of the same with the assumption that it does not vary drastically. Suppose, i and j are two connected super pixels and corresponding normalized LBP histograms are H i and H j , respectively. Their similarity is measured using χ 2 distance [37,38] as it tests consistency between the shapes of two histograms. Moreover, the measure can be easily interpreted and is robust towards vertical scaling of histograms. Similarity between the super pixels i and j is computed as: The LBP histograms are quantized into b bins to ignore minor variation in texture. In our experiment, b is taken as 16. A low value for CS ij corresponds to high similarity in terms of texture.

| Distance in terms of colour
The LBP captures the grayscale texture patterns. Two super pixels similar in terms of LBP may vary in colour. On the other hand, super pixels with perceptually similar colour may vary in terms of LBP. Hence, we consider colour based similarity also.

F I G U R E 4 Graph-based merging MUKHERJEE ET AL.
Global measure and detailed measures are formulated to judge the colour similarity of neighbouring super pixels.
The simplest measure for global colour similarity between two neighbouring super pixels i and j is computed as AC ij ¼ ð|μ ir À μ jr |þ|μ ig À μ jg |þ|μ ib À μ jb |Þ 3 . μ ir , μ ig and μ ib represent average R, G, B value of the super pixel i and μ jr , μ jg and μ jb are same for super pixel j. The global measure thus obtained is unaffected by random noise to a large extent. Perceptually, similar colour may vary in terms of R, G, B. Hence, another global measure AH ij is considered based on hue. It is taken as the absolute difference between the average hue of the two super pixels.
Detailed measure is based on the Bhattacharya distance [39] between the RGB histograms of two super pixels. It is computed as follows: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi H i c and H j c represent the normalised histograms of colour channel c (i.e., R, G or b) for super pixels i and j, respectively. The number of bins b is taken as 16 and it helps to ignore minor variation in colour. χ 2 Measure is comparatively stricter and a small shift in histogram may result into high distance. Bhattacharya distance is relaxed version and measures the proximity between two probability distributions with different amount of deviation.

| Merging
As described, similarity between each neighbouring node (super pixel) in the graph is measured as follows: � textual similarity based on χ 2 distance between their LBP histograms � global comparison of colour based on absolute difference between their average RGB values � global comparison of colour based on absolute difference between their average hue � detailed comparison of colour based on Bhattacharya distance between their RGB histograms.
With respect to the graph each of the distance measures is normalised such that maximum and minimum distances are mapped to 0 and 1, respectively. Thus, the transformed values represent a similarity confidence. Higher the value, the super pixels are more similar. Finally, each edge is assigned a single confidence measure combining all the four individual measures. The combination process is as follows: � Let CCS ij , CBD ij , CAC ij and CAH ij represent the transformed similarity confidence corresponding to CS ij , BD ij , AC ij and AH ij , respectively � Combined confidence

| Generation of ultrametric contour map
Segmentation approaches are broadly of two categories like region extracting or edge generating. Proposed super pixel segmentation followed by graph cut based merging extracts the regions. To compare the performance with the edge based approaches, we have also dealt with multiscale version of the input image and generated ultrametric contour map (UCM) [26]. UCM is an edge map where edge pixel value denotes the confidence that it belongs to a semantic edge. The process described so far corresponds to scale 1. The preprocessed image is furthered scaled with the factors 0.8, 0.6, 0.4 and 0.2. For each such version, SEED super pixel and graph merging steps are applied to generate the output. Maximum number of super pixels (input to SEED) is also reduced proportionately according to the scale. Though SEEDS determines the number of super pixels for a given image dimension and type (colour/greyscale), the reduced upper bound of the number of super pixels in each scale produces larger super pixels. Edges are extracted from each version. With the increase in size of super pixels, only the pronounced edges with sharp contrast are preserved. In order to generate the UCM, all edge maps are resized to scale 1, skeletonised and averaged. In the reduced version more and more intraregion edges or weak edges are eliminated whereas dominating edges are retained across various scales.
Thus in the UCM more contextually relevant edges are emphasized but minor edges forming due to local interference are damped. This form of output does not report individual segment rather it reflects segment boundaries in a fuzzy manner. It is useful for applications where confidence of contextual edge is important. Figure 5 shows the stepwise output for a sample image taken from BSDS500 dataset [26]. In the preprocessed image, it can be observed that edges are quite emphasised as it was intended. On the output of super pixel segmentation, the graph has been superimposed. The red links connect the neighbouring super pixels marked for merging. The final merged regions, groundtruth and UCM have also been shown.

| EXPERIMENTS AND RESULTS
The proposed methodology has been implemented in Cþþ language and opencv libraries [40] have been used. Experiments are performed on a Linux machine using single core of Intel i5 processor clocked at 3.0 GHz, 8 GB of DDR3 RAM and without any GPU support. BSDS500 dataset [26] has been used for experiment. The dataset consists of 500 images in a mix of portrait and landscape mode. The resolution of the images are 481�321 for landscape images and 321�481 for portraits. For evaluating the proposed methodology the resolution is increased by 1.5 times. Thus the size becomes to 721�481 (481�721) for landscape (portrait) images. The reason for doing so is to make the full scale image size similar to a standard webcam resolution. As commonly this resolution is used, it enables to evaluate efficiency for real scenario. The ground truth is also scaled up accordingly. The dataset provides variety of images of different categories like natural scenes, scenes with wildlife or animals, scenes focused on human subjects and scenes with man-made objects or structures. Both indoor and outdoor images are present. Images are in either portrait or landscape modes. The groundtruth of semantic segmentation has also been provided. Each image has been manually segmented by multiple human subjects. Finally, the dataset offers single human annotation (by randomly picking one out of multiple groundtruth) and multi-human annotation (multiple groundtruth by the individuals) for each image. An edge contour map is formed by averaging the groundtruths in multi-human annotation. It is treated as the groundtruth UCM where the edge strength is real valued and lie with in [0,1].
We have applied proposed methodology on each image. The performance of the proposed methodology have been compared with number of other systems in terms recall, precision, F-measure and time. These are elaborated in the following section. F I G U R E 5 Output at different stages. From left to right, top row shows the input image, preprocessed output, output after super pixel segmentation and neighbourhood graph generation (the red edges are marked for merging). From left to right, bottom row shows the output, ground truth and multiscale UCM output. UCM, ultrametric contour map

| Comparison of performance
We have compared the performance of proposed methodology with the works of Felzenszwalb et al. [19], Arbelaez et al. [26], Pont et al. [31] and Bosch et al. [32]. Two of these works [19,32] have generated segmented image as output and other two [26,31] provide the output in UCM form. We have also provided the output in region form (i.e., scale is 1) and also in UCM form (i.e., averaging the edge maps at various scale). For comparison relevant one is considered accordingly. Figure 6 shows a visual comparison of the output generated by the methodologies of Felzenszwalb et al. [19] and Bosch et al. [32]. For the groundtruth, the segments are shown in average of the colours of corresponding pixels. The  [19] are randomly colour assigned by their source code. Hence, it should not be compared with the groundtruth based on the colour. A colour corresponds to a region and comparison with the groundtruth has to be made based on the regions. The output of the proposed methodology and that of Bosch et al. [32] are rendered in the same manner as the groundtruth. It can be observed that the images are quite oversegmented by the method of Felzenszwalb et al. [19]. The methodology does not perform any global optimization. Moreover, the local optimization is also affected by the size of components. Whereas, the methodology of Bosch et al. [32] optimizes the segmentation by comparing various different techniques and balances each other's local vulnerabilities. Finally, it results into a globally optimized output. Proposed methodology uses various measures for comparing only neighbouring super pixels for graph merging. So, in a way, it is the best of both. It excels in accuracy without compromising the speed. Visually, it is observed that output of the proposed methodology is better. It should be noted that comparison is performed on the basis of binary edge map of the groundtruth images and region output of Felzenszwalb et al. [19], Bosch et al. [32] and proposed methodology. As the segments are not labelled semantically so a region overlapping metric is not applicable in this case and thus pixel to pixel comparison between binary edge maps are performed instead. Figure 7 shows the visual comparison with the methodologies of Arbelaez et al. [26] and Pont et al. [31]. In this case the outputs are considered in UCM form. Visually, it is clear that the proposed methodology outperforms others.
For quantitative comparison of performance recall, precision and F-measure are considered as metrics. Once we have considered single human annotation as the groundtruth and again the UCM groundtruth obtained from multi-human annotation as described earlier. Table 1 shows the comparison with the works of Felzenszwalb et al. [19] and Bosch et al. [32]. Considering all the images in the dataset the average values are shown in the table. In terms of F-measure and recall, proposed methodology is better than others. However, precision is less. It is so because of undersegmentation, that is, missing the regions and hence the edges are considered costlier in our work. Proposed methodology oversegments the image. It results in a fall in precision. However, thresholds may be tuned to achieve higher precision but that will compromise with the recall. The Fmeasure of Felzenszwalb et al. is significantly lower due to its over simplification of local features and thus it generates too many segments. Bosch et al. performs close to our approach in terms of F-measure and its precision and recall values are also balanced. It uses a combination of multiple segmentation approaches with subsequent voting of the best result. Thus, its results are quite good in terms of accuracy. But, it is computationally heavy.  Figure 8 shows the variation of F-measure with threshold for the methods of Arbelaez et al. [26], Pont et al. [31] and the proposed one. In this case, comparison is based on UCMs generated by the methods and groundtruth UCM. Threshold (i.e., confidence) on the UCMs including the groundtruth are applied. For each UCM, the confidence values are normalized and varied from 0.5 to 0.95 to generate the binary edge images for comparison. Considering all the images in the dataset, average F-measure has been shown. The plot shows that F-measure for the proposed methodology is much higher than the others over a major range. It indicates that proposed methodology detects the true edges with considerably high confidence and hence survives even at higher threshold. Table 2 shows the time required by the proposed methodology for images at different scales. With decrease in scale, the latency decreases in proportion to the area of the image. Although the time required for segmenting a full scale image does not qualify for real time robotic applications, at slightly reduced scale it is good enough. However, with higher computational setup it can achieve speed suitable for real-time application for the full scale images also. Table 3 shows the comparison between the methodologies in terms of the time requirement. Felzenszwalb et al. [19] have dealt with large graph that makes the process slow. The methodology of Arbelaez et al. [26] first prepares contour image by using multiple local cues with nonlinear complexity. Pont et al. [31] have applied a fast normalized cut approach on in a hierarchical graph of super pixels to generate UCM at a scale. For both the works, process is repeated at multiple scales.

TA B L E 1 Comparison of performance based on edge map
Relatively, complex processing at each scale has been adopted in those works. Hence, proposed methodology is faster than those. Bosch et al. [32] applied multiple popular image segmentation schemes to keep the accuracy high and that makes the process slow. Thus, it is clear that proposed methodology is much faster than the others in both single scale and multiscale mode. At a reduced scale and/or with multicore implementation the methodology can handle six or more frames per second and can be utilized in robot vision application.

| CONCLUSION
In this a work, a novel methodology has been presented for natural image segmentation that elegantly combines different techniques. In the preprocessing stage weighted average of denoised image and its derivatives helps to enhance the edges along with noise suppression. SEED super pixel method is used to generate the super pixels. On the basis of their adjacency, the super pixels are represented in a graph and a merging scheme is applied based on the similarity confidence of the connected super pixels. The confidence has been defined based on the textural, global and detailed colour similarity between the adjacent super pixels. The performance of the proposed methodology on BSDS500 dataset has been compared with number of other works and found superior. Proposed methodology is found to be much faster and can be utilized in real-time application.