Multiscale ensemble of convolutional neural networks for skin lesion classiﬁcation

Early detection and treatment of skin cancer can considerably reduce the patient mortality rates. Convolutional neural network (CNN) has been widely applied in the ﬁeld of computer aided diagnosis. However, for skin lesions, the inconsistent size of lesion regions in dermatoscope images hinders the convolutional neural network precise discrimination. To solve this problem, multiscale ensemble of convolutional neural networks called MECNN is proposed, which involves three branches with different lesion scales as the model input. The ﬁrst branch locates the lesion region outline by identifying the largest local response point. Then, MECNN reduces the search area of the lesion region and divides the out-line into two scales used as the input for the other two branches. A global loss function is deﬁned to control the learning objectives of the three branches and MECNN fuses the branches output as the ﬁnal classiﬁcation result. The proposed model is evaluated on the public HAM10000 dataset and achieves a higher classiﬁcation accuracy than the comparative state-of-the-art methods.

nology is particularly important in the diagnosis of melanoma.However, visual diagnosis is very time-consuming and subjective.It is difficult for dermatologists to identify malignant and benign skin lesions because of the visual similarity [4].Therefore, computer-aided diagnosis (CAD) system for skin lesion recognition has been naturally proved to be a contributing assessment tool that could reduce inter-observer variability and address the limited availability of trained experts [5].
To relieve these problems, we propose a skin lesion classification method called MECNN based on multiscale ensemble of CNNs.The proposed framework includes three branches (one main branch and two auxiliary branches), all of which could be embedded by the common network structures, such as ResNet [6] and DenseNet [7].We reduce the dimensions of the features extracted from the main branch to obtain the attention map, identify the position of the maximum local response point in the attention map and map this point to the input image to obtain the window cropping centre.Subsequently, we crop the input image with two different window sizes to obtain two scales images A global loss function is defined to control the learning objectives of the three branches, owing to which, the prediction results of the two auxiliary branches as similar as possible to those of the main branch.The main contributions are summarized as follows: 1. We propose a skin lesion classification method MECNN based on multiscale ensemble of convolutional neural networks, which achieves the better results compared with stateof-art methods on the public dataset.2. The proposed framework can be assembled with common CNN structures and substantially improves their classification performances.3. A global loss function is designed to control the learning objectives of multiscale network branches and instructive to reduce intra-class differences.4. The proposed multiscale ensemble can accurately extract the salient lesion region in dermatologic images without extra segmentation labels data and facilitate to relieve the noise interference from background around the lesion region.
The remaining paper is organized as follows: Section 2 presents a review of related works and summarizes the existing research on the classification of skin lesions.Section 3 describes the proposed method.Section 4 discusses the experimental results and the related evaluation.Finally, the proposed approach is concluded in Section 5.

Skin lesion classification
In recent years, deep learning has been widely used in the field of skin lesion classification.Kawahara et al. [14] proposed a multichannel CNN where multiresolution images were provided as an input, and the framework was optimized by an auxiliary loss function.Esteva et al. [15] trained InceptionV3 with 129,450 images to diagnose the most common and fatal skin cancers, and the diagnostic level of this method was noted to be similar to that of dermatologists.Ge et al. [16] trained a CNN model with clinical images and demonstrated the effectiveness of mul-timodal learning in skin lesion classification.Zhang et al. [17] proposed attention residual learning based CNN for skin lesion classification.This method improved the last layer of ResNet and enhanced the model's ability to discriminate dermatologic images through a novel attention mechanism.Menegola et al. [18] used six public datasets to pretrain CNNs to boost the performance of skin lesion classification.Chaturvedi et al. [19] trained a pretrained MobileNet on the HAM10000 dataset [20].This model outperformed the existing models and exhibited the higher speed and lower parameters size.Zhang et al. [21] optimized a collaborative deep learning method in terms of the classification and collaborative errors, effective in many medical image classification tasks.Ratul et al. [22] employed dilated convolution and reported that the highest performance compared the extended InceptionV3.

Segmentation based lesion classification
To focus on the local lesion region of dermatologic images, many of the existing methods first locate the skin lesion region and subsequently classify it [5,13].Diaz et al. [23] used a structure segmentation model to enable further feature representation.The diagnostic model used all the features to predict the final skin lesion type.Yu et al. [24] used the segmentationclassification union model to increase the accuracy.Specifically, the authors used the segmentation mask to locate the skin lesion region and classified the cropped skin lesion region image.Jia et al. [25] proposed a two stage framework with only one network to classify the dermatologic images.A convolution network was trained and subsequently the input region with the maximum activation value was cut out, and the network was retrained to output the final probability.Alom et al. [26] combined the feature mapping from lower and higher layers and achieve excellent results with the same or less network parameters in both segmentation and classification tasks.

Ensemble learning based classification
To enhance the classification accuracy of skin lesions, many studies used ensemble learning to fuse the diagnosis results of multiple methods [27].Matsunaga et al. [28] proposed a deep neural network ensemble method to classify melanoma, seborrheic keratosis and nevus.First, a classifier was generated to classify melanoma and other lesions.Second, another classifier was generated to classify seborrheic keratosis and other skin injuries.Finally, the model fused the output classification probability.Mahbod et al. [29] proposed a CNN integration scheme, which used the AlexNet, VGG16 and ResNet-18 models to extract features.Subsequently, the features were used to train different support vector machine classifiers, and finally the classification vectors were fused to provide the final classification results.a combination approach involving traditional image processing and deep learning techniques.First, six features were extracted from the dermatologic images, and the ResNet-50 network model was trained using these feature representations to obtain six feature vectors.Finally, the obtained feature vectors were mapped to the skin lesion types through logistic regression.
Bi et al. [32] trained a multiple CNN ensemble for the same classification problem; however, the authors trained three networks by fine-tuning a pretrained CNN for distinct problems: one of the networks aimed at the three classification problems, and the other two networks were binary classifiers (melanoma versus other lesion classes or seborrheic carcinoma versus other lesion classes).The area under the curve (AUC) of melanoma, seborrheic cancer and the other categories was 0.854, 0.976 and 0.915, respectively.Carcagni et al. [33] proposed a CNN architecture based on the DenseNet.Six different models were obtained through multilevel fine-tuning of the dense blocks.
Finally, the SVM model was used to classify the seven types of skin injuries.In general, although the method of classifying skin lesions based on deep learning can exhibit a high performance, the inter-class similarities and intra-class differences caused by the inconsistent size of the lesion regions in dermatologic images still influence the model generalization.

MECNN
To solve the problem of the deteriorated CNN performance, caused by the inconsistent size of the lesion regions in dermatoscope images, we propose a skin lesion classification method based on multiscale ensemble CNNs called MECNN.As shown in Figure 1.MECNN includes three branches (one main branch and two auxiliary branches), and the feature extraction modules of each branch share the same structure.The feature extraction module can be substituted by common network structures, such as ResNet, DenseNet etc.
Specifically, an input image passes through the main branch's feature extractor module to obtain feature maps.These feature maps are then passed through a classifier to obtain the classification output P r 1 and used to generate an attention map.In this attention map, we can identify the position point that having maximum local feature response and map this point to input image to locate the cropping centre.Two images Crop1 and Crop2 with different scales are obtained by cropping the input image with the cropping centre and two preset cropping window sizes.These two cropped images are later used as the inputs of the two auxiliary branches.Before feeding to auxiliary branches, Crop1 and Crop2 will first be upsampled into the size of the input image.Then, they are sent to the feature extractor module and classifier like the main branch to obtain P r 2 and P r 3 .In the training stage, we expect that P r 2 and P r 3 obtained in auxiliary branches can improve the lesion location ability of main branch.Therefore, a global loss function is designed to control the learning objectives of the three branches, owing to which, the predicted results of the two auxiliary branches are similar to those of the main branch.The same network structure is used in the test stage, and the results can be combined with the multiscale image information, which is reflected in the image cropping process.The multiscale information provides an attention based data augmentation way, and the cropping window is instructive to relieve the noise interference from background around the lesion region.The multiscale pairwise ranking loss functions facilitate to reduce intra-class differences.The pseudo code in Table 2 illustrates our method further.
A detailed description of the method is presented as follows.

Data augmentation
We resize the input image with 224 × 224.The images of each category are randomly divided into 5 equal groups, of which 4 groups are used for training and the last group is used for testing.Because of unbalanced distribution of the training dataset, we augment the number for each image category to the similar number.The specific mechanisms include rotation or contrast, brightness or saturation adjustment, and thus the total training set capacity becomes 59,500 images.

Cropping scheme
The proposed method aims at solving the problem of the scales non-uniformity of the lesion regions in dermatologic images; therefore, it is crucial to suitably extract the lesion regions.We cut out the lesion region by setting cropping boxes of different sizes to obtain cropped images.The obtained images are upsample into same size of original image, thereby enlarging the lesion region.The specific operation is shown in Figure 2. First, we seek out the max local feature response point from feature maps Figure 2(A).Then, we map the point to the input image Figure 2(C) as the cropping centre.Finally, we crop the input image with the cropping centre and preset window size.The where M x,y is the output feature maps before FC layer of the main branch in Figure 1, and (x, y) is the space coordinate of M. f 3×3 is a convolution operation with kernel size of 3 × 3 and the fixed kernel weights of 1.After finding the max local feature response point (x mr , y mr ), we map this point to input image by where (x c , y c ) is the cropping centre in input image, k is the downsampling times in feature extraction module.Note that if the cropping box exceeds input image, the vertex closest to the orange point in A is selected, and clopping is performed considering this vertex as that of the cropping box.

Loss function
A global loss function is designed, motivated by Fu et al. [34].The total loss function consists of two parts, namely, classification and pairwise ranking loss functions.The loss function is defined as This constraint ensures that p r s > p r 1 + margin during the training stage, and the hyperparameter margin is set to 0.05 in all the experiments.In the training process, as shown in Figure 1, the prediction results of the lower two branches are as similar as possible to those of the upper branch, and the final classification results are obtained by comprehensively considering the outputs of the three branches.

EXPERIMENTAL RESULTS
As the parameter setup for all the experiments, the initial learning rate is set to 0.001, and an attenuation of 0.1 times is implemented for every 7 epochs.The SGD optimizer is used to optimize the training process.We find that the margin in Equation ( 4) is robust to optimization, thus we empirically set margin as 0.05.All the experiments are performed on the Ubuntu system with a 1080Ti GPU and PyTorch framework.

Evaluation indices
The evaluation indices of the experiments include the precision, recall and F1-score.TP, TN, FP and FN, represent the true positive, true negative, false positive and false negative, respectively.

Compared with recent skin lesion classification methods
We use different network structures (ResNet50, ResNet101, DenseNet121, DenseNet161) as feature extraction modules.The evaluation index is calculated by micro-average and macroaverage methods.Micro-average considers all categories at one time to calculate the accuracy of category prediction, while macro-average considers each category separately, and obtain the final accuracy of the test set by arithmetic average.Because both micro-average and macro-average are used for comparison in the field of skin lesions classification, we present the results of both average methods as listed in Figure 3 and Table 3. Considering the precision, recall and F1-score of each model, as listed in Table 3, the largest values in the training process of 50 epochs are extracted.Moreover, the proposed method is compared with other ensemble methods, with [en] denoting the ensemble method.
Even when different network structures are used as the feature extraction modules, the proposed framework could achieve higher performance, which demonstrates that the proposed multiscale deep integration model is not susceptible to the feature extraction modules.In the model with the best performance, ResNet50 is used as the ensemble model of the feature extraction module.The corresponding macro precision, macro recall and macro F1-score are 0.8941, 0.8769 and 0.8723, respectively.Among the other skin lesion classification methods, Gessert's method exhibits the highest performance.

Lesion region cropping results
Because of the inconsistent size of the lesion regions in dermatologic images, especially the extremely small lesion regions  present in many images, the networks cannot accurately extract the associated features, leading to a low classification accuracy.Therefore, it is meaningful to crop out the correct skin lesion regions from the input images.We test the cropping method, Although the cropping box can effectively extract the lesion region, the box is not centred in the middle of the legion.This phenomenon occurs because after determining the maximum response point, the cropping frame cannot move beyond the image boundary, and the corresponding cropping is performed using the image vertex as the cropping edge.Furthermore, due to the large selection of the cropping frame, the cropping frames of all the images may not be centred in the middle of the lesion.
The problem of reducing the interference of hair in skin dermatologic images has challenged researchers for a long time.The hair interference directly affects the results.Therefore, we crop the image with hair interference and observe whether the result is affected by the hair, which leads to the inaccurate extraction of the lesion region.It is noted that the cropping frame is barely affected by the hair in Figure 5.

Confusion matrix analysis
To further analyse the effectiveness of the method, we study the confusion matrix.Zhang et al. [17] also used ResNet50 as the feature extraction module like our method, so we compare the result with the method [17].The experimental results are shown in Figure 6.The proposed method exhibits the higher accuracy for each category, which shows that the proposed method is effective for all the skin lesion categories.For the NV columns, it is noted that the condition is less than the results of [17] when the proposed method judges the other classes as the NV classes.This finding indicates that the proposed method can accurately locate the lesion region by cropping out the features at different scales after extracting the features, and thus the features of each class can be extracted more effectively when re-extracting the features.Moreover, the results obtained at different scales can be combined in the final classification, thereby improving the accuracy for each class.

Ablation study
Here, we discuss the effectiveness of the cropping process and the pairwise ranking loss function in our methods.

The cropping process
The Table 4 shows the performance when choosing different windows sizes and choosing different number of sub-branches.
According to the result of using only one sub-branch, the windows size smaller than 140 lead to the worse performance that will crop a limited area from lesion region.Based on this result, we choose two window size 168 and 196 to cover the whole range from 140 to 224 (the original image size) and get our best result.
Besides, we show the influence of the cropping centre selection in Figure 7, the green line represents the F1-score result that in our method we use fixed point (input image centre) as the cropping centre.Results show that the attention map based centre selection is beneficial to find the lesion region and improve the F1-score performance with 2-5%.The performance based on the fixed cropping centre strongly depends on the location  and size of the lesion area.However, our method based on the local maximum response point of the attention map can dynamically adjust the cropping window position according to lesion location.

The pairwise ranking loss function
In Table 5, we list the prediction probability of our model based on ResNet50 and two branches for 10 different dermatologic images.All the model prediction probabilities are notably higher than the main branch, which enhances the model prediction confidence beneficial from the pairwise ranking loss.

CONCLUSIONS AND FUTURE WORK
In this paper, we proposed a multiscale deep ensemble model for skin lesion classification.The network can accurately crop out the lesion regions in dermatologic images for further feature extraction and classify the results by combining the features obtained at different scales.The proposed method outperforms the other methods on the dermatologic dataset HAM10000.Future work will be focused on accurately locating the skin lesion regions by dynamically adjusting the size of the cropping box.

FIGURE 1
FIGURE 1 Network structure diagram overview.The features of the input image a1 are extracted to obtain the attention map, and then cropped on the attention map to obtain Crop1 and Crop2 with different scales, which are used to further extract features.P r 1 , P r 2 , P r 3 represent the execution degrees of the output.L s and L pair represent the classification and pairwise ranking loss functions, respectively

FIGURE 2
FIGURE 2 Schematic diagram of the cropping scheme.The convolution operation is performed on A to obtain B, the orange point in B is the maximum local response point, the orange point position is considered as the cropping centre in C with different window sizes

FIGURE 3
FIGURE 3 Comparison of the skin lesion classification methods with micro-average calculation.We use different basic networks as the feature extraction modules using the ResNet50 as the network feature extraction module, and compare the input image with the cropped images of two different scales to further analyse the effectiveness of the cropping method.Two images with small lesion regions are selected for each type of skin lesion, and the cut parts are marked with boxes of different colours: the blue and red boxes denote the cropping window size with 168 × 168 and 196 × 196, respectively.The results are shown in Figure 4.

FIGURE 4 FIGURE 5
FIGURE 4 Cropping results of dermatologic images

TABLE 2 The
pseudo code of MECNN.S is the total number of training steps, N is the number of auxiliary branches, h_n, w_n are the cropping windows size of an auxiliary branch mr , y mr ) = arg max x,y f 3×3 (M x,y )

TABLE 3
Comparison of MECNN framework and recent studies with macro-average calculation.We use different basic networks as the feature extraction modules.The maximum values of the performance indices are boldfaced

TABLE 5
The prediction probability of MECNN based on ResNet50 and two branches for ten images

TABLE 4
Results pertaining to different sizes of cropping boxes