Recognizing apple leaf diseases using a novel parallel real-time processing framework based on MASK RCNN and transfer learning: An application for smart agriculture

Effective recognition of fruit leaf diseases has a substantial impact on agro-based economies. Several fruit diseases exist that badly impact the yield and quality of fruits. A naked-eye inspection of an infected region is a difﬁcult and tedious process; therefore, it is required to have an automated system for accurate recognition of the disease. It is widely understood that low contrast images affect identiﬁcation and classiﬁcation accuracy. Here a parallel framework for real-time apple leaf disease identiﬁcation and classiﬁcation is proposed. Initially, a hybrid contrast stretching method to increase the visual impact of an image is proposed and then the MASK RCNN is conﬁgured to detect the infected regions. In parallel, the enhanced images are utilized for training a pre-trained CNN model for features extraction. The Kapur’s entropy along MSVM (EaMSVM) approach-based selection method is developed to select strong features for the ﬁnal classiﬁcation. The Plant Village dataset is employed for the experimental process and achieve the best accuracy of 96.6% on the ensemble subspace discriminant analysis (ESDA) classiﬁer. A comparison with the previous techniques illustrates the superiority of the proposed framework.

of apple leaf diseases are tiny; and iv) interference of ambient factors such as illumination and shadow affects the detection of lesion spots on leaves [20].
Recent computational developments, particularly embedded graphic processing units (GPUs), have revolutionized machine learning; as a result of which, novel advanced models and techniques have been proposed [21]. Deep learning is one of those techniques, which relates to the adoption of artificial neural network (ANN) architectures with many processing layers [22]. The deep learning approach has shown huge success in several computer vision applications like medical imaging, assisted living, and object classification [23][24][25]. However, in the agriculture domain, deep learning is mostly employed for classification purposes [26,27]. However, the segmentation of the tiny infected regions is not an easy process due to the presence of the ground truth images [28].
Several researchers have introduced computer vision-based techniques in the literature. Dubey et al. [29] use several features to describe a new computer vision-based technique. The technique comprises K-means segment three main steps: i) lesion spots; ii) features extraction like colour, LBP, and shape from the segmented spots, and iii) application of the multiclass SVM (MSVM) classifier for the final classification. Experiments are performed on several apple diseases and achieved notable performance. Khan et al. [2] present a system for several fruit diseases using deep learning. This technique describes two main steps-disease segmentation using correlation-based strategy and disease classification using deep features. Ormani et al. [30] present a computerized technique for apple disease identification. Three main steps are involved in the presented method: i) K-means clustering is employed for segmentation of lesion spots; ii) GLCM, colour, and wavelet features are extracted from the segmented spots, and iii) extracted features are fed in support vector regression classifier for the classification. Shuaibu et al. [31] introduce an optimized stochastic algorithm for the detection of the apple disease named Marssonina. Optimal features are chosen by using the particle swarm optimization (PSO) algorithm based on the highest discrimination. Then the SVM classifier does the classification and manages to achieve improved accuracy. Barbedo et al. [32] present a methodology by addressing colour channel and a Boolean operation to segmentation lesion spots on infected plant leaves. This methodology applies a binary mask. The challenges like variability in illumination and environment are resolved by this method.
In this work, we propose a framework for apple leaf disease detection and classification. For this purpose, we generate ground truths to train a CNN model name MASK RCNN. Our main contributions in this work are as follows:

PROPOSED METHOD
This section presents our proposed scheme using deep learning for apple fruit disease detection and classification. Figure 1 shows the proposed model, which consists of two parallel steps: disease detection and disease classification. A hybrid contrast enhancement method is implemented at the very first to improve the quality of images. Later, the improved images are passed onto MASK RCNN as well as re-training of ResNet-50 for classification. The detail of each step, shown in this figure, is as follows:

Hybrid contrast stretching
Contrast stretching is an important step in computerized methods that deal with the identification of specific regions or objects, such as infected parts [33]. High-quality images usually prove helpful in extracting the infected region in an image with good precision rates. Here we implement a hybrid contrast stretching approach to increase the visual quality of an image. The presented approach is based on the fusion of topbottom filters, and then set a mean value based threshold function. Based on the latter, the intensity of pixel values is increased. This approach is simple but most effective for this work. Mathematically, this method is presented as follows: Consider an input image of dimension 256 × 256 × 3, then a top-hat filter is defined as: where is the top-hat activation function, I is an original image, and s is the structuring element which is initialized as 11 in this work. The bottom hat filter is defined as follows: where Ψ denotes the bottom hat activation function and s is the structuring element and its value is 5 for this work. After that, we add the information of top(i, j ) along with the original image as: The new enhanced image New(i, j ) is improved by the contrast of local regions, which is further improved by the addition of the bottom hat image as: The mean for this new image is calculated and put into a threshold function, which is defined as: The New u (i, j ) is the final contrast-enhanced image. The resultant image is further passed into MASK RCNN for the disease spot identification.

Modified mask RCNN
Segmentation of the image is used to identify different changes in plant leaves. The real challenge is to identify the precise pixel boundaries of the infected area. The precise location of the infected area helps farmers to take timely targeted counter actions to avoid further spread of the infection. In this work, Mask RCNN [34] is utilized to segmentation infected regions using contrast-enhanced images. Mask RCNN is an evolved version of the faster RCNN, and it is more efficient than the faster RCNN. Mask RCNN consists of several parts, which aid each other in getting the desired output. These parts, named backbone, region proposal network (RPN), ROI alignment, network head, and loss function, are illustrated in Figure 2.
Backbone: This is the standard CNN (usually, ResNet) that acts as the extractor of features. The low-level features are extracted from the early layers while the high-level ones are extracted from the latter layers. The convolutional neural network backbone creates feature maps as input image passes through it. Consider, we have a contrast stretching image of dimension 256 × 256 × 3. As the acceptable input of Mask RCNN is dimension 1024 × 1024 × 3, so we converted enhanced images into a required input size of a network. The images are resized using a simple function of the programming tool Matlab (imresize). Each input image is converted from 1024 × 1024 × 3 to feature map size 32 × 32 × 2048. This feature map acts as input to further layers of the network. Feature pyramid network (FPN) [35] is employed to aid ResNet during the extraction of features. Using FPN, objects can be represented better at multiple scales. FPN facilitates both low-level and high-level features to be accessed at every level. We use separate combinations for the implementation of Mask RCNN: 1) ResNet50+FPN backbone and 2) Resnet101+FPN backbone.
Regional proposal network (RPN) proposes a potential candidate for object bounding boxes by utilizing the image feature of the backbone. This network is a replacement of the selective search [36], which is a computationally slower mechanism. The scanning is performed by RPN over the regions of the image are called anchors. They are overlapped and spread all over the image to find the potential candidate for bounding boxes in different sizes and aspect ratios. Anchors having the highest probability of RPN proposals are chosen for further refinement. As anchors overlap, so to reduce redundancy among the RPN proposals, non-maximum suppression [37] are used based on their class scores. The RPN loss function is described as: Here the predicted probability of an anchor point i, is represented by b i , Ground truth label is denoted by b * i . It is represented by 1 if the anchor point is positive otherwise, 0.
The bounding box-segmented area is drawn based on the four coordinates provided by the vector B i and B i * represent the ground truth bounding box, respectively. For the computation of the precise location of input features, the ROI align layer uses bi-linear interpolation [38] at four regularly sampled locations in every bin of ROI. For further refinement, average or max pooling is carried out on already attained features.
Network Head: Network head is a fully connected layer used to perform classification, prediction of the segmented mask, and bounding box generation. The input of the network head are features extracted from ROI align. The network head performs these tasks simultaneously. Against each ROI, a mask of size m × m is predicted by the fully connected layer. It is necessary to preserve the definite dimensional layout of m × m object. The utilized loss function is defined as follows: The detail of parameters initialized for the configuration of MASK RCNN, is given in Table 1. Table 1 shows that per epoch training steps are 400 and total validation steps are 50. Moreover, the learning rate is 0.01, and the momentum is 0.6. The resultant segmented images are shown in Figure 3, where the second column shows the ground truth images, the third column shows the bounding box images, and the last column represents the final segmented images, respectively.

Classification task
Today computerized methods are used for classification problems. In the field of agriculture, these computerized methods are also applied. These methods aid agriculture experts to classify multiple diseases. If we talk about agriculture, then deep con-

Feature extraction using TL
A pre-trained ResNet-50 model has been trained on selected fruits images through transfer learning (TL). TL is the process of reuse a pre-trained model for a new task with less computational time. A complete process of TL is shown in Figure 4. As mentioned in Figure 1, the Resnet50 pre-trained convolutional neural network is employed and retrains on contrast-enhanced apple leaf images. Originally, the ResNet model [39] is trained on the ImageNet dataset [40]. This model is composed of total 177 layers, which include 53 of the convolutional layers, 49 of the ReLu layers, 53 of the batch normalization layers, and 16 additional layers. The architecture of ResNet50 is given in Table 2. This model is fine-tuned for the new task. In the fine-tuning, the last layer has been removed and added a new layer for which includes only four types of apple diseases as mentioned in the results section. After that, employed an activation function on the feature layers (Avg Pool, FC) and obtained two feature vectors. During the training of a deep learning model, we used the following hyper parameters such as Mini batch size of 64, Momentum is 0.93, learning rate is 0.0001, and weight decay 0.2. Visually, the TL concept is shown in Figure 4.   In our work, we use cross-entropy as the activation function. The following expression defines the cross-entropy activation function: here, M is the number of classes, ℂ are the class labels, and P is the probability for V observations over the class m.
To get the minimum value of the loss function, optimization algorithms are used. Here gradient descent is used as the optimization algorithm. Choosing a good learning rate is crucial to achieving the local minima. It defines how fast or slow our system converges. With a well-chosen learning rate, the system attains better accuracies, as well as it becomes time efficient.
Mathematically, gradient descent is defined as Mini batch gradient descent is one variant of gradient descent. It learns faster and more frequently as it goes through a lesser number of examples per batch. It avoids redundant examples as it selects random examples. One more advantage of the mini-batch gradient is that it improves the generalization error, as the batch size is smaller than the training dataset. Mathematical equations for the mini-batch gradient descent is defined as follows: where is the learning rate, x (i ) and y (i ) are the training examples, and n is the size of the mini-batch. Momentum and learning rate are 0.93 and 0.001, respectively. After applying the activation function on two layers named average pool layer and fully connected layers, we obtained two feature vectors of dimensions N × 2048 and N × 1000, respectively. Later, we select the best features from both of them using the Kapur entropy along with the SVM approach and perform classification.

Feature selection and classification
The extraction of specific features from the original extract feature space is important to improve accuracy and minimum computational time. Several feature selection techniques are presented in the literature for the selection of the most discriminant feature points. The main purpose of feature selection in this work is to reduce the feature-length and minimize the execution time during the testing process. For this purpose, we propose using a Kapur entropy along with SVM (EaSVM) approach. In this approach, initially, the Kapur entropy is calculated. Mathematically, Kapur entropy is defined as follows: where, E n denotes entropies of the apple leaf classes and p j denotes probability values of each class. Based on E n , we define an activation function to maximize the entropy values: E mx = argmax {E n }. The value of E mx is embedded in a threshold function, and the features, equal to or greater than E mx are selected, and represented by E sl . The selected features based on this threshold are embedded in the Multiclass SVM. The MSVM is employed as fitness functions and computes the error rate. This process is continued up to 50 times, and the best feature vector based on the minimum error rate is selected. After the selection of the best feature vector, several classification

EXPERIMENTAL RESULTS AND DISCUSSION
This method is validated on the Plant village dataset [41]. The proposed method results are divided into two parts: i) Segmentation results of the implemented contrast enhancement and MASK RCNN, ii) Classification results of apple diseases using deep learning.

Experimental setup
A few sample images from the plant village dataset are shown in Figure 5. VGG image annotator (VIA) is used for manual annotation of the dataset. Few samples of the manually annotated images are shown in Figure 6. The infected leaf regions are separated by making a mask. These annotated images are saved in the form of a JSON file. For training, we used that JSON file as ground truths. Apple scab, Black rot, and cedar rust images are used for segmentation. In the evaluation of the segmentation task, an 80:20 is employed. As we have a total of 1200 ground  Table 3. The Resnet50 pre-trained model is utilized for deep features extracted from the last two layers-average pooling and fully connected. A cross-validation strategy is used to compute classification results over different K folds such as K = 5, 10, 15, 20. Ten different classification methods are used to evaluate the validation of the proposed system. The classification methods include linear discriminate, SVM linear, SVM quadratic, SVM cubic, medium Gaussian SVM, coarse Gaussian, medium KNN, cubic KNN, ensemble bagged tree, ensemble subspace discriminant. The performance of classification is evaluated on the number of parameters including sensitivity, specificity, precision AUC, FPR, and accuracy. Execution time is also calculated to see which method is more time-efficient. Classification results are obtained by using Core i7 4770 CPU, 16 Gb RAM, and NVIDIA GTX 1070 GPU.

Result analysis of segmentation using ResNet
The evaluation of the segmentation model is done by calculating precision and recall values at multiple threshold values of intersection over union (IOU). The considered threshold values are 0.5, 0.75, and 0.85 for segmentation results. Precision and recall values at these threshold values are calculated as given in Tables 4 and 5 (Resnet50+FPN). With the increase in the threshold value, we note that the precision and recall values start to decline. At a lower threshold value, the false rate is maximized while the higher values of the threshold minimize such false identification. As compared to training network heads only, we obtain better results on training all the layers of the network.  Tables 6 and 7. By training network heads only with all layers, the calculated precision is 0.793 and recall is 0.532 and 0.554, respectively for 0.5-threshold value. After obtaining the results for both backbones, we can infer that the results of the backbone Restnet101 are better.
Average precision (AP) is computed for better evaluation of each case at different values of threshold, which is 0.5, 0.75, and 0.85. The computed average precision is shown in Tables 8  and 9. The highest value of AP is observed when all the layers of the network are trained by using Resnet101 as the backbone. We attain an AP value of 0.861 at the threshold value of 0.5. From the results, it is observed that training all the layers of the model with Resnet101+FPN (R101+FPN) as a backbone gives better performance.

Result analysis for classification
In this section, the classification results are presented. Table 10 shows the classification results on the average pooling layer for  different types of classifiers, and each classifier is validated by cross-validation. The best performance attained on linear discriminate (LD) when compared to other classifiers using the average pooling layer features. The attained accuracy is 99.1 % along with the sensitivity is 99.25%, specificity is 99%, precision is 99.25%, area under the curve is 0.9925, and FPR is 0.0025 where K = 5. This accuracy of LD is also validated through a confusion matrix in Figure 7.
In Figure 7a, it is clearly shown that the proposed method on K = 5 gives better performance. The class apple rust gives 100% accuracy whereas the other three classes reach 99%. The second-best accuracy is attained on LD, which is 98.8%, while sensitivity, specificity, precision, AUC, and FPR are 99%, 99%, 98.75%, 0.9925, and 0.0025, respectively over 10-folds crossvalidation. The confusion matrix which is shown in Figure 7 also validates these results. Figure 7b shows that apple rust gives 100% accuracy, apple scab, and black rot reach 99% whereas, the class healthy apple attains 98% accuracy. For K = 15, the best-attained accuracy is 98.8% on LD along with the sensitivity of 98.75%, the specificity of 96%, precision of 98.75%, AUC of 0.99, and FPR of 0.005. Figure 7c shows that apple scab and apple rust attain accuracy of 100%, whereas black rot and healthy class give 99% and 96% accuracy, respectively. On K = 20, the best classification accuracy is attained by LD, which is 99.1% whereas sensitivity is 99.25%, specificity is 98%, precision is 99%, AUC is 0.9925, and FPR is 0.0025. Processing time is also an important parameter for a real-time system, so we compute processing time for each classifier, as given in Table 10. The time consumption of each classifier using the average pooling layer is shown in Figure 8. Table 11 shows the results of the proposed classification method on the fully connected layer. The best performance attained on ensemble subspace discriminate (ESD) when compared to other classifiers using the fully connected layer features of K = 5. The attained accuracy is 96.6 % along with the sensitivity rate of 96.5%, specificity is 96%, precision is 96.75%, area under the curve is 1, and FPR is 0.01. This accuracy of ESD is also validated through a confusion matrix in Figure 9. Figure 9a shows that the proposed method on K = 5 gives better performance. The class apple rust gives 100% accuracy whereas black rot and healthy give 96% and scab reached 95%. When the classification accuracy is computed on K = 10, we get the best accuracy using the CSVM classifier of 98.2% along with a sensitivity rate of 98.5%, specificity is 98%, precision is 98.25%, AUC is 1, and FPR is 0.005. Figure 9b shows that the class apple scab and black rot give 99% accuracy, while others give 98% accuracy. For K = 15, the QSVM classifier attains the best accuracy of 98.5% which is the highest accuracy among all the classifiers. The other   parameters such as sensitivity rate of 98.75%, specificity are 98%, precision rate is 98.75%, AUC is 1.00, and FPR is 0.005, respectively. A confusion matrix validates these results as illustrated in Figure 9c. For K = 20, QSVM gives the best accuracy of 97.9%. Figure 9d shows that class apple rust gives 100% accuracy, whereas apple scab, black rot, and healthy classes give 99%, 98%, and 95% accuracy, respectively. Overall, ESD and QSVM give better results. Processing time is also computed for the fully connected layer, plotted in Figure 10.
While comparing the results of the average pooling layer and the fully-connected layer we can infer that we get better performance on the average pooling layer as compared to the latter based on accuracy and time. The average pooling layer takes minimum time to attain better accuracy. Also, a comparison with known methods is presented in Table 12, which supports

CONCLUSION
A new method is presented for real-time apple leaf disease detection and classification using MASK RCNN and Deep Learning feature selection. The experimental results demonstrate the efficiency of the proposed technique for detecting and classification of apple leaf diseases. The results support our conclusion that the contrast stretching approach's implantation influences the detection of the infected leaf regions and gives a better mean overlapping coefficient (MoC). Also, this step is useful in detecting minor spots with less error rate. For the classification step, selecting the best features minimizes the error rate and increases the precision value. Also, based on the comparison among two-layer features, it is observed that the average pool layer features are more discriminant as compared to the FC layer features.
In future studies, we will try to include a larger number of ground truth images to train the MASK RCNN for improvement of the MoC value. Also, we will consider more different kinds of fruits for the classification task.