Classification of cassava leaf diseases using deep Gaussian transfer learning model

In Sub‐Saharan Africa, experts visually examine the plants and look for disease symptoms on the leaves to diagnose cassava diseases, a subjective method. Machine learning algorithms have been employed to quickly identify and classify crop diseases. In this study, we propose a model that integrates a transfer learning approach with a deep Gaussian convolutional neural network model. In this study, two pre‐trained transfer learning models were used, that is, MobileNet V2 and VGG16, together with three different kernels: a hybrid kernel (a product of a squared exponential kernel and a rational quadratic kernel), a squared exponential kernel, and a rational quadratic kernel. In experiments using MobileNet V2 and the three kernels, the hybrid kernel performed better, with an accuracy of 90.11%, compared to 86.03% and 85.14% for the squared exponential kernel and a rational quadratic kernel, respectively. Additionally, experiments using VGG16 and the three kernels showed that the hybrid kernel performed better, with an accuracy of 88.63%, compared to the squared exponential kernel's accuracy of 84.62% and the rational quadratic kernel's accuracy of 83.95%, respectively. All the experiments were done using a traditional computer with no access to GPU and this was the major limitation of the study.

calibration. 8 Decision-makers are typically worried about projections and the degree of confidence in such predictions. 9 One of the most appealing features of Gaussian processes is their capacity to provide posterior distributions that are accurately calibrated. 10 In addition to being frequently employed in the time series and regression domains, Gaussian process models can also be used to solve classification problems by using a response function and variational inference. Particularly for problems with small data sets, Gaussian processes (GPs) can be helpful. 11,12 Gaussian process models are resistant to overfitting and can quantify the uncertainty of predictions, in contrast to many supervised machine learning techniques like least squares and artificial neural networks (ANNs). 13 Although Gaussian process classifiers perform similarly to non-linear support vector machines (SVMs), they have been preferred due to additional benefits such as uncertainty representation and hyper-parameter selection. 14 The accuracy of Gaussian processes (GPs) is compromised by their inductive biases even if they offer uncertainty estimates and a marginal likelihood target. Deep Gaussian processes are now used to solve the drawbacks of Gaussian process models. 15 Iteratively convolving several GP functions across an image is accomplished using a deep convolutional Gaussian process. 16 Traditional (single-layer) GPs have limitations, and a non-parametric deep Bayesian method known as a deep Gaussian process (DGP) uses a hierarchical composition of GPs to address these limitations while maintaining the benefits of GPs. 17, 18 We extend our hitherto described technique, termed deep Gaussian convolutional neural network model (DGCNN) 19 in this article, by using a transfer learning strategy. Transfer learning (TL) is "a research subject in machine learning (ML) that focuses on retaining knowledge learned while addressing one problem and applying it to another similar but unrelated problem." The main advantage of TL is that it provides a better and faster response while taking less effort to collect the training data and construct the model. 20 We hypothesize that TL and DGCNN integration may improve the performance of the proposed model to identify illnesses in cassava leaves. This study was developed within the field of artificial intelligence for development (AI4D) to help achieve sustainable development goal (SDG) number two, "End hunger, ensure food security and enhanced nutrition, and promote sustainable agriculture."

The motivation for this study
The accuracy of probability predictions made by learning algorithms can be assessed by looking at the calibration procedure. Research on calibrating Gaussian processes has been done in the literature. 18 When a classifier correctly predicts a given class label with a probability p that corresponds to the actual proportion p of test points falling into that class, its output accurately depicts the likelihood of a specific class. 21 One of the best things about Gaussian processes is their capacity to offer accurately calibrated posterior distributions. 10 Non-parametric Bayesian models called Gaussian processes are good for modeling predictions. Traditional techniques frequently overlook the crucial aspects of Bayesian models, which include the representation and propagation of uncertainty. In general, projections aren't the only thing that concerns decision-makers; they are also concerned with how confident they can be in them. Action cannot be taken until the model being evaluated is certain of its predictions. This is essential for applications 14 like self-driving cars, robotics, indoor positioning systems, security, medical diagnosis, and self-driving cars, as well as computer simulations, 22 robotics, 23 spatial-temporal modeling, 24 link analysis and transfer learning, 25 and indoor positioning systems. 26 By applying Bayesian formalism, it is possible to acquire these uncertainties reasonably. Bayesian strategies address all types of model uncertainty, whether it is parameter inference or result prediction. Although they perform comparably to non-linear SVMs, some professionals like Gaussian process classifiers due to extra advantages like uncertainty representation and hyperparameter setting. 14 However, the greatest barrier to GPs' application with the massive current datasets is their computational and storage complexity. 27,28 Approximations or iterative matrix calculations can be used to address the problem of computational cost. 29,30 This study proposes to combine transfer learning (pre-trained deep neural networks) and a DGCNN, which is motivated by recent advances in deep learning and deep Gaussian processes. 19 The strength of deep transfer learning and deep Gaussian processes are both utilized by the proposed model.

Transfer learning
According to Reference 21, transfer learning (TL) is characterized in terms of task and domain. A feature space  and a marginal probability distribution P(X) make up the domain D, where X = {x 1 , … , x n } ∈ . Given a particular domain,  ∈ { , P(X)}, a task consists of two parts: a label space  and an objective prediction function f (.), which is learned from the training data consisting of pairings, which consist of pairs {x i , y i }, where x i ∈ X and y i ∈ . Predicting the appropriate label, f (x), of a new instance x is possible using the function f (x). Transfer learning seeks to enhance the learning of the target prediction function f T (.) in the target domain  T utilizing the knowledge in the source domain  s and the learning task  s , where  s ≠  T , or  s ≠  T . The DGCNN 19 that we previously suggested for the detection and classification of cassava diseases is being integrated with the transfer learning approach in this study.

2.2
A Review of some of the pre-trained transfer learning models The VGG16 ConvNets of the VGG architecture 31 are trained using fixed-size RGB images with a dimension of 224 by 224. The only preprocessing that is done in this case is to take each pixel's mean RGB value out of the training set. A series of convolutional layers are applied to the image, each having an extremely small 3 × 3 receptive field (the smallest size that can capture the concepts of left/right, up/down, and center). 1 × 1 convolution filters, which may be thought of as a linear adjustment to the input channels, were added in one of the configurations. For 33 convolutional layers, the convolution stride is set to 1 pixel and the spatial padding of the convolutional layer input is set to 1 pixel to maintain the spatial resolution after convolution. Five max-pooling layers provide spatial pooling after certain convolutional layers. A 2 × 2-pixel window is max-pooled using stride 2. Three fully-connected layers are added after a stack of convolutional layers. The soft-max layer is the last. All networks have the same arrangement of layers that are fully connected. Rectification (ReLU) 32 non-linearity exists in each buried layer. Local response normalization is present in all but one of the networks. Except for having 19 instead of 16, the VGG19 model is similar to the VGG16. The numbers "16" and "19" stand for the model's weight layers (convolutional layers). VGG19 has three more convolutional layers than VGG16. 33 For details, refer to Reference 31.

MobileNetV2
In a variety of model sizes, activities, and benchmarks, MobileNetV2 34 enhances the performance of mobile models. Using depthwise separable convolutional blocks instead of expensive convolutional layers, which are made up of a 3 × 3 depthwise convolutional layer to filter the input and a 1 × 1 pointwise convolutional layer to combine the filtered values to create new features, is a brilliant idea developed by MobileNet models. While being substantially faster, it achieves the same objective as conventional convolution. In the MobileNetV1 design, 1 × 3 depthwise separable convolutional blocks were introduced following a standard 3 × 3 convolution. Each block in MobileNetV2 contains a pointwise convolutional layer, a depthwise convolutional layer, and a 1 × 1 expansion layer. The pointwise convolutional layer of MobileNetV2 converts data with many channels into a tensor with few channels, in contrast to MobileNetV1. Each block's output in the bottleneck residual block experiences a bottleneck. Based on the expansion factor in the data before depthwise convolution, a 1 × 1 expansion convolutional layer multiplies the number of channels. The residual connection is MobileNetV2's second new feature. There is a residual connection that allows gradients to move through the network more easily. ReLU6 activation and batch normalizing functions are available in each layer of MobileNetV2. On the other hand, there is no activation mechanism in the projection layer's output. In the MobileNetV2 architecture, a normal 1 × 1 convolution, a global average pooling layer, and a classification layer are placed after the 17 bottleneck residual blocks. For details, refer to References 34 and 35.

The ResNet
The concept of VGG networks 31 served as inspiration for the ResNet 36 design's simple baselines. The convolutional layers, which typically have 33 filters, adhere to two essential design principles: For a particular output feature map size, each layer has the same number of filters, and if the output feature map size is cut in half, the number of filters is doubled to preserve the time complexity per layer. We used convolutional layers with a stride of two for direct downsampling. A 1000-way fully linked layer with softmax and a layer for global average pooling complete the network. There are 34 weighted layers in all. Compared to ResNet, the VGG networks are more intricate and contain more filters. ResNet's 34-layer baseline has fewer FLOPs (multiply-adds) than VGG-19, which has 3.6 billion (19.6 billion FLOPs). Shortcut connections are added to the network above, which transforms it into its original form. When the input and output dimensions coincide, the identity shortcuts are immediately available. When the dimensions increase, there are two options to consider: to account for the greater dimensions, the shortcut continues to perform identity mapping but with zero entries. To match dimensions, the projection shortcut is utilized; it does not add any extra parameters. The shortcuts are completed with a stride of 2 when they connect feature maps of two sizes in both possibilities. For details, refer to Reference 36.

InceptionV3
In terms of cost and parameter efficiency, inception networks (GoogleNet/Inception v1) 37 outperform VGGNet. 31 More care must be used while altering an inception network to avoid losing the computational advantages. Due to the uncertainty around the efficacy of the new network, it becomes challenging to modify an inception network for different use cases. Numerous network optimization strategies have been proposed for an Inceptionv3 38 model to reduce the constraints and promote model adaption. Factorized convolutions, regularization, dimension reduction, and parallelized calculations are a few of the methods used. For details, refer to Reference 39.

DenseNet201
A dense convolutional network (DenseNet) is a feed-forward network with connections between each layer. 40 In contrast to this network, which has (L(L + 1))/2 direct connections, typical convolutional networks with L layers have L connections, one between each layer and the one following it. All preceding levels also receive their feature maps as inputs, whereas each layer receives its feature map. Dense networks solve the vanishing-gradient problem, enhance feature propagation, reuse features, and drastically reduce the number of parameters. For details, refer to Reference 40.

The squared exponential kernel
The squared exponential (SE) kernel function is the most popular kernel or covariance function. 38,[41][42][43] The de facto default kernel for Gaussian processes is the SE kernel. This is so that it can be integrated with the bulk of functions and remain broad. The SE kernel is also known as the RBF kernel or the Gaussian kernel function. Equation (1) defines the SE Kernel as: where the two parameters are; The length controls how much the function "wiggles." Extrapolating more than units from the data is typically challenging. The output variance 2 establishes how far away the function is on average from its mean. Every kernel uses this scaling factor.

The rational quadratic kernel
Multiple SE kernels with various lengthscales are combined to create the rational quadratic kernel (RQ). 38,41-43 GP priors using this kernel, therefore, anticipate smooth functions throughout a broad range of lengthscales. The parameter controls the weighting of both large and small-scale changes. The RQ and the SE are identical when α → ∞. Equation (2) specifies the RQ kernel as: Designing new covariance or kernels functions The conventional kernels discussed above perform well if all of your data are of the same kind, but what happens if you have a range of feature types and wish to simultaneously regress on each of them? 38 A new kernel may be created that is well-suited to the available data. Numerous studies 38,[41][42][43] suggested methods for creating new kernels from existing kernels. If they are defined on different function inputs, multiplying them together is the conventional way to create new kernels. Comparable to multiplying two kernels is the AND operation. To put it another way, if two kernels are multiplied together, the resulting kernel will only be high if both of the base kernels are high. Equation (3) shows that given two valid kernels K 1 (x,x) and K 2 (x,x), their product is a valid kernel.
exists. The validity of k q (x,x) a covariance function q ∈ N is demonstrated by an extension of this argument. 43 Regarding further information on kernel designs, see References 38,41-43. ▪

Review of related work on crop leaf disease detection and classification
This section reviewed related works on automated detection and classification of crop leaf diseases using deep learning models. Other works can be found in the study. 44 However, we also note the difficulty in accessing some paid articles that are not open access in Uganda and therefore were not included in the review below.
The study 44 proposed a model using the probability distribution functions of image color histograms and the convolution of the Chebyshev orthogonal functions. In comparison to the baseline network, the results generated using the modified MobileNetV2 neural network demonstrated a statistically significant improvement in the accuracy of cassava leaf disease recognition.
The study 45 did a review the application of convolutional neural networks (CNN) in the detection of plant leaf diseases. The results revealed that the most effective technique for spotting early disease was deep convolutional neural networks, trained on image data. More so, due to the apparent similarity of the architectures used, the majority of studies have almost equivalent results. To avoid significant work duplication, new specifications, experiments, and architectures are needed.
The study 47 proposed a model for the classification of cassava leaf diseases using deep neural networks. The transfer learning approach was used for the proposed model trained. The model achieved an accuracy of 81.43% and 89.09% on original and segmented datasets respectively.
The study 48 proposed a novel neural network model for orange defects classification. The proposed technique uses orange images as inputs, then computes a gray-level co-occurrence matrix to extract the texture and gray features of the defect area. Finally, the defect areas are categorized using an RBPNN-based classifier. The outcomes of the studies show that up to 88% classification accuracy was attained.
The study 49 proposed a deep residual convolutional neural network model (DRNN) for detecting cassava mosaic. According to experimental findings, the classification accuracy is improved when employing a balanced dataset of images. On the cassava disease dataset, the proposed DRNN model performs significantly better than the simple convolutional neural network by a factor of 9.25%.
To identify and categorize the most prevalent guava plant diseases, the study 50 proposed an artificial intelligence (AI) driven framework. On a high-resolution image dataset of guava leaves and fruit, the proposed framework is assessed. On a set of RGB, HSV, and LBP features, the bagged tree classifier produced the best recognition results (99% accuracy in identifying four guava fruit disorders against healthy fruit). By adopting early safeguards, the proposed framework might assist the farmers in preventing potential production loss.
The study 51 developed a model that combines IoT and deep learning to create the automatic and intelligent data collector and classifier framework. Although the proposed model's classification is on par with cutting-edge models, training time was cut by 86.67%.
The study 52 proposed a double generative adversarial network (DoubleGAN) for plant disease detection. With Double-GAN, the dataset was increased, the generated images were more clear than with DCGAN (deep convolution generative adversarial network), and the accuracy of identifying plant species and diseases was 99.80% and 99.53%, respectively.

Dataset used
The cassava leaves dataset 3 used was divided into five categories: CBB, CBSD, CGM, CMD, and healthy (316 images). In a crowdsourcing experiment run by the National Crops Resources Research Institute (NaCRRI) and the Artificial Intelligence Lab at Makerere University, the image data was gathered via smartphones. The government agency in Uganda in charge of agricultural research is called the National Agricultural Research and Development Research Institute (NaCRRI). NaCRRI experts thoroughly identified each image, rating it for the frequency and seriousness of each illness. All experiments made use of the K-cross validation. The process of data preparation involved four main steps, and these are data acquisition, data cropping, data annotation, and data verification. The details of data preparation are in Reference 3. The dataset is highly nonlinear separable 46 and imbalanced with CMD and CBSD dominating other classes. The problem of class imbalance was addressed through data argumentation, which also involves applying random changes to the source images to increase the generalizability of the trained model.

The proposed hybrid kernel
This section discusses the proposed covariance function. The proposed covariance function combines a rational quadratic kernel with a squared exponential kernel, also known as the radial basis function (RBF) kernel or the Gaussian kernel function. According to the studies, 38,42 applying the squared exponential (SE) kernel to functions taken from the Gaussian process yields a smooth prior. The rational quadratic (RQ) kernel, like the SE kernel, yields a comparatively smooth prior on functions sampled from the Gaussian process. Due to its reasonable performance across all datasets, the SE kernel is frequently used as the default kernel in Gaussian process studies. Additionally, the rational quadratic kernel is produced by adding several different radial basis functions kernels with various length scales. It was proposed throughout this investigation that a mix of SE and RQ kernels would produce a potent model. The length-scale ( ), which indicates the GP's "wiggliness," and variance ( 2 ), which regulates the amplitude, are the parameters of the hybrid model. 42,43,53 In this article, we propose a hybrid covariance function, denoted by Equation (5), that combines a squared exponential kernel with a rational quadratic kernel. As a result of Equation (3); where is the length scale, 2 is the output variance, and the parameter control how much the large-and small-scale variations are weighted. For details on kernel design, refer to References 42,43,53.

The proposed model
The final fully connected layer of the pre-trained model was eliminated during the development of the proposed approach, and its place was taken by a DGCNN 19 that served as a prediction layer. GPlayers, the prior and posterior of a single multi-output GP, are represented by the GPlayers that make up the DGCNN model. A GPlayer can be likened to a typical fully connected (dense) layer in a DNN because it has an infinite number of basis functions. It is described by a kernel, inducing variables, and a mean function, three GPflow 54 objects. The sequential model from Keras is then used to build the DGCNN model, as shown in Figure 1. Convolutional neural networks and the DGP model were combined to form F I G U R E 1 Initializing, fitting, and evaluating the proposed model.
the DGCNN model. The DGP model is composed of the kernel function, the inducing variable, the GP layer, and the likelihood layer. The objective's variational expectation is calculated by the likelihood layer, which is also in charge of dealing with our likelihood distribution p(y∕f ). The 10 dense layers of the CNN model each comprise 100 units, and the activation function for non-linearity is ReLu. 55 The final layers of the neural network decrease the number of dimensions to one and do not employ nonlinearity. This is understood to be the non-linear feature-warping process being carried out by the neural network layers. The CNN model and DGP model are integrated using Keras' sequential model, with the latter serving as the prediction layer. The proposed kernel from Equation (8) was employed as the kernel function. In all of our trials, the kernel functions' default parameters were utilized. The length scale ( ), which represents the "wiggliness" of the GP, and variance ( 2 ), which controls its amplitude, are these parameters. 42,43,53 The length-scale and variance parameters for the kernel functions in our tests were both set to 1. Then, for hyperparameter selection, the marginal likelihood was employed. Default settings were kept for the models that had already been trained. The DGP model also made use of the Gaussian likelihood and the likelihood container. DeepGP was used to implement the DGP model in the GPflux library. The DeepGP is a Keras model specialization. Data minibatching and tracking of our training losses are handled by Keras. The dataset was scaled using a standard scaler, 56 and principal component analysis (PCA) 57 as a method of dimension reduction was employed. In this study, Google Colab was used for all of the experiments. Additionally, a laptop running Windows 11 with a 2.50 GHz Intel Core i7 processor, 16 GB of RAM, and a 2 GB NVIDIA GeForce MX150 GPU was used. GPflow 54 and GPflux 58 were both utilized as Python libraries. Below in Figure 2 shows the proposed model's pseudocode.

Model evaluation
The model was evaluated using the accuracy, precision, recall, F1-score, and loss function. The percentage of correctly predicted outcomes to all predicted outcomes is how accuracy is measured. 59 Because accuracy may be used to determine a model's quality quickly and because it is effective for binary classification issues, accuracy was chosen. Additionally, F1-score, precision, and recall were used to assess the models' performance. The performance measures are calculated using the formulae shown in numbers 6 through 9. Also, the loss function 58 was used as a performance metric.
Recall = TP TP + FN where TP, TN, FP, and FN stand for true positive, false positive, and negative, respectively.

F I G U R E 2
The pseudocode of the proposed model.

Pre-trained model parameters
During this study, the pre-trained models are combined with a DGCNN as the top classification layer. We consequently need to be aware of the model's parameters, including the total parameters, trainable parameters, and non-trainable parameters. ResNet50, InceptionV3, DenseNet201, and MobileNetV2 are the six pre-trained models that were taken into consideration for this study. ResNet50 contains more total and trainable parameters than the other pre-trained models, while MobileNetV2 has the fewest, as indicated in Table 1 below. VGG16 and VGG19 did not have non-trainable parameters, but DenseNet201 outperformed the other models in terms of non-trainable parameters.

Comparative performance of the pre-trained models on the cassava dataset
In this experiment, we measure a model's performance on the task generally using k-fold cross-validation and more specifically 10-fold cross-validation. This contributes to elucidating the model's variance in terms of the stochastic nature TA B L E 1 Pre-trained models with their corresponding parameters. of the learning process and variations across the training and test datasets. When estimating a confidence interval, the model's performance can be taken as the mean performance across k-folds, supplied with the standard deviation. This study made use of the value of k, which was discovered via experience to often result in a model skill estimate with little volatility and low bias. 60 We did a performance study on VGG16, VGG19, ResNet50, InceptionV3, DenseNet201, and MobileNetV2. VGG16 and MobileNetV2 performed better than other models in this study. We did this comparison because different algorithms perform differently on different versions of the dataset. Justification-different algorithms make different assumptions. For instance, some algorithms do not need normalized data, while a gradient descent-based algorithm does. Also, the differences in performance could be arising from the different architectures of the models. For example, the number of gates used, how they handle overfitting, and the kind of activations used. Therefore, after applying data argumentation to the original dataset, did a comparative study on VGG16, VGG19, ResNet50, InceptionV3, DenseNet201, and MobileNetV2. Whereas all the models performed relatively well during this experiment during training accuracy, the majority had low validation accuracy, an indication of overfitting. However, MobileNet-V2 and VGG16 exhibited better performance on both training and validation accuracies when compared to the rest. The performance of the models is presented in Table 2.

Experiments with the proposed model
We demonstrate our proposed model's performance in this section. The model combines the pre-trained models that were chosen, namely MobileNet V2 and VGG16, and replaces the top layer with DGCNN as the prediction layer. The squared exponential (SE) kernel and the rational quadratic kernel were outperformed by the hybrid kernel in experiments utilizing MobileNet V2 and VGG16 pre-trained models (RQ). However, the rational quadratic kernel also outperformed the squared exponential kernel in terms of performance. The outcomes also showed that the performance accuracy rose with the number of epochs. The hybrid kernel performed better overall than the squared exponential kernel and the rational quadratic kernel, according to Tables 3 and 4. Further findings showed that when compared to VGG16, MobileNet V2's pre-trained model performed better. However, there was not much of a difference in performance. Results from experiments employing MobileNet V2 and VGG16 are displayed in Tables 3 and 4, respectively. Figure 3 illustrates how training loss was employed in this study as a performance parameter. In the top row, the results of loss versus epochs using MobileNet V2 are displayed, whereas, in the bottom row, the results using VGG16 are displayed.

TA B L E 2
Classification results with pre-trained models at epoch 1000. The findings show that a squared kernel, then a hybrid kernel, and then a rational quadratic kernel provide the best loss. In some ways, the loss with the MobileNet V2 pre-trained model using the squared exponential (SE) kernel, the rational quadratic kernel (RQ), and the hybrid kernel followed a linear trend, however, the results with the VGG16 pre-trained model were not linear.

DISCUSSION
In this study, we propose a novel algorithm called DGCNN for the classification of cassava diseases. We first did a comparative performance of pre-trained transfer learning models. Six pre-trained models were considered during this study, that is, VGG16, VGG19, ResNet50, InceptionV3, DenseNet201, and MobileNetV2. The results in Table 2 show the performance of the pre-trained transfer learning models. It is evident from Table 2 that InceptionV3, DenseNet201, and MobileNetV2 had overfitting issues, while ResNet50 had underfitting issues and, VGG16 and VGG19 performed preferably well on both training and validation. The results further revealed that InceptionV3 and MobileNetV2 performed better than the rest of the models with the same training accuracy of 99.70%. However, MobileNetV2 performed better than InceptionV3 in validation accuracy. It can be noted that VGG16 performed better than the rest of the models on validation accuracy. Motivated by the experimental results in Table 3, we did further did experiments with VGG16 and MobileNetV2 with DGCNN 19 as a prediction layer. Three covariance functions were used in this study, that is, hybrid kernel, rational quadratic kernel, and squared exponential kernel. The hybrid kernel outperformed the squared exponential kernel and the rational quadratic kernel, according to experimental results using MobileNet V2 as a pre-trained model and DGCNN as a prediction layer, as shown in Table 3. The accuracy for the hybrid kernel, rational quadratic kernel, and squared exponential kernel was 90.1%, 85.4%, and 86.3%, respectively. Table 4 also displays tests with the VGG16 pre-trained model using DGCNN as a prediction layer. Table 4 findings show that the hybrid kernel performed better than the squared exponential kernel and the rational quadratic kernel overall. The results also showed that the pre-trained MobileNet V2 model performed better than VGG16. The results show that a hybrid kernel performs well when using loss as a performance parameter, followed by a squared kernel and then a rational quadratic kernel. In tests using VGG16 and the three kernels, the hybrid kernel performed better than the squared exponential kernel and the rational quadratic kernel, with accuracy levels of 88.6% versus 84.6% and 83.9%, respectively. The results of training loss experiments with the pre-trained transfer learning models MobileNet V2 and VGG16 show that a hybrid kernel, a squared kernel, and finally a rational quadratic kernel provide the best loss. In our experiments, the length-scale and variance parameters for the kernel functions were both set to 1. Then, for hyperparameter selection, the marginal likelihood was employed. During this study, we also compared our results with the original study 3 that released the cassava leaf datasets we used. Experimental results obtained during our study reveal an accuracy of 90.1% and 88.6% of our proposed model with MobileNet V2 and VGG16 respectively. The study 3 reported an accuracy of 93%, therefore, the results of our study had a reduction in performance of −2%. However, the study 3 did not report the results on precision, recall, and F1-score, the results on the same metrics in our study are reported in Tables 3 and 4.

CONCLUSION
In this study, we combined a DGCNN with a pre-trained transfer learning strategy to detect and classify cassava diseases. Three different kernels, including a hybrid kernel (a squared exponential and rational quadratic kernel combined), a squared exponential kernel, and a rational quadratic kernel, were used in the experiments. Additionally, two pre-trained transfer learning models, that is, MobileNet V2 and VGG16 were taken into account throughout this study. In comparison to tests using VGG16 and the three kernels, those using MobileNet V2 and the three kernels showed improved performance. Additionally, experiments using MobileNet V2 and the three kernels showed that the hybrid kernel performed better than the squared exponential kernel and the rational quadratic kernel. The primary computational resource limitation of this study was the use of an ordinary laptop computer for all experiments, which led to several experiments taking several minutes to complete before yielding data. To ascertain whether the suggested model will perform better, we propose to compare its performance to that of Kernelized SVMs on the same dataset in future work. We also propose to develop an algorithm that integrates an ensemble model of pretrained transfer learning models and DGCNN to check whether there will be a performance improvement.

CONFLICT OF INTEREST STATEMENT
The authors have no conflict of interest relevant to this article.

DATA AVAILABILITY STATEMENT
The data and the code used during this study will be shared on request.