Electricity theft detection for energy optimization using deep learning models

The rapid increase in nontechnical loss (NTL) has become a principal concern for distribution system operators (DSOs) over the years. Electricity theft makes up a major part of NTL. It causes losses for the DSOs and also deteriorates the quality of electricity. The introduction of advanced metering infrastructure along with the upgradation of the traditional grids to the smart grids (SGs) has helped the electric utilities to collect the electricity consumption (EC) readings of consumers, which further empowers the machine learning (ML) algorithms to be exploited for efficient electricity theft detection (ETD). However, there are still some shortcomings, such as class imbalance, curse of dimensionality, and bypassing the automated tuning of hyperparameters in the existing ML‐based theft classification schemes that limit their performances. Therefore, it is essential to develop a novel approach to deal with these problems and efficiently detect electricity theft in SGs. Using the salp swarm algorithm (SSA), gate convolutional autoencoder (GCAE), and cost‐sensitive learning and long short‐term memory (CSLSTM), an effective ETD model named SSA–GCAE–CSLSTM is proposed in this work. Furthermore, a hybrid GCAE model is developed via the combination of gated recurrent unit and convolutional autoencoder. The proposed model comprises five submodules: (1) data preparation, (2) data balancing, (3) dimensionality reduction, (4) hyperparameters' optimization, and (5) electricity theft classification. The real‐time EC data provided by the state grid corporation of China are used for performance evaluations via extensive simulations. The proposed model is compared with two basic models, CSLSTM and GCAE–CSLSTM, along with seven benchmarks, support vector machine, decision tree, extra trees, random forest, adaptive boosting, extreme gradient boosting, and convolutional neural network. The results exhibit that SSA–GCAE–CSLSTM yields 99.45% precision, 95.93% F1 score, 92.25% accuracy, and 71.13% area under the receiver operating characteristic curve score, and surpasses the other models in terms of ETD.

receiver operating characteristic curve score, and surpasses the other models in terms of ETD.

K E Y W O R D S
convolutional autoencoder, deep learning, electricity theft detection, long short-term memory, smart grids, weighting

| INTRODUCTION
Electricity makes a significant contribution to the lives of human beings, where nearly everyone in the world has access to it.2][3] On the basis of the distribution system operators' opinion, the categorization of electricity loss can be done into technical loss (TL) and nontechnical loss (NTL). 4TL is a type of electricity loss that takes place within the power distribution network due to heating and burning up of the hardware devices involved in electricity transfer, such as electric transformers, cables, conductors, and other equipment. 5NTL refers to the amount of electricity consumed but not billed.It occurs due to several factors, such as measurement error, malfunctioning of smart meters (SMs), electricity theft, and so forth. 6However, the principal reason for NTL occurrence is electricity theft.Electricity theft is an unpleasant event that leads to huge economic losses globally.The total monetary losses that occurred throughout the world due to electricity theft are more than $25 billion yearly in which $4.5 billion loss occurred in India 7 and more than ¥100 million loss occurred in the Fujian province of China. 8ue to the increased onfield inspection and the emergence of smart grids (SGs), traditional electricity theft methods, such as meter bypassing, are currently outdated and replaced by modern theft strategies, such as tampered (meddled) SMs readings' transmission.The SM readings can be tampered before, within, and after the SMs. 1 Despite that the classical theft detection approaches, such as state-based and game theory-based approaches 9 are employed for NTL detection.However, both the traditional game theory and state-based methods are not optimal and are costly, respectively.Currently, huge amount of the historical electricity consumption (EC) data of electricity consumers collected by advanced metering infrastructure (AMI) are easily available, therefore, machine learning (ML) techniques obtain satisfactory and better performance for electricity theft detection (ETD) in SGs.The ML techniques [10][11][12][13][14][15][16][17][18] employ the EC history of consumers collected by the AMI in SGs.Normally, electricity consumers' EC adopts a specific symmetric and synchronized statistical pattern.Whereas, asymmetry in the consumers' EC profiles can be an indication of the theft (abnormal) consumers' activities.The EC history database is used to train the ML classifier, which is later employed to predict and find malicious patterns.Since these classifiers utilize the ready-made SM EC history, their costs are not high.However, there are some problems in the available ML-based theft classification techniques that restrict their performances with respect to the theft classification accuracy, precision, and F1 score.One of the problems is class imbalance.This problem refers to the concept that the number of electricity theft and nontheft users is not the same.Moreover, the nontheft users heavily outnumber the theft users.It is due to the normal phenomena that the EC instances for nontheft users are available in a huge range while the theft samples are very rarely available or sometimes do not exist in the real environment.
Another issue in employing ML classifiers in ETD tasks is the curse of dimensionality.This aspect refers to the issue that arises while working with highdimensional data. 4Classification loss, in such situations, is directly proportional to the number of features (dimensions).Ignoring this issue badly affects the performance of the ML classifiers with respect to the theft classification accuracy, precision, and F1 score.Another crucial issue is ignoring the automated tuning of hyperparameters of the ML classifiers.When tuning of hyperparameters is done manually instead of automated, it affects the ML algorithm's performance in terms of computational overhead, precision, F1 score, and accuracy. 19n this study, we develop a novel electricity theft detector called SSA-GCAE-CSLSTM that makes use of users' EC history to address issues with existing electricity theft classification techniques.This detector is based on gated recurrent unit (GRU), convolutional autoencoder (CAE), salp swarm algorithm (SSA) optimization, and cost-sensitive learning (class-weighting) and long short-term memory (CSLSTM).The following is a list of the main contributions made to the underlying work.
• The class imbalanced problem is tackled using costsensitive (weighting) learning methods by weighing the classes in an inversely proportional manner with respect to the frequency of their instances.In this way, the overfitting issue that happens due to the usage of synthetic minority oversampling technique (SMOTE) for data balancing is resolved.Furthermore, the underfitting issue created by random undersampling (RUS)-based data balancing is also tackled.• The curse of dimensionality problem is tackled by introducing a new hybrid feature extractor that combines the GRU, and CAE, that is, gate convolutional autoencoder (GCAE).In this way, the unique and important features are extracted using GCAE and the curse of dimensionality issue is avoided.• The automated tuning of hyperparameters of the proposed model is performed using a bioinspired metaheuristic algorithm, known as SSA.The loss function is considered as the fitness function in SSA to fine-tune the hyperparameters.• SSA is used to fine-tune the learning rate hyperparameter to address the local optimal stagnation problem in the proposed model.The SSA can escape the local optimum trap problem by exploiting its capacity to do a global search.• The EC historical data of state grid corporation of China (SGCC) is used for performance validation of the proposed SSA-GCAE-CSLSTM theft detector.
Extensive simulations and performed and it is proven from the results that the proposed model obtains an improved ETD performance.
The following is how the remaining paper is arranged.
The related work can be found in Section 2. The problem statement, proposed ETD model, and simulation outcomes are presented in Sections 3, 4, and 5, respectively.Section 6 contains the paper's conclusion.

| RELATED WORK
This section provides an overview of the many ETD methods used in the literature to utilize the EC data of SMs.ETD algorithms are used to separate electricity theft consumers from normal ones.The existing ETD algorithms employed in the literature are grouped into unsupervised and supervised learning techniques.In the unsupervised learning category, the algorithms are trained using unlabeled data sets.Whereas, in the supervised learning category, the algorithms are trained using the labeled data sets in which all normal EC consumers are labeled as 0 while all fraud consumers are labeled as 1.
The authors [10][11][12][13][14][15][16][17]20 have worked on supervised learning algorithms for ETD. In brahem et al., 10 for detecting electricity fraud, a feedforward network is proposed.The network incurs minimal computational time.A functional encryption technique is applied to encrypt SM readings and the encrypted aggregated readings are sent to the system operator to maintain consumers' privacy.However, the data imbalance between the count of normal and abnormal users is neglected, which is one of the significant problems that restrict the theft detection performance of the ML algorithms. Husain et al. 11 proposed a categorical boosting (CatBoost) algorithm to detect electricity theft.K nearest neighbor algorithm is leveraged to fill in the missing values.SMOTETomek, a hybrid data oversampling and undersampling technique, is employed to resolve the data imbalance issue.To overcome the curse of dimensionality, feature extraction and selection are also carried out using the feature extraction and scalable hypothesis technique.However, the automatic adjustment of the suggested theft detector's hyperparameters is not taken into account.Gunturi and Sarkar 12 employed ensemble techniques for the detection of electricity theft in SGs.The algorithms used are random forest (RF), extra trees (ET), adaptive boosting (AdaBoost), CatBoost, light boosting, and extreme gradient boosting (XGBoost).ET and RF are the best-performing classifiers among the abovementioned classifiers.The data imbalance issue is handled using SMOTE.However, the overfitting problem takes place due to the exploitation of SMOTE for data balancing.
An XGBoost classifier is proposed for detecting electricity fraud in SGs in Yan and Wen. 13 In addition, an Irish EC data set is used for training the model.The EC data collected from Irish SMs are preprocessed to fill missing values and remove outliers using mean value and three sigma rule (TSR) methods, respectively.However, the problems of data imbalance and the curse of dimensionality are ignored.Hasan et al. 14 proposed a hybrid of convolutional neural network (CNN) and LSTM network, that is, CNN-LSTM, for detecting electricity theft.For computing the missing data, a novel preprocessing mechanism is introduced.It employs the local average of the EC data for missing values' calculation.Data imbalance is tackled using SMOTE.However, the model overfitting problem occurs due to SMOTE.In Yao et al., 15 a combined CNN model is proposed to detect the anomaly in SM data.The EC data from the SGCC is used for analysis.The Paillier algorithm protects energy privacy.However, the issue of data imbalance is not taken into account.For ETD in SGs, Avila et al. 16 used the random undersampling boosting algorithm.A maximal overlap discrete wavelet packet transform is used to extract significant features.RUS is used to balance the data.However, important and necessary information loss occurs due to randomly removing the majority class samples using RUS, which leads to the model underfitting problem.In Feng et al., 17 a TextCNN model is used for fraud detection in SM data.The Irish and Chinese data sets are employed for model training and testing.A new data balancing technique is proposed.However, tuning of the hyperparameters of the model is not considered.Jindal et al. 20 proposed a twolevel EC data processing and analysis strategy based on decision tree (DT) and support vector machine (SVM).This strategy is proposed to detect electricity theft with minimum false positive rate (FPR).The DT is employed to compute the expected EC of the users.This expected EC is then fed into the SVM classifier to classify the users as normal or fraud.However, class imbalanced and hyperparameters' tuning are ignored.
The authors [21][22][23][24][25] employed the unsupervised learning techniques for the detection of electricity fraud in SGs.Júnior et al. 21proposed an unsupervised technique for detecting electricity theft.An optimal path forest clustering technique is employed to separate the theft and nontheft users in industrial and commercial users' EC data.The EC data employed for analysis is obtained from Brazil's electric grid.Both the commercial and industrial EC data sets have eight features only so the dimensions are already reduced and there is no need for feature engineering.However, tuning of hyperparameters is ignored.Razavi et al. 22 introduced a combined model that consists of the finite mixture model clustering (FMM) algorithm and genetic algorithm (GA).The GA is employed for feature generation.GA applies a set of specific primitive functions to create (generate) new features.The FMM technique is used to perform EC consumers' segmentation into clusters where the consumers in the same cluster have a similar EC profile.Finally, the gradient boosting classifier is employed for electricity theft classification.The EC data for analysis is obtained from the commission for energy regulation, which is an Irish energy sector regulator.The data balancing is done using the six attacking models.However, hyperparameter tuning of the proposed clustering and classification models is neglected.Feng et al. 23 proposed an NTL detection method using local matrix reconstruction.Only five EC values on a daily basis are utilized to prevent high dimensionality.The principal component analysis (PCA) algorithm is employed to compute the reconstruction error (RE) for each sample.The RE of each EC record (sample) is compared with its neighbor samples to compute the local outlier values, which show the fraud degree of every EC record.The data sets for analysis purposes are obtained from CER and another one published by the US energy department.Hussain et al. 24 proposed a hybrid model that comprises outlier removal clustering (ORC) and robust PCA (ROBPCA).The curse of dimensionality issues is tackled using the ROBPCA by extracting the important features from EC data.Whereas, the ORC technique is employed to classify EC patterns with a low or high degree of possibility with respect to electricity theft.However, hyperparameters' tuning of the newly proposed hybrid model is ignored.A Gustafson-Kessel fuzzy clustering technique is proposed in Viegas et al. 25 for detecting the NTL in EC patterns.The analysis is performed using Irish EC data without involving hyperparameter tuning.

| PROBLEM STATEMENT
The ensemble (ET and RF) models and the XGBoost model, respectively, were proposed by Gunturi and Sarkar 12 and Yan and Wen. 13 The problem of the curse of dimensionality, however, is disregarded.This issue happens when an ML or deep learning (DL) technique deals with high-dimensional data.In this way, if this issue is not properly dealt, the ML and DL models memorize the redundant features and noise, which results in poor model generalization.For NTL identification in SGs, Hasan et al. 14 also suggested a CNN-LSTM model.SMOTE is also used to address the issue of data imbalance.However, the use of SMOTE for data balance results in the overfitting issue.This issue arises because the SMOTE overpopulates only a specific area instead of generating synthetic data points throughout the data.Overfitting refers to when a model performs well on seen instances and performs poorly on unseen instances.
Yao et al. 15 proposed a combined CNNs model to detect abnormality in metering data, respectively.However, class imbalanced issues are ignored.Class imbalanced issue refers to the concept when the number of instances belonging to the classes are not balanced and there is a pretty huge difference between the ratio of classes' frequency.Ignoring this issue leads the ML and DL models to be skewed towards the dominant class (class with a huge number of instances).Therefore, the model outputs false predictions for minority class instances that result in low theft detection performance.Moreover, Avila et al. 16 used RUS to tackle the class imbalanced problem.However, some important and necessary information may be lost due to RUS.In RUS, the number of instances of the dominant class is randomly reduced until it is equal to or nearly equal to the number of instances of the minority class.Important information is lost for the majority class as a result, which causes an underfitting issue for the classifier.
When a classifier performs poorly on both seen and unseen data samples, an underfitting problem emerges.Feng et al. 17 proposed TextCNN for ETD.However, hyperparameters' tuning of the model is not considered that leads to low accuracy.In addition, ignoring the tuning of the learning rate hyperparameter implies that the model does not have a suitable learning rate that leads the model to the local optima stagnation problem.

| PROPOSED ETD MODEL
In this part, the elements and their flows that are pertinent to the proposed ETD model are discussed.In addition, real EC data are used to build and evaluate the proposed ETD model.The model is made up of five parts, starting with the data preprocessing component followed by the abnormal and normal data balancing components.Afterwards, the dimensionality reduction component is used followed by the proposed theft classification and parameters' optimization components, as displayed in Figure 1.All of these components are combined to configure a complete ETD solution.The proposed ETD solution uses the cost-sensitive learning (weighting) method to tackle the class imbalanced problem.Moreover, a GCAE feature extractor is proposed to solve the curse of dimensionality issue.Afterwards, the proposed solution employs SSA to finetune its hyperparameters.Finally, the proposed SSA-GCAE-CSLSTM theft detector is introduced to classify the normal and abnormal energy consumers.The reason for selecting the hybrid of these three models is that this combination has not been used in the literature earlier for the same purpose.Moreover, after using different individual and hybrid models, these models proved to provide the best performance of all.The comprehensive details of the proposed solution's components are provided in the subsequent subsections.
Our proposed ETD solution is applied to the real-time EC data of the SM obtained from SGCC. 26 The data are collected from Fujian province of China.The data set has the real normal and abnormal consumers' records, collected and verified using on-field inspection process.More details relevant to the data set are given in Table 1.The EC data are recorded by the SMs that are located at the consumers' premises.Afterwards, the SM sends the data to the electric utility using communication medium, like, ZigBee, radio frequency, power line carrier communication, and so forth. 27Besides, in SMs, there can be possibilities of components' malfunctioning or failure, memory loss, and so forth, while recording the consumption readings.Due to which the outliers and missing values exist in the data set.Eventually, we need to clear the data set from missing values and outliers before passing it to any ML or DL model for training purposes.Therefore, we select some existing popular methods to fill the missing values, remove the outlier values, and finally normalize the data.The outliers are removed using TSR method while missing or non-a-number (NaN) values are handled using linear interpolation (LI).In addition, for bringing the unscaled data into a specific range, min-max (MM) normalization is used.To calculate and fill the NaN valued cell in the data set, LI 28 is implemented using Equation (1): where d represents a time period, in our example a day, and , + 1 .i represents for the customer number.EC reading for the ith user on the dth day is indicated by x i d , , where d = 1034 and i = 42,372, where , + 1 and d shows a timeslot, that is, day in our case.i represents the consumer number.x i d , denotes the EC reading of the ith user at the dth day where d = 1034 and i = 42, 372.x i d , −1 indicates the EC reading of the previous slot and x i d , +1 is the EC reading of the next slot.Furthermore, there are some outlier (unusual) values in the considered data set, which negatively affect the theft classification performance of the classifiers.For this reason, we handle the outliers using a TSR technique.The TSR 28 is implemented using Equation (2).
O t h e r w i s e .
Where X is equal to avg X ( ) while the dataframe is represented by X .The dataframe comprises several EC values, given by x i s , .In addition, the average and standard deviation for X are represented by avg X ( ) and std X ( ), respectively.Furthermore, the sensitivity of the deep models to data scalability and diversification 28 makes it a necessity to normalize the data leveraging MM scaling technique. 28The technique is implemented using Equation where X max exhibits the maximum value of X and X min indicates the minimum EC value of X .

| Data balancing
The issue of data imbalance negatively affects the performance of the classifiers.The data imbalance problem arises when one class samples are taken more frequently than another class samples.Due to the fact that abnormal users' data instances are less frequently available than those of normal users, this issue is rather typical in the ETD domain.Due to this reason, the classifiers generate the skewed and biased results with respect to the normal class and neglect the abnormal (minority) class.Therefore, we select a cost-sensitive learning (weighting) technique to tackle the data imbalance issue.Actually, two types of methods are used to handle the class imbalance issue: algorithm level and data level. 29In the algorithm-level methods, the ensemble and cost-sensitive methods are included.Whereas, in the data-level methods, all the data sampling methods are involved.A wide range of data-level methods are employed in ETD. 12,14,16However, these data-level sampling techniques like SMOTE and RUS cause overfitting and underfitting issues, respectively, as already discussed in Section 3.Moreover, data-level methods are widely used in ETD while in the algorithm level, weighting is narrowly leveraged in ETD like in Pereira and Saraiva 30 and Walsh and Tardy. 31Therefore, we prefer to use an algorithm-level balancing method, that is, the cost-sensitive learning (weighting) method, for ETD in SGs.In this method, false negatives and false positives are assigned different misclassification costs.Afterwards, the cost matrix (confusion matrix) is used to address the data imbalance issue.Class weighting is a type of cost-sensitive learning techniques used to tackle data imbalance issues.In class weighting, the classes are weighted indirectly proportional to their samples count. 30In this way, the abnormal (minority) class would have high weight value and normal (majority) class would have a low weight value.Resultantly, the classes would be balanced.The class-weighting method for data balancing because it works on the algorithmic level and unlike the existing techniques, it avoids issues, like, overfitting, underfitting, and so forth.

| Dimensionality reduction
In this subsection, we discuss the dimensionality reduction process using our proposed hybrid model.The proposed GCAE technique is a combination of the CAE and GRU.EC data recording by the SM is increasing in dimensions on a timely basis.Therefore, we need to obtain the most important, unique, and suitable set of features out of the total features.For this reason, we choose the concept of autoencoder (AE) for its powerful dimensionality reduction ability. 32Dimensionality of a data set denotes the number of its input features (dimensions).Moreover, the techniques employed to reduce the quantity of the input features in a data set are called dimensionality reduction techniques.The input dimensions constantly make a classification or any predictive task difficult for a model, which is generally referred to as the curse of dimensionality problem. 33This problem causes the ML and DL classifiers to perform poorly.Therefore, it is necessary to tackle this problem.
We propose a new hybrid GCAE to handle the curse of dimensionality issue.
As it is clear from its name, GCAE feature extractor is a combination of GRU and CAE.Where, CAE is a combination of AE and CNN.AE is an unsupervised neural network that consists of encoder and decoder.When the input x is passed to the encoder neural network, it is transformed to a compressed (encoded) vector z by passing it from multiple Dense layers in the encoder neural network.Afterwards, from z, the input is reconstructed back and x′ is achieved using multiple Dense layers in the decoder neural network.These three techniques work together to reduce the dimensions of the given data set, that is, reduce the number of features present in the data set.The working procedure of encoder and decoder are given in Equations ( 4) and ( 5), 34 respectively.
where an activation function exhibiting nonlinearity, for example, Leaky rectified linear unit (LeakyReLU), ReLU, Sigmoid, Hyperbolic tangent (Tanh), and so forth, is represented by σ. b enc and W enc are the biases and weights of the encoder related Dense layers, respectively.Afterwards, the reconstructed output x′ of the original input x is calculated by the decoder as follows: dec dec (5)   where b dec and W dec are the bias and weights of the decoder-related Dense layers, respectively.If the weights of encoder network and decoder network are equal in AE, W W = dec enc T , thus, the quantity of the parameters is minimized by 50%.Normally, in AE, there are several hidden (Dense) layers.Moreover, the decoder network is a picture or mirror image of the encoder network.Therefore, their weights are tied.While training AE, weight updating process is performed to reduce the RE through backpropagation algorithm.RE is calculated using Equation (6). 32,34( ) CAE is actually an AE in which the CNN is employed by following the AE's schema.CNN has pooling and convolution layers in distinction from the classical deep networks.Furthermore, the GRU is combined with the CAE to design our proposed GCAE feature extractor.GRU is added with CAE in a way that two layers of GRU are added: one at the encoder and another at the decoder.In the encoder network of GCAE, we use two onedimensional convolution layers (Conv1D), one GRU layer, and three one-dimensional MaxPooling (MaxPoo-ling1D) layers.The decoder network of GCAE contains one GRU layer, three Conv1D layers, and three onedimensional UpSampling (UpSampling1D) layers.More implementation details of the proposed GCAE feature extractor are provided in Table 2.
Like the basic AE, the GCAE also comprises two networks: encoder and decoder.It is shown in the architecture of GCAE in Table 2 that the first six layers started from first to sixth are part of encoder while the remaining belong to the decoder neural network.The encoder aims to compress the inputted data.Whereas, the decoder seeks to reconstruct the original inputted data from its compressed form prepared by encoder.After the training process, the encoder submodel is retained and the decoder submodel is discarded.The encoder is then employed for feature extraction.The extracted features can then be employed to train another ML algorithm. 33In our scenario, we select only 10,000 instances out of 42,372 in SGCC data set for analysis.The reason is that if we use all 42,372 records, the session crashes due to utilizing of all available RAM in Google Colab during the training of GCAE feature extractor.Therefore, we employ a data set of 10,000 instances and 1034 dimensions for feature extraction.The feature extraction is done using GCAE, which resulted in (None, 130, 1) output dimensions in the encoded layer (sixth layer of encoder given in Table 2), that is, 130 features only.Hence, the dimensions are reduced from 1034 to 130 only.Subsequently, these extracted dimensions are passed to train the SSA-and CS-based LSTM to obtain better theft detection results.

| Hyperparameters' optimization
In CSLSTM, the stochastic gradient descent 35 optimization technique is employed to update weights of the model in an iterative manner to obtain the optimal loss value and achieve better theft detection accuracy.Moreover, along with weight parameter optimization, the model's hyperparameters' optimization also has a direct effect on theft detection performance.Therefore, CSLSTM's hyperparameters need to be optimized to achieve high theft detection performance.These hyperparameters include the number of epochs, the learning rate, the optimizer type, the number of neurons, and the network weight initializer.Finding the optimal values for these hyperparameters notably enhances the ETD performance.
7][38] For example, in Gu et al., 36 grid-search (GS) algorithm is used for hyperparameters' tuning.However, the grid search is a time-consuming algorithm because it tries all the possible options (combinations) of the hyperparameters to obtain the best combination of hyperparameters values that optimizes the ETD performance. 39urthermore, random-search (RS) optimization technique is employed in Ismail et al., 38 where the optimal combination of hyperparameters is chosen from a random set of hyperparameters combination.However, in this search strategy, an optimal combination of hyperparameters that enhances ETD performance is not guaranteed. 39On the other hand, bioinspired optimization algorithms-based hyperparameters' tuning implements an intelligent search strategy that takes the detector's performance into account while searching the hyperparameters' search space.For this reason, in this research, we leverage a bioinspired hyperparameters' optimization algorithm, namely, SSA. 40It is a bioinspired metaheuristic optimization technique.Generally, the three factors, accuracy, precision, and F1 score, affect ETD performance.All of them are dependent upon the classification loss (error).If the loss is high, accuracy, precision, and F1 score will be low and vice versa.Therefore, we need to focus on loss optimization to obtain the best theft detection performance in terms of accuracy, precision, and F1 score.Hence, we design a single-objective fitness function, that is, loss function, to optimize it using SSA and achieve high ETD performance.The loss function in this scenario is binary crossentropy _ loss (logarithmic loss) function 41 as we intended to predict that a consumer is either theft (represented by 1) or nontheft (represented by 0).The binary crossentropy _ function takes two input arrays Ytest and Ypred.The fitness function is given in Equation ( 7). 41 where i and n show a specific energy consumer and total consumers, respectively.Ytest i is the actual testing data labels in the data set and Ypred i is the energy theft's probabilistic value calculated via SSA-GCAE-CSLSTM.
Here, the goal is to obtain the optimal combination of values for hyperparameters that minimizes the loss function (binary crossenropy _ ) given in Equation ( 7).Mirjalili et al. 40 were inspired by the moving and foraging mechanism of the salps's swarm in oceans and introduced the SSA.Moreover, the salps mostly create a swarm in oceans (also called salp chain).The primary reason for creating such a chain is not yet clearly discovered but some members of the research community say that the salp swarm (chain) obtains fast coordinated changes in mobility towards foraging. 42For the mathematical formulation of the SSA, 40 the population is split into the followers and leader categories.The leader salp is always at the front of the salps chain while the remaining salps are the followers.The leader salp leads and directs the chain (swarm) while the followers go next to each other.The current positions of all the salps are retained in a two-dimensional matrix M ( ) as like other bioinspired swarming algorithms.The food source F is assumed to be available in the search area as the salps chain's target.The location of the leader salp is updated using Equation (8). 40


where M j 1 represents the location of the leader salp in jth dimension.LB j and UB j show the lower and upper bound of the jth dimension, respectively.While the location of the food source in jth dimension search field is represented by F j .R R , 1 2 , and R 3 are the random numbers.Equation (8)  indicates that the first salp, that is, the leader, changes its location with regard to the F j .The random number R 1 is the most significant and valuable parameter because it balances the local search (exploitation) and global search (exploration).R 1 is computed using Equation (9). 40) where L and l represent the maximum number of iterations and the number of the current iteration, respectively.The random numbers R 2 and R 3 are generated between 0 and 1 uniformly.The followers salps' locations are updated using Equation (10), which is the equation of Newton's law of motion.
where  i 2. M j i indicates the location of the ith member of the follower salps category at the jth dimension.
shows the starting speed, t indicates the time, and V = M M t − 0 .Furthermore, in optimization, the step size between two iterations is taken to be 1.The initial speed (V 0 ) is considered as 0. Hence, Equation ( 9) is modified and is given as Equation (11). 40) where the constraint  i 2 is on its place.The ith salp with jth dimension in follower salps is presented by M j i .
Furthermore, using Equations ( 8) and ( 11), the salp swarm is created.Finally, the optimized values for the abovementioned hyperparameters computed by SSA are provided in Table 3.In addition, we employed GS 41 and RS 38 hyperparameter optimization algorithms for comparison purpose.The optimized values computed by GS and RS are given in Tables 4 and 5, respectively.In addition, it is not compared with other optimization algorithms, like, PSO and GA, as they consume large computational time.

| Electricity theft classification
This subsection provides the description about the theft classification using the proposed SSA-GCAE-CSLSTM model.After conducting data preprocessing using LI, MM, and TSR, data balancing using class-weighting mechanism, dimensionality reduction using GCAE, and hyperparameters' tuning using SSA, the final classification task is performed.In this section, weighting-based balanced data, GCAE-based reduced dimensions, and SSA-based optimized hyperparameters are inputted into LSTM neural network to perform accurate and precise electricity theft classification in SGs.LSTM 43 was introduced by Sepp Hochreiter and Jurgen Schmidhuber in 1997.Learning the mechanism to retain the information over long timestamp by the recurrent backpropagation networks incurs an extremely long duration, which is due to inadequate decaying loss (error) backflow.This issue is addressed by the gradientbased algorithm, known as LSTM.The basic recurrent neural network (RNN) employs its feedback links (connections) to retain the recent input's representations in terms of short-term memory while LSTM does so by gradually updating the weights.
The main advantage of the RNN is that it is able to leverage the context relevant information when mapping the output sequence and input sequence. 44However, for the classic RNN, it is limited to access the context range.As the input circulates and traverses through all the recurrent connections in the network, therefore, the affect of the input on the hidden layer and then on the output of the recurrent network either blows up or decays.This influence (effect) of the input over hidden and then over the network's output is called the vanishing gradient problem. 44This problem is tackled by the LSTM.LSTM comprises multiple subnetworks that are recurrently connected, called memory blocks.Each block holds single or multiple recurrent memory cells and three multiplicative gates (input, output, and forget).These gates allow the LSTM memory cells to The input gate finalizes which information from the input data is to be renovated (updated).The final output is determined via the output gate on the basis of memory of block and input. 45he LSTM cell's mathematical formulation is given in Equations ( 12)-( 17). 41,45 where  b , and f b represent the biases relevant to the output gate, input gate, current cell state, and forget gate, respectively.The hidden state's activation at the prior timestamp t − 1 and the input vector (instance) at the current timestamp t is repre- sented by h t−1 and x t , respectively.Finally, the Sigmoid and Tanh are the activation functions used in the LSTM cell.The formulas for these activations functions are provided in Equations ( 18) and ( 19). 41
An LSTM cell that presents the above procedure in graphical form is given in Figure 2. 45 LSTM is fit using the balanced data obtained by class weighting, extracted features obtained by GCAE, and adjusted hyperparameters achieved by SSA.The adjusted and optimal values of the hyperparameters for LSTM are given in Table 3.Moreover, in LSTM implementation, one LSTM layer, two LeakyReLU layers, three Dropout layers, two BatchNormalization layers, two Dense layers, and one Flatten layer are involved.The dropout probability selected is 0.3.The neurons for LSTM, neurons for the first Dense layer, and the learning rate for each LeakyReLU layer are finalized by SSA and provided in Table 3.Only one neuron is used for the final layer, that is, the Dense layer, because it gives a single output for an instance of EC data either theft or honesty.The input shape is one timestamp with 130 features, that is, (130, 1).Moreover, the binary crossentropy _ is employed as a loss function while the optimizer for weights' optimization to achieve minimum loss is selected by SSA and provided in Table 3.In addition, the number of epochs is also tuned by SSA and given in Table 3 with the batch_size of 128.Finally, the LSTM model is compiled by TensorFlow Python library.

| DISCUSSION OF SIMULATION RESULTS
In this part, simulation results are used to illustrate the applicability of the proposed SSA-GCAE-CSLSTM model for ETD.By contrasting the proposed model with the benchmark ETD models, we demonstrate the efficacy of the proposed model.The proposed and benchmark models are simulated using the real SM EC data utilizing the most popular Python libraries known as Keras, TensorFlow, and Scikit-Learn.

| Simulation configuration
Using the real SGCC data set, the Python programming language, and Google Colaboratory (a free cloud service), the proposed model is trained and tested.The data set comprises the 2 years and 10 months (January 1, 2014-October 31, 2016) EC data from real SM of Fujian, China.The more elaborative information about the data set is given in Table 1.In addition, SGCC data set clearly states 3615 theft users, validated using on-site inspections.Furthermore, to achieve more useful and improved results, discussed in Section 4.1, we start with the preprocessing of data set where MM scaler, TSR, and LI are employed.The entire data set is split into training and testing sets at a later stage.The ratio for testing and training sets is considered to be 20% and 80%, respectively.After that, the cost-sensitive learning (weighting) technique is used to balance the training data, as mentioned in Section 4.2.Later, to lessen the dimensionality of the data, GCAE is trained to extract the important features.It consists of two Convolution, three MaxPooling, and one GRU layers in its encoder and decoder parts.The hyperparameters of the proposed model are then tuned using SSA.Moreover, binary cross-entropy is used by SSA as the objective (fitness) function to achieve optimal values of hyperparameters.In addition, 25 epochs are selected as suitable by SSA for training the proposed model.Along with the number of epochs, other hyperparameters are also tuned using SSA.The SSA-based hyperparameters' tuning leads the proposed model to minimize the binary crossentropy _ error and increase the ETD accuracy.Finally, optimized values of hyperparameters outputted from SSA, extracted dimensions obtained from GCAE, and class weight defined by cost-sensitive (weighting) learning techniques are passed to LSTM for electricity theft classification, as already discussed in Section 4.5.

| Performance evaluation measures
Accuracy, precision, area under the receiver operating characteristic curve (AUC) score, F1 score, and recall are the performance metrics utilized in this paper to assess the performance of the proposed model. 197][48][49] To calculate the aforementioned metrics, we need to calculate the confusion matrix.Confusion matrix consists of four elements: false positives (FP), true positives (TP), false negatives (FN), and true negatives (TN).The FP refers to the concept when a classifier predicts a normal user as abnormal.TP refers to a concept when a classifier predicts an abnormal EC user as abnormal.The prediction of an abnormal user as normal is referred as FN while the prediction of a normal consumer as normal is referred as TN.The formulas for the performance measures leveraged in this work are provided in Equations ( 20)-( 24). 46,49

Accuracy TN TP FP TP FN TN
Where the NC and PC show the samples relevant to the negative (normal) and positive (abnormal) classes, respectively.The RANK i represents the rank value of the ith EC sample.Additionally, accuracy is the proportion of samples successfully predicted across the entire sample set.Precision describes the accurate forecasts made by a model that aids the power company in cutting the cost of on-field inspection.While recall aids utilities in minimizing financial loss.Besides, F1 score is calculated using precision and recall.To distinguish between the FPR and TP rate (TPR), AUC is used.To put it another way, that is, how FPR and TPR work.It varies between 0.5 and 1 where 1 means perfect classification and 0.5 means total failure of the classification technique.

| Benchmark schemes
The benchmark schemes, which are used in our work for comparative purposes, are explained in this subsection.The same data preprocessing and data balancing techniques are used for the existing schemes to conduct a fair and unbiased comparison.The benchmark techniques considered in this paper include SVM, ET, RF, XGBoost, AdaBoost, DT, CNN, CSLSTM, and even the newly developed GCAE-CSLSTM.CSLSTM and GCAE-CSLSTM are already discussed in detail in the system model section.The short introduction of the remaining benchmark techniques is provided in the following subsections.

| Support vector machine
SVM, an older, most well-liked, and frequently applied technique, can be applied to problems relating to regression and classification, though in various versions, that is, support vector regressor for regression and support vector classifier for classification.In classification tasks, it is originally made for binary classification problems. 49However, it can be employed for multiclass classification problems using the same principle used for binary classification.A multiclass classification problem is broken into multiple binary classification problems called one vs one -method and then the same principle is used as in binary classification.In this way, it leads to the multiclass classification. 50Recently, in Jindal et al., 20 Haq et al., 51 and Kong et al., 52 SVM is used as the proposed scheme for ETD.That is why, we choose SVM as a benchmark scheme in this research work.

| Decision tree
DT is a popular supervised classification algorithm.It can be used to perform all such tasks that involve classification and regression.To predict a class label in DTs for a data instance, the process starts from the root.The values of the root feature (attribute) are compared with the instance's feature.Then, we select the branch based on that value and jump to the subsequent node of the tree. 53T is considered as the proposed classifier in Tehrani et al. 54 for ETD in SGs.Moreover, it is considered as an important component of the proposed model in Jindal et al. 20 and Kong et al. 52 for NTL detection.Therefore, we choose it as a baseline technique in this paper.

| Extra trees
ET is an ensemble-supervised ML technique that consists of multiple DT classifiers.It can be used to perform both supervised regression and classification tasks.It is the most accurate and computationally efficient algorithm. 55t is different from the other tree-ensemble algorithms based on two reasons: first is that it divides the nodes by selecting the cut points fully at random basis and second one is that it takes the whole training data set (no subsets of the data set) to create more trees.Recently, ET ensemble is selected as the proposed scheme for ETD in SGs in Gunturi and Sarkar. 12For this reason, we select it to be one of our benchmark schemes in this work.

| Random forest
RF is a tree-based ensemble ML classifier that consists of multiple DTs.It obeys the bagging mechanism.Bagging is an ensemble learning mechanism, often used for homogeneous weak classifiers in parallel.Moreover, the final prediction result is decided by calculating the average or majority vote of the results of various DTs. 49,50F is proposed in Gunturi and Sarkar. 12So, we choose RF as one of our benchmark classifiers to have a suitable performance comparison.

| Adaptive boosting
AdaBoost is created on the concept of boosting.In boosting, multiple weak learners create a robust metalearner using the majority vote process.Multiple boosting iterations (rounds) are performed for creating AdaBoost or any other boosting technique.The boosting rounds are equal to the number of weak learners (base estimators) in a boosting technique.In the first boosting iteration in AdaBoost, the first weak classifier is trained and prediction results are generated. 56In addition, the error is computed for the first round.In the second iteration, the misclassified records are weighted more as compared with the correctly classified observations in the previous round.This process will continue till the last iteration and finally these weak learners are combined to build a meta classifier.The meta-learner assigns the labels for each observation using weighted majority voting. 50Very recently, in Qu et al., 57 AdaBoost is used as a proposed scheme for detecting anomaly in residential EC data.For this reason, we select it as a benchmark in the proposed work.

| Extreme gradient boosting
XGBoost is an ensemble ML technique used for classification and regression-related problems.In boosting, the DTs are employed sequentially and are trained to correctly classify the records misclassified by the previous weak learner.In boosting ensemble category, AdaBoost was the first model. 58Furthermore, in gradient boosting, weak classifiers are trained employing a gradient descent optimizer and differentiable loss calculation function. 56XGBoost is computationally efficient as compared with the gradient boosting classifier. 58ecently, it is proposed by Yan and Wen, 13 to detect the electricity theft using the metering data present in the Irish data set.Therefore, we choose XGBoost as a benchmark scheme in this article.

| Convolutional neural network
CNN is a DL technique, which was originally proposed for image-processing tasks.CNN takes the image as an input, gives importance in terms of the biases and PAMIR ET AL.
| 3587 weights parameters to the different objects available in the image, and separates these objects from each other. 50t consists of five layers: input, convolution, pooling, fully connected, softmax (for multiclass classification), or logistic (for binary classification).It was originally developed to solve the image-processing problems like in Kim et al. 59 and Deperlioglu et al. 60 However, researchers used it in other domains too. 61,62Some authors have also provided an in-depth analysis of ETD in the form of review papers. 63,64Recently, in Duarte Soares et al. 65 and Mangat et al., 66 CNN is employed as the proposed model for NTL detection.For this reason, we selected CNN as one of the benchmark models in this work.

| Proposed model performance evaluation
This subsection discusses the performance evaluation of the proposed SSA-GCAE-CSLSTM approach for theft classification using aforementioned performance indexes.In this regard, we begin with the data set preprocessing to remove outliers and missing values.In addition, the data set is standardized using the MM scaler.Afterwards, the data balancing is needed to be performed.Otherwise, if we train a model with an imbalanced data set, it gets biased and generates a high FPR value.Therefore, the cost-sensitive learning (classweighting) approach is employed to handle the class imbalanced problem, as elaborated in Section 4.2.Moreover, no new artificial instances are generated in this method.The weights are assigned indirectly proportional to the number of instances in the classes.A comparison between the selected cost-sensitive learning and the existing techniques (SMOTE and RUS)-based LSTM is performed for visualizing the effectiveness of the selected data balancing technique.The comparison results are provided in Table 6 and Figure 3.
In terms of accuracy, precision, F1 score, and AUC score performance metrics, our technique outperforms SMOTE-LSTM and RUS-LSTM, as reported in Table 6 and Figure 3. Imbalanced-LSTM achieves 91.22% accuracy, 99.81% precision, 50.88% AUC score, 91.29% recall, and 95.40 F1 score.Regarding accuracy, precision, and F1 score, the proposed technique is superior to RUS-LSTM and SMOTE-LSTM, but not significantly so when considering AUC and recall.It means that Imbalanced-LSTM classifier is biased that classifies the theft users as normal.Therefore, it obtains high accuracy and low AUC values.Furthermore, cost-sensitive learning-based LSTM obtains better results as compared with SMOTE-and RUS-based LSTM.It obtains 89.57% accuracy and 66.28% AUC score, which are better results without model's biasness.Hence, CSLSTM's results are improved as compared with the Imbalanced-LSTM, RUS-LSTM, and SMOTE-LSTM.Furthermore, Figure 4 shows the performance comparison of CSLSTM with the Imbalanced-LSTM, SMOTE-LSTM, and RUS-LSTM in terms of ROC-AUC.It is depicted in the figure that CSLSTM achieves better result in terms of ROC-AUC where it exhibits the maximum TPR and minimum FPR values as compared with the RUS, SMOTE, and Imbalanced-LSTM cases.Now, the convergence analysis of the proposed SSA-GCAE-CSLSTM and benchmark CNN models is done with respect to accuracy and loss during the training and testing phases.Figure 5 presents the convergence analysis of the proposed model in terms of accuracy.The proposed model effectively conducts feature extraction using GCAE and efficiently optimizes the hyperparameter tuning using SSA to avoid overfitting and improve the theft detection accuracy, as depicted in Figure 5. Epoch is a hyperparameter which controls the training and testing process of a model.Including the other hyperparameters, we also tune the epoch using SSA and finally got the suitable epoch value as 25.On the training data set, the accuracy value of the proposed model increases gradually and finally reaches 0.9225.While on testing data set, some rise and fall can be noticed in the accuracy values.The considered data set comprises some zero time series data.At the 10th epoch, the proposed SSA-GCAE-CSLSTM model is trained using batch having zero time series values.Therefore, it is not good to predict the batch that is nonzero valued in testing phase and it causes the overfitting issue.The training and testing accuracy is closer after the 10th epoch, and overfitting is prevented.Finally, no change is seen in the training or testing accuracy values of the proposed model after the 21st epoch.The overfitting problem is avoided and the model is generalized due to efficient and powerful feature extraction and suitable tuning of hyperparameters' values using GCAE and SSA, respectively.
The proposed model's convergence analysis in relation to loss is shown in Figure 6 | 3589 because the model is trained using zero-valued batch (batch having zero values) while at the testing phase, it is tested on nonzero-valued batch, which causes overfitting issue only in the specific epoch.But after the 10th epoch, overfitting issue is tackled.
The convergence analysis for the existing CNN is performed for comparison purpose.Figure 8 presents the convergence analysis of CNN with regard to loss.The training loss decreases while it fluctuates at some epochs before finally reaching the value of 0.2723.The testing loss, on the other hand, finally reaches the value of 0.5015.It means that CNN overfits and the loss does not decrease uniformly at either training or testing data set.The overfitting problem occurs due to two reasons.The first reason is that no dedicated feature extraction is conducted.The second reason is that no automatic hyperparameter adjustment is performed for CNN.At the 5th and 14th epochs, overfitting occurs due to the zero-valued batches.After the 14th epoch, both the training and testing losses become equal and overfitting is avoided until the 18th epoch.After the 18th epoch, the overfitting starts again and lasts till the end.Overfitting does not happen due to the zero-valued batches, but it occurs due to the unavailability of the specialized feature extraction and hyperparameter tuning mechanisms for the existing CNN model.
The proposed and current models' evaluation results, which were obtained using 20% testing data and 80% training data, are presented in Table 7 and Figure 9.The proposed classifier performs better than all of the benchmark classifiers, as shown by the results.In the proposed SSA-GCAE-CSLSTM model, the simultaneous usage of the CS, GCAE, and SSA improves the theft detection performance and makes it the best-performing model among all the considered benchmarks.It attains 0.9225 accuracy, which is the best ETD accuracy.It also outperforms the benchmark classifiers, such as DT, SVM, AdaBoost, XGBoost, RF, ET, CNN, CSLSTM, our newly proposed feature extractor GCAE-CSLSTM, GS-GCAE-CSLSTM, and RS-GCAE-CSLSTM with respect to accuracy.Our proposed model also beats the aforementioned benchmarks (except GS-GCAE-CSLSTM, and RS-GCAE-CSLSTM) in terms of precision.It obtains a precision value of 0.9945, which is among the best precision values.Higher precision value shows that the model is highly reliable in classifying instances as positive.In addition, the proposed model also obtains the best AUC value among all the selected benchmark schemes.It achieves a 0.7113 AUC value, which is the best AUC score among the selected benchmarks.It means that our proposed model better differentiates and separates the normal and theft classes.Moreover, the proposed model The AUC-ROC curves for the proposed model and benchmark schemes are shown in Figure 10.According to the figure, the proposed model's AUC value is 0.7113, which is reasonable when compared with all of the chosen benchmark schemes.It means that our proposed model discriminates the two classes (benign and theft) in a suitable way because of its strong feature extraction and hyperparameters' tuning abilities.

| Ablation results and computational complexity
The ablation results of the proposed model are presented as follows.When GRU is ablated, then we have values of accuracy, AUC score, precision, recall, and F1 score of SSA-CAE-CSLSTM as 0.3050, 0.5703, 0.2479, 0.9597, and 0.3941, respectively.Similarly, when both GRU and CAE are ablated, then we have 0.9145 accuracy, 0.5425 AUC score, 0.9945 precision, 0.9184 recall, and 0.9550 F1 score for SSA-CSLSTM.
The computational complexities of different algorithms are given as follows.
• SVM: Nonlinear SVM's training complexity mostly lies between O n ( ) 2 and O n ( ) 3 , where n is the number of instances. 67,68 74 • SSA: Time complexity of SSA is given as O tNd ( ), where t is the number of iterations, N is the number of salps (population size), and d is the problem dimen- sionality (number of variables). 75 CSLSTM: Time complexity of CSLSTM is given as

O T d Td d f T g T ( ( ) + + ( ) + ( ))
, where T is the length of the input sequence, d h is the hidden state's dimension, d i is the input's dimension, f T ( ) is time complexity function for cost-sensitive learning and g T ( ) is time complexity function for cost-sensitive learning backward pass.using the class-weighting method and its result is validated using Table 6 and Figure 4 by comparing the result of CSLSTM and other data balancing techniques.The second limitation is the curse of dimensionality issue that is resolved by a newly introduced hybrid GCAE model.The solution is validated and shown in Table 7, and Figure 9 in GCAE-CSLSTM obtains better results than all the other eight models (SVM, ET, RF, DT, XGBoost, AdaBoost, CNN, and CSLSTM) that are without employing any dedicated feature extractor.The third limitation is ignoring the hyperparameter tuning that leads to low accuracy.We used SSA to tune the hyperparameters of our proposed model to achieve better results.This solution is validated using Table 7 and Figure 10.
In limitation 4, the overfitting issue is mentioned that is created due to the usage of SMOTE-based data balancing in the literature.We tackle this issue using class-weighting method-based data balancing instead of using SMOTE for data balancing.In this way, the overfitting issue is tackled and it is validated in Figures 5 and 6.Limitation 5 is about the underfitting problem, which occurs due to the usage of the RUS technique for data balancing in the literature.Instead of RUS-based data balancing, we use classweighting-based data balancing to tackle underfitting issue.This solution is validated by comparison of CSLSTM and RUS-LSTM, given in Table 6 and Figure 4. Finally, the last limitation is the stagnation of the proposed model in local optima.This issue is tackled by tuning of the learning rate hyperparameter of the proposed model using SSA.The solution is validated in Table 7 and Figure 10 by comparing the results of the proposed SSA-GCAE-CSLSTM with other nine models.

| CONCLUSION
In this paper, we introduced SSA-GCAE-CSLSTM, a novel technique for ETD in AMI.A bioinspired optimization approach called SSA is used to automatically tune the T A B L E 8 Mapping of the limitations identified, solutions proposed, and validations done.

Limitations Solutions Validations
L.1 The model's biasness towards the majority class due to data imbalance, which has a negative impact on the effectiveness of theft detection 13,15 Cost-sensitive learning (weighting) method is used to handle data imbalance issue Cost-sensitive (weighting) data balancing technique is validated, as given in Table 6 and Figure 4 L. 2 The model generalization is hampered by neglecting the curse of dimensionality issue 12,13 GCAE is used to efficiently extract the important features and remove the unnecessary features to improve generalization Proposed GCAE feature extractor-based SSA-GCAE-CSLSTM model generalization is validated, as given in Table 7 and Figure 9 L.3 Hyperparameter tuning is not considered, which results in low accuracy 13,17 SSA is used to adjust the hyperparameters' values of the proposed SSA-GCAE-CSLSTM model SSA-based parameters' tuning for proposed SSA-GCAE-CSLSTM model is validated and shown improved performance results, as presented in Table 7 and Figure 10 L.4 Overfitting issue occurs due to the usage of SMOTE for data balancing as SMOTE overpopulates a specific region instead of the increasing the overall data 12,14 Cost-sensitive learning (weighting) method tackles the overfitting issue of SMOTE Cost-sensitive learning (weighting)-based data balancing to avoid the proposed SSA-GCAE-CSLSTM from overfitting is validated, as shown in Figures 5 and 6 L.5 Important information loss due to RUS that leads to underfitting problem 16 Cost-sensitive learning (weighting) is employed to balance the data instead of RUS to solve underfitting issue Proposed cost-sensitive learning-based data balancing is compared with RUS and is validated, as presented in Table 6 and Figure 4 L.6 Local optima stagnation 13,17 The learning rate of the proposed model is optimized using the global search ability of SSA and in this way it avoids the proposed model from local optima trapping problem SSA-GCAE-CSLSTM model proposed and validated, as given in Table 7 and Figure 10 Abbreviations: CSLSTM, cost-sensitive learning and long short-term memory;

F I G U R E 2
Long short-term memory cell structure.
. At every epoch, the training loss goes down and finally reaches the value of 0.2818.As the proposed model is converged after the 21st epoch, the testing loss becomes equal to the training loss.It means that the proposed model successfully avoids the overfitting issue and uniformly decreases the loss on both testing and training data sets.The overfitting problem is avoided due to two reasons.The first reason is the GCAE's powerful ability to SM data compression into lower dimensions and learning nonlinear patterns from it.The second reason is the strong ability of the SSA algorithm to avoid the local minima stagnation, which makes SSA able to better optimize the hyperparameters of the proposed model.However, at the 10th epoch, the testing loss fluctuates and increases than the training loss F I G U R E 4 The data balancing techniques' comparison.AUC, area under the receiver operating characteristic curve; CSLSTM, cost-sensitive learning and long short-term memory; LSTM, longshort term memory; RUS, random undersampling; SMOTE, synthetic minority oversampling technique.I G U R E 5 The proposed SSA-GCAE-CSLSTM model's training and testing accuracy.CSLSTM, cost-sensitive learning and long short-term memory; GCAE, gate convolutional autoencoder; SSA, salp swarm algorithm.F I G U R E 6 The testing and training loss of the proposed SSA-GCAE-CSLSTM model.CSLSTM, cost-sensitive learning and long short-term memory; GCAE, gate convolutional autoencoder; SSA, salp swarm algorithm.PAMIR ET AL.

Figure 7
exhibits the convergence analysis of the benchmark CNN model in terms of accuracy.On training data set, the accuracy value of CNN increases and at some point its accuracy fluctuates.Its training accuracy finally reaches 0.8858.While on testing data set, more fluctuation can be noticed in the accuracy values.Testing accuracy finally reaches 0.8300.At the fifth epoch, CNN is trained using batch having zeros.Therefore, it is not good to predict the batch that comprises nonzero data in testing phase as it causes the overfitting issue.The same case happens at the 14th epoch.After the 14th epoch, both training and testing accuracies become equal and overfitting is avoided.It happens until the 18th epoch.The model's overfitting resumes after the 18th epoch and lasts through the final epoch.The overfitting problem occurs and the model is not generalized because no dedicated feature extraction and hyperparameter adjustment mechanisms are employed to efficiently tackle the overfitting problem.
T A B L E 2 Proposed GCAE feature extractor architecture.
T A B L E 3 Optimal values of hyperparameters computed by salp swarm algorithm.Optimal values of hyperparameters computed by grid search.Optimal values of hyperparameters computed by random search.
Comparison of cost-sensitive (weighting) with other data balancing techniques.
Abbreviations: AUC, area under the receiver operating characteristic curve; CSLSTM, cost-sensitive learning and long short-term memory; LSTM, long-short term memory; RUS, random undersampling; SMOTE, synthetic minority oversampling technique.
The testing and training accuracy of the proposed CNN model.CNN, convolutional neural network.F I G U R E 8 The testing and training loss of the proposed CNN model.CNN, convolutional neural network.Comparison of SSA-GCAE-CSLSTM with the benchmarks.Abbreviations: AdaBoost, adaptive boosting; AUC, area under the receiver operating characteristic curve; CNN, convolutional neural network; CSLSTM, costsensitive learning and long short-term memory; DT, decision tree; ET, extra trees; GCAE, gate convolutional autoencoder; GS, grid search; RF, random forest; RS, random search; SSA, salp swarm algorithm; SVM, support vector machine; XGBoost, extreme gradient boosting.
T A B L E 7 76,775.6 | Mapping of the limitations, solutions, and validationsMapping is given inTable 8. L.1 represents the limitation 1, which is the class imbalance problem.It is resolved I G U R E 10 Techniques comparison based on AUC.AdaBoost, adaptive boosting; AUC, area under the receiver operating characteristic curve; CNN, convolutional neural network; CSLSTM, cost-sensitive learning and long short-term memory; DT, decision tree; ET, extra trees; GCAE, gate convolutional autoencoder; RF, random forest; SSA, salp swarm algorithm; SVM, support vector machine; XGBoost, extreme gradient boosting.
GCAE, gate convolutional autoencoder; RUS, random undersampling; SSA, salp swarm algorithm; SMOTE, synthetic minority oversampling technique.PAMIR ET AL.| 3593 model's hyperparameters.In the proposed work, a hybrid GCAE model is developed for feature extraction.Furthermore, the CS learning method is implemented for balancing the EC data, acquired from SGCC.Finally, the LSTM model is used for classification task.In addition to the implementation of the proposed SSA-GCAE-CSLSTM model, two basic versions of SSA-GCAE-LSTM, that is, CSLSTM and GCAE-CSLSTM along with seven benchmarks, SVM, DT, ET, RF, AdaBoost, XGBoost, CNN, are implemented in this work for comparison purpose.The proposed model provides excellent results in ETD with 99.45% precision, 95.93% F1 score, 92.25% accuracy, and 71.13% AUC score after performing extensive simulations.The results are better than the proposed model's basic versions and all the benchmarks.Finally, it is concluded that SSA-GCAE-CSLSTM is an effective ETD model that performs efficient theft detection.