Performance of cache placement using supervised learning techniques in mobile edge networks

Natural Sciences and Engineering Research Council of Canada Abstract With the growth of mobile data traffic in wireless networks, caches are used to bring data closer to mobile users and to minimise the traffic load on macro base station (MBS). Storing data in caches on user terminals (UTs) and small base stations (SBSs) faces challenges with respect to the decision of cache contents. Here, a multi‐objective cache content strategy that aims to maximise the cache hit rate of SBSs in mobile edge networks (MENs) is proposed. The multi‐objective cache placement optimisation is formulated as a classification problem. Unlike previous work, mobility input attributes such as user locations, contact duration, communication ranges, contact probability between UTs and SBSs, etc. as well as content popularity and the correlation between these input attributes separating the decision space into two regions of cache and not cache are used. Stochastic gradient descent algorithm is used for the training of three supervised machine learning techniques: artificial neural network ANN, support vector machine (SVM), and logistic regression LR to define the hyperplane that separates the cache content decision space. Simulation results show that compared with the weighted‐sum approach, the SBSs cache hit rates increase on the average by 18.58%, 18.52%, and 18.2%, and the total energy consumption values decrease on the average by 33.49%, 53.19%, and 49.9% for ANN, SVM, and LR, respectively.

Various cache placement schemes have been designed for edge caching in wireless networks. Most existing literature on cache content placement algorithms aim to maximise the cache hit rate by placing contents based on the change in content popularity. This is because a large amount of content requests are caused by a few popular files. Using popular files, cache placement may result in replication of the same file in many caches in one cell. In [4], the authors presented cache placement algorithm where the decision of cache or not cache (1 or 0) is converted into a convex optimisation problem where the decision is relaxed to any value in [0,1]. Using an optimisation technique, a non-integer solution is found and converted into an integer solution. In this algorithm, authors took content popularity distribution as an input to the optimisation problem and specified the constraints of the problem based on the network topology. A deterministic caching approach was used for cache placement based on computational greedy algorithms. In [5], an optimal greedy algorithm was proposed for cache content placement. The authors exploited the search for optimal contents by balancing between the content diversity in caches within the wireless network cell and the channel diversity gain. The authors in [6] designed the cache placement with the objective to minimise the delay of delivering the file to the end users. They proved that the objective function was a monotone non-increasing super-modular function that could be solved by a greedy algorithm to obtain a locally optimal solution. Deterministic caching approaches can improve network performance only with the assumption that UTs stay at the same locations for several hours [6].
To enlarge the set of cached files and exploit caching diversity, cooperative caching and device-to-device (D2D) caching approach was presented in [7][8][9]. The D2D approach utilises UT caches to improve network performance. The D2D cache placement algorithms considered content popularity, distances between users in one cell, and social association between users [10]. An asymptotically optimised online learning content placement and content delivery algorithm was proposed in [10]. The algorithm used a distributed on-line learning approach at individual edge servers based on stochastic gradient descent (SGD). Studies that use multi-winner auction theory for cache placement-based D2D communication showed that the network throughput increased linearly with the number of UTs in one cell [11].
A learning-based proactive caching was investigated to predict the content demand of users. In [12], the authors used singular value decomposition (SVD) to predict content preferences. In [13], users-to-content association was used to predict content popularity using non-negative matrix factorisation (NMF) machine learning. This is a linear model that learns through one layer relationship between the user and the contents. Multi-layer learning was proposed in [14] that used a deep neural network, which provides a non-linear model for the association between users and the content demands. Distributed deep learning (DDL) was proposed in [15] in which deep learning was used to predict users demand at the edge of the network. Then, the trained models were collected by the central server from the MENs and the global model was updated accordingly.
The cache placement problem was studied from the perspective of user mobility. In [16], the authors presented the effects of exploiting the statistical traffic pattern and users' context information, such as file popularity, user location, and mobility pattern on the amount of resources needed and at which network location the content should be pre-cached. Context-aware proactive caching was also considered in the work proposed in [17]. The authors formulated a multi-armed bandit optimisation problem. They considered user mobility by computing connected users' preferences and users' service priorities in a given time slot. The content popularity was considered to change over time by taking into account users' movements and connections between new sets of users. In [18], the authors addressed realistic features of cache placement by describing user mobility via the set of highly visited locations within the cell where each location was covered by one or more SBSs. The impact of mobility-awareness in cache placement algorithm was discussed in [19]. The authors formulated the problem of caching of coded segments at base stations (BSs) and UTs taking into account user mobility and the content amount per transmission. User mobility was presented as a peer-to-peer connectivity model with contact duration and contact frequency perspectives. The problem was formulated as an integer programming problem and solved by sub-modular optimisation.
In most previous works, such as those in [16][17][18][19], the impact of user mobility on caching strategy was considered from one of the following perspectives: either from users' content preferences, highly visited places by the users, user social context, content segmentation and multi-casting of segments to multiple cache locations, or users' contact probability. However, improving the quality of service by increasing the cache hit rate that leads to minimising the latency of downloading the contents and minimising the total energy consumption require the utilisation of learning techniques in the modelling of cache placement. Learning technique can be used to exploit different user mobility attributes and the relationship between these attributes that can affect cache placement decisions. This is a challenging problem and motivates for further investigations. Future wireless network should be more adaptive to the changes in user mobility, data rate, delay, number of SBSs and user terminals, and many other network information and parameters. To improve the adaptiveness to the dynamic changes in wireless networks, the network intelligence should be used to model network status and update future predictions. This is an open research challenge requiring the design of edge caching strategies which can automatically adapt the wireless network changes [20].
In this paper, a new formulation of cache placement problem is investigated. This is based on our formulation in [21] of a multi-objective function that aims to maximise the cache hit rate by maximising the file content popularity, maximising the contact duration between the mobile user and the cache that stores the required contents, minimising the communication range between mobile user and the cache, and maximising the UT contact probability. In the above mentioned formulation and solution, we assumed equal weights for all input parameters. In this paper, we focus on the values of weighted-sum (WS) cache placement input attributes to model the relationship between these values and the hit rate when the users are moving. Weights are assigned to each input to assess the contribution of input attributes on cache placement decision. The multi-objective cache placement optimisation is formulated as a classification problem. The formulation of cache placement as a binary classification was investigated in [22]. The authors considered the input propriety of video contents and user context requesting the video contents are the input attributes that defined the space of cache decision. The input attributes included information such as the number of views, number of likes, shares, comments, language, user subscription, and user past views. Unlike previous work, we explore mobility input attributes such as user locations, contact duration, communication ranges, contact probability between UTs and SBSs, as well as content popularity and the correlation between these input attributes that can separate the decision space into two regions for cache and not to cache. The main contributions of this work are summarised as follows: � Unlike existing techniques, we propose a new cache placement algorithm formulated as a binary classification problem (to cache and not to cache) based on user locations, contact probability, communication range, contact duration, and content popularity. Artificial neural network (ANN), support vector machine (SVM), and logistic regression (LR) are used to model cache placement decisions. � We investigate the characteristics of the input features (attributes) and the properties of these characteristics that affect supervised machine learning approaches. � We investigate the performance of new cache placement models using supervised learning techniques by placing the proposed models in work with different system parameters. These parameters are varied to study the sensitivity of classification decisions with the change of system parameters.
The remainder of the paper is organised as follows: Section 2 describes the system model and assumptions. In Section 3, we present cache placement problem formulation using binary classification approach. Section 4 discusses hyperplane parameters estimation based on the learning techniques. In Section 5, we provide the simulation results and discussion followed by conclusion in Section 6.

| SYSTEM MODEL
As illustrated in Figure 1, we consider a MEN with multiple SBSs and UTs. Each macrocell consists of one MBS connected to a gateway of the core network via a high-speed interface, N SBSs that are connected to the MBS through backhaul links, and M UTs connected to neighbouring devices through D2D communication and to SBSs. The sets of SBSs and UTs are denoted by s = {s 1 , s 2 , …, s N } and u = {u 1 , u 2 , …, u M }, respectively. There are cache storage units in each MBS, SBS, and UT with different storage capacities. Within one macro cell, cache storage capacities can be defined by two sets c s = fc s 1 ; c s 2 ; …; c s N g and c u = fc u 1 ; c u 2 ; …; c u N g for SBSs and UTs caches, respectively. It is assumed that the MBS has enough cache capacity to store F files defined by the set f = {f 1 , f 2 , …, f F }.
There can be four types of cache content delivery approaches, as shown in Figure 1. Out of these approaches, we consider the following approaches: local cache, SBS cache, and MBS cache.
The content library consists of F files stored in the MBS cache. Each file is denoted by f z for z = 1, …, F. The files are requested from the main library based on their popularity distribution. The popularity of the F files are denoted by the set Λ, where Λ = {Λ 1 , Λ 2 , …, Λ F }. The set Λ can be characterised by the Zipf popularity distribution [23].
It is assumed that the connectivity of users in the D2D communication and in the small cell network (communication between mobile UT and SBS) is a peer-to-peer connectivity model [19]. The user mobility is modelled based on spatial and temporal properties. The spatial properties provide physical location information of user mobility patterns, while the temporal properties provide time-related information [24]. The user mobility can be modelled by assuming a pairwise contact process that follows an independent Poisson process.

| MULTI-OBJECTIVE CACHE PLACEMENT
A cache placement algorithm provides decisions on which contents to cache and where to cache them. After completing the cache placement phase, users start sending their requests for contents during cache access and delivery phase. If the content is downloaded from another UT and/ or from SBS caches, it is considered a hit. If the content is downloaded from the MBS, it is considered a miss. The cache placement policy aims to maximise the cache hit rate that minimises the content delivery latency. The weightedsum cache placement approach provides an insight into the input data attributes. A cache placement approach that is able to provide an insight on which input attributes can maximise the hit rate is useful for implementing efficient content delivery to UTs.  Assume a matrix A ¼ ½A UT : A SBS � ðMþNÞ�F as the cache strategy matrix that needs to be determined by the cache placement algorithm. The contents of matrix A are either one or zero (cached or not cached). Let the matrix ðA UT Þ M�F represent the cache strategy of file f z that is cached in UTs, where a k,z 2 A UT represents file f z that can be served by UT u k , (k = 1, …, M). Then, the matrix ðA SBS Þ N�F represents the cache strategy of the files cached in SBSs, where a j,z 2 A SBS represents file f z that can be served by SBS s j 8j 2{1, …, N}. The objective function for latency-efficient cache placement algorithm maximises the probability of UT u i to download the requested files from nearby UTs and SBSs within the contact durations [21], and is written as follows: and the weighted-sum of the normalised objective values is where w 1 , w 2 , w 3 , and w 4 are the weights of each objective function. The file popularity probability, normalised contact duration, normalised communication range, and contact probability are represented by Λ T z ; T *UT ;SBS ; D *UT ;SBS , and P UT ;SBS m , respectively. In the multi-objective optimisation problem, weights are considered as general gauges of relative importance for each objective function. The importance of one objective function over the other objective functions to maximise the cache hit rate requires selecting the set of weights that reflects these preferences. The final solution may not accurately reflect initial weight preferences of each objective function [25]. In the cache placement problem, objectives may conflict with one another and optimising one solution with respect to a single objective function may result in an unacceptable solution with respect to other objective functions. For example, placing files in caches depending on maximising the content popularity and neglecting other objective functions like minimising the distances between UT and other UTs and SBSs may result in an unreasonable solution. A set of solutions need to be investigated such that each solution satisfies all objectives at an acceptable level.

| Input attributes influence weights
The cache placement decision is a binary decision (place or not to place the contents) that results in one of two cases: either the content is found (Hit = 1) or the content is not found (Hit = 0) in UT and SBS caches when the user requests the content. We will use the binary classification conditional probabilities presented in [26]. As mentioned before, in cache placement, we have four input attributes: file popularity probability (Λ z ), normalised contact duration (T *UT ;SBS ), normalised communication range D *UT ;SBS , and contact probability P UT ;SBS m . Each input attribute has its influence on the cache placement decision, which is represented by the weights. The combination of these influence weights gives an influence score for cache placement decision.
To formulate the influence of weights, we assume the influence weight for the cache placement decision for each input attribute to be between −1 and + 1. That means, the certainty of the decision (place) produces a weight of +1 and the certainty of the decision (not place) produces a weight of −1. Thus, the influence weight of the input attribute value represents the influence weight for an input attribute on cache placement decision relative to its opposite. For any input attribute, the sum of the conditional probabilities for each possible result is 1 as the events are mutually exclusive, shown as follows: where I i is the value of the input attribute 8i 2{1, …, 4}. Consider that the cache hit is positive and the probabilities can be mapped between the range of 0 to +1. Then, the cache miss is negative and the probabilities can be mapped between the range −1 to 0. The range of mapped probabilities for any input attribute is between −1 and +1. Since each weight represents its influence on both probabilities, we can compute the influence weight for an input attribute for both cache states as follows: Assume for a given period of time, there are V number of user requests. Each request results in a hit or a miss, then the hit rate (h) can be computed as follows:  (4), the file popularity probability influence weight (w 1 ), normalised contact duration influence weight (w 2 ), normalised communication range influence weight (w 3 ), and contact probability influence weight (w 4 ) are equal to 0.2, 0.6, 0.8, and −0.2, respectively. Figure 2 shows an example of input attribute influence weights on the two classes of the cache placement problem.

| Classification-based influence scores
For each user request, we can compute a single score for each influence weight of the input attributes. If we calculate the summation and averaging of the input attribute influence weight w i for all user requests for input attribute I i , then we can compute a score that can be used as an indicator of the influence weight of the input attribute. For our cache placement input attributes, the influence weight input attribute scores is where a = {1, 2, 3, four} and S 1 , S 2 , S 3 , and S 4 represent the influence weight input attribute scores for (Λ z ), (T *UT ;SBS ), D *UT ;SBS , and P UT ;SBS m , respectively w a,j is the influence weight for the ath input attributes and jth user request.
According to (6), the influence score can be used as an evidence that the corresponding example results in one class and not the other class. The samples that represent previous cache placement decisions for V user requests are considered as a training data set. From the training data sets, we can search for cases that result in correct classification and misclassification.
Classification is the task of predicting the class for which the set of input attributes belongs. The classifier k(I, w) aims to define a decision surface k(I, w) = 0 such that for a given set of input attributes I, the classifier can predict to which class the input attributes belong. Then, given the value of I, its class label y is predicted according to the following: where k(I, w) is used to decide on which side of the decision surface k(I, w) = 0 lies the input attributes I [27]. For a given training set with V samples that result in either y = −1 in a miss or y = 1 in a hit, the goal is to design a hyperplane that separates these two classes.
where w 0 is the bias and f is the classifier function, which can be either linear or non-linear. We need to build a classifier that is able to place points with negative score values on one side of the hyperplane and place points with positive score values on the other side of the hyperplane. Figure 3 shows an example scenario, where the hyperplane is separating the 4-dimensional space into two regions. For each row in the training set, we have the values of the input attributes and the value of the label y. Label y is the class that this row belongs to, so it is either y = −1 or y = + 1. The influence weights w i for i = 1, …, four of input attributes and the value of the bias w 0 can be computed by minimising the mean square error (MSE) cost function J(w j ) as follows [28].
where y 0 is the predicted class for the given values of influence weights and input attributes, and y is the actual class value for the jth sample. The MSE should be minimised such that the predicted responses would be as close as possible to the actual values of the class. The weights and bias are estimated to formulate the optimal model for the classifier. A learning technique is presented in the following section that can be used to estimate the values of influence weights and build the cache placement model.

| HYPERPLANE PARAMETERS ESTIMATION BASED ON LEARNING TECHNIQUES
The hyperplane parameter estimation can be implemented using supervised machine learning classifiers. Supervised classifier requires a dataset consisting of input/output pairs. Each row in the data set represents the observations (input), or what is referred to as the vector of features and the corresponding class y. The classification function aims to build a model by training the given dataset, and then the model will be able to estimate the output y 0 for new input features. The classifier learns from the previous datasets to predict the output when given new input features. The output of the classifier is the decision of either placing the content (cache) or not placing the contents (not cache). The objective function of the learning algorithm is to minimise the cost function given in Equation (9). The algorithm used for optimising the objective function is the stochastic gradient descent (SGD) algorithm, which learns the statistics iteratively from the training dataset. Three supervised machine learning (ML) techniques are used as binary classifiers for the cache content placement algorithm, as presented in the following subsections. We have adopted SGD algorithm in the three learning techniques, because in the literature it has been shown that SGD is able to minimise the loss function evaluated over a given training set and to find the right set for the estimated hyperparameters. It has the ability to avoid the overfitting problem by stopping the iterations of the optimisation routine early before it converges [29].
Consider the training data set, also called the observations, that consists of n input features (I 1 , I 2 , …, I n ) and the corresponding class y for V the number of observations. The classifier output y can be 1 that means the observation belongs to the class 'cache' or 0 which means the observation belongs to the class 'not cache'.

| Artificial neural network (ANN) for cache placement
Artificial Neural Network (ANN) is first adapted for binary classification of cache placement problem. The adaptive momentum (Adam) gradient descent-based optimiser is used for estimating the cache content. The Adam algorithm is a combination of RMSprop and SGD with momentum. It uses the square gradient to scale the learning and uses moving average of F I G U R E 2 Example of input attribute influence weights of two classes cache placement 6the gradient instead of gradient with momentum [30]. The Adam algorithm is utilised in machine learning problems with highdimensional input parameter space and huge datasets. The algorithm calculates the adaptive learning rate individually for various parameters involved in the training of gradients [30].
We consider an ANN that has three layers: one input layer, one hidden layer, and one output layer. The input layer is the first layer of the ANN, and it does not have any weights associated with it. The input layer consists of a set of n neurons representing the number of input features. Each neuron in the hidden layer transforms the values from the previous layer with a weighted linear summation followed by a sigmoid non-linear activation function. The output layer receives the output from the hidden layer and computes the output value y. The ANN estimates a non-linear classifier that separates the two classes. This classifier can be written as k(I, w) = f Ω (I, w), where f Ω (I, w) is a non-linear function consisting of the weights of the neurons in the hidden and output layer, as well as the activation function. ð t ¼ r Ω JðΩ t−1 Þ 8:  � Ω t+1 the updated model weights at time t + 1. � α the initial learning rate. � ϵ very small number to prevent any division by zero in the implementation, assume ϵ = 10 −8 .
The algorithm of artificial neural network for cache placement is shown in Algorithm 1.

| Support vector machine (SVM) for cache placement
Support vector machine (SVM) is a learning technique that defines a hyperplane to split the attributes space (input features) into two classes. Given a training dataset (fðI i ; y i Þg V i¼1 with vectors I i 2 I R and labels y i 2{+1, −1}, the aim is to design a linear classifier k(I, w) that satisfies the following [31]: This is equivalent to The non-linear function ϕ(.) maps the input space to a high dimensional feature space. This means the SVM classifier objective function can be defined as [31].
where min 1 2 w T w is the regularisation term that maximises the margin and imposes a preference over the hypothesis space to achieve better generalisation. The term ∑ V i¼1 max ð0; 1 − y i w T IÞ is called the hinge loss (empirical risk loss) and is used to penalise weight vectors that make mistakes. The penalty parameter C(C > 0) is used to control the trade-off between a large margin and a small hinge-loss.
As shown in Equation (14), if the sign of the product between the true label y and the predicted value y 0 = w T I is positive and larger than 1, the loss is zero. If not, the loss will increase linearly. The stochastic gradient descent algorithm for SVM performs gradient descent with respect to the objective function is given in Equation (14). The updating rule for the weights is then [32].
where α is the learning rate and r w t stands for a subgradient with respect to w t . The algorithm of SVM for cache placement is shown in Algorithm 2.

Algorithm 2 Support Vector Machine Classifier for Cache Placement
1: Input: Dataset with input features I j and one output label y. 2: Output: Classifier model k(I, w) given in (7). 3: Initialise bias w 0 = 0, weights w j = 0, and learning rate α = 0.001. 4: Split dataset into 70% for training and 30% for testing. Training the model: 5: for i = 1: Number of epochs do 6: Pick a random sample (I i , y i ) from the training dataset. 7: Compute r w t of J(w) (14) 8: w tþ1 ¼ w t − αr w t Jðw t Þ 9: end for 10: Return model structure with updated bias w 0 and weightsw j . Testing the model: 11: for each instance in the testing dataset do 12: Read input attributes I j 13: Compute y 0 using the resulting model from step 10. 14: end for

| Logistic regression (LR) for cache placement
The aim of binary logistic regression (LR) algorithm is to train a classifier to make a binary decision about a class of a new input observations. The LR model estimates the probability P r (y = 1|I ) that this observations belongs to the class 'cache' or the probability P r (y = 0|I ) that it belongs to the class 'not cache'. The LR learns from the training set by updating the weights w 1 , w 2 , …, w 4 and the bias w 0 . As discussed in section 3.1, the weights indicate the importance of each input feature to the classification problem. The LR learns by updating the values of the weights and bias from a training set. Each weight w i is associated with one input attribute I i . The weight represents the importance of the input attribute to the classification decision. The bias terms are added to the weighted inputs.
According to Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability assigned after the relevant evidence or the background is taken into account. Let us consider logistic regression model for a two-class (y 1 , y 2 ) classification task, the posterior probabilities are modelled as [33]. and After complete training of the weights and bias, the classifier multiplies each input I j by its updated weight w j , sums up the weighted features and adds the updated bias term w 0 as shown in Algorithm 3. for each training instance in training dataset do 7:

8:
Update bias and weights 9: Read input attributes I j 16:

17: 18:
if y 0 ≥ 0.5 then 19: The decision is cache 20: else if y 0 < 0.5 then 21: The decision is not cache 22: end if 23: end for The classifier models described by algorithms 3 define the contents of SBS caches. This means these classifiers define the contents of the solution matrix A presented in the multiobjective function given in Equation (1).

| EXPERIMENTAL RESULTS
In this section, we evaluate the performance of the classifierbased supervised machine learning algorithms for cache placement and access via simulations. We compared the performance of the proposed cache placement approaches with popularity, random, weighted sum cache placement algorithms.

| Simulation Setup
Our simulation is implemented on a certain sub area with the area dimension and number of SBSs as described in [34]. We assume that the macrocell includes one MBS, 15 SBSs, and there are three different paths that users pass through with 11 locations in each path (Figure 7). The results are computed for 10000 randomly generated user locations and 10000 different user requests within one wireless network cell. Each request consists of the requested file ID. The requests of files follow Zipf popularity distribution. A number of experiments are implemented to compare the impact of system parameters on the performance of different cache placement algorithms. During each experiment, one of the parameters is varied and the remaining are fixed. The parameters are illustrated in Table 1. For the weighted-sum approach, we assumed all the parameters have equal weights and the threshold is 0.5 to take the 50% of the effect of the input attributes.
We assume that the user mobility parameters follow the work in [19]. It is defined as an independent Poisson process, the pairwise contact duration between mobile UT and SBS follows the exponential distribution with parameter λ SBS i;j , such that Γ(10, 1/100) represents the contact rate between mobile UT and SBS. The dataset is divided into 70% data and 30% data used for training and testing, respectively. The simulation was implemented using Python on a Jupyter notebook installed on Anaconda environment [35].

| Cache placement algorithms
To evaluate the performance of the proposed classifier-based cache placement algorithm and investigate the relationship

Number of files in MBS 100
SBSs cache size (10%-100)% of MBS library  [19] It assumes that most popular contents are placed in each SBS cache.

| Random caching
It assumes that contents are placed randomly in each SBS cache.

| Weighted-sum approach caching [21]
In our earlier work, we assumed that the contents are placed in SBS caches depending on input attributes such as file popularity, contact duration between UT and SBSs, communication ranges between UT and SBSs, and contact probability between UTs and SBSs.

| Artificial neural network with SGD learning
An ANN is trained on datasets of previous user input attributes, user requests, and the resulted hit or miss to extract information from input attributes and predict cache contents.

| Logistic regression (LR) with SGD learning
It uses optimisation algorithm to train the cache placement model by minimising the loss function in (9). The trained model can be used to predict cache contents.

| Support vector machine (SVM) with SGD learning
It uses SVM to find the hyperplane separating the decision space into two classes as to place or not place the contents that results in cache hit or cache miss. The iterative learning process is implemented using SGD algorithm that aims to minimise the cost function in (9).

| Learning technique classification models
In this section, the results are evaluated for various learning techniques that are used in this paper: ANN, SVM, and LR. The performance of the classification models is explored for the same dataset using a confusion matrix, that describes the classification model performance. It examines the testing dataset to find if the classification outcome is correct or not. The model accuracy scores are calculated to identify model errors [36]. Figures 4-6 show the confusion matrices for ANN, SVM, and LR classification models, respectively. These figures summarise the percentage of correct and incorrect predictions for each class and the types of errors that are made by the classifiers. The correct estimation is organised along the diagonal direction from top-left to bottomright of the matrices. There are 23.13%, 21.27%, and 23.09% of the sample cases that belong to not cached category and they are predicted not cached by the ANN, SVM, and LR models, respectively. These samples are true negatives (TN), which means they are predicted to not cache and the correct decision is not to cache. The other correct estimation is the true positive (TP) case, there are 75.35%, 74.64%, and 74.42% of the samples, in which the actual decision was to cache and they are predicted to be cached by the ANN, SVM, and LR models, respectively. This refers to the cases where there were users who requested the files and the files were stored in the caches.

-
By looking at the classification errors, false positive (FP) gives 0.0%, 0.04%, and 1.86% for the ANN, SVM, and LR models, respectively. This means there are some percentage of samples in SVM and LR models that are identified as cache, and they should not actually be cached. From the cache placement perspective, this can cause energy loss and more latency consumed to store contents in some SBS caches that will not be requested from these SBSs and/or will not result in successful transmission to the requested UTs. From this point, ANN can be considered as the best model for estimating cache contents compared with other cache placements algorithms used in this work. Table 2 describes in more depth the performance of our estimators by applying the proposed ANN, SVM, and LR algorithms on the datasets and comparing the accuracy, precision, recall, and F1 score of the results. The accuracy gives the ratio of the correctly predicted observation to the total observations. Accuracy is defined as (TP + TN)/ (TP + FP + FN + TN). Accuracy alone will not give an indication of the performance of the estimator. Analysing other performance metrics will give us more understanding of our model before using it in the actual environment with different cases. The precision indicates how precise our model is out of those predicted positive (cache), how many of these samples are actually positive. Hence, Precision = TP/ (TP + FP) and the sensitivity or recall performance index measures the ratio of the correctly predicted positive observation (cache) to the all observations in the actual values. That is, Recall = TP/(TP + FN) Then, we compute the weighted average of precision and recall to compute the F1 score performance index. F1 score helps to understand the model accuracy better than the accuracy performance index when we have uneven class distribution, where F1 score is defined as  Based on the results provided by Table 2, among the three classification models, ANN has performed best and has achieved the highest score as compared with the other algorithms. However, all the remaining algorithms appear to demonstrate a good performance as well when compared with our previous work, weighted-sum approach.

| Performance evaluation
In this paper, we show only the results of caching in SBS caches. The six strategies are evaluated and compared by computing the cache hit rate for different experiments. User mobility is considered by taking into account that the users may request files while moving from one location to another location within the wireless network. Users can download files from nearby SBSs without the need to use backhaul link. This will cause cache hit. The cache hit rate indicates the number of cache hits divided by the total number of user requests.

| Impact of cache size
To study the impact of SBS cache size on the cache hit rate, we consider SBS cache size ranges from 10% to 100% of the entire file library size. Figure 8 shows the results of changing SBS cache sizes on the SBS hit rate. In all the cache placement algorithms, increasing the cache size results in an increase of the cache hit rate. This is because, increasing cache size means more files can be cached in SBS caches, which increases the possibility of finding the requested contents. In popularity and random caching, the cache hit rate increases almost linearly with the increase of SBS cache size. The WS performs better than popularity and random caching with near linear increase in the cache hit rate with the increase of SBS cache size.
The performance of the learning techniques ANN, SVM, and LR are very similar and better than that of the other techniques. Their results show that although increasing cache size results in an increase of the cache hit rate, the hit rate is 0.92 in ANN, 0.921 in SVM, and 0.91 in LR even for 10% SBS cache size. This means that the trained models in the three learning techniques were able to extract information from the dataset and give correct prediction for the cache content placement regardless of the cache size. Figure 9 shows the impact of SBS cache size on hit rate for different user speeds. The figure shows the results for 10 m/sec and 25 m/sec for four cache placement techniques: WS, ANN, SVM, and LR. We observed that at low user speed (10 m/sec), the SVM caching technique outperforms ANN, LR, and WS. When increasing user speed up to 25 m/sec, the ANN gives higher hit rate than the SVM, LR, and WS. Figure 11 presents the influence of changing SBS data rate on the cache hit rate. As the results show, for all cache placement techniques, increasing the SBS data rate increases the hit rate, since SBSs can transfer more data during the contact with the UT. Increasing the SBS data rate allows the user to complete downloading the contents within the available time while on the move from one location to another location. When the SBS data rate is 4Mbps, the hit rate in SVM technique outperforms the ANN, LR, and WS caching algorithms. While for higher data rates (16 Mbps), the results of LR show higher hit rate than the ANN, SVM, and WS. 5.4.5 | Impact of file size Figure 12 shows the impact of changing the file size from 1 to 8 MB on the SBS cache hit rate. As expected, the hit rate decreases when the file size increases. This is because the users are downloading the files while moving from one location to the next along the path. When the file size increases, the users will not be able to complete downloading the whole file during the required time which results in a miss. When the file size is small (1 MB), the ANN gives a higher hit rate than the SVM, LR, and WS caching techniques. After increasing the file size up to 8 MB, the ANN again outperforms SVM, LR, and WS. Table 3 compares cache hit rate for four cache placement algorithms while changing some system parameters between low and high values. The results show that changing the SBS cache size from low to high slightly increases the cache hit rates for ANN, SVM, and LR models and the ANN results in the highest score. The WS cache placement algorithm is improved when we increase the cache size. This is because we trained the model to place the contents based on the input features by defining the hyperplane parameters that separate the two classes. On the other hand, in WS, all the input attributes are assigned in equal weights in the cache placement model.  Changing user speed from low to high impacts on all algorithms. Knowing that in our simulation, the speed is only increased to the point that the UT is still able to download the contents from SBS while the user is on the move. The LR is the least sensitive to the change of user speed and results in a higher hit rate. For SBS data rate, changing the rate from low to high values affects the cache hit rates for all algorithms, since with higher data rate, the amount of contents transmitted to the UT increases. The SVM shows least sensitivity to the changes in data rate and the difference between low and high data rates is marginal. In the last experiment, we changed the stored file size and computed the effect of this parameter on the cache placement algorithms. In this case, the ANN has the highest hit rate compared with the remaining techniques as shown in 12. Figure 13 shows the impact of changing cache size and cache placement technique on the normalised total energy consumption in SBSs and MBS required to download the files. As the figure shows, the normalised total energy consumption decreases with the increase in the cache size. This is because as the SBSs cache size increases, the cache hit rate increases, which means the contents are mostly downloaded from the SBSs. The increase in energy consumption for transmitting the contents from SBSs to UTs means decreasing the energy consumption for downloading the same contents from the MBS. When comparing the normalised total energy consumption required transferring the contents from SBSs and MBS to UTs between different cache placement algorithms, we can see that the SVM and LR result in lower energy consumption. On the other hand, the WS technique results in higher total energy consumption compared with the learning techniques.

| DISCUSSION
In this study, we proposed a new cache placement using supervised learning techniques in MENs. To the best of our knowledge, this is the first work where cache placement is formulated as a binary classification based on multi-objectives optimisation problem to improve the cache hit rates that leads to minimising the total latency of downloading the contents to the end users. Also, the proposed cache placement-based learning techniques minimise the total energy consumption required to download the contents. We analysed three proposed learning technique for the prediction of cache contents, ANN, SVM, and LR. The first part of the experiments measures the accuracy and prediction precision among these three algorithms. For the same datasets, the three algorithms show competitive performance in their computed precision, recall, and F1 score. The accuracy of prediction is higher in ANN compared with SVM and LR. There is no unique classifier that performs best for all situations of cache placement system in MENs. To select one of the three algorithms to model the cache placement automatically for a given datasets, number of input features, and computation resources, we should consider the following: � If the collected dataset is not large, SVM or LR should be selected, otherwise ANN is selected. � If the input features increase, consequently increasing the hyperplane dimension, ANN or SVM should be selected, otherwise LR can be selected.
� If the computation resources are limited, LR is selected which can be implemented faster and easier. Otherwise, ANN or SVM is selected.
Further selection criteria can be considered after analysing the results from the second part of the experiments that measures the cache hit rate in MENs while changing the system parameters. Table 3 can assist in the selection of an appropriate learning algorithm based on the changes in SBS cache size, user speed, SBS communication ranges, and file size, as follows: � If the SBS cache size is low, SVM should be selected, otherwise ANN should be selected. � If it is known that most of the users are moving at a low speed, ANN should be used, otherwise LR should be used at higher user speed. � If the SBS communication ranges are low, LR should be used, otherwise ANN should be used for high SBS communication ranges. � If the SBS data rates are low, SVM should be used, otherwise LR should be used. � If the file size is large, ANN should be used.
Continuous dataset accumulation is suggested to more understand the effect of classification decisions and the change in system parameters on the latency and the total consumption of the energy required to download the contents in mobile edge networks. Increasing the orders of F I G U R E 1 3 Impact of cache size on the normalised total energy consumption with different cache placement algorithms 16magnitude of training dataset requires adopting new technology such as big data and the development of the proposed algorithms to deep learning-based prediction algorithms. However, there is an implication from using supervised learning techniques, that in some practical applications, collecting datasets from moving users and creating such large dataset while continuous changing the system parameters requires a considerable amount of resources, time, and effort. These resources may not be available in many practical cases which limits the use of deep learning methods. In addition, the increase in the training samples may cause supervised learning technique to fit itself to the statistical noise in the training sample and result in overfitting problem. To overcome these problems, the use of semi-supervised learning is suggested that leverages both labelled and unlabelled data for learning.

| CONCLUSION
This work focussed on formulating the cache placement as a classification problem that can be solved using machine learning techniques. Different learning techniques were investigated for understanding the problem and the input attributes that were correlated to the classification decisions. We analysed the characteristics of the input features (attributes) that were related to the decision of cache content placement objectives and their properties. These input attributes were important to form the hyperplane that separated the multi-dimensional space into two decision regions (cache contents and not cache contents). The performance comparison of different machine learning models was carried out with the same datasets. Finally, the cache placement models using artificial neural network, support vector machine, logistic regression, and weighted-sum algorithms were tested with actual data sets of user requests taken from published literature. The test was made by changing one of the system parameters and fixing the other parameters and computing the hit rate to investigate the sensitivity of the classification by the changes in the environment parameters.
It has been noticed that with the changes in system parameters, there were a few limitations and weaknesses in each classifier model used in this research. Generating an automatic algorithm selection that decides which algorithm is executed in each wireless network cell depending on the size of the dataset, the number of input features, the available resources, and values of system parameters is suggested for future work. The automatic selection of the algorithm can be a rule-based technique aiming to combine different learning algorithms.
This work yielded promising results for the formulation of cache placement as a classification problem, but continuous dataset accumulation and adoption of semi-supervised deep learning techniques are also suggested to understand the effect of classification and the change in system parameters on the latency and energy consumption caused by downloading the contents by requested users.