From statistical‐ to machine learning‐based network traffic prediction

Nowadays, due to the exponential and continuous expansion of new paradigms such as Internet of Things (IoT), Internet of Vehicles (IoV) and 6G, the world is witnessing a tremendous and sharp increase of network traffic. In such large‐scale, heterogeneous, and complex networks, the volume of transferred data, as big data, is considered a challenge causing different networking inefficiencies. To overcome these challenges, various techniques are introduced to monitor the performance of networks, called Network Traffic Monitoring and Analysis (NTMA). Network Traffic Prediction (NTP) is a significant subfield of NTMA which is mainly focused on predicting the future of network load and its behavior. NTP techniques can generally be realized in two ways, that is, statistical‐ and Machine Learning (ML)‐based. In this paper, we provide a study on existing NTP techniques through reviewing, investigating, and classifying the recent relevant works conducted in this field. Additionally, we discuss the challenges and future directions of NTP showing that how ML and statistical techniques can be used to solve challenges of NTP.


INTRODUCTION
During the last decades, new networking paradigms, for example, Wireless Sensor Networks (WSNs), Internet of Things (IoT), Internet of Vehicles (IoV) and 6th generation of cellular networks (6G) 1 have been emerging to establish the network infrastructures for real-world applications, for example, smart cities, crisis management, smart roads, etc.. 2 Thanks to miniaturization of digital equipment, today's networks include thousands of connected User Equipments (UEs) (known as end node devices) that can generate and/or consume data. IoT as a new emerging networking paradigm, provides an overlay network on top of other network infrastructures, from Near-Field Communication (NFC) to cellular networks to connect a virtually unlimited number of UEs. 3 It is expected that in 2025, the number of connected IoT devices will increase to 75 billion as predicted by Cisco. 4,5 While managing such numerous devices is a challenging issue, other characteristics of networks, for example, heterogeneity and mobility, can cause inefficiency in networking. Network heterogeneity is not solely due to the diversity of device types, as it can also be related to some other factors such as the volume of data generated by each connected UEs, the required services, and the diversity of network connections. Given these characteristics of new networking paradigms, the volume of data generated may be very huge which has given rise to the Big Data era. 6 To provide an efficient network infrastructure to transfer and manage such a huge volume of data, different techniques are introduced, mainly to prevent various network faults and inefficiencies, support Quality of Service (QoS) and provide security. Apart from the type of the network infrastructure, the QoS depends on the performance of route of data, and even the ability of routing devices to analyze real-time network situation to make dynamic networking adjustments and allocation decisions. Network Traffic Monitoring and Analyzing (NTMA) techniques are mainly introduced to monitor the performance of networking by providing information to analyze the network and offer solutions to address the challenges without human intervention. 7 There are four main subfields in the NTMA including 8 (i) Network Traffic Prediction (NTP), 9 (ii) Network Traffic Classification (NTC), 10 (iii) Fault Management, and (iv) Network Security. Among all these subfields, NTP focuses on analyzing the network load and prediction of the network traffic to avoid faults and inefficiencies in networking. In this study, we focus on NTP, as one of the most critical solutions to address various networking challenges, for example, resource provisioning, congestion control, resource discovery, network behavior analysis, and anomaly detection. 6 Different techniques are introduced to perform NTP, but generally, existing solutions can be divided into two types, that is, Machine Learning (ML)-based techniques and statistical-based techniques. As NTP can be designed based on both types, we first review the most relevant techniques and then investigate the proposed solutions for each type. The contributions of this study include: • Investigating the existing NTP techniques and available solutions to predict the network behavior.
• Classifying the NTP techniques based on statistical-, ML-based, and hybrid techniques.
• Providing a concrete future direction based on real-world applications compared to the state of the art techniques, models, and frameworks.
• Proposing a schema to integrate statistical-based techniques and ML-based techniques to improve the performance of NTP techniques.
The rest of the paper is summarized as follows. In Section 2, we first introduce the basic concepts and discuss the available types of techniques for the NTP. In Section 3, we survey and analyze existing solutions and provide a classification of them. In Section 4, we discuss the challenges and future directions, and finally, in Section 5, we conclude this study. network and/or transport layers in those packets selected based on the adopted sampling strategies. 7 Nevertheless, apart from some specific applications of DPI, in cases such as filtering or troubleshooting-given the lack of feasibility of applying this method due to privacy challenges and imposing significant computational and memory overheads on the network-the real-time Traffic-Monitoring-Centric tasks would rely on SPI for extracting the required data from packet streams. 12,14 NTP as a subtechnique of NTMA is used to determine the status of the network, identify changes and predict the network traffic behavior in a foreseeable future. Generally, the results of NTP techniques can be used in a wide range of applications, for example, QoS provisioning, fault detection, and security attack detection. The problem of predicting future network traffic volume is traditionally formulated in the form of a Time Series Forecasting (TSF) or rarely a spatiotemporal problem aimed at constructing a regression model capable of estimating future traffic volume by extracting the existing dependencies between historical and future data. 15,16 Typically, low computational overhead, simplicity, and the limited number of features can be referred to as advantages of TSF approaches. 17 On the other hand, due to new demands and exigencies stemmed from the ever-increasing rise in scale, speed, and heterogeneity of networks, non-TSF approaches are becoming more and more propounded. 18 These methods typically leverage flow and packet header data to estimate future incoming flows instead of traffic volume. 19 Basically, popular NTP solutions are divided into statistical analysis methods and ML-based models. 20,21 In the following, we shed light on some differentiation aspects of common network prediction's methods and techniques.

Statistical techniques for NTP
Statistical techniques are mainly based on analyzing the data patterns without having any prior knowledge (without training). Most of such techniques compare the current situation of the data, that is, the pattern of the data with the last identified pattern to recognize important changes. Linear statistical-based models extract patterns from the historical data as well as predict future points in time-series according to the lagged data. The well-known members of this category are Autoregression Moving Average (ARMA) and AutoRegressive Integrated Moving Average (ARIMA), as well as variants of the latter. 22 ARIMA, also known as Box-Jenkins model, is a prevalent paradigm among statistical models for time-series prediction. Both of ARMA and ARIMA models are emerged from the convergence of autoregressive (AR) model which involves with lagged values of observations, and moving average (MA) model which takes lagged errors. Nevertheless, The distinction between them is in their approach to the notion of stationarity in time-series. While ARMA assumes the time-series is stationary, ARIMA would provide the stationarity of data through differencing process which might be applied multiple times, until establishment of stationarity. In "ARIMA" the I (stands for "Integrated") refers to this procedure. ARIMA model is denoted as ARIMA(p, d, q), where p indicates the order of the AR part, d shows the involved differencing degree, and q is the order of the moving average part. 23,24 As concisely described below, ARMA and ARIMA can be formulated, respectively, in Equations (2) and (3). Assuming a time series data X t consisting of real numbers (x t ) and an integer index (t), an ARMA (p,q) model is given by Equation (1) where i are the parameters of the (AR) model, i represents the (MA) model's parameters, and t are error terms assumed to be independent and identically distributed (i,i,d) variables sampled from a normal distribution with zero mean.
Interchangeably, it can be shown as below where L is the lag operator: Interested readers are referred to Reference 23. Assuming the ARIMA model as the generalization of ARMA, the following formula defines an ARIMA(p,d,q) process given drift Diverse variations of ARIMA are proposed according to various applications and time-series; Among them, it is worth mentioning "Seasonal-ARIMA (SARIMA)" 25 and "Fractional AutoRegressive Integrated Moving Average (FARIMA)". 26 The former is often used in NTP considering its compatibility with the nature of changes in networks which usually obeys certain time patterns. A FARIMA forecasting model is an extension of the ARIMA (p, d, q) model in which the fractional parameter d can take real values rather than just integers, and is given by the equation: where L is the lag operator, as the error terms. 27 In addition to the extant components of the conventional ARIMA, the SARIMA 28 model also includes the frequency component (known as seasonality and shown by (S). 29 An SARIMA model would conduct prediction based on a linear combination of past observations and their related errors. As its name implies, the seasonality factor plays a key role in the structure and performance of this model. The SARIMA process often would be shown as models in the form of SARIMA(p, d, q) × (P, D, Q)S. For given time series {X t } with seasonality length (S), the SARIMA process is indicated by the Equation (5), while the differenced series w t = (1 − B) d (1 − B S ) D X t is a stationary ARMA process and, d and D are nonnegative integer values. where: • n is the number of observations up to time t and the backshift operator B is defined as: • t is defined as i.i.d samples with a zero mean and variance 2 , which for all K ≠ 0 we have: Cov( t , t−k ) = 0 • The nonseasonal components are: • The seasonal components are: As an another ARIMA's extensions, FARIMA is a generalization of the ARMA model customized to support those applications like NTP in which, besides short-term dependencies, there are considerable linear long-term dependencies between the observations. Unlike the ordinary ARIMA process, the difference parameter (d) in FARIMA model could take noninteger values. 30 The general FARIMA process is expressed in Equation (6) where B is the backshift operator. ( Generally, the family of ARIMA models is drawn on the assumption of time-series data stationarity; while, in dynamic environments such as IoT, network traffic with severe and intermittent fluctuations could lead to degrading model performance. Nonetheless, by using some transformations other than differentiation (eg, logarithms) to decrease nonstationarity in input data, this deficiency can be overcome to some extent. 23,31 One of the other standard statistical models widely employed in time-series problems is that of Generalized Autoregressive Conditional Heteroskedasticity (GARCH). 32 This model is an extension of the Autoregressive Conditional Heteroskedasticity (ARCH) model innovated by Engle in 1982 to estimate the target variables' volatility. 33 The main goal is to model the changes in variance of target variables whose part of the total variance is conditioned on lagged values of target variance and model's residuals. To this end, the concept of Conditional Variance (also referred to as Conditional Volatility) plays a key role. Considering { t } as a real-valued discrete-time stochastic process, as: where w t is discrete white noise given i.i.d ( = 0, 2 = 1), the GARCH(p,q) process is then denoted by Equation (7).
where i and j are model's parameters meanwhile, to avoid negative variance the following constraints are imposed: 34,35 Moreover, parallel to applying jointly with some nonlinear approaches (eg, ARIMA), some hybrid models based on the ARIMA's foundation have been proposed. For instance, Fuzzy-AutoRegressive Integrated Moving Average (Fuzzy-ARIMA) is an invented method which fuzzifies ARIMA's parameters using the fuzzy regression method. 36 In terms of acronyms, as seen sporadically in some references, Fuzzy-ARIMA is also referred to as FARIMA; which should not be confused with the Fractional-ARIMA model. Meanwhile, The latter is recorded as ARFIMA in some sources as well. 37

ML techniques for NTP
In general, the logical framework of problems solvable by ML techniques can be formulated in four broad categories, namely, classification and regression, clustering, and rule extraction. 19 There are four ML paradigms in the same vein corresponding to the nature of the problem at hand, namely Supervised Learning (SL), Unsupervised, Semi-supervised, and Reinforcement Learning, 38 respectively. Each of these paradigms has its own different effects on proceeding data collection, ground truth creation, and feature engineering. Most of ML methods used in NTP are subtechniques of SL as the models need to be trained by historical data. SL uses labeled data (historical data) to build up the models employed in classification and regression problems, where predicting the outcomes in the form of discrete or continuous quantities is intended. In many real-world problems, access to labeled data is subject to constraints. In networking, most of data gathered from a network is unlabeled or semi-labeled. 39 In the lack of sufficient knowledge or the abundance of missing labels, the Semi-Supervised Learning (SSL) paradigm as a particular variant of SL exploiting different techniques such as Active Learning 40 can be leveraged. 6 One of the critical aspects in ML is choosing the proper model from the mass of available algorithms and techniques. Different factors can be applied, for example, goals of applications, pros and cons of the operating environment regarding the deployments and applications of ML models, the learning method (ie, supervised or unsupervised), how to access data, etc. 41 Some of the most widely used models are presented in the following: • Neural Networks: Artificial Neural networks (ANNs) are among the most potent and widely used ML techniques. 42 Thanks to the Activate Function, ANNs can learn complex nonlinear dependencies among numerous variables, thus they are generally known as Universal Function Approximators. 42 The general architecture of ANNs is a directed graph 43 consisting of input and output layers, which are connected via the so-called hidden layer, which itself could be consist of one or more layers. Input values reach the output layer by applying transformations through the hidden layers. The number of these layers is also referred to as the depth of the model. Based on the ANNs' depth, the idiom "Deep Neural Network" denotes ANNs constructed of two or more hidden layers, opposed to "Shallow Neural Network" referring to the traditional baseline ANNs. 42 Due to their flexible structure, Deep Neural Networks (DNN) have gained striking popularity in time-series prediction. In this context, Recurrent Neural Network (RNN), by allowing inputs to be recycled in hidden layers around the recurrent connections has been a leaping advance. Different RNN-based architectures can be defined as per adopted activation function and how the neurons connect to each other, namely that of Fully Recurrent Neural Network (FRNN), Bidirectional Neural Networks (BNN), stochastic neural networks, and the well-known paradigm Long short-term memory (LSTM). 44 LSTM as an extension of RNN is proposed to resolve the vulnerability of normal RNNs against the gradient exploding/vanishing problem caused by long-term dependencies. [45][46][47] LSTM with some innovations in its architecture including a triple gate mechanism to control inputs to cells, and a feedback loop for data retention, can learn long-term dependencies and remove invalid inputs that cause perturbation in cell's outputs. 6,48 In practice, an implemented LSTM model usually consists of a set of blocks where each block contains several LSTM cells.

F I G U R E 1 A schema that how deep reinforcement learning interacts with the network
Despite the unavoidable capabilities of deep learning-based methods, the slow training process of these models is a significant problem in their application in dynamic environments. Moreover, the lack of transparency in the learning process of these models is another limitation 49 . 50 • Reinforcement Learning: The Q-Learning algorithm, along with Deep Reinforcement Learning (DRL) (which is actually a combination of Q-Learning and DNN), is one of two algorithms representing the Reinforcement Learning (RL) method. 8,12 As elaborated by Watkins, 51 the Q-Learning algorithm provides learner agents that can act optimally in Markovian environments relying on their knowledge stemmed from experiencing actions' consequences, without the need to mapping the environment. 51 It relies on a function called Q-function to learn the table containing all available state-action pairs and their long-term rewards. 52 In NTP, RL and its variants have a good potential to interact with the network to learn the network behavior and predict the behavior of the network in the future. As mentioned, RL can be integrated with DNN which can help to improve the performance of RL techniques. Figure 1 shows a schema of interacting DRL and the network regarding NTP.
Although the number of ML techniques used in NTP is not limited to the list above, DNN and RL are the most important ML techniques based on our literature review explained in Section 3.

Data collection
Creating an efficient model for a problem is highly dependent on the availability of appropriate and unbiased representative data. 19 Due to the variety of data in different applications of networking as well as alternation of data over time, it is essential to adopt a suitable method for data collection for ML methods to train their models. In networking, traffic data can be extracted through the packet inspection process by the DPI and SPI methods. 12 While the former is based on reading and, if necessary, analyzing the full packet contents that include application headers and payload, the latter method examines only headers of the network and/or transport layers in those packets selected based on the target sampling strategies. 7 Nevertheless, apart from some specific applications of DPI (eg, filtering or troubleshooting), since this method imposes significant computational and memory overheads on the network, the real-time traffic-monitoring-oriented tasks would rely on SPI for extracting the required data from packet streams. 12,14 Data collection is performed typically in offline or online manners. In the offline method, data are entirely used for training the model at once, and then, the model is deployed and used for operational data analysis. In the online method, throughout a continuous process, model training would launch in conjunction with deployment in the operating environment, and model knowledge is updated with new input data, which is received in a sequential order. 13,53 As SL, Unsupervised Learning (UL) and SSL are generally used in the offline learning setting, some of RL techniques, for example, State-action-reward-state-action (SARSA) 54 and incremental learning techniques 55 are mainly designed to update the model gradually based on new data (2). Training the model using collected data can be achieved in one of the batch or incremental (also known as streaming) ways depending on the situation and learning settings. In a batch setting, the collected data are divided into three subsets, including training, test, and validation (which in some cases is also called development set). The validation set is used when selecting the appropriate model, and its architecture is part of the process. Otherwise, it would not be needed. Determining the optimal values for the model parameters (eg, the weight of connections between neurons in a neural network [NN]) and evaluating the model's performance would be accomplished, respectively, using the training and test sets. On the other hand, in the incremental method, the data is streamed to the training model for various reasons (eg, large volume and infeasibility of loading at once, or gradual generation of data). 13,53 Moreover, in dynamic environments, especially online applications, the ML model must be continuously retrained. In such cases, to solve the concept drift problem, given the high computational cost of doing it from scratch, using only the new data for training is an efficient solution by incremental approaches. 12 Figure 2 shows different types of collecting data to train ML models in networking.

Feature engineering
As one of the ML's pillars, feature engineering includes feature selection and feature extraction in addition to data cleansing and data preprocessing. Feature selection refers to selecting the effective discriminator features and removing the irrelevant and redundant features; whereas feature extraction involves extracting new extended features from existing ones. These procedures both would lead to diminishing data dimensionality and computational overhead and, consequently, increasing the model's efficiency and accuracy. 56 In networking, features can be classified based on the granularity in three levels: packet-level, flow-level, and connection-level. 57 The finest level of granularity is packet-level features where the packet-related statistical data such as mean, root mean square (RMS), variance, as well as time-series information are extracted or derived from collected packets. 19 The independence of these features from the sampling method adopted to collect data is their key advantage.
Features such as mean flow duration, mean of packets per flow, and average flow size in bytes are observable at the flow-level. At the highest level of granularity, there are connection-level features that would be extracted from the transport layer. Throughput and advertised window size in TCP connection headers are examples of these features. Despite the high-quality information provided by connection-level features, the imposition of excessive computational overhead and high distortion in facing with sampling and routing asymmetries are among their drawbacks. Feature extraction is often performed using techniques such as Principal component analysis (PCA), entropy, and Fourier transform. 19 Table 1 shows a summary of some popular network features.

REVIEW OF EXISTING WORK
Using ML techniques and statistical-based techniques in NTP is a well-established research area. Historically, one of the earliest works in applying ML in NTP belongs to Yu and Chen, carried out in 1993. They used multilayer perceptron (MLP) NN (MLP-NN) motivated by enhancing the accuracy over traditional AR methods. 19 Since then, many researchers have lent themselves to improve the ML-based solutions for predicting network behavior as accurately and timely as possible. However, in this survey we focus on the most important recent works in the field. To review the literature, we have used the method shown in Figure 3 to refine the related literature. The NTP-related literature is classified as below.

F I G U R E 3
The method adopted to refine the literature 58

Pure ML-based NTP solutions
Evaluation of performance and efficiency of ML-based methods is a significant part of the literature. LSTM is one of the most widely used techniques in this field. The survey by Alawe et al 59 aims at Access and Mobility Management Function (AMF) in the 5G network. To this end, two approaches, Feedforward neural network (FFNN) and LSTM are examined. The authors propose to use ML to forecast the arrival time of requests from User Equipment (UE), and consequently, the scale-out/in process. This avoids the rejection of requests and keeping the attach duration (how long do UE are connected to network resources) low. The results indicate that LSTM outperforms FFNN in terms of prediction accuracy. The dataset is classified into 10 different classes, based on the load and the number of AMFs needed. The first technique used for predicting the class of load of the next period is FFNN. The second technique tested for predicting the average load of the upcoming period of time is LSTM. Both networks are trained with the 60% dataset and then asked to predict the remaining 40%.The results indicate that LSTM outperforms FFNN in terms of prediction accuracy.
Trinh et al 46 have dedicated their work to present a network traffic prediction model in the LTE environment, using the LSTM algorithm. The LSTM network consisting of multiple unified LSTM units is applied to the raw mobile traffic data collected directly from the Physical Downlink Control CHannel (PDCCH) of LTE. By assuming NTP as a supervised multivariate problem, the proposed model aims at minimizing the prediction error with respect to the information extracted from the PDCCH. This data gathering methodology and the multistep structure adopted for the predictive network are emphasized as dedicated research aspects. According to comparison results, the composed model outperforms ARIMA and FFNN models.
Pruning internal connections between NN neurons to diminish computational cost constitutes the idea underlying the research by Hua et al. 47 Based on this, a heuristic architecture with sparse neural connections, called the Random Connectivity Long Short-Term Memory (RCLSTM) is introduced, in which the complete (one-to-one) connection between neural network neurons, as is in conventional LSTM, has given way to a random pattern of links between them. The simulated model consists of a three-layer stack RCLSTM.
Wang et al 60 puts forward a model that aims to improve cellular traffic prediction accuracy with limited real data and support data privacy. The model is called (ctGAN-S2S) and consists of a cellular traffic generative adversarial network and a sequence-to-sequence neural network, in which learning is fed by an arbitrary length of time-window from the historical time-series data of cellular traffic. The augmentation model generates close-to-real cellular traffic data; thus, eliminating the need for original data supports data protection and data privacy.
The main contribution of Vinayakumar et al 20 is to compare some nonlinear prediction methods, namely that of FFNN, RNN, identity recurrent neural network (IRNN), gated recurrent unit (GRU), and LSTM (generally referred to as RNN) in terms of performance in Traffic Matrix (TM) prediction in large networks under different experiments. The arranged experiments in test scenarios help to identify the optimal network parameters and network structure of RNN. The TM is passed by using the sliding window approach. The obtained results indicate the superiority of LSTM over other models; meanwhile, GRU has relatively imposed less computational cost.
In considerable number of researches, Gaussian Process Regression (GPR) is the undergrounding leveraged technique. The conducted research in Bayati et al 61 is one among those who exploit GPR in this context. The proposed solution is based on the Direct (or Parallel) strategy for multiple-step-ahead traffic prediction in which the entire process is divided into H distinct models that are trained concurrently to conduct H-step-ahead time-series predicting. Each time-step forecasting is fed by the gained prediction in the previous time-step, as one of the input features. Subsequently, the prediction error at a time-step (or the uncertainty in the feature vector) is propagated through the forecasts at the next time-step. In order to tackle the error propagation, the paradigm has been investigated, indicating that the desired performance of a multiple-step-ahead prediction is strictly depended on, and could be influenced by the classification of data at higher levels.
Ensuring the QoS in the network in a way that simultaneously provides resource efficiency (bandwidth) requires an accurate real-time estimation of the network's future behavior. To this end, an online bandwidth allocation method based on GPR has been proposed by Kim and Hwang. 62 The theorem that a stationary increase in the size of a given process would lead to large-buffer asymptotics of the queue length process forms the basis of the proposed model for deriving the proper bandwidth.
A combined solution to deal with complex network data flows is presented by Wang et al. 63 The proposed solution consists of a preprocessing step and a 2-fold prediction process using LSTM and GPR models. In the initial phase, the data flow's dominant periodic features are extracted through Fourier analysis, and then the LSTM model is applied to the remaining small components. A complementary step adopting GPR is launched, estimating the residual components to improve the prediction's accuracy.
Poupart et al 64 elaborate utilizing ML techniques to approximate each network flow's size at its start, focused on detecting elephant flows independent of source application or end host. The proposed estimation method benefits the metadata exploited from the first few packets of a flow. In this regard, the authors explain three ML techniques, including Gaussian processes, Gaussian Mixture Model with Bayesian Moment Matching, and neural networks to predict the flow size based on existing historical data and online streaming data.
Mikaeil 65 proposes a method for near-time prediction of primary user (PU) channel state availability (ie, spectrum occupancy) in Cognitive Radio Networks (CRN) using Bayesian online learning (BOL). Given that the nature of the PUs channel state availability can also be considered a dual-states switching time series, the captured time-series representing the Pus channel state (ie, PU idle or PU occupied) are fed as an observations sequence into the BOL prediction algorithm.
The primary use of cardinality estimation algorithms in computer networks is counting the number of distinct flows. Cohen and Nezri 66 have investigated the application of flow cardinality estimation algorithms in SDN environments. their focus was on common deficiencies faced by sampling methods, especially in adapting to changes in flow size distribution. Further, they have introduced and elucidated a framework which benefits from online ML, and to achieve the best performance and accuracy, three popular linear regression ML algorithms, namely that of Stochastic Gradient Descent (SGD), Recursive Least Squares (RLS), and Passive-Aggressive (PA), have been examined.
Cui et al 67 discuss the Stochastic Online Learning (SOL) technique as a tailored model for that of environments like Mobile Edge Computing (MEC), which can experience time-varying, stochastic traffic arrivals without the Markov property. Among the multitude of articles, this paper is one of the few that addresses online learning from network behavior. Unlike the majority of ML methods that rely on learning from the training data, SOL learns from network changes by using SGD. This approach is aimed at establishing a trade-off between learning accuracy and learning time and increasing network throughput.
Zhang et al 68 present an incremental deep computation model for wireless big data feature learning in IoT. The model is constructed by stacking several incremental tensor auto-encoders (ITAE). To handle new arrival wireless samples, two types of ITAEs, based on the learning strategy, are developed, namely the parameter-based incremental learning algorithm (PI-TAE) and the structure-based incremental learning algorithm (SI-TAE). In facing new arrival samples, the proposed model only needs to load the new samples to the memory to update the parameters and structure, respectively, by PI-TAE and SI-TAE. This mechanism underlies this model's capability to deal with wireless big data for feature learning in real-time.
The cellular link bandwidth prediction in Long-Term Evolution (LTE) networks has been investigated by Yue et al. 69 The authors have approached the problem by analyzing the correlations between various lower-layer information and the link bandwidth. Thereupon, they propose a framework using a Random Forest-based prediction model, which through an offline data feeding process, and exploiting the intrinsic capabilities of Random Forest, identifies the most important features, and uses them to predict link bandwidth in real time.
Zhang et al 70 have introduced an approach using Convolutional Neural Network (CNN) to collectively model spatial and temporal dependence for cell traffic prediction. The key features of the proposed approach are considering the traffic data as images, as well as utilizing a parametric matrix-based fusion method to estimate influence degrees of the spatial and temporal dependence.
In a study by Pfulb et al, 71 the problem of estimating the expected bit-rate of network flows based on their metadata has become a three-tier classification problem using a fully connected DNN with the ReLU activation function. Their approach embraces three stages including data collection, data preparation, and data processing. In this approach, DNN training is treated as a streaming problem by dividing the dataset into identical data blocks. The model is trained and tested in a semi-streaming fashion applied on all blocks one by one such that all intended preparations being performed block-wise.
To improve the self-management and active adjustment capabilities of base stations in wireless networks, by Li et al 72 the temporal and spatial correlation of traffic data is addressed. To this end, they have composed a deep network-based framework of network traffic prediction using CNN and LSTM, constituted of three main units by which the spatio-temporal correlation of wireless network traffic data can be captured effectively.
The first part of Table 2 shows the list of the reviewed literature in this category.

Pure statistical-based NTP solutions
The Mehdi et al 36 have incorporated fuzzy regression and ARIMA models (Fuzzy-ARIMA) in promising to take both methods' advantages. Besides, to perform real-time predictions based on historical data, they have adopted a sliding window technique called SOFA ,which reduces the effect of input data fluctuations over time.
In promising to allocate bandwidth efficiently in SDNs, Bouzidi et al, 77 introduce a heuristic rules placement algorithm in which an Online-Learning module, designated upon the linear regression, is utilized for the network delay prediction. The overall framework of the proposed solution encompasses the formulation of the flow rules placement as an Integer Linear Program (ILP) targeting to minimize the total network delay, And finally, to solve the defined ILP problem through an excogitated algorithm, that leads to reducing the time complexity and enhancing the estimation accuracy.
The second part of Table 2 shows the list of the reviewed literature in this category.

Hybrid NTP solutions
In addition to the pure statistical-and ML-based NTP techniques, there are some proposed models that use both of these approaches simultaneously called Hybrid solutions. Xu et al 50 have presented an extension of Reference 78, proposing an architecture for traffic prediction in the Cloud radio access network (C-RAN) with the distributed remote radio heads (RRHs) and the centralized Baseband Units BBUs pool bearing lots of parallel Baseband Unit (BBU)s. In such an architecture, an alternating direction method of multipliers (ADMM) and the cross-validation empowered Gaussian Process (GP) framework is performed in which, the parallel BBUs' contribute in training process and the local predictions would be incorporated altogether via cross-validation to create the final prediction. retaining the trade-off between accuracy and time consumption, and scalability are the two focal claims in this method. Bouzidi et al, 79  setting in terms of the adopted learning algorithm, where the Linear Regression model has given way to an LSTM-based model. Nie et al 80 have modeled the NTP problem in the Intelligent Internet of Things (IIoT) ecosystem as a Markov decision process. In promising to extract short-term time-varying features of network traffic in real-time, and aiming at minimizing the training data size, it proposes a RL-based approach consisting of Monte-Carlo learning (MCL), Q-learning, and Kullback-Leibler (KL) divergence. Moreover, to deal with degrading impact of the vast state space of Monte-Carlo-Q-learning in IIoT ecosystems, a greedy adaptive dictionary learning algorithm is proposed that reduces the computational complexity. To discover future network-wide traffic behavior, Zhao et al 81 introduce a TM prediction method, coined WSTNet, a complementary combination of CNN and LSTM, and utilizing Discrete Wavelet Transform (DWT) as a feature engineering tool. The method is comprised of three phases. First, at the preprocessing step using DWT, the original TM series is decomposed into multilevel time-frequency subseries at various timescales; second, to draw out the spatial patterns of traffic flows between endpoints, CNN without pooling is leveraged. Finally, the LSTM with self-attention technique is adopted to extract the TM series' long-term temporal dependencies.
Zang et al 82 have composed K-means clustering, Wavelet decomposition, and Elman neural network (ENN) in a framework to predict cell-station traffic volumes by using the spatial-temporal information of cellular traffic flow. After clustering the multiple BS traffic flows, the integrated time series are decomposed into high and low frequencies through Wavelet transform. This creates new subdivisions of data with higher stability and more tractable features facilitating the prediction process.
The main idea of Aldhyani et al 83 is to make the adopted ML models robust by using specific techniques. It proposes a 2-fold process to improve the performance of LSTM and Adaptive neuro fuzzy inference system (ANFIS) models in predicting network traffic. In the first step, the data is preprocessed by using the weighted exponential smoothing model then, the historical data is clustered by non-crisp Fuzzy-C-Means (FCM). The presented results show that identifying and classifying existing patterns of data can improve the performance of LSTM and ANFIS models in forecasting the network traffic behavior.
The third part of Table 2 shows the list of the reviewed literature in this category.

Comparative works
Finding a good prediction solution by comparing the characteristics, shortcomings, and strengths of statistical-and ML-based methods has formed one of the areas of interest in researches related to this discipline. Xu et al 78 have established a wireless traffic prediction model by applying the GP method based on real 4G traffic data. Observation of network traffic to capture periodic trend and dynamic deviations of data, as well as leveraging the Toeplitz structure in the covariance matrix to reduce the computational complexity of hyperparameter learning are the two pillars of the proposed method. The proposed model claims a significant reduction in computational complexity, and high accuracy.
In order to highlight the magnitude of the impact of network Key Performance Indicators (KPIs) on prediction accuracy, a comparative research by Le et al 84 has been conducted comprising two different settings. The first one addresses the time-series traffic forecasting mainly drawn on exploiting the traffic's historical traces, and the second involves with traffic's KPIs prediction through analyzing the relationship between KPIs and future patterns of network traffic. In this analysis, the two main criteria considered are Mutual Information (MI) and Relative Mutual Information (RMI). The performance of three algorithms, including GP, ANN, and AR, has been examined in both settings, reflecting a better performance and accuracy from the GP.
The research by Soheil et al, 85 beyond a solely theoretical basis, addresses some practical aspects of training and deployment of ML models for predicting real-world network data streams in telemetry systems, issues that challenge applying an ML model and achieving the expected accuracy. To this end, various training-related aspects, including the volume, freshness, and selection of training data, have been examined to show their impact on the accuracy and overhead (and thus feasibility) of both adopted models, namely LSTM and SARIMA. Further, utilizing separate models for different segments of a data stream is explored as well. Drawn on the achieved results, the article concludes that any network modeling often needs to be customized based on its target application for a specific data stream from a particular network.
Jaffry 43 introduces a LSTM-based model for traffic prediction in the Long Term Evolution-Advanced (LTE-A) network. It has been compared with similar paradigms based on ARIMA and FFNN in terms of performance and accuracy. The results of this study indicate the superiority of LSTM over FFNN and ARIMA. In addition, the efficiency of LSTM in working with small amounts of training data is another advantage mentioned for this model.
A comprehensive comparison between three predictor classes, focusing on considering some initiative criteria, is proceeded by Faisal et al. 86 The compared techniques are Last Value (LV), Windowed MA, Double Exponential Smoothing (DES), AR, ARMA, ANN-based predictors, and wavelet-based predictors. Besides considering the accuracy, as seen in other similar studies, this study also investigates overhead in terms of both computation cost and power consumption. Further, employing an initiative synthesized metric called Error Energy Score (EE-Score), accuracy and energy consumption have been combined into a single global performance score for comparing predictors. Among the article's conclusions, this point is noteworthy that contextual conditions (ie, network characteristics) determine the proper predictor.
Azzouni and Pujolle 87 have suggested an LSTM framework for predicting TM in large networks. The potency of RNNs for sequence modeling tasks stems from their architectural characteristics. Having a cyclic connection over time enables them to store the activation from each time step in an internal state indicator, which forms a temporal memory. To establish continuous data feeding and learning required for real-time prediction of the current traffic vector Xt, a sliding window technique has been adopted which provides a fixed number of previous time-slots to learn of from. In Reference 27 by Chriskatris and Daskalaki, under different application strategies, including single, hybrid, selective, and combined, three models namely FARIMA, FARIMA/GARCH, and NN and their compositions have been applied to some schemes of dynamic bandwidth allocation for video transmission through network. The comparison outcomes highlight the predictive capability and cost effectiveness of a hybrid model consisted of FARIMA/GARCH, and NN.
In Reference 88 by Chriskatris and Daskalaki, individual, hybrid, and selective usage schemes of models for composing prediction approaches are examined. To this end, FARIMA and MLP models and their hybridization are used, while in the selective scheme, swapping between FARIMA and NN models is proceeded based on the White NN test considering the nonlinearity of traffic data. Outcomes indicate the outperforming of the hybris method.
The last part of Table 2 shows the list of reviewed literature in this category by comparing them in different aspects, for example, network type, purpose and approach.

CHALLENGES OF EXISTING SOLUTIONS IN NTP
Given the inherent complexity of working with big data, especially in network environments, there are many challenging issues that need further investigation. In this section, some open challenges in the field of NTP will be highlighted.
• Lack of specialized theoretical framework: The ML methods used in NTP address prediction needs in the general context of time-series-based problem solving. However, in practice, the impact of numerous environmental factors, such as topology, speed, heterogeneity, etc., on the performance of ML models are very significant. In this regard, establishing a dedicated and promising theoretical framework for the application of ML in computer networks is one of the basic needs in this area.
• Ground truth: ML techniques are extremely correlated to what is known as Ground-Truth (GT). GT reflects the abundance of labeled data as well as the validity of the labels assigned to the observations. In the main applications of ML, regression, and classification, the GT extracted from the training data is the basis for comparison and inference to estimate the new data labels and, in data categorization, it is used to evaluate the accuracy of model performance. One of the main challenges involved, especially in dynamic environments such as network traffic, is the scarcity of labeled data or delays in accessing labeled data; a situation in which the semi-supervised strategy becomes relevant. PCA techniques, labeling manually or synthetically, or Active-Learning techniques are among the measures that could be taken to overcome the problem.
• Representative datasets: Shortage of public and representative datasets to train ML models is a fundamental challenge network-related prediction scenarios. 89 Due to the data-driven nature of ML solutions, lack of proper representative data or poor quality can lead to reduced accuracy. On the other hand, gaining access to required representative data is often subject of strict constrains. 12 • Accuracy versus speed: Since increasing accuracy of ML techniques may lead to higher computational load in some cases as well as time, any improvement in accuracy will come at the cost of slowing down the training of the ML models and vice versa. Finding the equilibrium point between these two parameters and using hybrid methods for NTP while the model is being (re)-trained is recommended in this regard. 12 • Retraining: In dynamic environments, due to constant changes, model retraining is essential to maintain validity. Unlike static environments with a limited specific dataset, allowing model retraining at any time and retraining dynamic models involve complex technical problems. Even if available in terms of access to previous data, the task will not be feasible with conventional static methods due to high computational cost. In online environments such as computer networks, as the acceleration of change increases, the barriers to model retraining become more complex. One common solution is to use some incremental methods for retraining in which the model is only updated with new observations.
• The Theory of Networks challenge: Introduced by David Meyer in 2017, 12 this concept explains that the trained models by public datasets cannot be used in real-world applications. In this case, each ML model should be trained separately for each network in a real-world application. Considering this challenge, ML methods with very high training time are not appropriate to use in NTP models as the network can not be left unattended for a long period of time.
• Deployment issues: In modern prevalent frameworks for software development and support, such as DevOps and DevSecOps, the software life cycle in its traditional form has been fundamentally transformed and influenced by concepts such as CI/CD, delivery flow, workflow, etc. These frameworks would impose specific rules, automations, and restrictions to the operating environment regarding testing and deployment of software updates and changes. 90 As such, deploying online-generic NTP models that during test phase need to access to real-world network data is an important challenge. In such environments, the criteria for evaluation of new updates and deployment licensing is subject to passing a comprehensive test set to ensure that each component and the whole end-to-end workflow is not degraded through new deployments. Due to the dynamic nature of live streaming data and the data-driven structure of online-generic NTP approaches, gaining accurate and reliable results through testing, if not impossible, but is notoriously difficult. 91 Although the classical AutoML accelerates the development of ML solutions, it does not address those issues concerned in ML systems' deployment. Auto-Adaptive Machine Learning (AAML) is a notion composed by 91 to respond to this problem. It proposes an architecture for ML systems lying in the bed of continual learning concept, aiming to address the aforementioned challenge by continuous update of deployed models.

CONCLUSION
Network Traffic Prediction is considered as an important solution to address the various challenges in networking, for example, resource provisioning, congestion detection, and fault tolerance. The field of NTP is still in infancy as different models and techniques are used to predict the network traffic depending on network's type and purpose. Although ML techniques are widely used for NTP, there are some challenges that should be addressed to improve their efficiency in case of NTP, for example high computational cost, training, and retraining difficulties, Intrinsic large volatility of data, etc. On the other hand, statistical-based techniques are used traditionally for network traffic prediction; however they suffer from various challenges, for example, the excessive volume of needed historical data and their inherent constraints to deal with nonstationary time-series. In this paper, we have surveyed both ML-and statistical-based techniques for NTP. Our study shows that most existing solutions are not efficient in real-world applications due to the special characteristics of network data processing in modern networking technologies such as Theory of Networks, retraining challenge and ground truth.

DATA AVAILABILITY STATEMENT
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.