CA-MLBS: content-aware machine learning based load balancing scheduler in the cloud environment

Cloud computing is the on-demand provision of computing resources over the Internet, such as cloud storage, computing power, network, and so on. Cloud computing has several advantages, including high speed, cost reduction, data security, and scalability. The main challenge in cloud environment is to balance the workloads and network traffic among the available resources to achieve maximum performance. Several methods have been proposed in the literature for effective load balancing, including heuristic, meta-heuristic, and hybrid algorithms. The performance of these techniques has been improved by combining machine learning based Artificial Intelligence (AI) techniques and meta-heuristic algorithms. Most of the existing load balancing techniques are not aware of the content type of user tasks. However, from the literature, the content type of the tasks can be very effective to design a balanced workload distribution system in the cloud. In this work, a novel AI-assisted hybrid approach called Content-aware Machine Learning based Load Balancing Sched-uler (CA-MLBS) is proposed. The scheduling system CA-MLBS combines machine learning and meta-heuristic algorithms to perform classification based on file type. To achieve this, a Support Vector Machine (SVM) based classifier is used to classify user tasks into different content types such as video, audio, image, and text. A meta-heuristic algorithm based on Particle Swarm Optimization (PSO) is used to map users' tasks in the cloud. The proposed approach was implemented and evaluated using a renowned Cloudsim simulation kit and compared with Ant Colony Optimization File Type Format (ACOFTF) and Data Files Type Formatting (DFTF) heuristics. The results of the proposed study show that the proposed CA-MLBS technique achieved improvements of up to 29%, 29%, and 44% in terms of makespan, response time, and throughput, respectively.


| INTRODUCTION
Cloud computing is the provision of computing resources over the Internet. These resources include data storage, databases, software, networks, and servers. There are three types of cloud computing services: Software-as-a-Service (SaaS) (Cusumano, 2010), Infrastructure-as-a-Service (IaaS) (Serrano et al., 2015), and Platform-as-a-Service (PaaS) (Boniface et al., 2010). Software as a service is a way of delivering applications as service over the internet. Where you can access an application on the internet without any installation, maintenance, and infrastructure cost.
Infrastructure-as-a-Service (IaaS) is way to delivering infrastructure such as dedicated severs, network, storage, and virtual machines over the internet. In IaaS, application, operating system and other application runtime requirements are managed by user. Platform-as-a-Service model provides infrastructure, runtime, middleware, and OS. However, applications and data are managed by user.
The most popular public cloud providers are Amazon Web Services (Amazon EC, 2015), Microsoft Azure (Chappell, 2010), Google Cloud Platform (Krishnan & Gonzalez, 2015), and IBM Cloud (Boniface et al., 2010). Cloud computing offers several benefits, such as more reliable infrastructure, agility in business processes, ease of collaboration between teams, and cost savings. In addition, the cloud makes data available from anywhere and on any device on a scalable and secure platform (Sunyaev, 2020).
Cloud computing includes hosts (i.e., off-site physical servers) and virtual machines (virtual environment of computing resources such as CPU, RAM, storage, etc). Usually, a host computer contains multiple virtual machines (VM). In the cloud computing environment, the process of effectively distributing the workload across multiple VMs is referred to as load balancing. The load balancing aspect is important to achieve scalable performance (Mishra et al., 2020). The benefits of cloud load balancing is twofold. On one side, the cloud user wants to execute their tasks in lesser time with minimized overall cost which lead to higher user satisfaction. On the other side, the cloud service provider needs minimized execution cost and high utilization of cloud resource which lead to high Return on Investment (ROI). To attain higher user's satisfaction and improved resource utilization, the workload need to distributed in balanced way (Nabi, Ahmad, et al., 2022;Nabi & Ahmed, 2021a;Nabi, Aleem, et al., 2022). Efficient utilization of cloud resource maximize the profit of cloud service providers and reduce energy consumption. As a system's computing needs increase, you can scale your infrastructure by adding more hosts and VMs. However, it is difficult to estimate the size and nature of the workload and determine the exact number of resources required (with their computing power) to be created at runtime. In this regard, AI-powered algorithms are more helpful to predict the workload, the type of workload, and the resource requirements with their computing capability.
To achieve maximum load balancing, task scheduling algorithms play an important role. Cloud task scheduling can be classified into two broad categories, that is, heuristic and metaheuristic algorithms. The heuristics-based algorithms such as round-robin, max-min , and min-min (Ibrahim, Nabi, Baz, Naveed, & Alhakami, 2020) are mostly problem-dependent. The Round-robin is the simplest model of load balancing. It forwards a client request to each VM in turn and is easy to implement and has low scheduling overhead. The Max-min load balancing model allocates larger tasks with the highest priority and smaller tasks with the lowest priority. This algorithm favours larger tasks and penalizes smaller tasks. Whereas the Min-Min algorithm calculates the completion time of the tasks and allocates tasks based on their minimum execution time. These algorithms are fast and suitable for short-term solutions related to scheduling. Several heuristic approaches have been presented in the literature, such as the Zero Imbalance Mechanism (Kong et al., 2020), which is based on using the transfer time of tasks to the network. Dynamic Resource Aware Load Balancing Approach (DRALBA) (Nabi et al., 2021), that allocates a group of independent and compute intensive tasks to available virtual machines in a balanced way. DRALBA map tasks on VMs based on computation share of VMs and dynamically updates the VM computing share. Another heuristic-based load balancing method (Adhikari & Amgoth, 2018), which uses task size as one of the main considerations. Although the algorithms based on heuristics optimize the makespan and the quality of service (QOS) metric, however, the heuristics-based algorithms have limitations, such as the inability to find an optimal solution (Kaur & Kaur, 2022) for load balancing problems with conflicting parameters such as execution time and execution cost.
Methods based on meta-heuristics do not depend on a particular problem. They are generic algorithms that can be applied to a large number of related problems. Meta-heuristic algorithms address the limitations of heuristic algorithms and provide optimal results for scheduling problems.
Several meta-heuristic algorithms have been presented in the literature, such as. the deadline and resource constrained particle swarm optimization algorithm (PSORDAL) (Nabi & Ahmed, 2021a), Simulated Annealing (Hanine & Benlahmar, 2020), the ant colony optimization based dynamic and elastic algorithm (D-ACOELB) (Naik, 2020), and another swarm based meta-heuristic technique (Megharaj & Kabadi, 2019). Each metaheuristic algorithm has its strengths and limitations. Therefore, combining two or more algorithms could complement the advantages and lead to more suitable solutions.
The concept of combining multiple algorithms to solve certain problems, such as scheduling, is called a hybrid scheduling algorithm. These algorithms can be a combination of (1) heuristic and meta-heuristic algorithms, (2) two meta-heuristic algorithms, and (3) AI-powered machine learning and meta-heuristic algorithms. As compared to meta-heuristic algorithms, the hybrid meta-heuristic algorithms use strengths of multiple algorithms to solve a problem. Several hybrid algorithms for load balancing optimization in the cloud environment are presented in the literature, such as a combination of Firefly and genetic algorithms (Rajagopalan et al., 2020) and a hybrid of Cuckoo and Firefly . In addition, several literature reviews (Abrol et al., 2020;Pradhan et al., 2021) show the use of various meta-heuristic based algorithms to optimize load balancing in the cloud computing environment.

| Motivation
The load balancing in the cloud environment makes the system scale-able, improves reliability, increases performance, and boosts cost-effectiveness. The literature survey shows that most modern approaches to task scheduling consider parameters such as task length, task priority, and task file size to evaluate their proposed task scheduling schemes. However, most of these scheduling schemes do not consider the content type of the input workload. Moreover, the literature study shows that the content type of the workload can play a crucial role in balancing the workload (Cervantes et al., 2020).
The adaptation of cloud computing has grown rapidly along with other core services such as cloud storage. Various studies presented in the literature have applied machine learning algorithms (ML) to address the growing data challenges. There are three common types of ML techniques: Unsupervised Learning (Celebi & Aydin, 2016), Supervised learning (Liu, 2011), and Reinforcement Learning (Mirjalili et al., 2020).
Supervised learning is a methodology of machine learning where a machine learning algorithm builds its intelligence based on input data and classified for a specific kind of output. Whereas, unsupervised learning is another methodology of machine learning where an algorithm identify patterns in data points of dataset that not classified. Reinforcement learning is a methodology of machine learning that takes suitable actions to maximize output in a specific scenario. To classify data based on certain features: ML provides classification algorithms, which are mostly part of supervised learning (Reddy & Varma, 2020) techniques. The classification algorithms analyse the available data to extract features and build the corresponding training model. Then, the training model is used with a test dataset to classify the data (Sahli, 2020;Sen et al., 2020).
Some hybrid techniques of AI-based machine learning algorithms and meta-heuristics are discussed. These hybrid algorithms show significant improvements (Pinho et al., 2020;Rabbani et al., 2020) and demonstrate that machine learning with meta-heuristic algorithms can significantly improve the solutions to various problems. Existing classification methods (Liu, 2011;Mirjalili et al., 2020;Pinho et al., 2020;Rabbani et al., 2020) also use PostgreSQL and Amazon Web Services (AWS) (AWSACC Services, n.d.) to develop techniques to solve the content classification problem. PostgreSQL is an open source and highly expandable database management system that supports both relational and non-relational queries.
As compared to other database management systems, PostgreSQL supports wide range of data types such as geometric, network address, JSON, XML, HSTORE, arrays, ranges, and composite. However, these platforms themselves do not provide content classification mechanisms such as text-, image-, audio-, and video-based resource management. Support Vector Machine (SVM) is considered a preferred choice for input workload classification based on its content (Cervantes et al., 2020). The existing content-based workload distribution optimization studies in the cloud environment such as Ant Colony Optimization File Type Format (ACOFTF)  and Data Files Type Formatting (DFTF) (Junaid, Sohail, Rais, et al., 2020) can be improved by a refined dataset, improved kernel method, and simpler hybrid meta-heuristic algorithms.
The SVM classification algorithm as compared to other ML algorithms such as Artificial Neural Network (ANN) has the following advantages: (1) shorter training time, (2) better capacity to converge, and (3) better interpret-ability. Based on all the above, there is a need for a content-aware load balancing model using machine learning classification. That can classify the cloud data into appropriate categories and the classified tasks can be scheduled to the best-suited type of VMs using the load balancing schedular. In the next section, we present a Content-Aware Load Balancing model to overcome the limitations of existing approaches.

| Major contributions
This paper presents an AI-enabled meta-heuristic based hybrid and content-based method that uses a combination of machine learning and particle swarm optimization techniques to improve load balancing in cloud computing. The proposed Content-Aware Machine-Learning Based Load Balancer (CA-MLBS) classifies users' tasks based on content type. For task classification, CA-MLBS uses SVM classification algorithm. SVM is one of the most popular classification methods for cloud tasks (Cervantes et al., 2020). Moreover, SVM is a robust supervised learning algorithm that can be used to classify data and regression assignments. The proposed CA-MLBS technique uses the particle swarm optimization (PSO) algorithm to assign the classified tasks to the appropriate group of VMs. The reason for using the PSO algorithm for load balancing is that PSO-based algorithms can improve load balancing with optimal memory utilization and in a fast manner. The literature study shows that PSO based tasks scheduling and optimization algorithms have better performance and have simple implementation as compared to other optimization algorithms like Genetic Algorithm(GA), Ant Colony Optimization (ACO) among other (Fadlallah et al., 2021;Nabi & Ahmed, 2021a;Nabi, Aleem, et al., 2022). The proposed scheduling scheme performs user cloud task classification based on the type of user tasks (video, audio, image, and text). Moreover, CA-MLBS uses file fragments to build its training model and assign classification labels. For each file type, there is a corresponding set of VMs with their respective configurations. The classified tasks were distributed to appropriate VMs using PSO to effectively distribute the workload. The multi-objective model can be used for load balancing, however, multi-objective model is more useful for optimizing multiple objectives with conflicting parameters where attaining one objective (i.e., performance improvement of a certain parameter) may result in degraded performance of the other aspect, for example, time, cost, and so on. Moreover, the proposed research evaluates multiple non-conflicting parameters like makespan, throughput, and response time (Nabi & Ahmed, 2021a;Nabi, Aleem, et al., 2022). In future, multi-objective model will be used for optimizing conflicting parameters like time and cost. The main contributions of this work are summarized below: • This research proposes a content-aware, machine learning-based load balancer (CA-MLBS) scheme based on a hybrid load balancing model.
The CA-MLBS uses File Fragment Type (FFT) (Mittal et al., 2019) dataset based on file fragments of different content types to improve classification results and workload distribution; • The proposed algorithm is an easily applicable and simplified approach compared to the existing content-type algorithms; • The proposed scheme classifies users' cloud tasks based on content types such as video, audio, image, and text; • The proposed algorithm uses a PSO-based scheduling mechanism to optimize workload distribution; • Extensive analysis and comparison of response time, throughput and makespan have been performed.
The remainder of the article is organized as follows: Section 2 presents the literature review. Section 3 contains the system architecture, system model, RADL algorithm, complexity analysis, and scheduling overhead. Section 4 presents the experimental setup, dataset configurations, performance comparison, and evaluation results. Section 5 provides the conclusion and a roadmap for the future.

| RELATED WORK
This section discusses the relevant literature by highlighting pros and cons of state-of-the-art and has been summarized in the Table 1. Resource-Aware Load-Balancing Algorithm (RALBA) (Hussain et al., 2018) is a heuristic load-balancing algorithm that performs workload distribution according to the processing capacity of VMs. It applies the load balancing algorithm in the following two phases: (1) identifying the processing requirements of the task and (2) identifying the processing capacity of the available VMs. Based on the task's requirements, RALBA allocates to T A B L E 1 Summary of load balancing models in cloud computing the appropriate VM. The RALBA algorithm is based on two sub-schedulers (fill and split schedulers). RALBA provides improvements in makespan, resource utilization, and execution time. However, RALBA does not support task response time and content type classification based on the cloud's task type.
The Dynamic and Resource Aware Loading Balancing Algorithm (DRALBA) is a heuristic load balancing algorithm that balances loads based on processing performance and maintains VMs' workload. It assigns a set of independent tasks to a preconfigured group of VMs. It calculates the processing capacity of a group of VMs for a set of independent tasks. Based on these calculations, DRALBA selects the most appropriate VM with the highest processing capacity. DRALBA provides improvements in resource allocation and response time. However, it lacks a mechanism for distributing the workload based on content type classification. Overall Gain-based Resource-aware Dynamic Load-balancer (OG-RADL) (Nabi & Ahmed, 2021b) is a heuristic algorithm for dynamic load balancing. The OG-RADL algorithm improves workload distribution with enhanced resource utilization, better task deadline management, and delivers improved load to enhance overall cloud performance. OG-RADL presents a novel technique to normalize the values of evaluation parameters such as average resource utilization ratio, response time, makespan, and task deadline. Moreover, the OG-RADL load balancing technique provides improvements in the overall gain of the cloud. However, the proposed algorithm requires further improvements for task sequence and content-based workload distribution.
Muthusamy and Chandran (2021) have proposed a cluster-based task scheduling framework (CBTS) that uses K-Means clustering and considers task length and VM limit. In CBTS, tasks are distributed based on task length and VMs are clustered based on computational power. The single task in each cluster is assigned to the appropriate VM in the VM group of that cluster. This technique has shown improvements in execution time and makespan. However, the CBTS technique does not support classification of tasks based on their content.
Particle swarm optimization based resource and deadline aware dynamic load balancers (PSO-RDAL) (Nabi & Ahmed, 2021a) have been proposed. The PSO-RDAL is a meta-heuristic load balancing algorithm that delivers lower processing cost and time for large-scale independent cloud tasks. This research work evaluates the PSO-RDAL mechanism using multiple performance aspects such as resource consumption, makespan, task deadline compliance, response-time, penalty-cost, and execution-cost. The PSO-RDAL does not perform classification by task content type. Semmoud et al. (2020) propose a Starvation Threshold-based algorithm for scheduling problems. Each VM maintains its workload state and performs load balancing without considering the state of the other VMs. Experiments are conducted to evaluate the timeout and response time based on quality of service (QoS) metrics. The performance evaluation of the proposed method was performed with up to 800 tasks on 100 VMs.
The proposed study showed improvements in response time and makespan. However, the proposed technique was tested with a smaller dataset.
Moreover, this technique does not show content-based classification support for cloud tasks. The mutation-based PSO (Agarwal et al., 2020) updates the fitness function of each particle until the maximum iteration is reached. The makespan QoS performance metric is used to analyse the experiment results. For the proposed method, 20 data centers and up to 200 tasks were used for simulation. The proposed method achieves an improvement in the makespan. However, the pseudo-algorithm was not properly explained. Moreover, this technique does not support the classification of tasks based on the content type. Muthsamy and Ravi (2020) have presented a load-balancing technique based on an Artificial Bee Algorithm (ABC). The ABC-based scheduling algorithm performs optimal search based on the honey bee's approach to find the best from the available sources. The presented study has shown improvements in makespan and execution time. However, the dataset used for the evaluation is primitive and the technique is not capable to deal with the content-aware Cloud tasks scheduling. A hybrid approach of PSO and Firefly Algorithm (FA) optimization was proposed (Lilhore et al., 2020), which assigns the shortest job to the fastest processor-based machine and applies the shortest job next approach to PSO. The proposed technique considers the makespan and task migration as core scheduling objectives. However, the experiments were conducted in a limited environment and lack a comprehensive evaluation with state-of-the-art datasets. It also does not support classification of tasks by task type.
Another load balancing approach is proposed (Mishra & Majhi, 2021) and is inspired by the improved PSO and Honey-Bee Optimization (HBO) algorithms. The proposed technique distributes the workload among the VMs while the birds search for the food sources. The proposed method showed improvements in makespan, response time, and throughput. However, the experiments were conducted in a limited setting with fewer orders. The results could vary drastically if the experiments are conducted with large datasets. In addition, the proposed method lacks classification of tasks based on task types. An adaptive dragonfly algorithm (ADA) based on the dragonfly algorithm (DA) and the firefly algorithm (FA) was presented (Neelima & Reddy, 2020). The main goal of the proposed approach is to allocate the workload to the VM using the ADA. This limits the absolute execution time and cost. The proposed method shows improvements in execution time. However, the task dataset is smaller for the configured number of VMs. Moreover, this approach is not content-aware model. Sharma and Garg (2020) presented a hybrid meta-heuristic approach, which is called Harmony-Inspired Genetic Algorithm (HIGA). The HIGA technique uses the exploration and exploitation features of genetic and harmony search algorithms, respectively. The HIGA identifies the local and global optimum and provides a fast combination. The proposed technique has shown improvement in makespan and energy consumption.
However, the experiments were conducted in a limited environment with a limited dataset. Moreover, the tasks in the proposed study are not classified by task types. Data Files Type Formatting (DFTF), which is based on Cat Swarm Optimization (CSO) and SVM (Junaid, Sohail, Rais, et al., 2020). The DFTF classifies the cloud tasks into different types, that is, text, images, video, and audio with SVM using a polynomial kernel.
The classified tasks are input to CSO to perform load balancing. However, DFTF does not use a state-of-the-art dataset. In addition, DFTF classifies videos and audio into subcategories. The Radial Basis Function (RBF) kernel provides higher accuracy for the fine-tuned dataset compared to the polynomial kernel.
The QMPSO is a hybrid approach of modified Q-learning and particle swarm optimization (Jena et al., 2022). Three objective functions were formulated, which include (1) the workload difference between hosts and the average load on the cloud network, (2) the total energy consumption, and (3)  The proposed model categorizes tasks based on their size, but it lacks classification of tasks based on task types.
The study proposes the Honey Bee behaviour-based load balancing method, which attempts to minimize load redundancy by assigning tasks to matching or suitable VMs. After task assignment, it computes the state of VMs. The proposed method showed improvements in the following QoS performance matrices: makespan and degree of load balancing. However, the proposed technique does not show such improvements in response time. Moreover, the proposed study lacks the classification approach to categorize the tasks by content type. The summary of the literature review can be found in Table 1.

| PROPOSED CA-MLBS FRAMEWORK
In the cloud computing environment, balancing the workload across VMs is a challenging task. Balanced workload distribution helps to achieve optimal utilization of cloud computing resources. Optimal load balancing also improves cloud task execution and response time. Load balancing methods use various task scheduling algorithms based on cloud task characteristics such as task length, file size, and task content type to calculate the available workload of cloud resources. An optimal load balancing method receives incoming cloud task requests, intercepts their processing needs, estimates the existing workload, and selects the best VM to process the request based on this information.
Our proposed cloud load balancing model called CA-MLBS is based on a hybrid technique that combines PSO and SVM machine learning algorithms. The detailed system-architecture of our proposed hybrid load balancing model is shown in Figure 1. As shown in Figure 1, the cloud computing infrastructure has a lowest layer, the physical layer (Kaur, 2020), which includes hosts (a host is a physical dedicated server with processing, memory, data storage, and data transfer resources). The physical layer of cloud computing also includes cloud data storage resources.
The virtualization layer of cloud computing infrastructure is on the top of the physical layer that includes Virtual Machines (VMs). A VM uses the resources of a host machine and acts as a separate computer with all the essential software including the operating system, web server, and database applications, and so on. A host machine can have multiple virtual machines, and a virtual machine receives incoming cloud task requests, processes these tasks, and sends the execution responses back to the user.
The topmost layer is the load balancing layer that receives the user's cloud task, parses the cloud task parameters, and sends tasks to VMs.
The load balancing layer is responsible for effective workload distribution among VMs. This is the layer where we introduce our proposed scheduling algorithm. Our proposed load balancing model classifies user tasks based on content type and selects the best VM to map a specific cloud task.
The working semantics of the proposed load balancing model is presented in Figure 2 The Particle Swarm-based Optimization (PSO) technique was used for load balancing. The reason is that PSO performs better than its counterparts (Nabi, Ahmad, et al., 2022;Nabi & Ahmed, 2021a). The PSO-based load balancing phase includes steps such as initializing the PSO scheduler along with its parameters, computing the particle position and velocity, evaluating the fitness function, updating all the particles, completeness criteria, and final mapping of the optimal solution. After receiving the final mapping from PSO, the tasks are executed according to the optimal mapping plan. When the execution of the tasks is completed, the value of the makespan, throughput, and response time of the tasks are calculated.

| CA-MLBS algorithm
The CA-MLBS for task scheduling 1 starts by receiving incoming cloud task requests based on file fragments containing various file fragments of the task content, for example, text, image, audio, and video. The first phase is content type classification, which is performed using an SVM machine learning algorithm and prepares collections of classified tasks. The Key benefits of using SVM in the proposed model are that SVM is relatively memory efficient, more effective with high dimensional spaces, works well when there is a clear margin of separation between classes and is effective in cases where the number of dimensions is greater than the number of samples.

F I G U R E 1 System architecture of CA-MLBS
The content type classification process is based on the Radial Basis Function (RBF) kernel method using high-dimensional data (Algorithm 1, lines 1-8). Once the algorithm has collections of classified tasks, it proceeds with load balancing. The load balancing process takes two parameters: (1) first, the collections of classified tasks, and (2) the groups of VMs, where each group of VMs is dedicated to a particular content type. The load balancing process is based on the PSO scheduling algorithm explained earlier (Algorithm 1, lines 9-35).
The PSO-based scheduling algorithm specifies its parameters such as the number of iterations and the population size (Algorithm 1, lines 11-12). The PSO mechanism is used, which first initiates the particles with random velocity and position (Algorithm 1, lines 13-18). After initializing the PSO parameters with defined configurations, the fitness function evaluation is updated at each iteration to estimate the global and local best values (Algorithm 1, lines 19-32). The objective of fitness function is to find best VM for appropriate task in appropriate set of VM(s). Further, the PSO-based scheduling algorithm determines the best VM for each task and returns the scheduled tasks data. The evaluation process is repeated until the exit criterion is met. Moreover, the PSO-based scheduling algorithm determines the best VM for each task and returns the data of the scheduled tasks (Algorithm 1, line 33). The main elements of the proposed model were presented in the previous section. We build the training model using the features of kernel function of the SVM. The kernel function is a set of mathematical functions that process the input data and convert it into the required form. It converts the low-dimensional data format into a high-dimensional data format that can be further used for classification. There are various SVM kernel functions such as linear, non-linear, polynomial, radial basis function (RBF) and sigmoid. We used RBF (Majdisova & Skala, 2017) kernel function in our model because there are less constraints in terms of data formats, versatile applicability, high accuracy and fast convergence. The RBF kernel model is shown in Equation (1) In the above equation, gamma (γ) represents the points around a single training point. X1 À X2 is the product of the features used. We use a testing dataset (as shown in Table 9), which is also a collection of file fragments associated with different content types, as the input source. Python. It supports major machine learning algorithms such as SVM, k-means, and random forests. Using Scikit-learn, the dataset was first filtered for text, image, audio and video content (as shown in Figure 8). It is also converted into the desired dataset format (Chang & Lin, 2011).
The content-based classification process performs classification of tasks into four classes: video, audio, image, and text. Content-based classification reduces the processing overhead for the load balancer by preprocessing learning, feature extraction, and classification. After classification, the classified tasks are converted to the Comma-Separated Values (CSV) format. CSV format is one of the most commonly used formats. The CSV format is in CloudSim ready format that can easily used as input for CloudSim modelling and it can easily be read in any programming language.
Each line in the CSV file describes the content type (i.e., text, image, audio or picture), the length of the task in millions of instructions (MIs) and the file size in MIs. Table 2 shows the sample dataset.

| Load balancing step
The classified tasks in CSV format are then the input to the load balancing process. For each task category, there are four different groups of VMs with corresponding resource configurations (i.e., processing power, memory, and storage capacity). For example, a video content type task typically requires more processing, memory, and storage resources. We configured video-type VMs with 1000 MIPS (million instructions per second), 16 GB of memory, and 360 GB of data storage. Audio-type VMs were configured with 900 MIPS operations, 12 GB of memory, and 250 GB of storage; image-type VMs were configured with 700 MIPS and 200 GB of storage and 8 GB of memory; text-type VMs were configured with 500 MIPS, 4 GB of memory, and 120 GB of storage. The load balancing step has groups of VMs and classified tasks as input. Each group of VMs is responsible for processing a specific type of task. We used a PSO-based scheduling algorithm, which is one of the most popular population-based optimization algorithms. The PSO concept is derived from social interaction to solve a problem. It has multiple agents (particles) that perform swarm movement around the search space and search for the best solution. After each iteration, the position and velocity of the particles are updated until the best solution is reached. PSO is one of the most robust, coherent, and manageable algorithms compared to other population-based algorithms. We used the CloudSim (Calheiros et al., 2011) simulation environment to evaluate our proposed model. Our implementation is based on JSwarm-PSO (Cingolani, 2011), a Javabased PSO framework (Cingolani, 2011).

Fitness evaluation
The fitness evaluation function is a special type of function that calculates the fitness value and indicates how close it is to achieving the specified goals. The fitness evaluation function serves as a guide to find the optimal solution. It is an important part of meta-heuristic algorithms. It facilitates the application of the required optimization to find the best solution. In load balancing models, it helps to find the best VM for a given task to achieve effective workload distribution. Our proposed model performs fitness evaluation at each iteration. It also performs the exit criteria to complete the processing of the algorithm. Our proposed model optimizes the response time, throughput, and makespan to improve the load balancing among VMs.
Makespan: To get the best performance from cloud computing resources, the makespan should be minimized. The term makespan is used for the finish time of all cloud tasks using the available cloud resources such as computing power, memory, and storage (Zheng et al., 2017). Makespan is represented in Equation (2).
In Equation (2), T t represents number of tasks, VM l represents number of VMs, and FT tv represents finish time of t tasks with v VMs.
The processing time of a given type of task for a given type of VM needs to be estimated so that the load balancer can select the best VM from the group. We have presented the calculation for makespan in Equation (3).
FT max is the maximum finish time for the tth task on the vth VM. The n is task count and m is VM count.
The total finish time of all the cloud tasks with the support of the objective function (Khorsand et al., 2019;Saeedi et al., 2020) have been computed using Equation (4).

F2 t
Response time: The response time in cloud computing is the total time it takes to respond to a cloud task request (Phi et al., 2018). The response time of tasks has been computed using Equation (5).
In Equation (5), R T represents the response time, P Tt represents the t task's processing time, S Tt presents tth task's start time, and E Tt presents tth task's end time.
Throughput: Load balancing throughput in the cloud is the number of tasks processed in a given period of time (Ahmad & Khan, 2018). An efficient load balancing algorithm should maximize the completion of the number of tasks in a unit of time. We have represented the throughput in Equation (6).
In Equation (6), J TP represents the job throughput time, J h represents hth job, and J th the h-job's completion time.

| Experimental setup
The CloudSim (Calheiros et al., 2011) is used for verification and evaluation of proposed model. CloudSim is the most commonly used Java-based cloud computing simulation framework that supports the setup could computing resources such as hosts, VMs, Network pools, and storage capacities. It also helps to set up cloudlets (a cloudlet is a cloud task in the CloudSim environment), datacenter brokers, VM allocation policies, bandwidth, and RAM provisioning. It facilitates verifying and evaluating load balancing models cost-effectively. It allows users to concentrate on system modelling without gaining low-level details of cloud infrastructure and services. The advantages of CloudSim are it delivers an opensource, free, and customizable cloud simulation environment to assess the performance of the cloud computing models.
Further, it is an easy-to-use and robust cloud simulation software that allows testing a wide range of cloud applications in heterogeneous environments. Our proposed model has four groups of VM: Text-type VMs, Image-type VMs, Audio-type VMs, and Video-type VMs. Each group of VMs has distinct configurations of computing, memory, storage, and network resources. That is presented host configurations in Table 3. And, VM configurations for each group of VMs for the simulation environment presented in Table 7.
The state-of-art load balancing strategies that are used for comparative evaluation have evaluated for different parameter settings and parameters having the best results selected for comparison. Parameter settings used for proposed approach CA-MLBS, DFTF, and ACOFTF are presented in Tables 4-6, respectively. T A B L E 1 0 Iterations and population's best statistics (Adil et al., 2022)  to the data classification module is the collection of data in the form of video, text, audio, and images. The classification module takes the input data randomly and then performs classification on this data using polynomial SVM.

T A B L E 3 Host parameters
2. The Ant Colony Optimization File Type Format (ACOFTF) hybrid algorithm  is based on Ant Colony Optimization (ACO) and SVM. The proposed model uses the SVM classifier to classify the user's tasks into different types, such as text, images, video, and audio. Once the user's tasks are classified into appropriate categories, these classified tasks are input to the load balancing phase, which is based on ACO and efficiently distributes the load among VMs. The proposed approach is divided into two main modules: 'Data Classification based on SVM' and 'Load Balancing using ACO'. The input to the data classification module is the collection of data in the form of video, text, audio, and images. The classification module takes the input data randomly and then performs the classification of this data using polynomial SVM.

| Performance metric
The proposed model is evaluated and compared with the existing load balancing techniques explained in the above section. The QoS performance metrics such as makespan, response time, and throughput are analysed here.

F I G U R E 5 Throughput comparison by iterations and population
F I G U R E 6 Fitness value comparison by iterations and population 1. Makespan: Makespan is the finish time for completing all the tasks using the available resources (please see Equation (3)).

Response Time:
Response time in cloud computing is the total time it takes to respond to a cloud task requests (see Equation (5)).
3. Throughput: Throughput in cloud load balancing is the number of tasks processed in a unit time (see Equation (6)).

| Evaluation of analysis
Evaluation of the proposed algorithm is done in two phases; (1) Classification QoS measures and (2) Load balancing QoS measures.

| Classification evaluation
To validate the proposed classification model, accuracy, recall, precision, and F-measure were used as evaluation parameters. The results of the proposed approach were compared with state-of-the-art content-based planning algorithms such as DTFT (Junaid, Sohail, Rais, et al., 2020), ACOFTF , and Bayes Net (Çi gşar & Ünal, 2019) (as shown in Table 11). The results of the comparison are shown in Figure 7. In the result-related figures, the x-axis shows the accuracy, recognition, precision, and f-measure, while the y-axis shows the value of each classification QoS measure. The classification evaluation shows that CA-MLBS performs better in all classification evaluations.

| Load balancing evaluation
1. Evaluation based on Makespan: We conducted a thorough analysis of CA-MLBS for the makespan metric and compared it to DTFT and ACOFTF scheduling techniques. Our analysis is based on different configurations of VMs and tasks. We presented the statistics of our analysis in Table 12. We used the abbreviations Makespan (M), Response Time (RT), and Throughput (T) to make the table more readable. We have presented the graph of statistics in Figure 8. In the above figure, the x-axis shows the number of tasks and the y-axis shows the makespan value. In each column, the respective algorithm is indicated. 2. Evaluation based on response time: We performed a thorough analysis of the CA-MLBS based on response time and compared it to DTFT and ACOFTF. The analysis was performed based on different configurations of VMs and tasks. We have presented the statistics of these analyzes in Table 12 and a comparison graph is presented in Figure 9. In the figure, the number of tasks is shown on the x-axis and the response time value is shown on the y-axis. In each column, the respective algorithm is indicated. The results show that CA-MLBS optimized the response time by up to 15% compared to ACOFTF and by up to 29% compared to DFTF.
T A B L E 1 2 Load balancing's QoS measures comparison of all algorithms CA-MLBS ACOFTF  DFTF (Junaid, Sohail, Rais, et al., 2020)  3. Evaluation based on throughput: We performed the in-depth analysis of CA-MLBS with a comparison of DTFT and ACOFTF approaches. The analysis was performed based on different configurations of VMs and tasks. We have presented the statistics of these analyzes in Table 12 and Figure 10. In the aforementioned figure, the x-axis shows the number of tasks and the y-axis shows the throughput value. In each column, the respective algorithm is shown. These results show that the CA-MLBS approach optimizes throughput by up to 21% compared to ACOFTF and 44% compared to DFTF.

| CONCLUSION AND FUTURE WORK
Load balancing based on the content of cloud tasks in the cloud computing environment can significantly improve workload distribution. Several studies have been presented that classify cloud tasks based on content types. Some of them use machine learning to improve the balanced distribution of workload across VMs. We proposed a hybrid load balancing model CA-MLBS that provides improved load balancing compared to current approaches. Our proposed model is based on two steps: (1) a content-based classification and (2) a load balancing step. The content-based classification process performs classification of tasks into four classes: Video, Audio, Image, and Text. This reduces the processing overhead for the load balancer through preprocessing, learning, feature extraction, and classification. After classification, we converted the classified tasks to comma-separated value format. The classified tasks also have different task lengths and sizes depending on the task type. In the load balancing step, we input four groups of VMs and classified task collections. Each VM group is dedicated to a specific task type and has different resource configurations. Our model uses a PSO-based load balancing algorithm to distribute the balanced tasks to the corresponding type of VMs. We conducted a thorough analysis of our model and compared the results of the analysis with state-of-the-art approaches such as DFTF and ACOFTF.
As QoS metrics, we used makespan, response time, and throughput. The results of the comparison show that CA-MLBS improved makespan, response time, and throughput compared to DFTF and ACOFTF. Moreover, CA-MLBS is an easy-to-use and simplified approach compared to the existing content type algorithms.
In the future, the proposed approach will be extended by using meta-heuristics such as the genetic algorithm, and QoS parameters such as energy consumption, overhead time, migration time, and optimization time will also be considered. The proposed approach is based on the assumption that the input tasks lie within one of the four categories like video, audio, image, or text without overlapping. Moreover, the overlapping collections of tasks and multi-objective methodologies will also be considered in the future.

DATA AVAILABILITY STATEMENT
The data is openly available in a public repository, which can be directly accessed by FILE FRAGMENT TYPE (FFT)-75 DATASET. (https://ieeedataport.org/open-access/file-fragment-type-fft-75-dataset).