Predicting accurate batch queue wait times on production supercomputers by combining machine learning techniques

The ability to accurately predict when a job on a supercomputer will leave the queue and start to run is not only beneficial for providing insights to users, but can also help enable non‐traditional HPC workloads that are not necessarily suited to the batch queue style‐approach that is ubiquitous on production HPC machines. However there are numerous challenges in achieving such a prediction with high accuracy, not least because the queue's state can change rapidly and depend upon many factors. In this work, we explore a novel machine learning approach for predicting queue wait times, hypothesising that such a model can capture the complex behavior resulting from the queue policy and other interactions to generate accurate job start times. For ARCHER2 (HPE Cray EX), Cirrus (HPE 8600), and 4‐cabinet (HPE Cray EX) we explore how different machine learning approaches and techniques improve the accuracy of our predictions, comparing against the estimation generated by Slurm. By combining categorization and regression models, we demonstrate that our approach delivers the most accurate predictions across our machines of interest, with the result of this work being the ability to predict job start times within 1 min of the actual start time for around 65% of jobs on ARCHER2 and 4‐cabinet, and 76% of jobs on Cirrus. When compared against what Slurm can deliver, via the backfill plugin, this represents around 3.8 times better accuracy on ARCHER2 and 18 times better for Cirrus. Furthermore our approach can accurately predicting the start time for three quarters of all job within 10 min of the actual start time on ARCHER2 and 4‐cabinet, and for 90% of jobs on Cirrus. Whilst the initial driver of this work was to better facilitate non‐traditional, interactive and urgent, workloads on HPC machines, the insights gained can also be used to provide wider benefits to users, enrich existing batch queue systems, and inform supercomputing center policy also.

use and job parameters to select in their scripts, to better informing HPC centers so they can enhance queue policies and queue system settings to improve throughput.
In addition to the general user benefits described above, our specific area of interest driving this work is of non-traditional, interactive and urgent, HPC workloads.Small scale urgent workloads previously relied upon high priority queues or the ability to interrupt existing simulations, 1 however this is not practicable for unpredictable and dynamic situations, where the amount of computing required can be large and vary significantly as time progresses. 2 The VESTEC marshalling and control system 2 has been developed as a generic solution for running urgent, interactive workloads on HPC machines.Integrating use-cases ranging from wildfire fighting 3 to tracking mosquito-borne diseases, 4 these represent highly dynamic workloads, often driven by the arrival of data from external sources or interactivity from the end-user, with the requirement that such workloads must start to run as quickly as possible.Consequently being able to accurately estimate how long jobs will likely queue before they start to run on compute nodes, across several supercomputers, is critical in providing optimal workload placement.
The major challenge in predicting when a job will start to run on a production supercomputer is that there are numerous influencing factors, many of which are complex and unknown to end users.For example, job start time depends not only upon the configuration of the queue itself but also, at the time of submission, both the current queue state and jobs that will be subsequently submitted whilst our job of interest is still queued up.
There are also other aspects, such as queuing policy which are often determined by the HPC center and result in the queues appearing somewhat of a black-box to users when they are trying to estimate how quickly their jobs will start to run.Consequently, the use of machine learning based on historical batch queues data is an approach that has gained some traction, with the idea being that these models can capture the underlying patterns and use this to more accurately determine how long a job will wait in the queue.However existing machine learning models tend to be overly simplistic, limited by specific requirements or assumptions, or only target small-scale HPC or numbers of jobs.
In this paper, we explore the use of machine learning to predict the queue wait time for jobs on HPC machines.Using data from real-world jobs submitted to production HPC machines, in Section 2, we survey related work before describing the HPC machines used in this work and reporting performance of Slurm's built in estimator as a baseline.In Section 3, we describe our initial machine learning approach which, whilst simplistic, provides a limited degree of accuracy before building on this in Section 4 to improve our prediction by incorporate the current state of the queue into our models.This is then further enhanced in Section 5 by combining classification and regression to specialize the prediction of model regression, before briefly illustrating the use of our models to produce insights for users in Section 6. Lastly we draw conclusions in Section 7 and discuss further work.
The contributions of this paper are: 1. Demonstration that the estimated time provided by Slurm's backfill plugin tool is inaccurate for real-world jobs and accuracy also varies significantly across machines.
2. The ability to handle the highly unpredictable workloads present on production supercomputers by developing a stochastic method which generates different random queue states that are representative of machine usage patterns and used as inputs to the model.3. Illustration that, whilst it is possible to generate moderately accurate predictions based on a simplistic model, by adopting a multi-model approach of classification and regression one is able to obtain improved accuracy of results.
4. Presentation of job start time accuracy within a specific time frame of the actual start times.By contrast, the majority of related work presents accuracy using metrics such as mean standard error.Whilst such mathematical-based metrics are useful, it is often more important to understand how closely job start time predictions are to their actual start time counterparts.

BACKGROUND AND RELATED WORK
Job scheduling algorithms typically rely on policies such as first come first served, shortest job first, longest job first, or a job scoring methodology.
However to ensure that compute nodes of HPC machines remain filled and fairly allocated amongst users there are numerous additional complexities imposed at the system administration level making their operation far less transparent.Backfilling is one such example, where smaller jobs lower down the queue are prioritized to fill-up small numbers of available compute nodes which are not sufficient for the larger jobs higher up the queue.
Not only does backfilling induce additional uncertainties when trying to understand how long a job will queue before starting to run, but furthermore there are numerous backfilling algorithms that can be selected making this even more opaque.Backfilling is just one example and, put simply, the queue systems of modern HPC machines are black-boxes.As such it is a non-trivial task to predict how long jobs submitted to queues will wait before they are allocated to compute node(s) and start executing.
An early approach to job wait time prediction was first proposed in Reference 5, which calculates the average time based upon previous jobs submitted by a user.This method is very simplistic, only needing to take the average of previous jobs, and resulting accuracy of prediction was found to be lacking.By contrast, Reference 6 followed a simulation approach where they simulated first come first served, shortest job first, and backfill scheduling algorithms.These were then used to predict the wait time for each application when that application is submitted to the scheduler.As part of this, Reference 6 also predicted the runtime of applications, enabling a complete view of the state of the machine, and errors in job wait time predictions ranged between 5.01 and 996.67 min depending upon the workload being predicted and scheduling algorithm simulated.
Approaches exploiting historical queue data have become more common than queue wait time averaging or simulation.Supervised learning, where mathematical models are trained using the historical queue data is most popular and Reference 7 proposed an instance based learning (IBL) approach to predict job start times based upon historical wait times.Whilst this work was early in the field of machine learning, with limited techniques available to the authors and the number of jobs fairly small, their absolute average error ranged between 210.5 and 577.1 min which was promising for the time and acted as a foundation for further work.An example is Reference 8, which predicts the job queue wait time by undertaking a classification of similar jobs in the historical queue data.Their approach first predicts a job's wait time using the K-nearest neighbor (KNN) technique and then undertakes classification using support vector machine (SVM) among the classes of the jobs, with the probabilities given by the SVM used to provide a set of predicted wait times with probabilities.Their experiments were focused on the grid rather than an HPC machine, and they were aiming to predict within a window of 1 h for the job start time.Nevertheless, this is still applicable for supercomputer queue time prediction especially because their work was able to demonstrate correct categorization of job start times between 77% and 83% of the time.By contrast, in this work, we are using real-world HPC machines, and seeking predictions that are much more accurate and as close as possible to the actual job start time rather than predicting within hour sized windows.
In Reference 9, the authors proposed a method of predicting queue wait times based on a hidden Markov model.They were interested in queue congestion, where the greater the congestion the longer the time before a job starts.In this work, the authors represented queue congestion as an estimate of the state according to the degree of congestion for the queue waiting time expected at the time t, with the objective being that they can then use their model to predict the queue waiting times at time t + 1.When comparing their prediction accuracy against those of the other methods, results showed that the proposed algorithm improves the prediction accuracy by up to 60% although containing only 10,836 jobs the dataset used is small.
By contrast to the supervised learning approaches detailed above, Reference 10 studied the use of reinforcement learning (RL) to predict queue wait times.In this approach, a model is trained based upon rewarding desired behaviors and punishing undesirable ones.In Reference 10, the authors highlight the role of RL in handling the unknown amount of work in the queue, however their approach also requires prediction of the actual runtime of jobs before undertaking the job start time prediction.Undertaking runtime prediction is required for accurately assessing the amount of work in the queue, but requires in-depth knowledge of each individual job and is not scalable to large systems with many different workloads.For instance in Reference 10, the authors limited themselves only to VASP, and by contrast we aim for an approach which can be run on a snapshot of the machine executing a diverse workload without requiring such in-depth knowledge.
We summarize that previous work predicting queue wait times has demonstrated promise, but can be somewhat limited, based upon small scale datasets, or imposing numerous assumptions that often do not generalize to real-world HPC machines. 11However, all the papers surveyed in this section highlight the difficulty of predicting the job queue wait time and demonstrate this is complicated by several factors outside the control or knowledge of users.Often one is not aware of the full set of criteria that schedulers are using to determine when jobs will run, and there can be complicated inter-job relationships at play too.Driven by our interest in urgent computing workloads, 3 we require the ability to quickly predict how long a job will queue for on a given HPC machine before running based because we require that urgent jobs start to run as soon as possible.This means that we require the predicted start time to be reported in minutes and seconds, and it is important that the accuracy of such predictions is within a few minutes.

Supercomputers studied in this work
In the experiments detailed in this paper, we work with historical data from the following three production supercomputers:

4-cabinet:
The preliminary 4-cabinet ARCHER2 system that was available before the full ARCHER2 system was commissioned.An HPE Cray EX, at 1000 nodes this was approximately a fifth the size of ARCHER2 and we use historical queue data over 9 months, from February 2021 to October 2021 which comprises 373,560 (standard queue) jobs.This system rans Slurm version 20.11.
These three systems represent production HPC systems in use on a 24/7 basis, and provides a diverse set of system and job sizes to use for developing and evaluating our models.All systems run the Slurm 12 queue system and we used the sacct command to obtain the historical queue data.On ARCHER2 there is one partition, standard but two qualities of service (QoS), standard and short.The standard QoS is based upon the default fair share job priority, whereas jobs in the short queue are assigned a higher priority but can only run over a maximum of eight nodes for, at most, 20 min.There have been no modifications made to Slurm itself or the queue policies by the team.
It can seen that we report the number of jobs in the standard and short queue for ARCHER2, but just the standard queues for Cirrus and the 4-cabinet system.This is because, as we highlight in Section 3, it is much easier to accurately predict job start times for short queues compared to the standard queue.The reason for this is that, in the short queue, jobs are always small, short running, and tend to start very quickly, in contrast to the standard queue which does not have these limitations and jobs can wait for considerable amounts of time. 13Therefore, the majority of this paper is focused on prediction for jobs in the standard queues across our supercomputers of interest because that is the major challenge for job start time prediction and also due to the limits of the short queue (e.g., jobs running for a maximum of 20 min on ARCHER2), a workload of any complexity must use the standard queue.

Slurm estimated time
The Slurm queue system 12 provides its own job start prediction capabilities by providing an expected start time for jobs if Slurm is configured to use the backfill scheduling plugin.This prediction is only offered by Slurm once a job has been submitted, which is not suitable for our requirements in making informed scheduling choices for urgent computing, but nevertheless this estimated time acts as a baseline against which we can then compare the success of our machine learning approach in subsequent sections.
We tracked the lifetime of all jobs submitted on ARCHER2 and Cirrus over a 2 weeks period, using a script that continually polls the queue system for newly submitted jobs.Jobs are then stored and their details updated if required as time progresses, specifically amending the start time estimate if appropriate and once the job starts running our script compares Slurm's job start estimate(s) against the actual start time of that job.
Table 1 reports the accuracy of the estimated start time provided by Slurm for the standard queues on both ARCHER2 and Cirrus.In this table, we report the percentage of jobs whose start time estimates were accurate to within a specific time-frame of when the job actually started.Slurm updates the estimated start time for jobs if appropriate, and therefore there are two accuracy numbers reported for each machine in Table 1.initial is the accuracy of the initial prediction made by Slurm when the job was submitted, and best is the best accuracy obtained across all the estimated start times for a job.For ARCHER2, on average, 83% of jobs had multiple estimates provided by Slurm over their lifetime and 53% of jobs had more than five updated estimates made.
Table 1 reports that the estimates generated by Slurm are fairly inaccurate, especially for Cirrus.Irrespective of the machine in use, the initial estimates are less accurate than the best estimate across all estimated start times and it is most common for predictions to be made that are accurate to an hour or more of the actual start time.There is a difference in Slurm's estimation accuracy between ARCHER2 and Cirrus, which is because of the differences in usage model for the machines.Cirrus is a high-throughput system with many jobs requesting smaller numbers of nodes for a shorter amount of time.Consequently Slurm was overestimating the start time on Cirrus in 96% of cases, whereas on ARCHER2, which follows a more traditional HPC system usage model, overestimation of the start time occurred in 60% of cases.
TA B L E 1 Prediction accuracy of Slurm's backfill scheduling plugin for the standard queue on ARCHER2 and Cirrus.From Slurm's estimated start times we can conclude that the usage mode of the machine makes a significant impact on the accuracy of predictions generated.Slurm tends to overestimate, rather than underestimate, the start time and is often seemingly able to generate more accurate estimates for systems whose workload is more traditional HPC style.

INITIAL MACHINE LEARNING MODEL
Previous work in References 14 and 8 demonstrated that K-nearest neighbor (KNN) 15 is a successful approach for generating queue wait time predictions, and a simple approach acting on the data was our starting point.KNN is a simple supervised machine learning algorithm for solving both classification and regression problems and works based upon the assumption that similar items exist in close proximity.Calculating the k neighbors that are nearest to the feature of interest, these closest neighboring values are then reduced to the overall prediction, often by taking the mean.
An important configuration for KNN is what value of k to use, that is, the number of closest neighbors to each point that need to be considered.
Based upon experimentation, we found that k = 10 was most appropriate, using the KNeighborsRegressor from Sklearn 16 and the default minkowski distance metric used to determine the nearest neighbors.
This simple KNN approach served two purposes, firstly to understand how even a very basic machine learning approach compares against Slurm's built in estimator, and secondly to act as a foundation for more complex machine learning approaches described in subsequent sections.
Initially focusing on ARCHER2, based on the historical queue data we trained two regression KNN models.The first, basic, only selects the number of nodes and wall time requested by the user as features for each job.The second model, temporal, also includes the time and day of the week when the job was submitted as features and this enables us to understand the importance of when a job was submitted for accurately predicting job start time.Features for each job are normalized such that they all have approximately equal range, where we calculate the mean and standard deviation for each element for all jobs, and adjust these as per Equation ( 1) so that they are centered around 0, with a standard deviation of 1.
Throughout the experiments in this paper, we select 80% of jobs for training and 20% for testing.However, based upon experimentation it was found that if one did not enforce a separation in time between jobs in the test and training sets, then this would result in artificially high predictions.
This is because users often submit multiple very similar jobs in one go and this is so common that, for instance, Slurm provides support for job arrays to help manage collections of similar jobs.We therefore found that naively selecting one job in five for the testing set resulted in our machine learning models having likely encountered very similar jobs previously, and thus artificially made the task of queue wait time prediction easier.The random approach, which randomly picks 80% of jobs for training and 20% for testing suffers from the same issue.This is because, across the randomly selected dataset, for a job in the 20% of testing data there is an 80% chance that its neighboring job will reside in the training set.Ultimately, we found these strategies resulted in a poor distribution of data, and likelihood that jobs with near identical queue features and hence a similar wait time, would be across the test and training set.
Instead, to provide a fairer testing regime, we enforced a separation in time between our test and training data by selecting every fifth day's set of jobs to form our test jobs.For instance, if all jobs on a Monday are selected as test jobs then the next test day will be a Saturday.This ensures that there is a wider range of test jobs across different hours and different days to test our predictions against sight unseen.
Table 2 reports the results of prediction using our simple KNN models for ARCHER2.Accuracy is reported as the percentage of predictions that are made correctly within a specific time-frame of the actual job start time, with the smaller the difference the more accurate the model.We trained each model for both the ARCHER2 standard queue (314,880 jobs) and short queue (73,472 jobs), where temporal models, which are also provided with job submission date and time as features, result in increased accuracy of prediction compared against the basic model.This demonstrates that there is a correlation between when a job is submitted to the HPC machine queue and how long it will wait for in the queue.Whilst such a statement will likely not surprise an experienced user of such machines from their own personal experience, it is still important to identify that this relationship exists in the data.
For the temporal model operating on the standard queue in Table 2, when compared against the estimates generated by Slurm and reported in Table 1, this simple KNN approach improves the accuracy of predicted start times up to and including accuracies that fall within 10 minutes of the actual start time.Although this is at the cost of lesser accurate predictions, where for instance only 61% of jobs in the ARCHER2 standard queue are correctly predicted to start within an hour of the actual start time, compared to 70% for the estimation generated by Slurm.In contrast to the standard queue, predictions for the short queue are more accurate and there is less difference between the basic and temporal models.This is because there is considerably more uniformity to the short queue, where small jobs with a short requested wall time are submitted to a set of reserved nodes, compared with the standard queue and as such this makes it much more predictable.
We then trained our temporal KNN regression model using historical data from the standard queues of ARCHER2, Cirrus, and 4-cabinet.There is a separate model trained for each machine, and the results of using these trained models to test predicted job star times for 20% of the data, sight unseen, are reported in Table 3.It is interesting to observe that predictions for job wait times on Cirrus using our simple model are considerably more The high-throughput nature of Cirrus means that jobs on average tend to be smaller and faster running than ARCHER2 and 4-cabinet, thus starting more quickly.Consequently, the model is biased towards predicting these patterns of jobs, and whilst not all jobs follow this pattern, enough do to make predictions more accurate in this regime.Consequently, on Cirrus improving accuracy beyond this level will be more challenging as to do so we must undertake accurate start time predictions for jobs that do not conform with the most common machine usage pattern.
Even though the predictions in Table 3 are far from perfect, given the simplicity of the models in use we were surprised at how well they performed.Compared against the accuracy of estimates provided by Slurm based on the best accuracy reported in Table 1, it can seen for ARCHER2 that the KNN regression approach provides a greater number of predictions that fall very close (10 min or less) to the actual start time for ARCHER2 although there is a reduction for less accurate predictions.For Cirrus, the KNN regression predictions are considerably more accurate than those provided by Slurm.
Whilst the accuracy of prediction delivered by our simple model is encouraging, this is not sufficient for our urgent use-case or indeed users more generally wishing to obtain insight around how long their application will queue.It is considered that a reasonable level of accuracy within about 10 min or less is a minimum requirement.

QUEUE SNAPSHOT MACHINE LEARNING APPROACH
It was highlighted in Section 3 that for accuracy of prediction it is important to include when a job was submitted to the queue as features.This demonstrates there is a correlation between how long a job waits in the queue and when the user submitted it to the HPC machine.However, more generally, it is not the exact time and day when a job was submitted that is directly important, but instead the fact that this represents that the queue is in a specific state.For example at 10 p.m. on a weekend the queue might be very quiet and submitted jobs will start to run quickly, whereas at 2 p.m. on a Wednesday afternoon it is likely that many users are contending for the compute nodes and hence jobs will wait much longer in the queue.Whilst including the date and time of job submission as features improved the accuracy prediction of our simple model in Section 3, it was our hypothesis that these are a rather crude way of representing the state of the queue and accuracy can be improved by providing the current state of the queue as features to our model when training and testing.This is illustrated in Figure 1 where queue state represents, at the time of job submission, all other jobs that are running or waiting to run on the HPC machine.
Consequently, for each job, we capture the state of the queued and running jobs at the time of submission and provide this to our model along with details of the job itself during training.Once trained, when making predictions for job start times using our test data, we take a snapshot of the running and queued jobs from the historically gathered data.When deploying our approach in the real-world this can be done in real-time via the appropriate Slurm commands and providing this as inputs to the model.
Representing the state of queued and running jobs must be done in a manner that our models can easily consume.Specifically, providing values directly for every single queued or running job at the time of job submission would be cumbersome and liable to result in a very complex model.
Instead, the current queue state is divided into queued and running jobs, and there are seven features representing the queued jobs and six representing the running jobs.These are summarized in Table 4, and as can be seen, we are concerned with general features such as the number of nodes requested by the job, the requested wall time, the day of the week and time of day.Furthermore we represent the current state of the queue by features which when comprising _q_ in the name represents a feature for queued jobs, and _r_ for running jobs.The queue state is represented by scalars describing the number of queued and running jobs (s_q_jobs, s_r_jobs), the total number of nodes that have been requested by queued and running jobs (s_q_nodes, s_r_nodes) and the total work, which is the nodes multiplied by the wall time for queued and running jobs (s_q_work, s_r_work).
Furthermore, for queued jobs we also consider the scalar m_q_wait which is the median time that jobs have been waiting to run.
In addition to the scalar values reported in Table 4 that describe the state of the queue, we also provide three histograms in the feature set for both queued and running jobs.These histograms enable us to categorize the queue state into distinct bins that form a distribution, with these distributions representing the state of the queue.Each histogram comprises eight bins and it is important to appropriately calculate the size and shape of each bins forming the histogram.To achieve this we aim for bins to approximately contain the same number of elements, from a global perspective, across all our jobs of interest in the training data.This is calculated by a script working through each submitted job in the training data and, based upon the other jobs currently queued and running, will calculate the appropriate dimension of each histogram bin.It should be highlighted that, whilst we aim for roughly each bin to hold the same number of elements from the global perspective, at the individual job level the size of each bin representing the current queue state often varies significantly and it is this which is providing the characterization of the state.
Based on providing this enhanced queue state information we then retrained the KNN models and reran our experiments on the ARCHER2, Cirrus, and 4-cabinet systems for the standard queues.Using the same split and selection of training and test data as previously described, the accuracy of our predictions when including the queue state are reported in Table 5.When comparing against prediction accuracy of our simple KNN model in Table 3, it is observed that providing the state of the queue as an input to the machine learning model tends to generally improve prediction accuracy, but this improvement is fairly limited.

TA B L E 4
The features used in our queue state aware machine learning model.We hypothesized that the limited improvement in predication accuracy reported in Table 5 compared against the simple KNN model of Section 3 was because of uncertainties in the queue state.When a job is submitted we know exactly the number of other jobs already running and queued up waiting to run, and the number of compute nodes requested by the queued jobs and in-use by the running jobs.However we do not know exactly how long jobs will run for, and this was highlighted in Reference 10.Users provide a maximum wall time for their jobs, however when surveying the historical queue data we found that this maximum wall time tends to overestimate the actual job runtimes on average by around eight times on ARCHER2 and 4-cabinet and six times on Cirrus.The most common source of these over estimations is where users select a default value, such as an hour or a day, as the maximum job wall time of their job.
Consequently when a job is submitted to the HPC machine there is uncertainty around how much work there is on the supercomputer, in terms of how much time running jobs will continue to run for and how long queued jobs will actually run for.This amount of work considerably impacts the start time of a job, and approaches such as Reference 10 look to address this challenge by undertaking a prediction of runtime for currently running Illustration of stochastic approach operating over distinct randomly generated queue states.
jobs.However predicting the runtime of running and queued jobs requires in-depth knowledge about those jobs, and instead it was our hypothesis that we could use this workload uncertainty as an advantage because it enables us to quantify the uncertainty on our predicted wait time.
To address the uncertainty of work in the queue we adopt a stochastic approach where we train our KNN regression model on the actual wall times of jobs from the historical data, as these actual runtimes are known.When undertaking job start time predictions, at that point we only know the maximum specified wall time for queued and running jobs.Consequently, a large number of possible queue states are generated based on randomly chosen wall times for the queued and running jobs (as these are the unknowns).Each of these possible states are then run through our trained model as separate predictions and the resulting distribution of predicted job wait times is then used to determine the expected, mean, wait time.
The error, standard deviation or similar, can also be generated to provide an estimate of accuracy.This confidence estimation is especially useful for the urgent computing use case, as if the accuracy estimation is low then the VESTEC urgent computing system could schedule the workload across multiple HPC machines and pick results from the job which ran through to completion first.
This stochastic approach for job start time prediction is illustrated in Figure 2, where for clarity of presentation we only illustrate four distinct queue states being predicted by our trained model, although in reality we generate a 100 distinct queue states.The same job details are provided to each prediction, but each is provided with a separate queue state and generates its own distinct predicted start time.These predictions are then fed into the combination stage which calculates the overall, mean, prediction, and quantifies the uncertainty.
For such a stochastic prediction method to work we need to generate a set of random wall times for queued and running jobs as part of the 100 queue states.Ideally these will follow the same distribution of wall times as jobs previously submitted to the queue, and to achieve this, we consider the distribution of actual wall time to requested wall time for historical jobs.Figure 3 illustrates our approach of distribution generation for ARCHER2, where we consider all jobs within certain node ranges and aim for such a random distribution of queue states for that specific machine to conform to this pattern.
Based upon these distributions of the actual to requested wall times, we then generate a random number that obeys this distribution.This is achieved using the cumulative distribution function of the distribution chosen.Given the probability density function (PDF), in our case the histogram of Figure 3, we determine the cumulative distribution function (CDF) from Equation ( 2) where the CDF ranges from [0, 1].
Consequently, based upon the CDF, if we pick a uniform random number, x, between [0, 1] we can then obtain a random number y that follows the PDF by using Equation ( 3).This approach enables us to generate random numbers that have the same distributions as the histogram shown in Figure 3, which are the ratio of actual to requested wall time of all jobs stratified by node count, and in the case of Figure 3 within a specific month on ARCHER2.It is not intended that the reader analyses the values reported in Figure 3 in detail, not least because the specifics of this will change from machine to machine and month to month, but instead it provides an illustration of the distribution that our approach then uses in generating random Illustrations of queue-based distributions that are used to generate random numbers in our approach which, whilst they are random, can be used to provide a realistic pattern of the jobs being run on the machine because they are picked from this distribution.This distribution is the of actual to requested wall time for all jobs stratified by node count submitted to ARCHER2 in a single month.
TA B L E 6 Prediction accuracy of stochastic queue state approach, running our K-nearest neighbor (KNN) regression model with 100 randomly generated queue states for each job to determine the overall prediction.numbers.Using these random numbers we can determine random wall times by calculating y times the requested wall time.Therefore, whilst the runtime of each job of each queue state is random, these follow a realistic pattern given the jobs that are typically run on such a machine.
From a code perspective once the raw job data has been cleaned and preprocessed into a usable state, we run a Python script which operates across the data and generates the CDF wall time distributions.This data is then stored and used by the subsequent script which, for each job, constructs a list of the running and queued jobs in the queue at the job's submit time.For the 20% of total jobs selected for testing, for each of these a 100 set of queue features are generated with the random wall times.Based on the predictions generated by our KNN model for each generated queue state we then take the mean of the k nearest (based on a distance metric) vectors' actual wait times to be the predicted wait time.The prediction accuracy of this approach is reported in Table 6 for ARCHER2, Cirrus, and the ARCHER2 4-cabinet system.This stochastic queue generation approach improves prediction accuracy especially for ARCHER2 and the 4-cabinet system.Whilst even our simple KNN approach out-performed Slurm's estimations for the most accurate predictions, this is the first time where we outperform job start estimations generated by Slurm for all levels of accuracy across all machines.

Boosted trees
Until this point we have used a regression machine learning model based on K-nearest neighbors (KNN) as this was demonstrated to work well with queue predictions in References 8 and 14.However, KNN is a fairly simplistic approach and so an important question was whether advanced techniques would provide increased prediction accuracy.We explored the use of boosted trees 17 which models non-linear relationships in the data, and this is potentially advantageous as highlighted by Reference 10 who themselves obtained success using boosted trees in their work.
Otherwise known as gradient boosting, boosted trees rely on the concept of decision tree ensembles where a model comprises a set of classification or regression trees and features of the problem are split up amongst tree leaves.Each leaf holds a score associated with that feature and as one walks the tree, scores are combined which then form the basis of an overall prediction answer.A single tree is not sufficient for the level of accuracy required in practice, and so an ensemble of trees, where the model sums the prediction of multiple trees together, is used.As one trains a boosted trees model, the trees are built one at a time, and each new tree helps to correct the errors made by previous trees.[20] We used the XGBoost library 21 which is an open source software framework aiming to provide a scalable, portable and distributed gradient boosting library for Python and numerous other languages.Training our boosted trees model using the same approach for the stochastic queue state representation for the KNN model, results for this approach are reported in Table 7. Whilst this generally provides more accuracy than the KNN approach in Table 6, for instance it addresses the reduction in accuracy up to 1 min for Cirrus seen in Table 6, it is not a silver bullet.Still only around 55% of jobs on ARCHER2 and 4-cabinet are correctly predicted to start within 10 min of the actual start time.

COMBINING CLASSIFICATION AND REGRESSION
In Reference 14, the authors improved the accuracy of their predictions by splitting their data on jobs that start within an hour, termed quick starters, and longer waiting ones.They did this because the quick starters were commonplace and found to bias their models towards such predictions.In contrast, until this point we have been using regression models trained with 80% of the historical data for each machine to generate a numeric queue wait time estimation.However, intuitively, users often do not consider wait times to the exact second but instead within a specific bound, for instance whether the job will start within the next minute, 10 min, or hour.
Whilst the quick starters concept developed in Reference 14 was driven by grid computing rather than HPC, nevertheless when exploring the historical queue data on our HPC machines we found that a large proportion of jobs start within 10 s or less.Such quick job start times account for around 25% of jobs on ARCHER2, 60% of jobs on Cirrus, and 28% of jobs on 4-cabinet.Consequently, these frequent jobs with very short queue times bias our models during training in predicting shorter job queue times across the board.Therefore we modified our approach by first defining categories of job start time and categorizing jobs within these.This categorization no longer involves generating an exact predicted time, as per regression, but instead for our model to determine which category of start time a job will reside in.We define the term immediate starters which represent jobs that start within 10 s of being queued and first use a binary classification model to predict whether jobs are immediate starters or not.
We focus our actual classification on whether the job is a member of the category which contains the most jobs, which for ARCHER2 and 4-cabinet is the non-immediate starters category and is the immediate starters for Cirrus.We drive the grouping by this largest category because those jobs are more plentiful and hence easier to predict for.For instance, with ARCHER2, categorization is driven by those jobs predicted to be non-immediate starters and every other job is assumed to be an immediate starter.
Those jobs not classified as immediate starters will then be used as inputs to a subsequent model which categorizes them into one of seven start categories.These categories are; starting within a minute of being submitted to the queue, within 5 min, within 10 min, within 30 min, within 1 h, within 4 h, or stating over 4 h after being submitted.For both classifications we found that using a boosted trees approach was most effective, and although we use boosted trees compared to Reference 14 who used SVM, in contrast we report both exact and relaxed accuracy.The exact accuracy is the percentage of correct predictions made in that exact category, whereas the relaxed accuracy is the percentage of predictions which are either correct or miss-predicted only in the category either side.The reason for this relaxed prediction was that we found it was fairly common for some predictions to be close to the category boundary, but because the classifier is making a distinct choice, it is miss categorized into the neighboring category.It can be seen from Table 8 that the classification of jobs who are either immediate starters or starting within a minute of submission, is especially accurate.The accuracy is more variable for other categories although still tending to be fairly good for most.
In contrast to Reference 14 we still want to obtain numerical start time predictions, and Figure 5 illustrates the overall flow of our modified approach when predicting the start time of a job.The binary classification of whether the job will start immediately or not is first undertaken for all 100 stochastic queue states and the dominant decision, whether it is immediate or not, is selected.If it is selected as an immediate starter then the predicted job start time is set to be 10 s after the queue time.Otherwise the job is fed into instances of the classification and regression models, each running with the 100 distinct queue states.For each of these queue states, once the job start category has been determined then the appropriate regression model for that category is selected and used generate the predicted start time.All 100 predicted start times are then combined, with the mean prediction taken as the overall start time prediction as described in Section 4.
The regression model for each category has been trained on data from that category and the categories either side of it.The purpose is that, as per Table 8, if a job is going to be miss-categorized then it is most likely to reside in one of the categories either side of the correct one.Consequently, by training a regression model with three categories of data it provides the opportunity for the these miss-predictions to snap back into the correct category.This is illustrated in Figure 4, where those jobs not categorized as immediate starters are categorized as starting within one of seven timing categories on the left of the figure.Boosted trees regression models are trained for each category using data from the category itself along with the categories either side.The appropriate pre-trained model is then selected for undertaking predictions as illustrated in the overall flow depicted in Figure 5. Table 9 reports overall accuracy for this approach, where this results in considerably improved prediction accuracy compared to the models in Section 4. Our combined classification and regression approach correctly predicts jobs starting within 1 min for 63% of jobs on ARCHER2, 76% on Cirrus, and 66% on 4-cabinet compared with Slurm's estimator reported in Table 1 that only accurately predicts jobs starting within a minute at-best 16% of the time on ARCHER2 and 4% of the time on Cirrus.From Table 9, it can also be seen that we are reporting three quarters of all job start times accurate within 10 min on ARCHER2 and 4-cabinet, and 90% on Cirrus, compared with Slurm's estimator that predicts accurately within 10 min only a 30% of the time for ARCHER2 and 4% of the time for Cirrus.

Model runtime
In this paper, we have mainly focused our evaluation on the prediction accuracy of machine learning models.However, such models must also be realistic to run for jobs, especially with our focus of undertaking predictions of urgent workloads to rapidly undertake decisions around job placement across numerous supercomputers.Consequently, the runtime of models is also important, especially when undertaking inference for job start time predictions.We ran all our machine learning scripts on a 26-core Intel Xeon Platinum (Skylake) 8170 CPU with the runtime in seconds for different aspects reported in  This was unexpected given the more advanced nature of boosted trees compared to KNN, and fact that the boosted trees models undertakes classification and regression.Irrespective, for job wait time prediction the runtimes are small and we demonstrate that our approach is realistic to be used as in a semi real-time fashion, returning results in approximately a tenth of second for each machine when using our most accurate prediction approach.

USER INSIGHTS GAINED FROM MODELS
An objective in this paper has been to develop a model that can accurately predict job wait times for urgent workloads, enabling informed choices around workload placement.However these models can be used more widely by users to help understand optimal job configurations when submitting to the queue.For instance, answering questions such as whether changing the number of nodes will impact the overall queue wait time or the maximum wall time.Based upon the models developed in Sections 4 and 5, we undertook a number of predictions for the queue state of ARCHER2 on a standard Tuesday morning at 11 a.m., for different number of nodes and maximum requested wall times.These predictions are illustrated in the heatmap of Figure 6, where users can obtain specific insights across a range of node sizes and wall time selected.Predicted queue wait times ranged from the job running immediately to waiting 27 min and 8 s, for example if the user sought to run over 16 nodes then if possible they should set a maximum requested wall time of 2 h or lower, because from 4 h requested wall time on wards the queue wait time increases sharply.If the user was requesting 32 nodes, then they should avoid requesting 4 h maximum wall time as this is predicted to result in a much longer queue wait time than, for instance, requesting 8 or 12 h maximum wall time with 32 nodes.There are some interesting patterns in the data, for instance there is a very long queue wait time predicted for a wall time of 8 h over 128 nodes.From analysing the queue state we saw that there are a larger number of jobs requesting greater node counts than on average, so this could in part explain this prediction.However, a major challenge with machine learning models is that they often lack explainability, 22 and this instance is an illustration of a limitation of our approach where it would be useful to be able to further understand why our model predicts a shorter queue wait time for wall times of 12 h over 128 nodes, and indeed for many wall time configurations on 256 and 512 nodes.
Whilst these are simple examples, they illustrate how our prediction models can be used more widely to provide insights to users around how changing the number of requested resources will impact their queue wait time and hence make more informed choices around job configuration.

CONCLUSIONS AND FURTHER WORK
In this paper, we explored the use of machine learning to predict job start times on three production HPC machines that represent a diverse size of machine and usage model types.Beginning with job start time estimates provided by Slurm as a baseline, we then explored the accuracy of a simple KNN model.Building upon this simple KNN model we explored how to provide the queue state as an input to our models, however this was found to be further complicated by uncertainties in the overall amount of work in the queue at job submission time because only maximum wall times are provided which can vary significantly from the actual runtime.Consequently, we devised a stochastic approach which generates 100 different queue workload states for each job and whilst these are random, being based on the distribution of wall times from jobs previously submitted they are still a realistic representation.
After exploring the improved prediction that our stochastic approach provides, both with KNN and boosted trees techniques, we then developed a multi-stage approach which combines classification and regression.By adopting this approach we demonstrated significantly improved prediction accuracy for job start time, predicting within 1 min for around 65% of jobs on ARCHER2 and 4-cabinet, and 76% of jobs on Cirrus, as well as accurately predicting three quarters of all job start times within 10 min on ARCHER2 and 4-cabinet, and 90% of jobs on Cirrus.This represents a 3.8 times more accurate prediction for ARCHER2 and 18 times more accurate for Cirrus when compared to Slurm's estimations within a 1 min accuracy window.When considering a 10 min window our approach is 2.2 times more accurate for ARCHER2 and 20 times more accurate for Cirrus than Slurm's estimations.
The models we have developed can also be used to provide enhanced insights to users around when they can expect their jobs to run.In Section 6, we provided an example of this when changing the number of nodes and requested wall time.The job submission time was fixed to be an average Tuesday at 11 a.m., but it is also possible to vary the queue submission time and explore how submitting jobs at different times might reduce the overall wait times and this would be interesting to explore in the future.Our approach could be incorporated into a tool that users can use to dynamically explore the most appropriate parameters for their jobs to optimize configurations, as well as potentially enhancing the Slurm queue system to provide more accurate job start predictions.
To better improve the accuracy of our models, when binning the queue state we could consider the distribution of actual to requested wall times within certain wall time ranges, for instance for jobs with requested wall times less than 1 h.This would provide a more accurate possible distribution for each given job, although for some edge cases there may be few jobs to draw a distribution from which would impact accuracy.Furthermore whilst our start time predictions take into account the existing jobs currently running and queued at job submission, we do not consider additional jobs that might be submitted after the job of interest has been queued but is still waiting to run.These subsequent jobs could impact the job start time and it would be possible to generate some stochastic representation of the likelihood and dimensions of such jobs, providing these as part of our queue state model inputs which could further improve prediction accuracy.
We conclude that our approach significantly improves upon the prediction accuracy of Slurm's estimator.In contrast to existing machine learning techniques for predicting job wait time on HPC machines we are able to generate numerical job start times that tend to fall within 1 and 10 min of actual job start times across our machines of interest.Our approach has been incorporated into the VESTEC urgent computing system to accurately predict job queue wait times across many different machines to ensure suitable job placement, and is also of benefit more widely to users and system administrators of HPC machines.

1
Illustration of model incorporating snapshot of queue for job start time training and prediction.

F I G U R E 4
Illustration of categorizing job start times and then, for each category, running the boosted trees model trained on that category and the one immediately preceding and following it.in Section 5 ran faster than the KNN model described in Section 4, training in considerably less time and also generating predictions in less time too.

F I G U R E 5
Illustration of overall flow for job start time prediction with combined classification and regression models.TA B L E 9 Prediction accuracy of combination of classification and regression boosted trees models.

F I G U R E 6
Predicting job queue wait time on ARCHER2 for different numbers of nodes and maximum requested wall times, where the queue represents a standard Tuesday morning at 11 a.m.
TA B L E 2 Prediction accuracy of simple K-nearest neighbor (KNN) models on the queue data for ARCHER2 standard and short queues comparing a basic model (the requested number of nodes and wall time only as features) against a temporal model (also including job submission time and day).Prediction accuracy of simple K-nearest neighbor (KNN) model on standard queue across our machines of interest.
accurate than those for ARCHER2 and 4-cabinet.This is in contrast with estimations made by Slurm and reported in Table1, which were considerably less accurate for Cirrus than ARCHER2 and in all cases the simple KNN model predictions for Cirrus in Table1out perform what Slurm can provide.
TA B L E 7 Prediction accuracy of stochastic predicted queue state boosted trees model across HPC machines of interest.
Table 8 reports accuracy of classification for our different job start categories.Whilst this classification follows the approach of Reference 14, TA B L E 8 Prediction accuracy of classification of jobs into start categories using boosted trees.

Table 10
for each target supercomputer.With the exception of CDF generation and histogram bin identification, all codes were threaded across all 26 cores and model training is by far the most time consuming activity, although this only needs to be performed once per machine, whereas queue wait time prediction for a single job is less than a second.The boosted trees classification and regression models described Runtime of model training and prediction activities on a 26-core Intel Xeon Platinum (Skylake) 8170 CPU.