When Do We Need Massive Computations to Perform Detailed COVID‐19 Simulations?

Abstract The COVID‐19 pandemic has infected over 250 million people worldwide and killed more than 5 million as of November 2021. Many intervention strategies are utilized (e.g., masks, social distancing, vaccinations), but officials making decisions have a limited time to act. Computer simulations can aid them by predicting future disease outcomes, but they also require significant processing power or time. It is examined whether a machine learning model can be trained on a small subset of simulation runs to inexpensively predict future disease trajectories resembling the original simulation results. Using four previously published agent‐based models (ABMs) for COVID‐19, a decision tree regression for each ABM is built and its predictions are compared to the corresponding ABM. Accurate machine learning meta‐models are generated from ABMs without strong interventions (e.g., vaccines, lockdowns) using small amounts of simulation data: the root‐mean‐square error (RMSE) with 25% of the data is close to the RMSE for the full dataset (0.15 vs 0.14 in one model; 0.07 vs 0.06 in another). However, meta‐models for ABMs employing strong interventions require much more training data (at least 60%) to achieve a similar accuracy. In conclusion, machine learning meta‐models can be used in some scenarios to assist in faster decision‐making.


Introduction
As of November 2021, COVID-19 was directly responsible for an estimate of over 5 million deaths and over 250 million cases. [1] Taking the United States as an example, these numbers translate to almost 750 000 deaths from about 46 million cases, [2] while noting that fatality may be underestimated [3] and differs across sub-groups based on factors such as socio-economic status or race and ethnicity. [4] Although reports previously considered that "transition toward normalcy in the United States remains most likely in the second quarter of 2021," [5] the delta variant has effectively triggered a "new phase in the pandemic" [6] as can be seen with a rebound of over 100 000 new cases per day in the United States in August and September. [2] Similar phenomena can be DOI: 10.1002/adts.202100343 observed worldwide and continue to require action by government officials to limit the spread of disease. [7] For example, France has implemented a COVID-19 "health pass" while Italy has a similar "Green Pass"; mandates for masks or vaccines are on the agenda across several US states; and lockdown as well as travel curbs are making a comeback in China. A broad set of intervention strategies is available to policymakers, [8][9][10][11] including vaccines, preventative care (e.g., masks and social distancing), or lockdowns (e.g., remote work and travel restrictions). Implementing one of these strategies involves several logistical parameters: for instance, testing requires capacity for contact tracing and policies for quarantine (e.g., is a negative test required to leave quarantine? is a number of days required?); similarly, vaccination involves a complex logistical chain from shipping to administering doses. [12] In addition to the many possible combinations of interventions and parameter values, the "right set" of interventions can vary across places (e.g., based on disease incidence and hospital capacity), across time (e.g., in response to a new wave of infections), and across individuals (e.g., priority for vaccination to those most at risk). This results in a very large search space of possible interventions. [13] Although the necessity of immediate actions in the early days of the pandemic may have resulted in choosing an intervention based on minimal insight, there is now evidence for serious consequences in rolling a sub-optimal intervention: lives may be lost, the cost of future interventions may be heightened, or the adherence (hence the impact) of future interventions may be lowered. [14][15][16] Computer simulations using agent-based models (ABMs) can aid officials in making these decisions, by modeling the effects of specific interventions in specific places (e.g., small towns, [17] educational institutions, [18,19] supermarkets [20] ), populations (e.g., targeted vaccinations [21] ), or time windows (e.g., during a yearly mass pilgrimage [22] ). Several modeling frameworks are available to quickly run simulations for specific interventions, while accounting for individual heterogeneity in risk factors and contact patterns (e.g., by embedding agents across multiple networks such as community and work). However, significant computing resources are required to perform detailed simulations that track individual transmissions and evaluate various interventions. [23,24] Our analysis across six projects showed that www.advancedsciencenews.com www.advtheorysimul.com cloud computing or high-performance computing clusters were frequently needed. [25] This is exemplified by the model from Chang et al., which ran on 4264 compute cores. [23] So far, the primary solution to perform resource-intensive agent-based COVID-19 simulations has been to find more computing power. For example, the COVID-19 High-Performance Computing Consortium (covid19-hpc-consortium.org) was created to make private computing resources available to COVID-19 researchers. [26] Similarly, the Partnership for Advanced Computing in Europe (PRACE) offered a fast-track mechanism for access to supercomputers (prace-ri.eu/hpc-access/hpcvsvirus), and national laboratories issues calls for rapid-response research. [27] These computing requirements can be limiting in the context of pandemic responses, because officials are forced to wait for results before acting to prevent the spread of the disease. It also stresses inequities in simulation research, as some groups may struggle to perform their simulations in a timely manner due to the lack of resources at smaller research organizations. [28] In this paper, we examine whether massive computations are an absolute requirement to support decision-makers in comprehensively examining the expected consequences of various intervention scenarios in the context of COVID-19. Specifically, we assess whether we can replace a (computationally expensive) COVID-19 ABM with a (cheaper) machine learning model. The core idea is to assess whether it is possible to perform just enough simulations to train an accurate machine learning model, which can then be used to predict the remaining simulation results inexpensively. Our approach consists of generating data from four previously published and validated ABMs for COVID-19 [12,24,29,30] and using the data to train machine learning regression meta-models. By varying the amount of data used to train these meta-models, we characterize the relationship between how much data is used to train the meta-model and how accurate that meta-model is. Our specific contributions are as follows: • We develop machine learning regression meta-models for four COVID-19 ABMs to predict the total proportion of the population that will become infected, in response to the intervention scenarios captured by the model's parameters. • We examine the affects that different amounts of training data have on the overall accuracy of those meta-models, thus establishing the situations under which a COVID-19 ABM may require computing power to achieve accurate results.
The remainder of this paper is organized as follows. In Section 2, we summarize the key features of COVID-19 and interventions that are available in the four ABMs used in this study. Our background also briefly explains how ABMs are created for COVID-19 and how machine learning can be used to generate simulation meta-models. Then in Section 3, we cover our methods starting with the implementation and verification of the selected ABMs, and then detailing how we performed our machine learning regressions. In Section 4, we analyze the results produced by the verification data for our ABMs and the machine learning regressions. Next in Section 5, we discuss the interpretation and limitations of our results. Finally in Section 6, we provide concluding remarks and suggest future work that could be undertaken based on our results.

Background
In this section, we will examine key details of COVID-19 and available interventions, as well as how ABMs are created for it. Then, we summarize how machine learning has been used previously to create meta-models for simulations.

COVID-19
COVID-19, caused by the SARS-CoV-2 virus, was first reported in Wuhan, China in December 2019 [31] with newer studies suggesting a first case as far back as mid-November 2019. [32] The virus spreads through droplets that are released when an infected individual coughs or sneezes. As a respiratory disease, the primary mode of transmission is thus via exposure to droplets, either indirectly (e.g., via contact through contaminated objects or hands) or directly (air borne). [33,34] The virus infects cells in an individual's lungs, interfering with the lungs' ability to function properly. [35] Symptoms for the disease include fever, loss of smell, or cough. [36,37] A systematic review of 45 studies reported that 73% of individuals experience at least one persistent symptom. [38] Most commonly, symptoms such as fatigue, sleep disorders, or loss of smell can be experienced for months. [39] Less common consequences include multi-organ damage, [40] for example, in the the cardiovascular system, [41] kidneys, [42] nervous system, [43] or immune system. [44] Prior to the development of vaccines, all interventions for COVID-19 were necessarily non-pharmaceutical. These interventions included the use of masks, social distancing, regional lockdowns, and contact tracing. Masks reduce the spread of COVID-19 by lowering the potential for infected particles from entering the environment, and higher mask compliance leads to more effective disease mitigation. [45] Social distancing and the closing of restaurants, gyms, and other public locations led to a statistically significant reduction in the spread of COVID-19 as well. [46] Contact tracing, which helps officials control the spread of the virus by tracking who may have been in contact with an infected individual, has also been found to be effective. [47] In December of 2020, the first pharmaceutical intervention (i.e., vaccine) for COVID-19 became available. [48] The messenger RNA (mRNA) vaccines for COVID-19 contain a small piece of the SARS-CoV-2 virus's mRNA that triggers the immune system to start producing an immune response to the virus. The FDA found that the Pfizer vaccine was 95% effective against COVID-19, [48] but effectiveness may be reduced by mutations. [49] Mutations known as "variants of concern" impact aspects such as transmissibility and vaccine effectiveness. [50] In particular, some variants can evade antibodies caused by infection and current vaccines, [51] which calls for the development of nextgeneration vaccines and antibody therapies. [52] The existence of repeat infections [53] as well as breakthrough infections [54,55] (i.e., infection of a fully vaccinated person) have led to question the possibility of herd immunity, [56] that is, the assumption that most transmission would be blocked if a given threshold of the population gains immunity.

Agent-Based Models for COVID-19
In early 2020, given the urgency of the COVID-19 pandemic and the scarcity of data, the first generation of COVID-19 models was based on a compartmental approach in which individuals are aggregated into groups (e.g., susceptible, infected) and a simulation proceeds by applying flow equations between groups. These classic compartmental epidemiological models focused on estimating key epidemiological quantities such as R 0 , the number of new cases generated by each infected individual. [57,58] Although these models were imperfect in different ways, [59][60][61] expert predictions have still outperformed the general public [62] and their models "have influenced public health policies and increased the familiarity of the general public as well as policymakers with the modeling process, its value, and its limitations.' [25] The most commonly used epidemic models categorized the population into susceptible, infected, and recovered (SIR) or added an intermediate stage for exposure (SEIR). [63] A review of models produced between January and November 2020 found that the SIR and SEIR approaches represented 46.1% of all models, whereas ABM only accounted for 1.3% of studies at the time. [64] As the evidence base progressed, modelers realized that the assumptions of compartmental models (e.g., treating individuals as part of homogeneous groups) were limiting. In the words of Tolk et al., "As the pandemic unfolded, it quickly became evident these were not valid assumptions: the virus does not impact all populations evenly, and the interaction among different groups is far from even." [65] The rationale for the use of individual-level simulation models such as ABM thus centers on the notion of heterogeneity: heterogeneity of risk factors for individuals (e.g., age and underlying or "pre-existing" conditions that increase the risk of severe illness conditions [66,67] ), heterogeneity of behaviors (e.g., interest in vaccination, [68] compliance with mask policies [69,70] ), heterogeneity in socio-ecological vulnerability across places (e.g., lower access to resources in rural counties, [71] urban sub-populations at risk [72] ), and heterogeneity in contact patterns. [73,74] Although the need was clear, the development of ABMs was initially challenged by a lack of data, limited understanding of the disease, and occasionally a limited skillset. [25] The situation has changed with the growing evidence base on COVID-19, the availability of mobility data, and the development of reusable frameworks to instantiate ABMs specifically for COVID-19 (e.g., COVASIM, [75] OpenABM-Covid19, [76] various modeling pipelines [77] ).
An ABM represents individuals (as agents) and their interactions with the environment as well as other agents. Each agent can have a state along with additional characteristics such as age or position within social networks. States in several ABMs are inspired by classic compartmental epidemiological models, hence it is common to use a SEIR model to represent the progression of each agent. [29,78,79] Although earlier models may have relied on only four stages (SEIR), newer ones have introduced new stages to account for vaccination in two doses, disease severity, and the difference between asympatomatic and symptomatic individuals (Figure 1). In contrast with their mathematical predecessors, ABMs include the movements of agents over a space (e.g., a grid)  [12] Additional model logic is governed by algorithms, to identify agents who will be vaccinated or infected. www.advancedsciencenews.com www.advtheorysimul.com Table 1. Characteristics of the agent-based models. Characteristic Li [12] Shamil [24] Silva [29] Badham [30] Model states  Li [12] Between 900 and 1200 Cloud computing (Azure) Shamil [24] 32 492.86 on average High-performance computing cluster Silva [29] 727.12 on average High-performance computing cluster Badham [30] 34.63 on average Personal computer and interactions with other agents or their environment (e.g., contaminated surfaces), which leads to the spread of the simulated disease. One of the factors that makes ABM agents unique is their ability to model individual characteristics such as age or other risk factors in the context of infectious disease. States and characteristics found in the four chosen models for our study are expanded in Table 1. Although each model has been previously published with a detailed technical description, we provide a succinct overview of each model in the subsequent subsections to keep this manuscript self-contained. Since a key motivation for our approach (and the use of meta-modeling in general) is to create a cheaper proxy to a computationally expensive model, we exemplify the wallclock time typically required by the four chosen models, across platforms ( Table 2). These platforms show the diversity of hardware that end users of models can access, from cloud-computing services (Microsoft Azure with AMD EPYC platform for Li and Giabbanelli [12] ) and high-performance computing clusters (Intel Xeon processors with 96 Gb of memory for Shamil et al. [24] and Silva et al. [29] ) to personal computers (AMD Ryzen with 16 Gb of memory for Badham et al. [30] ).

Li and Giabbanelli Model
The ABM developed by Li and Giabbanelli [12] is built on the Covasim framework, by including several modifications that introduce vaccination. This ABM uses 656 000 agents and simulates the spread of COVID-19 for 180 days. This model contains states for susceptible agents, as well as several different states of infection, including asymptomatic infection, and three other levels of infection from mild to critical ( Figure 1). It also includes the ability to simulate two vaccine doses for agents, which can remove them from the pool of susceptible agents. In addition to vaccine support, the model also inherits the ability to handle quarantines and contact tracing from Covasim. Finally, the Covasim platform embeds agents across different networks (e.g., community, work, school) to model how interventions have different impacts across settings (e.g., social distancing in the community, face masks at work).

Shamil et al. Model
The ABM created by Shamil et al. [24] includes two possible agent configurations for different cities in the United States. One such configuration is a simplified representation for New York City, which contains 10 000 agents simulated for 90 days. It contains states representing different states of contagion and symptoms, with agents initially healthy and transitioning through a non-contagious state into a contagious asymptomatic state, then into a contagious symptomatic state, and finally into a recovered or dead state. This model includes representations of largescale gatherings and various other day-to-day activities that could spread the virus, such as attending work or school. This ABM provides support for contact tracing and quarantines.

Silva et al. Model
The ABM released by Silva et al. [29] models agents in infected states, as well as different levels of virus severity and hospitalization. Age and risk factors are also included for each agent, and these can be used when determining a lockdown policy. This simulation models the interactions of different agents from different home environments as well, and allows for lockdowns that prevent agents from traveling and interacting. This ABM provides support for quarantines, lockdowns, and masks.

Badham et al. Model
The ABM built by Badham et al. [30] models agents in infected states, and contains states for hospitalization and critical infection. Community spread is represented as the agents move around their environment, although movement of agents can be restricted by several policies. This ABM provides support for social distancing, hospital simulation, isolation, and movement restrictions.
www.advancedsciencenews.com www.advtheorysimul.com Figure 2. Flowchart for the process of training a meta-model.

Simulation Meta-Modeling
Meta-models, also known as surrogate models, are approximations of a more complex model. [80] Although an approximation may be less accurate, this is usually tolerated in exchange for a significant improvement (i.e., reduction) in computational cost such as wallclock time. [80] For example, a meta-model for hydrodynamic and thermal simulation reduced compute time by 93% while only reducing accuracy by 4%. [81] In another simulation of social networks, simplifications resulted in an 85% decrease in runtime and a 32% decrease in memory requirements. [82] As the notion of "models" can be broadly conceptualized across fields, it is important to distinguish two settings. In pure mathematics, meta-models are mathematical functions that approximate the output of another, more complex mathematical function. In the simulation of interest here, meta-models are models that predict the results of a simulation with less computational requirements. Machine learning is one approach to create simulation meta-models. One common type of machine learning metamodel is a regression model, [83][84][85] which is appropriate when the output of the simulation model (which we seek to approximate) is a discrete value based on its input parameters. Another common type of machine learning meta-model is a classification model, which produces a class (i.e., a group with similar characteristics) based on input parameters. [86] In this paper, we focus on regressions, whereby the objective is to provide a accurate but faster proxy to the final result of an expensive simulation. Figure 2 provides a high-level illustration for the process for constructing a simulation meta-model via machine learning. Gathering data for a meta-model involves running the simulation multiple times with varying input parameters. The parameters that should be varied during simulation runs are the desired meta-model input parameters. [87] For example, to train a clinically relevant meta-model of the human immunodeficiency virus (HIV), a HIV simulation model may be run for several values of clinically relevant parameters such as the start and expected efficacy of treatment. [88] The data produced by the simulation model then becomes the input or training data for the meta-model. The number of data points gathered for training depends on the computational requirements of the simulation itself. [89] Once the meta-model has been trained, its results are compared against the simulation itself using an error metric such as the mean-square error (MSE), root-mean-square error (RMSE), [87] or normalized RMSE. [90] Several sampling methods or "designed experiments" allow to produce the training data for the meta-model, that is, select values of the simulation parameters. Such techniques include random sampling, factorial sampling, and Latin hypercube sampling, [91] which have a long track record when applied to simulations. [92] Random sampling picks a certain number of entries or parameter values at random, which may lead to a cluster of points (i.e., oversampling in some areas) or an absence of points (i.e., under-sampling). Factorial sampling explores every value of a parameter combined with every value of every other parameter. [93] For example, if two different parameters could be either 0 or 1, the factorial set would be (0, 0), (0, 1), (1, 0), and (1, 1). Latin hypercube sampling uses each value for a parameter only once, but ensures an even spread over the domain of parameter values. [94]

Experimental Section
In this section, we explain how we used each ABMs to perform the machine learning regressions. A table is provided to list the parameter values used in generating data from each model. When the parameter values require further explanations, a secondary table is also included; additional details can be found in the peer-reviewed manuscripts corresponding to each model. Apart from the Li and Giabbanelli model for which we already had data (as it was produced by our group), we accessed the public implementations for each of their other models, verified the code vis-a-vis the corresponding publication, and then produced the simulation data. A flowchart of the overall process is provided in Figure 3.

Data Generation and Collection
The datasets generated based on the following procedures can all be openly accessed by readers from our third-party repository, https://osf.io/d7vqa/. Our approach to data collection relied on a factorial design of experiments, thus gathering data for combinations of parameter values. We also assessed whether the models experienced bifurcations or noticeably different simulation paths, as each pathway in a model's execution could then require dedicated data collection. As the models did not exhibit bifurcations (Figure 4), we focused on the creation of a comprehensive dataset in terms of combinations of parameter values.

Li and Giabbanelli Model
This model was produced by our research group and utilized as part of a factorial analysis (in the absence of vaccines) [13] or via a grid search (with vaccines). [12] It was thus the only case in which we had an existing and extensive simulated dataset to rely upon. For our meta-modeling purposes, we split the original dataset into three separate sets based on their vaccine plans: no vaccines in Table 3 and the two vaccination plans in Table 4. The first subset of data captured simulations that did not include a vaccine, the second subset contained the simulations that used one vaccine plan (under the Trump administration), and the third subset contained was formed of vaccine simulations under the other vaccine plan (from the current Biden administration). These three sets of data were used to train three different meta-models, since the two vaccine datasets contained vaccination while the first, nonvaccine dataset was limited to non-pharmaceutical interventions. The number of simulation runs per combination of parameter values varies, as it was set to automatically spawn new runs until a 95% confidence interval would be reached (in contrast with a pre-determined fixed number of runs). The dataset had a total of

Shamil et al. Model
For the Shamil et al. model, [24] we ran the New York city configuration with four different values for two parameters using a factorial design, where every possible combination is used. We also performed ten replications on each combination. The parameters we used are explained in Table 5. The 85% smartphone ownership level is based on a survey by the Pew Research Center from February 8, 2021. [95] The 14-day quarantine is based on the recommendations of the CDC at a time close to when the model was developed, in December 2020. [96] In total, we generated n = 240 data points from this model.  When to start contact tracing for an exposed individual Immediately, upon confirmation from a positive case

Silva et al. Model
For the Silva et al. model, [29] we ran the model with a factorial design on five different parameters, running each for 60 days in the model. We also performed 30 replications on each combination. The parameters we used and their values are listed in Table 6. To verify that we used this model correctly, we replicated Figure 5a from the original paper [29] that used tracked agent states over 60 days with no interventions and a population and grid size of 300. In total, we generated n = 13 500 data points from this model.

Badham et al. Model
For the Badham et al. model, [30] we ran the model with a factorial design on 4 different parameters, performing 30 replications. All other settings are set to the default values for the model's 1.1 version. The parameters we altered are explained in Table 8. To verify that we used this model correctly, we replicated Figure 4 from the original paper [30] that tracked daily hospital admissions per day. In total, we generated n = 2430 data points from this model.  Table 9. Explanations of values for the social distancing policy parameter (c.f. Table 8).

Distancing policy Description
None No distancing AllPeople Everyone reduces their contacts by the "contact reduction" amount ByContact Probability of contact is reduced by the "contact reduction" amount

Normalization and Model Training
To perform our regressions, we used TreeLearner regressors from the Orange data mining library (orange3 version 3.26.0) running on Python (version 3.8.3). When we loaded our datasets, we converted their cumulative infection numbers to proportions of the total population, so that the results from the four different models are comparable. Then, we used a tenfold cross-validation method to actually train our meta-models. Cross-validation involves separating data into ten different "folds." One by one, these folds are used as testing data for our model, while the other nine folds are used to train the model. We also needed to optimize the parameters of our regressor, so we used hyper-parameter optimization to maximize our meta-model accuracy.
Hyperparameter optimization ensures that the regressor is using the most accurate set of model parameters, and it does this by performing a second set of cross-validations, which is known as a nested cross-fold validation design. The training data from the first set of folds is divided into a further ten folds, where one fold at a time is used as a validation set. Then, the remaining nine folds are used to train a model using the different combinations of parameters to select the best combination. To select our parameters, we used a Latin hypercube design, which ensures that the sample space for parameters is covered effectively. We optimized the two parameters that have the most impact on the decision tree regressor: • the maximum depth of the tree. A smaller depth forces the meta-model to be simplified. We considered values from 5 to 50 by steps of 5. • the minimum number of samples to split a node . The machine learning algorithm stops sooner and delivers a simpler model if we raise the minimum number of samples. We examined values from 10 to 100 by steps of 10.
For details on the impact of these parameters on a decision tree, we refer the reader to two recent examples of optimization. [97,98] The optimization process was conducted using a Latin hypercube with ten different samples ( Figure 5).

Model Evaluation
Once the hyperparameter optimization selected the most accurate set of parameters, these parameters were used to build the model for the outer folds. Then, we were able to analyze how the accuracies of the models changed in comparison with the amount of data that was used to train them by calculating the RMSE for their predicted infected proportions (Equation 1). RMSE is a common error metric for regression models [99][100][101] and it is useful because the units for RMSE are the same as the units for the simulation output (i.e., our errors for this paper are measured in proportions of the population).
While the RMSE is informative for the quality of the regression, it does not suffice to compare two models, especially when the inputs are auto-correlated in the simulation model and are assumed to be independent in the meta-model. Consequently, we complement the RMSE by comparing the point estimates of the meta-model and simulation model on different outputs (number of deaths, number of infections) and across levels of policyrelevant control parameters (e.g., reduction in contacts).
Finally, we measure the time necessary to train the meta-model (wallclock time measured in seconds). This measure is important to assess whether the complete pipeline is truly time-saving for the end user, as the training time may otherwise be a hidden and potentially prohibitive cost for some machine learning methods. In sum, if we need few expensive simulations to create a meta-model that achieves a low RMSE and the meta-model can be created quickly, then there is a demonstrated incentive to use our solution instead of current practices.

Results
The ABMs used for this study were chosen as we could access their code and verify our use of the model vis-a-vis published figures in peer-reviewed manuscripts. The first subsection focuses on this verification effort, by contrasting the results that we obtained from the model with the authors' published results; figures were all reproduced with the authors' permission. The second subsection is centered on the machine learning results. To provide full transparency and support replicability efforts regarding machine learning, the training data used in this subsection is openly accessible on the third-party repository https: //osf.io/d7vqa/ , provided by the Open Science Framework.

Agent-Based Model Verification
As noted in Section 3.1.1, the model by Li and Giabbanelli [12] was not subject to verification, since we directly used data produced by the model. That is, we already had access to its full simulation  Figure 5a from Shamil, et al. [24] (left) with our replication (right) using five repetitions. The left figure is reproduced with permission. [24] Copyright 2021, The Authors, published under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature.
results. In contrast, the other two models had to be verified since we did not have a spreadsheet of results to use and hence we ran the code provided by the authors. The verification sought to replicate the published work of the authors, to ensure that we were using their model adequately.
For the Shamil model, we replicated Figure 5a from Shamil et al. [24] at six different contact tracing levels (0, 0.6, 0.7, 0.75, 0.8, and 0.9). Qualitatively, we note that shapes and trajectories of the infections are similar in our replication (Figure 6). However, values occasionally differ and some of the curves for the higher trace levels also overlapped in our replication, which is not something found in the original. Hence, results are not reproduced in terms of outcome, but they are "analysis reproducible" [102] since the same conclusions about the effect of a COVID-19 interventions are reached based on the authors' original results and the results obtained here. Based on previous attempts at replicating simulation models, [103] a possibility is that the high variability in the model's output results in noticeable differences across individual runs and only a large number of runs (e.g., to achieve a 95% confidence interval) would allow to create comparable curves. In other words, the discrepancies are a likely consequence of output variability in the published work.
For the Silva model, we replicated Figure 5a from Silva et al. [29] using a population size and grid size of 300 (Figure 7). The state counts in our replication closely follow the average state counts  [30] using the model's atJul13 scenario (Figure 8). The shape and values in our replication closely follow the original figure. Both models were thus reproduced at the level of the outcome.

Machine Learning Results: RMSE and Point Predictions
The RMSE for the Li and Giabbanelli model [12] without vaccines is shown in Figure 9. The RMSE is around 0.0795 regardless of the sample size, which shows that running as little as 5% of the experiments from the ABM suffices to then infer the rest via machine learning. However, we note that the error margins do benefit from an early increase in data, of up to 30% of the sample size. The situation changes when vaccines are introduced. For both the Trump administration's vaccine plan and the Biden administration's plan, we see that the RMSE strictly decreases as the sample size increases (Figure 10). The error can be as low as 0.0034, while noting that the decrease is essentially from a low initial error (RMSE of about 0.0041 to 0.0043 at 5% of the sample size) to a slightly lower error.
Different situations are observed among the other three models. For the Shamil et al. model, [24] results show an error around 0.065 when using most of the data (Figure 11). Most interestingly, Figure 7. Figure 5a from Silva et al. [29] (left) with our replication (right) using ten repetitions. The left figure is reproduced with permission. [29] Copyright 2021, Elsevier.  Figure 4 from Badham et al. [30] (top) with our replication (bottom) using 1000 repetitions. The top figure is reproduced under the terms of the Creative Commons CC-BY License. [30] Copyright 2021, The Authors, published by JASSS.
the error has a non-monotonic relation with the sample size: it is lowest at the smallest sample size (0.1%) and at about half of the sample (0.45%), but higher in between. For the model by Silva et al., [29] we see a decrease in RMSE as th sample size increases (Figure 12). Again, we note that the gains should be contextualized given the small range of the RMSE: 10% of the dataset suffices to achieve and error of 0.028 and even taking the full dataset only brings it down to 0.0225 at the best. Finally, the model by Badham et al. [30] shows a decrease of RMSE as the sample size increases up to 50% of the dataset, and then the error plateaus (Figure 13). Relative to its scale, we emphasize that the total decrease in RMSE between the 5% and 100% sample sizes is only around 0.01.
To further assess the errors, we compared the outcomes of the meta-model and the four simulation models on two variables of interest (death cases, infections) across several values of policyrelevant parameters such as the tracing percentage or contact reduction. Results show that the meta-model follows the simu-lation model very closely across several intervention scenarios (Figure 14-a,c,d). Although Figure 14d stands out, the tight yaxis means that the meta-model has settled on one value (0.19) whereas the simulation model still makes very small changes (by at most 5 × 10 −5 ). The main discrepancy is seen in Figure 14e, where the meta-model follows the overall trend of the simulation but point estimates can deviate by as much as 0.025.

Training Times
For each ABM, we measured how long it took to create the machine learning meta-model based on 100% of the data. This is an upper bound, as the previous section has shown that comparable estimations are obtained when the meta-model is trained with less data. Results (Table 10) were obtained on a personal computer, showing that the meta-model can be built within minutes. The training time depends on the complexity of the search space,  as evidenced by variations for a single model based on either the policy scenario (Li and Giabbanelli) or the desired output (Silva et al.). These time estimates can be contextualized based on Table 2, which showed the time to run a single simulation from each ABM. The contrast reveals that the costs of building the metamodel are less than running a single simulation from the corresponding ABM, which often needed a computing cluster. In conclusion, there is no "hidden cost" in building the meta-models as this operation is computationally inexpensive compared to the simulations.

Discussion
As real-world data does not directly support what-if analyses and policy evaluation of alternatives, high-resolution models such as ABMs are needed to support policymakers in running detailed scenarios and building trust in their outcomes. However, these ABMs can be computationally intensive, thus requiring resources that policymakers may not have readily at their disposal, or taking too long given the urgent need for actions. In this case, meta-models are needed to support analysts and decision-makers www.advancedsciencenews.com www.advtheorysimul.com  by generating results faster, on a ubiquitous hardware. While many other works have been devoted to either creating ABMs for COVID-19 or applying machine learning to real-world data, [65] our study is the first to combine these techniques to examine whether access to COVID-19 ABMs could be democratized by making them more computationally affordable for end users. Specifically, we analyzed how the amount of data used to train a regression meta-model affects its accuracy and differentiated situations where more data is required from situations where small amounts of data are sufficient. This analysis contributes to the broader literature on assessing and simplifying COVID-19 mod-els, which has already established that the number of parameters could be decreased significantly. [59] Our results shows that models which do not have strong interventions like lockdowns or vaccines do not require as much training data, hence it is possible to run few computationally expensive simulations and then switch to an inexpensive metamodel. The Covasim model with no vaccines and the Shamil model showed no discernible downward trend in their RMSE values, so adding more training data would not guarantee an increase in model accuracy. While the Badham model showed a slight decrease in RMSE, it only decreased by 0.01 from 5% to www.advancedsciencenews.com www.advtheorysimul.com  bottom row: death cases) and policy scenarios for the four models. Other parameters were fixed as follows: a) distancing policy: by contact, short/long movement reductions: 0.25; b) layers impacted: all, contact reduction in work/school/community: 0.7, daily tests: 1.1M, quarantine: both, test sensitivity 0.55, time to contact trace 7 days, no presumptive quarantine; c-d) 500 agents, grid size 500, conditional lockdown, masks; e) 3 days required for contact tracing.
100% sample size. Conversely, the models which did have strong interventions showed a strong downward trend in their RMSE values. Both Covasim models that used vaccines only began to stabilize after 60% of the data was used in training, and while the Silva model stabilized slightly faster, it took around 45% of the data to do so. Note that the RMSE values were low to start with (e.g., less than 0.0045 for Covasim with vaccines), so the initial error may already be tolerable for some end users as part of a tradeoff between accuracy and computational needs.
There are three main limitations to our work. First, we focused on peer-reviewed COVID-19 ABMs that we could run to obtain data and for which we could verify our use of the model vis-a-vis published results. As shown in previous calls for transparency in COVID-19 modeling, modelers do not systematically provide their code, [104] despite the existence of several platforms in which scientists can openly share code and data. [105] In a transparency assessment, Jalali et al. found that most models do not share their code, [106] which echoes similar observations about practices in agent-based modeling across application domains. [107,108] Our criteria thus meant that we could only assess a subset of existing models and it is possible that different trends or initial error levels are observed in other models. We note that projects that shared their code were also transparent regarding how their computational results were produced, [109] hence we were able to perform verification and we applied the same level of transparency when conveying the model's parameters. Second, several key parameters and assumptions regarding COVID-19 continue to change. For example, a study on almost 100 000 volunteers in July 2021 found a vaccine effectiveness of 49%, which is less than the lowest value assumed in some of the previous modeling studies. [12,110] Other studies have shown that vaccinated individuals have a comparable viral load to unvaccinated ones within the first few days, [111,112] whereas the flow diagrams used in many models have considered that vaccinated individuals were fully removed from the population. Our conclusions are limited to COVID-19 ABMs developed so far, since future models may exhibit markedly different dynamics. In particular, bifurcation may be present in future models, which would require an analysis of models by clusters of trajectories rather than average dynamics. [113] Although in reality there are rare events whose large impact (e.g., superspreader events, highly publicized case of a celebrity) can set the rest of the simulation on a very different trajectories, most current ABMs to not (yet) account for these events, hence they are easier to approximate with a meta-model. Many of the COVID-19 models exhibiting bifurcations are built from nonlinear differential equations rather than ABMs; these models provide abundant illustrations of the existence of bifurcations of various types, [114][115][116][117][118] such as Hopf bifurcation (and the related Neimark-Sacker bifurcation) or period-doubling bifurcation.
Finally, the size of the datasets that we generated were limited by the high computational needs of the COVID-19 ABMs. For example, the Shamil model required around 8 h per run on a node of a high-performance computing cluster. Given this limitation, our conclusions focus on the trend (e.g., is there a reduction in error as more simulations are used?) rather than on a precise point-estimate of the error (e.g., exactly how many simulations are required to achieve a given error level).
There are several potential avenues for future work. We focused on predicting the final result of the simulation, but there would also be merit to predicting the time series of outputs. Although the use of meta-modeling of simulations for time series is more common for financial simulations, [119] there could also be an opportunity for future studies to apply these techniques in the case of COVID-19 ABMs. In particular, several such ABMs (e.g., COVID-Town [120] ) have been developed to perform a joint analysis of economic and epidemic dynamics, hence they are particularly interested in the shape [121] of the economic recovery over time (i.e., the time series).
Another possibility would be to explore the prediction of multiple outputs. Indeed, various stakeholders may be interested in the effects of a COVID-19 intervention on different outcomes ranging from epidemiological (e.g., number of new cases) to logistical (e.g., spare capacity in intensive care units) and economical (e.g., loss in productivity or revenues). Creating a multi-value regression model would thus address the multi-value optimization problem whereby decision-makers seek a balance between the effects of an intervention on health, the economy, and other (possibly non-correlated) issues. While we may expect that more simulations are necessary to train an accurate meta-model able to predict several partially correlated outputs, the specific relationship is still a question. That is, we still need to establish how the number of predicted outputs raises the need for simulation data. In the interim, modelers can follow our approach to predict the effect of an intervention onto different evaluation metrics, but they would create one regression model per metric rather than a single model predicting all metrics at the same time. Figure 15 exemplifies the performance of a model trained to predict the number of deaths rather than infections on the Silva et al. ABM, showing that the model is already accurate with a very small fraction of the data and can be further improved if more data is provided.
Finally, the amount of data necessary to train an accurate model may be further reduced depending on the design of experiments. [122] Indeed, when a high-resolution simulation system is used to train a meta-model, the design of experiments to produce the data is essential for the success of the metamodel. [123,124] In ABMs that have a very large number of parameters, an important first step would be to assess which ones play a role in determining the output, either by themselves or in contribution with other parameters (second order effect, third order effects, etc). This can be approached by a factorial design of experiments and an ensuing analysis to decompose the variance in the output onto the action of parameters. As aforementioned, such analyses have shown that the number of parameters could be decreased significantly. [59] This approach may be less applicable to the ABMs studied here, as they have only a handful of parameters, which play a statistically significant role either directly or in contribution with other parameters. [13] In this case, an adaptive design of experiments may be more helpful [125][126][127] than a fixed design.

Conclusion
In this paper, we analyzed how the amount of data used to train a simulation meta-model affected the accuracy of the model. We found that for models with no strong interventions such as vaccines or lockdowns, a small amount of data could generate a Figure 15. Root-mean-square error of the predicted proportion of the population that died (rather than be infected) as a function of the fraction of data used to train the meta-model. Insets provide a per-fold graph showing the same RMSE (y-axis) across folds in a 10 cross-fold validation (x-axis); note that folds are unordered hence insets cannot be used as trend graphs. model with similar accuracy to one trained on a much larger amount of data. However, models which had strong interventions took large amounts of data to train a model that achieved a stable accuracy. These results indicate that modeling the spread of COVID-19 without strong interventions can be done with very little data, but when stronger interventions are considered, much more data is required to train an accurate model.