r2ogs5: Calibration of Numerical Groundwater Flow Models with Bayesian Optimization in R

Large‐scale and high‐resolution groundwater models are currently becoming increasingly important in order to clarify the extent to which climate trends and extreme weather affect the groundwater balance regionally. As a result, the parameterization of groundwater models is becoming more detailed and more complex, making conventional calibration methods too time‐consuming. Moderating the computational demand to find optimal solutions for the resulting potentially multi‐modal objective function requires intelligent and efficient global optimization methods. Moreover, the increasing use of modern scripting languages R and Python to craft environmental analysis workflows calls to integrate groundwater flow simulators in such. Here we introduce and exemplify r2ogs5, a tool that integrates version 5 of the open‐source simulation software OpenGeoSys into the programming language R. r2ogs5 allows for calibration of numerical groundwater flow models with a sequential model‐based optimization approach that combines Bayesian optimization (BO) with surrogate modeling. Here, we describe the structure and function of r2ogs5 as well as the implemented calibration method. We then demonstrate the calibration method by calibrating 4 and 12 parameters of two simple groundwater flow models. The results indicate that this method needs fewer runs of the groundwater flow model than conventional gradient search and Latin hypercube sampling in case of the 12 parameter model. We believe that our method offers the potential to calibrate computationally expensive groundwater flow models. r2ogs5 supports groundwater flow modelers to access the statistical analysis and visualization capabilities of the R language and is a valuable tool for geoscientists already using R.


Introduction
Numerical groundwater flow models (NGWM) are essential tools to analyze the development of water resources. In recent years, NGWM have become more complex in terms of the number of included processes, and larger model domains with a higher spatial resolution (Doherty and Simmons 2013;Asher et al. 2015). This translates into high computational costs for executing a single NGWM for which a set of parameters is defined. Therefore, model calibration becomes increasingly expensive in terms of computing resources and run time (Gorelick and Zheng 2015). During calibration, the parameters are optimized to find a model that best fits observations from the modeled system. Additionally, the objective function (deviation of simulated to observation data) that is to be minimized during calibration is often multi-modal, that is, may contain several local optima.
Here, global optimization methods are required.
NGWA.org Vol. 61, No. 1-Groundwater-January-February 2023 (pages 119-130) In such a case, conventional gradient search (GS) methods suffer from getting trapped in local optima and require several runs of the NGWM per iteration to approximate the gradient (Byrd et al. 1995). When applying gradient-based methods in conjunction with multi-start sampling algorithms in order to increase the probability of finding the global optimum, the number of total required NGWM runs is further multiplied (Gorelick and Zheng 2015). Similarly, probabilistic approaches such as simulated annealing (SA) require hundreds of model runs and are not suitable for computationally expensive NGWM such as regional-scale models (Pirot et al. 2019).
Current research focuses, for instance, on ensemble smoothing, a method scalable to computationally expensive models with large numbers of NGWM parameters ( 100), and, surrogate modeling to estimate the NGWM parameter probability distributions via Bayesian inversion (Xu and Valocchi 2015;Xu et al. 2017;Zhang et al. 2020). PEST (Doherty 2015), a parameter estimation tool that is employed by the well-known modeling software packages MODLFOW (Langevin et al. 2021) and FEFLOW (Trefry and Muffels 2007), relies mostly on gradientbased optimization but also includes an iterative ensemble smoothing method (White et al. 2020).
Surrogates (also known as meta-models or response surface models) are simple data-driven models that describe the relationship between NGWM parameters and corresponding simulation results. If properly trained, surrogates can replace the expensive NGWM to reduce the computational demand for exploring the parameter space of NGWMs (Asher et al. 2015). For training surrogates, sampling-based strategies like Latin hypercube sampling (LHS) are applied, which in turn, still require hundreds to thousands of NGWM runs (Xu et al. 2017). A random sample of parameters, such as in LHS, will also be highly inefficient to represent the NGWM at critical locations (i.e., the optimum) in the parameter search space. Such training approaches become limiting for models that already require high-performance computers to run a single model. Here, new strategies for efficient training of a sufficiently accurate surrogate are required.
A promising solution is Bayesian optimization (BO), an active learning algorithm also referred to as sequential model-based optimization in the literature (Kushner 1964;Mockus et al. 1978). It differs from other surrogate-based optimization procedures in the way the training data for the surrogate model is acquired. In BO, the surrogate is trained in a sequential way seeking the least amount of NGWM runs. A surrogate model, here a Gaussian process (GP), represents the prior believes on how the objective function that we seek to minimize changes with the parameters. Sequentially, Bayesian updates with new NGWM results yield a posterior surrogate model for a likely objective function. The parameters for the update are selected according to a criterion that determines where the surrogate model needs improvement in order to predict the minimum better. Therefore, BO is very data-efficient and often used for optimization problems of expensive objective functions, for instance in economics, engineering, and machine learning (Shahriari et al. 2016;Karban et al. 2021) when the number of parameters is not too large (<100). It is to be noted that we refer with expensive objective function and/or the term computationally expensive model to models with long run times. For groundwater flow modeling, therefore, BO seems to be promising if a model is expensive in terms of run time without having too many parameters.
In geoscience, however, there are only a few studies reporting the use of BO (Hamdi et al. 2017;Christelis et al. 2019;Pirot et al. 2019). Other sophisticated surrogate based optimization algorithms, named "adaptiverecursive approach" by Razavi et al. 2012, follow a similar logic as BO and are popular for solving design problems in the environmental sciences (Bau and Mayer 2006;Di Pierro et al. 2009;Fen et al. 2009). A version of this algorithm that uses radial basis functions as surrogates already showed strong advantages over GS and other conventional optimization heuristics for the calibration of groundwater bioremediation models (Mugunthan et al. 2005). In comparison to this approach, however, BO not only focuses on surrogate prediction but combines it with the prediction uncertainty to balance exploration (to reduce surrogate uncertainty) and exploitation (to focus on the current predicted minimum) of the parameter search space. To the best of our knowledge, BO has not yet been applied to calibrate numerical groundwater flow models.
Furthermore, BO is not available in calibration tools such as PEST or groundwater flow modeling packages such as FEFLOW, MODFLOW, or OpenGeoSys (OGS; Bilke et al. 2019;Kolditz et al. 2012). Many simulators such as OGS do not ship calibration tools at all and rely on third-party software. A promising alternative is to integrate groundwater flow modeling packages into modern data science languages such as R and Python via application programming interfaces (APIs); examples are: FloPy (Bakker et al. 2016), ogs5py (Müller et al. 2021), RedModRPhree (De Lucia and Kühn 2021), and toughio (Luu 2020). R and Python offer many functionalities for model pre-and postprocessing including optimization and visualization and provide the flexibility to quickly customize and integrate new methods into modeling workflows (Müller et al. 2021).
In this paper, therefore, we introduce and exemplify r2ogs5, a tool that integrates OGS version 5 into the statistical programming language R. Although a Python API for OGS version 5 already exists (Müller et al. 2021), we considered the prospect to integrate OGS into R very promising as R in addition to Python is one of the most popular programming language for data science, statistics, visualization (Karakan 2020), and in the geoscience (Lovelace et al. 2019). r2ogs5 is an R package that allows users to control OGS simulation runs and retrieve the output within an R session. In this way, reproducible and shareable modeling workflows can be scripted easily with R. To leverage the advantages of BO for calibration, we implemented this method into r2ogs5.
In this article, we (1) describe the structure of the OGS interface r2ogs5, (2) give a brief introduction how to use it, (3) present the integrated BO calibration method, and (4) demonstrate the calibration of a very simple NGWM example. The example is kept simple as we only want to present the developed method and give a short glimpse of its potential for groundwater flow model calibration. An in-depth evaluation is out of the scope of this article. It is to be noted that the BO method is tailored to the r2ogs5 package, however, the principle of the BO calibration method is, of course, applicable to NGWM in general.

Description of the r2ogs5 Package
r2ogs5 is a file-based API integrating OGS version 5 into the programming language R. It provides R users with the capabilities of OGS to simulate (coupled) thermal, hydrological, mechanical, chemical, and bioreactive processes in porous and fractured media (Kolditz et al. 2012;Bilke et al. 2019) including groundwater flow (Sun et al. 2011;Jing et al. 2018). Moreover, r2ogs5 also features the OGS parallelization capabilities (domain decomposition and parallelized reaction simulation) to profit from high-performance computing. In OGS, the model specifications such as processes, model geometry, boundary, and initial conditions, material properties, etc., are defined in corresponding input files (e.g., *.msh, *.bc). A simulation is executed from the command line with a call to the executable and the simulation output will then be written to specific output files (*.tec, *.pvd , or *.vtk ). For a detailed explanation of OGS version 5 see Kolditz et al. (2012) or consider www.opengeosys.org. The following operations are available in r2ogs5: • create input objects or modify them • create input objects automatically from existing OGS input files • write out input files • start the simulation • read simulation output into R • set up and run ensemble simulations • set up, start, and analyze calibration runs An overview on the API structure is given in Figure 1. Interaction with OGS is made by maintaining an intuitive object-oriented approach (Figure 1) using the R native S3 class system (Wickham 2019). An OGS model is created as a specific r2ogs5 model object (R object of class ogs5 in Figure 1). This model object contains two child objects: input and output (see ogs5 class in Figure 1). Every element of the ogs5$input object represents the content of a specific OGS input file (input classes in Figure 1), and can be easily browsed for by using the list operator $ in R; similarly, ogs5$output contains the output of the NGWM. Model specifications set by keywords blocs in the OGS input files can be read in with a dedicated function (input file handling in Figure 1), created as respective classes (input classes in Figure 1) and added to ogs5$input. For instance, the *.bc input file contains keyword blocs of the boundary conditions to be simulated. In r2ogs5, correspondingly, these are represented as class ogs5_bc (*.bc file) and class ogs5_bc_bloc NGWA.org P. Schad et al. Groundwater 61, no. 1: 119-130 (BOUNDARY_CONDITION keyword bloc). For every type of input parameter block, a special function is available to set up such a block and add it to the model object (create model object in Figure 1). Alternatively, the blocks can be read in automatically from a directory with existing OGS input files (input_add_blocs_from_file(); Figure 1). Functions to create simple FEM mesh geometries are also included. For starting a simulation the r2ogs5 model object is exported back to OGS input files in a user-defined path (input file handling in Figure 1) and a system call to the executable referencing the path is created (execution; Figure 1). OGS models that were parallelized with message passing interface can be executed by specifying the arguments use_mpi and number_of_cores to the execution function.
The output of a simulation can be read in with dedicated functions (output file handling; Figure 1) and is represented as an R native object (e.g., data. frame, numeric, list).
Ensemble simulations can be initialized by creating an ensemble object (ens class in Figure 1) and adding several objects of class ogs5 . Similar to single simulation runs, ensembles can be executed and the output can be retrieved with convenient wrapper functions (ens_write_inputfiles(), ens_run(); Figure 1). Specific outputs can be selectively read by indicating one or several block names used in the *.out file (ogs5_out class) to define the type and dimensions of the output generated by the NGWM or respective ensemble. After reading in, the output will be added to the $output section of the ens or ogs5 object. The R objects created by r2ogs5 can be manipulated at will with the tools in specialized thirdparty R libraries. For instance, with only a few lines of code the visualization library ggplot2 can plot ogs5$output objects. Additionally, r2ogs5 also includes functions for model calibration (calibration with BO in Figure 1, also see section Functions for Calibration with BO). New methods can be easily integrated into the r2ogs5 workflow, as the API handles all the hard work of file i/o and path configurations. This advantage allows users to really focus on the modeling. The strength of r2ogs5, therefore, comes into play when it is intended to run multiple simulations such as for sensitivity or parameter studies, scenario analysis, model calibration, or uncertainty analysis or when to include simulations in data analysis pipelines. For further illustration, Figure 2 shows a pseudo R script to set up a simulation.
r2ogs5 is open-source and available at https://doi .org/10.5281/zenodo.5572571. The development page is located at https://gitlab.opengeosys.org/ogs5/r2ogs5; the package includes an installation guide and several tutorials with further information on how to set up (ensemble) simulations.

Functions for Calibration with BO
The idea of our BO calibration method is to employ a data-driven surrogate that replaces the NGWM inside the calibration algorithm. The surrogate is sequentially trained on the NGWM input parameter vectors β i and corresponding values of a user-defined objective function y(β i ) of simulations with the NGWM. Here, the objective function is defined as the root mean squared error of NGWM simulation output to calibration data (1. Evaluate objective function in Figure 3). However, any other objective function can be defined as well. The NGWM simulation output depends on the type of NGWM and the calibration data and, for instance, can consists of head values at specific wells or time series of head values. Specifically, the surrogate is used to map the input parameter vectors of each NGWM simulation to the resulting calibration error (objective function). At the start of the algorithm, an initial sample of parameter vectors B 0 from the parameter search space D is evaluated with the NGWM, the respective objective function values are calculated and the surrogate is trained. Then, a new parameter vector β i (i indexing the current iteration) is selected for which the surrogate predicts the lowest objective function value and/or for which the prediction is most uncertain. The sampled β i is evaluated with the NGWM and the surrogate is retrained now including β i and the corresponding objective function valueŷ(β i ) (2. Update surrogate model in Figure 3). Finally, the next parameter vector is queried and the procedure is repeated until the maximum number of iterations is reached (3. Choose next parameter in Figure 3). The implemented variant of BO is based on the GP upper confidence bound algorithm (Auer 2002). Here, the surrogate is based on a GP model that is fully determined by its overall mean μ and kernel or covariance function k . The objective function value y(β i ) is then modeled as We used the GP implementation of the R-package GPfit as a surrogate model, where GP parameters γ and σ 2 are estimated by maximizing a profile likelihood (MacDonald et al. 2015). With the surrogate, the objective function valueŷ(β i ) is predicted and the corresponding prediction uncertaintyŝ(β i ) estimated. Based on this, confidence bounds for the surrogate predictions are computed (Equation 3). The lower confidence bound (LCB) is then used to choose the next NGWM parameter vector for evaluation.
The tuning parameter κ serves to set how much weight is placed on reducing the uncertaintyŝ. A higher κ favors exploration of the parameter search space D while a lower κ favors to focus on promising regions in this space. In every iteration of the calibration algorithm the LCB is computed and minimized to find the next parameter vector β new . This criterion is the difference of BO to the two methods discussed by Razavi et al. (2012), where onlyŷ(β i ) is minimized for a method named adaptive-recursive approach and the distance to previous parameter vectors is used as a criterion in the other method. A variety of other criteria such as Expected improvement for querying new parameter vectors are popular in BO as discussed by Shahriari et al. (2016).
The BO loop described in Figure 3 starts with an initial sample of values B 0 from the user-defined parameter search and ends when the maximum number of iterations is reached giving the parameter vector with the smallest objective function (i.e., rmse), and the obtained surrogate model for the search history reached so far. Every iteration, minimization in step 3 over the smooth LCB of the GP surrogate is achieved via a multi-start L-BFGS-F GS algorithm. It is important to remark that for the optimization of LCB no further NGWM runs are required and the runtime for surrogate meanŷ(β i ) and uncertainty predictionsŝ(β i ) necessary for obtaining the LCB for any parameter β i is very short.
As shown in Figure 4, a calibration run can be set up for an existing NGWM of class ogs5 by defining a parameter search space (Figure 4, lines 2-6) and sampling an initial set of parameter vectors (Figure 4, lines 7-9). The parameters have to be spatially uniform. A function (y) that has ogs5_obj and the observation data x as arguments, calculates the objective function value y(β i ) and returns a single value which has to be defined (Figure 4, lines 11-13). These three elements are then handed to the function that will start the calibration (Figure 4, lines 15-17) and return a calibration object of class BO. Note that these steps can be written into a script and easily be submitted to a high-performance computing cluster. The NGWM runs for the initial parameter vectors can be run in parallel as a simple fork cluster according to the number of ensemble_cores specified in the function call; afterwards, the algorithm continues to evaluate the NGWM sequentially. In BO, a critical choice is the number of maximum iterations. We propose an implementation of the algorithm where calibration can be started with a small number of maximum iterations, and later on, resumed from this starting or the previous state. The calibration object obtained from the first run (bo1 ; Figure 4) can simply be reinserted into the function to resume calibration. Further, a diagnosis plot method for the calibration object is proposed that summarizes the progress of the algorithm ( Figure 5). It may serve as an indicator if the assumptions for GP models are satisfied. If the surrogate is not suited to represent the objective function values correctly, the confidence bounds would remain wide and the prediction error would not converge toward zero.

Example Description
The example NGWM consists of the steady-state model of the Ashi river catchment in northeastern China

(B) Objective function value predicted by the surrogate (green line) including lower and upper confidence bounds (shade); objective function value simulated with the NGWM (black line). (C) Surrogate error as difference between predicted and simulated objective function, (D) Ratio of surrogate error to surrogate uncertainty.
head values at nodes constituting rivers, and (3) pumping rates as sink terms at individual wells. The initial condition is a hydraulic head of 200 m across the domain. The domain was discretized into 630 nodes and 9166 elements. Hydraulic conductivity was assumed isotropic and defined separately for each of the four geologic layers. This resulted in a model with a parameter vector of length d = 4 (NGWM I). We use d to indicate the length of the parameter vector to calibrate, which is synonymous with the dimension of the parameter search space. Search space bounds for the hydraulic conductivities are listed in Table S1. To further demonstrate the performance of our method to calibrate more complex models we set up an additional NGWM (NGWM II). Compared to the described model we split the original model domain into three horizontal regions, each containing four geologic layers. For every region, separate hydraulic conductivity parameters were assigned with respect to the geological layering. This resulted in parameter vectors of length d = 12 for NGWM II (Figure 6). Observations of hydraulic head at 14 wells were used as reference data for calibration. Observed and simulated head values are given Figure S1. The objective function was defined as the mean squared difference between observed x m and simulated NGWM m hydraulic head values at all 14 wells as defined in Equation 4, with m indexing single wells and i indexing the parameter vector for a specific iteration. During the calibration, the hydraulic conductivities for the defined regions and layers were optimized within the defined bounds. For comparison to our BO implementation, we conducted the calibration with a conventional GS method and a surrogate modeling approach based solely on LHS. GS is a direct calibration of the simulation model by minimizing the objective function via a Newton procedure. Very similar, gradient-based procedures are implemented to perform optimization in most calibration tools such as PEST. Here we used the box constrained Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS-B) after Byrd et al. (1995) available from the package stats in R (stats::optim() with default control parameters). Convergence is reached if the decrease in the objective function is within 1.0e+7 times the machine tolerance, which corresponds to approximately 1.0e−8.
Second, we used an LHS-based surrogate approach (Buchwald et al. 2020). Here, a number of samples of NGWM input parameter vectors are selected at random via the LHS method. The NGWM is then run for each parameter vector in a sample, and, the corresponding objective function value is used in conjunction with the NGWM input parameter vectors to train the surrogate. As surrogate we used a GP similar to BO. With the surrogate we predicted the NGWM objective function and derived the corresponding optimal NGWM input parameter vectors with a subsequent multi-start GS optimization. It is to be noted that for the subsequent GS optimization no additional runs of the NGWM were necessary as solely the surrogate was used. For both NGWM I and II we applied the LHS method with sample sizes of 60, 136, and 250 .
For BO we used initial sample sizes corresponding to three times the length of the parameter vector of each of the NGWM, resulting in 12 (NGWM I) and 36 (NGWM II) initial parameter vectors sampled by the Latin hypercube method. For both NGWMs we conducted 10 calibrations trials with BO, GS, and LHS each. The model input files and R-scripts used to obtain the results of this example can be accessed via the package repository.

Results
After calibration of NGWM I and II, simulated head values at the wells used for calibration range from approximately 130-140 m with lower values for NGWM II ( Figure S1). The separation of the model domain in three horizontal zones paid off and the root mean squared error to observation data was reduced from approximately 11 to 8 m ( Figure S2). Figure 7 shows the evolution of the objective function minimum with increasing iterations for individual calibration trials. In general, calibrating NGWM II required more NGWM runs than calibrating NGWM I, most probably, as the parameter search space dimension was higher. Up to a number of NGWM simulations of 10, the objective function minimum decreased quicker in trials conducted with BO than with GS. This was because initially, GS starts at one random position in the search space and then walks along the steepest gradient until it starts at a new position. If the start position is unfavorable and the gradient is not steep then optimization is slow and the current minimum is not reduced quickly. BO on the other hand starts with evaluating the initial sample and profits from the fact that this sample was taken across the search space, which increases the chance of hitting a parameter vector that yields a low objective function value right at startup.
To compare the performance of the three methods we defined the highest objective function optimum achieved with BO as reference for NGWM I and NGWM II, respectively; corresponding optima are 10.89 and 8.34 m. For the calibration of NGWM l with parameter vectors of d = 4, GS showed to require a similar amount of simulations as BO to achieve an approximate objective function minimum (value below the defined reference) (Figure 7). However, for calibrating NGWM ll with parameter vectors of d = 12, GS tended to need more simulations than BO to reach a comparable approximative minimum (Figures 7 and S2, Table S2). On some occasions, GS dropped earlier into the optimal parameter region than BO, however, then seemed to be trapped in apparent local minima. It is probable that the objective function for the NGWM II is more complex and contains more local minima than for the NGWM I. As GS randomly chooses the initial values of the NGWM parameters for gradient descent, the risk of being trapped in a local minimum is higher for a more complex objective function. The only counteraction is to increase the number of starts for GS, which further increases the computational burden. It is also visible that BO achieves more consistent results than GS (Figure 7).
The LHS method did not yield consistent results for NGWM ll. An intuitive explanation is, that the minima found by BO and GS are located on the limits of the parameter search space for these calibration problems (not shown in the figures). The probability of LHS to sample points from a specific point, such as the surface of a hypercube, is very small. In addition, there is no data from outside the limits to be used for interpolation. Therefore, the resulting GP model often fails to represent these regions properly. This problem gains importance with increasing dimension of the parameter search space, which can be seen by the little effect on the variability of increasing LH sample sizes for NGWM ll. On the other hand, BO proved to be suitable for problems of that category. A more optimal solution may locate outside the defined parameter search space, however, extending this search space would result in parameter values that are beyond what we think is realistic to assume about the modeled site.
The results are different when looking at the computational costs for calibration in terms of run time instead of the number of NGWM simulations. To calibrate the groundwater model with a parameter vector of length 12 BO took roughly three times as long as GS. This is because in BO additional computational cost has to be spent to train the GP surrogate model. This cost is relatively high compared to the run time of the single 12 parameters NGWM (0.58 s on average). The cost incurred for GP surrogate training increases considerably with the number of parameters by O(n 3 ) with the number of BO iterations n (Shahriari et al. 2016). To say, the run time of calibration with BO would increase drastically for NGWMs with an increasing number of parameters. This limitation could be tackled with dimensionality reduction techniques, faster surrogate models and a batch parallel variant of BO (Desautels et al. 2012). For instance, state-of-the-art optimization by BO of an algorithm configuration problem with 76 parameters was reported by Hutter et al. (2011) using random forest surrogate models, which can be easily parallelized. However, for NGWMs with hundreds of parameters, BO does not seem suitable.
Furthermore, the run time of the NGWMs can have a substantial influence on algorithm performance as well, such as when the NGWM is transient as opposed to the steady-state model we employed here. One can estimate the total calibration time for a certain NGWM with BO (T BO ) by multiplying the respective number of NGWM simulations N opt , required by the algorithm until reaching the objective function optimum with the runtime of a single NGWM simulation t NGWM and adding the time to train a GP surrogate model t GP (N opt ) (Equations 5 and 6). N opt corresponds to all runs of the NGWM that have been involved in computing the objective function from the initial sample of parameter vectors B 0 and from the following sequentially obtained parameter vectors β n .
In GS, the bottleneck is to evaluate the NGWM run required to obtain the values to calculate the gradients and matrix inversions; the time to actually compute the gradients and matrix inversions is negligible. Therefore, T GS can be simplified to Equation 6, where N opt is the number of NGWM runs that were required to compute the gradients and matrix inversions in all iterations of the algorithm until the objective function optimum was reached.
With Equations 5 and 6 and the data from the conducted calibration trails, the total calibration time can be extrapolated for hypothetical run times of a specific NGWM. Of course, this extrapolation is a simplification and the exact results should be taken rather qualitatively. For NGWM II, we computed N opt for each of the calibration trials conducted with BO and GS based on the reference optimum defined above. Then, we extrapolated the corresponding T BO and T GS for hypothetical run times of NGWM II of up to 8 min. On average, GS will be faster than BO for an NGWM run time below 1 min ( Figure S3). This is because the ratio of the computational cost for the surrogate to the cost of an NGWM run is comparatively high. When the NGWM run time increases above 1 min; however, this ratio will drop and the computational cost of the surrogate model will become comparatively less important. In this case, BO will be faster as it requires fewer runs of the NGWM. Considering this and the findings by Hutter et al. (2011), we expect that the BO method has the potential to improve the computational efficiency for the calibration of computationally expensive groundwater flow models such (e.g., coupled transient models at regional scales or models involving reactive transport) if the number of NGWM parameters remains below 80. Investigating this requires further experimentation and is out of the scope of this article.

Conclusions
In this paper, we describe the r2ogs5 API for the OGS version 5 modeling software and showcased our implementation of BO for calibrating numerical groundwater flow models. By integrating numerical groundwater flow modeling into R a substantial step was made to simplify and unify modeling workflows and to extend R's geocomputation capabilities. With our BO method, we successfully calibrated simple numerical groundwater flow models with 12 parameters and obtained consistent approximate objective function minima with fewer NGWM runs than GS and LHS methods. This highlights a promising potential of BO for calibrating groundwater flow models with up to 80 parameters, however, further research should address more in-depth comparisons to up-to-date methods. Here, r2ogs5 can serve as basis for further development, and our implemented BO method can also be easily extended or transferred.

Acknowledgments
We acknowledge funding from the Initiative and Networking Fund of the Helmholtz Association through the project Digital Earth (funding code ZT-0025). Furthermore, we acknowledge the Helmholtz Centre for Environmental Research-UFZ for additional funding and support (www.ufz.de). We would like to express our gratitude to the OpenGeoSys Community for technical support. The results have been developed in part on the High-Performance Computing Cluster EVE, a joint effort of both the UFZ and the German Centre for Integrative Biodiversity Research. Furthermore, we like to appreciate the critical and constructive review comments and questions of Michael Fienen and two anonymous reviewers that greatly improved the quality of this article. Open access funding enabled and organized by Projekt DEAL. Table S1. Bounds of hydraulic conductivities for individual geological layers. In case of NGWM II, the same layer specific bounds are defined for the conductivities in the three horizontal regions  Table S2. List of all runs presented in the results ( Figure  7) with the number of NGWM runs required to reach a reference optimum of the objective function. The reference optima correspond to the highest optima found in all calibration trails with BO for the respective NGWMs l and ll. Following, the actual approximate optimum (≤ reference optimum) and the number of NGWM runs until the final optimum found in the respective calibration trial are given. Figure S1. Measured and simulated hydraulic head at 15 well locations used for model calibration. NGWM I indicates the numerical groundwater flow model with d = 4 parameters; NGWM III indicates the numerical groundwater flow model with d = 12 parameters. Results correspond to calibration trial number 4 with BO for NGWM ll and trial number 6 for NGWM l. Figure S2. Required number of NGWM simulations until convergence for individual calibration trials. Bayesian optimization (BO, n = 10), L-BFGS-F gradient search (GS, n = 10) and random Latin hypercube sampling (n = 28); n is the number of calibration trials conducted. Number of NGWM simulations for BO includes NGWM simulations of the initial sample and sequential updates. Figure S3. Expected total calibration time for the calibration of NGWM II (d = 12) for hypothetical runtimes. Total calibration times were extrapolated from the data collected for the respective calibration trails of NGWM II (d = 12) according to Equations 5 and 6 and plotted as means (lines) and standard deviations (shades)