A machine learning‐based approach to the multiobjective optimization of CO2 injection and water production during CCS in a saline aquifer based on field data

The presence of carbon capture and storage (CCS) projects is important due to the growing production of greenhouse gases, especially carbon dioxide (CO2). Our target functions have been chosen because of the importance of CO2 storage in CCS projects and the requirement for producing less water in projects requiring water production. As a proxy for reservoir simulations, support vector regression, artificial neural network (ANN), and multivariate adaptive regression spline (MARS) have been used. It was determined that MARS had higher accuracy based on examining these three data‐driven models with the available field data. It was, however, very close to the accuracy of the ANN. MARS gave root mean square of error (RMSE), mean absolute error (MAE), and R2 values of 2.78%, 1.95%, and 0.998, respectively, for predicting CO2 storage values in test data, yet 3.73%, 3.53%, and 0.995 for blind data. The RMSE, MAE, and R2 to evaluate the machine learning (ML) model for predicting water production were 3.58%, 2.81%, and 0.997, respectively, while the results for blind data were 3.94%, 2.83%, and 0.997, respectively. By reducing the amount of computational load and time taken to reproduce the simulation data by MARS, it will be beneficial for optimization. This study highlights the application of the ML approach to coupling with genetic algorithms and optimizing CO2 storage and water production. With the proposed frameworks, preprocessing, feature selection, two‐stage validation of the data‐driven model, and optimizer are performed in the aquifer and a repository containing a series of optimal solutions is developed to be used in projects.


| INTRODUCTION
By utilizing fossil fuels and coal and petroleum, fossil fuel-related greenhouse gases and CO 2 are emitted. To slow down-eventually stop-rising global temperatures, atmospheric CO 2 concentrations, and other greenhouse gases must be stabilized. 1 Carbon capture and storage (CCS) is a striking option to reduce the concentration of CO 2 in the atmosphere and could make a significant contribution to reducing greenhouse gas emissions. 2 For climate change mitigation to be effective, CCS is essential. 3 A geological CCS process involves capturing CO 2 from emission sources (such as power plants), transporting it to a suitable storage location, and injecting it into an underground geological formation for a prolonged period of time. For more than two decades, geological CO 2 storage (GCS) has been studied as a possible way to reduce CO 2 emissions. [4][5][6][7][8] Among geological storage options, deep saline aquifers (DSAs) are the most preferred because of their largest potential capacity, their constancy, and the fact that they are widely dispersed. Previous studies by the International Energy Agency greenhouse gas emissions (IEA GHG) have shown that DSAs have the highest storage capacity of the various geological storage reservoirs available for CO 2 storage, so substantial decreases in CO 2 emission levels may therefore depend upon using DSAs as carbon storage reservoirs. Researchers have been studying the feasibility and applicability of injecting gas into geological formations, including DSAs. [9][10][11][12] Since the early 1990s, studies have been conducted on the storage of CO 2 to keep Injected CO 2 in a supercritical state in DSAs. 13 DSAs have also been the subject of several other studies investigating the types of different phenomena and simulations. [14][15][16][17][18] However, numerical simulators will not be very suitable for optimization tasks due to their inherent shortcomings and problems, 19 especially their time-consuming nature. So, using proxy models, it is possible to replicate the original models very quickly with acceptable accuracy. A proxy model (or surrogate model) replicates numerical simulation output for selected input parameters using a mathematical or statistical function. 20 Reduced-order models have been used to model thermal recovery processes 21 and transient CO 2 and brine leakage, 22 response surface models have been used for purposes such as history matching, 23 and reduced physics models have been used for purposes such as waterflooded reservoirs. 24 The weaknesses of reduced order models can be reduced physics and space-time resolution. In addition, statistics approaches (response surfaces) pose two main problems when they are applied to proxy models, such as numerical reservoir simulation models: the issue of "correlation versus causation" and the assumption of a linear, polynomial, and so on, function for the data being analyzed or modeled. When the nature of a given complex problem cannot be modeled by a predetermined functional form and changes behavior frequently, these approaches will fail. 25 Petroleum engineering and other engineering fields are increasingly using data-driven models due to weaknesses in these models. Using data-driven models does not have the limitations of other types of proxy models and offers more flexibility and capabilities. 26 In the artificial intelligence (AI)-based approach, the application of machine learning (ML) techniques enables us to build a model that can learn system behavior through the data provided and predict system outputs as new data enters the model. Data-driven models can utilize AI techniques such as artificial neural networks (ANN), fuzzy logic, and genetic algorithms. Data-driven models have been used in petroleum engineering for a variety of purposes. [27][28][29] Several forecasting tools are popular among energy researchers, including ANN, support vector machine (SVR), and multivariate adaptive regression spline (MARS). On one hand, in addition to high accuracy, robustness to outliers, and the fact that no assumptions are made about the distribution of the environmental variables, these three methods are also shown to model highly nonlinear data, provide a fast algorithm, and be a powerful tool for estimating realvalue functions. [30][31][32] On the other hand, ANN and SVR techniques have been widely recognized and applied in engineering over the past years, including forecasting (or regression analysis), decision-making processes (or classification works), and real-life engineering problems. [33][34][35] In terms of carbon dioxide storage in aquifers (prediction of carbon dioxide gas injection and saline water production) and coupling to genetic optimization, however, there is no comparison between the two methods. In contrast to the SVR and ANN models, the MARS model has not been widely tested. In spite of a scarcity of literature on MARS models applied to CO 2 injection, this model has proven to be accurate in several estimation engineering challenges. 36,37 Therefore, a comparative study between these three methods will be conducted in this field, and their performances are compared by each other, and finally, they are used in the optimization phase of the study.
A variety of studies have used ANN in petroleum engineering, including the screening enhance oil recovery method, 38 drilling engineering, 39 assisted history matching, 29 estimation of dew point pressure, 40 and so on. ANN was used to predict the storage efficiency of CO 2 in a saline aquifer 41 as well as for predicting the CO 2 properties in carbon capture and sequestration operations. 42 In CO 2 -enhanced oil recovery (EOR) and sequestration projects, the water alternating gas process was evaluated using ANN models. 43,44 and were employed as proxy models for EOR projects. 45,46 So, it is clear that the ANN model is a robust tool for predicting the feasibility of CO 2 sequestration with high accuracy. Another AI and data mining technique that is used for prediction is support vector machine (SVM). SVM was first introduced for classification problems, and SVR was then developed for regression problems. 47 In supervised learning, SVR is used to predict discrete values. SVR follows the same principle as SVMs. SVR has been studied as a proxy for reservoir-simulation models. [48][49][50] Considering our output in this study, SVR was applied to data. MARS is another ML method and is a flexible regression method that automatically searches for interactions and nonlinear relationships. The MARS method as a surrogate model for optimizing the hydraulic fracturing treatment was shown to perform well, 51 as well as an approximate hydrothermal model for optimizing geothermal reservoir operation. 52 In other studies, MARS has been extensively applied to the approximation of a water-oil-CO 2 flow reservoir model for efficient optimization of CO 2 geological storage, EOR, 53 and develop surrogate models to quantitatively evaluate leakage risk of CO 2 sequestration, 54,55 CO 2 modeling to investigate optimal design, geological uncertainty, and risk assessment during combined CO 2 underground storage and utilizations. [56][57][58] In this study, existing ML methods become more understandable by presenting continuous and regular steps. In addition, comparing the efficacy of ML methods in the aquifer injection and production fields can be useful for evaluating their performance. Using the selected ML approach, aquifer data is reproduced and then coupled with an optimization algorithm. Furthermore, to much higher injection and prevent fracturing of the formation due to pressure increase caused by CO 2 injection, saline water production will be appropriate 59 and the formation fracture and subsequent environmental impact can be prevented. However, producing saline water can be very expensive, so much so that in production optimization of hydrocarbon fields, water production accounts for 50% of the costs. 60 Using simultaneous brine production, which is the focus of this study, can depressurize the formation and compensate for excessive pressure build-up caused by CO 2 injection. Nevertheless, the vital role of the aquifer in emission reduction targets, the high capacity of the aquifers, the superiority of data-driven methods over conventional methods, and avoiding sharply increasing reservoir pressure while simultaneously producing saline water, were considered, and optimizing the amount of saline water produced in an aquifer and the amount of CO 2 stored took place that simultaneously plays essential roles which the amount of saline water produced separately or simultaneously has not been studied. Considering the existing gap in this field, optimization is carried out with a multiobjective genetic algorithm (NSGA-II) coupled with the ML method selected from the available three methods whose objective is to maximize CO 2 storage while minimizing saline water production from the aquifer. Numerous studies have performed optimization using genetic algorithms in various fields, which supports their suitability. [61][62][63][64] Due to the time-consuming realization of the aquifer model, coupling it to the optimization algorithm will be timeconsuming. So each realization takes about 2 h. To address this problem, selected ML method reproduces the information of the aquifer model with proper accuracy and much faster. Due to the lack of multioptimization studies in CO 2 storage in real DSAs, specifically the absence of multiobjective optimization of CO 2 storage and saline water production in aquifers and its coupling to MARS, this study is doubly important. In fact, in this area, comparing three ML methods can be helpful in determining which surrogate reservoir models (SRM) is better to use and this optimization framework and its coupling with a data-driven model are used and verified by field data from an aquifer and can be used for other tasks and the impact of various factors on storage in DSAs, in a more optimal time and makes it possible to optimize other aquifers, which were practically impossible due to the time-consuming their realizations.

| METHODOLOGY
In this study, we present a general framework for developing a data-driven model and optimization in the injection of CO 2 into the aquifer and the simultaneous production of water. To optimize both CO 2 storage and the amount of water produced in our aquifer, this framework optimizes two objective functions. As a result, we used information from an actual aquifer in the numerical simulator. Due to the time-consuming nature of each simulation run, especially in optimization-related tasks, SRMs greatly reduce computational load and, thus, enable the creation of a suitable Pareto front. The approach presented for AI-based methods in this study begins with the extraction of necessary data from the numerical model, followed by feature selection.
In this way, with an experimental technique called Latin Hypercube and by defining two variables with operating range, we extract the data required for several realizations of the numerical simulator. By creating a spatio-temporal database, we perform the necessary preprocessing on the extracted data. Following this, several steps were taken to select the most effective features for training the data-driven model these steps include analyzing linear and nonlinear relationships, as well as analyzing ANN weights so, the database is prepared for training. In this study, three data-driven models, SVR, ANN, and MARS are compared to select the model with the least error in the available data for coupling with GA. For ANN training, four steps Categorization of data, Network architecture, Training (or learning) algorithm, and Activation function are provided. For their validation, we use an additional step apart from the test data, which is called blind validation. Furthermore, the root mean squared error (RMSE) and mean absolute error (MAE), as well as coefficient of determination (R 2 ), were examined in the available data. Following the comparison and selection of the SRM from the ML methods, it will be used for coupling the optimization algorithm in the next stage. The two-objective genetic algorithm is used as the last step, to optimize the two objective functions in this study, to obtain the optimal Pareto front, which can be a set of optimal solutions needed for different projects. The Pareto front is subsequently validated in two steps. In the first step, 3000 experiments were randomly selected within the specified range and input to the SRM and the results were compared to the Pareto front. As a next step, to further validate the Pareto front, we take the output from the numerical simulation model in a random scenario and compare it with the existing Pareto front to examine the domination of the Pareto front. The research workflow is shown in Figure 1.

| Model description
In this study, heterogeneous field data of one aquifer have been used. Five injecting wells for CO 2 gas and six wells for producing saline water have been completed in F I G U R E 1 An overview of the study framework. ANN, artificial neural network; MAE, mean absolute error; MARS, multivariate adaptive regression spline; RMSE, root mean square of error; SVR, support vector regression. this aquifer. In Figure 2, the aquifer schematic and the location of the wells are shown.
The top of the aquifer, initial pressure, and temperature are 17,000 to 24,700 feet, 49.34 MPa, and 300°F which pressure and temperature are higher than the critical pressure and temperature of CO 2 (7.38 MPa and 87.9°F). Consequently, the injected fluid will be supercritical, which will have a significant effect on the amount of CO 2 stored, since supercritical CO 2 has a much higher density. Boundary condition of production wells is constant bottom hole pressure (BHP) at 2500 psi and the simulation domain (29.2 km × 23.4 km × 0.253 km) is divided into a grid of 60 × 47 × 31 which includes a total of 87,420 cells. Table 1 provides additional information on this heterogeneous aquifer.
CO 2 injection wells are completed in the bottom layers and saline water production wells in the upper layers. In fact, it reduces the free phase of CO 2 gas under the caprock because the gas is stored as dissolution and residual mechanisms, so less free CO 2 accumulates under the caprock and the risk of leakage is reduced. 65,66

| Generation of training database and feature selection
Due to the necessity of a numerical simulator for modeling system performance, the ability of a smart proxy model to reproduce the results of a numerical reservoir simulation model is essential. The spatiotemporal database stockpiles information from the different realizations of the reservoir simulation model. To introduce the uncertainties involved in the reservoir model to an SRM, a small number of realizations were built and run using a commercial numerical reservoir simulator. For example, in a history matching study in which the adjustable parameters are porosity and permeability distributions, the realizations vary in their distributions. 67 Another example could be an uncertainty analysis of operational constraints or different operational conditions. 29,68,69 Although SRM does not require a large number of simulations to run, there are no rules that determine the exact number of realizations required for a perfect SRM. In developing an SRM, many factors can alter the number of runs needed, including the complexity of the problem, particularly the level of reservoir heterogeneity. Generally, the run number selection is done based on a rule of thumb, where a user's experience could be helpful. Nonetheless, it is obvious that SRM will not perform well on data if the number of simulation runs is too small. 70 So, to develop a smart proxy model, we need aquifer simulation data. Therefore, to design the numerical reservoir simulation runs that will be used to develop the smart proxy model, two variables are considered to introduce the uncertainties involved in the reservoir model to an SRM: CO 2 injection rate and injection duration. Thus, an experimental technique called Latin Hypercube is used, which F I G U R E 2 Schematic of aquifer and location of wells; CCS# Production wells from 1 to 6, CCS# Injection wells from 1 to 5. CCS, carbon capture and storage.
T A B L E 1 Summary of saline aquifer properties.

Parameter Value
Average porosity (%) 3 Formation temperature (°F) 300 Average horizontal permeability (md) 54 Vertical to horizontal permeability ratio 0.1 Water salinity (ppm) 51,000 Maximum gas injection pressure (psi) 8700 Maximum gas injection rate (MMSCFD) 8-20 is a sampling tool 29 which can be used to sample the sample space without taking a duplicate sample. The ranges of CO 2 injection rate and injection duration were 8-20 MMSCFD and 10-40 years, respectively. In multiple reporting steps, the commercial numerical simulator collects data from each scenario that is run. A total of 11 simulation scenarios, as illustrated in Figure 3, were planned within these limits. Next, the numerical reservoir simulation runs were executed and the required data was extracted. Thirty-one features are included in this data. Based on these extracted data, the smart proxy model is built upon the spatio-temporal database. To train the neural network that makes up the smart proxy model, the spatiotemporal database generated during this step of the process must include all of the available information that we need. The rows of our spatial-temporal database represent time intervals, and the columns represent extracted data and indexing data and dynamic data are recorded in steps of 1 month in each realization of the numerical simulation.
For neural network training, the spatio-temporal database should be preprocessed such as missing values, normalization, and inefficient data, and in this article, one form of preprocessing performed is a normalization of the data. Due to the data of each feature differs in scale, by performing the normalization we transfer them to a specified interval to standardize the effect of the data. Learning and convergence are generally accelerated when data is normalized. Thus, we will normalize data between 0 and 1 (Equation 1) and restore it to its original value after training the network (Equation 2) Which x norm is the normalized data value, x is the data value and x min , x max are the minimum and maximum data values, respectively. Additionally, we exclude data such as indexing data used to format the database and will not be used to train the network. If there are missing values in ML data sets, they can be handled by deleting rows, replacing them with means, medians, and modes, assigning them a unique category, or predicting them. As the data in this study was obtained from a numerical simulator and scenarios were defined, there will be no problems associated with missing values. On the other hand, considering how many output parameters are present in each model, it is important to choose the ones with the greatest influence on the output. To uncover the features affecting the investigated output parameters, namely CO 2 storage and saline water production, it is necessary to study both linear and nonlinear relationships of the input parameters with the output and input parameters together. Nevertheless, we should note that specialized knowledge will be important to us because it helps us better understand the data that we will use, and so we can gain insight into what features are likely to be important or that need to be omitted. Therefore, from a specialized standpoint, it is important to consider pressure, injection rate, and duration when storing carbon dioxide. In the aquifer, these parameters are affected by CO 2 storage capacity, security, and breakthrough time, as there are limits on pressure build-up and injection rates. 71,72 Saline water production in aquifers also affects pressure and, therefore, CO 2 storage. 59 For the implementation of models, it is useful to consider storage operation characteristics (such as injection rate and duration) and regulatory constraints (such as pressure) which affect storage efficiency. 73 Based on Equations (3) and (4), we used the Pearson coefficient as well as Spearman's rank coefficient to study linear and nonlinear relationships.
 D n n Spearman′s rank correlation coefficient Which in Equation (3) x and y are our samples, x̅ and y are the average of the samples and n are the sample size and in Equation (4), D is the difference between the ranks of the corresponding members of the two groups and n is the volume of each group. There is a F I G U R E 3 Scenario design. Sampling for simulation runs (blue dots), and sampling for blind validation data (red dots). stronger correlation between the input and output parameters if Pearson and Spearman's coefficients are close to +1 or −1. On the other hand, if these coefficients for the input parameters are close to zero, it can indicate that they are independent of each other, that this independence of the input parameters leads to the selection of features that enter more information into the neural network. Weights in the neural network were also considered as another criterion for evaluating the effect of inputs on outputs, in fact, another method is to evaluate the nonlinear relationship between different parameters based on the concepts of ANN. In this way, the larger weight is assigned to an input parameter, the greater the impact on network training. Therefore, we trained the ANN with all available inputs, and by extracting the weights assigned to each input parameter and calculating the summation of their total absolute value, we identified the effective and ineffective parameters. We move the network weights to the range of 0 and 1 to display them with other methods. Figure 4 shows the details of the seven selected parameters based on the items mentioned above for feature selection. Hence, we benefit from all the items mentioned to select effective parameters. A total of seven parameters are selected for input as shown in the figure mentioned above. Water production and carbon dioxide storage are also considered outputs. Additionally, Figure 5 shows a density distribution diagram of some data which can be seen to some extent the distribution and relationships between the parameters shown. From the distribution diagram, we can see the linear relationships of some data in this diagram to some extent, as well as the normal distribution of the data shown in each row.

| Design and training of ANN, SVR, and MARS
After the numerical simulation model and feature selection, the next step is to construct the smart proxy model. In this paper, the data-driven models are built to evaluate the CO 2 storage and saline water production in the aquifer. At first, the ANN was chosen as the data-driven model in this study. Neural networks are a set of neurons that follow a unique algorithm. This set, which is modeled and inspired by the human brain, is designed and used to recognize patterns. Actually, ANN uses the processing of the brain as a basis to develop algorithms that can be used to model complex patterns and prediction problems. Neurons in ANN models are connected by weighted links in a complex and nonlinear manner. With ANN models, all processes, such as data gathering and analysis, architecture design, hidden layer number, weights, and bias, are computed through learning and training. A neural network consists of three layers: input, hidden, and output. The first layer is the input layer and it contains the input neurons that send information to the hidden layer. The hidden layer performs the computations on input data and transfers the output to the output layer. It includes weight, activation functions, and cost functions. An artificial neuron receives an input that has a weight. These neurons (also called nodes) have activation functions. This activation function works on the input and processes it to give an output. According to the data used that is labeled (input and output data), learning is supervised. Our approach to ANN development is to follow these four general steps: F I G U R E 4 Feature selection; comparing methods for selected parameters. (1) Categorization of data: In this study, for Multilayer perceptron (MLP) training, the data in the database is divided into three parts: training data (75%), validation data (12.5%), and testing data (12.5%). During training, validation data is also applied to the network, and its significance is that it prevents over-fitting or memorization. In fact, we use validation data to ensure that the system is not overly reliant on training data. After the training, the network is applied to the testing data set, which is used to provide an unbiased evaluation of the final model, based on the training data set.
(2) Network architecture: This describes the pattern of connections between neurons, otherwise known as network architecture. MLP networks are among the most common neural networks. By analyzing the minimum mean square error (MSE) in various random searches, the tuner optimizes the number of layers, the number of neurons in each layer, and the learning rate of the neural network. Keras Tuner is a library that helps pick the optimal set of F I G U R E 5 Density distribution diagrams of some features.
hyperparameters for the TensorFlow program. This hyperparameter tuning method randomly tries a combination of hyperparameters from a given search space. In this study, there are three parts in ANN, including the input layer, output layer, and three hidden layers. Figure 6 shows the architecture of the ANN studied.
The input layer in this study includes the input parameters selected during the feature selection step for training the neural network, and the output layer includes our two objective functions. Following the evaluation of the hyperparameter of the number of neurons, a first, second, and third hidden layer each have 256, 96, and 224 neurons, respectively. Moreover, as a result of optimization, the learning rate was determined to be 0.05.
(3) Training (or learning) algorithm: The backpropagation algorithm is widely used to train the MLP network by which the error associated with each neuron in the hidden layer is calculated. Actually, it is an algorithm that uses gradient descent for neural network supervised learning. Low speed and high sensitivity to learning rate are the problems of gradient descent. The Adam method used in this study is an optimized method for implementing a gradient descent to find the minimum cost function faster. 74 In addition to being fast and convergent, it also corrects the high learning rate (α). In this study, the learning rate was set at 0.05. (4) Activation function: An ANN must learn a complex relationship between input and output data, which is not linear. Therefore, to assist the network in learning complex data patterns, the activation function is added to the ANN. The activation function converts the output of the previous neuron into a form suitable for use as an input for the next neuron. A neuron's output is limited by its activation functions since a large neuron's output complicates the calculations. Derivability is another key feature of activation functions, important for the backpropagation algorithm. This study utilized the ReLU activation function, which does not suffer from the limitations of the hyperbolic tangent activation function or sigmoid activation function. Equation (5) shows that: using ReLU, inputs less than zero are set to zero, while inputs greater than zero are set to themselves.
SVR is a regressor that predicts continuous variables using the same principle as SVMs. The basic idea behind SVR is to find the best-fit line. In SVR, the best-fit line is the hyperplane that has the maximum number of points. A SVR algorithm is used to approximate the best value within a given margin known as an ε-tube. Since SVR is a regression algorithm, rather than using the curve as a decision boundary, it uses the curve to find a match between the vector and position of the curve. Support Vectors help in determining the best fit between the data points and the function representing them. MARS is a nonlinear and nonparametric regression method that models the nonlinear responses between inputs and outputs of a system using piecewise linear segments (splines) of differing gradients. Piecewise curves (called basis functions) give the model greater flexibility, allowing for bends, thresholds, and other deviations from linearity. MARS generates basis functions by searching in a stepwise manner. MARS models are constructed in a two-phase procedure. In the forward phase, functions are added and potential knots are identified to improve performance, and in the backward phase, the terms of least effectiveness are pruned. The data used to train the ANN is also used for the SVR and MARS methods. The data in the database was divided into test data (20%) and training data (80%). In this study, the kernel function used for SVR is radial basis function (rbf), which is a very well-known kernel function used in SVR. Additionally, the data used has been normalized. We define different intervals for each of the three existing SVR hyperparameters, for example, gamma, epsilon, and C, and use GridSearchCV in Python to select the most optimal. The optimal values for gamma, epsilon, and C were respectively 0.1, 0.1, and 100. The MARS algorithm was implemented in Python's py-earth package. We applied all three ML methods, that is, ANN, SVR and MARS to the training data to train the model. To determine which proxy model is more accurate, we compare them and choose one of them for coupling with the optimization algorithm. As shown in Figure 7, ANN is trained and MSE of data during training is shown. It can be seen that the MSE has decreased with the increase in the number of epochs, indicating that there has been appropriate network training. The MSE is calculated by Equation (6).
With network training, the model is applied to the test data to take another step toward network approval. Figures 8 and 9 (red dots) show the predicted values for cumulative CO 2 storage and water production are very close to their actual values. It is also evident from the R 2 that is calculated from Equation (7).
which ȳ pred is the average of the predicted values. After training and ANN approval, the results of SVR and MARS training were also reviewed. Blue and black dots in Figures 8 and 9 demonstrate the results from applying the test data to the trained SVR and MARS models respectively, and the accuracy of the model training. Actually, to make sure that the ANN model was accurate and to check if the model was improving during training, we used validation data and at the end of the training process, testing data was used to assess whether the training was successful and, in another hand, we used testing data sets that are larger than those used in ANN  to observe the accuracy of the trained SVR and MARS models. To select the appropriate proxy model, a further step, blind validation, is performed in this study.

| Blind validation
As a final validation of two existing data-driven models according to Figure 3 (red dots), we run a numerical simulator and extract the data needed. As it turns out, these data have no role in learning the model. This data is given as input to ANN, SVR, and MARS and their output is calculated by the model. Red, blue, and black dots in Figures 10 and 11 show the predicted and actual values by ANN, SVR, and MARS in the blind data set, respectively, and R 2 is calculated as well. In statistics, R 2 measures how well data fit into a statistical model, like a regression line or a curve, which means, it measures how well-observed outcomes are reproduced by the model. Nevertheless, it will not always be an appropriate criterion for selecting the best model, because it is highly dependent on the model used and it cannot be used to compare various models. MAE is a metric used to summarize and assess the quality of an ML model. Statistically, MAE can be defined as the difference between two continuous variables. Using Equation (8), the MAE is calculated.
Furthermore, RMSE can be calculated to measure the performance of our models. RMSE denotes the square root of the MSE and is calculated using Equation (9). These distance measurements help to quantify the accuracy of our data-driven model compared to the numerical simulator.
With blind validation data, the error values are calculated for ANN, SVR and MARS-trained models so that a more suitable model can be chosen for coupling with optimization algorithms. Table 2 shows the R 2 , RMSE, MAE values for the trained ANN, SVR, and MARS in the testing and blind validation data, which can be compared. Since we are dealing with cumulative CO 2 storage and water production volumes, these large RMSE and MSE values can be difficult to interpret. Consequently, we use their percentage values, that is calculated by dividing their values by the average of the two objective functions, as shown in Figure 12, which is more understandable.
Because R 2 alone is not adequate to select our models, the RMSE and MAE values in both testing and blind validation datasets were evaluated. Overall, according to the testing data set for MARS, the RMSE and MAE values are lower than both ANN and SVR and are superior to ANN results or at least very similar to them. These results are also valid for blind data set. In MARS, the RMSE, MAE, and R 2 measurements were 2.78%, 1.95%, and 0.998 respectively, while the measurements in blind data were 3.73%, 3.53%, and 0.995. On the test data, RMSE, MAE, and R 2 for analyzing the ML model for predicting water production were 3.58%, 2.81%, and 0.997, while on blind data, they were 3.94%, 2.83%, and 0.997, respectively. Hence, according to the results in this

| Multioptimization
In this study, metaheuristic and stochastic optimization protocols such as genetic algorithms were used to optimize various objective functions. Metaheuristic algorithms are algorithms that are inspired by nature, physics, and human beings and are used to solve many optimization problems. These algorithms are population-based or based on singlesolution. Our study employs known metaheuristic algorithms population-based, namely genetic algorithms. This type of optimization protocol does not have the same limitations as conventional gradient-based optimization methods, such as continuity and differentiability of the objective function. Genetic algorithm is not limited by the amount of data that is available, making them well-suited for problems that are too complex for traditional methods and can be used to find solutions to problems that are not well understood, making them a powerful tool for research. In addition, Genetic algorithms (GA) can also find better solutions than traditional algorithms, as they are less likely to get stuck in local minima, even when compared to algorithms like particle swarming and simulated annealing. 75,76 In this study, the optimization consisted of more than one objective function, known as the multiobjective optimization problem. CO 2 storage volume is a critical factor in optimization for CCS projects. The total amount of water production to reduce aquifer pressure is another important aspect of projects. Nevertheless, in this study, the objective functions for the GA are CO 2 storage and water production. GA does not search via gradients; it searches via random sampling by stochastic operators rather than using deterministic rules. In summary, the genetic algorithm consists of the following steps: (1) Generating randomly a list of solutions to a problem.
(2) Test each possible solution against the problem using a fitness function. (3) Maintain the best solutions. The best solutions are used to generate possible new solutions. (4) continually, repeat the previous two steps until either an acceptable solution is found or until the algorithm has completed its iterations through a given number of cycles/generations.
The basic operators of the GA are selection (reproduction), crossover, and mutation. First, selection is applied to the population, selecting chromosomes from the parents' population to cross over and produce offspring (It is based on the evolution theory of "Survival of the fittest" given by Darwin). Once the population has been enriched by better individuals after the reproduction phase, the crossover operator is used to create better strings from the mating pool. In the process of recombining good individuals, it is likely to create even better individuals. Using the mutation operator, after crossover, the strings are subjected to mutation, resulting in a sudden change in a gene within a chromosome. Consequently, it allows the algorithm to see solutions that are far from the current ones and this ensures the search algorithm is not trapped at a local optimum. By doing so, it prevents premature convergence and preserves diversity within the population.
In the single-objective optimization problem, a solution's superiority over other solutions is easily determined by comparing their objective function values. One cannot directly compare values of one objective function to another in a multiobjective case. In other words, the goodness of a solution is determined by dominance. If x 1 is strictly better than x 2 in at least one objective, where x 1 is no worse than x 2 in all objectives, then x 1 dominates x 2 . Thus, any set of solutions can be divided into dominated and nondominated subsets. Figure 13 illustrates this process. The nondominated set of the entire feasible decision space is called Paretooptimal or Pareto-efficient set. There is no "best solution" in this set, so we can choose whichever meets our needs. A Pareto-optimal solution can be joined by a line or surface known as the Pareto-optimal front.
The multiobjective optimization goal is to find a set of solutions that are as close as possible to the Pareto front. In this study, there are seven decision variables and two objective functions. Our two objective functions are cumulative CO 2 storage and water production. In fact, the maximum amount of CO 2 storage and minimum production of water will be desirable. In Table 3, the decision variables used for this study are presented along with their operational ranges.
The MARS coupled to the GA will be such that it normalizes the inputs before entering the proxy. The system then makes the required predictions. GA has a population size of 100, which means there will be 100 nondominated solutions in the developed Pareto front. The stop criteria for the algorithm are based on a number equivalent to 1000 generations, which after that algorithm stops.
As a result of running the algorithm, the Pareto front is calculated and shown in Figure 14 (Red dots). On the Pareto front, we see that the range of the first objective function, namely CO 2 storage, is between 0.64 × 10 6 and 5.9 × 10 6 MMSCF. Additionally, the range of the second objective function, which is the cumulative water production, is 8.5 to 2.5 × 10 3 MMbbl. To validate the Pareto front, in the first step 3000 experiments were randomly selected within the specified range and input to the SRM, that is, MARS, and the results were compared to the Pareto front. Figure 14 shows that the red dots (Pareto front) dominate all the blue dots, proving its validity.
As a next step, to further validate the Pareto front, we take the output from the numerical simulation model in a random scenario and compare it with the existing Pareto front. As shown in Figure 15, in this step, the F I G U R E 13 The Pareto front, the dominated, and the dominating solutions.
T A B L E 3 Decision variables in GA along with their operational ranges.

Parameter Unit Min Max
Field gas injection rate MMSCFD 40 100 Field pressure psi 6500 8000 Field water production rate bbl/day 10,000 20,000 Bottom hole pressure; injection well 2 psi 6500 8500 Well gas injection rate; well 1 MMSCFD 7 15 Well water production rate; well 1 bbl/day 3500 5000 Injection duration Years 15 30 Pareto front (red dots) dominates the other points calculated by the numerical simulation. As coupling numerical simulation to the optimization algorithm is time-consuming, we use MARS to reproduce the numerical simulation results. On a time scale, each realization of the numerical model takes about 2 h, while ML method predictions take a few seconds. Actually, the proxy model makes it possible to reduce significantly the computational overhead of a high-resolution Pareto front. Therefore, a Pareto front-based multiobjective optimization protocol has been developed to provide project decision-makers with a repository containing a series of optimal solutions.

| CONCLUSION
As a result of the importance of CO 2 storage in CCS projects and producing less water to reduce costs in projects requiring water production, these functions are chosen as our target functions. In this study, a framework was presented for the preparation of data and processing them, which simplifies the management and training of data-driven models. So, an SRM-based optimization approach is presented, and the following conclusions are drawn: (1) According to the objective functions determined which features were the most effective based on the analysis of the relationships between the data, and based on that, the SRM will have appropriate predictions. Based on the analysis of these relationships, field gas injection rates, field pressure, field water production rate, bottom hole pressure of injection well-2, gas injection rate of well-1, water production rate of well-1, and injection length were selected. (2) A four-stage process is followed for ANN design and training, followed by their validation through testing and blind validation data according to the provided framework. Categorization of data, network architecture, training (or learning) algorithms, and activation function steps was defined and applied. According to the testing data set for MARS, the RMSE, and MAE values are lower than both ANN and SVR and are superior to ANN results or at least very similar to them. These results are also valid for blind data set. (3) In MARS, the RMSE, MAE, and R 2 measurements were 2.78%, 1.95%, and 0.998, respectively, while the measurements in blind data were 3.73%, 3.53%, and 0.995. On test data, RMSE, MAE, and R 2 for analyzing the ML model for predicting water production were 3.58%, 2.81%, and 0.997, while on blind data, they were 3.94%, 2.83%, and 0.997, respectively. (4) Due to the coupling of MARS with two-objective GA, the problem of time-consuming realizations of the aquifer numerical simulation is solved, and it is possible to draw the Pareto front with highresolution despite a much lower computational load. In this framework, validation is performed in two stages, namely, random numerical model realization and random selection of the operational range of decision variables, and comparing their results with the Pareto front to determine whether the Pareto front dominates all other points.