Physics‐Based Human‐in‐the‐Loop Machine Learning Combined with Genetic Algorithm Search for Multicriteria Optimization: Electrochemical CO2 Reduction Reaction

Machine learning (ML) can be a powerful tool to expedite materials research, but the deployment for experimental research is often hindered by data scarcity and model uncertainty. An human‐in‐the‐loop procedure to tailor the implementation of ML for multicriteria optimization is described. The effectiveness of this procedure in the development of a nafion‐based membrane electrode assembly for electrochemical CO2 reduction reaction (CO2RR) into CO for two targets is demonstrated: energy efficiency (EE) and partial current density for CO2RR ( jCO2RR ). Model‐agnostic nonlinear correlation analyses identify the 11 features relevant to those targets. The three studied decision tree‐based ML models yield similar cross‐validation errors so an ad hoc feature analysis of the models is done with SHapley Additive exPlanations and nonlinear correlation techniques. The predicted EE‐ jCO2RR space and the functional dependency of the predictions are investigated to assess model plausibility. A genetic algorithm with CO production cost as the final target with subsequent validation experiments of candidate conditions is devised. The model chosen through ad hoc analysis yields the highest accuracy and the only one that can locate the Pareto front with a single round of experiments, demonstrating how appropriate model selection through careful inspection can greatly accelerate the research cycle.


Introduction
Machine learning (ML) [1,2] has gained enormous momentum in materials science as a way of potentially expediting compound property prediction, [3,4] materials discovery, [5,6] analytical characterization, [7] and the optimization of materials, [8] reactions, [9] devices, [10,11] and protocols. [12] The success of the algorithms is often driven by a large set of input data for predicting target variables. Accordingly, recent material informatics projects [13,14] have increased data availability using simulated data while combinatorial experiments have sped up data acquisition. [15,16] Despite the recent success of data-driven ML, their practical implementation for experimental research is still challenging due to the following three reasons. 1) The preliminary exploration in a new research area inherently confronts data limitations, constraints, and uncertainty, where calculations or high-throughput measurements are not always feasible. 2) Experimentalists often deal with heterogeneous and interdependent parameters for a multicriteria feasibility study in which two or more performance matrices competingly affect a final objective (Pareto optimality). [17,18] 3) There is often an undiscussed gap between research targets and a generalized end goal.
In order to overcome these challenges and gain practical suggestions from ML, we describe an interactive procedure ( Figure 1a) involving active ad hoc interpretation of ML models. Physical and heuristics understanding toward multiple criteria are exploited to seek the best representative surrogate model for Pareto optimality which is then integrated into a goal-oriented genetic algorithm (GA). Here, we apply this methodology to the optimization of a membrane electrode assembly (MEA) toward feasible electrochemical CO 2 reduction reaction (CO 2 RR) to CO as a decisive case study (Figure 1b).
CO 2 RR has attracted immense interest [19] because of its unique features such as electricity-driven ambient reaction conditions, diversified potential products, [20] and compatibility with renewable energy. [21] Enormous efforts have resulted in various efficient catalysts [22][23][24][25][26] and gas-diffusion cell designs, [27][28][29][30] DOI: 10.1002/aisy.202200290 Machine learning (ML) can be a powerful tool to expedite materials research, but the deployment for experimental research is often hindered by data scarcity and model uncertainty. An human-in-the-loop procedure to tailor the implementation of ML for multicriteria optimization is described. The effectiveness of this procedure in the development of a nafion-based membrane electrode assembly for electrochemical CO 2 reduction reaction (CO 2 RR) into CO for two targets is demonstrated: energy efficiency (EE) and partial current density for CO 2 RR (j CO 2 RR ). Model-agnostic nonlinear correlation analyses identify the 11 features relevant to those targets. The three studied decision tree-based ML models yield similar cross-validation errors so an ad hoc feature analysis of the models is done with SHapley Additive exPlanations and nonlinear correlation techniques. The predicted EE-j CO 2 RR space and the functional dependency of the predictions are investigated to assess model plausibility. A genetic algorithm with CO production cost as the final target with subsequent validation experiments of candidate conditions is devised. The model chosen through ad hoc analysis yields the highest accuracy and the only one that can locate the Pareto front with a single round of experiments, demonstrating how appropriate model selection through careful inspection can greatly accelerate the research cycle.
achieving high Faradaic efficiencies (>90%) with concomitant high current densities (>80 mA cm À2 ). However, the complex and time-consuming electrochemical experiments within a large parameter space hinder an accelerated feasibility cycle of materials discovery and device enhancement (I). Furthermore, CO 2 RR has several essential performance metrics such as energy efficiency (EE) and partial current density for CO 2 RR (j CO 2 RR ), which generally conflict with each other due to a trade-off electrochemical principle and elude rigorous physics-based modeling to solve the Pareto problem (II, Figure 1c). Finally, device implementation is generally tied to financial incentives compared to competing technologies, and technoeconomic studies [31] should be included in any optimization procedures as a final target (III).
Within material science, decision tree-based ML models are often employed due to their robust predictions even with limited data, and here we examine several, including random forest (RF) regression [32,33] and gradient boosted regression (GBR), [34,35] which are particularly popular. First nonlinear correlation studies identified 11 experimental features relevant to either EE or j CO 2 RR . Ad hoc inspections with SHapley Additive exPlanations (SHAP) [36] and nonlinear correlation analyses were carried out to investigate the influence of each feature on the ML predictions as a means to probe the validity of each model. We further inspect the data representation of the ML models by exploring the predicted EE-j CO 2 RR space and the functional dependency of electrochemical targets, identifying GBR to be the most representative of the experimental data and in accord with our heuristics. The chosen ML model was integrated into a GA with a final target of CO production cost to locate the Pareto front.
The suggested candidates were experimentally validated, resulting in a 40% margin improvement with an abbreviated process.

Initial Experiment Dataset and Search Space Definition
Our MEA is mainly composed of GDLs, catalytic layers (CLs), and a cation-exchange membrane (CEM). CO 2 RR (Equation (1)) is prompted on the cathode CL while oxygen evolution reaction (Equation (2)) occurs on the anode side.
The Faradaic efficiency (FE) shows the selectivity of electrochemical reactions toward a target product CO and is calculated as where z is the number of electrons exchanged in the reaction, n the number of moles of product formed obtained from the gas chromatograph data by product calibration curves, F the Faradaic constant, and Q the total passed charge obtained from integrating the current from the potentiostat. The EE for the electrochemical reaction is calculated as www.advancedsciencenews.com www.advintellsyst.com where E eq the thermodynamic equilibrium potential and E cell the applied cell potential. j CO 2 RR is the current corresponding to the CO 2 RR and described as where j total is the total geometric current density. Experimental input data were obtained with a nafion-based MEA for CO 2 RR discussed in our previous study, [37] where materials synthesis, device fabrication, and catalytic performance evaluation procedures have been described in detail. While the electrochemical performance matrices of the MEA are dependent on many experimental features, the data were practically obtained with changing cathode potential V typically in steps of 0.05 V or 0.1 V as a conventional electrochemical procedure, which yielded initial 210 experimental data points from a total of 58 MEA samples. Note that the number of data points for the MEA sample varied from 2-5 due to the various experimental factors (e.g., experiment planning, time constraints, sample degradation, human errors, machine troubles, etc.) during the experiments. Given that both the FE and j CO 2 RR change continuously with the applied potential based on the electrochemical principle, we fit the raw data as a function of V for the samples having more than three data points to generate interpolated values in the interval of À0.59 to À0.39 V in increments of 0.02 V in order to mitigate the data scarcity ( Figure 2a) and disproportionality within the MEA samples. If the cathode potential of the interpolated data coincided with measured value, only the experimental value was retained. This led to additional 330 data points. We then annotated associated compositional, synthesis, process, and operational parameters, aiming to exclude possible hidden variables, [38] giving rise to 38 experimental features. After removing constant (Pearson correlation coefficient r = 0) or fully dependent parameters (r = AE1) to avoid overfitting, we obtained the final dataset containing 24 features to initiate the ML-assisted optimization.
The optimization problem can be defined in terms of γ the cost of CO production, which is dependent on EE and j CO 2 RR and is therefore an implicit function of the experimental features f i The primarily goal is to obtain a feature set with reproducible γ values lower than the breakeven cost of $0.80 kg À1 . Since γ is negatively correlated to both EE and j CO 2 RR , we aim to constrain the search space for the ML application. γ was calculated according to previous research, [31,39] assuming a price of electricity of $0.06 kWh À1 . [38] The calculated cost value in EE-j CO 2 RR (Figure 2b) was then used to trim the data with a maximum threshold value for γ of $1.20 kWh À1 .

Model-Agnostic Nonlinear Feature Analysis
To select effective feature sets for the MEA optimization, we next analyzed the correlation of each feature toward EE and j CO 2 RR with both linear and nonlinear correlation models: Pearson correlation coefficient r, distance correlation D, [40] and maximal information coefficient (MIC) (Table S1, Supporting Information). [41,42] For each method we defined a feature as being significant if the correlation coefficient was >0.2 for either of EE or j CO 2 RR ( Table 1). There was general agreement among the three methods regarding the relevance of most features, with ten being identified as relevant by all techniques. However, we selected the 11 features deemed significant for the optimization based on the nonlinear correlation analyses (bold font, Table 1), expecting that the features were likely to display nonlinear dependencies, particularly near an extremum. . a) Raw data as a function of energy efficiency and partial current density for CO 2 RR to CO. The darker symbols indicate the trimmed data. The contours provide the cost of CO 2 RR to CO in $/kg. b) Root-mean-square errors of models toward energy efficiency and current density for CO 2 RR to CO estimated by 10-fold CV and nested CV with RF regression, extra tree regression (ExTr), and GBR. c) Binary-objective SHAP value analysis with RF, ExTr, and GBR for 11 features. Cathode parameters: electrode thickness τ c , metal weight ratio MWR c , carbon surface area CS c , time for thermal processing t c , loading L c , boiling temperature of ink solvent TB c , and loading temperature TL c ; anode parameters: averaged metal atomic number MA, number of metals MN), and metal weight ratio MWR a ; and operational parameter: cathode potential V. Importantly, these selected features were quite consistent with our domain knowledge about the cell design. [39] The significance of features such as the mass ratio of Co (MWR c ) and the surface area of ketjenblack (CS c ) for the cathode is obvious for improving available catalytic sites. Also, the reaction parameter V has high correlation coefficients for both EE and j CO 2 RR as expected. However, this analysis sheds light on the relevance of process parameters such as catalyst loading (L c ), loading solvent (TP c ), and loading temperature (TL c ), [43] which have been conventionally less discussed although they play a role in surface morphology of the CL and therefore in the accessibility of the catalytic sites. Additionally, the average atomic number of the metals (MA) and the number of metals (MN) in the anode show high correlation coefficients, reflecting the experimental efforts to reduce the overpotential on the anode side due to the significant but often ignored impact of the anode reaction on EE. [44] The minimal effect of the reaction parameter of temperature may be due to the limited range of testing while the relative insignificance of CO 2 flow rate suggests that the turnover frequency of the reaction is a rate-limiting step.

Machine Learning Model Investigation with Ad Hoc Feature Analysis
We investigated three tree-based regression models: RF regression, [32,33] extra tree (ExTr) regression, [45] and GBR. [34,35] These models were first hyperparameter tuned with a grid search method based on tenfold crossvalidation (CV) within a typical parameter range. Figure 2b shows the tenfold CV root-mean-square errors of EE and j CO 2 RR , σ EE , and σ j , respectively, which were effectively constant across all examined models showing that RF, ExTr, and GBR seem to perform similarly with respect to the averaged model accuracy. We additionally conducted the nested CV [46] since this method gives a generalization assessment of the models for extrapolation, expecting any difference in the model accuracy among tested. While nested CV shows higher rootmean-square errors than CV as expected, the trend of having constant values within the models is unchanged (Figure 2b), which stresses the need for other methods for selecting the model in this search space. Aiming to see whether the models appropriately capture reasonable trends of feature influences, we carried out visualizations of ML model predictions with SHAP ( Figure S1, Supporting Information). [36] SHAP is an ML visualization approach aimed at making the ML model more explainable by computing the contribution of each feature to the prediction based on game theory. SHAP value distributions for EE and j CO 2 RR indicate the contribution of each feature toward the ML prediction (Figure 2c). The observed SHAP values with these decision tree-based ML models roughly correlated with the nonlinear correlation coefficients ( Figure S2, Supporting Information), which indicates that all three ML models reasonably reflect the correlation between features and the targets.
One significant limitation is that SHAP is an inherently linear analysis. While 10 of the 11 chosen features were found to be significant according to their Pearson coefficients, as a check we performed nonlinear correlation analysis of the model predictions and compared with that done on the training set ( Figure S3, Supporting Information). As anticipated, the correlation coefficients for the features for the model predictions were higher than those of the training data due to the reduced number of features (11 compared to 24); however, the suspiciously large correlation coefficients for V for RF and ExTr models seem to suggest an overly disproportional feature contribution and a narrower range of validity for them as compared to that of GBR.

Physics-Based Ad Hoc Analysis of ML Models
While SHAP analysis is generally effective in helping to interpret ML models, in this instance, it provided minimal insight into meaningful differences among them. Thus, we aimed to more closely investigate how well each model reflected the experimental results. Specifically, we generated 100 000 random input data points [47] for the models within the feature space which was constrained to reasonable experimental values (Table 1). Most features were given continuous values, but some, such as the thicknesses of the carbon paper and nafion, were given discrete values according to actual availability. Inspection of the EE and j CO 2 RR predicted spaces demonstrates the essential differences among the models (Figure 3a-c). Specifically, RF and ExTr yielded prediction distributions that mostly map to the centroid of the experimental results, which strongly suggests that those models cannot predict the actual values at the Pareto front since these centroid-oriented models are expected to have more significant errors on the boundary. On the other hand, the space predicted by GBR mostly covers all the input experimental data, which suggests it performs relatively well on the edge of the search space.
We further investigated the decision tree-based surrogate models by exploiting the functional dependency of electrochemical performance. Specifically, we generated 100 sets of 1000 data points with varying values of V while holding constant all other

Input data
Model prediction Figure 3. Binary-objective distribution in the space of energy efficiency and current density for CO 2 RR to CO with a) RF regression, b) ExTr, and c) GBR. GBR covers nearly the entire trimmed dataset, in contrast to the other two models. Functional dependence analysis with partial current density for CO 2 RR to CO as a function of cathode potential for d) RF, e) ExTr, and (f ) GBR. The convergence of the current density at À0.4 V independent of other feature values for RF and ExTr seems physically unlikely as compared to the distribution shown for GBR.
www.advancedsciencenews.com www.advintellsyst.com feature values for each given set. The ML-predicted EE ( Figure S4, Supporting Information) and j CO 2 RR (Figure 3d-f ) were depicted as a function of V, showing obvious differences among the examined models. For instance, the j CO 2 RR values predicted by RF (Figure 4d) show a sizable void at around À0.50 V which is inconsistent with an expected continuous j CO 2 RR -V relationship. Additionally RF and ExTr (Figure 4d,e) predicted that j CO 2 RR converged to a single value at a potential of about À0.4 V independent of the values of the other features, reflecting how those models are skewed to the centroid. On the contrary, GBR shows a reasonable dependence of j CO 2 RR on V throughout the examined feature space. These differences between models are consistent with the differences observed with the ad hoc nonlinear correlation analyses mentioned previously. The various inspections of the ML models with the aid of physics and heuristics imply that GBR can best represent the data and should be used for the optimization process.

GA with Machine Learning Models
Given that the optimization space with 11 features has vast amounts of conditions to be tested (%10 14 ), which is beyond the scope of random search experiments, we integrate GBR into a continuous GA to seek promising conditions on the binarycriteria boundary front. [48,49] Each condition and feature value can be referred to chromosome and genes, respectively. GA starts with a population of 1 000 000 randomly generated chromosomes as an initial pool in which any chromosome with the simulated cost value of the breakeven value B < $0.8 kg À1 was selected to create the subsequent pool. Then, 1000 chromosomes were randomly chosen as 0th generation from the latter to initiate evolution steps. For the evolution process, we applied a cost-stability test on each chromosome. Namely, each prediction was tested by randomly perturbing the values of its features from 0 to AE 10% and evaluating the average cost γ and standard deviation of the cost Δγ to select the survivor based on a cost function h defined as where α > 0. h is designed to help to reject localized minima which may result from overfitting and also ensure that the device is insensitive to small perturbations in processing and experimental conditions. The next generation was chosen with a roulette-wheel selection method [49] and mutated at a probability rate of 10%, meaning that any chromosome mutates at the rate of the probability in which the feature value was randomly changed AE10%.
The GBR-implemented GA produced populations which converged within four generations when there was no longer sigfinicant change in the population (Figure 4a,b). Note that while the 0th generation contains data positioned close to the boundary of the initial distribution, the final generation exists in a region a small distance away from that boundary, attributed to the effect of the cost function to find more stable conditions. The final population of chromosomes showed fairly uniform feature values with <5% variation, meaning that it effectively corresponded to a single chromosome. After repeating the GA process 30 times, four measurably different optimization conditions were attained (Table S2, Supporting Information). Figure 4c displays the difference between ML-predicted and measured EE and j CO 2 RR of the candidates, showing that the results closely matched the predictions by GBR. In contrast, the predictions by RF and ExTr were in good agreement with the measured EE but the models consistently underestimated j CO2RR , again likely a result of the tendency of the these models to reflect the average especially for j CO2RR for the search space close to the targeted Pareto front (EE = 53-60% and j CO 2 RR = 115-103 mA cm À2 ) in agreement with the observation from the distribution analysis (Figure 2a,b). Of the four measured candidates, the best result (No. 4 in Table S2, Supporting Information) was γ = $0.73 kg À1 ($0.07 kg À1 margin) with EE = 57% and j CO 2 RR = 115 mA cm À2 , corresponding to a 40% margin improvement from γ = $0.75 kg À1 ($0.05 kg À1 margin). The comparison between the updated best experimental  Figure 4. The change in a) the production cost of CO and b) energy efficiency and current density for CO 2 RR in the GA. c) Comparison of the differences in the predicted results and experimental data for the candidates for optimization. The agreement is markedly better with GBR than with either of the other models.
www.advancedsciencenews.com www.advintellsyst.com condition (No. 4, Table 1) and the initial one reasonably highlights the increase in the catalyst loading and metal weight ratio on the cathode CL, seemingly leading to the noticeable improvement in the j CO 2 RR . While it seems trivial that the thicker layer and more catalyst metals may result in a higher reaction rate, these factors often hinder reactant/product transports. The new condition mitigated these adverse effects by concomitantly reducing the thickness of the GDL and fine tuning the loading conditions. The slight increase in the metal weight ratio on the anode CL may help maintain high EE by increasing the anodic catalytic available sites. It is notable that when we include the four candidates in the training set and complete the entire process again, the results are unchanged, suggesting that we reach the actual Pareto front in the first step. If we instead use the same GA with RF or ExTr models, the chromosomes still converge in about four generations, but since the models map conditions near the actual Pareto front with those closer to the average response, the predicted optimal conditions vary widely within the region that approaches the Pareto front. Neither RF nor ExTr show improvement to locate the front even after including the data from the optimized validation experiments, showing that meaningful inspection of ML models is essential for use with limited datasets.

Conclusion
We have presented a methodology of practically deploying ML in an interactive procedure for material science experiments. The efficacy of this technique has been demonstrated in the optimization of an MEA device toward feasible CO 2 RR to CO. Nonlinear correlation analyses identified relevant features for optimization. Conventional CV provided similar errors among tested decision tree-based ML models and an ad hoc analytical visualization with SHAP indicated that the predictions of all models were similarly influenced by the features. However, an inspection of ML data representation of binary targets for EEj CO 2 RR distribution and functional dependency of performance metrics revealed significant differences among the ML models, suggesting that GBR is the most representative. The ML model was subsequently integrated into GA with a stability test to identify candidates on the Pareto front. The experimental data from the candidates showed good agreement with the ML predictions only when the model was selected according to the methodology, and conversely, only the chosen model could locate conditions on the Pareto front. This study demonstrates that careful use of ML can meaningfully expedite experimental research in materials science, even with limited data in spaces of relatively high dimension.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.