Reinforcement Learning‐Guided Long‐Timescale Simulation of Hydrogen Transport in Metals

Abstract Diffusion in alloys is an important class of atomic processes. However, atomistic simulations of diffusion in chemically complex solids are confronted with the timescale problem: the accessible simulation time is usually far shorter than that of experimental interest. In this work, long‐timescale simulation methods are developed using reinforcement learning (RL) that extends simulation capability to match the duration of experimental interest. Two special limits, RL transition kinetics simulator (TKS) and RL low‐energy states sampler (LSS), are implemented and explained in detail, while the meaning of general RL are also discussed. As a testbed, hydrogen diffusivity is computed using RL TKS in pure metals and a medium entropy alloy, CrCoNi, and compared with experiments. The algorithm can produce counter‐intuitive hydrogen‐vacancy cooperative motion. We also demonstrate that RL LSS can accelerate the sampling of low‐energy configurations compared to the Metropolis–Hastings algorithm, using hydrogen migration to copper (111) surface as an example.


I. INTRODUCTION
Diffusional atomic motion is an essential microscopic process in the kinetics of materials [1].Various interesting phenomena and applications are rooted in diffusionrelated processes, from the interdiffusion at metal interfaces, vacancy and void formation, to hydrogen embrittlement [2] and resistance switching in oxide memristors [3].One important tool to study the diffusion process is atomistic simulation [4,5], which can simulate a wide range of materials phenomena [6,7].However, a critical challenge of atomistic simulation of diffusionrelated process is the timescale problem [8]: the atomic vibration has a timescale of fs -ps; however, the diffusionrelated transitions between adjacent energy minima have orders of magnitude larger timescale.That is because the energy barriers on the diffusion pathway slow down the diffusion process [7].The timescale problem limits most of the straightforward molecular dynamics simulations to nanoseconds, which fall short of the timescales relevant to many diffusion-related phenomena [8,9].Therefore, different methods are needed to deal with the long-timescale problem [8].
Our work is based on one of the widely studied algorithms, the kinetic Monte Carlo (KMC) method [10], where one directly works with diffusion timescale without explicitly showing the vibration timescale motion.Traditional KMC (in contrast with off-lattice KMC) requires energy minima and transition pathways (the socalled event table) as input.However, as the diffusion pathway is sometimes counter-intuitive, correctly determining the necessary input information of KMC is not a trivial task [10].To conduct a simulation without a known event table, the off-lattice KMC is developed [11].The algorithm conducts saddle point searches to obtain the diffusion pathways along with the KMC simulation.Another method reported to have advantageous efficiency is temperature accelerated dynamics (TAD), where the transition pathways are explored by high-temperature molecular dynamics [12].In both methods, the transition pathway is explored by random sampling (random initial guess in the saddle point search for off-lattice KMC, and random thermal motion for TAD).However, as the configuration space is high-dimensional, it requires a large amount of random sampling to be confident that the correct transition pathway is obtained, which limits the simulation system size and accessible timescale [11].
In this work, we developed a reinforcement learning (RL) method that guides the transition pathway sampling in off-lattice KMC to simulate long-timescale diffusion processes.Instead of searching for all nearby saddle points along randomly sampled initial directions [11], we use parameterized neural network model to guide the saddle-point search.The model can predict the direction of atomic motion that yields the high-probability transition pathway.That avoids the repeated saddle-point searches, which is the most significant contributor to the computational cost of the off-lattice KMC.We demonstrate that our RL model can either simulate physical diffusion trajectories or sample low-energy configurations in complex energy landscapes by simulating the hydrogen diffusion in alloys and metal surfaces.

II. RESULTS
Here, we briefly describe our RL method, as illustrated in Fig. 1a.In atomic diffusion, the energy landscape has a large number of local minima separated by transition energy barriers.In this paper, we use hydrogen diffusion in face-centered cubic (FCC) alloys as an example, as shown in Fig. 1b.In the local energy minimum configurations of FCC bulk structures, hydrogen atoms reside in octahedral and tetrahedral interstitial sites shown as the deep blue and shallow green potential wells in Fig. 1c, where the octahedral site has lower energy.The energy landscape is provided by a universal neural network Pre-Ferred Potential (PFP) [13] throughout this paper.Beginning from a given local energy minimum configuration or "state" s t = ( r 1 , r 2 , • • • , r N ) (the orange circles in Fig. 1, where r i is the coordinates of the ith atom), a set of possible transition displacements {a ti } (also called "actions") are first identified.In our problem, this is realized by identifying the polyhedron surrounding each hydrogen atom formed by its metal neighbors where possible actions are defined by translations through all face centers of the polyhedron (See section IV.A for details).
In the next step, an action a t is selected from the action space A st ≡ {a ti } based on the atomic descriptor D of the configuration s t .The probability of selecting each action a, π θ (a|s t ), is given by the Boltzmann policy based on a neural-network value function Q θ (s, a) [14]: where θ represents the model parameters, k B and T are the Boltzmann constant and temperature.Q θ (s, a) can also depend on T if the vibrational entropy contribution is considered, which will be discussed later.
After selecting an action a t = (i, v), the ith atom is dis-placed by vector v across the energy barrier.The system is then relaxed to the next state, s t+1 , using the MD-Min algorithm implemented in the Atomistic Simulation Environment [15].Parameters of the transition, including the transition energy barrier E NEB b , the attempt frequency ν NEB a , and the energy change after the transition ∆E, can then be estimated using the NEB method [16] setting s t and s t+1 as the initial and final points.The reward of this transition, r t , is designed to encourage either reproducing transition probabilities of the harmonic transition state theory (HTST) [17] or an energy minimization strategy, which will be discussed in the next part.The whole simulation trajectory is produced by repeating the above scheme that generates the next state according to the current state.
The Q θ (s, a) model is constructed based on the DeepPot-SE sub-networks [18].As the atomic interaction in alloys is short-range, we assume Q θ (s, a = (i, v)) is a function of the atomic environment of the moved atom i and its displacement v.The descriptor D i should be equivariant under translation, rotation, and permutation symmetry operations of the atomic system, realized by the following construction: where ˆ goes through all atoms around the ith atom within a cut-off radius r c .f c (r) is a cutoff function as defined in Ref. [18], which goes smoothly to zero at a cutoff radius r c , and G 1 k and G 2 l are embedding neural networks parametrized by θ emb .c m (m = 1, 2, • • • , M ) are the atomic species of the mth atom, and we set c M +1 as a unique "action species".The descriptor D i is invariant under all symmetry operations.The descriptor is then flattened to a vector and passed to a multilayer perceptron (MLP) that outputs the Q function: , where the model parameters θ = (θ fit , θ emb ) includes both parameters of the MLP θ fit and that of the embedding network θ emb (see section IV.B for detailed parameter settings).
By choosing different reward functions, our method has two working modes: transition kinetics simulator (TKS) and low-energy states sampler (LSS).The TKS aims to simulate physical transition probabilities according to HTST, and the LSS aims to converge to global energy minimum configurations.The TKS adopts the reward function of where ν i and ν * j are the ith normal mode vibration frequency at state s t and the jth positive vibration frequency at the transition saddle point between s t and s t+1 .The model is trained as a contextual bandit problem [27], where the value function Q θ (s t , a t ) is trained to fit the instantaneous reward r t (minimizing (Q θ (s t , a t )− r t ) 2 ).Then, as Γ sta = ν NEB a e −E NEB b /kBT = e rt/kBT (according to HTST) gives an estimation of the rate of the transition corresponding to action a, the policy in Eq. ( 1) gives the physical transition probability P (a|s t ) = Γ sta / a ∈As t Γ sta .The average residence time of the system on the state s t , ∆t = ( a∈As t Γ sta ) −1 , is then estimated as ( a∈As t e Q θ (st,a) ) −1 .Expressing the reward r t = r 0 t + r 1 t T as a linear function of T , the constant term r 0 t and linear term r 1 t can be fitted simultaneously by a two-component value function T to make the model applicable to different temperatures: (5) where λ is the learning rate, and T tr , the training temperature, is a hyperparameter that controls the relative importance of the two terms in the loss function.The two components give neural-network predictions for As a testbed, we first apply our method to hydrogen diffusion in pure FCC Cu and Ni.The model is trained on a 4 × 4 × 4 cubic supercell with 4 randomly sampled hydrogen sites.The model is then deployed to simulate a single hydrogen diffusion in a 3 × 3 × 3 cubic supercell for 500 timesteps.This system is simulated at temperatures spanning 250 K to 500 K with an interval of 50 K and repeated 50 times for each temperature.The final displacement ∆x i , total time t i , and temperature T i of the ith simulation trajectory are recorded.The two parameters D 0 and Q in the Arrhenius form of diffusivity D = D 0 e −Q/kBT are fitted by the maximum likelihood estimation (MLE): The derived D 0 and Q are reasonably consistent with the previous experimental measurement, as shown in Table I.The effective activation energy Q in simulation tends to be slightly smaller than the experimental results.That's probably because a small concentration of trapping sites (defects or impurities) in experiments are not considered T .(c) Hydrogen self-diffusivity at 400 K as a function of the short-range ordering parameter (the dashed line is a B-spline [24] connecting the data points).SRO = 0 corresponds to a fully random solid solution, SRO = 1 corresponds to WC parameters obtained from Ref. [25], and intermediate values of SRO are linearly interpolated.The SRO is sampled using the OTIS code in Ref. [26].
in simulation, which slightly increases the average energy barriers.
To test the method's capability to capture compositional complexity, we train the RL model on equiatomic CrCoNi medium entropy alloy.The CrCoNi alloy has recently attracted broad interest because of its outstanding fracture toughness and ductility [28].In the CrCoNi solid solution, each metal atom near the hydrogen can be of different atomic species, giving a complex state space.
The predicted E NN The hydrogen self-diffusion in CrCoNi is simulated using the trained model running on one hydrogen in a 4 × 4 × 4 rhombohedral supercell with short-range ordering obtained from Ref. [25].The hydrogen displacement as a function of simulation time is shown in Fig. 3a under 300 K using 30 repetitions of µs long-timescale simulations.An approximate function form of ∆x 2 ∝ t is shown by the blue line, and the diffusivity is estimated as 2.84 × 10 −14 m 2 /s.Similar simulations are implemented for different temperatures, as shown in Fig. 3b.The Arrhenius plot shows a good linear relation.The estimated effective activation energy Q equals 0.43 ± 0.01 eV, and the pre-exponential factor D 0 equals 5 ± 2 × 10 −7 m 2 /s.To our knowledge, these parameters have not been provided in the literature, so we show these results as predictions of our method.
In CrCoNi, short-range ordering (SRO) has significant influences on various properties of the material ranging from hardness [29] and stacking fault energy [25] to magnetism [30].We show that the SRO also has an evident influence on the hydrogen diffusivity in CrCoNi, as shown in Fig. 3c.The system with SRO under thermal equilibrium (SRO=1) gives approximately two times the hydrogen diffusivity of the fully random configuration (SRO=0), showing that the SRO enhances hydrogen diffusion.That can be explained by the reduction of Cr-Cr bond concentration by the SRO [25], as hydrogen transition energy barriers proximate to the Cr-Cr bond are found higher than the average hydrogen transition energy barriers in our calculations.Our results predict that the hydrogen diffusion behavior can also be tuned by the SRO in multi-principle element alloys.
The second working mode of our method, the LSS, sets the reward function as the energy reduction after the transition: r t = E(s t ) − E(s t+1 ) = −∆E.The model is trained by the deep Q network (DQN) algorithm [31], which aims to maximize the total reward R = T t=0 γ t r t on a trajectory with a discount factor γ close to one (set as 0.8 in our calculation).The model parameters are updated through the Bellman equation [31]: where θ t is the target network that updates less frequently than θ.
The converged Q θ (s t , a t ) fits the maximal total rewards after timestep t, max (at+1,at+2,••• ) T t =t γ t −t r t .As the Q function "foresees" the energy reduction of future steps and chooses actions that maximize "long-term" energy reduction, it is expected to converge to low energy configurations faster than local strategies that only consider single-step energy terms.That provides LSS a simulator of an annealing process, which converges to a near-ground state with fewer timesteps than the TKS.
We demonstrate the LSS's performance in simulating annealing by the hydrogen migration to copper (111) surface process, as shown in Fig. 4a. 4 × 4 × 3 hexagonal supercells are constructed with 10 randomly sampled hydrogen sites, and the (111) surface is created with a 15 Å vacuum layer.Hydrogen in the surface adsorption sites has lower energy than that in the bulk interstitial site, so the energy ground state is that all hydrogen atoms are on the surface adsorption sites.However, because of the energy difference between the octahedral sites and tetrahedral sites, the migration pathway involves multiple local energy minimums and low energy barriers, making it challenging to sample the low-energy states [32].After training, our RL policy gives the most likely action from each state, as shown in Fig. 4a.Within the cut-off radius of 8.5 Å in Eq. ( 3) from the surface, the highestprobability actions (HPAs) from all sites are oriented towards the surface.The HPAs from surface adsorption sites point to neighbor surface sites.This policy provides orientation for the hydrogen atoms to migrate across the local energy barriers toward the surface sites.The HPAs from sites close to the surface have larger Q values than that far from the surface, as the discount factor reduces the contribution of long-term rewards to the Q function compared to short-term rewards.For sites deeper than the cut-off radius, all move gives the same Q function due to the constraint of symmetry.
We compare the annealing process using the LSS and the Metropolis-Hastings algorithm [33], as shown in Fig. 4b,c.The LSS annealing leads all hydrogen atoms to surface adsorption sites and converges to the en-ergy ground states in 200 timesteps in all 50 trajectories.From the grey lines, one can observe that the system moves across a large number of low-energy barriers and approaches the ground state.In comparison, the Metropolis-Hastings algorithm converges slowly.Less than half of the hydrogen migrates to the surface sites in both 200 and 500 timesteps annealing, leaving ∼ 4 eV energy above the ground state on average.These results demonstrate that the LSS can show advantageous performance in approaching low-energy configurations compared to straightforward Monte Carlo methods.

III. DISCUSSION AND CONCLUSIONS
The TKS and LSS can be viewed as two special cases of a unified DQN framework.The general reward function is: (8) where log ν * j (s saddle ) + F 0 is the effective free energy of the saddle point (F 0 is a state-independent constant).There are three tunable parameters, α, β, and γ (in Eq. ( 7)), controlling the importance assigned to reproducing the correct transition probability, energy reduction, and long-term performance of the model.The TKS and LSS correspond to α = 1, β = γ = 0 and α = 0, β = 1, γ 1, respectively.Other parameter settings, despite the lack of direct physical interpretation, can be used to explore different configurations in the energy landscape with certain preferences.A probabilistic interpretation of the general framework is discussed in section IV.C, mapping each parameter set to a probability distribution function from which the trajectory is sampled.
Our method provides a computational framework to simulate the long-timescale diffusion and annealing process.Although the simulations in this paper focus on hydrogen diffusion in metals, the method is applicable to diffusion processes in different materials and microstructures, given a specifically designed action space.This method can also bridge large length scales, by first training a model on varied small structures, then deploying the model to guide the long-timescale simulation in a large supercell that includes the complexity of all trained structures.

IV. EXPERIMENTAL SECTION A. Action space identification algorithm
The action space A(s) = {a = (i, v)} is identified based on the atomic configuration s.The algorithm first identifies all hydrogen atoms with indices i 1 , i 2 , • • • .For each hydrogen atom i, the distance of all metal atoms j within a cut-off radius r c is ranked: where r ij k is the distance between atom i and atom j.
Then, we use all metal atoms j k with a distance r ij k < 1.2r ij4 (we denote the largest k satisfying the condition as n) and the hydrogen atom i itself to construct a convex hull including these atoms.If the hydrogen atom i is a corner of the convex hull, the hydrogen atom is on a surface adsorption site; if the hydrogen atom i is inside the convex hull, the hydrogen atom is a bulk interstitial site.
If the hydrogen atom is in a bulk interstitial site, we choose all face centers, ( Then, the actions towards every face center are included into the action space, except there are "collision" events.The "collision" event is defined as, if the hydrogen atom i takes the action, it will have a smaller distance than 0.5 Å with at least one other atom.If the hydrogen atom "collide" with another hydrogen atom, the action is directly discarded.If the hydrogen atom "collide" with a metal atom, the metal atom will be added to reconstruct a convex hull, and actions towards face centers adjacent to the added atom will be included, except it evokes another "collision".If that happens, the action will be directly discarded. If the hydrogen atom is on the surface adsorption site, the convex hull is reconstructed using metal atoms j k satisfying r ij k < 1.2r ij3 .Atoms directly connected with the hydrogen atom, (j 1 , j 2 , • • • , j n ), are identified as the adsorption site (we sort (j 1 , j 2 , • • • , j n ) to form a counterclockwise loop).The adsorption site center is obtained as c = 1 n k r j k .The adsorption site has n edges, and the sth edge center is e s = ( r js + r js+1 )/2.First, the surface diffusion actions (i, 1.6( e s − c)), s = 1, 2, • • • , n are included.Then, the action towards the bulk i, 3 Å c k − ri | c k − ri| is included.If "collision" happens, the same procedure as the bulk interstitial site case is applied.

B. Detailed parameter settings
The model training on pure copper and nickel is conducted on 4 × 4 × 4 cubic supercell of the FCC metals.3 atomic configurations are generated for each metal, where 4 hydrogen atoms are randomly sampled in all octahedral and tetrahedral sites in each configuration.20 and 40 trajectories are sampled for copper and nickel, respectively, with 30 timesteps in each.In the atomic relaxation and NEB calculations, all forces converge to 0.05 eV/ Å under the PreFerred Potential (PFP) v4.0.0, which is used throughout this paper.The cut-off radius of the neural network model is 4 Å.The embedding network G 1 k has one hidden layer and an output layer both with a size of 12. Throughout the paper, we take the first 1/4 columns of G 1 k to form G 2 k , and the input layers of G 1,2 k have a size of N c + 1, where N c is the number of element species.We define an element species list: , the input layer takes the N c +1 dimensional input vector whose lth component is f c (r im ) and other components are zeros.The fitting network has two hidden layers with a size of 32.The maximum atom number is set as 40, which has not been exceeded during the training.The training temperature is set as 1000 K throughout this paper.After including the nth trajectory, one randomly samples a trajectory from probability distribution P i = 1−0.991−0.99 n 0.99 n−i (recent trajectory has larger probability) and train 20 gradient descend steps from the sampled trajectory, and repeat this for n times.The training algorithm is Adam throughout this paper, and the learning rate here is set as 10 −3 in all online training.Offline training is conducted to further improve the model's accuracy.We separate the training data into the training dataset (2/3 of the data) and the testing dataset (1/3 of the data).10000 full gradient descent is implemented on the training dataset.The learning rate changes from 10 −3 to 10 −5 that exponentially decays with timesteps in all offline training in this paper.
The model training on NiCrCo medium entropy alloy is conducted on 4 × 4 × 4 cubic supercell of the FCC fully random solid solution.9 atomic configurations are generated for each metal, where 4 hydrogen atoms are randomly sampled in all octahedral and tetrahedral sites in each configuration.3 independent processes of training are conducted with 101 trajectories in each, and each trajectory contains 30 timesteps.In the atomic relaxation and NEB calculations, all forces converge to 0.05 and 0.07 eV/ Å, respectively.The cut-off radius of the neural network model is 5 Å.The embedding network G 1 k has one hidden layer and an output layer both with a size of 24.The fitting network has two hidden layers with a size of 128.The maximum atom number is set as 50, which was not exceeded during the training.The online training parameters are the same as pure metals.As to offline training, we separate the training data the same way as pure metals.Stochastic gradient descent is implemented with a minibatch size of 500 data points (one timestep is a data point).The minibatch is randomly sampled from all data points, and 10 gradient descent steps are applied to each minibatch.That is repeated for 20000 iterations.In order to avoid overfitting, a normalization term of 5 × 10 −6 θ 2 is added to the loss function.
The deep Q learning for copper (111) surface is conducted on 4 × 4 × 3 hexagonal lattice of FCC copper (4 replications on a and b directions and 3 replications on c direction.c direction is along the 3-fold axis).A vacuum layer of 15 Å is included in the c direction.We implemented 7 independent training processes, 4 of them have only one randomly sampled hydrogen atom in the copper slab (12 configurations are sampled as starting points, and initial configurations are randomly selected from them), and the other 3 have 10 randomly sampled hydrogen atoms (10 configurations are sampled as starting points).300 trajectories are sampled with 30 timesteps in each.In the atomic relaxation, all forces converge to 0.05 eV/ Å.The cut-off radius of the neural network model is 8.5 Å, as the model needs more distant atomic information to foresee the long-term rewards.the embedding network G 1 k has one hidden layer and an output layer both with a size of 24.The fitting network has two hidden layers with a size of 128.The maximum atom number is set as 260, which has not been exceeded during the training.After including the nth trajectory, one randomly samples a trajectory and trains 5 gradient descent steps from the sampled trajectory, and repeats this for n 2/3 times.The offline training randomly samples a mini-batch with 10 trajectories and applies 10 steps of gradient descent at each iteration.There are 1010 iterations in the training process.

C. Probabilistic Interpretation of the DQN framework
By setting the parameters α, β, and γ, Our method samples different probability distributions.In physical reality, the transition rate is approximately determined by the harmonic transition state theory (HTST): where s is the next state after taking action a.For kinetics simulation (TKS) that reproduces the transition probabilities of Eq. ( 10), coefficients are set as α = 1, β = 0.The expected stationary time is then evaluated as: In certain scenarios, the goal is to sample thermal equilibrium distribution.The detailed balance principle proved that the probability distribution follows Eq. ( 11) as long as α + 2β = 1.One can set β to a larger value to sample more rare transition events while keeping the thermodynamics properties correct.

γ ∼ 1: maximizing global probability of a trajectory
When we set γ ∼ 1, the algorithm maximizes R(T ) T t=0 r t (we consider setting γ slightly smaller than 1 as a convergence technique that leads to a small bias).The probability of the trajectory is: Using the expected value τ t = 1/Γ st , the probability becomes P (T |τ t = 1/Γ st ) = P (s 0 )e −T T −1 t=0 Γ st→st+1 .Then, maximizing the total reward corresponds to maximizing: e R(T )/kBT −αT P (s 0 ) α+β = P (T |τ t = 1/Γ st ) α P (s T ) β (15) Here, the initial state s 0 does not depend on the policy, so it is constant when doing the maximization.If α = 0, β = 1, the method aims to sample the most probable final state s T , corresponding to an annealing process that targets the ground state.If α = 1, β = 0, the method aims to sample the most probable trajectory based on transition kinetics.In the general case, α and β can be tuned according to sample probabilities considering both the final state distribution and transition kinetics.

FIG. 1 .
FIG. 1.(a) Computational workflow of the RL long timescale method illustrated on (b) hydrogen diffusion in CrCoNi medium entropy alloy.The blue, green, and grey spheres represent Cr, Co, and Ni atoms, respectively.The orange circle, black dashed arrow, and red arrow represent state, action space, and selected action, respectively.(c) The potential energy landscape of a hydrogen atom on the grey planes in (a, b).When calculating the energy, surrounding atoms and the z-coordinate of the hydrogen atoms are relaxed.

) 2 Q
TABLE I. Hydrogen self-diffusion simulation results in pure copper, pure nickel, and CrCoNi medium entropy alloy.∆Eb ≡ (E NN b − E NEB b and ∆νa ≡ (ln ν NN a − ln ν NEB a ) 2 are the validation error of model prediction on transition energy barrier and attempt frequency.The activation energy Q and coefficient D0 are fitted by reinforcement-learning-simulated diffusivity D = D0e −Q/k B T using maximal-likelihood estimation, and D exp 0 and Q exp are the values from previous experiments.exp (eV) 0.38 [19] 0.31-0.44[20] -0.46 [21] 0.37-0.44[22] -0.435 [23]

FIG. 2 .
FIG. 2. Comparison of the neural network model prediction of (a) transition energy barriers E NN b and (b) attempt frequency ν NN a with those calculated by the NEB method in the training dataset, E NEB b

FIG. 3 .
FIG. 3. Hydrogen diffusion simulation in CrCoNi medium entropy alloy.(a) Square hydrogen diffusion displacement ∆x 2 (absolute value) as a function of time under 300 K.The grey lines show 30 trajectories; the blue squares and error bars are the mean square displacements and their error range (± one standard error); the red dashed lines is the linear fitting of the blue dots.(b) Arrhenius plot of hydrogen self-diffusivity under different temperatures.The blue caps show the error bar of calculated diffusivities, and the black dashed line is a linear fitting of log D vs 1000T .(c) Hydrogen self-diffusivity at 400 K as a function of the short-range ordering parameter (the dashed line is a B-spline[24] connecting the data points).SRO = 0 corresponds to a fully random solid solution, SRO = 1 corresponds to WC parameters obtained from Ref.[25], and intermediate values of SRO are linearly interpolated.The SRO is sampled using the OTIS code in Ref.[26].

b
and ν NN a are approximately consistent with the values in the training and testing dataset, as shown in Fig.2, where the data points are distributed close to the diagonal line in the wide range of observed quantities.The standard deviation errors of the model predictions are close in training and testing datasets, confirming that the training data is not overfitted despite the large volume of model parameters.

FIG. 4 .
FIG. 4. Sampling low energy configurations of hydrogen migration to copper (111) surface.(a) Highest probability actions (HPAs) and Q values of hydrogen atoms.The blue, silver, and pink spheres are copper atoms, octahedral interstitial sites, and tetrahedral interstitial sites.The HPAs (the actions with the highest probability according to the policy) are shown by arrows (red arrow: a unique HPA, black arrow: multiple (but not all) actions with equal probabilities, brown arrow: all actions have equal probabilities).The Q values of HPAs are denoted.Energy (using ground state energy as reference) vs simulation step under simulated annealing with (b) T = 1000 − 950 t 200 K using the trained policy and (c) T = 3000 − 2700 t τ K (τ =200 for blue lines and τ =500 for red lines) using Metropolis-Hastings algorithm.The grey/cyan/orange thin lines are 50 simulation trajectories, and the thick red/blue lines are their average.