A Phase‐Change Memristive Reinforcement Learning for Rapidly Outperforming Champion Street‐Fighter Players

The interactions with humans, and simultaneously, making of real‐time decisions in physical systems, are involved in many applications of artificial intelligence. An example of these conditions is maneuver sports. Movement‐type simulations, viz., the esports game Street Fighter (SF), recapitulate the complex multicharacter interactions and, concurrently, generate the millisecond‐level control challenges of human athletes. Herein, the physical and mental signatures of the SF agent (it is called SF R2) are controlled by utilizing a previously unreported model‐free, natural, deep reinforcement learning algorithm “Decay‐based Phase‐change memristive character‐type Proximal Policy Optimization” (DP‐PPO) through an assemblage of hybrid case‐type training processes; and an integrated training configuration for time‐trial evaluations, as well as competitions with a world's best SF player, is developed. A short length of time utilized by the SF R2 to defeat the opponent and, simultaneously, maintaining a good health level is achieved, as well as excellent handling of imperfect information settings. Training studies reveal a moderate maneuver etiquette in the SF R2, along with rapid, effective head‐to‐head competitions with one of the world's best SF player. This paves the way for achieving a broadly applicable training scheme, capable of quickly controlling complicated‐movement systems in fields where agents should observe unspecified human norms.


Introduction
[7] A difficulty arises from the balance between an increased level of interactions with humans for potential applications in robotics and, at the same time, enhancing the degree of respect for imprecisely specified human norms. [8,9]As movement sports require real-time control of human bodies with complicated millisecond-duration maneuvers and, simultaneously, operate within inches of opponents, it is a domain that poses these challenges. [10,11][35][36] As traditional autonomous competitions limit the challenges to performing fixed poses and simple object identification, [37,38] maneuvers powered by fully unmanned characters could be several years away.
[44] Human athletes are required to be highly skilled in four areas: 1) maneuver tactics, 2) human-body control, 3) movement strategy, and 4) navigation etiquette, to be effective.The human DOI: 10.1002/aisy.202300335 The interactions with humans, and simultaneously, making of real-time decisions in physical systems, are involved in many applications of artificial intelligence.An example of these conditions is maneuver sports.Movement-type simulations, viz., the esports game Street Fighter (SF), recapitulate the complex multicharacter interactions and, concurrently, generate the millisecond-level control challenges of human athletes.Herein, the physical and mental signatures of the SF agent (it is called SF R2) are controlled by utilizing a previously unreported model-free, natural, deep reinforcement learning algorithm "Decay-based Phasechange memristive character-type Proximal Policy Optimization" (DP-PPO) through an assemblage of hybrid case-type training processes; and an integrated training configuration for time-trial evaluations, as well as competitions with a world's best SF player, is developed.A short length of time utilized by the SF R2 to defeat the opponent and, simultaneously, maintaining a good health level is achieved, as well as excellent handling of imperfect information settings.Training studies reveal a moderate maneuver etiquette in the SF R2, along with rapid, effective head-to-head competitions with one of the world's best SF player.This paves the way for achieving a broadly applicable training scheme, capable of quickly controlling complicated-movement systems in fields where agents should observe unspecified human norms.
athlete performs precise maneuvers in short times with a little margin for error, as well as building tactical skills needed to defeat and hinder opponents.Moreover, human athletes develop a detailed understanding of the movements of the human body and the signatures of the surrounding environment for controlling the human system.When modeling opponents and deciding when and how to implement a maneuver, the human athlete also utilizes strategic thinking.Furthermore, human athletes conform to highly refined but imprecisely specified sportsmanship rules.
Herein, we show that, by stimulating and altering the ensemble of hybrid case-type training processes in the natural, offpolicy, model-free DRL algorithm "Decay-based Phase-change memristive character-type Proximal Policy Optimization" (DP-PPO), we are able to control the physical and mental signatures of the Street Fighter (SF) game agent, which we call the SF R2, and develop an integrated training mode for time-trial evaluations, along with competitions with a world's best SF player.The SF R2 was developed using the DRL based on a phasechange memory (PCM) hardware and with rapid, effective head-to-head competitions with a leading SF player, together with good time-trial agent performance.A previously unknown larger number of matches won by the SF R2 agent compared to that of a top SF player and with the use of different opponent characters was achieved.This result can be seen as an important step in the continued progression of competitive tasks, such as the Poker, Jeopardy, Chess, Starcraft, Go, and other tasks, that computers can facilitate the best players at.Training studies reveal a time utilized to defeat the opponent of %27 s with a combination of kick upward, rotating kicks and on-the-ground throw of opponents against the top SF player.Moderate maneuver etiquette, excellent handling of imperfect information settings, strong ability to achieve targeted game states, and short times utilized to defeat the opponent and, at the same time, maintain a good health level for the SF R2 were further attained.As a result, agents such as the SF R2 have the potential to provide high level, realistic competition for training professional players, discover new movement techniques, and render esports games more interesting.

Approach for SF R2
Figure 1 illustrates the training configuration of the agent SF R2.The agent runs on a separate computing element and communicates asynchronously with the game using the baselines3 library and gym-retro platform [45,46] because the SF game runs on a central-processing unit (CPU) system.We chose to utilize the character Ryu as the main character, and each SF R2 instance controlled a main character on its CPU.Utilizing 16 CPUs, an equal number of main CPUs (systems that control SF R2 instances) and a GPU machine that asynchronously updates the neural networks, the SF R2 was trained from scratch.
We mapped core actions of the SF R2 to three discrete dimensions: 1) perform a simple maneuver (weak punch, strong punch, weak kick, strong kick, or guard), 2) administer more complicated maneuvers (unique offensives or combinations of offensive), and 3) render a movement (forward, backward, jump, or bend down).The effects of actions were consistent with the physics of the environment, as imposed by the game.The SF R2 utilizes the same interface as the human player to input commands to the game, and the game was designed to communicate delayed game information to the SF R2.The complicated maneuver, viz., spinning kicks, throwing of fireballs, upward punches, and other maneuvers, was implemented after a specified number of frames, in agreement with the input of targeted commands by human players.To simulate the reaction delay of human players, we chose to harness a delay time of 240 ms (%15 frames).Moreover, the SF R2 was able to learn when to deliver an offensive more precisely, although the SF R2 agent was not able to disclose a higher tendency to remain stationary in a protected position, and subsequently wait a long time for opponents to expose themselves to offensives through a defensive-playing strategy, compared with that for humans.The SF R2 observed the red, green, and blue (RGB) color frame, which contains key state information such as the character health level, positions on the screen, speed vector, outgoing offensive region, incoming offensive regions, current action state (standing, airborne, or crouch), current action type (at rest, simple move, guard, recovery, or more complicated moves), and other information about itself and the opponent from the training system (Figure 1b).
Utilizing the DRL algorithm we call DP-PPO, the SF R2 was trained.This scheme learns a value function (we call it a "critic") that estimates future rewards of each possible action, as well as learning policy (the term "actor" is utilized) that selects actions on the basis of agent observations.By altering it to handle n step returns and replacing the expected value of future rewards with a representation of probability distributions of the reward, the DP-PPO extends the advantage-actor critic (A2C) approach.The neural networks are trained asynchronously by the DP-PPO, i.e., the DP-PPO model samples data from the rollout buffer (RB), and at the same time, actors fill the buffer with new experiences continuously and practice maneuvers using the most recent policy.Experiments have demonstrated reinforcement learning with a software-based exponential decay function in different games, such as the Walker2D, HalfCheetah, Acrobot, and other games. [47]The exponential decay functions are utilized for maintaining the PPO approach without introducing new hyperparameters or additional computational burden to the algorithm, and for enhancing exploration of the game environment by the agent at the beginning of the training, and avoiding policy updates near the end of the training process.It is of interest in this work to examine the exponential decay function based on the conductance drift phenomenon of PCM elements (Figure S1, S2, Supporting Information).[50] A low energy consumption was achieved in training-based processes and the inference process.The SF R2 operations can be categorized into two processes based on system functions: 1) the training process and 2) the inference process.The PCM element was utilized in training processes.In the training process, the data simulation was performed using the conductance drift effect of a single PCM element/device to achieve the decay operation.The PCM element in the amorphous state only was utilized.The output conductance of the PCM element was read over time to generate input parameters for implementing the decay function.In the case of the inference process, the trained-actor model was utilized within a Python software, and model parameters were fixed and not updated further.In this process, an observation-image input was fed into the trained-actor model and the model generated a single-action output that was implemented in the game environment.PCM elements were not used in the inference process.
The SF R2 learned to maintain a strong health level in approximately an hour and was able to defeat the in-built AI opponent in a shorter time compared to that of %75% of human players in a reference dataset for this work within 6 h (Figure S3, Supporting Information).Additionally, the SF R2 trained for up to 48 h, shaving off several seconds, until the time utilized to defeat the in-built opponent stopped improving, as shown in Figure 1c.The SF R2 achieved excellent time-trial performance for different opponent characters with this training procedure (Figure 1d,e and Figure S4a-d, Supporting Information).Figure 1d shows the distribution of the time utilized to defeat an in-built opponent character for a hundred human players.The strong performance in terms of the time used to defeat the in-built opponent character Guile by the SF R2 is shown in Figure 1e, with a minimum time harnessed to defeat the opponent (10 s) shorter than that for human players (%13 s).
For the SF R2, a difficulty arises from the balance between decreasing the time utilized to defeat the opponent and, at the same time, increasing the health level. [51]The SF R2 was given penalties if it exhibits a large degree of damage/long time utilized to defeat the opponent, and a progress reward for a damage it delivers to the opponent (this is also described as no penalties).Similar plots based on the other two in-built opponent characters are shown in Figures S4a-d, Supporting Information.f, g) The training scenarios with the opponent character Guile, including two specialized scenarios where the opponent character performs f ) the antiair offensive and g) the dash forward.The snapshots of the video game were adapted and modified from www.myabandonware.com. [100]he SF R2 receives positive feedback quickly for decreasing the time harnessed to defeat the opponent and, simultaneously, maintaining a high character health level (Figure 2a), as a result of these shaping rewards/penalties.
To incentivize the SF R2 to defeat the opponent rapidly, the time/damage penalties were not sufficient.A time utilized to defeat the opponent below 30 s in an arcade mode and on a medium or higher level is very difficult to attain. [52]The SF R2 would learn to take a defensive position and assault incoming opponents with long-distance offensives, if the time harnessed to defeat the opponent was short enough.This allows the SF R2 to accrue large rewards without risking devastating damage.We find that adding distance penalties helped the SF R2 learn to defeat the opponent in 19.4 s (Figure 2a).Moreover, the SF R2 disclosed an average character health level of %83.4% for a hundred matches (the original health level is 100%).We used a distance penalty that was proportional to the position of the agent relative to that of the opponent.
The straightforward application of a self-play was inadequate for the SF R2 in terms of imperfect information settings, although many applications of the DRL in games utilize the self-play to improve performance.Because the SF is a real time game, two players make their decisions at the same time.This indicates that the agent is required to make decisions without knowing the opponent's decision or strategy.Thus, the SF is considered an imperfect information game.For instance, the SF R2 performs "unsafe" maneuvers upon recovery (we call it wake up) with a larger probability compared to human players.The maneuver can be easily negated and punished (this results in a large degree of damage).The SF R2 was ill prepared for the imperfect information context that it would observe in human opponents by navigating with copies of itself only.If the SF R2 agent does not defend upon the wake up, and anticipate maneuvers that the opponent renders, it might be caught in an unblockable combination of offensives (the term setup is used).This feature of maneuver games, i.e., the decisions two players make simultaneously cause a player to be punished substantially, is not a feature of traditional turn-based games such as the Chess and Go.We utilized a hybrid population of opponents including the customized opponent and in-built opponent to alleviate this issue.The importance of these choices is disclosed in Figure 2b.The SF R2 was ill prepared/unable to anticipate the opponent maneuver when the SF R2 agent was trained with the in-built opponent (this was indicated by the weak to moderate health level), and when the SF R2 was trained with the customized opponent, the SF R2 agent learnt to be well-prepared/able to anticipate opponent maneuvers (the moderate-tostrong health level attained was the indication).c) The ablation of elements of the jumping-type training over a range of epochs sampled during trainings.The agent's ability to perform a large damage maneuver is measured by the y axis.d) The SF R2 used a shorter time to defeat the opponent with an increased amount of time penalties.The nonbaseline policies were judged to be unsportsmanlike by test players and judges.e,f ) The effects of various settings on agent performance.The baseline settings were colored in a darker shade of cyan in all plots.The range of values from the training administered on three different models is described by error bars.(e) Increasing the GAE-lambda value was advantageous (the mean time utilized to defeat the opponent decreases).(f ) The SF R2 would not be able to defeat the opponent in a shorter time without the decay-type enhancement to the PPO.
Moreover, to learn specified skills, the opportunities are rare for the SF R2.This is called the exposure problem, i.e., without the cooperation of opponents, targeted game states are not accessible to the SF R2.For instance, one of the best ways to induce a large degree of damage for achieving a short time utilized to defeat the opponent is by jumping in close to the opponent constantly because the damage received by the opponent upon a successful jump (the SF R2 is able to land close to the opponent and perform a combination of offensives/throw) is substantially larger compared to that by the SF R2 upon an unsuccessful jump (the opponent performs an antiair offensive).The successful jump is a situation that occurs a few times or not at all in the entire game.If the opponent is not able to perform the antiair offensive consistently, the SF R2 would learn to jump in near to opponents for a large part of the game, and be foiled by a human player who is consistent in implementing antiair offensives.We develop a procedure which we call the hybrid scenario training to alleviate this issue.To identify a small number of game situations that were pivotal for each opponent character, we worked with a senior SF player.Scenarios that presented the SF R2 with noisy variations of these critical situations were configured.We utilized simple proportional-integral-derivative (PID) controllers to ensure that the opponent implemented the targeted offensive, such as performing antiair offensives consistently, in training scenarios, that we wanted the SF R2 to be prepared for.The specialized scenarios, viz., cases wherein opponents administer the antiair offensive and dash forward, for the opponent character Guile are revealed in Figure 1f,g.A stratified sampling type was harnessed to ensure that a situational diversity was present throughout the training.This approach results in more robust skills being learned, as disclosed in Figure 2c.
Human judges are required for movement sports, occurring in both physical and virtual configurations such as the case for the SF R2, like many other sports.These judges review maneuver "incidents", for instance, cases where the agent delivers offensives on an opponent, and then make decisions about whether agents obtain penalties.For the penalty duration, forced by the game engine, the agent with a penalty is not able to engage the opponent (opponents would be in a targeted resting state).The maneuver rule describes various conditions under which agents are penalized, although a medium population of offensives delivered to the opponent is considered acceptable and common.The judge assessments include many contexts, e.g., the effect the offensive has on the immediate future of opponents, and the maneuver rules are ambiguous.However, it is extremely challenging to encode these rules in a way that provides the agent clear signals on the specified rule to learn because judge decisions are contextual and subjective.Thus, an example of the difficulty that AI practitioners face when designing agents that interact with humans who expect the agent to conform to behavioral norms is maneuver etiquette.
Although observations that the SF R2 obtain from the game do not indicate if a penalty is deserved, these observations include a flag when a medium population of offensives is delivered to the opponent.To encode the etiquette as instantaneous penalties on the basis of the time required to defeat the opponent for a medium population of offensives delivered, we experimented with different approaches.However, resulting policies were judged to be mild by the judges and test players when a negligible time penalty was utilized (the time required to defeat the opponent is long).Thus, we chose to harness a moderate-type methodology which penalizes the agent with time penalties that are small for achieving a short to intermediate time utilized to defeat the opponent, with an additional large time penalty if short times are needed to defeat opponents.The effects of time penalties and key design choices utilized in this work are revealed in Figure 2d,e,f, respectively.

SF R2 Evaluation
We utilized the SF R2 in two action events with a top SF player to evaluate the agent performance.The first event involved both time-trial and head-to-head matches.Including one of the world's best SF players, the head-to-head match with the SF R2 was held at the Singapore University of Technology and Design.The SF player competed with an archetype of the SF R2.The points were awarded based on the upper bound of the damage delivered, i.e., the player that reaches the upper bound of the damage delivered first, a point; the other player, zero point, and six matches were conducted.The SF player utilized the opponent character for six different types, and the main character was harnessed by the SF R2.The SF R2 achieved a draw in the first event (the score was 3-3) (Figure S5a-f, Supporting Information).
We increased the network size, enhanced the training procedure, improved the opponent population type, and made adjustments to the features and rewards after evaluating the SF R2 agent performance in the first event.By an overall score of 4-2, the SF R2 won the rematch administered at the second event.The SF R2 improved by a win for the opponent character Chun li, using the same main character.The damage delivered by the characters through each match and the points earned by each player are revealed in Figure 3a-f.
The ability to avoid the need for traditional engineers to program when and how to perform the skills required to win a match is one of the advantages of utilizing the DRL to develop the SF R2.The SF R2 agent learns to perform the right actions by trial and error as long as it is exposed to the right conditions.The SF R2 was able to rapidly defeat the opponent, utilize the jumping strategy effectively, disrupt the maneuver of opponents, defend, and perform complicated maneuvers.The evidence of the generalized tactical competence of the SF R2 is exhibited in Figure 3g.An example from the second event whereby the SF R2 disclosed a time utilized to defeat the opponent of 27 s against the top SF player with a combination of offensives comprising kick upward, rotating kicks, and on-the-ground throw of opponents is shown in the figure.This type of tactical competence was not limited to a specified character.The distribution of the number of matches with the time utilized to defeat the opponent below 30 s for the in-built opponent and with the utility of two different opponent characters is shown in Figure 3h.In the time-trial match, the top SF player was invited to achieve an enhanced player performance than the SF R2.The SF R2 agent won all six matches, although the SF player was allowed to observe the screen recording of SF R2 matches.Figure 3i discloses the results.
Notably, this work has demonstrated the training of the SF R2 using the DRL algorithm DP-PPO, which has not been performed before.An onset of the epoch convergence of 8000 epochs is achieved by the SF R2/PCM model, which is %7.33% smaller than the average of 8632 epochs for current decay approaches (Figure S6, S7, Supporting Information).This finding allows the utilization of efficient algorithms for attaining rapid computation speeds.Besides, a small energy consumption is revealed by the SF R2/PCM configuration.[55] The occurrence of the von Neumann power wall limits the energy effectiveness of the decay computation, which hinders further the advancement and efficacy of reinforcement learning.In this work, the SF R2/PCM mode discloses an energy consumption of 2.626 Â 10 À11 J, which is %141 times smaller compared to the average of 3.719 Â 10 À9 J for existing GPU-based systems (Figure S8, S9, Supporting Information).This result enables the utility of low energy systems for achieving high energy efficiency.[58][59][60] Owing to three reasons, the PCM is a leading candidate for administering decay operations compared with existing memristive hardware: 1) mature fabrication technology, 2) intrinsic decay signature, and 3) in-memory computing ability.Experiments have demonstrated that the PCM reveals an ultralarge array size and sophisticated fabrication innovations, [61][62][63] which can corroborate more complicated workloads and algorithms.The conductive drift phenomena in the PCM is also determined physically through structural relaxation after the writing process and is inherent in character, [64,65] which is general and resilient and does not require specialized programming conditions/material adoptions.[68][69][70] Moreover, the previously  ).The plots disclose that, once the SF R2 obtained the lead in terms of the damage delivered, the SF player could not catch it for most cases.The onset of the large increase in the damage delivered represents the occurrence of the high-damage maneuver.g) An example from the second event in which the SF R2 exhibited a time utilized to defeat the opponent of %27 s against the top SF player with a combination of offensives that comprise kick upward, rotating kicks, and on-the-ground throw of opponents for the opponent character Guile.The example showed that the performance of the SF R2 was contextual.Although the SF player tried to defend against offensive combinations, the SF R2 was able to find new strategies to defeat opponents and in a short length of time.The snapshots of the video game were adapted and modified from www.myabandonware.com. [100]h) The distribution of matches with the time utilized to defeat in-built AI opponents below 30 s for two different opponent characters, demonstrating that the SF R2 has learned to defeat opponents rapidly.i) Results of the time-trial competition in the second event.
unexplored training of the SF R2 based on an integrated penalty type was exhibited.The work has also disclosed the utilization of a merged opponent population to train the SF R2, which has not been administered before.Additionally, the previously undescribed training of the SF R2 driven by a combined scenario variety was revealed.Besides, the work has elucidated the training of the SF R2 enabled by a mixed time penalty group, which has not been implemented before.Furthermore, the previously unknown training of the SF R2 using different opponent characters was demonstrated.This highlights the potential of the training methodology to not only allow a strong agent performance, but also modulate the physical/mental signature of the SF R2.

Discussion
The applications such as the maneuver game design is challenging for the conventional AI because of several requirements: 1) address the balance between decreasing the time utilized to defeat the opponent and, simultaneously, increasing the character health level, 2) excellent handling of imperfect information settings, 3) strong ability to achieve specified game states, 4) moderate maneuver etiquette, 5) good time-trial agent performance, and 6) rapid, effective head-to-head competitions with one of the world's best SF players.Currently, none of traditional AI models fulfill the requirements listed above.The examples shown in this work indicate that the current state of the SF R2 is able to achieve most of these requirements with a reasonable time in relation to archetypal models.The key improvement in the SF R2 to enable these applications is the demonstration of the previously unknown time utilized to defeat the opponent of 19.4 s, as well as maintaining a character health level of %83.4%.This enables a decreased occurrence of defensive playing approaches for achieving interesting games.
The ability to manage imperfect information settings is another advantage of the SF R2.For instance, the SF R2 was able to respond to opponent strategies well, which is described by the moderate-to-strong health level obtained for realistic gaming, which has not been shown before.Additional key advantage of the SF R2 is the ability to attain specified games states.The previously unreported well-administered unique offensives/offensive combination, which is represented by a high percentage of large damage maneuvers delivered to the opponent to render the game representational, was revealed by the SF R2.The ability to demonstrate moderate maneuver etiquette is another performance advantage of the SF R2.The SF R2 agent was able to conform to behavioral norms well, which is denoted by the short to intermediate time utilized to defeat the opponent for obtaining realistic games, which has not been exhibited before.An extra advantage of the SF R2 is the ability to achieve good time-trial agent performance.A previously unseen shorter time utilized to defeat the opponent in time-trial matches for the SF R2 compared to that for the SF player and with the use of different opponent characters was achieved for realizing high-level gaming.The ability to exhibit effective, rapid head-to-head competitions with the top SF player is a further advantage of the SF R2.The SF R2 revealed a time utilized to defeat the opponent of %27 s with a combination of kick upward, rotating kicks, and on-the-ground throw of opponents in head-to-head matches against the top SF player, which has not been elucidated before, for achieving high-level games.

Conclusion
These good time-trial agent performance, rapid, effective, headto-head competitions, excellent handling of imperfect information settings, strong ability to achieve targeted game states, and moderate maneuver etiquette are achieved through a hybrid case-type training process of the DRL algorithm DP-PPO in an integrated training configuration for time-trial evaluations and competitions with one of the world's top SF players, which alters the physical and mental signatures of the SF R2 agent.In principle, this methodology is applicable to a wide range of agents and training models, so that an appropriate combination of algorithms and agents opens opportunities for optimizing SF R2 agent performance.The success of the SF R2 in this environment suggests that these techniques may have an effect on real-world systems such as autonomous vehicles, collaborative robotics, and aerial drones.

Experimental Section
Game Environment: Since its inception in 1992, the game SF champion edition 2 was played by millions of players. [71]The game was created for arcade machines and subsequently developed for different game systems, such as the PC, Sony Playstation, Microsoft Xbox, Nintendo Switch, and other systems. [72]The game allowed up to two players to play on the same system, and each player chose a character out of 12 different characters. [73]he techniques of the characters were modulated for achieving fair competition. [73]he SF R2 agent was run on a digital computer using the gym-retro Python library. [45,46]The library supported reinforcement learning studies for developing next-generation computer games and was compatible with various emulators. [41,74]In traditional studies, reinforcement learningbased agents required asynchronous interactions between the computer and the game using an internet protocol, which was susceptible to network delays. [1,3,75]On the other hand, in this work, the gym-retro environment, i.e., the application-programming interface (API), waited for the SF R2 agent to execute actions based on the latest observation of the agent before the game proceeded to the next frame, thus alleviating the problem of communication delays.The actions of the agent were consistent with game-controller inputs of human players, but only a subset of these actions was chosen for facilitating the training of agents.
Computing Environment: Each experiment utilized a trainer on a compute node with a NVIDIA Quadro P2000/P2200 or a NVIDIA GeForce GTX Titan X coupled to 16 CPUs and 42 GB of memory.Trainers were run on three different computers to train instances of the SF R2 agent, viz., Ryu, with six different opponent characters, i.e., Guile, Chun li, Ken, Dhalsim, Blanka, and Ehonda.Moreover, rollout workers were utilized in each experiment, and each worker comprised a compute node to control the SF R2 agent.In each instance of the SF R2 agent, the compute node generated the rollouts, viz., a collection of tuples comprising the character states, actions, rewards, values and log probabilities, via tasks such as controlling the agent to execute actions, passing information to the game using the API, sending experience streams, i.e., a loss corresponding to the logprobabilities and values used for updating weights in the model, to the trainer, and receiving updated policies from trainers (Figure 1a).Each compute node utilized two CPUs and 2.1 GB of memory.A total of 96 rollout workers/SF R2 instances were utilized over approximately 2 weeks to train six different policies for defeating opponent characters.To train a policy that was able to defeat the opponent character consistently, 16 rollout workers were harnessed over 3-4 days.
Actions: The API allowed the player to control eight independent discrete actions, viz., up, down, down left, down right, left punch, right punch, left kick, and right kick.Each action was implemented in a frame.As special moves, such as the shouryuken, hadouken, spinning kick, and other special moves, were rendered through a series of actions, the system utilized three/four frames to administer these moves.The eight actions were converted to eight discrete integer values, i.e., 0-7, which were learnt by the agent/model for generating the next action.The on-screen image was passed to the model/policy network.The models then computed the output values and corresponding softmax/probability values for the eight actions.Finally, the agent chose the action with the maximum probability for the new state/frame.
Policy, Actor, or Strategy Network: The policy, actor, or strategy (π) network comprised a convolutional neural network (CNN) with three convolutional layers linked to three fully connected layers.In the first layer, an observation matrix with a size of 84 Â 84 Â 1 was used as an input matrix.In this layer, 32 3D kernel matrices with a size of 8 Â 8 Â 1 were utilized.To perform the convolution operation with the 84 Â 84 Â 1 input matrix, each kernel matrix was scanned across all possible 8 Â 8 Â 1 submatrices of the input matrix.The process started from the top left corner of the input matrix and finished at the bottom right corner of the observation-input matrix, in a sequential manner. [76]We utilized a stride operation with a step size of 4 Â 4, meaning that when the kernel matrix was shifted across the input matrix, and if a rightward shift was performed, the number of columns skipped between two submatrices was 4, whereas for the case if a downward shift was implemented, four rows were skipped between two 8 Â 8 Â 1 submatrices.An element-wise multiplication was administered between elements of the 8 Â 8 Â 1 kernel matrix and those of 8 Â 8 Â 1 submatrices, and elements of the resulting matrix, i.e., a 8 Â 8 Â 1 matrix, were then summed to generate 1 intermediate scalar value.Using this methodology, the calculated output size of the convolution of each kernel matrix with the input matrix was a 20 Â 20 matrix, i.e., ((84 -8)/4 þ 1, (84 -8)/4 þ 1) = (20 Â 20). [77]This means that for each kernel convolution on the input matrix, a total of 400 intermediate scalar values were generated from the stride operation, which resulted in the 20 Â 20 matrix.As 32 kernel matrices were utilized, the output matrix from the first layer was a 20 Â 20 Â 32 matrix.The rectified linear-unit (ReLU) activation function was then applied to the matrix.The 20 Â 20 Â 32 matrix was used as an input matrix for the second convolutional layer.In the second layer, 64 3D kernel matrices with a size of 4 Â 4 Â 32 were utilized.We used a stride operation with a step size of 2 Â 2. To perform the convolution operation with the 20 Â 20 Â 32 input matrix of the second layer, each kernel matrix was shifted across all possible 4 Â 4 Â 32 submatrices of the input matrix.The calculated output size of the convolution of each kernel matrix with the input matrix was a 9 Â 9 matrix, viz., ((20 -4)/2 þ 1, (20 -4)/2 þ 1) = (9 Â 9).As 64 kernel matrices were utilized, the output matrix from the second layer was a 9 Â 9 Â 64 matrix.The ReLU activation function was then administered to the matrix.The 9 Â 9 Â 64 matrix was used as an input matrix for the third convolutional layer.In the third layer, 64 3D kernel matrices with a size of 3 Â 3 Â 64 were utilized.We used a stride operation with of a step size of 1 Â 1.To perform the convolution operation with the 9 Â 9 Â 64 input matrix of the third layer, each kernel matrix was shifted across all possible 3 Â 3 Â 64 submatrices of the input matrix.The calculated output size of the convolution of each kernel matrix with the input matrix was a 7 Â 7 matrix, i.e., ((9 -3)/1 þ 1, (9 -3)/1 þ 1) = (7 Â 7).As 64 kernel matrices were utilized, the output matrix from the third layer was a 7 Â 7 Â 64 matrix.The ReLU activation function was then administered to the matrix.The 7 Â 7 Â 64 matrix was flattened to generate a 1D matrix with 3136 elements.The matrix was then passed through a fully connected network with three layers, viz., one input layer, one hidden layer, and one output layer.The input layer consisted of 3136 neurons corresponding to the size of the 1D matrix, while the hidden layer comprised 512 neurons and with the use of the ReLU-activation function.The output layer comprised eight neurons and the softmax-activation function was utilized.The actions were selected through generating a softmax distribution over all eight possible actions.The distribution induced the sum of action probabilities to 1.As a result, the action that disclosed the highest probability was chosen as the output action.The action was implemented in the game environment, which generated a new observation input that was delivered to the same policy network for the next action.The procedure was repeated until the timer expired or when a character was defeated.The policy network was utilized in both the training process and the network-inference process.
Critic or Value Network: The critic or value (χ) network utilized the same network architecture as that of the policy network, except for the final layer.The critic network consisted of a CNN with three convolutional layers linked to three fully connected layers.However, the output layer comprised only one neuron and with the use of a hyperbolic-tangent activation function.The hyperbolic-tangent activation function generated a value for the observation input that was delivered to the policy network.Moreover, the critic network was utilized only during the training process, wherein the generated value was stored in the rollout buffer to be used for the update of the policy network and the critic network in subsequent training epochs.The critic network was not utilized after the training process.
Features: Game features were passed to the neural network in the form of an observation vector.The game features comprised the hit points (HPs) of the character, positions of characters, current stance, i.e., the character was in the offensive state or defensive state, relative distance between characters, and remaining game time.These features were encoded in the red, green, and blue (RGB) pixels of the screen.The health bar, which contained information on the HP of the character, was represented by pixels at the top of the screen, whereas pixels at the center of the screen described the position of characters.The countdown timer, which disclosed time values from 99 to 0, was associated with pixels at the top center of the screen.The RGB screen pixels were converted into an 84 Â 84 grayscale matrix by normalizing previous values to new values between 0 and 1.The matrix was then utilized as an input to a nature-based CNN. [4,78,79]ewards: The reward function, r, described the linear sum of reward components based on the transition between the previous state s t-1 and current state s t .The components comprised the damage delivered reward, r dd , damage received penalty, p dr , time penalty, p t , distance penalty, p d , and winning reward, r w .This was summarized in Equation ( 1), where the n dd , n dr , n t , and n d represented weights of the damage-delivered reward, damage-received penalty, time penalty, and distance penalty, respectively.
Damage Delivered Reward, r dd : The agent was given a reward for delivering damages to the opponent.The reward was described by a value calculated through the difference between the opponent health in the previous state and that in the current state.The value was retained if the agent delivered damages to the opponent and set to zero otherwise.The retained values were further normalized to final values between 0 and 1 through scaling the retained value by 176, i.e., the original health value.We utilized the n dd with a value of 2 to incentivize the agent to fight aggressively.
Damage Received Penalty, p dr : The agent was given a penalty when he received damages from the opponent.The penalty was described by a value obtained via the difference between the agent health in the current state and that in the previous state.The value was retained if the agent received damages and set to zero otherwise.The retained values were further normalized to final values between 0 and 1 via scaling the retained value by 176, viz., the original health value.As the focus is on the offense rather than the defense, we utilized the n dr with a value of 1.
Time Penalty, p t : The matches ended when the HP of the character or the remaining time became zero.To avoid long-duration matches, we motivated the agent to end the game quickly by penalizing it according to the amount of time passed in the match, i.e., the time utilized to defeat the opponent.A constant penalty value of p t = -3.4Â 10 À4 was harnessed for each time step.The n t with a default value of 1 was used and the n t value was modulated for different scenarios (Table S2, Supporting Information).
Distance Penalty, p d : It was ideal for the agent to stay close to the opponent to deliver hits successfully for ending matches rapidly.The penalty was represented by a value attained through the difference between the x coordinate of the agent and that of the opponent.A negative sign was assigned to the value.The resulting value was divided by 3 Â 10 5 , to incentivize the agent to remain close to the opponent.We utilized the n d with a default value of 1.
Winning Reward, r w : The agent was given this reward at the end of the match, when the remaining time or the HP of the character became zero.The r w is 1 if the agent won the match, and in the case wherein the agent lost the match, r w = -0.5.
Training Algorithm: In this work, we utilized three different models, viz., A2C, PPO, and DP-PPO, to train the SF R2 agent.All three models were developed based on the actor-critic methodology, and the models disclosed different objectives of the policy, i.e., π.Through interactions with the game environment, the algorithm obtained a small batch of experiences, viz., s, a, r, v, l, to update the decision-making policy.The previous batch of experiences was discarded, and utilizing the updated policy, a new batch of experiences was attained.This was known as the "on-policy learning" approach where new batches of experiences obtained were utilized once per policy update.
Actor-Critic Method: The actor-critic methodology utilized two models, i.e., the actor model (we represented it as π) and the critic model (the symbol χ was harnessed).Both the π and χ comprised a CNN with thee convolutional layers linked to three fully connected layers.However, the models revealed a different number of nodes in the final layer.The π learnt an action, a t , that was administered in the current state, s t , of the game environment at a timestep, t.The π obtained the s , which was a grayscale-image array that summarized game features, and subsequently generated the a t , corresponding to the movement, viz., forward/backward, or maneuver, e.g., punch/kick, of the character.Finally, the a t was passed to the game environment, which created the new state s tþ1 and the reward r t .
The SF R2 obtained a positive feedback in the form of a large reward when a hit was delivered to the opponent, and when the SF R2 was hit, the SF R2 agent attained a negative feedback in the form of a small reward.The π comprised a final layer with eight nodes because there were eight independent discrete actions.The softmax function was utilized as the activation element in the final layer to denote probability values for the eight actions.
The χ learnt to evaluate whether the action a t implemented by the π in the current state s t resulted in an enhanced next state s tþ1 through generating a rating value v t .The π then utilized the v t to determine strategies to optimize the choice of actions.The χ comprised a final layer with a node, which generated a real number/value.The hyperbolic-tangent function was utilized as the activation element for the final layer.
[82] The idea involved differentiating the policy objective, L Policy (θ), with respect to the weight of the policy, θ, and subsequently, computing the estimator value of the policy gradient, i.e., the gradient of the L Policy (θ), for updating parameters in the policy, π. [83][84][85] The policy comprised one of the three actor models, viz., A2C, PPO, or DP-PPO.
where π θ was the stochastic policy parametrized by the policy weight θ, a t was the action generated by the π θ given the observation s t at the timestep t, and A t was defined as the estimator value of the advantage function in the same timestep described in Equation ( 4).
Here, A t comprised the discounted reward r t from the timestep t to the terminal timestep T with the discounted value γ.This indicated that the policy was run for T timesteps in the environment to collect the current batch of experiences <s t , a t , r t , v t , l t >, where the t was located in the interval [0, T] before being updated.The v(s t ) denoted the value at the timestep t that was obtained from the critic model χ using the observation s t .Furthermore, E t […] was the expectation function that represented the mean over the current batch of experiences, which alternated between the optimization process and the sampling process.
][88] To avoid large updates to the policy model, recent studies proposed modifications to the policy objective utilized in the PPO model, as described in Equation ( 5).We represented the R t (θ) as the probability ratio Þ , where R t (θ old ) = 1, which denoted the change in the policy.The idea involved penalizing modifications to the policy which resulted in a R t (θ) value that deviated away from one.
where ε was the constant or hyperparameter that limited the amount of changes, R t (θ)A t was the unclipped objective, i.e., the probability/ratio that the R t (θ) was not bounded and allowed to change freely, and clip(R t (θ), 1 À ε, 1 þ ε)A t was the clipped objective that prevented the R t (θ) from exceeding the interval [1 À ε, 1 þ ε] when the ratio attained was small or large.If the ratio was small, the clip(R for the case where the ratio was large.By utilizing the minimum value of the clipped and unclipped objectives, the change in the probability ratio was ignored when changes in probability ratios resulted in an enhanced objective, and when the change in the probability ratio led to a deteriorated objective, changes in probability ratios were considered.Overall, this prevented drastic changes in policy updates.In this model, the generalized advantage estimation (GAE) element A t was utilized (Equation ( 6) and ( 7)). where Here, λ was defined as the GAE-lambda hyperparameter, which was a measure of the degree of the dependence on the current-value estimate for calculating updated-value estimates. [18,89]The high and low values of the λ were associated with the strong and weak degrees of the dependence on current-value estimates (high bias, low variance)/actual rewards from the environment (low bias, high variance), respectively.Moreover, a difficulty emerged from the trade-off between the bias and variance, and finding an optimal value of the λ facilitated the training process.Furthermore, the models disclosed that variations in the λ value influenced the agent performance for the SF R2 trained with the PPO model, as illustrated in Figure 2e.
DP-PPO Model: Recent studies demonstrated the utilization of a decay function/eligibility trace in the clipping-value range [1 À ε, 1 þ ε] through harnessing a decreasing clipping value/hyperparameter ε with an increase in training epochs (the epoch was defined as an update to the parameter/ weight in the model) because the constant ε value led to the greedy approach and underexploration of the environment. [47,90,91]Inspired by this methodology, we considered a mathematical equation to represent the decay function for the clipping range (Equation 8) and the policy objective of the DP-PPO algorithm (Equation 9) where t represented the current timestep, T final was the maximum time step in the entire training process, ε 0 was the initial clipping value at t = 0 (the default value was 0.02), ε(t) described the decreased clipping value at the current timestep t, v was the decay exponent, and A t denoted the GAE element.We observed that the value of T final Àt T final varied from 1 (the initial state of the training process) to 0 (the final state of training processes).
We utilized the electronic signature of PCM systems, viz., the conductance drift phenomenon, to generate the decay function for the training process.In this work, the drift exponent v = 0.1 was harnessed, which was obtained from the PCM material GeSbTe (GST) doped with nitrogen for training the DP-PPO model (Figure S1, Supporting Information).We considered conductance-drift phenomena of PCM materials in Equation ( 10) Here, G 0 represented the initial conductance measured at a time t 0 , G(t*) denoted the conductance at the time t*, and v was the drift exponent.To train the DP-PPO model from t = 0 to t = T final , we mapped the measured conductance value G(t*) from t* = t 0 to t* = 1000t 0 (the natural drift process) to the decreasing clipping value/ε(t) in Equation (8).
For the mapping procedure, in the software, at each updated timestep t in the training process, we observed the progress remaining in training processes, i.e., the ratio of the number of remaining training epochs to the total number of training epochs, T final Àt T final , which was predetermined by fixing the number of epochs in the training process.For instance, we utilized N epochs, and for the first epoch, the progress remaining was given by NÀ1 N because the update steps/intervals were uniform.For the PCM element, this process was achieved by examining the t* (the time for PCM elements), which corresponded to the t (the time in the software).This was administered by calculating the total number of epochs N required for training the model and then programming the conductance reader to partition the time interval [t 0 , 1000t 0 ] into N -1 uniform time intervals.Finally, the conductance value at each discrete time point t Ã ¼ i 999t 0 NÀ1 was read, where i described the epoch.The conductance reader was utilized to measure the conductance of the nitrogen-doped GeSbTe at the time t* at the onset of the epoch i, and the measured conductance was then scaled by a factor k, where . Finally, the scaled value kG(t*) was mapped to the decreasing clipping value.This process was described in Equation (11), where Equation ( 10) was multiplied by the factor k.
The term on the right-hand side of Equation ( 11) agreed well with that of Equation (8).For instance, at the time t = 0 (t* = t 0 ), 1000 À 999 indicating that the terms of both equations were equal.Moreover, at the time t = T final (t* = 1000t 0 ), 1000 À 999 T final Àt T final ¼ t Ã t 0 ¼ 1000.This further indicated that the mapping between Equation ( 8) and ( 11) was valid.By utilizing the PCM-mapping approach to create the decay function, we were able to utilize a smaller number of multiplication/exponentiation operations relative to that for traditional software approaches.In Equation (8), when the conventional software approach was harnessed, the inner term T final Àt T final was multiplied with 999 (a multiplication operation), and then raised to the power of -v (an exponentiation operation).
Finally, the term 999 T final Àt T final Àv was multiplied with the ε 0 (the multiplication operation).Overall, two multiplication operations and an exponentiation operation were utilized.The conductance reader was used to read the conductance at discrete time points and a multiplication operation was performed, i.e., the conductance value was scaled by a factor k ¼ ε 0 G 0 , when the PCM-mapping approach was harnessed.The exponentiation operation was not utilized because the conductance decrease was a natural process.This indicated that the PCM-mapping approach enabled savings in the form of a multiplication operation and an exponentiation operation for each epoch, or N multiplication operations and N exponentiation operations for N epochs.In this work, we used the PCM-mapping approach, viz., we mapped the ε(t) value for performing updates to the policy network, to bypass the computational burden for computing the typical software-based decreasing clipping value.Recent studies illustrated the importance of enhancing the efficiency of algorithms for fundamental computations. [92]The methodology utilized in this work could potentially benefit large-scale reinforcement learning processes, i.e., cases where the decreasing clipping value outperforms constant clipping values, through minimizing the computational load, viz., the number of multiplication/exponentiation operations.
Model Parameter Update: The A2C, PPO, and DP-PPO models utilized the actor-critic methodology, which involved a neural network architecture that shared the weight of a policy θ between the actor model π and the critic model χ.To update the θ using optimization, we constructed a loss function, i.e., a total objective, that combined the policy objective for the π, which generated optimal actions, and the value-function loss for the χ, which created the value.The total objective was further augmented by including an entropy-type bonus for sufficient exploration of the environment to avoid overexploitation by the agent. [86,93,94]The total objective was represented in Equation (12).
Here, c 1 and c 2 were coefficients, S described the entropy bonus, L value t θ ð Þ represented the value-function loss, which was also known as the squared-error loss the value predicted by the χ, and V target t denoted the actual value.As we utilized the maximization operation for the total objective with respect to the θ, the negative of the value-function loss was minimized in Equation (12).The total objective with a difference in the L policy t θ ð Þ was utilized for the three models, as shown in Equation ( 2), (5), and (9).The hyperparameter values harnessed in these three models are summarized in Table S3, Supporting Information.
Training Scenarios: The ability of the SF R2 to defeat an opponent required the mastery of specified skills, such as the capability to maintain an appropriate distance from the opponent, predict the timing of an offensive, and respond to a maneuver administered by opponents.To familiarize the SF R2 with targeted maneuvers utilized by different opponent characters (the different opponent characters exhibited altered maneuver behaviors), we trained six different SF R2 models for six different opponent characters.If the agent trained with an opponent character, it resulted in a scenario where the agent became accustomed to the maneuver behavior of the opponent character, i.e., the ensemble of maneuvers utilized for defeating opponent characters, and was not able to perform well when the agent trained with a new opponent character for the same assemblage of moves.For instance, the opponent character Dhalsim specialized in long range offensives.If the agent learnt to administer a large population of offensives from an extended distance to avoid receiving a hit, it was not able to implement a high population of short-range offensive, which were required to defeat aggressive opponent characters such as Ken. Figure S4, Supporting Information, disclosed the results of different scenarios where we trained two SF R2 models separately to achieve a short time utilized to defeat the in-built opponent characters Dhalsim and Ken.Each model was trained for 150 M steps, corresponding to a maneuver time of %48 h.
A limitation of the agent that was trained solely with the in-built opponent was the lack of robustness.For instance, training the agent to defeat the in-built opponent rapidly resulted in agents that were able to perform the same assemblage of offensives only, viz., the agent performed spinning kicks/threw fireballs continuously, which minimized the time utilized to defeat the opponent.As the in-built opponent performed the maneuver in a deterministic manner, i.e., in-built opponents administered almost the same ensemble of maneuvers from a look-up table through observing the agent movement, the agents learnt to counter the predetermined maneuver of an in-built opponent via lookup tables rather than movements based on logic.
Based on suggestions from an expert player, we implemented two different training scenarios to improve the agent robustness.In the first scenario, we trained the agent with three variations of an opponent that exhibited different behaviors, viz., deterministic, unpredictable, and both, so that the agent learnt to counter altered opponent behaviors.In the second scenario, we trained the agent with an opponent that utilized different movement styles, i.e., uppercut, dashing, and both, to increase the probability of delivering a large damage maneuver to opponent characters successfully.
In the first scenario, we chose Dhalsim as the opponent character because the character was one of the most difficult opponent characters to play with in the game.Dhalsim specialized in long-range offensives which rendered it difficult for the agent to deliver hits from a close range.We trained the agent with three different types of Dhalsim-based opponents: 1) customized opponent, 2) in-built opponent, and 3) hybrid population of customized opponents and in-built opponents.For the customized opponent, we created an assemblage of random maneuvers in the action space for the opponent character Dhalsim, viz., the state file utilized for the two-player mode was modified.In each step of the training process, the opponent disclosed a random maneuver and the agent learnt to defeat opponents.For the in-built opponent, the agent was trained purely with default computer opponents that exhibited maneuvers in a deterministic/predictable manner.For the hybrid population of customized opponents and in-built opponents, a two-step approach was utilized, viz., we trained the agent with the customized opponent followed by inbuilt opponents, and the agent learnt to defeat both opponent types.To evaluate the agent performance after training agents with different opponent varieties, a new strong AI opponent was built.To build the strong AI opponent, we created a model (we call it model E) that disclosed a high degree of aggressiveness by modulating coefficients of the reward function (Table S1, Supporting Information) and training the model with an in-built opponent for two days.We then utilized the strong AI opponent in a twoplayer mode to assess the agent performance for agents trained with the customized opponent only, solely in-built opponent, and hybrid population of customized opponents and in-built opponents.Figure S10, Supporting Information, reveals a summary of the training procedure.
Figure 2b illustrates the importance of these scenarios.However, the SF R2 that was trained with the in-built opponent disclosed a weak agent performance, viz., a weak to medium health level of 10-55% was achieved.Additionally, a moderate agent performance was exhibited by the SF R2 that was trained with the customized opponent, i.e., the SF R2 agent attained a medium to strong health level of 30-100%.Finally, the SF R2 that was trained with the hybrid population of customized opponents and in-built opponents elucidated an excellent agent performance (a strong health level of 60-100% was obtained).These results revealed that training the SF R2 with different types of opponents played an important role in enhancing the agent performance.
In the second scenario, we analyzed the situation where the agent hit the opponent with a large damage maneuver, i.e., a hit that decreased the opponent health level by 20% or more, and with a high success rate.Based on the feedback from an expert player, we considered cases where the opponent performed 1) an antiair offensive only, 2) solely a dash forward, and 3) both antiair offensive and dash forward.For cases wherein opponents administered the antiair offensive only or purely the dash forward, we restricted the action space/limited the number of irrelevant maneuvers for the opponent.For the case in which the opponent implemented both the antiair offensive and the dash forward, the union of both action spaces was integrated into the training process.In this scenario, Guile was chosen to be the opponent character.Figure 2c illustrates that the agent trained with the opponent that utilized both the antiair offensive and the dash forward was able to hit the in-built opponent with a large damage maneuver and with a higher probability relative to cases where the opponent administered only the antiair offensive or the dash forward solely for late stages of the training process.This indicated that the agent learnt to implement large damage maneuvers and deliver hits on the opponent successfully.However, we noted that random-sampling fluctuations from the rollout buffer might led to the forgetting of skills between successive training epochs, resulting in slight fluctuations in probability traces at the converged state.
Policy Selection: For recent studies, in the field of machine learning, when models were in the convergence state, i.e., the model revealed optimal signatures, training models for an additional number of epochs did not enhance the policy performance.However, for this work, in the reinforcement learning, the continuous exploration and random sampling of experiences from the rollout buffer were performed when stochastic models, such as A2C, PPO, or DP-PPO, were trained.As a result, the policy performance varied when the model with the convergence state was trained for an extra number of epochs, as shown in Figure 2c.Training models in the convergence state for more number of epochs created policies that disclosed a variation in the probability of delivering a large damage maneuver for the model trained with the in-built opponent.A policy with models in the convergence state and trained for an increased number of epochs, for instance, exhibited a combination of small damage maneuvers and large character damage maneuvers rather than revealing a large damage maneuver only compared to that of the policy with models trained for a small number of epochs.Thus, our interest was in developing a policy with an excellent maneuver signature to compete with a human player.
The agent policy was sampled every %800 000 steps during the training of the model for each in-built opponent character, out of a total of 50 million steps/3000 epochs.A total of 62 different agent policies were sampled, and we utilized six different in-built opponent characters.We sampled the last few policies, i.e., the last several epochs, from the model that utilized the shortest time to defeat the in-built opponent, viz., the DP-PPO, and evaluated the policy performance after training models with inbuilt opponents for several tens of matches.The model metrics, such as the time harnessed to defeat the opponent, health level, movement style, and ability to avoid special combinations of offensives from the opponent, were examined.The top three policies in terms of the policy performance were considered for the competitive match with the SF player.To choose the best policy out of the three policies, we performed a final round of selection in which the model played with human opponents.The human opponent administered actions that were different from the in-built opponent in various scenarios and the selection criteria was based on the ability of the agent to counter special maneuvers and the degree of aggression exhibited, viz., the time utilized to defeat the opponent.The best policy was then harnessed for two competitive events.
In the first event, we invited a top SF player to assess the agent performance, and the event involved both time-trial matches and head-to-head matches.In the time-trial match, the SF player competed with in-built opponents.The SF player utilized the main character Ryu, and six different opponent characters, viz., Guile, Ken, Chun li, Blanka, Dhalsim and Ehonda, were harnessed by the in-built computer (Figure 3i).The results were compared with that of the agent and the same ensemble of in-built opponents.The SF player then competed with the agent in the head-tohead match.The best agent policy for each opponent character, i.e., the DP-PPO, was used.The agent harnessed the main character Ryu, and the SF player utilized six different opponent characters (the characters were Guile, Ken, Chun li, Blanka, Dhalsim, and Ehonda).Figure S5, Supporting Information, discloses the results.However, the SF player mentioned that some agent types were aggressive and that agents utilized maneuvers randomly, e.g., the agent implemented kicks and punches at a fixed location, although the opponent was in a remote position and stationary.As a result, the agent was vulnerable to opponent offensives, and often, was unable to recover upon receiving a hit, when agents performed random maneuvers.
The models were retrained based on the SF player suggestions.We modified the training scenario.For instance, the agent was trained with a hybrid population of in-built opponents and customized opponents (Figure 2b), and the movement stance of opponents was altered to reveal more complicated maneuver situations (Figure 2c).Moreover, we analyzed components of the reward function (Figure 2a), and the models were retrained for a larger number of steps/epochs, viz., an extra 150 million steps/9000 epochs, or a longer duration, i.e., an additional two days.We then utilized the same selection procedure to choose the agent with the best policy to compete with the SF player in the second event.
In the second event, the agent disclosed a new score of 4-2 compared to the previous score of 3-3 (Figure 3, Figure S5, Supporting Information), indicating an enhancement of the agent performance.Although agents were not able to defeat the SF player for opponent characters Blanka and Dhalsim, the agent was capable of delivering a large degree of damage to the opponent (Figure 3d,e and Figures S5d,e, Supporting Information).The SF player was also not able to defeat in-built opponents for characters Blanka and Dhalsim, as the opponent character possessed a skill advantage, such as the long-range arm offensive for Dhalsim and rolling-body offensives for Blanka.To defeat the SF player using these characters, the agent can be trained further with a copy of the SF player through imitation learning to enhance the policy performance. [40,95,96]airness for Agent and Human: As computers and humans behaved differently, it was challenging to render competitions between AI computers and human players fair.The goal was to create a sufficiently fair competition, for instance, by minimizing unfair advantages that the AI computer possessed over the human player.This was achieved by implementing game modifications.The agent was different from the human player in some aspects.
The first aspect was perception.The agent was able to visualize the entire screen image simultaneously, i.e., agents were capable of observing health bar values, remaining time of the match, distance between the opponent and agent, and movement environment at the same time.On the other hand, prototypical humans were not able to obtain these information concurrently to generate well-informed decisions for implementing a new set of actions.However, information on the remaining time were not required by the typical human player, and human players required information about the opponent position and stance only.These information were sufficient to defeat the opponent because archetypal human players were able to implement a maneuver according to a selected style of play, e.g., the human player revealed defensive maneuvers for maintaining a strong health level or an aggressive maneuver for defeating opponents rapidly.Thus, the advantage possessed by the agent became minimal.
The second aspect was ethics.The agent administered maneuvers in an unethical manner, unlike human players.For instance, special maneuvers were implemented by the agent, viz., spinning kick, immediately after the opponent recovered from the ground and without allowing opponents a chance to defend against the offensive.Normal human players who observed fair play waited for the opponent to recover briefly before performing new maneuvers.To alleviate this, we modified the reward function by utilizing a small time penalty.These encourage the agent to play the match for an intermediate duration, and at the same time, maintain the number of hits delivered, so that a reasonable amount of time became available for the opponent to react to agent maneuvers (Figure 2d).
The third aspect was deception.Human players utilized deceptive strategies, e.g., the opponent character moved backward to encourage main characters to move toward opponent characters, and when the distance between the main character and opponent character became small enough, the human player administered a large damage offensive.As the agent aimed to maintain a medium distance between main characters and opponent characters, the main character approached the opponent character regardless of human player plans.To mitigate this issue, we trained the agent to avoid maneuvers near stage corners, due to the small distance between the main character and the opponent character.This was achieved by utilizing large penalties when x-coordinate positions of agents became close to that of the screen edge.As a result, the agent learnt to avoid maneuvers with opponents in unfavorable environments.
The fourth aspect was reaction time.The AI system disclosed a shorter reaction time than the typical human player.To alleviate this issue, we harnessed a small delay time of %240 ms/15 frames, i.e., when agents competed with human players, each action generated by the policy network at the current frame/step was parsed to the game environment 15 frames/steps later, which corresponded well with the reaction time of professional players (200-250 ms).
Tests versus SF/Human Player: We invited one of the best SF players, viz., Brandon Chia, 1st place, Capcom Pro Tour 2022, Asia Southeast; 1st position, Intel World Open 2021, Southeast Asia; named top 50 SF players in the world by the Global Esports Federation, for the time-trial evaluation and head-to-head matches with the SF R2 agent.
For the time-trial evaluation, the agent disclosed excellent model performance and defeated the six in-built opponent characters in a shorter time compared to that of the SF player (Figure 3i).However, it was challenging to play with in-built opponent characters Dhalsim and Blanka because the agent required more than 10 s to defeat the opponent character, and these opponent characters were not defeated by the SF player.
In the first event, the agent attained a draw in head-to-head matches, i.e., a score of 3-3 was achieved (Figure S5, Supporting Information).Once the agent obtained the lead in terms of the damage delivered, the SF player was not able to catch up for most matches, as shown in Figure S5a,c,f, Supporting Information.A close defeat for the agent occurred in the match with the SF player using the opponent character Dhalsim.In the match, the SF player delivered a series of damage at the onset time of %35 s to defeat the agent (Figure S5e, Supporting Information).Both the main character and the opponent character received a large degree of damage before the match ended.After the event, the agent was retrained for an increased period, based on the feedback from the SF player.
In the second event, the agent won head-to-head matches with a score of 4-2 (Figure 3a-f ).The agent disclosed %10% shorter time utilized to defeat the SF player with the opponent character Guile in the second event compared to that in the first event (the time decreased from 30 s to %27 s) (Figure 3a, Figure S5a, Supporting Information).This was achieved through a new combination of maneuvers created by the agent (Figure 3g), which resulted in an ensemble of large damages delivered to the opponent character at the onset time of %5 s.Although the agent lost to the SF player with the character Blanka again, the SF R2 agent was able to deliver a larger degree of damage to the opponent character in the second event than that in the first event (Figure 3d, Figure S5d, Supporting Information).Compared to the population of Blanka rolling-body offensives countered by the agent in the first event, the SF R2 agent was capable of countering a larger population of Blanka's rolling-body offensives in the second event.
The normal human players who participated in the time-trial evaluations were as follows: Ralph Toon, Lam Yu En, Yu Jiang, Rose Evangeline Anne, Cheyenne Chua, Brian Lim, Rosham Emmanuel, Maria Prisca, Wang Qiang, Ngai Lam Ho, Simon Chan, Ryan Ho, Cheryl Tan, David Han, and Albert Wijaya.
Testimonials: The following quotes were obtained from the SF player after the two events: "Agent Ryu is more difficult than the in-built opponent, countering a lot of my inputs with Shouryuken to the point that it is almost unplayable.It would be more useful for tournament practice if there is a way to reduce the AI's sensitivity to human inputs.Nevertheless, it is still very fun." -Brandon Chia (First event)."I think it has come to a point where the AI is behaving on its own most of the time by constantly attacking.To make it even better in future, the agent can be trained in a way that behaves on its own 50% of the time and waiting for the opponent to make a mistake for the other 50%.For example, there exists a range of attacks from a distance where Ken is unable to hit it.The AI can try learning to capture this in future so that it is even better for tournament play.I would say it is much more fun and reasonable this time round compared to the previous round."-Brandon Chia (Second event).
Cell Fabrication and Materials Characterization: The PCM element was fabricated utilizing an integrated technique comprising nanopatterning and conventional lithography, according to the previous fabrication procedure. [97,98]Each patterning step comprised using an electron-beam lithography (JEOL) or 365 nm photolithography (Cannon), followed by the materials deposition and lift-off process.All of the materials were deposited utilizing composite targets in a DC magnetron sputtering system (Blazers Cube).Nitrogen-doped GeSbTe films were deposited by sputtering from a composite GeSbTe target and concurrently in flowing nitrogen gas at a constant N 2 /Ar flow rate of 0.2.The GeSbTe films doped with nitrogen were characterized using the X-ray photoelectron spectroscopy (XPS), which showed that the nitrogen concentration in the film is %3 at%.A 4 00 Si wafer with a 1 μm-thick SiO 2 layer was utilized as the starting configuration, on which a 300 nm-thick TiW bottom electrode was deposited and patterned.An insulating layer, comprising a 25 nm-thick layer of SiO 2 , was deposited and etched to form pores with diameters of %40 nm.The openings were filled a 25 nm-thick layer of nitrogen-doped GeSbTe for forming the active region.Finally, a 300 nm-thick TiW top electrode was deposited to complete the structure.
System Structure: PCM elements were deposited on a SiO 2 -on-Si substrate based on the previous device structure. [97]The PCM element disclosed a pore-type structure comprising a 25 nm-thick nitrogen-doped GeSbTe layer, which was sandwiched between 300 nm-thick top and bottom TiW electrodes.The GeSbTe doped with nitrogen was confined in 40 nm-wide pores formed by a 25 nm-thick SiO 2 insulating layer.The electrodes were utilized to connect the test structure to the external circuitry for electrical testing, while the silica insulator provided the electrical and thermal insulation.
Electrical Characterization: The PCM element was characterized using a custom-built electrical characterization system comprising a nanosecond pulse generator (Tektronix), a digital oscilloscope (Agilent Technologies) and a probe station, based on the previous testing protocol. [98,99]The nanosecond pulse generator comprised specifications of a pulse duration varying between 5 and 900 ns, a rise time of <3 ns, and a maximum amplitude of 5 V.The element was connected to the pulse generator and oscilloscope through low capacitance cables (0.2-3 pF) and a load resistor R l = 50 Ω.The upper bound of the time constant of the resistance-capacitance circuit was estimated to be several 10 ps.The duration and height of voltage pulses were varied from several ns to several 10 ns and from 0 to 5.0 V, respectively.To ensure good functionality, the elements were switched reversibly more than 100 times between the low resistance level of 10 kΩ and the high resistance level of %300 kΩ before the experimental study.The occurrence of the crystallization was determined by the change of the resistance level of the element.The resistance change of the element was of the "sudden-drop" type.The resistance of the melt-quenched amorphous state was smaller than the typical range of %1 MΩ.This is because a small element size and a thin PCM layer were utilized.The amorphous region would be smaller and thus the resistance of the amorphous state was lower.

Figure 1 .
Figure 1.Training of SF R2. a) An example training configuration.The trainer distributes training scenarios to rollout workers, each of which controls a CPU running an instance of the SF.To control a main character, the agent in the worker runs a copy of the most recent policy, π, and critic, χ.For each main character, the agent sends an action, a, to the game.The game computes the next frame and sends each new state, s, to the agent, asynchronously.The agent reports the state, action, reward, value, log probability, tuple <s, a, r, v, l> to the trainers, which stores it in the rollout buffer, when the game reports that the action has been registered.To update the πand χ-function networks, the trainer samples the rollout buffer.The data computation was implemented based on the conductive drift phenomenon of a single PCM element for attaining the decay function.PCM elements in the amorphous state were utilized, and a reset stimulus was administered to the PCM element.b) The 2D pose representation of the main character (left stick figure) and the opponent character (right stick figure) when they are at rest.c) The distribution of learning curves based on the in-built opponent character Guile.The light red line denotes original training data, whereas smoothed training data is represented by the red line.The DP-PPO model achieved excellent performance.The models attained a strong performance in 36 h of training, and the model was trained up to 48 h.d) The distribution of the time utilized to defeat the in-built opponent character Guile for human players.Superimposed on d) is the number of hours that the SF R2 (DP-PPO), using 16 CPUs with a main character each, required to achieve similar performance.e) Histogram of 40 games from the time-trial policy harnessed by the SF R2 (DP-PPO) (orange bars) compared with the 5 shortest times utilized to defeat the in-built opponent character Guile for human players (cyan circles).Similar plots based on the other two in-built opponent characters are shown in FiguresS4a-d, Supporting Information.f, g) The training scenarios with the opponent character Guile, including two specialized scenarios where the opponent character performs f ) the antiair offensive and g) the dash forward.The snapshots of the video game were adapted and modified from www.myabandonware.com.[100]

Figure 2 .
Figure 2. SF R2 model ablations.a-d) Evaluation of networks and configurations utilized to train the SF R2. a) Including both damage penalties and the time penalty result in a shorter mean time required to defeat the opponent.Inset: Simultaneously, the SF R2 was able to maintain an excellent health level.The error bars represent the range of values from the training performed on three different models.b)The SF R2 was ill prepared/unable to anticipate the opponent maneuver when the SF R2 agent was trained with the in-built opponent (this was indicated by the weak to moderate health level), and when the SF R2 was trained with the customized opponent, the SF R2 agent learnt to be well-prepared/able to anticipate opponent maneuvers (the moderate-tostrong health level attained was the indication).c) The ablation of elements of the jumping-type training over a range of epochs sampled during trainings.The agent's ability to perform a large damage maneuver is measured by the y axis.d) The SF R2 used a shorter time to defeat the opponent with an increased amount of time penalties.The nonbaseline policies were judged to be unsportsmanlike by test players and judges.e,f ) The effects of various settings on agent performance.The baseline settings were colored in a darker shade of cyan in all plots.The range of values from the training administered on three different models is described by error bars.(e) Increasing the GAE-lambda value was advantageous (the mean time utilized to defeat the opponent decreases).(f ) The SF R2 would not be able to defeat the opponent in a shorter time without the decay-type enhancement to the PPO.

Figure 3 .
Figure 3. Results for SF R2. a-f ) Time evolution of the damage delivered to the opponent for the SF R2 against the SF player with opponent characters (a) Guile, (b) Ken, (c) Chun li, (d) Blanka, (e) Dhalsim, and f ) Ehonda.The inset of the figure shows the points obtained for each player (bottom right table).The plots disclose that, once the SF R2 obtained the lead in terms of the damage delivered, the SF player could not catch it for most cases.The onset of the large increase in the damage delivered represents the occurrence of the high-damage maneuver.g) An example from the second event in which the SF R2 exhibited a time utilized to defeat the opponent of %27 s against the top SF player with a combination of offensives that comprise kick upward, rotating kicks, and on-the-ground throw of opponents for the opponent character Guile.The example showed that the performance of the SF R2 was contextual.Although the SF player tried to defend against offensive combinations, the SF R2 was able to find new strategies to defeat opponents and in a short length of time.The snapshots of the video game were adapted and modified from www.myabandonware.com.[100]h) The distribution of matches with the time utilized to defeat in-built AI opponents below 30 s for two different opponent characters, demonstrating that the SF R2 has learned to defeat opponents rapidly.i) Results of the time-trial competition in the second event.