Intrinsic Decay Property of Ti/TiOx/Pt Memristor for Reinforcement Learning

A memristor‐based reinforcement learning (RL) system has shown outstanding performance in achieving efficient autonomous decision‐making and edge computing. Sarsa (λ) is a classical multistep RL algorithm that records state with λ decay and guides policy updates, significantly improving the algorithm convergence speed. However, λ decay implementation of traditional computing hardware is confined by the extensive computation of power exponential decay. Herein, the value update equation for Sarsa (λ) is implemented by using the topological structure of the memristor array, without complex circuits. Where, most importantly, the critical λ decay function is realized by a TiOx‐based memristor with conductance decay property. The energy required for floating‐point operations can be significantly reduced while accelerating the convergence speed. Then, a path planning task is demonstrated based on intrinsic conductance decay property and shows outstanding performance. Finally, the information of rounds used for the task is obtained, which is based on the intrinsic decay property of the TiOx‐based memristor, maps into a 32 × 32 memristor array in parallel to calculate the value of each round. The results indicate that the experimental data have similar results to the simulations. Herein, thus, it provides a hardware‐enabled scheme for the memristor‐based RL algorithm implementation.

DOI: 10.1002/aisy.202200455 A memristor-based reinforcement learning (RL) system has shown outstanding performance in achieving efficient autonomous decision-making and edge computing. Sarsa (λ) is a classical multistep RL algorithm that records state with λ decay and guides policy updates, significantly improving the algorithm convergence speed. However, λ decay implementation of traditional computing hardware is confined by the extensive computation of power exponential decay. Herein, the value update equation for Sarsa (λ) is implemented by using the topological structure of the memristor array, without complex circuits. Where, most importantly, the critical λ decay function is realized by a TiO x -based memristor with conductance decay property. The energy required for floatingpoint operations can be significantly reduced while accelerating the convergence speed. Then, a path planning task is demonstrated based on intrinsic conductance decay property and shows outstanding performance. Finally, the information of rounds used for the task is obtained, which is based on the intrinsic decay property of the TiO x -based memristor, maps into a 32 Â 32 memristor array in parallel to calculate the value of each round. The results indicate that the experimental data have similar results to the simulations. Herein, thus, it provides a hardware-enabled scheme for the memristor-based RL algorithm implementation.
important the state is. Algorithm 1 shows the pseudo code for the Sarsa (λ) algorithm. The Sarsa (λ) has a wide range of applications in various fields such as path planning, [5] automatic driving, [14] and scheduling management. [15] However, most RL tasks based on traditional architectures are usually accompanied by high power consumption, slow convergence speed, etc. The main impetus in hardware RL realization was the development of memristors, [16][17][18][19][20] which are expected to achieve an energy-efficient architecture with adjustable resistors for storage and computation. Recently, memristor-based RL has been widely studied. [21][22][23][24][25][26][27][28] However, in previous studies, the training process of RL task was mainly accelerated by using the matrix-vector-multiplication (MVM) inherent to memristor. [22,23] Hardware implementations of Sarsa (λ) have rarely been verified either, which requires stable λ decay to enhance the convergence speed. And the massive floating-point operations, during the Q (S, A) update on the digital platform, caused by the power exponential decay of λ, which leads to extensive energy consumption. [29] This work reports on experiments to implement the RL Sarsa (λ) algorithm in a 32 Â 32 memristor array and its application in path planning. Where, the critical λ decay function is represented by the decay property of the TiO x -based memristor. A fast convergence rate of the path planning task is demonstrated based on the intrinsic decay property of the TiO x -based memristor. The rounds information for path planning, based on the intrinsic decay property of TiO x -based memristor, is imported into a 32 Â 32 memristor array in parallel to calculate the Q (S, A) for each of them. Moreover, the decay coefficient of the device is scaled by hardware to match the optimal decay amplitude of the soft simulations. Controlling the λ decay amplitude with a simple hardware circuit. The results indicate that the experimental results are similar to the simulated results. This work, thus, provides a fast convergence scheme for the memristor-based RL algorithm implementation.

Results and Discussion
The path planning task has been demonstrated with a memristor-based RL system, as shown in Figure 1a, in which the agent interacts with the environment to find the shortest optimal path. Sarsa (λ) is a multistep RL algorithm showing faster convergence speed, which updates the Q (S, A) of all action-state pairs stored in the Q-table by a λ factor. To implement the Sarsa (λ), first the path information maps to the TiO x -based memristor after 32 rounds of training. As shown in the up panel of Figure 1a, the λ decay of Sarsa (λ) is implemented by the conductance drift coefficient due to the diffusion of oxygen vacancies from the TiO x -based memristor. The different action matched to different conductance states, decaying with time, of TiO x -based memristor. The steps closer to the endpoint are more significant and of higher value. The up panel of Figure 1b illustrates the schematical structure of TiO x -based memristor, and the scanning electron microscopy (SEM) image is shown as the down panel of Figure 1b. After obtaining the rounds information replaced by conductance values, the information is imported into the memristor array in parallel to calculate the Q (S, A) , which is calculated by using the inherent MVM properties of the memristor array, as shown in Figure 1c. The array conductance is programmed with amplitude of voltage pulse with fixed pulse width, and the programming scheme is shown in Figure S1, Supporting Information. Where R k represents the kth read voltage, t indicates the length of each round as t steps, G k, j indicates the conductance value for the kth action of the jth round. After a Winner-Take-All (WTA) rule for comparison (similar to a neuronal lateral inhibition circuit, in order to output the largest value of the 32 rounds), the maximum value in 32 rounds is returned to computer with Q* (S, A) from the previous batch, which retains the larger one, iterated until convergence. Where the size of the memristor array is 32 Â 32, which is configured by using a one-transistor-one-resistor configuration (1T1R). The Q (S, A) is calculated as follows. (Details of the derivation process, see Description 1, Supporting Information) where Q (S, A) (t) means the value of a path with time t action-state pairs information, β denotes the decay index inherent to the policy map, and a t-kþ1 indicates the t-k þ 1th action, which is represented as the initial conductance weight in the array. R t-kþ1 represents the reward of tÀk þ 1th action.
To physically realize the Sarsa (λ), a TiO x -based memristor is fabricated with stable conductance decay property. First, the direct current (DC) characteristics of the memristor have been investigated. As shown in Figure 2a, the TiO x -based memristor exhibits reliable bipolar resistive switching (100 cycles). When the positive bias is applied on the Ti electrode with a limit current of 500 μA, the device switches from a high resistance state (HRS) to a low resistance state (LRS). And switching from the LRS to the HRS with negative bias. Furthermore, the multilevel conductance is realized by imposing consecutive increment RESET voltages (10 cycles), as shown in Figure 2b. The conductance is decreased by continuously imposing an increasing negative RESET voltages (À1.2 to À1.5 V, with a step of À0.1 V) to the Take action A, observe R, S' Choose A 0 from S 0 using policy derived from Q (e.g., ε-greedy) For all s∈S, a∈A (s) Until S is terminal www.advancedsciencenews.com www.advintellsyst.com  www.advancedsciencenews.com www.advintellsyst.com top electrodes (TE). The insert of Figure 2b shows the representative set of multilevel conductance states, which can be observed that well differentiation of conductance under different voltages. The gradual increase of negative voltage causes the conductive filament to become thinner before the conductive filament breaks, resulting in a gradual decrease of conductivity from the initial state. [30,31] Interestingly, the I-V-t characteristics of ten consecutive RESET cycles indicate that all of the conductance (read at 0.2 V) exhibits stable decay property. All of the conductance can maintain in a specific conductance range without overlapping in the time scale of 100 s, as shown in Figure 2c. It may be explained by that the degradation of the resistance caused by the change of the oxygen vacancies profile due to oxygen diffusion. [32,33] Such conductance spontaneous relaxation can provide a stable decay function, i.e., the decay factor required to physically achieve Sarsa (λ) without additional energy consumption and data movement for enhancing energy efficiency. The four regions will be mapped to the four actions in the simulation, with the mapping scheme shown in Table S1, Supporting Information. To further investigate device conductance decay behavior, we selected a group of typical conductance states from Figure 2c, fitting them in function, as shown in Figure 2d. The results indicate that the different conductance obeys to same decay function, as follows.
where G 0 is the initial conductance, t means the time after programming, and ϑ % 0.084 denotes the decay factor. According to Equation (1), different conductance states which decay with time, represent different action (a k ), can be combined with the parallel memristor array-based MVM to accelerate the convergence speed of the RL algorithm.
To verify the acceleration convergence of the physical Sarsa (λ) to the RL algorithm, a path planning task for a 2D discrete maze (insert of Figure 3d) has been investigated for the navigation task. It is a 20 Â 20 sized maze world, where each square corresponds to a separate state. The agent selects the four directions, including "up," "down," "left," and "right" to reach next state to receive the R k . The λ decay of each state will be indicated by the initial conductance decay amplitude of the TiO x -based memristor, every new action performed will program its conductance value by the RESET operation, which decays with time. Where, the initial conductance value corresponding to the four actions are 6.75 μS (Up), 4.45 μS (Down), 2 μS (Left), and 1.2 μS (Right), respectively. A game round ends once the agent reaches the endpoint by the ε-greedy algorithm (initial utilization is 90%) to update the Q (S, A) . The ε-greedy algorithm means that probability ε moves randomly, and with probability 1Àε takes action with Q* (S, A) from Q-table. [34] Where the endpoint and traps R k are 100 and À50, respectively, and the common ground R k is set to À0.1, which is to find a path to avoids the traps for the agent with shortest steps. During the computation, the read voltage is normalized and scaled to match the corresponding reward (R k ). The four directions by the agent are quantified by the four conductance of the TiO x -based memristor.
The total rewards and exploration steps at each round are shown in Figure 3c; however, they exhibited a weak convergence  speed. This is because in software simulations, the decay factor is typically at 0.8 < β < 0.9 (see Figure S2, Supporting Information). However, the conductance drift coefficient of the device is typically at 0.07 < v < 0.1, yielding a slower dynamic decay, as shown in Figure 3a. Hence, by transforming the conductance mathematically and mapped the transformed conductance decay between 0 and 1, as given in Equation (3).
where ε is the threshold and β Convert is set to 0 below ε. G(t) is the conductance of the device at moment t, which corresponds to the scaling of conductance above the threshold. As shown in Figure 3b, the conductance is transformed using Equation (3) (for example, 6.75 μS (action Up)), when ε is taken as 4.85 μS the transformation can match the role of mathematical exponential decay function. The above conversion can be simply implemented by an arithmetic logic unit which includes comparator. The conductance below the threshold ε will be set to 0. Such transformation is able by scaling the device decay range to match the decay factor under different tasks, which can improve the flexibility of the system. As shown in Figure 3d, after the conversion, the agent reaches convergence after 59 rounds and forms the optimal strategy. Compared with the results before conversion (Figure 3c), the convergence speed is significantly faster. Figure 4a shows a comparison of the convergence speed of different RL training algorithms in a path planning task (20 Â 20 sized maze world navigation). The results indicate that the Sarsa (λ), which after the transformation, shows fast convergence speed in terms of rewards and steps update compared to SARSA and Q-learning. Furthermore, to verify the memristorbased RL system for the path planning task, the rounds information of a 4 Â 4 sized maze world calculated with Sarsa (λ) is parallel mapped into a 32 Â 32 memristor array for Q (S, A) calculation. Figure 4b shows a comparison of the simulation and experimental calculation results for Q (S, A) . The results indicate that the memristor-based RL system for path planning tasks is reliable. It is worth mentioning that this work requires no extra energy consumption for policy mapping during algorithm update besides device conductance modulation owing to device spontaneous conductance drift, which, thus, would enhance the energy efficiency for the RL system. Moreover, the hardware scheme based on the full TiO x -based memristor array will be more simple, and the corresponding peripheral circuit is shown in Figure S3, Supporting Information. In addition, a large-scale memristor-based RL system has the potential to solve more complex optimization problems, which may require multiple small arrays or structure-level designs, which will continue to be investigated in depth in future.

Conclusion
In this work, to implement the Sarsa (λ) algorithm, which has a faster convergence speed than other classic algorithms, a TiO x -based memristor with conductance decay property was fabricated. The value update equation for Sarsa (λ) was implemented by using a memristor array. Where, most importantly, the critical λ decay function of Sarsa (λ) is realized by the TiO x -based memristor with conductance decay property. Then, a path planning task is demonstrated based on intrinsic device decay and shows outstanding performance. The obtained rounds information for path planning, based on the policy map of the TiO x -based memristor, maps into a 32 Â 32 memristor array in parallel to calculate the Q (S, A) of each round. The results indicate that the experimental results are similar to the simulated results. This work, thus, provides a fast convergence scheme for the memristor-based RL algorithm implementation.

Experimental Section
The two Ti/TiO x /Pt memristors are stacked on a SiO 2 /Si substrate. First, the bottom electrodes (BE) (Pt/Ti 40/5 nm) are deposited by e-beam evaporation, in which the Ti was used as an adhesive layer. Then, the functional layers (TiO x % 15 nm) are deposited by the magnetron sputtering method through co-sputtering the Ti (100 W) and O 2 (2 sccm) in Ar (20 sccm) atmosphere. Finally, the TE (Ti 20 nm) are deposited by magnetron sputtering, and 20 nm Pt is deposited by magnetron sputtering successively to protect the top Ti electrodes. All the electrodes and functional layers were patterned through an ultraviolet lithography and lift-off www.advancedsciencenews.com www.advintellsyst.com process. The processes of the 32 Â 32 memristor array are same as the reference. [35] The DC tests were performed on an FS Pro semiconductor parameter analyzer.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.