Toward Optimized In‐Memory Reinforcement Learning: Leveraging 1/f Noise of Synaptic Ferroelectric Field‐Effect‐Transistors for Efficient Exploration

Reinforcement learning (RL), exhibiting outstanding performance in various fields, requires large amounts of data for high performance. While exploration techniques address this requirement, conventional exploration methods have limitations: complexity of hardware implementation and significant hardware burden. Herein, in‐memory RL systems leveraging intrinsic 1/f noise of synaptic ferroelectric field‐effect‐transistors (FeFETs) for efficient exploration are proposed. The electrical characteristics of fabricated FeFETs with low‐power operation capability verify their suitability for neuromorphic systems. The proposed system achieves comparable performance to the conventional exploration method without additional circuits. The intrinsic 1/f noise of the FeFETs facilitates efficient exploration and offers significant advantages: efficiency in hardware implementation and simplicity in adjusting the 1/f noise level for optimal performance. This approach effectively addresses the challenges of conventional exploration methods. The operation mechanism of the exploration method utilizing the 1/f noise is systematically analyzed. The proposed in‐memory RL system demonstrates robustness and reliability to the device‐to‐device variation and the initial conductance distribution. This work provides further insights into the exploration methods of RL, paving the way for advanced in‐memory RL systems.

Reinforcement learning (RL), exhibiting outstanding performance in various fields, requires large amounts of data for high performance.While exploration techniques address this requirement, conventional exploration methods have limitations: complexity of hardware implementation and significant hardware burden.Herein, in-memory RL systems leveraging intrinsic 1/f noise of synaptic ferroelectric field-effect-transistors (FeFETs) for efficient exploration are proposed.The electrical characteristics of fabricated FeFETs with low-power operation capability verify their suitability for neuromorphic systems.The proposed system achieves comparable performance to the conventional exploration method without additional circuits.The intrinsic 1/f noise of the FeFETs facilitates efficient exploration and offers significant advantages: efficiency in hardware implementation and simplicity in adjusting the 1/f noise level for optimal performance.This approach effectively addresses the challenges of conventional exploration methods.The operation mechanism of the exploration method utilizing the 1/f noise is systematically analyzed.The proposed in-memory RL system demonstrates robustness and reliability to the device-to-device variation and the initial conductance distribution.This work provides further insights into the exploration methods of RL, paving the way for advanced in-memory RL systems.
of the neural network can be used as probabilities for selecting an action. [10,11]However, implementing a softmax activation function in hardware is complex, and there is a drawback of requiring additional circuits for selecting actions stochastically.[14] By introducing noise during the training process, agents are encouraged to explore their environment more extensively.This approach induces stochasticity to action selection even with the deterministic action selection method and offers an advantage of simplicity in hardware implementation of RL.
In the noise-based exploration method, the amount of exploration varies depending on the given noise level.The higher the noise level, the greater the amount of exploration included in the decision-making process of the agent.Conversely, a lower noise level leads to a reduced amount of exploration, resulting in a more deterministic behavior of the agent.The optimal amount of exploration to achieve high performance depends on the specific characteristics of the problem to solve.In some environments with high uncertainty or large and complex state spaces, a large amount of exploration may be necessary to discover optimal strategies and avoid getting trapped in suboptimal solutions.On the contrary, in more stable or relatively small and well-understood environments, less exploration may be more appropriate to exploit already learned knowledge and maximize performance.Therefore, the optimal noise level for achieving high performance is expected to vary according to the specific problem being addressed and the neural network size.Prior studies have utilized inherent stochastic conductance switching and cycle-to-cycle conductance variability as noise sources in the neural network. [15,16]However, the noise sources used in the prior studies are challenging to adjust to the optimal noise level.Furthermore, given that the prior study employed only positive weights with a learning rule similar to spike-timing-dependent plasticity, it faces challenges in applying to complicated problems necessitating a neural network with a hidden layer.The current state of the environment is quantized and one-hot encoded to serve as input data for the neural network.It requires a substantial number of input neurons, degrading scalability.
In this work, we propose in-memory RL systems with efficient exploration leveraging 1/f noise, the intrinsic low-frequency noise (LFN) of synaptic devices.It has generally been thought that the 1/f noise has a detrimental effect on the neural networks and should be minimized.However, in the proposed in-memory RL system, the intrinsic 1/f noise of the synaptic device provides stochasticity to the policy of the agent and facilitates exploration to improve training speed and performance.One notable advantage is the simplicity in adjusting the intrinsic 1/f noise of the synaptic device to an optimal level, achieved through controlling read time, operating voltage, device size, etc. [17] This feature contributes to the flexibility and efficiency of the in-memory RL system.A ferroelectric field-effect-transistor (FeFET) that can effectively perform the operations of a neuromorphic system is fabricated and utilized as a synaptic device.Fabricated devices are investigated in terms of synaptic memory characteristics, including the 1/f noise characteristics for the efficient exploration of the in-memory RL system.The synaptic FeFETs are capable of low-power operation and have relatively high 1/f noise levels.These properties allow for a wide range of noise levels, making them suitable for tuning to optimal noise levels.The approximated backpropagation algorithm introduced in our prior work is used for network training. [18,19]A Frozen Lake problem, a wellknown RL benchmark, is used to assess the performance of the proposed in-memory RL system.The contribution of the intrinsic 1/f noise of the fabricated FeFET to the exploration is thoroughly analyzed.The detailed operation mechanism of the exploration utilizing the intrinsic 1/f noise is systematically investigated.The effects of the noise level, the device-to-device variation, and the initial conductance distribution on the performance are also investigated.

Electrical Characteristics of Synaptic FeFETs
[22][23][24] Electronic synaptic devices corresponding to synapses that transmit electrical signals in the biological nervous system can represent the synaptic weights of neural networks.A FeFET is fabricated for a synaptic device (Figure 1a).The key fabrication process steps for the FeFETs are represented in the Experimental Section and Figure S1, Supporting Information.Figure 1b,c shows a top optical image of the fabricated FeFET array with dimensions of 12 Â 24 and a transmission electron microscope (TEM) crosssectional image of the fabricated FeFET, respectively.A schematic and enlarged top optic image of the fabricated FeFET array in Figure 1b are illustrated in Figure S2, Supporting Information.
Electrical characteristics of the fabricated FeFETs are investigated to assess the suitability of the FeFETs for neuromorphic systems.Figure 1d represents the transfer curves (I D -V GS ) of the fabricated FeFET with channel width (W ) and length (L) of 0.5 and 0.4 μm, respectively.The polarization change of the hafnium zirconium oxide (HZO), the ferroelectric layer, causes the threshold voltage change of the FeFETs.When a program (PGM) pulse is applied to the gate, the threshold voltage decreases since the polarization of the HZO is directed toward a Si, and the electron concentration of the channel increases.On the contrary, when an erase (ERS) pulse is applied to the gate, the threshold voltage increases since the HZO is polarized toward a TiN direction and decreases the electron concentration of the channel.The partial polarization of the HZO layer realizes a multilevel conductance of the FeFETs.The transfer curves of 40 fabricated FeFETs are shown in Figure S3, Supporting Information.The PGM (4.0 V, 10 μs) and ERS (À4.0 V, 100 μs) pulses are applied, and the threshold voltage of each FeFET is successfully adjusted.The program-inhibit operation is evaluated using five consecutive FeFETs.The PGM (4.0 V, 10 μs) pulse is applied to the WL, while the inhibit (2.0 V, 10 μs) pulses are applied to the BLs of the unselected FeFETs.Two different bias schemes for program-inhibit operation and the corresponding conductance mapping result of the fabricated FeFETs are provided in Figure S4, Supporting Information.Only selected FeFETs are programmed without affecting adjacent cells.This is because the polarization switching in the unselected FeFETs is suppressed due to the inhibit pulse reducing the voltage difference between the WL and channel.The long-term potentiation (LTP) and long-term depression (LTD) characteristics of the fabricated FeFET are investigated by applying PGM and ERS pulses.The transfer curves with the successive PGM and ERS pulses are illustrated in Figure 1e.While the pulse widths are fixed to 10 μs, the amplitudes of the PGM and ERS pulses are increased from 3.0 to 3.75 V in 0.05 V steps, and from À2.5 to À3.7 V in À0.08 V steps, respectively.Figure 1f shows the average and standard deviation of both LTP and LTD characteristics of the fabricated FeFET with the number of applied pulses during 20 cycles.20 cycles of LTP and LTD characteristics of the fabricated FeFET are depicted in Figure S5, Supporting Information, confirming a uniformity in cycling performance.This allows fabricated FeFET to successfully mimic the properties of biological synapses and is suitable for neuromorphic systems.The endurance characteristic of the fabricated FeFET is shown in Figure S6, Supporting Information.The memory window decreases as the number of PGM/ERS cycles increases.27][28][29] Figure 1g represents the variation of the drain current (I D ) of the fabricated FeFET over time.Corresponding I D amplitude distributions with various V GS conditions are depicted in Figure S7, Supporting Information.The change in the I D shows a Gaussian distribution.The power spectral density (PSD) of the fabricated FeFET is measured under various V GS conditions to investigate the LFN characteristics of the FeFET (Figure S8, Supporting Information).Figure 1h shows the I D normalized PSD (S ID /I D 2 ) versus frequency ( f ) of the FeFET.V DS is fixed at 0.1 V, and V GS is changed to generate I D s ranging from 20 to 800 nA.In the measured f ranges, the fabricated FeFET exhibits 1/f γ noise behavior (γ = ÀlnS ID /lnf ).The γ value of the FeFET ranges from 0.9 to 1.1.It has been reported that the 1/f noise originates from the random fluctuation of the carrier number owing to the trapping/detrapping process to/from the defects inside the gate oxide.Such behavior can be explained by the carrier number fluctuation (CNF) model.The CNF model is expressed as follows [30] where g m is the transconductance, q is the electron charge, k B is the Boltzmann constant, T is the temperature, N T is the volume trap density, λ is the tunneling attenuation coefficient, and C ox is the gate oxide capacitance per unit area.
It is necessary to investigate the correlation between S ID /I D 2 and (g m /I D ) 2 to verify the origin of 1/f noise.Figure 1i shows S ID / I D 2 sampled at 10 Hz and (g m /I D ) 2 versus I D of the fabricated FeFET.The S ID /I D 2 and (g m /I D ) 2 exhibit the same tendency in all I D operating regions, demonstrating that the 1/f noise of the fabricated FeFET stems from the CNF.Here, it is important to note that the fabricated device with a ferroelectric material exhibits a relatively large noise due to the existence of phonon scattering.Typically, the 1/f noise in devices is considered undesirable, and efforts are made to minimize it to improve system performance.However, the intrinsic 1/f noise of the synaptic device can be utilized as a source of stochasticity with high energy efficiency in RL, which goes against the common perception.
The standard deviation of the drain current (σ ID ) can be used to represent the current fluctuations resulting from the intrinsic 1/f noise of the device. [31,32]The variance σ ID 2 can be obtained as follows where S ID represents the PSD of the device, The relative ratio σ ID =I D , which indicates the magnitude of the 1/f noise, is given by the following equation where A denotes the PSD at a frequency f ¼ 1 Hz ð Þ.Through the measured PSD of the fabricated FeFET, σ ID =I D ¼ 0.0108 is obtained.The 1/f noise level (σ ID =I D ) is affected by various parameters such as read time (t read ), read current (I D ), and device size (WL).By adjusting these factors, the desired noise level for the specific application or system requirements can be achieved.The intrinsic 1/f noise of the fabricated FeFET is represented by the following equation where I D,real and I D,ideal represent drain currents with and without the intrinsic 1/f noise of the FeFET, respectively.Figure 1j-l indicates the 1/f noise level (σ ID =I D ) of the fabricated FeFET versus read time, read current, and device size, respectively.The 1/f noise level increases as read time increases, and shows a tendency to decrease as read current and device size increase.Therefore, it has the advantage of simplicity in adjusting the intrinsic 1/f noise of the fabricated FeFET to the optimal level for solving the given problem.This will be elaborated in a subsequent section.

In-Memory RL System
The interaction between an agent and environment in the RL is shown in Figure 2a.The agent observes the given environment and obtains the current state (s t ).As the agent takes a specific action (a t ), the environment changes (s tþ1 ), and the agent receives a corresponding reward (r t , sometimes a penalty).Through these interactions, the agent learns which action to take in a given state to maximize the expected future rewards.Figure 2b depicts an illustration of in-memory RL system.The in-memory RL system incorporates integrate-and-fire (I&F) neurons and synapse array.The fabricated FeFET array is employed as a synapse array.Two synaptic FeFETs per synaptic weight are required to express negative synaptic weights.[35] The data obtained by the agent observing the environment (current state) are converted to presynaptic spikes and transferred to the synapse array.The synapse array current is summed through the current mirror, charging the membrane capacitor of the I&F neuron.When the membrane potential exceeds the threshold voltage, the spike generation circuit in the I&F neuron generates postsynaptic spikes.The agent takes an action based on the output neuron that exhibits the highest firing frequency.These processes are repeated until solving the given problem (end of one episode).Figure 2c represents the flowchart of the problemsolving procedure of the in-memory RL system.The schematic diagram and operation of the I&F neuron circuit are shown in Figure S9, Supporting Information.
The performance of the proposed in-memory RL system is evaluated using a Frozen Lake problem, a simple path-finding problem with a sequential task.A grid world of the Frozen Lake problem, along with a schematic illustration of the deep Q-network (DQN) employed to solve the problem, is depicted in Figure S10, Supporting Information.The agent ("A" inside a blue circle) can take four actions in an 8 Â 8 grid world: move north, south, east, and west by one tile."S" and "G" written on green tiles indicate the starting tile and the goal tile, respectively.The other white and black tiles represent frozen tiles and holes, respectively.The new problem begins with the agent at the starting tile of the grid world.The goal is to find a path that the agent starts from the starting tile to the goal tile by stepping on the frozen tiles only while not falling into any holes.A total of 10 holes are randomly placed in the grid world.When the agent successfully reaches the goal tile, the episode is terminated, and the agent receives a reward of 1.On the contrary, when the agent falls into the hole, the episode is terminated, and the agent receives a reward of À1 (i.e., a penalty).In all other cases, the reward given to the agent is 0. Figure S11, Supporting Information shows the strategy for mapping the synaptic weights of the proposed neural network to the fabricated FeFET array.Two FeFET arrays (excitatory and inhibitory arrays) are required to represent negative synaptic weights.The proposed in-memory RL system requires an energy consumption of 1.09 μJ for solving a single Frozen Lake problem (please see Note S1, Supporting Information, for the energy consumption).

Efficient Exploration in RL Leveraging Intrinsic 1/f Noise of Synaptic FeFETs
Exploration, commonly used in RL, improves the performance of the network by gathering more information about the environment.This is especially important at the early stage of training as it determines the overall training speed.However, since excessive exploration can slow down the training speed by interrupting the optimal action selection, conducting an appropriate amount of exploration is necessary.While relatively simple problems can be solved without exploration, the network hardly solves relatively complicated problems without exploration.
LFN, which includes 1/f noise generated in electronic devices, is a major factor in performance degradation.It has been reported that LFN reduces the synaptic weight reliability of the synaptic devices used in neuromorphic systems and consequently degrades the performance of the networks. [36,37]herefore, it has been believed that 1/f noise should be minimized due to its detrimental impacts on the system performance.In this work, we utilize unfavorable intrinsic 1/f noise of the synaptic FeFETs to conduct exploration with low-power consumption.In contrast, the conventional exploration methods are challenging to implement on a hardware basis and require a significant hardware burden.This section will thoroughly analyze the efficient exploration method leveraging the intrinsic 1/f noise of the fabricated synaptic FeFETs.
Figure 3a illustrates the number of output spikes generated from four output neurons for every possible state in the Frozen Lake problem at the early and later stages of training.There is a slight difference between the numbers of output spikes in the early stage of training since the initial synaptic weights are randomly distributed.However, the number of output spikes indicating a particular action becomes dominant as the training progresses.These results indicate that action selection can be considerably influenced by the intrinsic 1/f noise of the synaptic FeFETs in the early stage of training but is hardly affected in the later stage of training.The standard deviation (σ) of the number of output spikes for every possible state is shown  3c.The average σ in the later stage of training is about 6.5 times larger than the average σ in the early stage of training.Although the action selected in the early stage of training can be changed through the intrinsic 1/f noise of the synaptic FeFETs, it is difficult to change the action selected in the test process even if the same 1/f noise level exists since the network is stabilized (small σs of the number of output spikes).Notably, this enables stochastic action selection through deterministic action selection without additional circuits.This approach utilizing the intrinsic 1/f noise of the synaptic FeFETs shows a similar function to the ε-greedy method, which is commonly used to implement exploration in RL.Both methods select a random action (exploration strategy) in the early stage of training while selecting an optimal action (exploitation strategy) in the later stage of training.
Figure 3d illustrates the number of output spikes of a welltrained network when the agent is at different arbitrary states.
In the example state 1, the output spike of the south direction, approaching the goal tile, is generated dominantly.The output spike of the east direction, away from the hole on the left, is generated second most.Therefore, the agent moves in the south direction with a high probability and moves in the east direction with a low probability.The agent hardly moves in the other direction despite the intrinsic 1/f noise of the synaptic FeFETs.Similarly, the agent moves in the east direction with a high probability in the example state 2. Figure 3e represents the change in conductance of all synaptic FeFETs in the network as the training progresses.The conductance distribution before and after training the network is depicted in Figure S12, Supporting Information.As the training progresses, particular synaptic weights become dominant, diminishing the influence of the 1/f noise on the network.
Figure 4a-c presents the problem-solving results of the proposed in-memory RL system, comparing the conventional ε-greedy method (red lines) with the exploration method leveraging the intrinsic 1/f noise of the synaptic FeFETs (black lines).The performance evaluations were conducted on ten separate untrained networks to ensure the consistency of the proposed approach.The thick solid lines in the figure indicate the average value derived from the results of the ten networks.These average lines smooth out any fluctuations or outliers that may occur in individual networks.Figure 4a,b indicates the success rate of the network and the number of actions required to solve the problem as the training progresses, respectively.Figure 4c shows the action change rate (ACR), which represents the rate of selecting an action different from that in the case without the 1/f noise when the 1/f noise is considered.This indicator provides insights into how the presence of the intrinsic 1/f noise of the synaptic FeFETs affects the agent's decision-making process and exploration strategy.
The success rate of the network gradually increases from 0% and eventually reaches 100% as the training progresses.Additionally, the number of actions required to solve the problem decreases as the training progresses.While the agent takes about 50 actions almost randomly in the early stage of training, it converges to taking 14 optimal actions in the later stage of training.These results imply that the agent finds an optimal path to the goal tile as the training progresses.The network leveraging  and c) ACR of the network as the training progresses.The performance of the network with the exploration method leveraging the intrinsic 1/f noise of the synaptic FeFETs (black lines) is similar to that of the conventional ε-greedy method (red lines).In contrast, the network devoid of the 1/f noise (green lines) is poorly trained since insufficient exploration is conducted.The ratio at which the agent passes each tile (passing rate) during the test episodes at the d) early, e) intermediate, and f ) later stages of training.The exploration method utilizing the 1/f noise is used for network training.The network without the 1/f noise exhibits a single path (exploitation) without diversity, regardless of the training process.Conversely, the 1/f noise facilitates exploration by searching various paths in the early stage of training.After the policy of the agent is optimized (later stage of training), the 1/f noise no longer affects the action selection (exploitation), reducing the diversity of paths.g) Initial and well-trained policy matrices of the agent in the Frozen Lake problem.The blue tiles represent the path is followed by the agent.The nearly random distribution of the initial policy matrix demonstrates difficulty finding a path to the goal tile, while the well-trained policy matrix clearly delineates the optimal path to the goal tile.
the intrinsic 1/f noise of the synaptic FeFETs exhibits a similar training speed compared to the network employing the ε-greedy method.In the early stage of training, the ACR is relatively high, reaching around 50%.As the training progresses, it gradually decreases, eventually converging to 0% in the later stage of training.The convergence of the ACR to 0% indicates the stabilized network, where the action selection is hardly affected by the intrinsic 1/f noise of the synaptic FeFETs.This stability ensures the reliability of the agent's action selection, maintaining an optimal policy during the problem-solving procedure after the training.This functionality resembles the ε-greedy method, where the amount of exploration decreases as the training progresses.The proposed in-memory RL system exhibits comparable performance to the networks employing the ε-greedy method, implying that the system effectively balances exploration and exploitation.In contrast, the networks devoid of the 1/f noise (green lines) are poorly trained since the ACR consistently remains at 0%.This result indicates that the agent continuously takes identical action in each state, resulting in an insufficient amount of exploration.
Figure 4d-f represents the ratio at which the agent passes each tile (passing rate) during the test episodes at a particular stage of the training process.The exploration method leveraging the intrinsic 1/f noise of the synaptic FeFETs is used for network training, and the effect of the 1/f noise is analyzed by ignoring the 1/f noise only during the test episodes.In the network without the 1/f noise, the agent initially moves toward the hole with a probability of 100% (Figure 4d, early stage of training).However, in the network with the 1/f noise, a single path toward the hole extends to the peripheral area due to the random action selection.In the early stage of training, before the path to the goal tile is found, various possible paths are explored with the aid of the 1/f noise.As the training progresses, the path to the goal tile is found (Figure 4e, intermediate stage of training).Still, the network without the 1/f noise maintains a single path.In the network with the 1/f noise, the diversity of paths is reduced compared to the early stage of training.This becomes more obvious as the training progresses.After the synaptic weights representing the path to the goal tile are strengthened, there is no significant difference between the networks with and without the 1/f noise (Figure 4f, later stage of training).This is because particular actions become dominant as the training progresses, and the 1/f noise no longer affects action selection.Note that the network without the 1/f noise exhibits a single path (exploitation strategy) without diversity, regardless of the training process.On the contrary, the 1/f noise facilitates exploration by searching various paths in the early stage of training.As the training progresses, the policy of the agent is optimized, and the 1/f noise does not interfere with the problem-solving in a well-trained network.Notably, sufficient exploration is conducted by leveraging the intrinsic 1/f noise of the synaptic FeFETs without additional circuits.
Figure 4g indicates the initial and well-trained policy matrices of the agent in the Frozen Lake problem.The arrows on each tile represent the agent's predicted action selection obtained by the difference between the numbers of the four output spikes.The nearly random distribution of the initial policy matrix reveals that the agent struggles to find a path to the goal tile and occasionally falls into a hole in the early stage of training.On the contrary, the well-trained policy matrix demonstrates the optimal path to the goal tile followed by the agent.A specific optimal path to the goal tile becomes prominently reinforced, effectively circumventing both holes and walls.Most of the output spikes are generated in the direction along the optimal path (relatively long arrows on the optimal path).On the contrary, a small number of output spikes are generated in other directions (relatively short arrows in other tiles).These results emphasize that the intrinsic 1/f noise of the synaptic FeFETs is hard to affect the action selection in a well-trained network.
The policy matrices of the agent when insufficient exploration is conducted during the training process are shown in Figure S13, Supporting Information.Each policy matrix represents instances where the agent gets stuck in an endless loop or falls into a hole.Addressing these issues requires adjusting the 1/f noise level of the synaptic FeFETs to ensure sufficient exploration.Adequate exploration aids in achieving the optimal solution to the problem, circumventing suboptimal solutions.This highlights the advantage of the exploration method utilizing the 1/f noise: the simplicity in adjusting the intrinsic 1/f noise of the synaptic FeFETs to the optimal level for problem-solving.

Effect of the 1/f Noise Level on the Network Performance
In RL, striking a balance between exploration and exploitation is essential.The optimal amount of exploration depends on the characteristics of the given problem, and is directly influenced by the noise level in the noise-based exploration approach.Therefore, adjusting the noise level to the problem-dependent optimal level within this approach is indispensable, emphasizing the necessity for tailored approaches.This section delves into the effect of the noise level on the in-memory RL system.It highlights the significance of adjusting the noise level to an optimal level that aligns with the characteristics of different problems.In this regard, the exploration method leveraging the intrinsic 1/f noise of the synaptic FeFETs is advantageous with simply adjustable noise level by controlling read time, operating voltage, device size, etc.
Figure 5 represents the problem-solving results for various 1/f noise levels of the proposed in-memory RL system.The performance evaluations were conducted under identical conditions to the previous section, differing only in the variation of the 1/f noise level.Figure 5a-c shows the success rate of the network, the number of actions required to solve the problem, and the ACR of the network for various 1/f noise levels.Green, black, and red lines indicate the results when σ ID =I D ¼ 0.01, 0.1, and 0.2, respectively.With an increase in the 1/f noise level, there is a noticeable decline in the training speed and the success rate (≈96% when σ ID =I D ¼ 0.1 and ≈30% when σ ID =I D ¼ 0.2).At the 1/f noise level deviating from the optimal noise level, the number of actions required to solve the problem rarely converges to the optimal number of 14 (≈18 when σ ID =I D ¼ 0.1), especially at a higher noise level (≈49 when σ ID =I D ¼ 0.2).The declined training speed impedes particular actions from becoming dominant and leads to continuous random action selections due to the intrinsic 1/f noise of the synaptic FeFETs.These phenomena are corroborated by the slowly decaying ACR (≈21% when σ ID =I D ¼ 0.1).At a higher 1/f noise level, the ACR remains relatively high regardless of the training process (≈54% when σ ID =I D ¼ 0.2), resulting in persistent random action selection.
These results indicate that adjusting the 1/f noise level to an optimal level is essential for problem-solving.
The success rate and the number of actions required to solve the roblem for various 1/f noise levels are shown in Figure 5d.The black and red symbols denote the success rate of the network and the number of actions required to solve the problem, respectively.The performance of the network is highly dependent on the 1/f noise level.An optimal range of 1/f noise levels exists in which the network is trained well and achieves high performance.Excessively low 1/f noise level results in insufficient exploration and degrades performance, while excessively high 1/f noise level introduces excessive stochasticity and disrupts the optimal action selection.These hamper the network training and lead to poor performance.Within the optimal range of 1/f noise levels, the network exhibits adequate exploration while maintaining stability and consistency in action training speed and the success rate.d) Success rate (black symbols) and the number of actions required to solve the problem (red symbols) for various 1/f noise levels.The network exhibits adequate exploration and achieves high performance within the optimal range of 1/f noise levels, while excessively high or low 1/f noise levels degrade the performance.Adjusting 1/f noise levels is essential for achieving high performance in the in-memory RL system.e) Average passing rate of ten separate networks for various 1/f noise levels.The agent has trouble finding an optimal path to the goal tile as the 1/f noise level increases.The movement of the agent follows an almost random trajectory when σ ID =I D ¼ 0.2.f ) The path followed by the agent of ten separate networks for various 1/f noise levels.Suboptimal paths are formed at a high 1/f noise level, while every agent of ten networks finds the optimal paths when σ ID =I D ¼ 0.01.The agents of most of the networks get stuck in an endless loop or fall into a hole when σ ID =I D ¼ 0.2.g) Policy matrices of the agent for σ ID =I D ¼ 0.1 and 0.2.When σ ID =I D ¼ 0.1, two suboptimal paths (yellow arrows) and endless loops (red arrows) are formed.The variability in action selection attributed to the 1/f noise often enables the agent to escape from the loops.On the contrary, when σ ID =I D ¼ 0.2, the agent has difficulty escaping the loops, often leading to falls into a hole.
selection.These results highlight the importance of adapting 1/f noise levels of the in-memory RL systems.
Unlike Figure 4d-f, which shows the passing rate of the agent as the training progresses in a single network, Figure 5e indicates the average passing rate of ten separate networks for various 1/f noise levels.When σ ID =I D ¼ 0.01, the agent achieves a 100% success rate in reaching the goal tile.The intermediate paths to the goal tile vary since the agents of ten separate networks move along different optimal paths, while the passing rates near the goal tile are about 100%.As the 1/f noise level increases, the agent struggles to find an optimal path to the goal tile, and the paths tend to extend to the peripheral area.Specifically, with a high 1/f noise level (σ ID =I D ¼ 0.2), the movement of the agent follows an almost random trajectory.A noticeable result is that the passing rates near the goal tile are only about 50% and 10% for σ ID =I D ¼ 0.1 and 0.2, respectively, despite the success rates of ≈96% and 30%, respectively (Figure 5a).This result is attributed to the movement of the agent deviating from the optimal path and getting stuck in a loop, traversing the same tile repeatedly.The agent traverses a specific tile multiple times within a single episode.This result aligns with Figure 5b, where the number of actions required to solve the problem does not converge to the optimal number of 14.
The paths followed by the agent of ten separate networks for various 1/f noise levels are shown in Figure 5f.When σ ID =I D ¼ 0.01, every agent of the ten networks reaches the goal tile with the optimal path (green paths).On the contrary, when σ ID =I D ¼ 0.1, agents of several networks deviate from the optimal path and reach the goal tile via detours (gray paths).These suboptimal paths are formed not only when insufficient exploration is conducted but also when the 1/f noise level is excessively high.When σ ID =I D ¼ 0.2, agents of most of the networks fail to find a path to the goal tile and get stuck in an endless loop or fall into a hole (red paths).
Figure 5g represents the policy matrices of the agent for σ ID =I D ¼ 0.1 and 0.2.When the 1/f noise level is high, the policy is not reinforced distinctly in a specific direction, unlike the welltrained policy matrix (Figure 4g).In both 1/f noise levels, most arrows (policy) point in the diagonal direction, indicating no particular action dominates.Especially for σ ID =I D ¼ 0.2, the lengths of most arrows are similar, with the majority pointing in almost random directions (resembling the initial policy matrix in Figure 4g).This indicates that the action selection can be easily affected by the 1/f noise, leading to an unstable network (similar to the untrained initial network).A high 1/f noise level further increases the variability in action selection.For σ ID =I D ¼ 0.1, the agent follows suboptimal paths while circumventing the holes.Two suboptimal paths (yellow arrows) are formed since the policy is not reinforced distinctly to a specific single path.Endless loops (red arrows) also exist where the agent repeatedly moves between two tiles.Despite these loops, the variability in action selection attributed to the 1/f noise often enables the agent to escape from the loops and successfully reach the goal tile.On the contrary, for σ ID =I D ¼ 0.2, it is relatively difficult for the agent to escape from the loops due to the extensive size and robust connectivity of the loops, and the agent often falls into a hole.
The performance of the in-memory RL system degrades at excessively high or low 1/f noise levels.It emphasizes the importance of adjusting the 1/f noise level to the optimal range that avoids excessive or insufficient exploration to ensure high performance.The exploration method leveraging the 1/f noise provides flexibility and adaptability to the in-memory RL system through an easily adjustable intrinsic 1/f noise of the synaptic FeFETs.The 1/f noise levels of the fabricated synaptic FeFETs versus I D (Figure 1k) exhibit optimal noise levels (Figure 5d) in all I D operating regions (Figure 1e).

Effects of the Device-to-Device Variation and the Initial Conductance Distribution
In fabricated synaptic devices, inherent variations between devices are commonly observed.[45] The effect of the device-to-device variation in the fabricated synaptic FeFETs on the performance of the proposed in-memory RL system is investigated.The device-to-device variation is expressed as follows [46,47] where G real and G ideal denote the conductance of the synaptic FeFET with and without the device-to-device variation, respectively.σ indicates the standard deviation of device-to-device variation.
Figure 6a-d shows the effect of the device-to-device variation on the performance of the proposed in-memory RL system.The performance evaluations were conducted under identical conditions to the previous section, except for the device-to-device variation.Figure 6a-c depicts the success rate of the network, the number of actions required to solve the problem, and the ACR of the network for various device-to-device variations.Each result represents the average value of five separate networks.The results of each network are illustrated in Figure S14, Supporting Information.As the device-to-device variation increases, the training speed gradually declines.However, up to σ ¼ 0.4, the proposed in-memory RL system maintains a high success rate with the optimal number of actions, exhibiting strong immunity to device-to-device variation.The ACR also converges to about 1% with sufficient exploration in the early stage of training, and the network is stabilized.When σ ¼ 0.5, the performance of the network slightly degraded.The average value of the results of the last 1000 episodes in the later stage of training is shown in Figure 6d.Saturation episode indicates the average number of episodes required to achieve a success rate of 100% in each network.The saturation episode increases with an increase in the σ, indicating a decline in the training speed.The device-to-device variation of the fabricated synaptic FeFETs hardly degrades the performance of the proposed in-memory RL system.
Figure 6e represents the performance of the network as a function of the maximum number of actions per episode and the number of episodes per problem.Both of these variables play a crucial role in achieving high performance.The performance of the network is improved as these variables increase, while the network is poorly trained in the opposite case.An adequate maximum number of actions allows the agent sufficient time to explore the environment and gather information.If the maximum number of actions is excessively small, the agent may not have enough time to find optimal strategies.On the contrary, if the number of episodes per problem is excessively large, it can lead to prolonged training time without substantial performance improvements.
The initial conductance distribution of the synaptic devices before training the network is an important factor affecting the performance of the network. [48,49]The performance evaluations were conducted under identical conditions to the previous section for various initial conductance distributions (σ/μ).Figure 7a-c depicts the effect of the σ/μ on the performance of the network.The average performance of five separate untrained networks in each distribution is presented.Each result is the average value of the last 1000 episodes in the later stage of training.The black and red symbols in Figure 7a represent the success rate of the network and the number of actions required to solve the problem, respectively.The networks exhibit consistent and robust performance with minimal variation, regardless of the σ/μ.The number of actions required to solve the problem is the optimal number of 14, with slight variation according to the σ/μ. Figure 7b shows the saturation episodes for various σ/μ.As the σ/μ increases, the saturation episode decreases (fast training speed) while the variation for each result increases.The ACR in the early stage of training decreases as the σ/μ increases (Figure 7c).These results are consistent with Figure 7b, where the training speed accelerates with increasing σ/μ.On the

Conclusion
We have proposed an in-memory RL system leveraging the intrinsic 1/f noise of the fabricated synaptic FeFETs for efficient exploration.The proposed in-memory RL system exhibits comparable performance to the networks utilizing the ε-greedy method, even in the presence of the 1/f noise during both training and test processes.Despite employing deterministic action selection with the intrinsic 1/f noise of the synaptic FeFETs, the network conducted sufficient exploration during the training process, even at a minimal 1/f noise level.The contribution of the intrinsic 1/f noise of the synaptic FeFETs in facilitating exploration was comprehensively examined.This approach obviates the requirement for additional circuits, which is essential for implementing conventional exploration methods on a hardware basis, thereby reducing the hardware burden.
The effect of the 1/f noise levels on the in-memory RL system was investigated.The performance degrades at excessively high or low 1/f noise levels, while the high performance is achieved within the optimal range of 1/f noise levels.In this regard, one significant advantage of the exploration method employed in this work is the simplicity in adjusting the intrinsic 1/f noise of the synaptic FeFETs to an optimal level.This feature enhances the flexibility and efficiency of the proposed in-memory RL system while imposing a minimal hardware burden.The system also exhibits strong immunity to the device-to-device variation and the initial conductance distribution of the fabricated synaptic FeFETs.This result highlights the robustness and reliability of the proposed in-memory RL system.

Experimental Section
Fabrication Process of Synaptic FeFETs: A p-type silicon-on-insulator wafer with a low dopant concentration was used for the fabrication of the FeFETs.The device silicon layer had a thickness of 100 nm.The wafer was cleaned using the following solutions: SPM solution (H 2 SO 4 : These cleaning steps were conducted after the active patterning.Following the cleaning process, the dielectric layer (Al 2 O 3 ) and the ferroelectric layer (HZO) were deposited using atomic layer deposition.The deposition cycles of the HZO layer involved two cycles of HfO 2 and one cycle of ZrO 2 .These cycles were repeated 23 times, with an additional two cycles of HfO 2 to form a 6.2 nm thick HZO layer.Consequently, a 1.0 nm Al 2 O 3 layer and a 6.2 nm HZO layer were formed as the dielectric and ferroelectric layers, respectively.A 100 nm TiN layer was sputtered on the wafer.This TiN layer was then patterned to form a gate metal and also served as a hard mask for the subsequent implantation step.Phosphorus ions were implanted on the source and drain regions with a dose of 10 15 cm À2 and energy of 10 keV.To crystallize the HZO film and activate the dopants, a postmetal annealing step was performed using rapid thermal annealing at 500 °C for 30 s in an N 2 ambient.Finally, a high-pressure annealing (HPA) process was conducted to enhance the ferroelectric properties of the FeFETs.The HPA was conducted at 300 °C in a forming gas ambient consisting of 4% H 2 % and 96% N 2 for 30 min.The schematic views of key fabrication process steps are illustrated in Figure S1, Supporting Information.
Electrical Measurement: A probe station and a semiconductor parameter analyzer (B1500A) were utilized to assess the synaptic memory characteristics of the fabricated FeFETs.The PSD of the FeFET was measured using a Keysight Semiconductor Device Analyzer (B1500A), a Stanford Research Systems Low-Noise Current Preamplifier (SR570), and a Keysight Dynamic Signal Analyzer (35670 A).First, the B1500A supplied the voltage to the gate.Then the SR570 converted the output current fluctuation into a voltage fluctuation.Lastly, the 35670 A converted the dynamic signal from the SR570 into the PSD.
Network Structure and Training Process: The DQN used to solve the Frozen Lake problem is depicted in Figure S10, Supporting Information.The network consisted of 64 input neurons and four output neurons.The 64 input neurons received the observed data from the environment.The observed data expressed the current location of the agent in the 8 Â 8 grid world in one-hot encoding and was applied as input to the network.The four output neurons corresponded to the four possible actions the agent can take.For every training episode, the performance of the network was assessed through 100 test episodes.The parameters of the DQN are shown in Table S1, Supporting Information.The previously proposed synaptic weight initialization method was used. [48]The simulations reflected the measured electrical characteristics of the fabricated synaptic FeFETs.
The network was trained employing a deep Q-learning algorithm, one of the most preferred algorithms in RL, and the previously reported training method approximating a backpropagation algorithm. [18]The entire training process is detailed in Note S2 and Figure S16, Supporting Information.When the synaptic weight potentiation was required, the conductance of excitatory and inhibitory synaptic FeFETs was increased and decreased, respectively.On the contrary, the opposite procedure was followed when synaptic weight depression was necessary.Unlike the previously proposed update methods, the proposed update method can train the network well without resetting the conductance of the synaptic devices. [50]

Figure 1 .
Figure 1.Electrical characteristics of synaptic FeFETs.a) Schematic illustration of the biological nervous system and a synaptic FeFET.b) Top optical image of the fabricated FeFET array with dimensions of 12 Â 24.c) TEM cross-sectional image of the fabricated FeFET.d) Transfer curves (I D -V GS ) of the fabricated FeFET.e) Transfer curves with consecutive PGM and ERS pulses.The fabricated FeFET changes to HRS or LRS depending on the applied pulse due to the polarization change of the HZO layer.f ) Average and standard deviation of both LTP and LTD characteristics of the fabricated FeFET with the number of applied pulses during 20 cycles.The fabricated FeFET exhibits low cycle-to-cycle variation.g) I D variation over time with various V GS conditions.h) I D normalized PSD of the fabricated FeFET with various V GS conditions.The I D -normalized PSD decreases with an increase in V GS .i) S ID /I D 2 sampled at 10 Hz and (g m /I D ) 2 versus I D of the fabricated FeFET.The origin of the intrinsic 1/f noise of the fabricated FeFET can be explained by the CNF model since the S ID /I D 2 and (g m /I D ) 2 exhibit the same tendency in all I D operating regions.The 1/f noise level (σ ID /I D ) of the fabricated FeFET versus j) read time, k) read current, and l) device size.The intrinsic 1/f noise of the fabricated FeFET can be simply adjusted to the optimal level by controlling read time, operating voltage, device size, etc.

Figure 2 .
Figure 2. In-memory RL system and problem-solving procedure.a) Interaction between agent and environment in RL. b) Schematic illustration of in-memory RL system consists of I&F neurons and synaptic FeFET array.The observation data of the environment (current state) are applied to the synapse array.The I&F neurons generate postsynaptic spikes proportional to the summed presynaptic spikes.The agent takes an action based on the output neuron with the highest firing frequency.c) Flowchart of the problem-solving procedure of the in-memory RL system.A series of processes are repeated until solving the given problem.

Figure 3 .
Figure 3. Operations of the in-memory RL system with an efficient exploration method.a) Number of output spikes generated from four output neurons for every possible state in the Frozen Lake problem at the early and later stages of training.The intrinsic 1/f noise of the synaptic FeFETs can easily change the priority between the number of output spikes in the early stage of training (exploration strategy).However, changing priority in the later stage of training is difficult since a particular output neuron generates a spike dominantly (exploitation strategy).b) Standard deviation (σ) of the number of output spikes for every possible state in the Frozen Lake problem.c) Average σs for every possible state in the early and later stages of training.The average σ increases by a factor of about 6.5 as the training progresses.A large σ indicates a stable network that remains unaffected by the 1/f noise (exploitation strategy), while a small σ indicates a network where action selection is highly susceptible to the 1/f noise (exploration strategy).d) Number of output spikes of a well-trained network when the hole is west (state 1) and south (state 2) of the agent as an example.In the figure on the left, the red arrows on the agent represent the agent's predicted action selection obtained by the difference between the number of output spikes.The length of the arrow is proportional to the number of output spikes.In both of the different arbitrary states, the agent avoids the hole and moves in the direction toward the goal tile.e) The change in conductance of all synaptic FeFETs in the network as the training progresses.As the training progresses, particular synaptic weights become dominant, mitigating the effects of the 1/f noise on the network.

Figure 4 .
Figure 4. Problem-solving results of the proposed in-memory RL system.a) Success rate of the network, b) the number of actions required to solve the problem,and c) ACR of the network as the training progresses.The performance of the network with the exploration method leveraging the intrinsic 1/f noise of the synaptic FeFETs (black lines) is similar to that of the conventional ε-greedy method (red lines).In contrast, the network devoid of the 1/f noise (green lines) is poorly trained since insufficient exploration is conducted.The ratio at which the agent passes each tile (passing rate) during the test episodes at the d) early, e) intermediate, and f ) later stages of training.The exploration method utilizing the 1/f noise is used for network training.The network without the 1/f noise exhibits a single path (exploitation) without diversity, regardless of the training process.Conversely, the 1/f noise facilitates exploration by searching various paths in the early stage of training.After the policy of the agent is optimized (later stage of training), the 1/f noise no longer affects the action selection (exploitation), reducing the diversity of paths.g) Initial and well-trained policy matrices of the agent in the Frozen Lake problem.The blue tiles represent the path is followed by the agent.The nearly random distribution of the initial policy matrix demonstrates difficulty finding a path to the goal tile, while the well-trained policy matrix clearly delineates the optimal path to the goal tile.

Figure 5 .
Figure 5. Problem-solving results for various 1/f noise levels.a) Success rate of the network, b) the number of actions required to solve the problem, and c) ACR of the network for various 1/f noise levels as the training progresses.Excessively high 1/f noise level hampers the network training and degradestraining speed and the success rate.d) Success rate (black symbols) and the number of actions required to solve the problem (red symbols) for various 1/f noise levels.The network exhibits adequate exploration and achieves high performance within the optimal range of 1/f noise levels, while excessively high or low 1/f noise levels degrade the performance.Adjusting 1/f noise levels is essential for achieving high performance in the in-memory RL system.e) Average passing rate of ten separate networks for various 1/f noise levels.The agent has trouble finding an optimal path to the goal tile as the 1/f noise level increases.The movement of the agent follows an almost random trajectory when σ ID =I D ¼ 0.2.f ) The path followed by the agent of ten separate networks for various 1/f noise levels.Suboptimal paths are formed at a high 1/f noise level, while every agent of ten networks finds the optimal paths when σ ID =I D ¼ 0.01.The agents of most of the networks get stuck in an endless loop or fall into a hole when σ ID =I D ¼ 0.2.g) Policy matrices of the agent for σ ID =I D ¼ 0.1 and 0.2.When σ ID =I D ¼ 0.1, two suboptimal paths (yellow arrows) and endless loops (red arrows) are formed.The variability in action selection attributed to the 1/f noise often enables the agent to escape from the loops.On the contrary, when σ ID =I D ¼ 0.2, the agent has difficulty escaping the loops, often leading to falls into a hole.

Figure 6 .
Figure 6.Problem-solving results for various device-to-device variations.a) Success rate of the network, b) the number of actions required to solve the problem, and c) ACR of the network for various device-to-device variations as the training progresses.The proposed in-memory RL system exhibits strong immunity to device-to-device variation.d) Average value of the last 1000 episodes in the later stage of training.The performance of the network, including the training speed, slightly degrades as the σ increases.e) Performance of the network as a function of the maximum number of actions per episode and the number of episodes per problem.Sufficient number of actions per episode and episodes per problem are necessary for high performance.

Figure 7 .
Figure 7. Effect of the initial conductance distribution on the performance of the network.a) Success rate of the network and the number of actions required to solve the problem, b) saturation episode, and c) ACR of the network for various initial conductance distributions (σ/μ).Each result represents the average value of the last 1000 episodes in the later stage of training.The networks exhibit consistent and robust performance with minimal variation, regardless of the initial conductance distributions.