Tolerating Noise Effects in Processing‐in‐Memory Systems for Neural Networks: A Hardware–Software Codesign Perspective

Neural networks have been widely used for advanced tasks from image recognition to natural language processing. Many recent works focus on improving the efficiency of executing neural networks in diverse applications. Researchers have advocated processing‐in‐memory (PIM) architecture as a promising candidate for training and testing neural networks because PIM design can reduce the communication cost between storage and computing units. However, there exist noises in the PIM system generated from the intrinsic physical properties of both memory devices and the peripheral circuits. The noises introduce challenges in stably training the systems and achieving high test performance, e.g., accuracy in classification tasks. This review discusses the current approaches to tolerating noise effects for both training and inference in PIM systems and provides an analysis from a hardware–software codesign perspective. Noise‐tolerant strategies for PIM systems based on resistive random‐access memory (ReRAM), including circuit‐level, algorithm‐level, and system‐level solutions are explained. In addition, we also present some selected noise‐tolerate cases in PIM systems for generative adversarial networks and physical neural networks.

results. In this article, we discuss circuit-level, algorithm-level, and system-level solutions to build reliable memristor-based designs for neural network training and inference in the condition of the noise in the PIM system. Inferencing with the PIM system indicates that the forward path of the neural network is implemented in the PIM system. The noise will influence the inferencing accuracy, but the backward path and weight update are noise-free. [11] However, training with PIM systems is a more complex topic, and thus a small mismatch in the system can lead to poor training results. [12] Therefore, the on-chip training algorithm should adapt to the system imperfections. [13,14] In addition, we look into the latest work on noise-tolerating in PIM systems. For instance, the photonic PIM system has the advantage of high parallelism and high energy efficiency, but the device suffers from programming noise. Likewise, physical systems have to deal with the device's and peripheral circuits' nonideality. Therefore, we show representative noise-tolerant cases in these systems.

Background
In this section, we want to explain the ReRAM-based processingin-memory designs and analyze the noise in PIM systems. Besides, we include the general approaches towards noise in neural networks.

ReRAM-Based Processing-in-Memory Designs
ReRAM has been viewed as one of the most outstanding techniques in neural network acceleration. The ReRAM cell is made up of a metal-oxide material layer inserted between two metal electrode layers. ReRAM has the advantages of nonvolatility, small size, multilevel cell representation, fast access time as well as low access energy. ReRAM is energy efficient because the matrix-vector multiplication can be performed using the memristive array structure. Beyond that, novel designs such as PRIME, [15] ISAAC, [16] and PipeLayer, [7] are exploring the improvements in energy efficiency.
The benefit of utilizing ReRAM as a neural network accelerator lies in the crossbar structure because it can complete the matrix-vector calculation. The horizontal wordlines (WL) and the vertical bitlines (BL) can be treated as matrix rows and columns, and the conductance value in each intersection can be viewed as the element in the corresponding matrix position. The conductance is preprogrammed according to the weight values. Figure 2 illustrates the crossbar array structure. Let G be the conductance matrix of the ReRAM crossbar. When the voltage V I is applied to the crossbar wordlines, the output current generated through the bitlines is I O ¼ V I G according to Kirchhoff 's current law and Ohm's law.
We would use GAN as an example to explain the advantages of ReRAM-based PIM design. In a GAN model, two neural network models (i.e., generator and discriminator) are simultaneously trained and therefore, consume high computation and storage resources. Moreover, the generator in GAN uses a novel transposed convolutional (TCONV) layer, which inserts many zeros  www.advancedsciencenews.com www.advintellsyst.com into its input feature maps before performing traditional convolution operations, leading to significant resource underutilization. To address the challenges, ReRAM-based PIM designs are proposed to reduce the resource requirement and improve the resource utilization rate. For example, ReGAN, [9] the ReRAM-based GAN accelerator, is proposed to efficiently pipeline the training to reduce on-chip memory access and reduce the training latency. In addition, spatial parallelism and computation sharing are proposed to further improve performance. Another work, ZARA, addresses the resource underutilization problem with a novel computation deformation technique that can skip zero-insertions in TCONV. [17] By pre-classifying the kernels into sub-kernels, the proposed design can get rid of the zeroinsertion step in TCONV. Most importantly, this deformation design in the forward phase can be extended into the error backpropagation phase and weight update phase, and thus the training efficiency can be further improved with this ReRAMbased design.

Noise in ReRAM-Based PIM Systems
The variation in ReRAM-based neural network computing refers to parameter variation and process variation. The resistance status (including high resistance state or low resistance state) of the ReRAM cell changes according to the voltage applied to the electrode layers. If there is a voltage applied to the ReRAM cell, there is a possibility that a conductive filament consisting of oxygen vacancies is formed. The randomness of generating oxygen vacancies is the major reason for the process variation. [18] Given the same level of voltage and equivalent conditions, the shape of oxygen vacancies is unpredictable, which leads to the difference in resistance levels among cycles. Hsu et al. studied the variation in the ReRAM resistance and showed the fitted distribution for high-resistance state and low-resistance state. [19] For the ReRAM-based PIM system inferencing, different kinds of noise should be carefully studied. Thermal noise is generated due to the thermal agitation of the charge carriers inside the conductor and depends on temperature and operating frequency. In ReRAM devices, this thermal noise affects the output current. Therefore, the thermal noise can be modeled as a current source in parallel with a passive resistor. [20] Shot noise originates from the discrete charges in the current flow and is related to the operating frequency. [21] Random telegraph noise (also known as RTN) is generated in semiconductor and ultrathin oxide films. Programming noise is introduced when writing to ReRAM devices.
To summarize, the potential challenges of noise in the PIM system lie in the following aspects: First, when there is a high operating frequency or high temperature, the thermal noise and shot noise increase and lead to the high-amplitude noise. Accuracy degradation happens when the operating frequency or temperature increases. [22] Second, the noise is aggregated when considering the combination of noise sources. Third, the tolerable noise margin is reduced if the ReRAM cell resolution is high, such as 8 bit. Therefore, the study of tolerating the noise effect in the PIM system should consider these challenges and improve the system's reliability under model perturbation.

General Approaches Towards Noise in Neural Networks
Before moving to the PIM-specific solution, it is important to determine the general approaches toward noise in neural networks. The elements involved in the software-based neural network training process are in floating-point format. However, when implementing the neural network with analog or digital hardware, the data precision of the system should be carefully determined because of these two concerns.
First, reducing weight and activation precisions introduces quantization noise to the process and can lead to test accuracy degradation. This circumstance indicates that the quantization effect is only considered in the forward path. For instance, the deep compression method utilizes a full-precision pretrained model and quantizes the model with finetuning procedure. [23] Recent research shows that post-training quantization that doesn't involve retraining or fine-tuning the full model can also effectively quantize the model. [24][25][26][27] Another critical issue of training convergence appears when utilizing the quantized version of neural networks during both training and testing. Chen et al. proposed a hardware accelerator for convolutional neural networks and showed that the training precision should be selected as 32 bits instead of 16 bits to guarantee the convergence. [28] To reduce the constraints for training precision, Gupta et al. proposed stochastic rounding for gradient so that the computing units can use 16-bit fixed-point data for both the forward and backward path. [29] However, the step of quantization changes the elements into discrete values, and the performance of the gradient-based method degrades because continuous space is not achievable. The straight-through estimator (STE) is proposed to support the training of quantized neural networks with ultralow precision. [30] In this method, a copy of the full-precision weight is kept, and the quantized weight will be used for the forward path calculation. During the backward path, the gradient of the quantized weight is directly passed to the gradient of the full-precision weight. Consequently, gradient-based optimization can be performed on the full-precision weight. This STE method inspires multiple quantized neural network designs. [31][32][33] For generalized quantization robustness, recent works, including gradient L1 regularization and Hessian-based regularization, target the various bit widths instead of one specific width and show the effectiveness of the proposed regularization scheme. [34,35]

Circuit-Level and Algorithm-Level Solutions
In this section, we aim to analyze the circuit-level and algorithmlevel approaches for tolerating noise effects in PIM designs. We first go over an example of in situ learning of a single-layer perceptron classifier, as shown in Figure 3. [36] Training set and weight initialization are the inputs for the training system, and the crossbar (shown in the gray-shaded box) is the place for the matrix-vector multiplication and gradient update. Besides, the Delta learning rule and the Manhattan learning rule are adopted for hardware-based in situ learning, and the equations for learning rules can be found in Figure 3b. [37,38] Here, the Manhattan learning rule is more hardware-friendly and efficient because it only updates the sign information of the result from www.advancedsciencenews.com www.advintellsyst.com the Delta learning rule. Considering the switching dynamics of memristive hardware devices, the variable-amplitude writing scheme is proposed to make the training algorithm more compatible with the circuit design. [39] When the design scale increases to multilayer perceptron, the noise problem becomes more severe because the disturbed output from the first layer serves as the activation for the second layer. Thus, the impact of noise in different layers may accumulate. In a circuit-level work, [40] the weight disturbances of memristor cells under different read operation directions are analyzed, and preferred read schemes are proposed to reduce the impact of memristance drift and improve the system reliability. To combat the possible noise effect in the system, the closedloop circuit provides a feedback control from each layer's output to the weight. This design takes advantage of the feedback loop to stabilize the weight. In this work, the learning algorithm introduces the hardware inferencing result to the update rule. Results show that the closed-loop design can maintain good performance for 14.95Â service time compared with the open-loop design.
There are some solutions targeting the effect of one specific noise in the ReRAM-based PIM system. For example, a design strategy is to use the hardware training method to minimize problems caused by process variations, where the parameter tuning strategy is based on the difference between the target value and the output value with noise. [41] Taking programming noise as another example, we can analyze the effect of programming noise and the possible solutions. Multiple memristor cells can be integrated to represent a single weight if the memristor cell's resolution is smaller than the weight's resolution, which is often the case for a ReRAM-based PIM system. [42] Thus, programming noise that relates to the more significant cells will be multiplied by a scalar during integration, and the programming noise in the most significant cell is of the greatest concern. Long et al. proposed a dynamic fixed-point representation to reduce the unused most significant bit of memristor cells. [43] Besides, the noiseaware training algorithm also manipulates the disturbed weight in the forward path while keeping the original weight copy for the backward phase, which is similar to the quantization-aware training method. The result shows that the combination of dynamic fixed-point data format and the noise-aware training facilitates finding a more smooth local optimal surface in the loss space.
Besides, the analysis of weight elements' importance in the neural network is also leveraged in hardware design. Beigi et al. analyze the importance of weight elements and identify the most effective rows in a weight matrix. In the thermal-aware mapping method, they map the most effective rows to cold rows and the most ineffective rows to hot rows. [44] Furthermore, Shin et al. propose a thermal-aware optimization framework to reduce the average temperature and temperature variance among memristor arrays. [45] The noise-tolerant techniques previously proposed for memory storage can also be adjusted to PIM systems. For instance, Feinberg et al. utilize the error-correction scheme based on AN code to monitor the results of in-memory computing. [20] The limitation of this method is that it can only be used for binary cell circumstances.
To summarize, circuit-level and algorithm-level solutions explore the hardware and software possibilities for better design rules to achieve the goal of tolerating noise effects.

System-Level Solutions
To achieve the goal of designing noise-tolerant, highperformance, energy-efficient PIM systems, system-level approaches are necessary. Besides, the overhead of such a powerful system-level approach should be minimized. One important feature of system-level solutions is that the framework includes the whole design space into consideration. For instance, Xia et al. propose an exploration flow of system parameter selection to overcome the impact of variations. [46] Moreover, they design a training algorithm to leverage the inherent self-healing capability of a neural network to further prevent large weights from being mapped to a large variation memristor. For reliability optimization, a simulation framework which enables reliable ReRAM-based accelerators for deep learning (DL-RSIM) can be used to explore the design space of ReRAM-based deep learning accelerators. [18] DL-RSIM proposes a workload-dependent sensing strategy to show how to leverage the framework to build reliability optimization techniques. The simulation findings can assist chip designers in selecting a superior design alternative during the early stages of development.
In recent work, Yang et al. consider different existing noise sources in PIM systems and target at achieving robustness and effectiveness with minimal design exploration cost. [42] They design a ReRAM-based stochastic-noise-aware training method (ReSNA), and include three major design analyses. First, the distribution of noise under frequency and temperature settings is analyzed. Thermal noise, shot noise, and programming noise is modeled as Gaussian distributed noise, while www.advancedsciencenews.com www.advintellsyst.com RTN is modeled as Poisson-distributed noise. Programming noise scales linearly with the conductance levels, while thermal and shot noise is related to the square root of the conductance values. [20] Second, the relative conductance change is included in the ReRAM cell during training. Note that there is a larger relative impact for lower conductance levels, where most weights of DNN are distributed. Besides, thermal and shot noise dominates at small conductance, which is troublesome under high frequency. Then the hardware settings, such as the resolutions of the analog-to-digital converter and digital to analog converter, are also included in the ReSNA algorithm. For achieving robust and efficient ReRAM-based designs, the goal of the design space exploration is to find Pareto-optimal ReRAM-based system with minimal cost. However, the challenge here is that the number of available design configurations is up to 10 7 , and each evaluation with ReSNA will consume a large computation cost. The solution here is continuous-fidelity max-value entropy search for multi-objective optimization (CF-MESMO) with the following key steps: Gaussian processes are employed as the choice of the statistical model due to their superior uncertainty quantification ability. Then, sample the Pareto Front with a surrogate model and calculate the information gain per unit cost with the formula derived from the properties of Gaussian processes and Monte Carlo sampling. Afterward, the next candidate ReRAM design and fidelity pair can be selected with the information gain formula. The key idea is to find the ReRAM design and fidelity pair that maximizes the information gain per unit cost about the Pareto front of the highest fidelity. According to the evaluation of optimization efficiency, CF-MESMO achieves a 90.91% reduction in computation cost compared to the popular multi-objective optimization algorithm NSGA-II when reaching the best performance of NSGA-II.
Generally, system-level solutions for tolerating noise effects in PIM systems can successfully explore the design space for the optimal design choice while maintaining the test performance under stochastic noise.

Representative Noise-Tolerant Case Study
In this section, we want to present the latest work on noisetolerating designs in PIM systems, including the photonic system and physical system. Other than these cases, SRAM-based PIM design and FeFET-based PIM design can also tolerate the noise effects through different design techniques. [47,48]

Photonic Systems: Harnessing Noise Effect in Generative Adversarial Network
In a photonics-based generative adversarial network design (shown in Figure 4), Wu et al. propose an efficient weightcompensatory training method to harness the noise effect to generate the diverse image pattern successfully. [49] Compared with previously mentioned designs, the uniqueness of this design is that generative adversarial networks take stochastic random noise as an input. Therefore, the noise in the computing kernel will be added to the noise from the input vector. Besides, the network itself takes advantage of the noise effect to produce different patterns to fool the discriminator. The photonic computation kernel consists of an array of programmable phase-change mode converters (PMMC). During the programming of the kernel weight, the noise distribution can be estimated as the Gaussian distribution. During the training process, the noise of a specific standard derivation is included in the forward path, as shown in Figure 4b. In each layer, the input signals pass through the photonic tensor core and are converted to the electrical domain by photodetectors (PDs). After postprocessing, the data are converted back into the optical domain and transferred to the next layer. This data transformation process can also introduce noise to the result. The experimental result shows that the noise-aware training can outperform the noiseless training with respect to image quality and diversity. The results of the noise impact show that the weight-compensatory GAN built on photonic hardware with practical noise outperforms the noiseless hardware in inference. The inference accuracy of a discriminative network, on the other hand, inevitably decreases as the hardware becomes noisier. [50] Despite the unavoidable optoelectronic disturbances and faults, this unexpected increase in performance reveals photonic neural networks' promise in generative models.
However, the convergence of the photonic system with complementary metal-oxide semiconductor (CMOS) so far is still far from mature. Specifically, the processing-in-memory photonic systems require the incorporative working between the photonic circuit and the electronic circuit to perform high-fidelity optical signal processing with a variety of optoelectronic functions. Monolithic integration of the photonic chip in the proximity of the electronic circuits thus is necessary. One challenge toward this monolithic integration is the lack of a suitable material for photonic functions either in bulk CMOS or state-of-the-art CMOS technics below 28 nm transistor node. [51] Another challenge is that several necessary photonic building blocks such as on-chip circulators, isolators, and amplifiers have still not proliferated to the extent that they have in bulk fiber-optics technologies. [52][53][54][55] Despite this, it is worth noting that efforts toward monolithic photonic-electronic systems have been demonstrated and paved the way to integrate photonics with state-of-the-art CMOS technics. [51]

Physical Systems: Redesigning Backpropagation to Tolerate Noise
To eliminate the effect of device asymmetry, an alternative stochastic gradient descent method called the Tiki-taka algorithm is proposed and validated with various neural networks. [56] The Tiki-taka algorithm can simultaneously minimize the training loss and the hidden cost introduced by the asymmetric update. Therefore, the adaptive learning algorithm can be beneficial for the training process with real devices.
In recent work, a physics-aware training method (PAT) has been proposed to support the backpropagation in physical neural networks, as shown in Figure 5. [57] The physical neural network includes the mechanical neural network, optical neural network, and electrical neural network. PAT is capable of creating accurate hierarchical classifiers that take advantage of each system's unique physical transformations and intrinsically reduce each system's noisy processes and flaws. The difference between PAT and previously mentioned noise-aware training methods www.advancedsciencenews.com www.advintellsyst.com Figure 5. Physics-aware training method (PAT) is a hybrid in situ-in silico algorithm to apply backpropagation to train controllable physical parameters so that physical systems perform machine-learning tasks accurately even in the presence of modeling errors and physical noise. Instead of performing the training solely within a digital model (in silico), PAT uses the physical systems to compute forward passes. Although only one layer is depicted in (a), PAT generalizes naturally to multiple layers. Reproduced with permission. [57] Copyright 2022, Springer Nature. Reproduced with permission. [49] Copyright 2022, AAAS.
www.advancedsciencenews.com www.advintellsyst.com is that PAT uses different models for forward and backward paths. Because the physical model is not differentiable, the approximated differentiable equation in Figure 5d is generated from a neural-architecture-search approach to mimic the transformation from the input to the output. The in-situ part generates the output during the PAT, and the in silico part produces the estimated gradient. The training convergence of PAT has been proved with experimental results, which is superior to pure in silico training. This physics-aware training method can get rid of analytical gradients and improve efficiency. Furthermore, the in silico part in the PAT helps tolerate the noise effect by providing precise gradient updates.

Future Work
The idea driving noise-aware training is that there is a more smooth surface in the loss space. Theoretical examination of the presence of such a surface, in contrast, is lacking. The theoretical bound or guarantee will aid researchers in better comprehending the concept of noise tolerance. [58] Previous noise-tolerating research focuses on noise-aware training strategies. However, the training process can take high computational costs. [42] Moreover, the training performance relies on the precise modeling of the noise. That is to say, training with one specific noise may not help the test accuracy under another kind of noise. The noise distribution can only capture the characteristic of a collection of data points. But the noise in each trail is unpredictable. This problem is more severe when the cycle-to-cycle variation becomes larger. Even though the noiseaware training can help with the average performance, the dynamic adaption to each cycle is not well studied. Therefore, tolerating cycle-to-cycle variation is an important future topic to explore.
Another critical issue is how to deal with the noise effect in the heterogeneous PIM system. For instance, the processing subarrays may have different noise levels. How to dynamically monitor the noise level in different subarrays and automatically adjust the algorithm is worth exploring.
In addition, we find that there is a probability that noise can be an ideal factor for the system. For instance, for the Bayesian inference, noise in the PIM system can serve as a regularizer to improve the performance. [59] This leads us to explore whether we can take advantage of noise to implement Bayesian neural networks with high energy efficiency.
Furthermore, if we consider "noise-tolerate" to be a performance parameter for the neural network, we may incorporate it in the neural architecture search to identify the most resilient network architecture. The search space consists of the network design parameters and the hardware configurations. In each sampled neural network, we will evaluate its performance under hardware noise. Therefore, the automatic search can provide guideline for designing noise-tolerant networks and hardware platforms.