A Multifault-Tolerant Training Scheme for Nonideal Memristive Neural Networks

Memristor crossbar is extensively investigated as an energy‐efficient accelerator for neural network (NN) computations. However, hardware implementation of NNs using realistic memristors is challenging due to the ubiquity of faults (mainly classified into hard and soft faults) in memristors. Herein, a hardware‐friendly, low‐power multifault‐tolerant training (MFTT) scheme capable of addressing both hard and soft faults simultaneously for memristive NNs is proposed. The MFTT scheme consists of multifault detection, targeted weight pruning, and in situ training with the Manhattan update rule. Specifically, multifault detection is first conducted to detect both hard and large soft faults. The detected faulty weights are subsequently pruned to prevent them from disturbing the NN. The sparsified NN after pruning is in situ trained so that the hard and large soft faults can be effectively tolerated using the sparsity and self‐adaptivity of NNs. In addition, the remaining small soft faults can be well tolerated by the Manhattan update rule. Experimentally, MFTT demonstrates the lowest accuracy losses among several representative fault‐tolerant schemes not only in the hard fault‐only (when the hard fault ratio exceeds 10%) and soft fault‐only (when the soft faults are large) cases, but also in the case where both types of faults coexist.

Tremendous research efforts have been devoted into improving the fault tolerance of the memristor crossbar-based NNs. To tolerate hard faults, there have been both hardware-and softwarelevel solutions. The hardware-level solutions include switching off the individual access transistors connected to the faulty memristors [15] and replacing the faulty memristors with redundant ones. [16,17] These hardware-level solutions, however, introduce large area and routing overheads and complicate the design of peripheral circuits. On the other hand, the software-level solutions typically contain some or all of the following steps: adaptive weight mapping that maps the significant weights onto the faultfree memristors, weight pruning that fixes the weights mapped onto the faulty memristors to zero (or reducing these weights), and ex situ training (or retraining). [18][19][20] The software-level solutions can compensate the hard faults-induced accuracy losses by taking advantage of the inherent sparsity of NN. In addition, they can save the hardware overhead. Therefore, the software-level solutions have become the mainstream in tolerating the hard faults.
For the tolerance of soft faults, the in situ training which can adjust the weights self-adaptively on chip is an effective approach. [21] The fault tolerance can be further improved using some modified weight update rules, such as the Manhattan update rule, [22,23] stochastic update rule, [6,24] stochastic sparse update with momentum adaption, [25] and sign-based backpropagation (BP) algorithm. [26,27] The main idea of these modified weight update rules is to move the weights in the general direction that results in the reduction of loss function without regulating the step size. [28] This makes the precise control over the weight update unnecessary and hence the soft faults can be tolerated.
As introduced earlier, the hard and soft faults have their respective solutions, but these solutions may not work properly when both faults are present simultaneously. For example, the mainstream solutions to hard faults typically use ex situ training, which is, however, difficult to load the software-computed weights to the memristor crossbar accurately in the presence of soft faults. [11] On the other hand, the solutions to soft faults adopt in situ training, whose performance can be deteriorated by hard faults. This is because the in situ training attempts to update the weights mapped onto the memristors with SA0 and SA1 faults, while these weights are indeed untunable. [20] Therefore, a general fault-tolerant solution capable of addressing both hard and soft faults simultaneously is urgently needed for the memristor implementation of NNs.
In this article, we propose a multifault-tolerant training (MFTT) scheme for rescuing the accuracies of memristive NNs with both hard and soft faults. This scheme includes three key steps: multifault detection, targeted weight pruning, and in situ training with the Manhattan update rule. The main contributions of this scheme are listed as follows. 1) Unlike previously proposed fault detection methods which mainly detected hard faults, a multifault detection method is proposed to detect both hard and large soft faults. 2) A weight represented by a pair of memristors is regarded as faulty if either one in the pair is detected to be faulty. Pruning is performed to the faulty weight by tuning the conductance of the normal memristor in the pair to be the same as (or close to) that of the faulty memristor. The pruned weight is thus zero (or close to zero), and it will no longer be updated during training. This process is called targeted weight pruning, which can effectively prevent the faulty weights from disturbing the NN. 3) The sparsified NN after pruning is in situ trained with the Manhattan update rule to recover the accuracy. Thanks to pruning, the accuracy losses caused by hard and large soft faults can be recuperated during the self-adaptive in situ training. Meanwhile, the remaining small soft faults can be well tolerated by a hardware-friendly Manhattan update rule.
The proposed MFTT scheme is applied to prototype singlelayer perceptron (SLP) and multilayer perceptron (MLP) for modified national institute of standards and technology (MNIST) image recognition. In the hard faults-only case, MFTT achieves an accuracy loss of 2.7% at the hard fault rate of 20%, which is better than the performances of previous hard fault-tolerant schemes. In the soft fault-only case, MFTT outperforms previous in situ training-only schemes, particularly when the soft faults are large. Moreover, when both hard and soft faults are present simultaneously, MFTT exhibits the highest accuracy among all the investigated schemes.

Memristor Crossbar-Based NN
Memristor is a two-terminal passive circuit element whose conductance can be continuously tuned by applied electrical pulses. The memristor conductance can be used to represent the synaptic weight due to its multivalue, nonvolatility, and tunability. An array of interconnected memristors form a memristor crossbar which can store a matrix of weights as conductances. Based on Ohm's and Kirchhoff 's laws, when applying a vector of voltage pulses along the rows of a memristor crossbar, the currents collected along the columns are where I j is the output current along the j-th column, V i is the input voltage along the i-th row, and G ij is the memristor conductance at the cross point of i-th row and j-th column. Equation (1) indeed represents an MVM operation, and its time complexity is reduced to Oð1Þ using the memristor crossbar. Because G ij is always positive, a pair of memristors [22] or additional reference memristors [13] are needed to represent a signed weight w ij . Here, two differential crossbars with memristor pairs are used, as shown in Figure 1a. The difference between G þ ij and G À ij can thus represent a signed weight w ij .
Here, one op-amp circuit is assumed for one column, but for a large-scale crossbar, a multiplexer may be used to allow several columns to share one op-amp circuit. Using the op-amp circuits, the output currents are converted to voltages, which may be further fed to analog-to-digital converters (ADCs) to produce digital data. The digital data are stored and used for further computations (e.g., the activation function). However, the multiplexer, ADCs, and digital logic circuits as needed are not shown in Figure 1a, which can be referred to in other studies. [13,29]

Hard and Soft Faults of Memristors
Hard faults include SA1 and SA0 faults, manifesting as those memristors that are stuck at the ON and OFF states, respectively. Typically, the SA1 fault results from the overforming defects while the SA0 fault stems from the open-switch defects. [30] The SA1 and SA0 faults can be randomly distributed or clustered in a row (or column) in a memristor crossbar. Here, only the random distribution is considered. Because a pair of memristors are used to represent a signed weight, one or both of the memristors may be faulty. Table 1 shows the possible SA0-SA1 combinations in the differential crossbars [18] when only hard faults are considered. C1 denotes the case where both positive and negative memristors are fault free. In this case, weights in the whole range of [Àw max , þw max ] can be successfully mapped. C2-C5 denote the cases where one of the memristors has either SA0 or SA1 fault, while the other one is fault free. In these cases, weights in only half of the range of [Àw max , þw max ], that is, [Àw max , 0] or [0, þw max ], can be mapped. C6-C9 denote the cases where both two memristors have either SA0 or SA1 fault. In these cases, weights are fixed at 0, Àw max , or þw max .
Soft faults refer to those causing the memristor conductance to deviate from the ideal value (but the conductance can still be tuned). The soft faults include C2C and D2D variations, writing nonlinearity, finite number of conductance states, and limited ON/OFF ratio. For a realistic memristor, the conductance can be tuned only in a finite range of [G min , G max ] by applying writing pulses, and this range is quantified by the ON/OFF ratio (G max /G min ). In addition, the number of conductance states (n s ) available in the range of [G min , G max ] is also a finite value. During writing, the conductance change depends nonlinearly on the number of writing pulses, which is referred to as writing nonlinearity (v). Every writing pulse can introduce a noise to the conductance change. This fault is called the C2C variation (σ C2C ) and σ C2C typically follows Gaussian distribution. [13] In addition, the conductance change induced by the same number of pulses can vary from device to device, a fault known as the D2D variation (σ D2D ). σ D2D is difficult to be quantified because the differences of G max /G min , n s , v, and σ C2C among different devices can all cause σ D2D . The aforementioned soft faults are schematically shown in Figure 1b.

Previous Fault-Tolerant Schemes
Most previous fault-tolerant schemes addressed only one type of faults, either hard or soft faults. To tolerate the hard faults, both hardware-and software-level solutions have been investigated. Chen et al. [15] proposed using the 1T1R array architecture for the memristor crossbar and hence the faulty memristors could be pruned by switching off the individual access transistors connected to them. Liu et al. [16] utilized the redundant columns of memristors to replace the faulty columns onto which significant weights were mapped. Although the hardware-level solutions can alleviate the accuracy losses caused by hard defects, they introduce large-area and routing overheads and also increase the design complexity of peripheral circuits. To circumvent these issues, a variety of software-level solutions have been proposed. Liu et al. [16] applied a fault-aware retraining scheme to SLP on the MNIST dataset (note: the MNIST dataset is always used hereafter unless otherwise specified), recovering the accuracy to 98.1% of the ideal value in the presence of 20% hard faults. Chen and Song et al. [15,19] used a bipartile-matching algorithm to map significant weights to the fault-free memristors. They further reduced the large weights mapped onto the faulty memristors and then   [20] proposed a scheme combining online fault detection and fault-tolerant training. In the training phase, fault-blind weight pruning was performed in software, followed by genetic algorithm-based remapping. Their scheme could recover the accuracy of the VGG-11 network (on the Cifar-10 dataset) from 37% to 83% when 14% hard faults are present. Jin et al. [18] modified Xia et al.'s scheme using targeted weight pruning followed by retraining of the sparsified network. Their scheme implemented on a four-layer convolutional NN (CNN) could reduce the accuracy loss to within 3% at a hard fault ratio of 20%. Although all the software-level solutions could effectively tolerate the hard faults, their performance may degrade significantly when the soft faults are present simultaneously. This is because (re)training is performed ex situ and errors can occur when loading the software-computed weights to the memristor crossbar in the presence of soft faults. On the other hand, the tolerance of soft faults mainly relies on in situ training which can self-adaptively adjust the weights. Chen and Agarwal et al. [13,31] investigated the tolerable ranges of soft faults, which were allowed by in situ training. To further improve the fault tolerance, various modified weight update rules have been studied. Lim et al. [22] used the Manhattan update rule for in situ training, where the weight was increased or decreased by a single step according to the sign of the backpropagated error. The MLP trained with the Manhattan update rule achieved accuracies close to the ideal values when v was smaller than 3 and n s was larger than 32. Gokmen et al. [6] proposed a stochastic update rule which translated the inputs and errors to stochastic bit streams and reduced the input-error multiplication to a simple AND operation. The stochastic update rule resulted in an accuracy loss of 0.3% for an MLP at a noise of 10%. Zhang et al. [27] applied a sign-based BP algorithm, as it has been proposed and analyzed in previous research [32] with the addition of stochastic noise. [33] This algorithm could mitigate the effects of soft faults and meanwhile reduce the area and energy costs for calculating and storing the intermediate data. For all these modified weight update rules, the main idea is to move the weights in the general direction of the error gradient without regulating the step size. [28] As a result, the precise control over the weight update is unnecessary and hence the soft faults can be tolerated. However, it is still difficult for the in situ training-only schemes (even using modified weight update rules) to address large soft faults (e.g., large v and σ C2C and small G max /G min and n s ).In addition, when the hard faults are present simultaneously, the performance of in situ training may further degrade. The reason is that the in situ training attempts to tune the weights mapped onto the memristors with SA0 and SA1 faults, while these weights are indeed untunable. [20]

Motivation
As introduced in Section 2.3, although the hard and soft faults have their respective solutions, these solutions may have limited performances when both faults are present simultaneously. This motivates us to develop a so-called MFTT scheme capable of addressing both hard and soft faults simultaneously. Our first consideration is that the inherent sparsity of NN should be taken full use of. We therefore propose to first detect both hard and large soft faults using a multifault detection method and then prune the detected faulty weights. The sparsified NN after pruning can be in situ trained to recover the accuracy due to self-adaptivity. In addition, the small soft faults remained in the NN can be well tolerated using the Manhattan update rule for in situ training. Therefore, both hard and soft faults can be addressed by the proposed MFTT scheme, which combines the multifault detection, targeted weight pruning, and in situ training.

MFTT Scheme
3.1. Overview Figure 2 illustrates the overall flow of the proposed MFTT scheme. The multifault detection is first carried out to detect both hard and large soft faults. Then, the targeted weight pruning is performed to force the detected faulty weights to be zero (or close to zero) based on the self-compensation mechanism. After that, the in situ training with the Manhattan update rule is implemented on the sparsified NN. The detailed methods of the multifault detection, targeted weight pruning, and in situ training are described in the next three subsections.

Multifault Detection
Some efficient methods have been recently proposed to detect the types and locations of faults in memristor crossbars, but they mainly focused on hard faults, while soft faults were unaddressed. [20,34,35] Song et al. [19] proposed to detect the conductance variation (i.e., soft fault) of every memristor and record it together with the location of this memristor in a buffer. This bit-wise detection method is relatively slow and the stored information of every memristor's conductance variation value is redundant.
Here, we propose a simple multifault detection method to detect the locations of both hard and large soft faults. The proposed multifault detection can be performed row by row for an individual crossbar. The connection between the two differential crossbars can be temporarily cut off using some switches in the circuit (not shown in Figure 1a). We first initialize the conductances of a row of memristors (e.g., the first row) to G max by applying sufficient numbers of potentiation pulses. If an ideal memristor has n s conductance states in the range of [G min , G max ], applying potentiation pulses with a number of P 1 ¼ (n s -1) is sufficient to increase the conductance to G max . Then, depression pulses with a number of P 2 (e.g., P 2 ¼ 16 for n s ¼ 32) are applied along this row. For an ideal memristor, its conductance after depression would become However, the conductance of an actual memristor (G actual ) can deviate from G ideal significantly in the following cases: 1) it has either an SA0 or SA1 fault; 2) it has a writing nonlinearity v or 3) a writing noise σ C2C much larger than zero; 4) it has an ON/OFF ratio G max =G min or 5) conductance state number n s much smaller than their respective ideal values. Case (1) and (2-5) represent the hard and soft faults, respectively, and their schematic illustrations are shown in Figure 3a-g. For Case (4), because the memristors typically exhibit a small variation of G max after electroforming with appropriate voltages and compliance currents, [36] only the variation of G min is considered to be responsible for the variation of G max /G min .
To characterize the deviation between G actual and G ideal , a reading pulse is applied along the first row of the individual crossbar and the output voltages along different columns are obtained in parallel as follows.
where V o j is the output voltage at the j-th column, G 1j is the actual conductance of the j-th memristor in the first row, V R is the reading voltage, and R 0 is a constant resistance used in the op-amp circuit (see Figure 1a). Note that Equation (3) applies to the case where an individual crossbar is being detected, while its connection to the neighboring crossbar is temporarily cut off. In addition, because the op-amp circuit (shown in Figure 1a) outputs a voltage with the negative sign, another op-amp circuit is needed to invert the sign of the output voltage. The output voltage V o j is subsequently compared with a reference voltage V f given by The difference between V o j and V f can thus be used to screen out the memristors belonging to Case (1-5). Specifically, when jV o j À V f j > ε, (ε is a threshold value), the memristor shall have either a hard fault or a large soft fault, as described in Case (1-5); otherwise, it is free from the hard and large soft faults. This screening operation can be implemented using the circuit, as shown in Figure 3h. Note that this circuit may have limitations such as susceptibility to noise and difficulty to settle when the input is very close to AEε, which are not considered in this work.
Repeating the above four steps (initialization, writing, reading, and screening) for the rest rows, all the memristors with hard and large soft faults in a crossbar are thus detected. Their locations are stored in a buffer. The remaining memristors either have small soft faults or are fault free. Note that the proposed method can detect multifaults without identifying the exact fault types. In addition, the conductance variation values are not stored. Therefore, our method appears to be simpler and more efficient than the previous fault detection methods when dealing with multifaults. [19,20]

Targeted Weight Pruning
After multifault detection, the targeted weight pruning is performed for the two differential crossbars based on the self-compensation mechanism. It has been demonstrated that more than 50% weights in an NN (particularly for a fully connected NN) could be pruned, while causing almost no accuracy loss in certain tasks. [37] This motivates us to fix the faulty weight to be zero (if either one in a memristor pair is detected to be faulty, the corresponding weight is regarded as faulty), using the normal one in the pair to compensate the faulty one (i.e., self-compensation). Prior to performing it, a clear understanding of possible fault combinations in the memristor pairs is needed. When both hard  and soft faults are present simultaneously, the fault combinations considered in previous studies [18] (see Table 1) should be extended. Table 2 shows the possible combination of hard faults (SA0 and SA1 faults), large soft faults, and "normal" devices in the differential crossbars. Here, "normal" denotes that the memristor is fault free or has a small soft fault. The normal memristors have been distinguished from those with hard and large soft faults using the earlier-described multifault detection. The multifault detection, however, does not further distinguish the memristors with SA0 and SA1 faults and large soft faults. Nevertheless, they are listed as independent cases in Table 2 because they will respond differently to the writing pulses applied in the pruning step (to be described later).
According to Table 2, one can use a pair of normal memristors (C1) to map the weights in the whole range of [Àw max , þw max ]. The NN training can thus fully rely on these memristors. In C2-C3 where both memristors have SA0 or SA1 faults, the weights are naturally pruned. In C4-C10 where at least one memristor has an SA1 or large soft fault, the weights cannot be mapped appropriately. These weights need to be pruned, which can be realized by writing the conductances of the paired memristors to G max . Likewise, in C11-C14, where at least one memristor has an SA0 fault, the weight pruning is also needed, which can be realized by writing the conductances of the paired memristors to G min . Note that in C4-C14, the weights after pruning may not be exactly zero, because the conductance of the memristor with a soft fault may not be tuned to be exactly the same as that of its neighboring memristor with hard or soft faults, but a weight fixed to a close-to-zero value is still called a pruned weight. In the last two cases (C15-C16), where one memristor has an SA0 fault, while the other has an SA1 fault, the weights are fixed to þw max and Àw max and they cannot be pruned.
To implement the targeted weight pruning efficiently, a rowby-row manner is adopted. Specifically, taking the first row of two www.advancedsciencenews.com www.advintellsyst.com differential crossbars as an example, a sufficient number of potentiation pulses [P 3 ¼ ðn s À 1Þ] are applied to the selected memristor pairs through the columns to set the conductances of these memristors to G max . The memristor pairs are selected if either device in the pair is detected to be faulty (C2-C16 in Table 2). This requires the use of control signals based on the stored fault locations. Then, a reading pulse is applied along the first row of the two differential crossbars and the output voltages along different columns are obtained in parallel as follows.
Because writing has been performed beforehand, G þ 1j and G À 1j which are tunable are set to be around G max , while those corresponding to hard faults remain unchanged. Therefore, V 0 o j can be exactly zero or close to zero for the cases of C2-C5 in Table 2. In contrast, V 0 o j can deviate from zero significantly for the cases of C6-C16 in Table 2. C6-C16 can be further divided into two groups according to the degree of deviation: 1) C6-C10, where at least one memristor has a large soft fault while the other has no SA0 fault and 2) C11-C16, where one memristor has SA0 fault. Apparently, the deviation of V 0 o j from 0 (i.e., jV 0 o j j) should be larger in C11-C16 than in C6-C10.
To distinguish the above cases according to jV 0 o j j, one can use a sufficiently large threshold value ε 0 and a circuit similar to that shown in Figure 3h. jV 0 o j j ≤ ε 0 screens out the cases of C2-C10.
The weights leading to jV 0 o j j ≤ ε 0 are regarded as already pruned, and no more operation will be performed on them. In contrast, jV 0 o j j > ε 0 screens out the cases of C11-C16. The weights leading to jV 0 o j j > ε 0 may be further adjusted because the memristors with soft faults are still tunable.
Rewriting is thus performed for the first row by applying a sufficient number of depression pulses [P 4 ¼ ðn s À 1Þ] to each column where jV 0 o j j > ε 0 is detected. After rewriting, the weights in C11-C14 can be pruned while those in C15-C16 remain stuck.
Repeating the earlier four steps (writing, reading, screening, and rewriting) for the rest rows, the weights in the differential crossbars belonging to C2-C14 can all be pruned. These pruned weights will not be adjusted any more in the weight update step of the in situ training, as described as follows.

in Situ Training with Manhattan Update Rule
The in situ training of a memristor crossbar-based NN mainly includes four steps: forward propagation, backward propagation, gradient calculation, and weight update. Using memristor pairs (G þ ij À G À ij ) to represent signed weights (w ij ), the forward propagation is implemented by applying voltages encoded by the inputs x i (or activations a As seen earlier, the memristor crossbars are used to implement the MVM operations during both forward and backward propagations. Other operations involved are all hardware implementable using digital logic circuits. [29] Now, let us focus on the weight update. The Manhattan update rule is used for its ease of hardware implementation and its fault tolerance. [22,28] In the Manhattan update rule, weight is increased or decreased by only one step according to the sign of ∂J= ∂w ij , as expressed by where η is the minimum allowed weight update corresponding to the conductance change induced by only one potentiation/ depression pulse. However, some weights should not be updated because they are already pruned. In addition, to avoid the oscillation issue and reduce the number of update events, there is no need to update the weight with a small magnitude of ∂J= ∂w ij . We therefore use two mask matrices M 1 ij and M 2 ij whose elements are either 0 or 1 to regulate the weight update. In M 1 ij , the element 0 (1) indicates that the corresponding weight is pruned (not pruned). In M 2 ij , the element 0 (1) indicates that the corresponding j ∂J= ∂w ij j is smaller (larger) than a threshold value ε grad . M 1 ij is fixed because its elements are generated from the stored fault locations, while M 2 ij can vary from iteration to iteration because its elements are generated from the intermediate results. By taking into account M 1 ij and M 2 ij , Equation (7) is modified as The weight update based on Equation (8) can be performed row by row on the differential crossbars. Using the first row as an example, M 1 1j and M 2 1j are first used as the control signals to select the columns intended to be updated. Then, two-phase writing is performed, as schematically shown in Figure 4. The first phase deals with the case of sgnð ∂J= ∂w ðlÞ ij Þ ¼ À1. G þ ij will be increased by one step if G þ ij < G max , while G À ij will be decreased by one step if G þ ij ¼ G max . The second phase deals with the case of sgnð ∂J= ∂w ij Þ ¼ þ1. G À ij will be increased by one step if G À ij < G max , while G þ ij will be decreased by one step if G À ij ¼ G max . Apparently, prior to the two-phase writing, a reading step is required to check whether G þ ij and G À ij have reached G max . Repeating the above two steps of reading and two-phase writing for the rest rows, the weight update for one iteration is completed. The total number of iterations is set appropriately so that the convergence can be reached.    Table 3 lists the simulation details. The efficiency of the MFTT scheme is evaluated on an SLP (784 Â 10) and an MLP (784 Â 256 Â 10) for image recognition. 60 000 and 10 000 images of handwritten digits from the MNIST dataset [38] are used for the training and test, respectively. The SLP is implemented using a pair of crossbars (each containing 7840 memristors) together with 784 input neurons and 10 output neurons (using the softmax function). On the other hand, MLP is implemented using a 200704-memristor crossbar pair and a 2560-memristor crossbar pair for the first and second layers, respectively, together with 784 input neurons, 256 hidden neurons (using the ReLu function), and ten output neurons (using the softmax function). The crossbar is assumed to be sufficiently large, and thus the tiled architecture is not used. Only the effects of the faults of memristors on the network performance are investigated, while other effects (such as ADC precision and sneak path) are not considered.
A fault-free memristor is assumed to have a writing nonlinearity (v) of 0, a number of conductance of states (n s ) of 32, an ON/ OFF ratio (G max =G min ) of 10, a writing noise (σ C2C ) of 0, and a device-to-device variation (σ D2D ) of 0. The soft faults are introduced by increasing v and σ C2C and decreasing n s and G max =G min . For G max =G min , only G min is varied while G max is assumed to be unchanged, as mentioned earlier. To introduce σ D2D , memristors are assumed to be independent and each individual has a random set of v, n s , G max =G min , and σ C2C . The v, n s , G max =G min , and σ C2C values are assumed to obey the uniform distribution. For example, all the memristors have independent v values, and these v values are uniformly distributed in the range of [0, v max ]. v max ¼ 0 means that the memristors have an ideal v max value of 0 and no σ D2D exists. As v max increases, both the average v and σ D2D become larger. Therefore, σ D2D can be reflected by the distribution range of soft faults, and hence it will not be specifically investigated hereafter. In terms of the hard faults, the SA0 and SA1 faults are assumed to be equal in number and they are randomly distributed among all the memristors. The hard fault ratio (γ HF ) is defined as the percentage of memristors with SA0 and SA1 faults.
The conductance difference (G þ ij À G À ij ) is mapped to the weight in the range of [Àw max , þw max ], with w max set to be 0.2 in this study. Note that any finite values can be used for w max , but the w max value should be fixed throughout the simulation unless the G max and G min values of memristors change. When performing the multifault detection and targeted weight pruning, the threshold values ε and ε 0 are varied in different cases.
The percentage of pruned weights is denoted by γ pw . However, the threshold value ε grad for the weight update is fixed to be 0.002. To make a comprehensive comparison, several controls are used. The first control is the conventional ex situ training (EXT). The second and third controls are two representative hard fault-tolerant training schemes. One is named the HFTT-1 scheme that combines the heuristic algorithm-based weight mapping with targeted weight pruning, [18] while the other is named the HFTT-2 scheme that combines online fault detection with fault-blind weight pruning. [20] Both HFTT-1 and HFTT-2 schemes use the ex situ training. In addition, in situ training-only schemes using the conventional gradient descent learning rule (ISTGD) and the Manhattan update rule (ISTM) are used as another two controls. The last control combines ISTGD with multifault detection and targeted weight pruning, which is called the ISTGDþ scheme. Note that combining ISTM with multifault detection and targeted weight pruning indeed forms the proposed MFTT scheme. In addition, the difference between MFTT and ISTGDþ is that the former uses the Manhattan update rule, while the latter uses the conventional gradient descent learning rule.
All the schemes use a minibatch training with a batch size of 250 and 30 training epochs. For EXT, HFTT-1, and HFTT-2, the optimal weights obtained during ex situ training are mapped onto the memristors to generate the highest test accuracy. In contrast, for ISTGD, ISTGDþ, ISTM, and MFTT, the test accuracies are averaged over different epochs and random fault distributions. Note that unless otherwise specified, the accuracy mentioned in the rest of this article refers to the classification accuracy on the test set of MNIST.

Results and Analysis
The MFTT scheme and the control schemes are first implemented on the SLP for the cases where only one type of fault exists ( Figure 5, 6, 7, 8 and 9). Figure 5 shows the accuracies of various schemes achieved in the case where only hard faults exist. As assumed earlier, the hard faults contain an equal amount of SA0 and SA1 faults and are randomly distributed among all memristors. Let's first look at the two schemes designed specifically for the hard faults: HFTT-1 and HFTT-2. HFTT-1 achieves higher accuracies than HFTT-2 in the whole investigated range of γ HF , consistent with that observed in the study by Jin et al. [18] Why HFTT-1 outperforms HFTT-2 is because HFTT-1 uses the targeted weight pruning, while HFTT-2 uses the fault-blind weight pruning, and the former pruning method is more effective to tolerate hard faults. Then, let us turn to the in situ training algorithms (ISTGD, ISTGDþ, ISTM, and MFTT). Although they are not designed specifically for the hard faults, and they still possess some hard-fault tolerance because of the self-adaptivity of in situ training. Moreover, MFTT and ISTGDþ achieve higher accuracies than ISTM and ISTGD, respectively, indicating that the targeted weight pruning can help to improve the hard-fault tolerance. This is not unexpected because all the weights mapped onto the memristors with hard faults are pruned (except C8-C9 shown in Table 1) and no attempts will be made to update these weights; therefore, errors caused by these weights can be minimized. Adjusting only the weights surrounding the pruned ones is sufficient to achieve high accuracy. Figure 5 also shows that MFTT exhibits higher accuracies than ISTGDþ, revealing the advantage of the Manhattan update rule over the conventional www.advancedsciencenews.com www.advintellsyst.com gradient descent learning rule. The Manhattan update rule uses the minimum step size to update the adjustable weights to better accommodate the pruned and stuck weights. [39,40] However, the conventional gradient descent learning rule may introduce excess weight movements, thus lowering the accuracy. We next focus on the comparison between MFTT and HFTT-1. At γ HF ≤ 10%, HFTT-1 exhibits higher accuracies than MFTT. In contrast, as γ HF exceeds 10%, the accuracies of HFTT-1 are surpassed by those of MFTT. The accuracy losses of HFTT-1 and MFTT at γ HF ¼ 20% are specifically calculated and compared, using the accuracy obtained from fault-free memristors (i.e., 91.6%) as the reference. The accuracy loss of HFTT-1 at γ HF ¼ 20% is 3.9%, which is close to that reported previously. [18] When MFTT is used, an even lower accuracy loss, that is, 2.7%, is obtained. The reason why MFTT can achieve a lower accuracy loss than HFTT-1 at a large γ HF may be interpreted as follows. At a large γ HF , the number of significant weights mapped onto the faulty memristors becomes large. HFTT-1 forces these weights to be zero (i.e., pruning) in the software and subsequently performs retraining. However, some of these weights are stuck at þw max and Àw max (see C8-C9 in Table 1) and thus cannot be forced to be zero. In contrast, in MFTT, the weights stuck at þw max and Àw max together with the pruned weights can be well accommodated by self-adjusting the remaining weights. Therefore, the capability of MFTT to compensate the accuracy loss at a large γ HF is better than that of HFTT-1.
While it has been demonstrated above that MFTT can well tolerate hard faults, how it performs in the presence of soft faults is of great interest. The writing nonlinearity (v) is the first types of soft faults to be investigated. It is assumed that all the memristors have a uniform distribution of v values in the range of [0, v max ]. v is defined following a memristor behavioral model suggested by [13] Long-term potentiation LTP ð Þ∶G actual ¼ Bð1 À e Àv P P 5 Þ þ G min (9) Long-term depression LTD ð Þ∶G actual ¼G max ÀB 1Àe where G actual is the conductance at the pulse number P, G min is the minimum conductance, G max is the maximum conductance, www.advancedsciencenews.com www.advintellsyst.com v is the writing nonlinearity, B is a conductance range-related parameter equaling ðG max À G min Þ=ð1 À e Àv Þ, and P 5 is the total number of pulses used to tune the conductance from G min to G max (or from G max to G min ). As seen from Equations (9) and (10), the larger v, the more nonlinear the LTP and LTD curves will be, and v ¼ 0 indicates the linear case.
Using the fault detection with a threshold value ε, memristors with v > v th can be detected (as described in Section 3.2). Then, the weights mapped onto memristors with v > v th are pruned by setting the conductances of the memristor pairs to G max (as described in Section 3.3). It is noteworthy that if either one memristor in the pair has v > v th , the corresponding weight is pruned. By decreasing ε, v th is reduced and consequently the ratio of pruned weights (γ pw ) increases (see Figure S1, Supporting Information, for details). Figure 6a illustrates the accuracy evolution with varying γ pw for MFTT and ISTGDþ at a given v max of 10 for the distribution range of [0, v max ]. For both schemes, the accuracy first increases and then decays as γ pw increases, forming a peak at certain γ pw . Why the accuracy first rises to a peak and then falls with increasing γ pw (i.e., decreasing ε) can be explained as follows. With the decrease in ε, v th is gradually reduced; consequently, weights with large v are first pruned, followed by those with small v. Weights with large v are detrimental to the accuracy of the NN because they are typically far away from the target values after programming and thus cause large errors. Pruning them can therefore lead to the increase in accuracy by taking use of the sparsity of NN. In contrast, weights with small v can function almost normally. Further pruning these weights causes the decrease in accuracy because there are no sufficient number of weights remaining to maintain the proper operation of the NN. Therefore, pruning an appropriate number of weights with large v is beneficial to the accuracy, but pruning too many weights will deteriorate the accuracy.
Similar accuracy evolutions of MFTT and ISTGDþ with γ pw are observed for other distribution ranges of [0, v max ], and the optimal γ pw values leading to peak accuracies are shown in Figure S2, Supporting Information. The peak accuracies of MFTT and ISTGDþ are plotted against v max in Figure 6b, along with the accuracies of ISTM, ISTGD, and EXT (note: because HFTT-1 and HFTT-2 can address only hard faults, their performances on soft faults are indeed the same as that of EXT and thus not shown hereafter). At a small v max (e.g., v max ¼ 2), MFTT, ISTGDþ, ISTM, and ISTGD achieve almost the same accuracy (around 90%), suggesting that small v can be well tolerated by the in situ training. However, EXT achieves a lower accuracy (88.8%), probably because even small v can cause sizable errors at the weight loading step of ex situ training. As v max increases from 2 to 10, the accuracies of ISTM and ISTGD degrade significantly, while those of MFTT and ISTGDþ decrease smoothly and remain at >86%. It is therefore demonstrated that the pruning can effectively mitigate the accuracy loss caused by large v Figure 6b further shows that MFTT achieves a higher accuracy than ISTGDþ at a large v. This is probably because the v-induced deviation in weight update can be minimized by the Manhattan update rule. Therefore, by combining both pruning and Manhattan update rule, MFTT achieves the best tolerance against v among all the investigated schemes.
Similar to v, other three types of soft faults, that is, limited number of conductance of states (n s ), narrowed dynamic range (G max =G min ), and writing noise σ C2C , are investigated separately. The n s , G max =G min , and σ C2C values are all assumed to have uniform distributions, and the detailed experimental settings are presented in Supplementary Note 1, Supporting Information. Figure 7a, 8a, and 9a show the accuracy evolutions with respect to γ pw for MFTT and ISTGDþ at a given n s min of 2, ðG max =G min Þ min of 1.37, and σ C2C max of 0.5, respectively. Accuracy peaks at certain γ pw are observed for all the three types of soft faults, similar to that observed for v (see Figure 6a). The formation origins for these accuracy peaks can also be explained in a similar way. In brief, pruning weights with large soft faults (small n s , small G max =G min , and large σ C2C ) first is beneficial to accuracy, but further pruning weights with small soft faults (large n s , large G max =G min , and small σ C2C ) can deteriorate the accuracy. Figure 7b summarizes the peak accuracies of MFTT, ISTGDþ, ISTM, ISTGD, and EXT with n s min varying from 2 to 20 for the distribution ranges of [n s min , 32]. For all the investigated n s min values, the accuracies of MFTT and ISTGDþ are above 87.4% and decrease only slightly with decreasing n s min . In contrast, both ISTM and ISTGD show a dramatic accuracy decay when n s min becomes smaller than certain values (6 and 9 for ISTM and ISTGD, respectively). These results demonstrate that pruning is particularly useful to suppress the accuracy loss caused by small n s . In addition, MFTT achieves a higher accuracy than ISTGDþ, which can be attributed to the use of the Manhattan update rule. For the Manhattan update rule used in MFTT, the minimum magnitude of weight update is always used; however, for the conventional gradient descent learning rule used in ISTGDþ, the magnitude of weight update may be exaggerated for small n s , thus causing accuracy loss. Figure 8b shows the peak accuracies of various schemes with varied ðG max =G min Þ min for different distribution ranges of [ðG max =G min Þ min , 10]. As clearly seen, MFTT and ISTGDþ exhibit higher accuracies than ISTM and ISTGD, respectively, indicating that pruning can reduce the accuracy loss induced by small ðG max =G min Þ. In addition, the accuracy of MFTT is higher than that of ISTGDþ, which may be ascribed to the merit of the Manhattan update rule. Figure 9b shows the peak accuracies of various schemes tested with varying σ C2C max for the distribution ranges of [0, σ C2C max ]. At all the investigated σ C2C max values, MFTT and ISTGDþ exhibit higher accuracies than ISTM and ISTGD, respectively, confirming the effective role of pruning in mitigating the accuracy loss caused by large σ C2C . However, accuracies of MFTT and ISTGDþ are not much different, probably because noises can cause random deviation in the descent direction which can hardly be addressed by both the Manhattan update rule and the conventional gradient descent learning rule.
Having demonstrated the good performance of the MFTT scheme in tolerating only hard faults or one type of soft faults, how it performs when hard faults and various types of soft faults are present simultaneously is of great interest. For the case where both hard and soft faults coexist, the performances of MFTT and other schemes are evaluated on both SLP and MLP. The hard and soft faults are assumed to be distributed randomly in the all memristors. The hard fault ratio γ HF is assumed to be 10%, and the distribution ranges of v, n s , G max =G min , and σ C2C are assumed to be [0, 10], [18,32], [2. 7,10], and [0, 0.1], respectively. As shown in Figure 10, HFTT-1, HFTT-2, and EXT exhibit rather low accuracies on both SLP and MLP because they cannot handle the soft faults. The accuracies of ISTM and ISTGD on both SLP and MLP are also unsatisfactory, probably due to their limited tolerance against hard and large soft faults. Notably, MFTT and ISTGDþ rank top two among all the investigated schemes on both SLP and MLP, demonstrating that the combined multifault detection, targeted weight pruning, and in situ training can well address both hard and large soft faults using the sparsity and self-adaptivity of NN. Moreover, MFTT achieves higher accuracies than ISTGDþ on both SLP and MLP, suggesting the good tolerance of the Manhattan update rule against the remaining small soft faults. It is therefore demonstrated that the MFTT scheme can tolerate both hard and soft faults simultaneously, which is further verified with different experimental settings (see Figure S3, Supporting Information). This capability distinguishes MFTT from most previous fault-tolerant schemes which could address only hard or soft faults. However, the effectiveness of MFTT on more complex NNs [41] (such as convolutional NNs for the CIFAR10 and ImageNet datasets) remains a question, which will be investigated in further research.
Last but not the least, the multifault detection and targeted weight pruning methods used in MFTT are simple, which would not introduce large hardware overhead. The Manhattan update rule used in MFTT also simplifies the hardware implementation as it requires no computational resources to calculate the applied pulse numbers. In addition, there are no needs to update the pruned weights and the weights with small j ∂J= ∂w ij j during the weight update step, which can significantly reduce the energy consumption. For example, in the case shown in Figure 10, an average of 6690 weights (total number of weights: 7840) is not updated during the 30 training epochs in MFTT, thereby reducing the writing energy by 85.3%. Therefore, the MFTT scheme provides a hardware-friendly and low-power solution to mitigate the accuracy losses caused by both hard and soft faults for nonideal memristive NNs.

Conclusion
In this article, we have proposed an MFTT scheme for addressing both hard and soft faults simultaneously in memristor crossbarbased NNs. The proposed scheme consists of three steps: multifault detection, targeted weight pruning, and in situ training with the Manhattan update rule. The pruning of the detected faulty weights with hard and large soft faults followed by in situ training can effectively tolerate these faults, by making use of the sparsity and self-adaptivity of NN. In addition, the remaining small soft faults can be well tolerated by the Manhattan update rule. Experiments show that in the hard fault-only case, MFTT achieves smaller accuracy losses than previous hard fault-tolerant schemes when the hard fault ratio is large (>10%). In the soft fault-only case, MFTT outperforms previous in situ training-only schemes, particularly when the soft faults are large. When both hard and soft faults are present simultaneously, MFTT achieves the highest accuracy among all investigated schemes. Therefore, the proposed MFTT scheme has good tolerance against both hard and soft faults, making it promising to be applied in the memristor crossbar-based NNs.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.  www.advancedsciencenews.com www.advintellsyst.com