Efficient Training of the Memristive Deep Belief Net Immune to Non-Idealities of the Synaptic Devices

The tunability of conductance states of various emerging non-volatile memristive devices emulates the plasticity of biological synapses, making it promising in the hardware realization of large-scale neuromorphic systems. The inference of the neural network can be greatly accelerated by the vector-matrix multiplication (VMM) performed within a crossbar array of memristive devices in one step. Nevertheless, the implementation of the VMM needs complex peripheral circuits and the complexity further increases since non-idealities of memristive devices prevent precise conductance tuning (especially for the online training) and largely degrade the performance of the deep neural networks (DNNs). Here, we present an efficient online training method of the memristive deep belief net (DBN). The proposed memristive DBN uses stochastically binarized activations, reducing the complexity of peripheral circuits, and uses the contrastive divergence (CD) based gradient descent learning algorithm. The analog VMM and digital CD are performed separately in a mixed-signal hardware arrangement, making the memristive DBN high immune to non-idealities of synaptic devices. The number of write operations on memristive devices is reduced by two orders of magnitude. The recognition accuracy of 95%~97% can be achieved for the MNIST dataset using pulsed synaptic behaviors of various memristive synaptic devices.


Introduction
The separation of memory and computing units in the conventional von-Neumann architecture computing systems, which causes the memory wall bottleneck, is the main issue preventing the artificial neural network from competing with the human brain in efficiency and in intelligence. [1,2] Emerging non-volatile memory devices which have tunable resistance, i.e., memristive devices, including resistive random-access memory (RRAM), [3,4] phase-change memory (PCM), [5] ferroelectric random-access memory (FeRAM), etc. are promising techniques to solve the memory wall issue. [6][7][8] They can store information in an analog way and process information at the same location, acting as artificial synaptic devices and enabling in-memory computation just like what happens in the human brain. [9,10] Furthermore, an array of memristive devices can efficiently perform the vector-matrix multiplication (VMM), which is in the computational kernel of a deep neural network (DNN), via Ohm's law and Kirchhoff's current law in one step, [11][12][13] making it a promising way to greatly accelerate the DNN and to power future artificial intelligence. [14][15][16] However, there are several remaining issues before the promise comes true. Firstly, since the memristive VMM operations are conducted in the analog domain, expensive analog-todigital and digital-to-analog converters (ADCs and DACs) and additional circuits for the neuron's non-linear activation functions are needed for the communication in adjacent layers of the DNNs. [17,18] To avoid the use of high precision and expensive ADCs and DACs, novel spiking rate-coded neurons have been proposed. [19,20] However, the spiking rate-coded neuron circuit is still complex and informational inefficient. [21] Secondly, the online training is usually performed by tunning the conductance of the synaptic devices in a closed-loop write method In this paper, we investigated the hardware implementation of the memristive deep belief net (DBN) based on the learning algorithm of contrastive divergence (CD). [45] The memristive DBN is composed of stacked restricted Boltzmann machines (RBMs), [46] where the VMM operations have binary inputs and stochastically binarized outputs, needing no ADCs or DACs in the peripheral circuits. The RBM is trained by accumulating the CD in a separated digital array and updating the synaptic weights periodically via the open-loop write method on the memristor array. The training of the DBN needs no additional cache memory to store the intermediate states of hidden layers, nor dedicated circuits for non-linear activation functions.
The proposed memristive DBN shows high immunity to non-idealities of the synaptic devices, greatly relaxing the specifications for memristive synaptic devices in multiple dimensions.

The DBN and RBMs
The structure of the investigated DBN is shown in Figure 1a which consists of three stacked RBMs [45] . Each RBM has a visible layer, a hidden layer, and a weight matrix (w) connecting them. For supervised learning tasks, taking MNIST dataset as an example in this work, [40] the images are fed into the visible layer of the first RBM, and the labels are part of the visible layer of the last RBM. Unlike conventional DNNs based on error backpropagation algorithms, the DBN rely on the consecutive training of each RBM via the CD algorithm.
The CD is obtained by alternative Gibbs sampling between the visible layer and the hidden layer within each RBM which requires both forward and backward VMMs as well as binary sampling operations. All input signals are binary digital signal which can be easily generated by digital circuits. For instance, in the first RBM layer (RBM 1, Figure 1a and 1b), each image in the MNIST dataset was binarized (pixel value to be either '0' or '1') and converted to a onedimensional vector as the states of the visible neurons (v). After alternative Gibbs sampling (see Methods for more details), the hidden neuron states (h), the reconstructed visible neuron states (v'), and the reconstructed hidden unit states (h'), which are the local information needed to calculate the CD and update the weight matrix, are obtained.
After the first RBM layer is trained, the state of the hidden neurons (h) will be the input of the second RBM layer (RBM 2 in Figure 1a) for its training. The last RBM layer (RBM 3) takes both the states of the hidden neurons of the previous RBM layer and the label vector (l) as the input (see Methods for more details). The weight matrix in the RBM 3 is partitioned into two parts ( and for clarification). The learning of DBN by consecutive training of the stacked RBM layer is named as "greedy learning" method. [45] In a conventional DNN with the error backpropagation algorithm, the error propagated from the last layer would gradually vanish which makes it harder for hardware implementation.
Additionally, the gradient descent of the weight relies on both the input of the layer and the error back propagated from the last, which raises the issue of the data dependency. In other words, the states of the neurons in all layers need to be stored before the backpropagated error arrives and the weight is updated. Whereas, in the DBN, all the neuron states are binarized ('0' or '1') and the CD elements are ternary values ('-1', '0', and '1', see Eq. 10 in Methods) making them easier to be processed by the hardware. Moreover, the calculation of the gradient descent, i.e., the CD, depends only on the local information of the neurons, further simplifying the memory requirements and hardware design in the training stage.

Implement VMM with in-situ stochastic activations
The Gibbs sampling operation in an RBM can be fully hardware-performed by the memristive crossbar array with an additional noise current in each output node as shown in Figure 1c. Figure 1c performs the forward VMM and output sampling from the visible neurons to hidden neurons (Eq. 6 and Eq. 8). The binary states of the visible neurons (input digital signal) are converted to the read voltage (Vi, ∈ 1, 2, … , ) as the input of the memristive array with the size of m-by-n, which performs the VMM via Ohm's law and Kirchhoff's current law. Since the input of the VMM operation is a binarized vector, only a level shifter is needed (i.e., 1-bit DAC). The current output of the memristive array can be denoted as where ∈ 1, 2, … , is the column index of the memristive array, and Gij is the conductance of the device in the ith row and jth column. A separate column of the memristive device with fixed reference conductance ( %&' ) is used to provide the reference current, A noise current [ ()$*& ∈ (0, + , )] is injected into each output node of the memristive array.
The output current is then converted to the voltage by a trans-impedance amplifier (TIA) and compared with the voltage output of the reference column by a comparator (i.e., 1-bit ADC).
The hidden neuron states thus can be written as We have separately simulated and verified the circuit functionality of the noise current generation, trans-impedance amplifier, comparator, level shifter. However, simulation of the CMOS peripheral circuit for a specific technology, including specific limitations, such as operational voltage, parasitic capacitance, to explore the bandwidth and latency of the design memristive RBM and DBN, needs further investigation and should be the next step of the work.
The stochasticity of visible or hidden neurons can also be provided by the intrinsic read noise of the memristive device by properly tuning the signal-to-noise ratio, [47] which can further simplify the hardware implementation of the DBN. Here, we utilize the external noise such that we can turn the noise current off for fast inference.

Memristive array and CD accumulation array
To enable the learning of memristive DBN tolerant to non-idealities of synaptic devices, we used a mixed-signal hardware design of the RBM layer composed of an analog memristor array and a signed digital counter array (Figure 1d). The memristor array is composed of a crossbar array of memristors with the conductance of Gij and reference cells with the conductance of Gref.
Only the memristor array participates in the VMM as detailed in Figure 1c. The forward and backward VMMs and stochastic excitation result in two sets of binarized visible neuron states and hidden neuron states (v and h, v' and h'). The digital counter array will perform the outer product calculation of the CD matrix (CDij=vihj-v'ih'j) and accumulate the ternary CD values in its cell, i.e., a signed digital counter. An identical pulse will be applied to the memristor cell to potentiate or depress its weight (Gij) when the corresponding CDij in the digital array reaches a threshold (≥CDth) or below the negative threshold (≤-CDth). This will result in a positive or negative conductance change (ΔG) on the memristor cell defined by the memristive synaptic behavior. No verifying read operations will be needed. The training procedure of the memristive RBM is shown in Figure 1e, where the analog VMM and neuron state sampling steps, as well as the weight updates, are light-blue colored and the digital CD calculation and accumulation step is light-yellow colored, corresponding to the colored components in Figure 1d.
The proposed mixed-signal approach for memristive DBN training shares some similarities with the state-of-the-art techniques recently proposed to improve the training performance of the deep neural network, [48,49] however, also shows distinct features. S. Ambrogio et al. [48] proposed a hybrid synaptic cell composed of non-volatile memristive devices and volatile capacitor gated transistors (2PCM + 3T1C) for a DNN implementation. The capacitor gated transistor branch of the synaptic cell has high linearity for weight updating and performs both VMMs and weights updates. The accumulated weights updates were transferred to the nonvolatile memristive devices periodically. Here, in our proposed neural network, the CD counter array only accumulates gradient descent (weight update request), and the memristive array performs the VMMs alone. S. R. Nandakumar et al. [49] proposed a mix-precision approach where each layer of a DNN is composed of a low precision memristive array and a high precision digital part. The digital part computes and accumulates the weight update request in floating-point numbers, and the conductance of the elements in the memristive array is updated when the accumulated weight update request in the high precision digital part reaches a threshold. The memristive array performs the VMMs in an analog fashion and deals with the small input and output for error backpropagations, which requires high-performance DACs and ADCs. The high-precision digital part is more complex than our digital counter array since in our proposal the weight updates request (CD) only consists of integers. The comparison of the learning algorithm and training method with previously reported works of the memristive deep neural network can be seen in the Supplementary Information Table S1. According to the literature [14] , the ADCs and DACs may account for 75% of the area and 87% energy consumption of the macro core consisting of the memristive array and peripheral circuits. Thus, a significant energy consumption reduction compared with the conventional design of memristive based deep neural network is expected.
Capacitor gated transistors [48] or other emerging electrolyte gate mem-transistors [50] with highly linear behaviors may also be used as the CD accumulation cells replacing the digital counters. Since the CD accumulation array is only needed in the training stage, it can be powered off at the inference stage and does not require long-term non-volatility. of the accuracy between the fast deterministic inference and repeated sampling inference when using the well-trained DBN to recognize the handwritten digit images in the MNIST dataset.

Memristive DBN training
We first use a synaptic behavior with an ideally symmetric and linear weight update ability Then, the DBN is fine-tuned by the wake-sleep algorithm for 30 epochs (Supplementary Movie 2, Figure S2), [51] which can be performed in the same hardware as in the greedy learning (see Methods for more details). Note that replacing the fully connected RBM layer with the convolutional RBM layer can effectively reduce the size of both the memristive array and the CD accumulation array resulting in better accuracy, [52] which, however, is beyond the scope of the current work. To scale-up the DBN for larger datasets, i.e., CIFAR-10, convolution RBM layers are also necessary. [53]

Inferences with binarized activations
The inference of the DBN, i.e., pattern recognition from the input to label, can be   Figure S1b shows the test accuracies of deterministic inference and 50 times sampling inference (following training results will use this metric) as a function of training epochs.

Immunity to non-idealities
To simulate more non-idealities of the memristive synaptic devices, an empirical model capturing conductance levels (Np and Nd for potentiation and depression phases, respectively), CD th on/off ratio (Gmax/Gmin), the non-linearities ( . and / ), and the asymmetry between potentiation and depression (Np≠Nd, . ≠ / ) is proposed and shown in Figure 4a, which can be written as (without cycle-to-cycle and device-to-device variations), and, for potentiation and depression, respectively. Figure S3 shows the example traces of conductance evolution obtained from the model when random generated potentiation and depression pulses are applied. With this model in hand, we check the effects of various nonidealities of memristive devices on the performance of the memristive DBN.

Limited conductance levels
In contrast to the ideal analog conductance tunability, most memristive devices only show two conductance levels, i.e., low conductance state (LRS) and high conductance state (HRS). [2] Multiple conductance states are generally more promising in RRAM and PCM devices. [54] However, these multiple conductance states are usually obtained with external controlling stimuli, e.g., compliance currents or close-loop read-write-read verify technique. [22,55] Here, we  The deterioration in the performance at higher conductance levels can be compensated if more training epochs are conducted (Figure 4b).

Non-linear weight update
Non-linear weight update behavior is another major source of performance lost when using memristive synaptic devices for the training of a neural network. [42,56] To verify the effect of weight update nonlinearity on the training of the memristive DBN, we vary the nonlinearities of both potentiation and depression in the model ( . and / ) while keep them equal (Supplementary Figure S5a and S5b). Figure 4c shows that increasing the nonlinearity of the weight update will slightly decrease the training accuracy. Besides, increasing the CDth could partially compensate for the deterioration.

Asymmetric weight updates
The weight updates for potentiation and depression generally do not have the same degree of non-linearity, i.e., asymmetric nonlinear weight updates. To test the effect of asymmetric weight updates, we fix the non-linearity for the depression phase ( / ) and only vary the nonlinearity for the potentiation phase ( . ) (Supplementary Figure S5c and S5d). Figure 4d shows the performance of the memristive DBN as a function of the asymmetry between the weight update non-linearities of potentiation and depression phases.  nonlinearity, e) cycle-to-cycle variation, f) device-to-device variation, g) yield, and h) read noise.

Write variations
Another source of performance degradation of the memristive neural network comes from the cycle-to-cycle and device-to-device variations when write to memristive devices. [57,58] The cycle-to-cycle write variations are modeled by adding a Gaussian distribution to the conductance change with its standard deviation proportional to the ideal conductance change for each weight update operation, Surprisingly, the higher device-to-device variation does not degrade the performance of the neural network. We see a slight increase in the recognition accuracy.

Device yield
In memristive devices, especially the RRAM devices, device yield is the major issue preventing its application in data storage and neuromorphic computation on a large scale. [55,59] The memristive device may not work due to the process variation or in some other cases the synaptic devices may initially work well but then stuck at HRS or LRS during the following write operations. In the simulation shown in Figure 4g, we assume that a percentage of the devices are not working (half of them stuck in HRS and the other half stuck in LRS). From Figure 4g, we see that when the device yield is higher than 90%, the performance of the memristive DBN does not degrade. While when the yield is less than 90%, the accuracy of memristive DBN training quacking drops to 20%. Two factors cause the accuracy drop for low device yield: 1) low device yield prevents the accurate greedy learning layer-by-layer; 2) the fine-tuning after the greedy learning is more sensitive to the nonidealities of the memristive devices thus ruin the previously learned recognition ability through greedy learning algorithm (Supplementary Information Figure S7a).

Read noise
Multiple sources of noise can induce inaccuracy in the reading of the memristive devices, for instance, flicker noise, random telegraphy noise, and white noise, etc. [60][61][62] The noise read instability could also be originated from the sense amplifiers and other peripheral circuits. The inaccurate read current will result in the inaccurate output of the VMM. However, since the proposed memristive DBN has stochastic output, the read noise could be a beneficial factor making the hardware implementation easier. As discussed earlier, the probabilistic behavior of the neurons in RBM induced by the noise current injected to the input of each column of the memristive array in Figure 1c can be realized by properly tuning the signal to noise ratio of the memristive device reading. [47] Here, we test the effect of reading noise by adding a current noise in each of the memristive devices and test two cases, i.e., without and with noise current injected to the neuron. Figure 4i shows the performance of the memristive DBN as a function of the read noise level for the two cases. With noise current injected into the neuron, i.e., probabilistic neurons as designed earlier, the read noise of the memristive device slightly lowers the training performance. While, without noise current injected into the neuron, when the read noise is small, the memristive DBN shows highly degraded recognition accuracy after training. A certain noise level will be beneficial to the neural network.

Nonlinear I-V characteristic
Another common non-ideal behavior of the memristive device is its non-linear I-V characteristics. [13,63] This prevents the direct implementation of the multiplication in the analog domain since the conductance of the device is not a constant value at different read voltages.
Pulse width or pulse number modulation is usually used to represent the analog input of VMM in the implementation of a neural network. [18,64] Method of operating the devices in a small dynamic range to avoid the non-Ohmic conduction or in the small signal domain has also been proposed. [65,66] All the solutions come with the price of complexing the readout circuit for VMM operations. In the proposed memristive DBN structure, however, the input of the VMM is also binary-valued, which is inherently immune to the non-linear I-V characteristic issue of memristive devices.

Memristive DBN with real memristive synaptic devices
The synaptic model in Figure 4a is used to fit various memristive synaptic behaviors of SiGe epiRAM [67] , PCMO [23] , ECRAM [50] , OxRRAM [68] , and PCM [69] devices (Figure 5a- [68] , and PCM [69] , we use the differential pair where each synaptic cell contains two devices (Supplementary Figure S8). All the memristive synaptic behaviors obtained by identical potentiation and depression pulses enable the successful training of the memristive DBN with accuracies ranging from 95% to 97% ( Figure 5f and Supplementary Figure S9). Note that in Section 4, to validate the learning algorithm accommodate to all synaptic behavior, the simulation includes the extreme cases where the nonidealities of the memristive devices are unusually high. In these cases, the finetuning would ruin the previously learning recognition ability ( Figure S4, S5, and S7). For the synaptic behavior of real memristive devices, as shown in Figure 5 and Figure S9, we see that the fine-tuning always improves the performance of the neural network.

Relaxed specifications for memristive synapses
By properly balancing among the parameters of various nonidealities, we got a set of parameters required for the memristive synaptic devices that can achieve 95% accuracy for the MNIST dataset, as listed in Table 1. The specifications of memristive devices in the literature [26,[42][43][44] are also listed in the table to make the comparison. The proposed memristive DBN has high relaxed specifications for memristive devices. According to our simulation experience of checking various nonidealities, we found that the non-linearity of weight update is the most important factor need to be taken care of. Reducing the non-linearity would continuously improve the performance of the neural network. The number of conductance levels and device yield need special care since a low number of conductance levels and low device yield will suddenly deteriorate the performance of the neural network. However, when enough conductance levels and device yield are available, further improvement of these metrics would not significantly benefit the neural network's performance. Other nonidealities are highly relaxed and can be easily meet with most of the existing memristive device technology. P.Y. Chen, et al., 2017 [43] C.C. Chang, et al., 2018 [42] S. Agarwal, et al., 2016 [44] This work Conductance levels ≥1000 ≥64 ≥256 (8 bits) ----≥100 DAC/ADC accuracy 9 bits -8 bits -1 bit Nonlinear activation Software Software Software Digital core Not required '-': Not discussed; a Could be removed if differential pairs are used; b Converted from the original value since different metric of non-linearity is used in the reference.

Conclusion
A memristive DBN composed of mixed-signal RBM layers for efficient online training is proposed. The mixed-signal RBM layer consists of an analog memristive array for the stochastic VMM and a digital counter array for the accumulation of CD. The proposed memristive DBN has stochastically binarized activation, free from the need for complex peripheral circuits with expensive DACs and ADCs. It shows high immunity to various non-idealities of the memristive synaptic devices. The endurance requirement of the memristive is also highly relaxed.

Methods
Training of a single RBM layer: We used the first order CD for the training of memristive RBM.
The input (state of the visible units, ) is first multiplied by the weight matrix of the RBM 1 (w1) to obtain the probability of the state of the hidden units (h), where ( ) = After that, the reconstructed state of the visible units (v') is again multiplied by the weight matrix to obtain the probability of the reconstructed state of the hidden units (v'), The CD is then calculated by the difference between the outer-products of the two sets of visible-hidden neuron states, The CD matrix which acts as the gradient descent of the weight matrix is used to update the weight matrix, where is the energy of the RBM which should be minimized. In this work, we do not update the weight matrix directly. We accumulate the CD matrix (request for weight updates), and only update the elements in the weight matrix when the elements in the CD accumulation (signed integers stored in digital counters) reach a threshold. where ∑ & &# $ , which makes sure that, statistically, only one label neuron will be excited. The CD matrix and its accumulation will be implemented separately for the weight matrices w3 and w4, respectively.
Week Image generation by the trained DBN: The memristive DBN after fine-tune with generation path can be used to generate image when only given label input ( Figure S10). As shown in Figure S10a, a random noise image and a label are input to the DBN, and the top RBM layer performs the Gibbs sampling in multiple iterations. Taking the reconstructed visible neuron states of the top RBM layer as input, the generation path's output will provide the correct digits image corresponding to the label ( Figure S10b).

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.