Hardware-Friendly Stochastic and Adaptive Learning in Memristor Convolutional Neural Networks

making the method highly hardware friendly and energy ef ﬁ - cient. In addition, the probability updating algorithm has been intrinsically embedded into our method, in which all the nonlinearity and stochasticity updating can also be conducted by the memristor network itself; therefore, it is fully hardware friendly without complex peripheral circuits when compared with the traditional nonlinear SGD algorithm or the PL algorithm. The effectiveness of the proposed method has been carefully investigated on training of a classical LeNet-5 CNN. The demonstrated network exhibits high accuracy, about 93.88% (statistical data), on the MNIST test, which is close to the network with a complex updating method, such as the PL method (94.7%), and is higher than the original nonlinear SGD algorithm (90.14%)


Introduction
Neural networks have been widely explored in artificial intelligence (AI)-based applications such as computer vision and speech recognition. [1][2][3] However, the AI algorithms running on traditional complementary metal-oxide-semiconductor (CMOS) digital platforms have been largely constrained by either the approaching end of Moore's law or the limitation of von Neumann architecture. [4] Novel neuromorphic computing architectures beyond CMOS technology are strongly desired.
classification. [11] Li et al. developed multilayer memristor neural networks for image processing. [16] Recently, Yao et al. reported the fully hardware implementation of memristor-based convolutional neural networks (CNNs). [13] However, most of the previous works only focused on the inference process, and the training process is rarely addressed. As stated in Yao's work, [13] "In contrast to the pure in-situ training solution, the ex-situ training method appears to be a shortcut that takes advantage of existing high-performing parameters." This is because there still exist several bottlenecks that need to be solved for the training process. First, the training process is difficult to be implemented by memristor hardware itself and is usually conducted by a host computer. [12,15] Second, the online learning process demands great computation resources to calculate the parameters in weight updating, which would increase not only the complexity in peripheral circuits design, but also the power consumption. Third, the nonlinear and asymmetric conductance updating properties have largely degraded the network performance. [10] Li developed the two-pulse conductance programming scheme [12] using SGD algorithm to address the challenges in memristor conductance programming, which enables the network to continuously adapt its knowledge and significantly improves its accuracy and defect tolerance. However, the nonlinear property remains a great challenge. Wang proposed a piecewise linear (PL) method [19] to mitigate the weight update error caused by the nonlinearity of memristors. Compared with the classical nonlinear SGD method, it provides higher overall accuracy. However, it involves complex peripheral circuits to calculate and store the updating parameters, which is actually not energy efficient.
Generally, there are approaches to improve the performance of memristor networks by ex situ training, but most of them require that the parameters be tuned based on specific knowledge of the hardware (e.g., peripheral circuitry) and memristor array (e.g., device defects). In this work, we propose a concise and effective learning method for the memristor network weight updating. The in situ training adapts the weights and compensates them automatically and thus is more powerful and energy efficient in training neural networks.
The proposed method in this work is inspired by the traditional SGD method, [20] but no learning-rate parameter (an important parameter in SGD algorithm) is specifically needed here. In each updating process, only the updating direction (to be either increased or decreased) is provided. [11] Therefore, it is self-adaptive [12] and is free of calculating the specific numbers of electrical pulses. In addition, the probability updating algorithm has been intrinsically embedded in our method, and the nonlinear and stochastic updating can be fully implemented by the memristor network itself. Compared with the traditional nonlinear SGD updating algorithm and the PL updating method used in memristor neural network, our method is fully hardware friendly without complex peripheral circuits. Effectiveness of the proposed method has been carefully investigated on training of a classical LeNet-5 CNN, [20] and the demonstrated result shows that our stochastic-adaptive learning method exhibits good performance compared with the PL method and is better than the nonlinear SGD method.

Memristor CNN
The CNN was first proposed in the 1980s. However, it has not been widely used because the training algorithm was difficult. LeCun et al. applied a gradient-based learning algorithm to train a CNN and obtained successful results. [20] After that, researchers further improved the CNN and reported strong results in the field of image recognition. [2,3,[21][22][23] To date, CNN is the top research subject for AI applications due to its great performance. The overall architecture of CNN is structured as a series of convolution layers. The convolution layer exploits the property that many natural signals are compositional hierarchies, in which higher-level features can be obtained by composing lower-level ones. [1] For example, edges form motifs, motifs assemble into parts, and parts form objects. In deep CNNs, these specific features are detected by a series of convolution kernels (filters), of which the vector multiplication is used to measure the similarity between input feature and convolution filters.
LeNet-5 is the first successful CNN, which was designed for handwritten and machine-printed character recognition [20] in the 1990s. As shown in Figure 1a, the LeNet-5 contains two convolutional layers and three fully connected (FC) layers. The FC layer can also be treated as a convolution operation with 1 * 1 filter; therefore, all the calculations in CNN are equivalent to the matrix vector multiplication (MVM) of in which the kernel size of convolution weight is K y * K x , input channel is CH in , output channel is CH out , and height and width of the input feature is H and W, respectively. The matrix multiplication can be directly implemented using the memristors network, [8,9,18] as shown in Figure 1b. For the forward propagation, the multiplication operation can be directly obtained using Ohm's law in terms of I j ¼ V i ÃG ij , whereas the accumulation operation can be done by Kirchhoff 's law, I out ¼ P V in ÃG ij at the output neuron, where G ij represents the corresponding conductance of the memristor.
For the back propagation and weight updating, the desired conductance of each memristor can be translated into a certain number of electrical pulses for each device determined by the standard SGD algorithm 9,10. Figure 1c shows the typical performance of an electronic synaptic device (such as memristor and synaptic transistor [24] ). Conductance of the device can be well modulated by the input electrical pulse (including the pulse number, pulse duration time, pulse directions, and pulse amplitudes). When the electrical pulse is positive, the conductance shows a trend of continuous increase, regarding long-term potentiation (LTP, strengthening of synapses based on recent patterns of activity). However, with negative pulse, the device conductance will be gradually depressed, corresponding to the long-term depression process (LTD, decreasing in synaptic strength). LTP and LTD are the core parts of synaptic plasticity, meaning that the synaptic weight can be modulated and thus can be applied in artificial neural networks.
An ideal behavior of a synaptic device desires a "linear" relationship between the conductance and the number of programmed voltage pulses, namely, G ideal ¼ kÃN pulse . However, the practical device reported in literature does not follow such ideal trajectory, exhibiting "exponential-type" updating characteristics. [10] What is worse, the precision of the conductance is not that high and usually limited to be 4-8 bit (corresponding to 16-256 conductance states). The random conductance variation is inevitable due to the physical limitations, including the inherent drift and diffusion dynamics of the ions/vacancies in the device. [7] Thus, it is difficult to train a memristor neural network in situ.
To analyze the impact of nonlinear weight updating during network training, a general behavior model [10] is developed. The device properties can be well described with five parameters: G max (the maximum conductance), G min (the minimum conductance), P max (the number of conductance states), A p (nonlinearity of the LTP updating), and A d (nonlinearity of the LTD updating). The conductance can then be deduced by in which A is A p (for the LTP updating) or A d (for the LTD updating) and x represents the current state of the pulse, and it ranges from 0 to P max . G max , G min , and P max can be directly measured in the electrical characterization process. A p and A d can be extracted from the conductance-pulse curve. To get the further numerical analysis result, the normalized conductance-pulse curve of the mode device is shown in Figure 2, in which different A p and A d are plotted. It is worth mentioning that the A p and A d of one device are usually unrelated, and for most practical devices, the A p is empirically better than that of A d . Take the device in Figure 1c as an example, the G max , G min , P max , A p , and A d is 20, 3.5, 32, 0.1, and À3.5, respectively.

Stochastic and Adaptive Learning in Memristor Networks
Next, we will tend to the memristor network learning process. It should be first pointed out that the synaptic weight in a neural network can be either positive or negative. However, the conductance of the device is always positive. To map the positive conductance into both positive and negative ones, many previous  Figure 2. Normalized nonlinear conductance-pulse curve of memristor. The device exhibits "exponential-type" of updating characteristics in both LTP and LTD.
www.advancedsciencenews.com www.advintellsyst.com works [9,[11][12][13] have taken advantage of the differential-pair method (Figure 3a,b). Generally, the output current of the original memristor network will be first translated into voltage by a current-voltage converter, namely, V þ out ¼ ÀRÃð Then, the final output can be obtained using the integrated operational amplifier circuit For the weight updating process, if the weight is calculated (according to the SGD algorithm) to be increased, several positive electrical pulses will be applied to the device w 1 ; meanwhile, the device w 2 is untouched. If the weight has to be decreased, several negative electrical pulses will be applied to the device w 1 and the device w 2 is still untouched. The above process shows the standard weight updating method used in today's memristor network (which is named as "mode-0"). However, there exist several shortcomings here. First, the number of electrical pulses should be calculated before updating, which will largely increase the computing complexity. [14,15,25,26] What is worse, in some cases, is that the forms of the pulses are variable that the direction or amplitude or duration time of the pulse will have to be changed for a higher accuracy, which will increase the design complexity. [10,12,16] Secondly, the second memristor (device w 2 ) in the differential-pair is never updated. [12] Third, the nonlinear characteristic of the device is inextricable.
Adaptive learning system is an automatic control system that preserves its operational capability under conditions of unforeseen change in the properties of the controlled system. For a neural network, an adaptive system desires self-optimization capabilities [12] and is able to continuously adjust the parameters. Standing on this point, here, we propose new learning algorithms to address the earlier issues. Four different kinds of updating modes are developed, as shown in Table 1.
To improve the network accuracy with an elaborate updating algorithm, "mode-1" and "mode-2" (as the normal control group) are proposed with the 1-transistor-1-memristor plus 1-transistor-1-memristor (1T1R-1T1R) structure, which is the same as the classical nonlinear SGD algorithm ("mode-0"). The difference is that, in "mode-0," the two memristors in the differential-pair are exactly the same, whereas in our design (both "mode-1" and "mode-2"), these two are with different device size (Note that this will not introduce additional issues during the device/chip design and fabrication process).
To reduce the circuit area, "mode-3" (as the normal control group) and "mode-4" are designed using the 1-transistor-1memristor plus 1-resistor) (1T1R-1R) structure. This is because the 1T1R-1T1R involves twice the number of required devices, whereas the 1T1R-1R structure is able to share the resistor for different differential-pairs. [24] In this way, about 50% circuit area in the memristor network can be saved theoretically. The difference between "mode-3" and "mode-4" is the initialization method of the resistor's resistance. In "mode-3," all the resistance (weights of w 2 ) will be randomly initialized at the very beginning and then be fixed during the whole updating process, which is exactly the same as "mode-0." In mode-4, the resistance is always fixed at (G max /2 þ G min /2).
The weight updating method in the earlier four different modes is different from that of "mode-0" mainly because it is not necessary to calculate the specific number of voltage pulses, and the form of the pulse is also fixed without the change in duration time, amplitude, and directions. What we need is just to figure out the sign of updating Δw to be either positive or negative for the training. The simple learning rules make the memristor network capable of updating its knowledge adaptively, thus getting the adaptation and self-optimization capabilities for more conditions. Figure 3. a) Differential-pair design of the typical 1T1R-1T1R memristor network, in which 1T1R means 1-trasistor-1-memristor structure. b) Weight updating in the differential-pair.
www.advancedsciencenews.com www.advintellsyst.com The updating algorithm can be easily explained in "mode-2," "mode-3," and "mode-4." In "mode-2," namely, w ¼ w 1 -w 2 . If the weight of differential-pair w needs to be increased, we send one positive pulse to device w 1 and one negative pulse to device w 2 . If the weight w needs to be decreased, we send one negative pulse to w 1 and one positive pulse to w 2 . For "mode-3" and "mode-4," namely, w ¼ w 1 ÀR. If the weight needs to be increased, we should just send one positive electrical pulse to the first memristor (w 1 ), and vice versa. These three updating methods are deterministic, because the practical weight updating directions are always the same as the algorithm suggested ones.
For "mode-1," the updating algorithm is stochastic and can be described in the following steps.
Step 0: Randomly set the network weights into two parts: "default positive" and "default negative." The "default positive" means that the differential-pair in the hardware is designed to be "w ¼ w 1 -w 2 ," and the "default negative" means that the differential-pair in the hardware is designed to be "w ¼ w 2 -w 1 ," where the devices w 1 and w 2 are with different area, and the area of device w 1 is larger than that of device w 2 .
Step 1: Random initialization. Though the size of w 1 is larger than w 2 , it does not mean that the value of w ¼ w 1 -w 2 is always positive due to random initialization. The probability can be calculated as the following. Suppose w 1 and w 2 are randomly initialized with a distribution of f(x), the range of w 1 is [G min1 ,G max1 ], and the range of w 2 is [G min2 ,G max2 ]; then, the probability of w ¼ w 1 -w 2 > 0 can be expressed as For example, if f(x) represents uniform distribution, and G min1 ¼ G min2 ¼ 0, G max1 ¼ 2, and G max2 ¼ 1, the probability is P w1>w2 ¼ 75%.
Step 2: One-pulse updating. Using the standard SGD algorithm, we get the updating directions and determine whether Δw is positive or negative. If Δw > 0, one positive pulse will be sent to both w 1 and w 2 for the "default positive" differentialpair devices (see step 0). One negative pulse will be sent to both w 1 and w 2 for the "default negative" differential-pair devices. If Δw < 0, we need to send one negative pulse to both w 1 and w 2 for the "default positive" devices and one positive pulse to both w 1 and w 2 for the "default negative" devices.
Step 3: Stop the learning process until all the cycling loops (including the batch-number loop, layer-index loop, and device-number loop) are finished.
The aforementioned updating algorithm can be easily conducted in both software simulation and hardware implementation ( Table 2). However, there is still one uncertain point to be emphasized-that the one-pulse updating may have failed due to the random initialization of w 1 and w 2 . For instance, how can we guarantee that the conductance of w new is really larger than w old in the "default positive" devices (w ¼ w 1 -w 2 ) after one-pulse positive updating? Figure 4a shows the typical updating process for Δw > 0 in "default positive" device as an example. The conductance can be written as where x is the memristor pulse state. Equation (6) represents the successful learning after one-pulse updating, and it can be further simplified using Taylor's expansion with the low-order approximation. For the first-order linear approximation, it is equivalent to The probability of G 0 x 1 Þ À G 0 ðx 2 Þ > 0 À is then calculated to be Table 1. Different weight updating methods. (Note 1: """ means increase and "#" means decrease. Note 2: In "mode-0," "mode-3," and "mode-4," the weight is written as "w ¼ w 1 ÀR" because w 2 is always fixed to a constant value of R).

This work Stochastic-adaptive learning
Mode-1 (nonlinearity & stochasticity) where P max , A 1 , G max1 , G min1 and A 2 , G max2 , G min2 are the attributes of device w 1 and w 2 , respectively. The temporary variables are defined as The probability of G 0 ðx 1 Þ À G 0 ðx 2 Þ > 0 shows a strong rela- , the result is 50%, which is consistent with the intuitive knowledge. If G max1 / G max2 ¼ 2, the probability can reach as high as 95.4%, as shown in Figure 4c (black dots), which means that the conductance of w new shows a significant high probability to be practically larger than w old after the one-pulse updating. Similar analysis on Δw < 0 can be obtained, as shown in Figure 4b.  Figure 4. a) Analyses of the updating process for Δw > 0 in "default positive" devices. b) Updating process for Δw < 0. c) Black dots: the probability of a successful one-pulse updating at different G max1/ G max2 . Blue dots: the effective weight precision at different G max1 . Note: Step 0 Set the network weights to be "default positive" Randomly design the differential-pair to be (w ¼ w 1 -w 2 ) or (w ¼ w 2 -w 1 ) 1. w 1 and w 2 are with different device size. size of w 1 is larger than size of w 2 . G w1 > G w2 with high probability_1 (e.g., 80%) according to their size. 2. The weight updating only needs one pulse in all of the cases. 3. The updating may have failed in one certain time due to the random distribution of w 1 and w 2 , thus introducing the second probability_2 (e.g., 95%) according to the design.
(send one negative pulse to w 1 & w 2 ) :(send one positive pulse to w 1 & w 2 ) Step 3 Run the loop Run the loop www.advancedsciencenews.com www.advintellsyst.com However, it should be noted that with the increasing G max1 / G max2 ratio, the effective weight precision (defined as 1/(G max1 ÀG max2 )) will be decreased because the number of conductance states is fixed, and this will degrade the network performance, which will be discussed later.

Network Performance
Before training the network using our stochastic-adaptive method, the linear scaling factor k of each layer should be determined first, because the distribution range of the weight is largely dependent on the k-value. [11][12][13] It is better to make the network weight parameter's range consistent with the memristor's conductance.
In software, k is the factor of Rectified Linear Unit (ReLu) activation function.
However in hardware, k is mainly the value of the resistance R in the current-voltage converter, as shown in Figure 3a. The activation in circuit is designed using The effectiveness of the proposed method has been investigated on the classical LeNet-5 network. Figure 5 shows the comparison of our method (this work) with the standard linear SGD algorithm (ideal performance) and nonlinear SGD algorithm (baseline), in which the k-value of conv1, conv2, FC1, FC2, and FC3 layer is 1, 1, 0.01, 0.1, and 0.5, respectively, and , and batch_size ¼ 128. Our method (red dots) shows comparable high accuracy with the standard linear SGD algorithm (gray dots) and meanwhile is higher than the nonlinear SGD algorithm (blue dots). An adaptive learning system is an automatic control system that preserves its operational capability under conditions of unforeseen change in the properties of the controlled system. Based on the extremely simple updating algorithm, our network is able to continuously adjust the parameters and exhibit great self-optimization capabilities. Remarkably, it exhibits particular good performance at small training epoch. For example, the training accuracy can reach as high as about 85% at iteration ¼ 50 (corresponding to only 6400 training images), which means the system is able to learn useful knowledge at small numbers of training samples, confirming the high performance of our stochastic-adaptive learning system. [12] Next, more detailed information on network training has been studied using "mode-1." Figure 6 shows the network accuracy on different G max , G min , nonlinearity of LTP, and non-linearity of LTD. When the G max of w 2 is fixed at a normalized value of 1, the best G max of w 1 is checked to be around 2, and best G min ranges from 0 to 0.1, which means the on/off ratio of the original memristor is about 20 and can be easily achieved in a practical device. In addition, when the nonlinearity of |LTP| and |LTD| is less than 2, the recognition accuracy can reach about 90%, indicating good network performance. [10,12,19,25] The results on the other modes are shown in Figure 7. "Mode-1" exhibits the best performance, and "mode-2" exhibits worse performance than "mode-1" because of the excessive amount of compensation in updating, which is similar to the large learning rate in SGD algorithm. "Mode-3" and "mode-4" are designed to reduce the circuit area; however, the second weight (w 2 ) in the differential-pair is untouched in weight updating. Therefore, their performance is also less than that of "mode-1." It should be noticed that mode-4 has a higher accuracy than mode-3. This is mainly because mode-4 has uniform and symmetric weight distribution than that of mode-3. Set G max ¼ 1 and G min ¼ 0, and the weight states number is 11. In mode-4, w 2 ¼ (G max þ G min )/2 ¼ 0.5; thus, the weight value of each differential-pair (W ¼ w 1 -w 2 ¼ w 1 -0.5) can be either negative or positive (weight set: W ¼ À0.5, À0.4, À0.3, À0.2, À0.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5). However in mode-3, the distribution of W is usually asymmetric. For example, if w 2 is (randomly) initialized to be w 2 ¼ G max (or G min ), the weight value of the differential-pair (W ¼ w 1 -w 2 ) will be always negative (or positive), which is to the disadvantage of high-accuracy on-line learning.
Finally, the network performance with different training methods is shown in Table 3, where all the listed data represent the average result obtained from ten times of duplicated tests with random weight initialization conditions for each mode. The proposed stochastic-adaptive learning method exhibits much better performance than that of nonlinear SGD algorithm and is close to the network with complex updating method, such as the PL method. The comparison of hardware implementation methods for memristor network training using different updating algorithms is shown in Table 4. Our method is the best one with great hardware-friendly features. www.advancedsciencenews.com www.advintellsyst.com

Conclusion
In summary, we developed a stochastic and adaptive learning method to train the memristor CNNs, and four different modes have been proposed to either improve the performance or to reduce the chip area. In the proposed learning method, only the updating direction provided by the learning algorithm is required, and the complex calculation of the specific conductance variations, fine-tuning of the pulse amplitude and pulse duration, specific numbers of electrical pulse are all exempted, thus   Figure 6. The influence of G max and G min on the network performance trained with "mode-1" method, in which G max1 is the variable on the horizontal axis and G min2 is the variable on the vertical axis for the figure. In each panel, there are 49 results corresponding to the 49 combinations of A p (nonlinearity varies from 0 to 6) and A d (nonlinearity varies from 0 to À6). The best performance occurs at an on/off ratio of about 20 or more, corresponding to the panel a2, a3, b2, and b3. Note, G max1 :G min1 ¼ G max2 :G min2 , and G max2 ¼ 1.
www.advancedsciencenews.com www.advintellsyst.com  making the method highly hardware friendly and energy efficient. In addition, the probability updating algorithm has been intrinsically embedded into our method, in which all the nonlinearity and stochasticity updating can also be conducted by the memristor network itself; therefore, it is fully hardware friendly without complex peripheral circuits when compared with the traditional nonlinear SGD algorithm or the PL algorithm. The effectiveness of the proposed method has been carefully investigated on training of a classical LeNet-5 CNN. The demonstrated network exhibits high accuracy, about 93.88% (statistical data), on the MNIST test, which is close to the network with a complex updating method, such as the PL method (94.7%), and is higher than the original nonlinear SGD algorithm (90.14%).

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.