Direct Gradient Calculation: Simple and Variation-Tolerant On-Chip Training Method for Neural Networks

On‐chip training of neural networks (NNs) is regarded as a promising training method for neuromorphic systems with analog synaptic devices. Herein, a novel on‐chip training method called direct gradient calculation (DGC) is proposed to substitute conventional backpropagation (BP). In this method, the gradients of a cost function with respect to the weights are calculated directly by sequentially applying a small temporal change to each weight and then measuring the change in cost value. DGC achieves a similar accuracy to that of BP while performing a handwritten digit classification task, validating its training feasibility. In particular, DGC can be applied to analog hardware‐based convolutional NNs (CNNs), which is considered to be a challenging task, enabling appropriate on‐chip training. A hybrid method is also proposed that efficiently combines DGC and BP for training CNNs, and the method achieves a similar accuracy to that of BP and DGC while enhancing the training speed. Furthermore, networks utilizing DGC maintain a higher level of accuracy than those using BP in the presence of variations in hardware (such as synaptic device conductance and neuron circuit component variations) while requiring fewer circuit components.

DOI: 10.1002/aisy.202100064 On-chip training of neural networks (NNs) is regarded as a promising training method for neuromorphic systems with analog synaptic devices. Herein, a novel on-chip training method called direct gradient calculation (DGC) is proposed to substitute conventional backpropagation (BP). In this method, the gradients of a cost function with respect to the weights are calculated directly by sequentially applying a small temporal change to each weight and then measuring the change in cost value. DGC achieves a similar accuracy to that of BP while performing a handwritten digit classification task, validating its training feasibility. In particular, DGC can be applied to analog hardware-based convolutional NNs (CNNs), which is considered to be a challenging task, enabling appropriate on-chip training. A hybrid method is also proposed that efficiently combines DGC and BP for training CNNs, and the method achieves a similar accuracy to that of BP and DGC while enhancing the training speed. Furthermore, networks utilizing DGC maintain a higher level of accuracy than those using BP in the presence of variations in hardware (such as synaptic device conductance and neuron circuit component variations) while requiring fewer circuit components. backpropagation (BP) has several drawbacks. These include requirement for additional circuits and memory devices for the implementation of backward path to allow the propagation of error values, and the difficulty of designing the backward path in complex networks. [30] Furthermore, accuracy degradation due to large variations in hardware remains an unsolved challenge for BP. [28] Therefore, we propose a novel on-chip training method for neuromorphic systems called direct gradient calculation (DGC) to avoid the inclusion of additional circuitry required for backward propagation of error values. DGC is a method that can obtain the gradient values of weights in hardware-based NNs. In this method, the update quantity of weights in the gradient descent method is calculated by sequentially applying a small change to each weight and then measuring the change in cost value. The scope and conditions in which DGC can be used are the same as those of BP. DGC achieves an equally high accuracy as BP with the following advantages: 1) Reduction in the area required for on-chip training of NNs by eliminating the backward path circuitry of BP. 2) Possibility of being used for on-chip training of diverse hardware-based NNs for which designing BP circuitry is challenging. 3) Stronger immunity to device variations than that of BP.
This article is organized as follows. Section 2 presents the underlying principle and algorithm details of DGC and provides the hardware implementation of the proposed training method. Section 3 provides verification of the training feasibility of DGC, an analysis of the network accuracy depending on the magnitude of a small change applied to the weights, verification of DGC on a deeper and more complex network, and the hardware variation tolerance of DGC compared with BP.

Limitations of On-Chip BP
The gradient descent method updates the weight values of a network by calculating the gradient values of a cost function with respect to the weights with the aim of minimizing the cost value for a given training dataset. [31] For a cost function C, weight W i , and learning rate η, the update quantity of the weight is calculated as follows where i represents the index of the specific weight among the N total weights of the network. For BP, the gradient values of each weight are calculated by propagating the error value δ backward using the chain rule. [32] However, there are two major issues when implementing BP in a hardware system. First, although on-chip BP training has stronger immunity to variations in hardware compared with that of off-chip training, variations in hardware and noise could still affect the accuracy of the network, because the update quantities pass through several sequential layers. [28] In the case of BP, it is assumed that the hardware implementation of equations, involving summation, multiplication, and other operations, such as activation functions, is ideal, and device-to-device uniformity is preserved. However, discrepancies between the ideal and the real hardware environment cannot be completely corrected by BP, because devices in the backward path also exhibit variations (not ideal) in their characteristics. Second, the backward path requires a considerable amount of hardware resources. The gradient values of each weight are obtained by multiplying the error value of the postsynaptic neuron and the activation value of the presynaptic neuron, which requires multiplier elements as many as the number of weights. Also, memory devices are required for each neuron to store the activation value, which are necessary to propagate the error values backward. [30]

DGC Method
We propose a novel method for obtaining the weight gradient values. With DGC, the weight gradient values are determined directly by measuring the rate of change in the cost value with respect to the change in weight value instead of propagating the error backward. Because DGC does not require a backward path or derivative function implementation, the above-mentioned problems related to BP can be overcome. The essence of DGC can be summarized as follows where θ represents other parameters in the network except W i . In this article, δW is the temporal weight change for the gradient calculation, and ΔW is the weight update value. If the value of δW is sufficiently small, the measured gradient value for each weight would be similar to that of the actual gradient value. This enables the weights to be updated toward the minimum cost value. The complete sequence of learning a single training datum in a network with N weights using BP and two DGC methods (using different updating processes) are presented in Figure 1. In Equation (2), if δW is infinitesimally small and the update is simultaneously performed after calculating the gradients for all the weights, DGC will update all the weights identically to BP, as shown in Figure 1a,b. However, retaining all the calculated gradients of the weights requires a significant amount of memory space. To avoid this problem, we propose that each weight be updated one by one after calculating the gradient of that weight instead of simultaneously updating all the weights, as shown in Figure 1c. Updating weights simultaneously and one by one is referred to as DGC-S and DGC-O, respectively. The validation of updating the weight values one by one is presented in Section 3.1.
However, both DGCs need more time compared with that of BP to train an NN on real hardware. This is because the computation on the forward path in the case of DGCs has to be repeated more times than that of BP. Hence, DGC can be considered as an algorithm that enables on-chip training with a lower area requirement and realizes on-chip training of complex NNs (verified in Section 3.3) at the expense of the training time. Because the training mechanism of DGC completely differs from that of BP, the actual hardware of the whole circuitry would be necessary to compare the exact training speed and power consumption of each method. However, a complete circuitry of neuromorphic systems with analog synaptic devices that utilize BP has not been developed yet. Therefore, a comparison between DGC and BP with the exact training speed and power consumption would be impossible in the current state. Instead, a qualitative comparison of these methods regarding circuit and memory requirements and training speed is shown in Table 1 using big-O notation. Because the neurons are placed at one side of the synapse array, the number of neurons is O(N 1/2 ).

Hardware Implementation of DGC
In a neuromorphic system with analog synaptic devices, a single weight requires a differential pair of synaptic devices with a conductance of G þ and G À to represent both positive and negative weights, respectively. [33] In Figure 2a, the odd and even rows of each output (V l i ) represent G þ and G À , respectively. To implement DGC, we propose a conventional n-channel FG-MOSFET operating in the ohmic region as a synaptic device. The conductance of this device is represented as follows where K is the device parameter, V GS is the external gate voltage, and V TH is the threshold voltage. The hardware weight corresponding to the software weight in this architecture is as follows where R 1 , R 2 , and R ref1 are shown in Figure 2a. Each neuron circuit consists of three operational amplifiers: U1, U2, and U3, where U1 summates the currents from the G À devices. Note that R ref2 should be sufficiently small to ensure that the output of U1 is unsaturated. Subsequently, U2 summates the currents from the G þ devices and subtracts the output of U1 from it. The neuron circuit exhibits a hard tanh activation function due to the supply voltage limitations of U2. Finally, U3 reduces the voltage by a ratio of R 2 /R 1 . Because the reduced voltage is the input of the subsequent layer, www.advancedsciencenews.com www.advintellsyst.com should be sufficiently small for the FG-MOSFETs to operate in the ohmic region. During the updating phase, the weight values of the synaptic devices are updated through a permanent change in V TH . This is achieved by applying a program (or erase) pulse to the FG-MOSFETs with G þ or G À conductance while maintaining V GS to a constant bias voltage during the forward path. [28] The program (or erase) pulse stores or removes the charge in the floating gate, resulting in a permanent change in V TH .
Conversely, δW should be applied temporarily during the forward path in DGC. However, several problems arise if δW is controlled by changing V TH . First, applying a program (or erase) pulse to the device requires a considerable amount of power. Second, repeated program/erase pulses degrade the endurance of the device. Finally, the weight value should return to its original value after the gradient calculation. However, restoring the original V TH value is a challenging task due to the asymmetric characteristic of V TH modulation. [34] These problems can be eliminated by changing V GS instead of V TH , which is considered a more efficient way of applying δW. The conductance (G) depends on V GS and V TH symmetrically, as shown in Equation (3). Thus, changing V GS is equivalent to changing V TH by the same amount (in terms of G change), as shown in Figure 2b by ② and ①, respectively.
It should be noted that changing V GS is considerably rapid compared with changing V TH during gradient calculation. Applying a small change to the weight value can be implemented by simply adding a small voltage (δV ) to the read bias (V r ) at the gate terminal of the target synaptic devices, marked as the thick, red terminals in Figure 2a. The changes in V GS and V TH (corresponding to a single weight) and the resulting cost value change are shown in Figure 2c. Here, other weight values remain unchanged while updating the single weight. For convenience, it is assumed that the G À value of the weight is fixed, and only the G þ value is updated. In this case, a program pulse with a voltage of V PGM is applied to the gate terminal of the synaptic device to increase V TH and decrease the weight value. This is done, because the cost value increases when a positive δW is applied in the forward phase. The update quantity of V TH is proportional to δC/δW. Contrastively, if the cost value decreases with positive δW, an erase pulse should be applied to the synaptic device to decrease V TH and increase the weight value. Figure 2d presents the cost value perturbation detector and the weight update circuit schematic. The initial cost value (C) is stored in C1, whereas the perturbed cost value (C þ δC) due to the applied δW is stored in C2 by controlling the switches. The voltage subtractor calculates the cost value change (δC), www.advancedsciencenews.com www.advintellsyst.com which is converted to an update pulse through a pulse width modulation (PWM) circuit. [35] As mentioned earlier, DGC-S requires a large number of memory devices to temporarily store the weight update values, whereas DGC-O does not. Note that a full hardware implementation of DGC requires a significantly reduced number of devices compared with BP. Especially, considerably fewer memory devices are required for DGC-O. Furthermore, DGC is more suitable for mini-batch learning. Every neuron requires as many memory devices as the mini-batch size in BP. In contrast, DGC only requires one additional capacitor (C3) for mini-batch learning, which stores the accumulated cost value change for the training data.

Results and Discussion
We designed a fully connected NN (FCNN) of size 64-64-10 and a convolutional NN (CNN) for the modified National Institute of Standards and Technology (MNIST) database classification task. A Python simulation was conducted to verify and analyze the proposed method. The training dataset for one epoch was composed of 60 000 training images, and the accuracy was evaluated using the ratio of correctly classified images out of 10 000 test images. The cross-entropy function was adopted as a cost function, and the networks were trained for 50 epochs with a minibatch size of 100. The mini-batch gradient descent method was adopted for this study. Gradient descent methods with momentum, such as RMSProp or Adam, [36,37] were not used, as they require additional memory and circuitry in hardware. Furthermore, optimization techniques, such as dropout, [38] were not used due to their complexity when implemented with analog circuits. Optimal learning rates were obtained for each network utilizing each method, and time-based decay was adopted as the learning rate schedule. Each network had the same optimal learning rate for all methods, so the same learning rate was used for training. To reflect the limits of synaptic conductances and voltage values in the hardware, the absolute values of the weights were constrained to be less than or equal to W max ¼ 1, and the hard tanh function was adopted as the activation function in every layer.

Trainability Validation
The simulation was conducted to train FCNNs using three different training methods: BP, DGC-S, and DGC-O. The simulation was conducted with an identical initial weight value set for each method, and the simulation was repeated five times with different initial weight value sets. The training curves of BP, DGC-S, and DGC-O are presented in Figure 3. The accuracies and cost values were averaged along the epochs and indicated by symbols. Deviations of the accuracy and cost values are indicated by error bars. The different initial weight value sets of each simulation cause the network to converge to different local minima, leading to deviations in accuracy. The average accuracies of DGC-S and DGC-O (95.05% and 95.06%, respectively) were similar to that of BP (95.03%). DGC-S has almost the same convergence speed as BP in terms of epochs, because they have identical update sequences. However, in the early epochs, DGC-O achieves higher accuracy (lower cost) than that of BP, which means faster convergence. This is expected to be mainly due to the frequent updates in DGC-O. Frequent updates enable weights to be updated with a lowered cost, resulting in a faster convergence speed for DGC-O than that for BP and DGC-S. Figure 4a shows the weight matrices of the first fully connected layer in the 50-epoch-trained networks utilizing each method. The matrices were obtained under one simulation condition out of five simulations with different sets of initial weight values. All three networks using each method started with the same set of initial weight values and eventually converged with similar accuracy but at different weight values. Both BP and DGC-S converged at similar but slightly different weight values, because δW was set to 0.01, which is not infinitesimally small. However, because DGC-O has a different weight update process compared with that of BP and DGC-S, the weight values where DGC-O converges were significantly different from those of BP and DGC-S.
The true positive rate (TPR), positive predictive value (PPV), true negative rate (TNR), and modified confusion entropy (MCEN) were also measured for these three FCNNs using 10 000 test images. [39] Figure 4b,c shows that the values of the TPR and PPV of BP and DGC-S were different from those of DGC-O for some labels due to convergence at different weight values. However, as shown in Figure 4d, all three methods exhibit similar TNRs. Furthermore, as shown in Figure 4e, the three methods exhibit similar MCENs, which is an indicator of the overall performance of a classifier, as it considers the degree of confusion of an image with respect to another. Similarly, MCEN also verified that the three methods have an equal performance.
The feasibility of training with DGC using two different updating processes was verified by the results mentioned previously. However, as described in Section 2.2, DGC-O has the advantage of a lower memory requirement compared with that of DGC-S. In this article, we focus on the hardware implementation of DGC. Accordingly, we adopt the updating process of DGC-O for the general DGC method. From this point, DGC will be used to represent DGC-O. www.advancedsciencenews.com www.advintellsyst.com

Accuracy Dependency on the WCR
We define the term weight change ratio (WCR) as follows WCR ≡ δW W max (5) Figure 5 shows the accuracy of the FCNN after 50 epochs as a function of the WCR in DGC. For a WCR greater than 0.1, the accuracy significantly decreases, as the WCR increases. However, for a WCR less than 0.1, the accuracy of the network saturates, as the WCR decreases. This is because, referring to Equation (2), as δW decreases, the measured gradient value converges to the ideal value. However, in practical hardware implementations, extremely small values of δW would be indistinguishable from external noise. In addition, the cost value change (δC) could also be intractable due to the resolution limit of the cost value perturbation detector circuit. Therefore, we consider 0.01 as a reasonable WCR for practical applications of DGC. Accordingly, the default value of the WCR is set to 0.01 for the remaining simulations in this article.

DGC for CNNs
To verify the performance of DGC in deeper and more complex networks, we designed a CNN with the structure of a variant of LeNet-5, [40] as shown in Figure 6a. All convolutional layers had a stride of 1, and zero padding was not used. Here, we also propose a hybrid method that uses both BP and DGC. In a CNN that uses BP or DGC, all the weight gradients in the convolutional layers and fully connected layers are obtained by each method. However, in the hybrid method, the gradients of the weights in the convolutional layers are obtained by DGC, whereas the gradients of the weights in the fully connected layer are obtained by BP. The weights of the convolutional layers are sequentially updated by DGC, and then, the weights in the fully connected layer are simultaneously updated by BP during the updating phase of one image set, as shown in Figure 6b.
Three networks with the same structure were trained using BP, DGC, and the hybrid method. The simulation was conducted for five different initial weight value sets, and the cost values and accuracies were averaged along the epochs, as in Section 3.1. It can be observed from Figure 6c that the accuracies of DGC and the hybrid method are 98.04% and 97.98%, respectively, which  www.advancedsciencenews.com www.advintellsyst.com are similar to the accuracy of BP, i.e., 98.00%. It should be noted that DGC enables on-chip training in convolutional layers. This is remarkable, because an on-chip training method suitable for the convolutional layers of CNN is absent, to the best of our knowledge. The FCNN in Section 3.1 achieves an accuracy of 95% using 4810 total weights, whereas the CNN achieves %3%p higher accuracy using 2380 total weights. Thus, the CNN, which adopts DGC and is capable of on-chip training, can achieve high accuracy in edge devices that require small area occupancy. Furthermore, the hybrid method efficiently combines BP and DGC to enable on-chip training of CNNs while enhancing the training speed compared with that of standalone DGC. In the hybrid method, only convolutional layers, which are intractable in hardware that utilizes BP, can be trained by DGC.
Especially for CNN architectures with a small proportion of weights in the convolutional layers and a large proportion of weights in the fully connected layers (such as VGGNet), [41] the hybrid method is expected to reduce the training time significantly, because the training time required by DGC is proportional to the number of weights. The hybrid method is also expected to be an efficient method of training a network in a reasonable time for large images (such as the CIFAR-10/100 dataset), providing a practical solution for on-chip training of hardware-based CNNs.

Hardware Variation Tolerance
In hardware-based systems, the physical parameters of the device, such as the width, length, thickness, doping concentration, and mobility, can randomly deviate from the ideal target values. [42] Hence, the variations in G, R ref1 , and the voltage scaling gain described in Section 2.3 are modeled as a random normal distribution of N(1, σ 2 ). The standard deviation (σ) of the random normal distribution varied from 0 to 0.4 in 0.1 increments in the simulation, and each method was simulated five times for one σ value. Figure 7 demonstrates the dependency of the accuracy on hardware variations in 50-epoch-trained FCNNs utilizing BP and DGC. As the random variations increase, DGC exhibits a smaller performance degradation than BP. For σ ¼ 0.4, the accuracies of networks utilizing BP and DGC decrease by 2.41% and 1.08%p, respectively. In the case of BP, variations in hardware affect the circuit in both forward and backward paths; whereas in DGC, the variation affects only the circuit in the forward path. As a result, DGC is approximately twice as tolerant to variation in hardware compared with that of BP.

Conclusion
We have proposed a novel on-chip training method for NNs that is simple and tolerant of variation in circuit implementation-DGC. Prior research on on-chip training, which adapts to a nonideal circuit environment during training, mostly utilizes BP as a training algorithm. However, BP requires an additional backward path, which increases circuit area, memory requirement, and complicates The weights of the convolutional layers and fully connected layers are denoted as W C i and W F j , respectively, and the numbers of total weights in convolutional layers and fully connected layers are denoted as N C and N F , respectively. c) Accuracy and cost function change along epochs for BP and DGC in the CNN. www.advancedsciencenews.com www.advintellsyst.com the circuit design when implemented in hardware. Furthermore, accuracy degradation of hardware BP due to large variation in hardware needs to be solved. DGC, which can substitute conventional BP, is a method for obtaining the gradient values of weights in hardware-based NNs. Unlike BP, DGC uses the cost value change corresponding to the weight value change in the forward path to obtain the gradient values. Hence, it does not require a backward path. The validity of DGC, especially with updating weights one by one, was verified in the MNIST database classification example. The accuracy of DGC in the FCNN and CNN, 95.06% and 98.04%, is similar to that of BP, 95.03% and 98.00%, respectively. On-chip training in hardware CNNs is considerably difficult due to the complexity of the BP circuit design. Therefore, utilizing DGC is an appropriate approach. In addition, DGC is suitable for mini-batch learning compared with that of BP. Moreover, DGC shows greater tolerance to variations in hardware for FCNN compared with BP. In the case of a normal variation with a standard deviation of 0.4, the accuracy of DGC decreases by 1.08%p, whereas the accuracy of BP decreases by 2.41%p. The main issue related to DGC is the relatively slow training speed. If this issue can be resolved, DGC is expected to be an efficient on-chip training method for various types of NNs. As one of the solutions, a hybrid method that uses both DGC and BP for on-chip training CNNs can effectively reduce the training time compared with that of standalone DGC. Due to the simplicity of BP in fully connected layers compared with that of convolutional layers, BP circuitry of the hybrid method is significantly simpler than that of standalone BP. Thus, the hybrid method can efficiently make up for the shortcomings of DGC and BP. As CNNs have superior performance in various fields compared with that of other NNs, the results in this article enable neuromorphic systems that can achieve software-comparable performance with less chip area and power consumption.