Spin‐Transfer‐Torque Magnetic Tunnel Junction Nonlinear In‐Sensor Computing Synapse for Improving the Performance of the Feedforward Neural Network

In‐sensor computing architecture has a great advantage especially in massive data sampling, transfer, and processing compared with the separated intelligent sensor systems. However, most of the in‐sensor computing device is proposed based on the traditional neural network model, where the synapse performs linear multiplication of input and weight. This approach fails to make the most use of the nonlinearity of in‐sensor computing devices. Therefore, in this article, first a modified feedforward neural network model with the nonlinear in‐sensor computing synapse (NSCS) located at the input layer is presented, and the backpropagation (BP) algorithm is modified to train the network. Then, the nonlinear characteristics of the NSCS composed of the spin‐transfer‐torque magnetic tunnel junction (STT–MTJ) devices and simple complementary metal‐oxide‐semiconductor (CMOS) circuit are analyzed. Based on the nonlinear response of STT–MTJ NSCS, the small‐scale network with NSCS synapse is experimented on the Modified National Institute of Standards and Technology dataset and compared with the traditional network of the same network size. In the simulation result, it is shown that better performance can be achieved with the STT–MTJ NSCS, including a 2–15 times improvement in convergence speed and a 2.5%–5.1% increase in accuracy.


Introduction
[3] However, in the traditional intelligent sensing systems, the sensor, analog processing front end, analog to digital converter, digital signal processor, and memory unit are physically isolated from each other.The redundant data migration between sensor, processor, and memory inevitably leads to meaningless time delay and energy cost. [2,4]This data transportation bottleneck between the sensor and processer could be named as acquisition wall.Therefore, the traditional architecture cannot meet the time and power requirements in the big data era.The in-sensor computing as an emerging direction, [2][3][4][5][6] implementing signal process in sensors, has been regarded as one of the most potential architectures to break the acquisition wall, since it can significantly reduce the redundant data processing and movement fundamentally, so as to improve the sampling efficiency, and reduce the delay and energy consumption. [7]he core foundation of in-sensor computing is the construction of the neural network using artificial synapses and neurons.Research has attempted to implement the in-sensor computing synapses for the artificial neural network (ANN) using various new types of devices, such as optoelectronic, [8,9] ferroelectric, [10] memristor, [11] 2D material devices, [12] as well as spintronic devices. [5,13]In these works, most of the existing in-sensor computing synapses proposed mainly perform multiplication, [4,14] while the nonlinearity of hardware synapses is viewed as defective.The principal reason is that the nonlinearity of synapse degrades the accuracy under the traditional network and training algorithm. [15]Therefore, an additional linearization of the synapse is introduced by narrowing the conductance window region or compensating the nonlinearity based on extra hardware circuit design. [3]These approaches would introduce new problems including low on/off ratio, complex circuit design, as well as larger synapse size.By contrast, neurobiology studies have proved that the biological synapse has a complex nonlinear response. [16,17]Hence, it would be more effective to make use of the intrinsic nonlinearity, thus releasing the potential of analog computing, and then to alter the traditional network architecture and corresponding training algorithm based on this nonlinear synapse.
Actually, the nonlinear response of biological synapses has attracted much attention.Previous work has shown the advantage of ANN with nonlinear synapses.As early as 1992, Jerzy B. Lont et al. used simple three-transistor nonlinear synapse to replace the linear multipliers to maximize the density. [18]In 1994, they further constructed a large-scale (26 000 weights and 110 neurons) neural network with nonlinear synapse. [19]ncouragingly, their result shows that the convergence speed of nonlinear network is faster than that of the network with linear multipliers.In contrast, in 2003, Momchil Milev et al. implemented a simple analog nonlinear synapse model [20] to avoid weight oscillation around the optimum solution by using the inherent quadratic nonlinearity synapse.Furthermore, they also demonstrated that the classification success rate of the new network with nonlinear synapses was better in many instances.It is thus clear that the advantages of nonlinearity synapse are instructive.However, their work was based on the complementary metal-oxide-semiconductor (CMOS) devices, which could not perceive external physical fields unless an accessional sensor is introduced.
In contrast, to mimic the nonlinear interaction among the inputs to the dendrites, Yuki Todo et al. proposed a nonlinear dendritic neural model (DNM) in 2019. [21]Benefitting from the rich nonlinearity of DNM, it only needs a single DNM to solve the linear non-separable problem.However, this model is relatively complex to use especially in large-scale networks.At the same time, it stays in the conceptual state without consideration of the physical devices.Inspired by the previous work, it would be quite interesting and practical to implement a nonlinear in-sensor computing synapse (NSCS), that would combine the merits of in-sensor computing architecture and nonlinear synapse by using the device's inherent physical response.As a fundamental device of spintronics, magnetic tunnel junction (MTJ) with spin-transfer torque (STT) effect is rich in nonlinear dynamics, providing a new approach for the nonlinear synapses with physical devices.However, so far, researchers have paid little attention to this issue.
In this article, we briefly contrasted the traditional McCulloch-Pitts (M-P) neuron model and modified neuron model with a nonlinear synapse first.Then, based on the aforementioned two models, we proposed a new feedforward neural network architecture with the nonlinear synapse located at the input layer.After that, a modified backpropagation (BP) algorithm adapted to the architecture was demonstrated.Following this, the basic character of the STT-MTJ device and the nonlinear characteristics of NSCS based on the STT-MTJ/CMOS circuit were shown.Finally, the performances of the network based on the STT-MTJ/ CMOS circuits as well as the traditional network were compared and verified by using the same Modified National Institute of Standards and Technology (MNIST) dataset.The simulation result shows that the convergence speed of the network with STT-MTJ/CMOS circuits is largely increased compared with that of the traditional linear synapse.

The Modified Neuron Model and Feedforward Neural Network
The M-P neuron is the most fundamental unit of ANN, [22] which can be divided into two parts, as shown in Figure 1a.The first part (in green) performs the weighted summation of input x i and synapse weight value w ij .In addition, the second part (in blue) is the threshold activation function to introduce nonlinearity.Therefore, the output of M-P computation model can be expressed as where b j is the bias value, f (•) is the activation function of the neuron node, and the subscript i = 1, 2,…, n and j = 1, 2,…, m indicate the index of the input nodes and output nodes, respectively.By contrast, the biological synapse performs a nonlinear response to their input. [16]Hence, we proposed a new type of modified neuron model with nonlinear synapses as shown in Figure 1b, where, the NSCSs are marked as triangles (▴) in the figure.The nonlinear activation function f (•) is same as the one in the traditional M-P model, but the response of nonlinear synapse g (•) is dependent on the intrinsic physical character of the NSCS.Compared with the traditional M-P model, the output of this NSCS would be expressed as where z represents the output result of nonlinear synapse.The index i and j are same as the one in Equation ( 1).As can be seen from Equation (2), the nonlinear in-sensor computing unit works as the synapse introducing more complicated nonlinearity and this can also be interpreted as the combination of the preprocessing of input data and multiplication of the input and weight.Subsequently, to sense and process the external physics information effectively, we propose a new two-layer feedforward network architecture with the NSCS as the input layer synapse shown in Figure 2b.As a comparison, the traditional feedforward network is listed in Figure 2a.Both two kinds of networks are composed of an input layer, a hidden layer as well as an output layer.The only difference between the feedforward network is the synapse between the input layer and the hidden layer.The input layer of the traditional network receives normalized data first.However, the network with the NCSC (▴) located at the input can sense the external information directly, and also implement nonlinear transformation at the same time.Therefore, the feedforward process of the two-layer network with NCSC starts with Equation (2), and then follows the computation formula: where Z represents the synapse calculation result.The index o = 1, 2,…, p labels the output neurons, B o is the bias value, and f o (•) is the activation function of the output neuron.

The Modified BP Learning Algorithm
The proposed network architecture with the NSCS is modified based on the traditional feedforward neural network.Thus, the BP algorithm can be used to train the neural network only if the activation function f (•) and I/O (input/output) response of NSCS g (•) are both differentiable.The choice for the activation function in the hidden layer is rectified linear unit (ReLU), that is f (z) = max (0, z), and the SoftMax function ( f 2 ðziÞ ¼ e z i =Σe z j ) is used at the output layer to transform the original output data into the probabilities.As for the response of NSCS g (•), it is strongly dependent on the physical character of the device, which will be discussed later in detail in the article.Further, to reduce the risk of being stuck in local minima, the mini-batch gradient descent algorithm is used in this work.
The detailed training process flow path can be seen in Section S1, Supporting Information.At each epoch, the sample in minibatch is presented to the NSCS synapse one by one, and the output values Y o can be calculated as Equation ( 2) and ( 3).All the output responses in the mini-batch are compared to the desired output Y label , where one-hot encoding is applied.The average batch error is evaluated as cross-entropy loss function: The superscript b in Equation ( 4) indicates the batch number.Based on the error shown in Equation ( 4), the synaptic parameters are adjusted as the following equations: where η L and η N represent the learning rate for linear synapse and NSCS.The gradient of the typical synapse can be calculated: Similarly, by using the chain rule, we can obtain the gradient of NSCS: where It needs to be pointed out that the derivative shown in Equation ( 18) is determined by the response characteristic of NSCS, which will be discussed later.Thus, the weights of synapse and NSCS are adjusted as Equation ( 2)-( 18) to minimize the error value L, until the training epochs reach the set value.For details about the training algorithm, see Section S1, Supporting Information.The spintronic devices are regarded as the potential candidate to implement in-sensor computing since its unique feature of multi-physics field sensing and nonlinear characters. [5,23,24]he STT-MTJ devices could keep in steady oscillation is referred as STT oscillators (STOs).The oscillation frequency and output amplitude of the STO device can be regulated by both DC bias current and external DC magnetic field.Therefore, STO device has the potential to perform in-sensor computing, with the external magnetic field and bias current works as the input and weight.

NSCS Based on the STT-MTJ Device
The STO device mentioned in this work has a core magnetic stack consisting of CoFe/Ru/CoFeB/MgO/CoFeB.The magnetic stacks were fabricated into pillar-shaped devices with nominal dimensions of 75 Â 75 nm.The top view of the STO device is shown in Figure 3a.The GSG 40 GHz probe is used to measure the microwave emission characteristics of STO devices.The diagram of the measurement system is shown in Figure 3b.The positive current is defined as the current flowing from the free layer to the reference layer, and the positive external magnetic field (H ext ) stabilizes the parallel (P) state.The DC current is injected into the sample through a bias tee.The microwave signal is transmitted through the bias-tee AC port, and then collected by the oscilloscope.The measurement was performed at room temperature.The resistance of the STO devices as a function of inplane external magnetic field (along with the easy axis of the device) is obtained at a bias current I bias = 10 μA.The resistance curve shown Figure 3c reveals a tunneling magnetoresistance (TMR) ratio of 80%. Figure 3d indicates the microwave voltage spectra under I bias = À0.2 mA and H ext = 0 mT.The inset is the microwave voltage at the time-domain measured by the oscilloscope.As can be seen from Figure 3d, a single oscillation peak is observed with a fundamental frequency of 0.59 GHz.The magnetic sensing characteristics of the STO device is simulated based on our previous work. [25]See Section S2, Supporting Information, for details about the simulation parameters.The external magnetic field perpendicular to the device is applied, the corresponding oscillation frequency change with the external magnetic field is presented in Figure 3e.It can be seen from Figure 3e that a frequency shift of 1.8 GHz is exhibited over a change in the external field of 72 mT, resulting in a frequency sensitivity of 25 MHz mT À1 .

The Character of STT-MTJ NSCS
To analyze the performance of the network and algorithms mentioned in Section 2, a nonlinear sensor computing unit which could perform nonlinear multiplication of inputs magnetic and weights current is proposed.The nonlinear in-sensor computing unit mainly consists of the STT-MTJ device, the current source, the bias tee, the rectifier diode, and the RC integrator circuit, as shown in Figure 4a.The STT-MTJ device can generate sustained oscillation based on the STT effect by setting the appropriate external magnetic field and bias current.The AC voltage running through the capacitance was then transferred to the subsequent detection circuit.Finally, the corresponding DC voltage is obtained.Details about the proposed in-sensor computing unit can be seen in Section S3, Supporting Information.
Using the MTJ/CMOS simulation method previously proposed, [26] the output DC voltage V of the preceding in-sensor computing unit regulated by the bias current I and external magnetic field H can be obtained.The black dots in Figure 4b shows the output voltage of the proposed NSCS unit under different I and H. Obviously, the output voltage V monotonically increases with I and H.In this way, the function of the nonlinear synapse can be achieved by using the in-sensor computing unit with H, I, and V working as the input, weight, and output, respectively.
In addition, to utilize this in-sensor computing circuit as an NSCS for the training of neural networks, the I/O response must be differential.Therefore, the response results were fitted with the following sigmoid-like function: The fitting coefficients a, b, c, and d were À4.20 Â 10 2 , À1.23 Â 10 À2 , 8.49 Â 10 À1 , and À3.68 Â 10 À2 , respectively, with the corresponding R 2 was 0.99.It's pointed out that this rule only applies to the simulated range.Furthermore, the network with the NSCS located at the front is limited by the device's physical property.The input data x i (working in the range of [À1,1]) needs to be linearly transformed to the practical magnetic field range at the beginning, i.e.
where H i represents the magnetic field corresponding to x i , H min , and H max denote the minimal and maximal magnetic field of the NSCS's sensitive range.For our simulation, H min and H max are set to 2 and 72 mT, respectively.In addition, it needs to be pointed out that the traditional synapses can be either positive or negative to cause excitation or inhibitory effects on the corresponding output. [20]However, the weight value in this work represented by current I is restricted to a particular range (from À0.10 to À0.35 mA), which is employed to produce the continuous oscillation of the STT-MTJ device. [27]Moreover, the rectifier voltage would remain positive unless additional processing is added.Therefore, only excitation response output can be obtained in the current scheme.To address this issue, the voltage is mapped to À1 to 1 by the following linear transformation:

Results
For attempting to analyze the influence of the NSCS on the neural network performance, the MNIST dataset is used to compare the performance of the two kinds of neural networks as shown in  There are 60 000 training samples and 10 000 testing samples in this dataset.The structure and size of both networks are set to be the same.The input layer has 28 Â 28 nodes, which send information to the hidden layer through traditional linear synapses and the new proposed NSCSs, respectively.There are 30 nodes in the hidden layer, which then in turn feed to 10 output nodes.To evaluate the performances of these two kinds of networks, the BP algorithm is run at the same hyperparameters.The batch size is set to 100, and the learning rate for the linear synapse is (1À(À1))/2 Â 0.001.As the bias current denotes the synapse weight value, the learning rate of NSCS is given as ((À0.10)À (À0.35)) mA/2 Â 0.001, i.e., the relative learning rate of NSCS is consistent with that of the linear synapse.
As shown in Figure 5a,b, the testing accuracy and training error of the neural network are respectively compared with the traditional linear synapse and NSCS.It is seen that the final accuracy of the network with the NSCS is slightly above the linear one, which could reach up to 92.2% at 150 epochs with the NSCS.Being benchmark, the accuracy for the network with the typical synapse is 90.2%.In particular, the accuracy increases from 28.0% to 87.0% at only 15 epochs for the NSCS, while that of the linear synapse just grows from 44.9% to 77.3% in the same epoch.These results mean that the converge speed is improved by using the NSCS.To further explain the result, the partial derivatives of the I/O response are further calculated, listed as follows: Based on this, the partial derivatives of the NSCS with respect to I (dV/dI) and H (dV/dH) are obtained and shown in Figure 5c, d, respectively.It can be seen from Figure 5c that the dV/dI is associated only with H, while |dV/dI| increases with the increase of H.Moreover, the dV/dH depends on both H and I (indicated as Equation ( 23)).By contrast, the output (y) of the linear synapse is the product of the weight w and input x.The partial derivatives to weight (dy/dw = x) and input (dy/dx = w) are both linear as shown in Figure 5e,f.As mentioned in Equation ( 11) and ( 18), the update rate of weight (w ij ) for NSCS is influenced by partial derivatives of the weight, that is dV/dI for NSCS and dy/dw for typical linear synapse.By comparing the partial derivative of NSCS with that of the linear synapse as shown in Figure 5c,e, it is evident that the uneven variation of dV/dI induced from the nonlinear response of NSCS greatly changes the training process, and also leads to the discrepancy in the converge speed.
Different network sizes are further run to prove the previous conclusion.For convenience, we use an array to represent the size of the network, taking [784, 5, 5, 10] as an example.It means that there are four layers in the network.The first number indicates the node of the input layer (784) and the last one represents the node of the output layer (10).The middle number denotes the size of the hidden layer.There are two hidden layers in this example, and each hidden layer has five nodes.Figure 6a-h shows the testing accuracy and training error of different networks changing over epochs, respectively.It is displayed clearly in Figure 6 that the network convergence speed would enhance significantly even in different sizes of neural networks, as long as the NSCS is located at the front.At the same time, the final accuracy can be slightly improved.The previous results show that the benefits of the NSCS may be universal in feedforward neural networks.To make a quantitative analysis of the network performance, we define the convergence speed as the epoch when accuracy is more than 80%.From all the results shown in Figure 6, the accuracy and convergence speed of different networks can be obtained and listed in Table 1.As indicated in Table 1, the convergence speed can be enhanced by 2 to 15 times varying with the size of networks.In addition, the final accuracy of all the networks with NSCS can be increased from 2.5% to 5.1% compared with that of the traditional linear synapse under same size.The simulation results of different network sizes fully prove that the proposed NSCS can improve the convergence speed effectively.Furthermore, with the NSCS located at the network front, the sensing and nonlinear preprocessing could be combined effectively.This new conception may greatly change the traditional intelligence sensor information processing method.
To evaluate the generalization ability of the network with the NSCS, the neural network with a size of [784, 30, 10] is used as an example.First, the original dataset (Figure 7a) is used to train and test the network as a comparison.Figure 7b,c shows the training error and testing accuracy based on the original dataset.The red curve indicates the result of the NSCS, while the blue one represents that of the linear synapse.It can be seen that with the change of epochs, the error gradually decreases along with the increase of accuracy for networks with both kinds of synapses.Finally, the accuracy of the network with the linear and NSCS synapse is 88.5% and 92.8% at epochs of 100, respectively.Second, the uniformly distributed random noise is deliberately introduced to the testing dataset.The original data without noise and the data with a noise peak value of 0.1 are available in Figure 7a.The network was trained by using the original data and tested by the noisy data.The simulation results are shown in Figure 7d,e.On one hand, the training error of both kinds of networks decreases with the epoch increasing.On the other hand, the network with the traditional linear synapse performs well on the noise testing data, while the accuracy of NSCS grows rapidly at first and then declines gradually to 51.5%.This result illustrates that the network with the traditional linear synapse has better generalization ability.The introduction of NSCS should be responsible for the overfitting.The underlying reason behind the overfitting could be explained from two aspects.First, the complexity of the neural network is increased with the nonlinear synapse.Second, the convergence speed may be too large for the network with the nonlinear synapse.Therefore, the network memorizes too many details in the training dataset, but missing the true regularities.To prevent overfitting caused by the nonlinear synapse, several solutions should be taken, such as increasing the training set, early stopping, weight decay, reducing the net capacity and dropout.In addition, the influence of the weight value limitation of NSCS is discussed in Section S4, Supporting Information.

Discussion and Conclusion
This work focuses on the simulation verification of the proposed architecture.But from our point of view, this proposal would be reliable in the real network.The reason is that the STT-MTJ devices have advanced reliability and variability compared with other kind of neuromorphic devices.First, the physical mechanism of the device mainly includes the TMR effect and STT effect.The relative theory is complete and clear.Second, the deposition technology of magnetic stacks and the fabrication processes of STT-MTJ devices are mature.At present, the large-scale preparation of STT-MTJ/CMOS circuits with high consistency has been realized in the industry.In this article, a modified feedforward neural network model with the NSCS located at the input layer is proposed for improving the feedforward neural network performance.The BP algorithm is modified for the proposed architecture.To examine the performance of the proposed architecture, an NSCS consisting of STT-MTJ and CMOS devices is used.The simulation result based on the MNIST dataset shows an obvious improvement in convergence speed, which is attributed to the nonlinear response of the NSCS.In contrast to traditional feedforward neural networks, our proposed network can not only sense the external magnetic field directly, but also improve the network performance further.This work is an initial attempt to realize and improve the traditional neural network by taking the advantage of nonlinear in-sensor computing devices.Further work on the integrated design of in-sensor computing devices and hardware neural networks is certainly an interesting direction to build a better intelligent sensor system.

Figure 1 .
Figure 1.a) The McCulloch-Pitts neuron model for computation and b) the schematic of the neuron model with the NSCS.

Figure 2 .
Figure 2. The schematic of a) the traditional feedforward network and b) the new feedforward network architecture with the NSCS as the input layer synapse, respectively.The notation w ij represents the weights between input x i and hidden neuron y j , and W jo is the weight connecting between hidden neuron y j and output neuron Y o .

Figure 3 .
Figure 3. a) The top view of the spin-transfer nano-oscillators (STNO) device, b) the diagram of the measurement system, c) resistance as a function of in-plane magnetic field, and d) microwave spectra under I bias = À0.2 mA and H ext = 0mT, respectively (inset is the transient microwave voltage).e) The STNO oscillation fundamental frequency as a function of H ext (the bias current I is equal to À0.2 mA).

Figure 4 .
Figure 4. a) The schematic of the NSCS circuit; b) the output voltage of the circuit under different magnetic fields H (input) and currents I (weight).The black dots and colored surface indicate the simulated voltage and fitted result, respectively.

Figure 2a,b.
Figure 2a,b.There are 60 000 training samples and 10 000 testing samples in this dataset.The structure and size of both networks are set to be the same.The input layer has 28 Â 28 nodes, which send information to the hidden layer through traditional linear synapses and the new proposed NSCSs, respectively.There are 30 nodes in the hidden layer, which then in turn feed to 10 output nodes.To evaluate the performances of these two kinds of networks, the BP algorithm is run at the same hyperparameters.The batch size is set to 100, and the learning rate for the linear synapse is (1À(À1))/2 Â 0.001.As the bias current denotes the synapse weight value, the learning rate of NSCS is given as ((À0.10)À (À0.35)) mA/2 Â 0.001, i.e., the relative learning rate of NSCS is consistent with that of the linear synapse.As shown in Figure5a,b, the testing accuracy and training error of the neural network are respectively compared with the traditional linear synapse and NSCS.It is seen that the final accuracy of the network with the NSCS is slightly above the linear one, which could reach up to 92.2% at 150 epochs with the NSCS.Being benchmark, the accuracy for the network with the typical synapse is 90.2%.In particular, the accuracy increases from 28.0% to 87.0% at only 15 epochs for the NSCS, while that of the linear synapse just grows from 44.9% to 77.3% in the same epoch.These results mean that the converge speed is improved by using the NSCS.To further explain the result, the partial derivatives of the I/O response are further calculated, listed as follows:

Figure 5 .
Figure 5. a) The testing accuracy and b) training error of the neural network with the traditional linear synapse (the blue curve) and NSCS (the red curve).The partial derivative of NSCS: c) V with respect to I; d) V with respect to H.The partial derivative for linear synapse: e) y with respect to x; f ) y with respect to w.

Table 1 .
The performance of feedforward neural networks with different sizes and different synapses.