Spike-Enabled Audio Learning in Multilevel Synaptic Memristor Array-Based Spiking Neural Network

Speech recognition involves the ability to learn the audios which are closely related to event sequence. Although speech recognition has been widely implemented in software neural networks, a hardware implementation based on energy efficient computing architecture is still missing. Herein, W/MgO/SiO2/Mo memristor arrays with multilevel resistance states are fabricated, where the weights of the artificial synapses in the memristor array can be tuned precisely by voltage pulses. Based on the array, speech recognition in memristive spiking neural networks (SNNs) with improved supervised tempotron algorithm on Texas Instruments digit sequences (TIDIGITS) dataset is conducted, demonstrating software‐comparable accuracy for speech recognition in the memristive SNN. It is envisioned that such memristive SNNs can pave the way to building a bioinspired spike‐based neuromorphic system for audio learning.


Introduction
Speech recognition is one of the key abilities of artificial intelligent machines to understand human speech and conduct the assigned task on the voice interface of mobile devices, smart household appliances, and wearable electronics in our daily lives. Large voice data of speech recognition tasks lead to long latency and large storage requirements in the existing von Neumann architecture based on the conventional complementary metal oxide semiconductor (CMOS) based platforms. At present, the speech recognition accuracy based on artificial neural networks (ANNs), [1][2][3][4][5][6][7] especially deep neural networks, is approximately the same as commercial software, whereas the high power consumption of existing systems is not suitable for edge devices.
In this work, we have fabricated W/MgO/SiO 2 /Mo memristor with nonvolatile, analog switching characteristics. The weight conductance of the memristor can be tuned precisely with multilevel conductance distribution. In addition, the W/MgO/SiO 2 / Mo memristor-based array is utilized for building single-layer hardware spiking neural network (SNN) for speech recognition. Before the training process, the audio signals of the TIDGITS dataset were processed with the maximum-margin tempotron algorithm, [34] and 1D self-organizing map (SOM) network [35] was used for effective and sparse spatiotemporal feature extraction. Afterward, the acoustic features of the audio signals were further classified by the memristor array-based SNN with offline learning methods. A high speech recognition accuracy of 94% was achieved, which is equivalent to the recognition accuracy in software. We envision that the results can pave the way toward hardware SNN based on memristors for energy-efficient speech recognition tasks. Speech recognition involves the ability to learn the audios which are closely related to event sequence. Although speech recognition has been widely implemented in software neural networks, a hardware implementation based on energy efficient computing architecture is still missing. Herein, W/MgO/SiO 2 / Mo memristor arrays with multilevel resistance states are fabricated, where the weights of the artificial synapses in the memristor array can be tuned precisely by voltage pulses. Based on the array, speech recognition in memristive spiking neural networks (SNNs) with improved supervised tempotron algorithm on Texas Instruments digit sequences (TIDIGITS) dataset is conducted, demonstrating software-comparable accuracy for speech recognition in the memristive SNN. It is envisioned that such memristive SNNs can pave the way to building a bioinspired spike-based neuromorphic system for audio learning. Figure 1 shows the overall principle of biomimetic neuromorphic computing of audio signals recognition. In biological systems, the neural network involves specialized neurons that can transmit analog audio signals to corresponding spike trains. The spikes can be subsequently processed by the central nervous system, as shown in Figure 1a. Similarly, memristor can mimic the dynamics of biological synapses and neurons, such as the spike time-dependent plasticity (STDP). Figure 1b shows the corresponding neuromorphic hardware aiming to recognize audio signals based on memristor crossbar arrays.

Results and Discussion
To achieve analog weights and allow learning, we have fabricated synaptic devices based on W/MgO/SiO 2 /Mo memristors in this work. Figure 2a shows a schematic diagram of the W/MgO/ SiO 2 /Mo memristive device. To verify the analog switching behavior, we have measured the typical IÀV characteristics of the device (in Figure 2b). In a pristine state, the device is in a low-conductance state, whereas the on state can be obtained by applying a þ1.85 V set voltage without forming operations. In addition, a À1.85 V reset voltage is required to switch the memristor back to the off state. When a positive voltage is applied to the Mo top electrode, oxygen ions are driven toward the Mo interface, leading to oxygen vacancy-based conductive filaments in MgO/SiO 2 layer. A negative voltage applied on the top electrode can lead to recombination of oxygen vacancies with ions in the MgO/SiO 2 layer, and the filament is gradually ruptured and the device is switched to off state. The W/MgO/Mo device exhibits abrupt switching in our previous work, [36] revealing that the inserted layer of sputtered SiO 2 plays a role in limiting the rate of the oxygen migration process. [37] To further elucidate the switching mechanism, we fabricated additional control devices with a structure of Mo/SiO 2 (20 nm)/W and studied their characteristics. Typical currentÀvoltage curves were measured with 1.45 V/À1.45 V write/erase voltages in 50 DC sweeps, where a forming process with 0À3 V DC sweep was required first to initialize the resistance switching, as shown in Figure S1, Supporting Information. Based on the results on Mo/SiO 2 /W devices and our previous study on W/MgO/Mo devices, [36] both the MgO and SiO 2 layers can lead to resistive switching behaviors; however, such single-layer-based devices showed unstable states and apparent variations in switching voltages. The introduction and formation of oxide bilayers are crucial for stabilization of resistance switching and alleviation of the variation. Thus, gradual oxygen vacancy movement can be obtained in the MgO/SiO 2 bilayer under external electric field, leading to gradual resistive switching, which is desirable for synapse devices.
We have experimentally measured the retention time of the high resistance states (HRS) and low resistance states (LRS) of the Mo/SiO 2 /MgO/W device with 0.2 V read voltages in 10 4 s at room temperature, as shown in Figure 2c. In addition, we have measured the endurance characteristics under (1.2 V/À1.2 V, 1 ms) write/erase voltage pulses over 10 4 consecutive cycles at room temperature, as shown in Figure 2d. The devices exhibit excellent retention and endurance characteristics.
To evaluate the multilevel performance of the memristor, successive DC sweeps with gradually increased voltages were applied in the set and reset processes with 0.05 V/À0.05 V step. As shown in Figure 2e, multiple states are achieved stably in the set process by controlling the positive sweeping voltage. Similarly, multiple states can be obtained successfully in the reset process by controlling negative sweeping voltage (inset of Figure 2e). To verify the tunable conductance of the memristor, the long-term potentiation (LTP) and long-term depression (LTD) were investigated in the W/MgO/SiO 2 /Mo memristive www.advancedsciencenews.com www.advintellsyst.com devices using pulse stimulations. Consecutive voltage pulses are applied to the device, and Figure 2f shows the potentiation and depression characteristics of the device under 50 consecutive positive pulses (1.2 V) and 50 consecutive negative pulses (À1.25 V), with various pulse widths of 1 μs (red), 10 μs (purple), 100 μs (blue), and 1 ms (green), showing a well-controlled analog state. Moreover, we have measured the cumulative distribution of the resistance state of the device. It can be seen from Figure S2, Supporting Information that the device exhibits four clearly separable resistance states. We have fabricated a 10 Â 10 memristor array based on W/MgO/SiO 2 /Mo structure, as shown in Figure 3a. The vectormatrix multiplication can be conducted in the memristor array through input voltage pulses in a fast, highly parallel, and spiked method. The conductance of each cell at the crosspoint can be updated by applying programming voltages from row/column at the same time. As the device performance in the memristor array can determine the capability of the memristive neural network, we have characterized the cycle-tocycle (C2C) and device-to-device (D2D) performance of the memristor in the array. It can be seen from Figure 3c,d that the W/MgO/SiO 2 /Mo memristor array shows ultralow C2C and D2D variations, which are highly desirable for building hardware SNNs.
To implement the task of speech signal recognition in SNNs, analog audio signals have to be encoded into discrete spikes.
In addition, signal features of raw audio signals also should be extracted and transformed into the spike sequences as inputs for the neural network. Here, we use Mel-frequency cepstral coefficients (MFCC) [38] as a feature extraction method of speech signals, which is the most common feature extraction method in machine learning. [39] MFCC is a cepstral parameter extracted in the frequency domain of the Mel scale, which describes the nonlinear characteristics of the human ear frequency. The logarithmic distribution relation between the scale value of Mel frequency and the actual frequency is more in line with the human acoustic characteristics so that the speech signal can be better represented. Figure 4a shows the raw sound signals, which were preprocessed by the following steps: 1) pre-emphasis to amplify high-frequency components; 2) segmentation of continuous sound signals into overlapped frames of suitable length to better capture the temporal variations of the sound signal; and 3) applying the Hamming window function on these frames to reduce the effect of spectral leakage. Short-time Fourier transform (STFT) was also conducted on the segmented structures and computed power spectrum to mimic the human auditory front end. Next, 20 log Mel-scaled filters were applied on the resultant power spectra to generate a compressed feature representation for the acoustic characteristic of each sound frame. As shown in Figure 4c,d, log filterbank energies and Mel frequency cepstra were obtained when the waveform of a speech signal (in Figure 4b) is subject to MFCC.    Thereafter, the latency code [40] was applied to convert the spectral energy of each frequency into spikes, as shown in Figure 5a. As a result, the speech information was encoded to the spike timing precisely. In our experiment, we found that the raw-encoded spike train patterns contain too many spikes beyond the learning ability of the adopted SNN. To solve the problem, SOM as a tool was applied for feature extraction from the latency-encoded filter bank output vectors. By applying SOM, the spike train patterns become more sparse, and the extracted efficiency feature will facilitate the maximum-margin tempotron temporal learning, where the change of weights of SOM synapses is shown in movie 1 in Supporting Information. To study the impact of the number of neurons on SOM, we increased the number of neurons during training gradually. It can be noticed that speech recognition accuracy improves rapidly with more SOM neurons and reaches about 94.0% on the testing dataset with about 256 neurons, as shown in Figure 5b. Moreover, we conducted experiments with different numbers of activated output neurons in the SOM for each sound frame. With more activated output neurons, the SNN achieves lower classification accuracy, as shown in Figure S4b of Supporting Information. The results show clearly that the number of activated output neurons in the SOM has a large impact on accuracy.
To classify spatiotemporal spike patterns using the maximummargin tempotron algorithm, the learning features of the SOM (also the spatiotemporal data of the output of SOM) were input into a single-layer fully connected neural network. The method combines the tempotron [41] with the maximum-margin classifier, [42] suitable for multiclass classification tasks due to the one-against-all strategy. During training, we trained one output neuron for each individual class. The tempotron learning rule is a biologically plausible module with membrane-potentialbased stochastic gradient descent (SGD)-supervised learning rule. The weight update method can be summarized as follows.
where λ denotes a constant learning rate and K 0 is a normalization factor that ensures that the peak amplitude of kernel K(t À t i ) is 1. τ m and τ s correspond to the decay time constants of the membrane integration and synaptic currents, respectively, as the whole determines the shape of the kernel function. K(t À t i ) is a causal filter that considers only spikes before time t. Furthermore, we used a maximum margin Δ to realize adaptive firing threshold voltage of the neuron during the training phase. Specifically, for the desired output neuron, the threshold voltage is increased by an amount, while it is decreased by the same amount for wrong output neurons. As a result, the firing probability of the neuron can be adjusted by the adaptive threshold. In this work, we investigated the impact of Δ on the classification accuracy using the TIDGITS dataset by increasing Δ from 0 to 1 with an interval of 0.1. We found that Δ improves the classification accuracy of the tempotron learning rule, as shown in Figure S4a in Supporting Information, and the best accuracy is achieved with a Δ value of 0.5.
Here, we use a memristor crossbar to implement the improved tempotron algorithm with experimental datanormalized LTP and LTD in Figure 5d for classification in SNN. The simulation process is shown in Figure S6a, Supporting Information. In the training phase, we utilized the one-against-all strategy to train one output neuron corresponding to each class. In other words, the output neuron i represents the i th class and triggers a weight update whenever an error occurs. In forward passes, the spike trains are fed into the memristor crossbar and converted to weighted currents. The neuron can then integrate the currents and change the membrance potential, generating spikes when reaching the firing threshold. If the output neurons do not fire, the threshold voltage of the desired output neuron is decreased by an amount, whereas it is increased by the same amount for wrong output neurons. If output neurons fire, the fired neuron has to be judged and it remains to be seen whether the neuron is the desired output neuron. In case the fired output neuron matches with the corresponding label, the classification is correct and otherwise incorrect. In the latter case, the weights will be updated during backward propagation. Figure S6b, Supporting Information shows the flow chart of the training process achieving supervised learning in SNN. When the desired output neuron fails to fire in response to the input pattern, LTP is triggered. Similarly, LTD is triggered when the wrong output neuron fires. For convenience, the corresponding W ij of a synapse was scaled to [À1,1], and the initial value was chosen randomly. The conductance modulation curves of the devices in response to applied pulse number, including LTP and LTD, were first fitted by the following equations.
Although we adopted the improved tempotron algorithm with SGD to get updated weight values, the pulse number is hard to acquire through Equation (3). To solve the nonlinearity of memristors that results in hardware implementation problem, the fitting of conductance modulation curves is replaced by piecewise linear approximation: where P i refers to the number of input pulses applied to the synapse. During training, the pulse numbers were applied to modify the weight according to Equation (4). The actual weights of the devices after updating were then calculated by Equation (3). The earlier processes were iterated till the end of training. After training, the distribution of learnt synaptic weights is shown in Figure 5e. The details of weight distribution in the training process are shown in movie 2 in Supporting Information. Thereafter, the test set of the TIDIGITS dataset was applied in the SNN for inference. High accuracy of classification was obtained, namely, about 94% with 10 epochs, which is comparable with results implemented in software, as shown in Figure 5f. Notably, the speech recognition system based on SNN in the present work shows significant advantages in simplified structure, fewer iterations, and high energy efficiency compared with other systems based on ANNs.

Conclusion
In this work, we experimentally demonstrated hardware implementation of SNN using W/MgO/SiO 2 /Mo memristive devices as synapses for speech recognition, adopting the improved supervised tempotron algorithm on the TIDIGITS dataset. The SOM network was proposed for feature extraction, which is an essential operation to acquire high performance and simplify the SNN classifier. The SNN can be successfully trained from the audio data in software and conduct inference in hardware, with high-classification accuracy, which represents a promising direction in building neuromorphic spiking systems for audio learning.

Experimental Section
Device Fabrication: W/MgO/SiO 2 /Mo devices were fabricated on SiO 2 substrates. Tungsten layer as bottom electrode (45 nm) was deposited by sputtering, followed by a lift-off process, which formed the bottom electrode. Then, magnesium oxide (10 nm) and silicon oxide (20 nm) switching layers were sputtered at room temperature. After the second photolithography process, the molybdenum layer (45 nm) was deposited by sputtering, and the patterns (100 Â 100 μm 2 ) were formed with a lift-off process, forming the 10 Â 10 memristor array with W/MgO/SiO 2 /W devices.
Electrical Measurements: The surface morphology of the devices were characterized using scanning electron microscopy (SEM, JEOL 7800F). Electrical measurements were carried out using Agilent B1500A semiconductor parameter analyzer in the cleanroom at room temperature with electrical signals applied on Mo electrode.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.