Energy‐Efficient Memristive Euclidean Distance Engine for Brain‐Inspired Competitive Learning

Inspired by competitive rules of the nature, competitive learning contributes to the specialization of the human brain and the general creativity of mankind. However, the construction of hardware competitive learning neural network still faces great challenges due to the lack of an accurate distance computation method and a self‐adaptive in situ training scheme. Herein, a fully memristive Euclidean distance (ED) engine based on analog multiply‐accumulate operation in a 32 × 32 TiN/TaO x /HfO x /TiN 1T1R array is demonstrated. The dual‐layer devices perform multilevel modulation under the target‐aware programming method with excellent read linearity in a dynamic range of 10–100 μS. The ED calculation is verified experimentally on a test board with an O(1) temporal complexity. Furthermore, in situ training and offline inference schemes for competitive learning, based on the ED engine, are developed and the simulated results show comparable success rates with those obtained by the CPU‐based software. Compared with a state‐of‐the‐art RTX6000 GPU (0.5 TOPS W−1), the energy efficiency of competitive learning models on ED engines can yield 100× improvements by utilizing optimized memristive devices.


Introduction
Competitive rules are widely observed in the nature, which dictates the evolution of organisms and cells via natural selection. [1] In the brain, a competition mechanism exists among neurons wherein synaptic connections with high spiking frequencies and strong inputs are retained and strengthened, while the connections with low frequencies and weak inputs are pruned or decayed [2][3][4] (Figure 1a). Inspired by the information processing mechanisms of the biological brain, competitive learning neural networks (CLNNs) have received widespread attention. [5][6][7][8][9] Their corresponding network structure is shown in Figure 1b. [10] As a traditional artificial neural network (ANN) model, CLNN is used to discover patterns in the distribution of data mainly through unsupervised learning, based on similarity measurements between input samples and weight vectors. [11][12][13][14] In the era of artificial intelligence and the Internet-of-Things (AIoT), similarity measurements are commonly accepted in machine learning algorithms, and are extensively used in recommendation systems, pattern recognition, data queries, and other applications. [15][16][17][18][19] Euclidean distance (ED) is one of the most well-established methods for similarity measurements and has been frequently applied for image recognition, natural language processing, data mining, wireless localization, and other applications. [20][21][22][23][24] It is used to quantitatively represent the distance between two vectors in Euclidean space, as shown in Figure 1c. The absolute value of the ED calculation can be used to determine the degree of similarity, and the sample with the smallest ED value is considered to yield the best match.
With the dramatic increase in data dimensions in the AIoT era, ED-based applications have encountered huge challenges on the resource-constrained edge computing platforms due to serious bottlenecks in computing power and efficiency. [25] Memristive in-memory computing has emerged as a promising solution for energy-efficient non-von Neumann computing paradigms. [26][27][28] The frequent vector-matrix multiplication (VMM) operations in ANNs, such as multilayer perceptrons and convolutional neural networks, have been accelerated considerably owing to in-memory computing in which multiply-accumulate (MAC) operations can be executed in a single step using Ohm's and Kirchhoff 's laws in a memristive crossbar array. [29][30][31][32][33][34] Therefore, for CLNNs, building an ED engine utilizing a singleclock VMM operation in memristor arrays ( Figure 1d) is essential to overcome traditional computational limitations. Currently, building a functional, fully memristive ED engine is challenging, although there are initiatives to implement ED calculations on memristor arrays. [35][36][37][38] Specifically, the remaining issues are mainly the lack of 1) complete expression for ED calculation on a hardware platform, 2) efficient in situ training scheme for CLNNs with hardware ED engines, and 3) remarkable versatility to different ED-based algorithms.
In this study, a fully memristive ED engine was demonstrated for the first time that exhibited large hardware computational efficiency and flexibility. ED calculations for data with five dimensions were implemented on a TiN/TaO x /HfO x /TiN 1T1R array to verify the reliability of the ED engine. The favorable analog behavior and the excellent dynamic-range read linearity of the memristor cells ensure the accurate data mapping as well as precise analog computing results. By utilizing the memristive ED engine, CLNNs were demonstrated in prototype clustering tasks. With in situ training and optional offline inference schemes, the clustering task based on the memristive ED engine yielded equivalent results for the IRIS and the breast cancer datasets compared with that running on full-precision software. The memristive ED engine provides a vigorous and general solution for the hardware implementation of competitive learning, which completes ED calculations within constant time and features in fully hardware online weight updating.

Principle of Memristive Euclidean Distance Engine
Mathematically, the ED of two data vectors S and w is calculated using Equation (1) DðS, wÞ ¼ jS À wj 2 ¼ S 2 À 2S·w þ w 2 S ¼ ½s 1 , s 2 , : : : , s m w ¼ ½w 1 , w 2 , : : : , w m 8 > > < > > : (1) where S and w are the m-dimensional vectors in Euclidean space. In CLNNs, S represents the set of sample vectors to be classified, and w represents the weight vector of the networks. From Equation (1), the ED calculation contains the dot product term (À2S·w) and two non-negligible squared terms (S 2 , w 2 ). Some specific competitive learning algorithms, such as self-organizing maps [35] and K-means, [36] have been implemented on the memristor crossbar array by ignoring one or two squared terms. The neuron with the heaviest weight connection outputs a spike with a higher intensity and suppresses the other channels. b) Structure of competitive learning model with WTA as learning rule in which the distance (similarity) calculation plays the key role for the competitive process. c) Illustration of ED in Euclidean space, which is essential for competitive learning. A point in an N-dimensional Euclidean space is represented by an N-dimensional vector. d) Single-step vector-matrix multiplication operation on the memristor array owing to Ohm's and Kirchhoff 's laws.
www.advancedsciencenews.com www.advintellsyst.com Sheridan et al. [37] revealed that the ED calculation was simplified in the form of a dot product of vectors for pattern-matching tasks. The intensive MAC operations were then accelerated by memristor arrays by mapping weights to memristor conductance. However, in case the vectors are not normalized, the dot product term could not accurately express the true distance between vectors anymore. Subsequently, Jeong et al. [36] experimentally proposed a scheme for the direct comparison of the Euclidean distances without normalizing the weights on a memristor crossbar. The squared term of each input sample vector is ignored in this scheme because it is a constant for all weight vectors. Thus, a direct, single-step comparison of the ED can be implemented on a memristor array by adding an extra row on the array to store the squared term of the weight vectors.
Notably, this improved ED calculation solution can only perform the forward inference of CLNN tasks, whereas the process of online updating, which leads to the fully hardware implementation and self-adaption of the network, is almost hardly achieved on the memristor. In addition, this scheme is only suitable for applications that require a comparison of the relative ED values while not for absolute ED values. Therefore, it is of great importance to compute the full expression of ED in memristor arrays and to implement online weight updating. Herein, we focus on these problems by devising a fully memristive ED engine. Based on the traditional CPU architecture, the basic operation for ED calculation involves a serial subtractor, multiplier, adder, and accumulator, which can be replaced by the memristive ED engine in a single-clock step ( Figure 2a). First, Equation (1) can be rewritten as Equation (2) DðS, wÞ ¼ S 2 À 2S·w þ w 2 ¼ S Â ðS À wÞ þ ðÀwÞ Â ðS À wÞ By taking (S -w) as a whole, the ED calculation can be converted into a sum of two dot product terms. Accordingly, based on the principle of Equation (2), a fully memristive ED engine was designed, as shown in Figure 2b. The item (S -w) was mapped to the differential conductance rows in the memristor array. The elements of the two vectors were mapped as the memristive conductance in the two rows. The dimension of the vectors determined the number of columns of the array. The subtraction of the two vectors was then achieved by the differential conductance pairs, which is the a common method used to implement negative weights in various memristive neural networks. [39][40][41] To achieve the summation of the two dot product terms in Equation (2), the two horizontal (S -w) conductance kernels are iteratively programmed, and other vectors, namely, S and -w, were mapped as input voltage signals (encoded as the number of voltage pulses with a fixed amplitude, or single pulses of varying amplitudes to denote different vector elements) and multiplied by each of the two conductance kernels. Therefore, similar to distributed kernels, [34] absolute ED calculations can be completely expressed on memristor arrays to leverage the parallelism of the VMM operations to achieve a single-step operation. The time complexity of the ED calculation is lowered from O(n) for CPU calculations to O(1) for our designed memristive ED engine. That is, the feedforward process of CLNNs that utilize the ED calculation can be accelerated directly on the memristor arrays. Moreover, the fully calculated EDs can find a wide spectrum of applications, such as kernel functional calculations and wireless locations. [42][43][44] Furthermore, the designed ED engine was also fitted to the online updating rule of competitive learning tasks. In CLNNs based on the learning rule of Winner Takes All (WTA), only one winning neuron updates its weight while the other neurons where α(t) denotes the learning factor of the network, v In and v w , respectively, indicate the input and the weight vectors that correspond to the input pattern and winning neuron used for the update. In general, the vectors in the sample set (S) are used as the input pattern, and the weight vectors (w) are used as the weight map for CLNNs. Therefore, as shown in Figure 2c, a backward online update can also be performed on the designed memristive ED engine. The learning factor α(t) was encoded with a fixed voltage amplitude, which was input to the rows storing S and w with positive and negative pulses, respectively. This reverse VMM operation mapped exactly the calculation of Equation (3). Therefore, in the designed memristive ED engine, the complete expression of ED calculation was associated with high computational parallelism and low time complexity, and an online update for CLNNs also became possible.

Experimental Demonstration of a Memristive ED Engine
To experimentally verify the feasibility of the designed memristive ED engine, an field programmable gate array (FPGA) test board consisting of a 32 Â 32 1T1R TiN/TaO x /HfO x /TiN memristor array was used and the packaged 1T1R array is shown Figure 3a. The optical image of the test board is shown in Figure S1, Supporting Information. The 1T1R cell was formed c) The testing structure of the proposed ED engine in a 1T1R array. d) Pulse-induced conductance tuning behavior exhibits its potential for use in analog computing. e) The target matrix to be programmed in the array. The conductance will be programmed by the target-aware method to achieve the target values. f ) The linear I-V relationship for a read voltage of 0.4 V. It proves that the encoded input voltage below 0.4 V does not affect the conductance state. g) Ten-level retention properties for the measured memristor. h) Experimental programmed values on the 1T1R test board. The maximum error for programming is less than 6% while the average error is approximately 2.8%.
www.advancedsciencenews.com www.advintellsyst.com by growing a TiN/TaO x /HfO x /TiN memristor on the drain side of an N-type metal-oxide-semiconductor transistor, and the details of fabrication are illustrated in the Experimental Section. Typical bipolar resistive switching characteristics of a 1T1R memristor cell are shown in Figure 3b. Utilizing the transistor as a selector and a current limiter, 1T1R arrays are immune to the sneak path issue. The SET operation applied positive voltages to the gate and drain, while RESET applied positive voltages to the gate and source. Notably, negative drain or source voltages are usually not available for 1T1R operations owing to the limit of transistors. Subject to the constraint, the implementation of the designed memristive ED engine on a 1T1R array is shown in Figure 3c. Differential pairs of S and w were stored in two separate kernels, and the positive voltage signals with different amplitudes, encoded from S and w, and then inputted to each of the two kernels. Finally, the two differential currents are passed through a subtractor, and the output current was proportional to the actual calculated ED value of the array. Our 1T1R cells have demonstrated a continuously tuneable conductance, as shown in Figure 3d, which indicates the analogue computing capability. The conductive filament of the memristor cells will be enhanced or weakened under SET or RESET pulses, respectively, and thus resulting in cycling potentiation or depression behaviors. Taking the target matrix shown in Figure 3e as an example, the ED calculation was proved on the 1T1R array. The mapping lists for memristive conductance and input voltage amplitudes are shown in Table SI, Supporting Information. The voltage amplitudes from 0 to 0.4 V were encoded vectors, and the linear I-V relationship has been observed on memristors of different conductance (Figure 3f ), yielding accurate readout results. The memristive conductance exhibited 10 linear discrete levels from 11.1 to 100 μS. A target-aware method was adopted to program them accurately, as shown in Figure S2, Supporting Information, and the obtained stable ten-level retention is shown in Figure 3g. The conductance of the cells was controlled by the SET and RESET voltages with varying amplitudes until the programming error was within the target error ΔG. The matrix was then programmed on the 1T1R array with a maximum error of 6% and an average error of 2.8%, as shown in Figure 3h. The detailed programming data can be found in Table SII, Supporting Information. Moreover, to investigate the correct rate of the ED engine, input voltages were applied to the memristor crossbar array via a programmed conductance to obtain the experimental MAC results. Figure 4a shows the MAC values for 100 temporal cycles, which represent the ED values of S and w. The tested values maintained stable fluctuations near the mean value. The original truly tested MAC results in the 1T1R array are shown in Figure S3a, Supporting Information. Owing to the differential pair for ED calculation, the standard deviation of ED results is larger than the original tested ones. But it remains two orders of magnitude smaller than the measured data which indicates the relatively stable measurement results. The analyzed test results are shown in Figure 4b. The relative error (proportion of the difference between the experimental MAC result and their mean in the latter) was a Gaussian distribution with an average value μ (%0) and a variance σ (%0.039), thus indicating a reliable experimental ED calculation. More simulated EDs under the www.advancedsciencenews.com www.advintellsyst.com same condition are analyzed in Figure S3b, Supporting Information, where a similar distribution of the relative errors is observed. In addition, this method performed well on the array with 88.5% errors <0.1 at different points (programming on different array regions, namely, device-to-device), as shown in Figure 4c. Furthermore, the effects of non-ideal factors of the memristor device, including the available number of conductance states and write variations, on the ED calculation were investigated via simulation (Figure 4d). The results suggest that more accurate conductance programming and smaller write-state fluctuations are beneficial to the accuracy of the calculation. Specifically, the 6-bit precision of the conductance states and 4% write variation for programming are sufficient to control the relative error of the calculation within 10%. Notably, even if the write variation is reduced to zero, or the conductance precision is as high as 10 bits, the relative error is not likely to be zero. This is because the true values (real number) can only be stored by discrete quantified conductance states. In addition, simulation results considering more factors apart from device properties are shown in Figure S3c and S3d, Supporting Information, including the input encoding noise and the stuck-at fault (SAT) of the memristor array.

Hardware Mapping of Competitive Learning Models
An in situ CLNN for clustering, as well as its optional offline inference process, is demonstrated and simulated based on the aforementioned investigation of the memristive ED engine. The control logic and data flow during the training are shown in Figure S4, Supporting Information. As shown in Figure 5a, the mapping rule for the two-layer competitive learning model is illustrated based on the ED engine. The input neurons represent the input pattern, and the competitive mechanism is introduced to neurons, where the output neurons of the network compete with each other and adhere to the WTA principle by measuring the EDs of the input vector and the adjustable weights. The M www.advancedsciencenews.com www.advintellsyst.com weights associated with k competitive neurons were mapped to a k Â 2M memristor crossbar array owing to the distributed kernels. Both the input sample vector and the weight vectors were programmed twice. The input sample vector S was stored in the 0th row. Every weight vector from the weight matrix W was programmed in the other n rows in sequence. The stored sample vector S and the ith weight vector w i (to be compared) were then encoded as the input voltages with different amplitudes, as shown in Figure S5, Supporting Information. The readout currents I i and I 0 were output to the external circuit to calculate the differential currents. The ith differential current represents the ED between the input vector and the ith competitive neuron. The stored sample vector S was compared with all of the weight vectors, and the output differential current was stored in a buffer sequentially until the minimum ED value was obtained by a WTA circuit outside the array. The flow chart of the training process for a two-layer CLNN is shown in Figure S6, Supporting Information. For the memristive ED engine, the calculation and comparison of a sample vector with all weight vectors were both serially computed, which was relatively time-consuming. This can be resolved by an asynchronous comparison circuit shown in Figure S7, Supporting Information, to speed up the comparison for competitive learning tasks. In this study, prototype clustering algorithms were used as typical applications of competitive learning models based on ED calculations. The in situ training of a competitive layer learns the features to cluster different classes of inputs automatically, which exhibited the essence of unsupervised K-means. The IRIS dataset, an extensively used machine learning dataset , [47] was adopted to verify the online training of a CLNN with an ED engine. Figure 5b shows the clustering results after training by modeling the experimental ED engine performance. The success rate of the network reached 92.6% (equivalent to 94% in software). The convergence traces are shown in Figure 5c. With only ten epochs, the success rate quickly saturated and fluctuated within a small range. The fluctuations of the success rate originated from the inevitable read and write variations, whereby the exhibited variations may skip the best weights and lead to a set of unstable trained states. This uncertainty can be improved by adaptive learning rate during training. Furthermore, the robustness of the on-chip implementation was also explored. As the conductance accuracy increased, the success rate improved simultaneously until a plateau was reached with a 6-bit accuracy (Figure 5d). The 6-bit requirement of the memristor states is rigorous for the general ED-based application. This shortage can be compensated by the device optimization or the cooperation of multiple low-precision chips. The increasing read or write variations will undoubtedly cause the collapse of the success rate. However, the write variation has a greater impact on the success rate compared with the read variation in terms of online learning (Figure 5e).
In situ training of a CLNN offers the possibility of self-adaptive application scenarios, such as autonomous driving, meaning that the weights are updated with real-time input to the net. However, for some competitive learning tasks, such as semisupervised learning vector quantization, in situ training is important, and the inference phase after training is also critical. Therefore, in Figure 6a, an alternative mapping design for parallel ED calculation is demonstrated due to the reconfigurability of the memristor arrays. Utilizing the training rules, the network was first trained on the simulation platform. The detailed in situ training results are shown in Figure S8, Supporting Information. After the online training, the trained weight vectors were fixed for inference. To calculate a sample vector with all of the weight matrices, the sample vector was then programmed n times to cover the left n rows of the distributed kernels that stored the trained weight vectors. Moreover, one additional column was added to store the squared L2 norm of the trained weights (the detailed operation processes are shown in Figure S9, Supporting Information). Herein, the breast cancer dataset, [48] which contains more samples than IRIS, was adopted as the benchmark. Figure 6b shows the clustering results of the breast cancer dataset in different situations, including software simulation, online training, and offline inference on the ED engine. Clustering results based on the ED engine were slightly lower than those obtained using software. Moreover, the result of offline inference overall yielded a higher success rate than the online training, especially for the "Malignant" class, consistent with previous publications , [30,32] which indicates the robustness of the memristor-based inference platform. As an extension, multisample vectors could also be compared with the same weight vector, as shown in Figure S10, Supporting Figure 6. a) Reconfigured structure used for offline inference based on the memristive ED engine after online training. One additional row (with green background) is utilized to store the squared L2 norm of the trained weights W. This structure can calculate the EDs between the input vector S and the trained weights W in parallel. b) The success rates for learning vector quantization with the breast cancer dataset in different situations, including pure software simulation, offline inference, and online training on the memristive ED engine, respectively. www.advancedsciencenews.com www.advintellsyst.com Information, by storing the sample vector set on the right part of the memristor array and one weight vector on the left part repeatedly. The multichip scheme provides a matrix-to-matrix ED calculation method at the expense of a larger chip area. Utilizing the mature GPU platform as a benchmark, Table 1 shows the projected inference efficiency of the proposed memristive ED engine using various memristive devices (the detailed calculation process is shown in the Supporting Information). In this study, the efficiency reaches 1.835 TOPS W À1 , which is much higher than that of a high-performance GPU (0.5 TOPS W À1 ). [49] Potentially, utilizing state-of-the-art memristive devices, the proposed inference engine is expected to yield energy efficiency improvements that exceed 100Â (181.3 TOPS W À1 ). [34,50]

Conclusion
In conclusion, a fully memristive ED engine was demonstrated to compute the full expression of ED in a single-step MAC operation. Experimental verifications with 5D data were implemented on the 1T1R crossbar array, and the constant time complexity was proven regardless of the data dimensions. In situ training and offline inference schemes for the competitive learning model were developed and verified via simulated ED engines. Our results showed that the ED engine could accomplish clustering tasks with great tolerance of device variation and limited conductance states. Its performance also parallels with that of the software. Moreover, the projected energy efficiency for competitive learning exhibits a greater improvement compared with the traditional GPU. The ED engine and the memristive competitive learning models have shed light on potential edge applications by exploiting memristor-based analog computing.

Experimental Section
Device Fabrication and Characterization: The basic component of the 1 kb array used in this work was a hybrid integration of a metal-oxidesemiconductor field-effect transistor (MOSFET) and a TiN/HfO x /TaO x / TiN memristor device. The MOSFET was fabricated with a standard 0.18 μm logic process in the company SMIC, and the channel width and length were 10 and 0.35 μm, respectively. The sandwiched memristor structure was grown on the drain of the MOSFET in the following steps. The bottom TiN (40 nm) electrode was deposited on a polished W plug with reactive sputtering. HfO x (10 nm) and TaO x (50 nm) switching layers were grown by atomic layer deposition and physical vapor deposition, respectively. A 30 nm TiN layer was then deposited as the top electrode. Finally, the memristor sandwich structure was patterned using a dry etching method. The effective size of the memristor device was approximately 1 μm Â 1 μm, which was defined by the etching pattern.
Electrical Measurement (the FPGA Test Board): In this study, a versatile and portable hardware platform was developed to test the resistive switching characteristics of the memristor and perform analog computing functions within the 1T1R crossbar array. The platform consisted of an FPGA-based controller, high-speed analog-to-digital and digital-to-analog converter circuits used to generate programmable pulses for reading and writing the memristors, parallel 32-channel excitation and measurement circuits for computing, independent gate voltage control circuit, two switch matrices, DDR3 circuits for data buffering, and a USB 3.0 interface to exchange data with the laptop. The platform can generate positive or negative programming pulses with a maximum amplitude of 5 V and a resolution of 10 mV. The pulse-width resolution can reach 1 ns, and the minimum rising edge is 10 ns. By utilizing the configurable feedback signal condition circuit, the weight measurement ranged from 100 Ω to 30 MΩ. The calibration algorithm was integrated into the Kintex-7 FPGA to correct the channel mismatches of the 32-channel excitation and measurement circuit, which guaranteed the accuracy of analog computing.
In Situ Training of CLNN: A two-layer competitive network was achieved with the open-source Python language (version 3.6). Some open-source libraries, including NumPy and pandas, were used to build the simulation platform. The IRIS dataset used for online training contained three classes of 50 instances each, while each instance contained four attributes: septal length and width and petal length and width. In this study, only the three most effective attributes (septal width and petal length and width) were used to verify the performance of the competitive network for online clustering tasks. During the training process, the updating rule obeyed Equation (3), and the learning rate was fixed at 0.1. The preset maximum training cycle remained at 50. This indicated that the training progress would immediately be interrupted when the 50th training cycle was reached, even though the error did not reach the target error.
Offline Inference Tasks of CLNN: The training process for offline inference verification was also implemented on the same CLNN simulation platform. The breast cancer dataset had 699 samples, 16 of which had missing values. Four hundred samples were used as the training set while the remaining were used as the testing set. Each sample had nine attributes, each of which was preprocessed to quantized values ranging from 1 to 10. The learning rate decreased or increased automatically based on the training epochs and classification results to limit the training results to its Bayesian boundary and converge. The starting learning rate was 0.3 which was recommended in the study by Kohonen et al. [51] and the learning rate would never be larger than its initial value. The training samples were randomly selected from the original training dataset.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author. www.advancedsciencenews.com www.advintellsyst.com