Ex Situ Transfer of Bayesian Neural Networks to Resistive Memory‐Based Inference Hardware

Neural networks cannot typically be trained locally in edge‐computing systems due to severe energy constraints. It has, therefore, become commonplace to train them “ex situ” and transfer the resulting model to a dedicated inference hardware. Resistive memory arrays are of particular interest for realizing such inference hardware, because they offer an extremely low‐power implementation of the dot‐product operation. However, the transfer of high‐precision software parameters to the imprecise and random conductance states of resistive memories poses significant challenges. Here, it is proposed that Bayesian neural networks can be more suitable for model transfer, because, such as device conductance states, their parameters are described by random variables. The ex situ training of a Bayesian neural network is performed, and then, the resulting software model is transferred in a single programming step to an array of 16 384 resistive memory devices. On an illustrative classification task, it is observed that the transferred decision boundaries and the prediction uncertainties of the software model are well preserved. This work demonstrates that resistive memory‐based Bayesian neural networks are a promising direction in the development of resistive memory compatible edge inference hardware.

DOI: 10.1002/aisy.202000103 Neural networks cannot typically be trained locally in edge-computing systems due to severe energy constraints. It has, therefore, become commonplace to train them "ex situ" and transfer the resulting model to a dedicated inference hardware. Resistive memory arrays are of particular interest for realizing such inference hardware, because they offer an extremely low-power implementation of the dot-product operation. However, the transfer of high-precision software parameters to the imprecise and random conductance states of resistive memories poses significant challenges. Here, it is proposed that Bayesian neural networks can be more suitable for model transfer, because, such as device conductance states, their parameters are described by random variables. The ex situ training of a Bayesian neural network is performed, and then, the resulting software model is transferred in a single programming step to an array of 16 384 resistive memory devices. On an illustrative classification task, it is observed that the transferred decision boundaries and the prediction uncertainties of the software model are well preserved. This work demonstrates that resistive memory-based Bayesian neural networks are a promising direction in the development of resistive memory compatible edge inference hardware.
Fortunately, however, this randomness follows stereotyped probability distributions and allows resistive memory devices to be instead yielded as physical random variables. [24,25] Previous work has used the probability distributions that emerge from the volatile random switching properties of magnetic RRAM [26,27] and stochastic electronic circuits, [28,29] for example, to perform Bayesian inference.
In this article, by working within this framework of probabilistic programming and Bayesian modeling, [30,31] we propose and experimentally demonstrate an approach for the ex situ training and subsequent transfer of a Bayesian neural network into the nonvolatile conductance states of a resistive memory-based inference hardware. In Bayesian neural networks, like resistive memory conductance states, model parameters are not single highprecision values but probability distributions-suggesting a more natural pairing of algorithm and technology. In this setting, the objective is no longer to precisely transfer a single parameter from a software model to the corresponding device in a resistive memory array but to transfer a probability distribution from the software model into a distribution of device conductance states.
In this work, we first propose to use an expectationmaximization algorithm to decompose a probability distribution into a small number of random variable components, each corresponding to the cycle-to-cycle conductance distribution of RRAM programmed under a given programming current. We then experimentally demonstrate, with an array consisting of 16 384 fabricated hafnium dioxide 1T1R structures, that these random variable components can be used to transfer the original probability distribution into the conductance states of a column of RRAM devices. We show that this approach can be leveraged to achieve the transfer of a full Bayesian neural network in a single programming step. We also describe an RRAM-based hardware capable of storing, and performing inference with, such a Bayesian neural network. Finally, a Bayesian neural network model is trained ex situ and transferred using the proposed technique to an experimental array. The transferred model is used for inference in a simple illustrative classification task where the decision boundaries that were learned ex situ are seen to be well preserved in the model transferred into the inference hardware.
Resistive memory devices store information in their nonvolatile conductance states. These conductance states can be programmed by applying voltage or current waveforms over the device in a fashion specific to the type of memory technology. Here, we consider an oxide-based resistive random access memory (OxRAM) composed of a thin film of hafnium dioxide sandwiched between a top and a bottom electrode of titanium and titanium nitride. By applying a positive SET voltage between the top and bottom electrodes, a filament of conductive oxygen vacancies is instantiated within the oxide between the electrodes. This filament can thereafter, by applying a negative RESET voltage, be disrupted. By applying successive SET and RESET pulses, RRAM devices are cycled between their HCS and LCS. We have cointegrated such devices in a standard 130 nm complementary metal-oxide-semiconductor (CMOS) process [32] to realize a fabricated array of 16 384 RRAM devices, which we use as our experimental platform throughout this article. Each device is connected in series with an n-type transistor, realizing a 1T1R structure. This structure allows each device to be individually selected for reading and programming.
In RRAM, the random mechanisms governing the distribution of vacancies within the oxide dictate that, between successive programming operations, the device will assume a different conductance state from the previous one. If a device is repeatedly cycled under the same programming conditions, a normally distributed cycle-to-cycle conductance variability emerges for the HCS. [25] In addition, the median conductance of this normal distribution can be determined by limiting the current that flows during the SET operation. [33] As an example, the cycle-to-cycle conductance variability distributions of a single device using three different SET programming currents are shown in Figure 1a. Notably, the standard deviation of each normal distribution is intrinsically tied to the median of the distribution: the standard deviation reduces, as the conductance median increases. This result is summarized by plotting the average relationship between the cycle-to-cycle conductance standard deviation and the median for the full population of 16 384 devices in www.advancedsciencenews.com www.advintellsyst.com the 1T1R array, which can be approximated with a linear function ( Figure 1b). Each RRAM device is, therefore, a normal physical random variable: [25] the conductance states that result from SET operations are analogous to drawing samples from a normal distribution with a median and standard deviation determined by the programming current. Bayesian neural networks are variants of conventional neural networks, whereby parameters are not single values, but probability distributions. [31] The distribution of each parameter encapsulates the uncertainty in its estimation, which allows for a model to avoid overfitting, given, for example, a small training dataset or noisy sensory observations. [30] Therefore, the challenge is to transfer these software-based probability distributions into a plurality of device conductance states on an RRAM-based inference hardware. The fundamental insight of this article is that, because RRAM conductance states are also probability distributions, they owe themselves more naturally to the transfer of ex situ trained Bayesian neural network models than deterministic ones.
We propose that the probability distribution of each Bayesian neural network parameter can be approximated by a linear combination of weighted normal random variable componentsdetermined using a Gaussian mixture modeling approach. [34] In a Gaussian mixture model, each of the K Gaussian distributions, also referred to as normal random variable components, is characterized by a median, a standard deviation, and a weighting factor. These three parameters per component are updated iteratively using an algorithm called expectation-maximization until a mixture of components is found that best "explains" the target parameter distribution. [34] As shown in the previous section, OxRAM devices are normal physical random variables in the HCS [25] (Figure 1a). However, although the median of each RRAM random variable component can be freely determined by the SET programming current (Figure 1a), the standard deviation is intrinsically tied to this value ( Figure 1b). This requires that, instead of treating the standard deviation of each component as free parameters during the expectation-maximization algorithm, its value must be assigned based on the known relationship with the median (here, the equation in Figure 1b). This may require, for example, additional circuitry on a practical chip to perform an initial calibration step owing to the die-to-die variability that exists across chips on a wafer as well as between wafers. [35] We apply this technique to decompose the single target probability distribution plotted in green in Figure 2a into K physical random variable components. These components, determined through expectation-maximization, can then be used to program experimentally a column of N RRAM memory cells, such that the distribution of conductance states in the column approximates that of the target distribution. This result is achieved by programming subsets of devices in the column with a SET programming current, such that their conductance states are sampled from the Gaussian corresponding to each component. The number of devices programmed per component is equal to the nearest integer value resulting from the multiplication of the total number of available devices by its weighting factor. For this target distribution, it was found that five normal components (K ¼ 5) were required to well approximate the target distribution. This result was obtained by performing the expectation-maximization algorithm over a sweep of K and observing at which value of K the resulting log-likelihood of the mixture saturated-as plotted in Figure 2b. The five resulting RRAM random variable components are superimposed over the original target distribution in Figure 2a. These five components are then used experimentally to program a column of 1024 1T1R RRAM devices as describedthe number of devices programmed per component is specified in the caption of Figure 2a. The resulting probability distribution, plotted as a histogram in Figure 2a, is seen to well approximate that of the original target distribution. This also suggests that the linear approximation of the relationship between conductance median and standard deviation, which has a nonnegligible error at the conductance extremities, is not detrimental.
To quantify the closeness of the approximation transferred to the hardware, we evaluate the Kullback-Leibler (KL) divergence from the transferred to the target distributions over a range of column sizes. The resulting mean KL divergence, over ten experimental transfers, is plotted in Figure 2c for an increasing number of RRAM cells per column. The KL divergence reduces rapidly, as the number of devices in the RRAM column is increased, consistent with the law of large numbers. [36] Before applying the presented technique to the transfer of a full Bayesian neural network model, we first describe how to perform the ex situ training of an RRAM-based Bayesian neural network, and how the parameters from this software model can be represented using an array of resistive memory devices. [25] In the Bayesian framework, training is typically performed with Markov chain Monte Carlo (MCMC) sampling or using variational inference algorithms. [37] In this article, we use the No-U-Turn sampler (NUTS) MCMC algorithm. [38] In contrast to gradient-based approaches, which result a deterministic locally optimal model, NUTS MCMC results in a collection of sampled models, each with their own parameters (synaptic weights and biases). The distribution of each learned parameter, in other words, the distribution of parameters over all of the sampled models, can then be transferred to a distribution of device conductances in a column of RRAM cells. In contrast to deterministic models, which generally require a single device per parameter, the use of a distribution comprising multiple devices per parameter allows uncertainty to be incorporated into its estimation. The ability to represent uncertainty in this manner is an advantage of Bayesian models, permitting them to account for factors, such as sensory noise, and small training dataset size as well as propagating uncertainty into their output predictions. [30,31] One neuron in an RRAM-based Bayesian neural network can be realized, as shown in Figure 3b. The neuron receives input synapses from M (here, three) neurons in the previous network layer (Figure 3a)-each connecting to one of its three columns. The distribution of the three input synaptic parameters is each stored in a column of size N and, therefore, necessitates an N Â M array of 1T1R structures per neuron.
By applying a voltage vector V across these M columns, corresponding to the activations of the neurons in the previous network layer or the input data for neurons in the first layer, a current equal to V ⋅ g n flows out of each array row and into a neuron circuit. As the conductance values of g n are on the order of microsiemens, the neuron circuit must first multiply this current value by a scaling factor S and then apply an activation function www.advancedsciencenews.com www.advintellsyst.com hðÞ to the scaled quantity, resulting in a neuron output voltage z n ¼ hðSðV ⋅ g n ÞÞ. This voltage can then, in turn, be applied to a column of each of the neuron arrays in the next layer. The distribution of these N neuron activation voltages, z, constitutes the output distribution of the neuron. In this article, we use the hyperbolic tangent activation function for all neurons besides those in the output layer where the softmax function is used.
Use of the softmax function at the output allows the likelihood of the model to be evaluated as a categorical random variable during training, such that it can be applied to multiclass datasets. In practice, as each parameter distribution can assume positive and negative values, each model parameter should be described by the difference between positive and negative distributions: pðgÞ ¼ pðg þ Þ À pðg À Þ (Figure 3b). Therefore, during MCMC sampling, the parameters, which are sampled, are pðg þ Þ and pðg À Þ and not pðgÞ directly. In addition, each neuron of the Bayesian neural network requires a bias distribution pðg b Þ, that can be realized with an extra column of devices, identical to the others, to which a constant voltage V b is applied.
One further technological constraint must be considered. Each RRAM device has a limited conductance range; in the technology applied here, extending approximately from 20 to 120 μS (Figure 1b). As a result, the sampled distributions of each parameter, pðg þ Þ and pðg À Þ, of the Bayesian neural network must be bounded within these limits during the ex situ training. Fortunately, in the Bayesian framework, such a bounding can  b) Maximum value of log-likelihood obtained for an increasing number of components, K. For this target distribution, it was determined that five components were required to approximate the target distribution (red dashed line). c) KL divergence from the target to the transferred distributions calculated for an increasing number of RRAM memory cells per column. For each number of memory cells, the distribution was transferred ten times. The resulting variability in the KL divergence is shown using green vertical bars at each point indicating one standard deviation. An example of the transferred distributions for 32 and 4096 devices is plotted as an inset.
www.advancedsciencenews.com www.advintellsyst.com be achieved naturally by placing an appropriate prior distribution over each parameter. To account for this, we, therefore, place a normal prior over each parameter, with a median of 80 μS and a standard deviation of 20 μS, such that the sampled distributions exist within the limited conductance range. We now combine the ideas of the two previous sections and present an approach to achieve the transfer of an ex situ trained Bayesian neural network onto the RRAM-based hardware shown in Figure 4a.
The detailed methodology of the transfer is presented in Note 1 and Figure S1, Supporting Information, such that, here, we present only the core principles.
To transfer the ex situ trained Bayesian neural network to the inference hardware, the software model resulting from NUTS MCMC is required to be processed in two core steps. First, the expectation-maximization algorithm is applied to each parameter of the software model to decompose each parameter distribution into K components. The identified components for each parameter are then used to quantize the software model by setting each of sampled values equal to the closest normal component median. Second, the quantized software model is then transferred in a row-wise fashion to the RRAM-based hardware. Each RRAM device is programmed with a SET current, such that the device assumes a conductance value sampled from a normal distribution centered on the corresponding software value. It is important to perform the transfer row-wise, because the values of the different parameters in a row (i.e., one model sampled during NUTS MCMC) are correlated. If we were to program each column independently, as in the case of the single distribution (Figure 2a), the correlation between the parameters of each sample would be lost. Note that this would be not be required for approaches where parameters do not have a covariance-in variational inference for example. [37] After the model has been transferred, the hardware can then perform inference on previously unseen data points whose features are presented as voltages to the columns of the neuron arrays in the first hidden layer. These voltages drive the forward propagation of neuron voltage activation distributions through the subsequent network layers, finally resulting in an activation distribution per output neuron. These output distributions can then be used to make a prediction regarding as to what output neuron class the input data point belongs. In addition, the standard deviation of each prediction distribution can be calculated and used to quantify the uncertainty in the prediction of each output neuron.
As the pre-synaptic distributions of each of the neurons in a Bayesian neural network are stored in multiple rows (i.e., multiple samples), but the physical connections between each neuron consist only of single metal wires, inference with a Bayesian neural network must be performed one row (sample) at a timebecause each sample produces a different activation voltage that must be propagated to the corresponding sample in the next layer. At the output layer, then, a separate memory structure is required to temporarily store each of the output neuron activations, which result from the independent forward propagation of an input data point through each of these transferred samples. In this fashion, after all of the rows have been read in an inference, the prediction distribution of each output neuron is readily available. To achieve this, we propose, in Figure 4a, that each neuron array contains only a single neuron circuit that is multiplexed between each of the N rows sequentially. By applying voltage pulses to the gates of the devices in only one row, while . Each row of the array uses devices that code for positive (g þ ) and negative (g À ) values that enables each parameter to be positive or negative. The inputs to the columns are the output voltages generated by the M neurons in the previous network layer. As a result of these input voltages, two currents will flow out of each row and into a neuron circuit, which subtracts them and then evaluates an activation function. This activation produces an output voltage as a function of this current that can then be applied to the column of neuron arrays in a subsequent layer. The distribution of the N output voltages (blue probability distribution) is the output distribution of the neuron.
www.advancedsciencenews.com www.advintellsyst.com grounding the others, this multiplexing can be achieved cheaply and without a dedicated multiplexing circuit. In addition, the use of a shared neuron circuit also reduces the required circuit overhead to implement the activation function or to perform any required analogue-to-digital conversions by a factor of N. For example, by applying the red pulses in Figure 4a to row N ¼ 0 of all of the neuron arrays simultaneously at time t 0 , each neuron in the output layer will produce a voltage activation, z 0 . In other . The output of three hiddenlayer neuron arrays, corresponding to neurons one, two, and eight in b), are connected to the inputs of three columns of RRAM of another neuron array, neuron one in the output layer in b). As a function of the input data feature voltages (from the red-colored neurons in the first layer), the hidden layer neurons will produce activation voltages that are, in turn, applied over the columns of the output layer neurons causing the output layer neurons to produce activation voltages. This forward propagation of voltage continues for an arbitrary number of network layers until reaching the output layer. By sequentially applying gate voltage pulses to each row of all the arrays-the red pulses at t 0 , the green pulses at t 1 , and, finally, the blue pulses at t NÀ1output neuron one will sequentially produce voltage activations z 0 , z 1 , and z NÀ1 . The distribution of all activations, pðzÞ, gives rise to an output distribution of neuron one. b) (Center) A single hidden-layer feedforward Bayesian neural network. Circles and lines in bold correspond to the neuron arrays and connections shown in part (a). (Left) The probability density histograms and kernel density estimates for a synaptic parameter (green) using 16, 128, and 1024 memory cells per column. (Right) The predictive probability contours of neuron one (recognizing points from the red moon) and neuron two (blue moon) for 16 (right), 128 (centre), and 1024 (left) memory cells per column. Each of the red and blue moons data points is described by two feature voltages that are applied as inputs to the columns of the green neuron arrays.
www.advancedsciencenews.com www.advintellsyst.com words, these output activations result from the forward propagation of the input through the devices in row N ¼ 0 of each array only. Thereafter, applying the green pulses at t 1 , the outputs z 1 will result from the forward propagation through the devices at rows N ¼ 1 and so on. By proceeding in this fashion up until row N À 1, an output distribution pðzÞ will be available for each output neuron at t NÀ1 . The mean value and standard deviation of this distribution can be used to, respectively, make a prediction and quantify prediction uncertainty.
To demonstrate this technique, we perform the ex situ training of the Bayesian neural network shown in Figure 4b. We apply it to an illustrative example in the generative moons classification task: [39] each of the two output neurons of the network must learn a nonlinear decision boundary that separates its respective class of noisy data points from the other. To evaluate the transfer of the Bayesian neural network model, we perform a hybrid hardware/ software experiment. After termination of NUTS MCMC, the normal random variable components required for all of the model parameters are identified using expectation-maximization. Then, 1024 devices in the experimental RRAM array are programmed, using the corresponding SET programming currents based on the median value of each of these random variable components. The resulting conductance values are then used to build up a computer model of the proposed hardware shown in Figure 4a to perform an inference. This is required because the experimental 1T1R array features parallel running source and bit lines, instead of orthogonal source and bit lines, such that devices are addressed for read or programming individually. Each RRAM cell of this computer model is randomly assigned one of the 1024 transferred conductances, which resulted from the SET programming current that would have been used to program the equivalent device on the physical array. Examples of the resulting distributions transferred to the synaptic parameter highlighted in green in Figure 4b are plotted for 1024, 128, and 16 rows. On average, based on the measured SET programming currents, the programming energy required to perform the transfer of the full Bayesian neural network model to the array was 1.37 μJ, 172 nJ, and 21.5 nJ for the models based on 1024, 128, and 16 rows, respectively.
Upon performing inference with a hybrid hardware/software model, the decision boundaries for each of the two output neurons for the model transferred to the 1024, 128, and 16 row arrays, shown in Figure 4b, arise. The output neurons appear, in all situations, able of discerning the underlying structural separation between the two types of data point that was learned in the software model. The probability contours of the two output neurons are largely similar for the case of 1024 and 128 rows, whereas those for 16 rows appear more erratic. Despite this appearance, however, the boundaries drawn at the interface of the two moons with N ¼ 16 rows still capture the fundamental curvature of their division. Based on the read currents of the programmed devices, the energy required to read all of the device conductances during inference was 110 nJ, 13.7 nJ, and 1.72 pJ for the models transferred to the 1024, 128, and 16 row arrays, respectively. However, that is important to note that the energy required by read circuitry, analogue-to-digital and digital-to-analogue conversions, and circuits for implementing the neuron activation functions has not been considered and would lead to a considerable increase in these values depending on design choices.
The prediction uncertainty of each of these transferred Bayesian models is plotted in Figure S2, Supporting Information. This uncertainty, captured in the distribution of each synaptic parameter, naturally propagates through a Bayesian neural network to the output layer, where, as might be expected, it is seen to be greatest at the interface between the red and blue points. While the prediction uncertainty contours are largely similar for N ¼ 1024 and N ¼ 128, they are once again degraded for N ¼ 16. In safety-critical edge inference applications, the ability of a Bayesian neural network to quantity uncertainty, with respect to deterministic models, is potentially invaluable and, perhaps, indispensable from an ethical perspective. [40] For example, in a medical system, such as an implantable cardioverter-defibrillator, [41] these prediction uncertainties can be leveraged to avoid the erroneous application of an electric shock to the heart, which can, in some instances, prove fatal. [42] If the system was presented with a data point close to a noisy decision boundary (as in Figure S2, Supporting Information) or with a data point from a location in the feature space that the model had not observed during training, perhaps due to a damaged or drifting sensor, the prediction uncertainty of the model will be large. By placing a threshold on a tolerated level of prediction uncertainty, above which the system should not take action, erroneous interventions can be avoided.
In this article, we have presented, and demonstrated in a hybrid hardware/software experiment, a method for transferring an ex situ trained Bayesian neural network model onto a resistive memory-based inference hardware. Unlike previous transfer approaches, whereby iterative closed-loop programming (programverify) schemes are used, an expectation-maximization-based approach facilitated the transfer of a Bayesian neural network in a single programming step. This is particularly important, because Bayesian neural networks use multiple devices to describe the probability distribution of each parameter. We have also found that, in the simple illustrative task addressed, despite the fact that each of the devices was programmed only once without verification, the decision boundaries of the software model were well preserved. Furthermore, it was demonstrated and discussed how the prediction uncertainty that is available in this Bayesian modeling approach could be an important facet in the ethical application of ex situ trained models in edge inference.
Going forward from this initial proposal and experimental demonstration, future work will focus on understanding how the proposed technique can scale to larger network models and to higher-complexity datasets as well as exploring further Bayesian ex situ training algorithms such as variational inference. [37] It will also be instructive to perform a quantitative comparison between ex situ trained RRAM-based Bayesian and deterministic neural network models to understand the advantages and trade-offs between the two approaches in terms of inference accuracy, the energy and latency incurred in model transfer and inference, and the memory requirements.
Ultimately, the article proposes a new approach in the deployment of ex situ trained software models at the edge based on Bayesian neural networks. Such models offer certain advantages such as an increased compatibility with resistive memory properties [25] as well as their ability to represent uncertainty that has important implications in ethical edge inference.