Efﬁcient and robust bitstream processing in binarised neural networks

In the neural network context, used in a variety of applications, bi- narised networks, which describe both weights and activations as single-bit binary values, provide computationally attractive solutions. A lightweight binarised neural network system can be constructed using only logic gates and counters together with a two-valued activation function unit. However, binarised neural networks represent the weights and the neuron outputs with only one bit, making them sensitive to bit- ﬂipping errors. Binarised weights and neurons are manipulated by the utilisation of bitstream processing with regard to stochastic computing to cope with this error sensitivity. Stochasticcomputing is shown to provide robustness for bit errors on data while being built on a hardware structure, whose implementation is simpliﬁed by a novel subtraction-free implementation of the neuron activation.

✉ Email: ayguns@itu.edu.tr In the neural network context, used in a variety of applications, binarised networks, which describe both weights and activations as single-bit binary values, provide computationally attractive solutions. A lightweight binarised neural network system can be constructed using only logic gates and counters together with a two-valued activation function unit. However, binarised neural networks represent the weights and the neuron outputs with only one bit, making them sensitive to bitflipping errors. Binarised weights and neurons are manipulated by the utilisation of bitstream processing with regard to stochastic computing to cope with this error sensitivity. Stochastic computing is shown to provide robustness for bit errors on data while being built on a hardware structure, whose implementation is simplified by a novel subtractionfree implementation of the neuron activation.
Introduction: Emerging technologies such as nanoscale devices pave the way for a better trade-off between computation, power, and area. However, these hardware systems can be vulnerable to stuck-at-fault and bit-flip errors. Therefore, it is crucial to design fault-robust and noise-robust algorithms. Stochastic Computing (SC) has introduced a hardware-efficient bitstream computational paradigm that manipulates data in the form of non-stationary Bernoulli sequences [1] to provide the same level of significance to all bits, and thereby be robust to bit-flipping errors [2]. Moreover, SC has the capability to reduce the hardware resource utilisation by using simple logic gates instead of the sophisticated arithmetic units [3][4][5][6].
In the context of neural networks, now widely used in a large variety of applications, binarised networks, describing both weights and activations as binary values, provide computationally attractive solutions with competitive model prediction accuracy. However, in binarised neural networks (BNN) [7], the 1-bit representation of the weights is vulnerable to soft errors that are inherent to high-throughput nanoscale environments. Therefore, this article proposes to adopt bitstream encoding and processing to handle the operations of the neural network. The bitstream processing paradigm using random streams is also named stochastic computing in literature. In terms of computational unit, BNNs involve a sign function at each neuron in hidden layers, to obtain +1 or −1 activation values. Moreover, each weight is restricted to +1 or −1 during the training. Thus, multiplication and summation operations are reduced to simple digital logic elements, like an XNOR gate and a counter, respectively.
A few studies have recently investigated the use of bitstream processing in the context of neural networks [2,[4][5][6]. Most of the preceding bitstream processing neural networks operate at either full precision or 2-or-more bit quantisation of the weights and activations. Stochastic streams were first considered in the context of BNNs by Hirtzlin et al. [8], who proposed an improved training strategy by introducing input data streams into the network during training. For the MNIST Fashion dataset [9], they increased their stochastic network accuracy closer to the deterministic one. They employed the XNOR gate, counter, and the subtraction unit of activation for the hardware construction. In contrast, we propose a computational framework that derives the neuron activation from the accumulation unit, and thereby avoids the subtraction generally implemented in the neuron activation unit. In the simulation section, we underline how our solution is robust to errors injected into the images and network weights, which is the main theme of this article.  Background: In bitstream processing, the numbers are represented by a binary encoding scheme, before being processed as a bit flow by simple logic hardware, to implement numerical operations. In this work, arguments in italic fonts are used to denote scalar numbers, while bold italic fonts denote their corresponding binary stream. Let x be a scalar value. Its binary stream representation is composed of N binary elements x = {x 1 , . . . , x N } where N ∈ Z + , N ≥ 2 define the stream size, and anymth element out of N, x m , is either binary 1 or binary 0. In a bitstream, let K be the number of 1s denoted by K = N m=1 x m . Unipolar encoding (UPE) and bipolar encoding (BPE) are generally considered to encode any scalar into a bitstream, but the fractional number encoding is essential in neural network applications. A positive-only fractional number x, in the range [0,1], is encoded via UPE such that In the BPE stream, a majority of 1s, that is, high values, is an indication of the positive sign, whereas a majority of 0s is an indication of the negative sign. Thus, BPE is superior to UPE in terms of sign estimation robustness. In both the schemes, at encoding, a Bernoulli sequence can be considered to convert the scalar number into a random sequence for perfect randomness of each bit [3]. In this work, data are represented using the BPE format in the proposed framework.
Proposed framework: In this study, the acronym BNN refers to the conventional deterministic binarised neural network [7]. In contrast, the 'BSBNN' acronym denotes the bitstream-processing BNN. We first revisit how BNN works, and then introduce the proposed BSBNN implementation.
In any deterministic neural network, S (n) m denotes the mth neuron pre-activation related to the nth hidden layer (HL). Pre-activation as is computed with the multiplied and accumulated values of preceding neuron (or input) values, x, and the weights, w, where p is the number of neurons in the (n − 1)th layer. In the hidden layers of the deterministic BNN, decimal weight and neuron output values, restricted to ±1 during training, are mapped into 1-bit binary values such as (+1) 10 ≡ (1) 2 and (−1) 10 ≡ (0) 2 . This makes it possible to implement bitwise multiplication using a single XNOR gate. Table 1 compares decimal and bitwise multiplications and demonstrates how decimal multiplication is degraded into a simple logic operation via XNOR gates.
However, in deterministic BNN, accumulation is performed by counting the population of 1s. For this operation, a modulus (MOD) digital counter circuit is utilised. Let T be the total number of states of the counter. Including a zero initial accumulation value, the counter can count up to the final state, (T − 1). The term popcount, short for population count, is generally used to denote the basic counter logic hardware [7,8]. Figure 1 depicts an example of an asynchronous counter based on the D-type flip-flops (ffs) for popcount. Binary inputs to be accumulated are fed sequentially to the first flip-flop clock input, and the base-2 out- Figure 1 is truncated from T = 8 states into T = 5. It outputs a (1) 2 and resets itself, whenever the sixth occurrence of high value is observed in the input bitstream put, (Q 3 Q 2 Q 1 ) 2 , gives the total number of 1s in the input binary word. It naturally resets to zero when the total number of 1s presented to the input has reached a value that is the number of ffs to the power of two, that is, 2 (# of f f s) .

Fig. 3 Popcount presented in
When considering our BSBNN framework, we implement multiplication and accumulation operations using bitstream sourced logic elements. Distinct from the prior art [8], bitstreams are performed not only to represent the image pixel values but also to define all neuron outputs and weights, that are handled throughout hidden layers. Thus, a −1 weight value in deterministic BNN is converted into w = {0, 0, . . . , 0 N } for bitstream processing. Likewise, +1 is converted into w = {1, 1, . . . , 1 N } for bitstream processing. Our experiments show that this way of representing weights and hidden values improves the bit-flip data error robustness.
Next, the implementation of the proposed BSBNN bitstream operations is detailed. In bitstream processing, the multiplier is either an AND gate for UPE bitstreams or an XNOR gate for BPE bitstreams. Since the weights can be +1 or −1 in the binarised network case, BPE streams are adopted, and the multiplication is implemented by the XNOR gate. Figure 2 depicts an example of XNOR-based multiplication for x and w BPE bitstreams.
In BSBNN, the neuron activation can be computed directly by processing the concatenation of all multiplication bitstreams, denoted as [(x 1 w (n) m,1 ), . . . , (x p w (n) m,p )], where (x i w (n) m,i ) represents the multiplication stream related to the ith neuron from the preceding layer. The concatenated bitstream is inputted to a counter, which decides if the output of the neuron should be activated or not. It depends on whether the concatenated bitstream contains a majority of 1s. Our proposed solution uses a counter that resets itself and forwards a positive neuron activation as soon as the number of 1s observed in the concatenated bitstream has attained a threshold, corresponding to the half size of this concatenated bitstream. Since the total number of states is not always a power of two, the counter combines a popcount with a specific masking logic to reset the popcount once it reaches the predefined hard-coded value T [10]. This concept is denoted as truncated MOD-T counter in the following. The term truncated refers to the fact that the counter resets before the maximal value of the popcount is reached, while MOD-T indicates that T is the modulus value of the counter. In practice, at every increment, the current total sum, (Q 3 Q 2 Q 1 ) 2 is presented to the bit masking logic to check whether the threshold is attained. The counter truncation is exemplified in Figure 3 for the popcount presented in Figure 1. After the output, (Q 3 Q 2 Q 1 ) 2 would have gone through five states, from (000) 2 to (100) 2 , and it resets itself by setting the clear pin. To reset after five states, the output of the popcount is masked by a gate that outputs a (1) 2 only if Q 1 = (1) 2 , Q 2 = (0) 2 , Q 3 = (1) 2 . This can be implemented with an AND gate, as shown in Figure 3. As 1s are present in majority in the input binary stream, whenever the sixth high value is seen on the input, the masking logic outputs (1) 2 .
As depicted in Figure 4, our proposal to use a truncated MOD-T counter simplifies the hardware in charge of turning the concatenated bitstream into a neuron activation. Former solutions, given in Figure 4(a),  simply use a register to accumulate the number of 1s counted in each of the input products and then rely on a subtraction to compare the accumulated sum with the pre-encoded threshold. In contrast, our proposed solution shown in Figure 4(b) parses the concatenated products bitstream and embeds the pre-encoded threshold T directly in the truncated MOD-T counter. It does not need any extra register nor subtraction other than the masking logic. In Figure 5, a numerical example is provided to clarify the overall proposal. By assuming this is the input layer of a neural network, input values are exemplified as 1, 0.5, 0.5, and weight values are 1, −1, 1. The overall concatenated bitstream size is 12 bits because there are three neurons, and each stream is 4 bits. The threshold is T − 1 = 6, as T = 5 including the zeroth state, indicates (101) 2 hard-coded value in base-2 representation. To control the neuron activation bit, that is, signed bit, the masking logic is performed at the output of the truncated MOD-5 counter that has been shown in Figure 3. To conclude, hardware units in deterministic BNN and BSBNN are compared in Table 2. In summary, it reveals that BSBNN does not require subtraction for the activation operation while using bitstream processing.
Network on the simulation: After the discussion on how BSBNN can be implemented efficiently, with reduced hardware logic, we analyse how BSBNN performs compared to BNN in case of input or model bit-flip errors. This section simulates our proposed network, using the MNIST handwritten digit dataset [11], and presents the obtained results. Figure 6 presents the accuracy obtained on the MNIST digit dataset by previous bitstream-based neural network architectures as a function of the bitstream length. Most studies in the literature use a multi-layer perceptron

Fig. 7 Bit-flip soft error injections applied to BNN, BSBNN8, and BSBNN16. Error injection is defined in terms of the percentage of flipped binary symbols. (a) BNN model trained without error is tested for image bit-flips. (b) BNN model trained without error is tested for weight bit-flips. (c and d) Models are trained in the presence of (image or weight) errors, with the processing paradigm used at test time
with 2 HL, as has been considered in this study. During our simulations, we first consider a single model. It is obtained using deterministic BNN in the training phase. The forward path of the network uses binarised weights. However, the weight update is performed considering a floating-point version of the weights, which are then binarised to define the forward path. The network is made of four layers, each of them including 784-1024-1024-10 neurons, respectively. Squared hinge loss is chosen, which outperforms cross-entropy loss in terms of validation accuracy. An exponential learning rate decay is applied in the range of [3 × 10 −7 , 3 × 10 −3 ]. The batch size is set to 100. Dropout is implemented with p = 0.2 for the input layer, and with p = 0.5 for hidden ones. The best performing model is saved by considering the best validation accuracy, 98.43%, at the 968th epoch over 1000 epochs. The deterministic BNN test accuracy is 98.27%. From a data representation point of view, traditional binary data in BNN is known to be quite vulnerable to bit-flipping errors. This is attributed to it being quite sensitive to errors affecting the most significant bits of input data and the 1-bit only weights. In contrast, we expect increased robustness from the bitstream-processing paradigm. Hence, bitflip error injection has been considered to measure the robustness of the proposed bitstream processing framework to errors that can happen during memory accesses of images and weights [4,8]. Since binary symbol errors affect the weights and pixel values differently, our study investigates those two kinds of errors separately.
For the case in which errors affect pixel values, some binary symbols representing the pixels are randomly selected. The pixels of test images are manipulated in 8-bit traditional binary representation for deterministic BNN. For BSBNN, the image data is represented in N = 8 (BS-BNN8) or N = 16 (BSBNN16) sized streams. Figure 7(a) depicts how network accuracy decreases with the percentage of flipped symbols. We observe that the conventional processing paradigm suffers from dramatic accuracy losses while the BSBNN achieves remarkable robustness. Re-garding binary weights flipping, we observe in Figure 7(b) that BNN and BSBNN8 achieve close performance, while BSBNN16 demonstrates a non-negligible, but significantly reduced sensitivity to weight bit errors.
As the second round of simulations, we have investigated how accounting for the errors and the processing paradigm at training time affects the robustness to errors at the test time. Therefore, different models have been trained with different bit-flipping rates (5, 10, or 20%) at training. This is done using either the conventional or the bitstream processing computational paradigm, and the errors affect either the image pixels or the weights. The resulting models are tested with the same computational paradigm than the one used at training, and their test accuracy is reported in Figure 7(c) and (d), as a function of the bit-flip percentage at test time. In Figure 7(c), we observe that injecting errors on image pixels at training increases the robustness to errors at test time by a large margin (about 30% accuracy improvement at 30% error rate) and that the bitstream processing paradigm is significantly more robust to errors than the conventional one. With regard to the errors affecting weights, we observe in Figure 7(d) that accounting for errors is largely beneficial when they are also present at test time but the performance is penalized in absence of errors at test time.
To compare the complexity of the conventional and bitstream-based computational paradigms, a hardware simulation framework was constructed in MATLAB Simulink with design primitives from Xilinx System Generator. For both BNN and BSBNN hardware simulations, the input layer data were copied from MATLAB workspace. BPE-based random bitstreams were prepared on the software side, and they adopted a Bernoulli distribution for BSBNN following the procedure in [4]. Parallel co-simulation was performed by feeding the hardware model associated with the feed-forward hidden layers. For single neuron hardware shown in Figure 4, the total count of the look-up table and ff logic was 101 for N = 8-bit BSBNN, and 147 for BNN, which represents a gain of approximately 30%, according to the design synthesis in the Vivado tool setting the target device as Zynq-7000.
Conclusion: A stochastic bitstream processing binarised neural network is presented. The first contribution of the study is proposing subtractionfree activation using the truncated MOD-T counter in the presence of bitstream usage. The second contribution is underlining data error robustness of bitstream processing over the deterministic approach. Injecting random bit-flips into the data of deterministic BNN decreases the accuracy values significantly. Stochastic bitstream representation of image pixels and weights provides noticeable robustness for soft errors on data.