Device Variation Effects on Neural Network Inference Accuracy in Analog In-Memory Computing Systems

and then the weights are programmed onto memory


Introduction
Deep neural networks (DNNs) have achieved unprecedented capabilities in tasks such as image and voice analysis and recognition and have been widely adopted.However, computation requirements and the associated energy consumption of neural network   We propose that architecture aware training can be considered in 3 levels.Level 1 is the standard quantization-aware training method [10] , where high precision weights are passed through a fake-quantization function before computation.At Level 2 device-aware training, signed weight representation in memory cells, and limited on/off ratio are considered.At Level 3 tile-aware training, the limited memory array size and ADC precision limitation are also considered.In addition, for all 3 levels, variation can be injected on per mini-batch basis to mimic the effect of programming variation.

The Necessity of the Tiled-Architecture
There are 3 types of important non-idealities in analog IMC systems for VMM operations, interconnect parasitics, ADC limitations, and memory device non-idealities.Because energy efficiency is the most important target, ADC operating frequency is likely to be limited to below ~100MHz [11] .At this speed, with more than 10ns of hold time between input change effects in analog IMC systems have been studied.In such a system, neural network models are trained off-line and programmed onto memory arrays, and large neural network layers are mapped onto multiple memory arrays where partial sums (Psums) are produced by ADCs at each array and summed in digital domain [15] (Figure 1a).Signed weights are represented in two cells on two different columns that receive the same input activations.Currents from the two columns are quantized by ADCs individually, then the digital output of the negative column is subtracted from that of the positive column (Figure 1b).

Architecture-Aware Training
In general, many of the device and circuit non-ideality effects can be effectively mitigated through architecture-aware training methods [15] , where hardware details are mimicked in the training process.In architecture-aware training, we developed a simulator based on Google's TensorFlow deep learning framework by modifying the training graph from the standard floating-point pipeline.To compare the impact of different hardware non-idealities, we consider 3 inference pipelines, Level 1 through Level 3, and their corresponding training topologies (Figure 2).In Level 1, only the quantization of weights and activations are considered.In Level 2, the effects of signed weights representation on two cells and limited device on/off ratios are introduced.For both training and inference in Level 1 and Level 2, we used the common scheme for quantization-aware training [10] , where the weights pass through the fakequantization function before calculations are conducted.The fakequantization function does not change the overall range of the weights and instead rounds the weights to a number of fixed values determined by the range and resolution set for the function, and these parameters can be different for each layer.For actual hardware representation of weights where the conductance range is fixed for the whole system, the outputs of each layer need to be multiplied by a high precision scaler to match that of the software model.In Level 3, the physical range of memory cells, the multiplier, limited memory array size, and ADC precision  By sequentially introducing different levels of architecture details during the training process, the neural network model can potentially account for these architecture and device factors and recover the desired model accuracy [15] .However, high levels of device programming variation, which is indicative of today's analog memory devices, still present challenges in considerable inference accuracy degradation.
Table 1.Models used for benchmarking.Only CNN and fully connected layers are shown.RRAM array size of 256x64 is used.

Networks Used for Benchmarking
In this study, we chose 3 neural network and dataset combinations of various complexities to investigate the impact of analog IMC accuracy at realistic device non-idealities for different network and dataset complexity (Table 1).The first network is a relatively simple VGGblock-based model trained for the CIFAR-10 dataset.This model contains only convolution (Conv) layers, a fully connected (FC) layer, and MaxPool layers.The second network is the Wide ResNet 16-8 model (WRN) [17] .This network uses residual connections and batch normalization in addition to convolution and fully connected layers.We used the WRN 16-8

Manuscript
This article is protected by copyright.All rights reserved network for the CIFAR-10 dataset and the more complex CIFAR-100 dataset to test the effects on more challenging tasks.

Hardware Characteristics
We used 8bit ADC in our study because it has been found to offer a good balance between energy efficiency and resolution, as reducing resolution further does not appear to yield a meaningful improvement in energy/sample [18] .RRAM cells with an analog read current range of 0.3µA ~ 3µA, ADC input range of 0 ~ 45µA, array size of 265x64 were considered for the tiled implementation.

Training Process
We first obtain floating-point models using standard practice.For the VGG-block-based [17] models, we trained for 150 epochs using the stochastic gradient descent (SGD) optimizer with a learning rate of 0.001, momentum of 0.9.For WRN models, we follow the parameters described in [19] .Then, different levels of hardware details are progressively introduced during

Manuscript
This article is protected by copyright.All rights reserved training, as schematically shown in Figure 3, along with the parameters used during the training processes.Specifically, Level-1 models are fine-tuned from the floating-point models, Level-2 models are fine-tuned Level-1 models, and Level 3 models are fine-tuned from Level-2 models.We found this approach leads to better model inference accuracy compared with training directly the Level-2 or Level-3 models from scratch with random weights [15] .In fact, we found that Level-3 MNIST models trained from random weights reached only 77.78% accuracy (compared to 99.13% for model fine-tuned from Level-2 and float model) in previous studies, and Level-3 VGG and WRN models produced accuracies of only 10%, which is no more than chance for the CIFAR-10 dataset.In the fine-tuning process, we used a learning rate of 0.001 for the VGG-block-based model and trained for 20 epochs.For the WRN models, we used a learning rate schedule, where the learning rate starts at 0.003, then steps down to 0.001, 0.0005, 0.0002 after 5, 10, 15 epochs and trained for a total of 40 epochs.

Effects of Computation Errors in Analog IMC Systems
First, we present the effects of deterministic errors including weight and activation quantization, signed weight representation, limited RRAM array size, ADC precision limitations, and RRAM cell on/off ratios (Figure 4).For the 3 network-dataset combinations we studied, when only quantization (activation and weights quantized to 8bits) and signed weight representation were considered during inference (Level 2), there is minimal accuracy drop from just using the quantization-aware trained models [10] (Level 1).We do note that the activation quantization range in the inference pipelines must correspond to the input range of activation function used during training (ReLu6 etc.), or there is severe degradation in accuracy due to the limited range due to the quantization effects.
However, in the presence of a low device on/off ratio and/or array size and ADC limitations, the quantization-aware trained models cannot produce acceptable accuracies.By introducing Manuscript models.We found this approach leads to better model with training directly the Level-2 or Level-3 models from scratch with random weights

Manuscript
with training directly the Level-2 or Level-3 models from scratch with random weights we found that Level-3 Manuscript we found that Level-3 accuracy (compared to 99.13% for model fine-tuned from Level-2 and float model) in which is no more than chance for the CIFAR-10 dataset.In the fine-tuning process, we used a Manuscript which is no more than chance for the CIFAR-10 dataset.In the fine-tuning process, we used a rning rate of 0.001 for the Manuscript rning rate of 0.001 for the WRN models, we used a learning rate schedule, where the learning rate starts at 0.003, then

Manuscript
WRN models, we used a learning rate schedule, where the learning rate starts at 0.003, then down Manuscript down to 0.001, 0.0005, 0.0002 after 5, 10, 15 epochs and trained for training pipeline, results in poor accuracy for the more complex models or datasets such as WRN.Acceptable results may be produced by Level-2 training for simpler models such as VGG-blocks due to the use of only Conv and FC layers which are generally more resilient to errors.As a result, tile-aware training (i.e.Level 3 pipeline) must be used for the more complex models or datasets to produce good accuracy, as shown in Figure 4. We believe the more complicated model structure with residue connections and the use of batch normalization layers make the WRN models more sensitive to errors.Particularly, models with batch normalization layers are sensitive to changes in activation distribution, and the quantization of partial sums due to ADC precision and range limitations produce a shift in activation distributions [20] .

Manuscript
This article is protected by copyright.All rights reserved

Programming Variation Effects on Inference Accuracy
Next, we examine the effects of device variations on network inference accuracy.Neural network models are trained off-line then programmed onto memory arrays for inference, and the weights do not change during the inference process.Combined with analog computation, this means any deviations that occur during the device programming process result in inference to be conducted on models that are effectively different from the trained models, leading to potential accuracy degradation.Different from deterministic errors discussed earlier, the randomness of device variations means each programmed chip maps an essentially different model.Re-training each chip individually may potentially recover the accuracy, but will be very expensive and impractical.In the following, we investigate the impact of device programming variation on large-scale DNN networks inference accuracy, the effectiveness of mitigations methods, and factors that impact network robustness against device variation under realistic device and circuit conditions.
We examined the effect of weight variations using models trained with Level 1, 2, and 3 pipelines, and studied the model accuracy in the corresponding inference conditions (e.g. when only quantization effects, quantization + device on/off, and quantization, on/off and finite array size and ADC precision effects are present during inference, respectively) (Figure 5).Previous studies have shown that the VGG-block-based model had minimal accuracy drop even at relatively high variation levels, while more complex models show severe accuracy degradation [15] .In this section, we thus used the more complex WRN-16-8 models for the CIFAR-10 dataset to highlight the effects of device variations.
Author pipelines, a In the accuracy test, after weight storage, variations were applied additively as Gaussian distributions with a constant standard deviation across all weights (i.e.4% variation means the standard deviation is 4% of the dynamic range of memory cells).This variation distribution was chosen as a generic example, since memory technologies have substantially different characteristics, and it represents a near-worst-case scenario.On one side, many emerging resistive switching devices exhibit state-dependent programming variation, where lower conductance states are associated with lower variations [5,21] , which is less detrimental to inference accuracy.On the other side, programming variations in multi-bit Flash memories are generally more state-independent while also suffering from additional non-linear behaviors [22,23] .In Level-2 and Level-3 inference pipelines, where signed weights are represented in two columns (Figure 1b), the variations are applied independently to each cell.This is different from variations that are directly applied to the signed weights (Level-1) and means the impacts of weight variations are not equivalent between Level-1 and the other As a natural extension in the architecture-aware training approach, we hypothesize that injecting noise during training may improve inference accuracy.Specifically, we used weight noise injection during training to mimic device programming variations to produce trained DNN models that can produce better inference accuracy in presence of variations.In this implementation, weight noise is added after each mini-batch during training, where an error is drawn from a Gaussian distribution for each weight then added to it.The standard deviation for the Gaussian distribution is defined as relative to the dynamic range of the memory cells.
For example, 1.56% noise injection means the Gaussian distribution has a standard deviation that is 1.56% of the dynamic range of memory cells.From a general neural network training perspective, noise injection at inputs, hidden units, and weights during training have long been proposed as methods to improve the generalization ability of neural networks [24][25][26][27][28] .In particular, weight noise injection has been shown mathematically to improve fault tolerance as it produces networks with smoother input-output mapping where the output becomes less sensitive to noise [26] .Recent studies have also applied this method to analog computing systems [29][30][31][32] .However, these prior studies are generally limited to small-scale networks or did not consider realistic system limitations like ADC characteristics, device on/off ratios, and especially array size limitations.The improvements in inference accuracy from weight noise injection in training can be observed in Figure 5, and the trend in improvements is consistent across different inference pipelines.In general, higher-level noise injection leads to better accuracy recovery.For high device variations, noise injection not only allows the average and the peak accuracy to recover but also reduces the variation in performance between different runs.
The improvements from noise injection can also be observed from model outputs directly.
Figure 6a shows the error in model outputs caused by device programming variation with CIFAR-10 validation dataset as input.For inference with a programming variation level of

Effect of Higher Target Programming Resolution
Although 8bit programming target resolution cannot be reliably represented by devices with a variation of even as low as 1.56%, we found, compared to 4bit programming target, higher target resolution produces models more robust to programming variations (Figure 7).Thus, 8bit programming target resolution was used in this work.We believe this is because,

Author
The p   9a normalized to that of models trained with learning rate schedule (Figure 5b).c) 75 th percentile tile-aware models evaluated in the tiled pipeline with an on/off ratio of 10. d) Accuracies in Figure 9c normalized to that of models trained with learning rate schedule (Figure 5d).

Impact of Learning Rate
When a simple learning rate of 0.0002 is used in the fine-tuning process instead of the learning rate schedule described in Section 2.2, models achieved similar accuracies with no weight variation.When weight variation is introduced, device-aware trained models (Level 2)   showed no clear pattern between the two different learning rates, while tile-aware trained

Conclusion
In this work, we took a systematic look at the weight variation effect caused by memory device programming in analog IMC systems, which appears to be the most difficult error source to mitigate.We show proper noise injection can improve model robustness against weight variations.However, in the presence of moderate to high variations and for complex tasks and models, these methods may not be able to fully recover the accuracy drop.Thus, further developments in algorithms to produce neural networks that are more robust against weight variations could be critical for practical deployment for analog IMC systems for neural network workload.
Author tios, memory array size, ADC characteristics, and signed weight representations.We also Introduction Author Introduction Deep neural networks (DNNs) have achieved unpre Manuscript GPU, FPGA, have shown significant improvements over traditional CPUs in both computing power and Manuscript have shown significant improvements over traditional CPUs in both computing power and nergy efficiency, continued innovation is necessary to meet the growing demand.

Manuscript
nergy efficiency, continued innovation is necessary to meet the growing demand.articularly, DNN inference workload on edge computing platforms like mobile and IoT has Manuscript articularly, DNN inference workload on edge computing platforms like mobile and IoT has ent energy efficiency requirements due to limited energy supply, and unconventional Manuscript ent energy efficiency requirements due to limited energy supply, and unconventional pproaches like analog computing may prove more advantageous in meeting this requirement.
Figure 1.a) Tiled analog in-memory computing systems.Large DNN layers are mapped onto multiple memory arrays.Analog outputs of each array are digitized by ADCs to produce partial sums.The partial sums are then summed in the digital domine to produce the final layer output.b) Signed weights are represented on two memory cells in two different columns.c) device characteristics consider in this study.d) Neural network models are trained off-line then programmed onto memory arrays for inference.Author then programmed onto memory arrays for inference.Author then programmed onto memory arrays for inference.Manuscript Manuscript a) Tiled analog in Manuscript a) Tiled analog in onto multiple memory Manuscript onto multiple memory partial sums.The partial sums are then summed in the digital domine to produce the final Manuscript partial sums.The partial sums are then summed in the digital domine to produce the final layer output.Manuscript layer output.b) Signed weights Manuscript b) Signed weightsManuscript off ratios are introduced.For both training and inference in Level 1 and Level 2, we Author device on/off ratios are introduced.For both training and inference in Level 1 and Level 2change the overall range of the weights and Author function does not change the overall range of the weights and of fixed values determined by the range and resolution set for the function, and these Author of fixed values determined by the range and resolution set for the function, and these parameters Author parameters can Author can be different for each layer.For actual hardware representation of weights Author be different for each layer.For actual hardware representation of weights Manuscript each array and summed in digital domain Manuscript each array and summed in digital domain cells on two different columns that Manuscript cells on two different columns that columns are quantized by ADCs

Manuscript
columns are quantized by ADCs column is subtracted from that of the positive column (Figure1Manuscript column is subtracted from that of the positive column (Figure1Architecture-Aware TrainingManuscriptArchitecture-Aware TrainingIn general, many of the device and circuit non-ideality effects can be effectively mitigatedManuscriptIn general, many of the device and circuit non-ideality effects can be effectively mitigated h architecture-aware training methods Manuscript h architecture-aware training methods ining process.In architecture-aware training, we developed a simulator based on Google's Manuscript ining process.In architecture-aware training, we developed a simulator based on Google's nsorFlow deep learning framework by modifying the training graph from the standard Manuscript nsorFlow deep learning framework by modifying the training graph from the standard floating-point pipeline.To compare the impact of different hardware non-idealities, w Manuscript floating-point pipeline.To compare the impact of different hardware non-idealities, w consider 3 inference pipelines, Level 1 through Level 3, Manuscript consider 3 inference pipelines, Level 1 through Level 3, This article is protected by copyright.All rights reserved limitation are introduced.As described in Figure 2, separate multipliers are assigned to each array and trained during the training process.
sequentially introducing different levels of architecture details during the training process, e neural network model can potentially account for these architecture and device factors and Manuscript e neural network model can potentially account for these architecture and device factors and recover the desired model accuracy Manuscript recover the desired model accuracy which is indicative of today Manuscript which is indicative of today inference accuracy degradation.Manuscript inference accuracy degradation.

Figure 3 .
Figure 3.Architecture-aware training process and parameters used during training.

Author
Architecture-aware training process and parameters used during training.Author Architecture-aware training process and parameters used during training.resolution, as reducing resolution further does not Manuscript energy efficiency and resolution, as reducing resolution further does not meaningful improvement in energy/sample Manuscript meaningful improvement in energy/sample ~ Manuscript ~ 3µA, ADC input range of 0 ~ 45 Manuscript 3µA, ADC input range of 0 ~ 45 tiled implementation.Manuscript tiled implementation.
Manuscript accuracy (compared to 99.13% for model fine-tuned from Level-2 and float model) in evious studies, and Level-3 Manuscript evious studies, and Level-3

Manuscript
to 0.001, 0.0005, 0.0002 after 5, 10, 15 epochs and trained forEffects of Computation Errors inAnalog IMC Systems Manuscript Effects of Computation Errors in Analog IMC Systems irst, we present the effects of Manuscript irst, we present the effects of This article is protected by copyright.All rights reserved the finite on/off ratio properties in the training pipeline, device-aware trained models can successfully mitigate the effect of limited on/off ratio down to 10, along with any effects due to the two-column signed weight representation, as shown in Figure 4. On the other hand, in the presence of array size and ADC limitations, the device-aware training, i.e.Level-2

Figure 4 .
Figure 4.Effect of signed weights represented in two cells, on/off ratio, ADC, and array size limitation.Inference accuracy.Act Quant CR: activation and weight quantization 8bits with activation quantization range corresponding to ReLu6 used during training, on/off ratio 1000.Act Quant IR: activation and weight quantization 8bits with activation quantization range of 0-1 which does not correspond to the ReLu6 range used during training.Act Quant CR + On/Off: low on/off ratio of 10.Act Quant CR + On/Off + Array + ADC: array size 265x64, 8bit ADC.Floating-point accuracies for the CIFAR-10 VGG, CIFAR-10 WRN, CIFAR-100 WRN models are 83.73%,95.11%, and 74.74%.Author Effect of signed weights represented in two cells, on/off ratio, ADC, and array size Author Effect of signed weights represented in two cells, on/off ratio, ADC, and array size Author Manuscript the presence of array size and ADC limitations, the device-Manuscript the presence of array size and ADC limitations, the device-

Figure 5 .
Figure 5.Variation effect under different inference pipelines for WRN-16-8 network on the CIFAR-10 dataset, for models trained with different levels of noise injection.The variation level is defined as standard deviation relative to the dynamic range of the weights.The boxplots show model inference accuracy distribution from 40 runs.Legend: variation injected during training.Orange lines: floating-point baseline.Ideal Array and ADC: no array size limitation, no quantization or range limitation of output.Realistic ADC: 8bit ADC with 0 ~ 45µA as described in section 3.2, array size 256x64.

Figure 6 .
Figure 6.The proportional error of network model inference output with weight programming variation compared to inference output without variation, Level-3 model in Level-3 inference pipeline with on/off ratio of 10.Programming variations are simulated 40 times, and results are aggregated.Tr.Var.: variation injected during training.Inf.Var: variations experienced in the device programming process.(a) images from the validation dataset as input to the network.(b) random pattern as input to the network.

Figure 9 .
Figure9.Effects of learning rates.a) Device-aware models trained with a learning rate of 0.0002, evaluated in tiled inference pipeline.b) 75th percentile accuracies in Figure9anormalized to that of models trained with learning rate schedule (Figure5b).c) 75 th percentile tile-aware models evaluated in the tiled pipeline with an on/off ratio of 10. d) Accuracies in Figure9cnormalized to that of models trained with learning rate schedule (Figure5d).