Optimization of Projected Phase Change Memory for Analog In‐Memory Computing Inference

Phase change memory (PCM) is one of the most promising candidates for non‐von Neumann based analog in‐memory computing–particularly for inference of previously‐trained deep neural networks (DNN). It is shown that PCM electrical properties can be tuned systematically using a projection liner, which is designed for resistance drift mitigation, in the manufacturable mushroom PCM. A systematic study of the electrical properties‐including resistance values, memory window, resistance drift, read noise, and their impact on the accuracy of large neural networks of various types and with tens of millions of weights is performed. It is sown that the DNN accuracy can be improved by the PCM with liner for both the short term and long term after programming, due to reduced resistance drift and read noise, respectively, despite the trade‐off of reduced memory window. The liner conductance, PCM device characteristics, and network inference accuracy with PCM memory window and reset state conductance is correlated, which allows us to identify the device optimization space to achieve better short term and long term accuracy for large neural networks.


Introduction
Phase change memory (PCM) is a promising candidate for non-von Neumann based analog in-memory computing-www.advelectronicmat.de different structure, complexity, and type of nonlinear activation functions employed, etc. We also studied the impact of various device characteristics on network accuracy and identify a range of the target device specifications for PCM with liners for further improvement.
Previously, without the bench marking with large neural network inference accuracy, the PCM devices for in-memory computing are difficult to compare by only using many device characteristics that have trade-offs and correlations. This work solve this problem by providing extensive bench marking of the PCM devices in large neural networks and provide guidelines for projected PCM device optimization.

PCM Device Structure and Characteristics
In this work, we investigate the mushroom type PCM device with liner (Figure 1a). The mushroom type of PCM has better manufacturability than the confined cells, which require filling PCM materials into a hole with 10 −30 nm diameter and 100 −200 nm deep, [13] typically using atomic layer deposition (ALD) techniques. The mushroom PCM on the other hand can be fabricated with much simpler processes with only physical vapor deposition (PVD) without additional masks. It is also more manufacturable than the bridge cell, which requires patterning and etching the PCM into a bridge tens of nanometers wide [6] and hundreds of nanometers long. The PCM etching in this dimension is challenging to achieve a clean edge and uniform width. In the fabrication process of the mushroom PCM with projection liner, the liner layer is first deposited on top of the bottom electrode (heater), followed by PCM material deposition on top of the liner. The top electrode is then deposited on top of the PCM materials. The phase change material is made of a doped GeSbTe (Figure 1a).
Schematics of the current flow in the low voltage read and high voltage write condition for the full reset state are illustrated in Figure 1b,c, respectively. In the full reset state, the PCM has an amorphous dome, where the resistivity is much larger than the crystalline PCM. [14] The liner conductivity is chosen to be higher than the amorphous PCM but lower than the crystalline PCM. During PCM read, the current will bypass the amorphous dome ( Figure 1b)and flow laterally in the liner and then vertically to the crystalline PCM region. The high drift coefficient amorphous dome region is bypassed in the current flow. As a result, the resistance drift is reduced. In the PCM write condition, all the pulse current can go through the amorphous dome (Figure 1c), due to the much reduced PCM resistance at voltages higher than the threshold switching voltage.
The projection liner not only changes the drift property of the PCM devices, but also affects other properties such as the memory window, noise, etc. These different metrics often impose important trade-offs. For example, the PCM with liner can be modelled using finite element model of distributed resistance network (Figure 1d). [15] Among many other things, there is a trade off between the memory window and the drift coefficient ( Figure 1e). The liner conductance needs to be carefully chosen to enable best balance between the memory window and the drift coefficient. In this paper,

www.advelectronicmat.de
we systematically vary the liner conductivity and thickness to tune and evaluate PCM performance. We first characterize the key PCM performance metrics and then study the implications of these PCM device characteristics on DNN accuracy results across a variety of networks.
The resistance versus pulse voltage programming curves (Figure 2a) shows that the RESET resistance is modified by the projection liner with different sheet conductance. As the liner conductance is increased from liner-A to liner-C, the corresponding RESET resistance and memory window increases. The read resistance versus time are also recorded in a log-log plot (Figure 2b). The slope in the log R versus Log T curve is the drift coefficient ν given by the The device without liner has higher slope, which corresponds to higher resistance/conductance drift coefficient. The device with the liner has lower drift coefficients. It is also clear that there is a difference in the read noise. The PCM device with liner has smaller fluctuation in the read resistance values, and thus smaller read noise. We also characterize the endurance of the phase change memory without and with projection liner ( (Figure 2c)). The endurance of the PCM with liner has less change in the reset resistance up to 10 8 cycles of SET and RESET programming, because the RESET resistance of PCM with liner is determined primarily by the liner material, as opposed to the PCM material.
The resistance (conductance) drift coefficient and read noise are extracted from the resistance versus time curves for various programmed resistance values through the available resistance range. The drift and noise extraction is described in Section S1 and Figure S1 (Supporting Information). The resistance versus time data is first fit with a power law R(t) = R o (t/t o ) ν to get the drift coefficient ν. The slope of the drift is then subtracted from the resistance versus time plot. The remaining fluctuation of the resistance versus time is due to the read noise. The spectrum analysis shows that source of the read noise is mainly 1/f noise ( Figure S1, Supporting Information).
We show that the PCM device resistance drift coefficient can be tuned can be tune systematically by changing the projection liner conductance (Figure 3a). The memory window is changed at the same time as can be seen from the x-axis of Figure 3a. The liner drift variability is also measured (Figure 3b). The devices with liners show reduced drift coefficient as well as reduced drift coefficient variability for Set, Reset, and Intermediate states.
The noise can be characterized as the standard deviation of the read conductance over time after subtracting the drift. [16]

www.advelectronicmat.de
The device resistance/conductance was read every one second for 1000 s. A power law drift coefficient is fitted from the time series data. After the drift slope is subtracted from the time series data, the remaining data has no drift component and shows only the random read value fluctuation due to noise. This read fluctuation is plotted (Figure 3c) as the standard deviation (STD) of read conductance (normalized) versus the conductance values. It is clear that the PCM devices with liners have reduced read noise than the PCM device without liner.

DNN Inference Accuracy Investigation
The PCM devices were then simulated as the analog weight elements for matrix vector multiplication operation (Figure 4a) for DNNs, including LSTM, ResNet-32, and BERT. Different weight mapping schemes are considered in this work (Figure 4b-d). 1 PCM Direct Weight Mapping scheme (( Figure 4b)) uses two PCMs in a differential configuration in one unit cell. One PCM represents positive weight, the other one represents negative weights. Since only 1PCM is used at a time, we call it 1 PCM Direct Weight Mapping in this paper. Similarity, 2 PCM Direct Weight Mapping scheme (( Figure 4c)) uses four PCMs in one unit cell. Two PCMs represent positive weight, the other two represents negative weights. Each of the two PCMs used represents 50% of the weight value. Since only 2 PCMs are used at a time, we call it 2 PCM Direct Weight Mapping in this paper. 4 PCM Optimized Weight Mapping (( Figure 4c)) is an advanced scheme for optimizing the weight mapping over the entire array and advanced drift mitigation to greatly improve the network accuracy over time. In this advanced weight mapping optimization scheme, we employ a weight programming optimization strategy that minimizes weight errors, including a function of time due to conductance drift, by exploring a vast search space where weights can be distributed variably across the multiple conductances. This results in an optimization problem where we aim to minimize weight errors such that where W is a the unitless weight, β is the unitless-to-hardware rescaling factor, and where the value of each conductance can be a function of unitless weight. We optimized a time-averaged and normalized mean squared error metric to optimally anticipate and minimize programming errors, drift errors, and read noise. The anticipated drift is also factored in the initial weight mapping. The F factor is optimized over discrete values such as 1, 2 and 4. For a more detailed description, please refer to Ref. [12]. Since 4 PCMs are need to represent a weight at all times, we call it 4 PCM Optimized Weight Mapping. More details of hardware implementation is listed in Section S2 and Table S1 (Supporting Information).
We examine the accuracy of DNNs using these PCMs with and without liners using the IBM's analog AI simulation tool. [17] We compare all devices across 3 DNNs and 4 datasets: ResNet-32 evaluated on the CIFAR-10 dataset, 2-layer LSTM evaluated on the Penn Treebank dataset, BERT network evaluated on MRPC dataset, and BERT network evaluated on MNLI dataset.
The inference accuracy for ResNet-32 (Figure 5a) and LSTM (Figure 5a) are plotted versus inference time after programming for 1 PCM Direct Weight Mapping and 4PCM Optimized Weight Mapping. The time range considered here is from 1 s after programming till 1 month after programming. The well designed liner devices consistently perform better than the no liner devices in both initial accuracy and long term accuracy. The better long term accuracy of the liner devices is due to the reduced drift coefficient and reduced drift variability. For the Only two PCMs are used to represent either a positive or a negative weight by splitting the weight in two equal halves, and d) optimized weight mapping scheme (4 PCM Optimized), where each weight is mapped onto four PCMs, including G+, G, g+, and g with varying significance defined by F. A weight programming optimization framework captures all memory imperfections and hardware compensatory techniques, and produces optimal weight programming strategies as described in reference. [12] www.advelectronicmat.de 1 PCM Direct Weight Mapping, the liner devices show very significant long term inference accuracy improvement. The improvement of accuracy after 1 month is much more significant than the accuracy improvement right after programming due to the reduced resistance drift. For the 4 PCM Optimized Weight Mapping case, where there is already advanced drift compensation, there is still a respectable improvement in long term inference accuracy. It should be noted that achieving higher accuracy is very desirable for most applications even if the improvement is a few percent.
It is clear that the liner device device performance depends heavily on the liner parameters and need to be carefully chosen for best balance of the overall device characteristics to achieve best accuracy. Although Liner-C device has the lowest drift coefficient due to higher liner conductivity, its performance in DNNs is worse (Figure 5a) than liner-A and liner-B, due to its much lower memory window.
Besides the long term accuracy, the initial accuracy for the liner devices are also significantly improved compared with no liner devices. The initial accuracy improvement is not due to the reduced drift as drift does not affect the accuracy right after programming. The improvement in the initial accuracy is instead due to the reduced noise of the liner devices. Since the liner devices have both reduced drift and reduced noise, in order to see the effect of the noise reduction we need to separate the effect of drift and noise. To illustrate this, we show the inference error of liner devices in ResNet-32 (Figure 5c) using the experimental measured noise (blue) and hypothetically higher noise (red), which is set as the same as the no liner device. When we simulate using the higher noise (red), the inference error is higher across the considered time range. When we simulate using the actual reduced noise of the PCM device with liner, the inference error is much reduced for both right after programming and long term after programming. While the noise reduction is more helpful for 1 PCM Direct Weight Mapping than for the 4 PCM Optimized Weight Mapping, it still exhibits clear benefit for the optimized weight mapping scheme.
It is also clear that the effect of the reduced noise is different for different liner devices as well. The noise reduction for liner-C device shows the biggest impact 5c) on accuracy than liner-A and liner-B, because liner-C has the smallest memory window and is more sensitive to reduced noise.
The inference accuracy for BERT with MRPC dataset (Figure 6a) and MNLI dataset (Figure 6b) are plotted versus inference time after programming for 1 PCM Direct Weight Mapping, 2 PCM Direct Weight Mapping, and 4PCM Optimized Weight Mapping schemes. The general trend is the same as the ResNet-32 and LSTM network ( Figure 5. Compared with 1 PCM Direct Weight Mapping, the 2PCM mapping has a doubled memory window than the 1 PCM mapping due to splitting the weight equally onto two PCM devices. As a result, 2 PCM Direct Weight Mapping is more tolerable to the drift and the inference error is typically lower than the 1 PCM Direct Weight Mapping Mapping. This is especially significant for the liner-C device, whose memory window is the smallest. The 4 PCM Optimized Weight Mapping with the advanced Figure 5. DNN inference test error versus inference time after programming for: a) ResNet-32 network evaluated on CIFAR-10, b) 2-layer LSTM network evaluated on Penn Treebank, c) Inference error of ResNet-32 using different noise scale in the simulation: The blue line represents the results using actual experimental noise for liner devices. The red line represents the results using the same noise as the no liner device. This simulation seperates the noise reduction from drift reduction and shows its importance for initial error reduction.

www.advelectronicmat.de
drift compensation shows much improved performance than the direct weight mapping. However, using projected PCMs with liners in direct weight mapping can achieve similar performance as the 4 PCM optimized weight mapping using the conventional no liner PCM devices.

Optimization for Improved Inference Accuracy
As discussed above, An optimization scheme is needed to achieve the best overall balance of these characteristics for the best DNN inference accuracy. The PCM resistance drift, read noise, and memory window can all be correlated with the projection liner conductance (Figure 7). While the drift coefficient ( Figure 7a)and read noise (Figure 7b) are reduced with increased liner conductance, the memory window (Figure 7c) is also reduced with increased liner conductance. It is important to note that the memory window reduction in liner devices stems primarily from an increase in G min while G max is kept almost the same (Figure 7d). When G min is much smaller than G max , the PCM conductance range (G max -G min ) available for programming is only changed slightly by the change in G min , although the memory window (G ma /G min ) changed much more significantly. This is an important factor that allows the projected PCM devices to improve network inference accuracy in a wide range of liner conductance.
As the projection liner modulates the PCM read resistance at the reset state instead of the set state, all the changes in device characteristics can be related to the reset state conductance G min . Since the maximum PCM conductance G max has little change, the change in G min can be correlated to the change in memory window as well. Since memory window is much easier to measure than liner conductance, we use the change in memory window as an indicator of the change in liner conductance. Therefore, we examine the optimization by looking at the dependence of the measured device characteristics and network inference accuracy on memory window (Figure 8). We can also directly use G min instead of memory window in Figure 8, which is discussed in Section S3 and Figure S2 (Supporting Information).
As shown in Figure 8a, although the memory window (G ma /G min ) is reduced from ≈200x for no liner device to ≈40x for liner-A device or to ≈20x for liner-B device due to the increase in G min , the conductance programming range (G max -G min ) reduction is > 10% (Figure 8a). This is a good range to operate due to significant reduction in drift and noise while keeping a reasonable conductance programming range. There is clearly a tradeoff between the memory window and the drift and noise characteristics. When the memory window reduces due to using more conduction liner, other characteristics including drift coefficient (Figure 8b), variability of drift coefficient (Figure 8c), and read noise (Figure 8d), all improve.

www.advelectronicmat.de
We compare the initial error (1 s after programming) and long term inference error at 1 month after programming for various networks as a function of PCM memory window (Figure 8e-h). When increasing the liner conductance, the DNN inference error decreases first for memory window above 10x (Figure 8e-h) for both initial and long term accuracy. In this range, improved device drift and noise characteristics outweigh the reduction in the memory window. The initial error decrease is due to reduced noise. The long term error decrease is due to reduced drift. However, when the memory window decreases further to much >10x, the DNN inference error increases (Figure 8e-h). There is clearly an optimum memory window range where the overall PCM performance leads to the best DNN accuracy. This optimum point is dependent on many things, including the actual conductance values of the PCMs and weight mapping schemes. The optimization for different weight mapping schemes is discussed in Section S4 and Figure S3 (Supporting Information).
Another knob that can be tuned is the the drift coefficient dependence on PCM conductance. We simulate devices with various hypothetical drift vs conductance profiles that have upticks of drift coefficient at various locations ( (Figure 9a). The objective is to understand where is the most critical location in the PCM conductance range that the increase in drift coefficient will cause the most network accuracy degradation. In this study, the network inference error versus time of ResNet-32 (( Figure 9b) for CIFAR-10 and LSTM (( Figure 9c) for PTB are investigated. In the case of the ResNet-32, the device with the drift coefficient uptick near the high conductance range degrades the network accuracy the most. In the case of the LSTM network, the drift coefficient uptick near the intermediate states has the biggest influence on the accuracy. This results can be correlated to the weight distribution in those networks: weights tend to be more uniformly distributed in the ResNet-32 case making the higher conductance values more prevalent and important to overall network accuracy, whereas weights in the LSTM network are more normally distributed with a much lower density of weights at the extremes. The weight distributions of ResNet-32 and LSTM are shown as insets in Figure 9b and Figure 9c, respectively.
We also investigated the trade-off between the noise and memory window in more details (Figure 9d-f). PCM devices with various hypothetical noise scale and memory windows are considered (Figure 9d). The error of ResNet-32 (( Figure 9e) and the perplexity of LSTM (( Figure 9f) versus memory window shows that a smaller memory window from using a liner does not necessarily hurt the network accuracy, unless the memory window is too small ≈10x like the case for the liner-C device studied here. For liner-A and liner-B, due to the noise reduction of 50% compared to no liner devices, the accuracy of the network is actually increased despite the reduction in memory window. In this case the reduction in programming range G max -G min is smaller than the reduction in the read noise, which means more distinguishable states for the liner devices. The trend is the same for ResNet-32 and LSTM.

Conclusion
We systematically study PCM devices with varied projection liners to understand the changes and trade-offs in the device performance, as well as the implications for DNN accuracy across a variety of large deep neural networks. This is achieved by characterizing and incorporating conductance dependent drift and noise models into neural network simulations to quantify accuracy using these PCM devices. The optimized target specifications require a balance of various factors, including resistance drift at various states, dynamic window, and read noise. Projection liners can be used to tune the PCM device to achieve these targets. Properly designed mushroom PCM devices with liner can achieve better initial accuracy and long term accuracy than the PCM without liner, due to reduced resistance drift and read noise, despite the trade-off of reduced memory window. A method to correlate the PCM device characteristics with the memory window and the reset state conductance is identified to enable the optimization of the PCM with liner devices for better in-memory computing inference accuracy.

Experimental Section
The analog memory characteristics reported in this work stem from mushroom-type phase-change memory (PCM) devices comprised of doped germanium-antimony-tellurium (GST). PCM devices are initially conditioned using 10 5 full RESET pulses, which melt and rapidly quench the PCM material into an amorphous (i.e., minimum conductance) state. This was achieved using 100 ns pulse duration with amplitudes of 4.5V. A full SET (i.e., maximum conductance) pulse has a voltage amplitude of approximately 2V, a pulse duration of 1 µs, and a pulse trailing edge of 1 µs. The programming of the full SET state and intermediate analog states was achieved through careful optimisation of SET and RESET pulses, which includes exploring various combinations of SET pulse voltages, duration, and trailing edges. The reset pulse amplitude is scanned from 1 to 4.5V in the resistance-voltage (R-V) characterization. The endurance test was done by programming device with set and reset pulses up to 10 8 cycles. The PCM device was read 10 times per every decade of increased pulse number.
Conductance (resistance) drift was measured over a period of 1000 s using 20 points per decade. Drift coefficients are obtained by fitting conductance versus time using a power-law dependence. The mean and standard deviation in drift coefficients were extracted for various conductance values to produce the conductance-dependent drift characteristics reported in this work.
Read noise was characterized by performing 1000 sequential current measurements using a one second time spacing. Read noise is then extracted after subtracting drift from the conductance versus time data. This was performed for a range of conductance values to produce conductance-dependent read noise characteristics. In all measurements, PCM conductances are measured by applying a fixed read voltage of 0.2 V and measuring the resulting current.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.