Demonstration of Differential Mode Ferroelectric Field‐Effect Transistor Array‐Based in‐Memory Computing Macro for Realizing Multiprecision Mixed‐Signal Artificial Intelligence Accelerator

Harnessing multibit precision in nonvolatile memory (NVM)‐based synaptic core can accelerate multiply and accumulate (MAC) operation of deep neural network (DNN). However, NVM‐based synaptic cores suffer from the trade‐off between bit density and performance. The undesired performance degradation with scaling, limited bit precision, and asymmetry associated with weight update poses a severe bottleneck in realizing a high‐density synaptic core. Herein, 1) evaluation of novel differential mode ferroelectric field‐effect transistor (DM‐FeFET) bitcell on a crossbar array of 4 K devices; 2) validation of weighted sum operation on 28 nm DM‐FeFET crossbar array; 3) bit density of 223Mb mm−2, which is ≈2× improvement compared to conventional FeFET array; 4) 196 TOPS/W energy efficiency for VGG‐8 network; and 5) superior bit error rate (BER) resilience showing ≈94% training and 88% inference accuracy with 1% BER are demonstrated.

Harnessing multibit precision in nonvolatile memory (NVM)-based synaptic core can accelerate multiply and accumulate (MAC) operation of deep neural network (DNN). However, NVM-based synaptic cores suffer from the trade-off between bit density and performance. The undesired performance degradation with scaling, limited bit precision, and asymmetry associated with weight update poses a severe bottleneck in realizing a high-density synaptic core. Herein, 1) evaluation of novel differential mode ferroelectric field-effect transistor (DM-FeFET) bitcell on a crossbar array of 4 K devices; 2) validation of weighted sum operation on 28 nm DM-FeFET crossbar array; 3) bit density of 223Mb mm À2 , which is %2Â improvement compared to conventional FeFET array; 4) 196 TOPS/W energy efficiency for VGG-8 network; and 5) superior bit error rate (BER) resilience showing %94% training and 88% inference accuracy with 1% BER are demonstrated.
(also known as the memory wall) for successfully deploying fullscale networks using IoT devices. To address this limitation, two architectural approaches have been developed: 1) near-memory computing; 2) in-memory computing (IMC). [4] Recent progress in the field of emerging nonvolatile memory (eNVM) has led to several successful demonstrations of IMC architectures showing significant gains in terms of energy efficiency and computation speed. [5][6][7][8] Hafnium oxide (HfO 2 )-based ferroelectric (Fe) memories have shown strong potential as eNVM due to: compatibility with standard CMOS technology, superior endurance, and data retention. This has enabled very-large-scale integration (VLSI) of ferroelectric field-effect transistor (FeFET)-based macros. [9,10] However, the major drawback lies in the device-to-device (D2D) variation (ΔI D2D d ) in the drain current of the FeFET cells, especially for low-voltage threshold (LVT) state, [9][10][11] which hinders high-precision arithmetic operations. Although the scientific community has spent enormous time improving this from the process and device point of view, most demonstrations are limited to standalone devices. [12,13] A previous report on 1F-1R devices showed mitigation of ΔI D2D d in FeFET focusing on 1 bit per cell operation. [14] In this work, we focus on evaluation of a differential mode FeFET (DM-FeFET) array macro for performing multilevel multiply and accumulate (MAC) operations. Key contributions of this study include the following: 1) A novel differential bitcell architecture with multibit precision capability and high error resilience is proposed. The proposed bitcell can be adapted with only periphery-level changes into any standard FeFET array. 2) The error resilience of the bitcell for current sensing-based strategies is validated using an image encoded to 2-bit representation showcasing improvement of BER up to 2Â. 3) Validation of proposed bitcell for multiple precisions demonstrating inference accuracies >88% on the CIFAR-10 dataset [15] using VGG-8 network. [16] 4) Proposed bitcell achieves excellent energy efficiency (196 TOPS/W) comparable to state-of-the-art realizations based on NVM. Figure 1a shows the stack of the device used for the study with a scanning electron microscope (SEM) image shown in Figure 1b. Detailed fabrication flow has been explained in the prior literature. [17] Program-erase operations were performed over 60 devices across a 300 mm wafer with 500 ns pulses showing a memory window of 1.5 V (see Figure 1c). Distributions of I read states based on the program-erase operations are shown in Figure 1d. For the current study, a memory array with 4 k devices was fabricated along with peripheral circuits and analog-to-digital converters (ADCs) using 28 nm HKMG technology (shown in Figure 2a,b). Schematic of the IMC array is shown in Figure 2c). Current-limiter (CL) transistors were used at the end of each column to reduce the impact of ΔI D2D d in bit line (BL) current (I BL ). Sneak-path issue was mitigated by deploying WRITE-inhibit and READ-inhibit operation for the array. [18,19] MAC operation can be performed by I BL accumulation along the column, which is then used as input for a 3-bit ADC.

Multilevel Cell Characterization: Techniques and Limitations
In order to understand the limitations of the current architecture, we first experimentally characterized standalone devices through extensive program-erase operations. The experiment utilized 500 ns pulses of varying amplitudes. The sensing operation after the program-erase operation was done by two strategies: threshold-voltage-based sensing (TVB) and drain current-based sensing (DCB). Constant current (CC)-sensing targets to achieve a threshold voltage (V T ) state at a constant current (I read ) (see Figure 3a). Multilevel coding (MLC) based on V T states are obtained using the CC scheme for bidirectional "WRITE" operations ( Figure 3b). However, the MLC is difficult to achieve for DCB sensing (Figure 3c), where the drain current is sensed at constant gate voltage (V GATE ). The variation in drain current arising from the random variation of ferroelectric domains, channel percolation path, and the intrinsic defect sites in HfO 2 poses a limitation in achieving the current-based MLC, especially for array-level operation. Figure 3d,e shows the distribution of currents for the currentbased sensing scheme. The three least significant I read states (LSS) show a meager noise margin, and the most significant state (MSS) shows high variability. The I read for three LSS with different V T is almost identical, leading to the nonapplicability of the MLC states for performing current-based sensing of MAC operations. Apart from this, the variation in the I read of the MSS accumulates during MAC operation, leading to erroneous MAC output.

Proposed DM-FeFET Bitcell
Differential bitcell structures have been widely studied in context of RRAM devices to both realize error resilience and facilitate www.advancedsciencenews.com www.advintellsyst.com realization of negative weights/inputs. The proposed DM-FeFET bitcell architecture uses these factors as motivations and further addresses limitations of MLC sensing discussed in the above section. Figure 4a shows the realization of the DM-FeFET cell over a fabricated FeFET crossbar through mapping by sharing the cells across a column. Compared to RRAM-based designs, the proposed bitcell is capable of realizing multiprecision weights due to stable MLC programming capability demonstrated by FeFET devices. An additional advantage of the proposed bitcell is in terms of area compared to 2T-2R differential bitcells that utilize a separate selector and device. The impact of improvement in sensing margin at a constant V read using the proposed scheme is shown in Figure 4c-e for program-based MLC and Figure 4e,f for erase-based MLC. Mapping schemes utilized for the achieving 2-bit using conventional and proposed schemes are shown in Figure 4g,h, respectively. Detailed comparison of the schemes with flowchart and circuits can be found in Figure S1, Supporting Information. The resulting 2-bit states from the DM-FeFET configuration show a well-separable sense margin. Table 1 compares the proposed synaptic cell with other state-of-the-art memory cells at the array level.   www.advancedsciencenews.com www.advintellsyst.com Next, we evaluated BER/error probabilities for proposed DM-FeFET bitcells for 2b states using both program and erase (Figure 5a). We can observe BER improvement of the order of 3x compared to the conventional scheme. As an application demonstration, we program 32 Â 32 crossbar arrays to represent an image using conventional MLC states and DM-FeFET and further validate the BER improvement (see Figure 5b) with the proposed scheme showing BER < 1%.

DM-FeFET-Based IMC: Concept and Results
In the previous sections, we discussed the primary motivations and error resilience of the proposed DM-FeFET bitcell. For realizing IMC-based MAC operations, inputs are applied in form of differentially encoded read pulses across word line (WL), with weight vectors stored along the columns. Final output of MAC operations is realized by integrating current along BL (I BL ). Initial characterization using only binary input values was performed to validate the applicability of the fabricated circuit for the current application. Experimentally characterized memory states for the proposed bitcell were then used to perform a large-scale simulation for a DNN. To incorporate device variability, BER of MAC output was utilized as a modeling parameter in order to facilitate accelerated simulations.

Binary MAC Operation Validation
To validate the applicability of the fabricated hardware for performing MAC operations, initial experiments focusing on only binary inputs and weights were performed. Figure 6a shows highly linear and V WL -independent MAC operation. Figure 6b, demonstrating the MAC operation from a single tile, shows negligible leakage in I BL with the biasing scheme. The MAC operation was performed over 20 different tiles for statistical modeling. Figure 6c shows stable MAC operation over 20 different tiles from the crossbar array with a maximum standard deviation of 5% from the mean value for any state. Figure 6d shows the stability of the MAC output obtained over testing for %3 h showing no deviation. The column of a single tile is terminated with a current-mode ADC to sense I BL . The 3-bit low-precision current-mode ADC fabricated as a part of the IMC is operated with a reference current (I ref ) value of 100 nA to perform binarized MAC operation sensing. While I BL is smaller than I ref , the first current mirror in ADC maintains a high V out . As I BL rises above I ref , V out is dropped to a lower value. An encoder follows the ADC for generating the binary output for MAC operation.

DM-FeFET-Based IMC for NN: Simulation-Based Validation
To perform NN simulations using the proposed DM-FeFET bitcell, a weight-mapping strategy was devised to program the individual FeFET cells to realize binary, ternary, and 3-bit weight values. In the current evaluation, as applied inputs are intended to be constant voltage pulses, inputs and outputs are both assumed to be binary for the MAC operation. Through empirical simulations of over 100 trials, the feasibility of performing MAC operations using binary activations with varying weight precision using an 8 Â 8 tile, i.e., 4 Â 8 weight matrix, was validated. Results of MAC outputs as sensed voltages are shown in Figure 7a. Figure 7b shows the connections on a 4 Â 4 FeFET array to realize MAC operation using the IMC array. For application validation, VGG-8-based convolutional neural network Table 1. Comparison of current work with FeFET-based state-of-the-art synaptic bitcells.
References 1T1R [23] 2FeFET þ 1 T [24] 1 T-1FeFET [25] 2T1C [26] 2FeFET  www.advancedsciencenews.com www.advintellsyst.com (CNN) architecture is utilized (see Figure 7c). The proposed methodology shows highly reliable sensing operation for both binary and ternary. The network was trained over the CIFAR-10 dataset after first binarizing the input data using thermometric encoding. The three network schemes are described based on the precision of weight and activation as XWBA, where X is B: binary, T: ternary, and Q: 3-bit. The evolution of training accuracy and loss throughout 30 epochs with different weight precisions is shown in Figure 8a,b, respectively. Next, we evaluated the impact of device programming variability on the accuracy of binary MAC operations using different weight precisions. As MAC operations rely on current accumulation over the column, the array size can impact overall output variability. Output BER as a function of array size for all three precisions is shown in Figure 8c. Both binary and ternary weights show saturation with increasing array size, while 3-bit weights show a constant error profile. Finally, we evaluate the impact of BER on network accuracy (Figure 8d). Cutoff BER = 1% is achieved across all weight precisions. A comparison of the proposed DM-FeFET IMC macro using the fabricated FeFET array with recent IMC array demonstrations in the literature is shown in Table 2. The proposed macro achieves a very high energy efficiency, storage density, and accuracy on CIFAR-10 compared to other implementations in the literature.

Discussions
Quintessentially IMC is based on MAC operations, where current integration is performed along the column or row. Therefore, the fault in a single cell and the resulting current deviation can accumulate, leading to erroneous MAC operations. Hence, evaluation of the reliability aspect is crucial to successful VLSI. Reliability analysis focusing on two aspects was performed as part of this study: 1) time-dependent dielectric breakdown (TDDB) and 2) flicker noise. TDDB characteristics can be utilized to ascertain impact of device endurance due to repeated  www.advancedsciencenews.com www.advintellsyst.com program-erase operations. In case of FeFET devices, TDDB can lead high leakage or even loss of ferroelectric properties, thus completely losing functional capabilities. [20] While the focus of this study is to use a read-centric workload, WRITE operations may become a concern over the lifetime because weights would need to be updated periodically depending on evolution of network architectures or even data. TDDB characteristics for the device used in the study have been summarized in Figure 9a-c. Another important parameter to evaluate accuracy of the MAC operations focuses on low-frequency noise (LFN) characteristics, i.e., flicker noise to ascertain thermal noise resulting when reading a device in high-voltage threshold (HVT)/LVT states. [11] Figure 9d,e presents the characterization of flicker noise for devices in the program and erase states up to the frequency range of 10 kHz. Based on the analysis, fabricated FeFET devices show very low noise (%pA) at an operating frequency of 1 kHz.

Conclusion
A 28 nm HKMG technology-based IMC macro with FeFET-based IMC core has been demonstrated. High-precision, linear MAC operation is conducted on the chip. The differential mode of operation in the FeFET-based crossbar array shows an improvement in current-based sensing and MLC-MAC operation. AI application workload (CNN) was used to benchmark the performance of the DM-FeFET crossbar array over multiple precision, yielding the highest training accuracy of 94% over the CIFAR-10 dataset. Extensive characterization of reliability factors such as TDDB, Flicker noise, and D2D variability resulting in MLC state BER was performed. DM-FeFET shows improvement in BER performance in MAC operations by up to 3Â for 2b storage. The impact of D2D variability on binary MAC operations was also studied, showcasing that array scaling can lead to a reduction in BER. The proposed DM-FeFET IMC macro achieves a very high energy This work IEDM 2021 [27] VLSI 2021 [28] IEDM 2020 [29] JSSCC 2022 [30] ISSCC 2020 [31] Nature 2022 [32] Device   www.advancedsciencenews.com www.advintellsyst.com efficiency, storage density, and comparable accuracy on CIFAR-10 compared to other implementations in the literature.

Experimental Section
Device Fabrication: Crossbar arrays and memory cell test structures were fabricated on 300 mm wafers at GlobalFoundries with 28 nm HKMG technology. [9][10][11][12] An 8 nm-thick silicon-doped HfO 2 (Si : HfO 2 ) was used as the ferroelectric layer with a 1 nm interfacial layer of silicon dioxide (SiO 2 ) Electrical Characterization: FeFET IMC arrays were characterized using a National Instruments (NI) PXI-Express system. Contacts of the IMC array were controlled using Pin Parametric Measurement Unit (PPMU) of NI PXIe-6570 and the Source Measurement Unit (SMU) of NI PXIe-4143. A custom switch matrix was utilized to select WLs to connect the contact pads of the array via a probe card. The details of the characterization are described in our previous publications. [19,21,22] Programming Scheme: WRITE operation was conducted row-wise with a 4.5 V pulse of 500 ns. Memory array was block-wise erased by applying a 5 V pulse of 40 μs in bulk while applying 0 V at WL, BL, and SL.
Simulation Framework: QKeras framework (based on Tensorflow) was utilized for performing quantization aware training using binary activation functions and three precisions (binary, ternary, 3-bit) for the weights. To model device error, BER was first derived for each weight precision on a block size of 4 Â 8. This BER value was then modeled as part of a modified binary activation function during inference for estimating accuracy.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.