Nonvolatile Capacitive Crossbar Array for In-Memory Computing

Conventional resistive crossbar array for in‐memory computing suffers from high static current/power, serious IR drop, and sneak paths. In contrast, the “capacitive” crossbar array that harnesses transient current and charge transfer is gaining attention as it 1) only consumes dynamic power, 2) has no DC sneak paths and avoids severe IR drop (thus, selector‐free), and 3) can be fabricated on top of complementary metal–oxide–semiconductor (CMOS) circuits for 3D‐stacking. For the first time, ferroelectric Hf0.5Zr0.5O2 (HZO) capacitive crossbar arrays are experimentally demonstrated. Asymmetry of the HZO electrode interfaces leads to small‐signal capacitance on/off ratio >110% that can achieve read‐disturb‐free operation. The vector matrix multiplication (VMM) experiments are conducted on the fabricated capacitive crossbar array, showing a linear weighted sum versus numbers of input or on‐state weight. The array‐level VMM operation could maintain weight pattern reprogramming after 1) thousands of 1 ms/3 V pulses and 2) an extrapolated 10‐year retention at 85 °C. Array‐level circuit simulation at 22 nm node shows the energy consumption of a capacitive crossbar array is 20–200× lower than the resistive crossbar array counterpart. Moreover, analog‐shift‐and‐add circuits are designed for multibit weight summation, achieving 16.6% less area and 26.9% lower energy consumption than digital‐shift‐and‐add circuits.


Introduction
Resistive crossbar array has been intensively studied for inmemory computing, [1] where the parallel VMM shows significantly faster speed and higher energy efficiency when compared with the traditional von Neumann architecture. The representative resistive synapses include two-terminal phase-change memory (PCM), resistive random access memory (RRAM), magnetic random access memory (MRAM), and three-terminal flash transistor or ferroelectric field-effect transistor (FeFET) of which the channel conductance is encoded as weight. However, with multiple rows turned on simultaneously and the resistance of synapses being usually of only a few kΩ, the high static current typically conducts through the array; thus, it consumes large static power. When the array size increases, the wire resistance (R wire ) becomes comparable with the on-state resistance of the memory cell (R on ), resulting in a serious IR drop along wires. Furthermore, the low R on also induces sneak-path current, causing further degradation in inference accuracy of the neural network. To counter the sneak paths, an extra access transistor is used to form a one-transistor-one-resistor (1T1R) cell, but the width of the access transistor is usually sized up to deliver the high write current of the resistive synapse, [2] leading to a large cell area (>60F 2 , F is the feature size of a given technology node [3] ) and a relatively lowarea efficiency.
Inspired by the charge transfer principle, the capacitive crossbar array was recently proposed to overcome the disadvantages of the resistive crossbar array [4,5] (Figure 1). Capacitive array utilizes programmable small-signal capacitance states (at DC zero bias) as weights. The high static power can be eliminated as capacitors consume only dynamic power. Moreover, the open-circuit nature of a capacitor can effectively block the sneak-path current and prevent IR drop along wires. In other words, the capacitive design can be scaled up more easily compared with the resistive array due to its IR-drop-free and sneak-path-free nature. It is to be noted that process variation within an array will still be a challenge when scaling up the arrays, which however happens to all the designs, including the resistive ones. Therefore, without the need of an access transistor, the cell size of a capacitive synapse can ideally be kept at the minimum 4F 2 . As HZO capacitors have been proven to be both CMOS and back-end-of-line (BEOL) compatible, [6] the crossbar structure can be potentially fabricated on top of the peripheral circuits, leading to a high-area efficiency.
Theoretically, a perfect ferroelectric capacitor should not exhibit capacitance window at DC zero bias (with displacement charges), but only show capacitance window in transient sweep that involves polarization switching. Therefore, interface engineering is required to break the symmetry and open up the nonzero capacitance window at DC zero bias. Prior work on the HfO 2 -based capacitive synapse [7] needs a DC bias at %1.5 V to open the capacitance window, which is not realistic due to the potential read-disturb. In addition, the difference between its high-and low-capacitance state is as low as %1% in the study by Zheng et al. [7] In this work, we demonstrate a HZO capacitive synapse device with >110% capacitance window, which is free from read-disturb by reading the states at DC zero bias. Here, for the first time, we experimentally integrated the HZO capacitors into a crossbar array structure and set up a measurement system for performing VMM based on the charge transfer mechanism. The experimental results show that the output voltage has high linearity versus both the inputs and weight values. Through circuit simulation, array-level and system-level benchmarking results indicate substantial benefits of capacitive crossbar array over the resistive counterparts at advanced nodes.
Finally, we propose to shift-and-add the weighted sum for multibit weight in an analog manner instead of a conventional digital manner to further reduce area and energy consumption.

Device Characteristics of Nonvolatile Capacitive Synapse
This section presents the small-signal capacitance-voltage (C-V ) characteristics of the single devices from the fabricated HZO crossbar capacitor array. The process flow of the crossbar array can be found in the Experimental Section.
The asymmetric "small-signal" C-V characteristics in Figure 2a show high-and low-capacitance states (HCS/LCS) at DC 0 V. Here the dielectric constant (ε r ) represents the capacitance states. In our previous work, [8] the small-signal capacitance asymmetry was attributed to excessive oxygen vacancies at the bottom electrode (BE) interface, which induces the domain wall (DW) pinning effect (down-polarized even after positive sweep), resulting in more DWs and thus more charges in HCS ( Figure 2b).  Figure 1. Nonvolatile a) resistive array and b) capacitive crossbar array architecture with HZO capacitor for in-memory computing that 1) only consumes dynamic energy due to the capacitive nature, 2) shows low IR drop due to minimum sneak-path current and high ratio between R HZO and R wire , 3) can be fabricated with high density on top of the peripheral circuits, 4) can be designed without access transistors, and 5) allows zero read-disturb due to its read operation at DC 0 V.    To verify this capacitor as a read-disturb-free and programmable capacitive memory, we further performed small-signal AC measurement at DC ¼ 0 V directly after program and erase with þ3 and À3 V write pulses, respectively (Figure 3a), where the cycling endurance shows steady window even after thousands of strong 3 V with 1 ms programming pulses. For inference engine, weight programming is relatively infrequent; thus, 1000 cycles endurance cycles is sufficient for its application. In Figure 3b, a practical read operation applies a 100 mV pulse and integrates the current flowing onto the capacitor. The integrated charges show a distinct margin and an on/off ratio of 113%, similarly as the initial capacitance window in Figure 3a. This is because integrated charge is equal to capacitance Â small-signal input voltage. It should be noted that the downward-shifting capacitances over endurance cycles can be a minor issue in an inference-only application. Although the weights need to be occasionally adjusted after deployment, a dummy column of the capacitors or on-chip sensing circuits can be applied to shift the reference voltage in the sense amplifiers correspondingly.
Finally, even though capacitive crossbar array is immune to sneak paths, the half-select write-disturb is still a potential concern. We analyzed the effect of write-disturb under 1/3 write voltage (V w ) scheme. The disturbance compared with the total margin is shown in Figure 4a, where the capacitance change (disturbance) is <10% of the total margin with V w ¼ 2.5 V or 3 V      www.advancedsciencenews.com www.advintellsyst.com after programming stress. The 1/3 V w scheme is illustrated in Figure 4b and is described as follows. The selected cell experiences the total V w drop on the capacitor whereas the un-selected cells experience 1/3 V w . The programming protocol for the analysis of the write disturbance in Figure 4a is shown in Figure 4c. We first precycled the device with a hundred of 3 V/1 ms pulses. Then, we applied V w and measured the small-signal capacitance without disturbance. After that, we applied V w followed by applying 1/3 V w with opposite polarity and finally measured the smallsignal capacitance with the disturbance. The "disturb" is calculated as the percentage difference between the capacitances with and without disturbance. There were no reset pulses between the measurement of the capacitances with and without disturbance.

Array-Level Measurement
We fabricated a small-scale 12 Â 12 HZO-based capacitive crossbar array to demonstrate the in-memory computing functionality as a proof of concept. Figure 5 shows the microscopic image of the fabricated array. The active size of each cross-point cell ranges from 1 Â 1 μm 2 to 100 Â 100 μm 2 . The details of the fabrication process flow are discussed in Experimental Section.
Before moving to the array-level measurement, we need to understand the operating principle of the capacitive array. [9] Figure 6 illustrates the basic concept of the operating principle in two steps. In the first step (Figure 6a), the WL voltages represented by step functions with amplitude ¼ 100 mV as the input vector propagate through WL multiplexer (MUX) and charge the array of ferroelectric capacitance (C FE ), which is preprogrammed to different capacitances to represent the values in the weight matrix. The product of one input voltage value and one weight capacitance value is encoded as the charges on each C FE . In the second step (Figure 6b), the input voltages return to the common-mode voltage (V CM ) and become the same as the negative input of the operational amplifiers (OPAMPs). At this moment, the voltage drop on each C FE becomes 0 V so the charges are forced to transfer along the corresponding BL onto the reference capacitor (C ref ). The number of charges on C ref is the weighted sum along the BL and the resulting output voltage (V out ) determines an entry in the output vector. The ideal analytical equation for weighted sum is as follows.
where i and j represent the number of rows and columns.
In the following experiment, we first demonstrate VMM operations with eight capacitive synapses in one column with all the inputs turned on. The detailed description of the experiment setup is provided in Figure 11. Figure 7a shows the measured weighted sum, V out , sensed at the output node of the OPAMP, over 12 measurement trials. The array consists of eight number of 50 Â 50 μm 2 ferroelectric capacitive synaptic cells with all inputs being "1." The tight distribution in Figure 7a implies low cycle-to-cycle variation. Average V out in Figure 7b shows high linearity versus the number of HCS cells. Arrays with smaller synapses are also demonstrated in Figure 7c  www.advancedsciencenews.com www.advintellsyst.com A highly linear relationship between V out and the number of turned-on WLs is also proven (Figure 8a) with ultratight distribution ( Figure 8b). In practice, it is challenging to measure sub-10 μm capacitor's response due to parasitics and sensing limit in any off-chip instrument. Therefore, the performance of nanoscale capacitive crossbar arrays with integrated peripheral circuits is projected by simulations in Section 4.
Moreover, reliability tests of the capacitive crossbar array were performed in terms of endurance and retention characteristics at the array level by monitoring V out . Figure 9a shows the cycling endurance with 3 V/1 ms pulses to reprogram the weight pattern. Even after thousands of such pulses, a sense margin at V out still exists. Figure 9b shows the 15-hour retention at 85 C, where a clear V out sense margin can be extrapolated to 10 years. Decreasing V out in both HCS and LCS over time implies a decreasing capacitance. This might be a result of the imprint effect that is commonly observed in ferroelectrics. As shown in Figure 9c, hours after being programmed to HCS, the small-signal C-V curve shifts to the negative side, resulting in a lower HCS capacitance. On the other hand, after erased to LCS, the positive-side-shifting C-V over time ( Figure 9d) causes the LCS capacitance to decrease during the retention test.

Simulations Toward Large-Scale Systems
To evaluate the latency, energy, and equivalent number of bits (ENOB) under thermal noise of the nonvolatile capacitive array at an advanced technology node, we ran simulation program with integrated circuit emphasis (SPICE) simulation with 22 nm lowpower (LP) transistor models by considering wire parasitics. The array schematic and the OPAMP circuits are shown in Figure S1a,b, Supporting Information. The array size and other key parameters in the simulation are listed in Table S1, Supporting Information.
The SPICE simulation results showed an ENOB ¼ 4.3 in Figure S1c, Supporting Information, and latency for the charge transfer is %5.1 ns, where latency is defined as the time for V out to reach 80% of the steady-state voltage. The ENOB value larger than 4 is achievable, suggesting a 4-bit partial sum quantization, which has been reported to keep reasonable inference accuracy for  www.advancedsciencenews.com www.advintellsyst.com CIFAR-10 dataset. [10] Subsequently, we compared the capacitive subarray results with those of the resistive subarrays in terms of energy and latency. Due to suppression of static energy during the steady-state readout, the capacitive subarray consumes much lower total energy, 20-200Â lower energy compared with 1-bit/cell 1 T-RRAM [3] , 1 T-1MRAM, [11] and 1 T-1FeFET, [12] as shown in Table 1. The subarray energy is normalized to 1-bit multiplyaccumulate (MAC). Based on the array-level SPICE simulation result, we benchmarked the capacitive crossbar array with the open-source simulator DNN þ NeuroSim [10] at 22 and 7 nm to evaluate its system-level performance in Table 2. The simulation setting for NeuroSim is described in Supporting Information. With the capacitive array assumed on top of the peripheral circuits, it is benchmarked with 22 nm 1 T-RRAM, 1 T-1MRAM, 1 T-1FeFET, and 22 nm/7 nm static random-access memory (SRAM). The benchmarking results of the 22 nm capacitive array show a higher energy efficiency and compute efficiency than those of the other resistive counterparts and 22 nm SRAM array. The projection of a 7 nm capacitive array also shows a substantial %2Â energy efficiency boost over the 7 nm SRAM.

Parallel Processing with Analog-Shift-and-Add
Weights in deep learning algorithms are typically quantized before being mapped to a synaptic array. Conventional hardware implementation of n-bit quantized weight groups n binary synaptic cells together to represent an n-bit weight value. The n cells in one group are in adjacent columns in the same row, where the n outputs of the weighted sums need to be combined later by peripheral circuits based on the bit significance of each cell representation. A common setting of the peripheral circuits is shown in Figure 10a, where MUX selects one bit-line (BL) of the synaptic array among n BLs (assume n-bit weights are mapped to the hardware). Here, n is equal to 4 in this work. The selected BL value is then transformed from an analog representation to a digital representation by the following ADC. The n digital values are processed sequentially and summed based on their bit significance by a digital-shift-and-add circuit.
However, this setting of sequential processing with digitalshift-and-add requires n cycles to combine the weighted sums, which consume n cycles of energy. The sequential processing also requires an array of MUX to select one BL at a time for the input of ADC, which becomes additional area overhead.
To improve the energy and area efficiency, we propose to sum the n BLs with different bit significances in parallel in one cycle using an analog-shift-and-add circuit instead of the digital case. The high-level schematic is shown in Figure 10b, where we can sum up the n BLs in parallel with an analog-shift-and-add circuit and feed the output to an ADC. [13] The parallel processing avoids the need of using MUX and saves the energy and latency by reducing the processing time from n cycles to a single cycle.
The circuits of analog-shift-and-add are shown in Figure 11a. The array of analog voltage buffer serves two purposes. The first purpose is to shield the charges at the outputs of the OPAMPs from leaking, which will lead to significant information loss due to the voltage drop. The second one is to downshift the offset voltage so that the total charges transferred to the next stage can be reduced. Here, the smaller number of charges can result in lower latency. The output voltage of the buffer representing the weighted sum from each BL is connected to an array of capacitors. The voltage representing the least significant bit Table 2. System-Level benchmarking results show that the capacitive array has the potential of outperforming its resistive counterparts and can be more competitive over 7 nm SRAM. (subarray size ¼ 128 Â 128; F ¼ 7 nm or 22 nm for normalizing cell area; thus, it does not indicate the physical feature size. The 7 nm projection of "capacitive" applies 7 nm peripheries while keeping the same cell area as the 22 nm one. 1 T-RRAM [3] 1 T-1MRAM [11] 1 T-1FeFET [12] Cell  Table 1. Subarray-level evaluation with SPICE simulation, compared with those of the representative crossbar arrays obtained using DNN þ NeuroSim framework. The energy and latency of the resistive arrays include those from the crossbar structure, while the energy and latency of the capacitive array include those of the crossbar structure and OPAMPs. The results are simulated and averaged assuming all input turned on and weight patterns from a pretrained VGG-8 model. Delay of the capacitive array is defined as the output voltage reaches 80% of the steady-state value. (Subarray energy includes the static and dynamic energy of the crossbar structure. Subarray energy is normalized to 1-bit VMM.).
Capacitive (this work) 1 T-1RRAM [3] 1 T-1MRAM [11] 1 T-1FeFET [12] C on /R on [aF Ω À1 ] 120 6 k 2.5 k 67 k www.advancedsciencenews.com www.advintellsyst.com (LSB) is connected to the capacitor with unit capacitance C, while the voltage representing the most significant bit (MSB) is connected to the capacitor with capacitance 2 nÀ1 C. In this way, the MSB voltage can be correctly weighted and results in a larger corresponding number of charges compared the one with LSB voltage. After the voltages values are successfully weighted based on their bit significance, the charges will be transferred to the summed-up capacitance (C sum ) shunting of an OPAMP with its output connected to an ADC. The value of C sum should be designed based on the value of n (n-bit weight) and the desired output voltage range. It should be pointed out that the capacitive array for the analog-shift-and-add requires voltage as input, which is compatible with the output of the capacitive crossbar array. In contrast, a resistive array which outputs current requires additional circuits to transform the current to voltage with additional area and energy overhead.
To evaluate the advantages based on the analog-shift-and-add at the system level, we extracted the energy and latency from SPICE simulation and estimated the area of the analog-shiftand-add circuit, from the voltage buffers to the output of the OPAMP shunting C sum . In the system-level evaluation, we compare the performance between designs based on digital-and analog-shift-and-add. Both systems apply a 4-BL-sharing-1-ADC setting and an array size of 128 Â 128. The unit capacitance for the analog-shift-and-add is 100 aF. For the ADC precision in the context of digital-shift-and-add, although the number of rows connecting to one column in the synaptic array is equivalent to 7 bits, 3-bit loss at the ADC side can be tolerated without severely degrading the inference accuracy according to analysis in the study by Jiang et al. [13] . Hence, a 4-bit flash ADC (7 À 3 ¼ 4) is applied for the digital case. For the analog case, the number of rows is also of 7 bits. However, 4 columns are merged before connecting to the ADC, so the total number of bits is 11. According to the study by Jiang et al., [13] because the analog way avoids the additional information loss during the process of digital quantization, 7-bit reduction (more than the 3-bit reduction in the digital case) at the ADC side is tolerable without severely degrading the inference accuracy. Therefore, the ADC precision is also set to 4 bits (11 À 7 ¼ 4). The delay of the analog-shift-and-add takes less than 2 ns, from receiving the output voltages of the capacitive array to delivering the final output. Figure 11b shows the performance improvement achieved by the analog-shift-and-add over the digital-shift-and-add. Because of the parallel configuration, the read dynamic energy is 26.9% lower. The overall area is 16.6% lower because BL MUX is no longer needed and the area of the analog-shiftand-add circuit is relatively compact.

Conclusion
In this work, the nonvolatile capacitive crossbar array was experimentally demonstrated for in-memory computing for the first time. Essentially, the capacitor crossbar structure is based on www.advancedsciencenews.com www.advintellsyst.com the ferroelectric HZO, which is sandwiched between top and bottom TiN electrodes. Due to the intrinsic asymmetric interfaces of the plasma-enhanced atomic layer deposition (PEALD)-grown HZO capacitor, it was found that nonidentical capacitance values at DC 0 V can be obtained after program and erase pulses, which could be utilized as two distinct memory states. Depending on the capacitance values, corresponding charges could be stored in the ferroelectric capacitor, which will later be transferred to the reference capacitor shunting of the OPAMP and generate the respective output voltage. Based on the charge transfer mechanism, our experimental results of the crossbar array showed high linearity with respect to both the inputs and weight values to perform the VMM. Moreover, the reliability test of the capacitive crossbar array showed promising features for the inference engine in terms of cycling endurance of >1000 cycles and extrapolated retention of > 10 years at 85 C.
To evaluate the key parameters, that is, latency, energy, ENOB, etc., the subarray-level SPICE simulation was conducted while considering wire parasitics. Finally, based on the SPICE simulation results, we benchmarked the capacitive crossbar array using DNN þ NeuroSim at 22 and 7 nm to evaluate its system-level performance. As capacitors only consume dynamic power but with negligible static power, %2Â energy efficiency boost could be achieved in the 22 nm capacitive array when compared with the SRAM at 7 nm. Furthermore, the capacitive crossbar array can become selector free due to its capacitor nature without DC sneak paths and also take advantage of the BEOL process, thus achieving high-area efficiency compared with other resistive counterparts and conventional SRAM array. Finally, peripheral circuits with analog-shift-and-add, compatible with the capacitive crossbar array, are proposed. Compared with the design of digital-shift-and-add, the analog way can achieve 16.6% lower area and 26.9% lower energy consumption.

Experimental Section
Fabrication Method: Figure 12a shows the key fabrication process of the HZO crossbar capacitor structure. The capacitive crossbar array was fabricated on 100 nm of thermal oxide on top of the p-type (100) silicon wafer. For the capacitor stacks, we used the Fiji G2 ALD system from Veeco. First, 25 nm of PEALD TiN at 250 C was deposited to form the BE. Then, 21 nm of PEALD Al 2 O 3 was deposited afterward to use it as a hard mask. The photoresist (S1805) was spin coated and baked on the experiment sample to define the BE structure. The Heidelberg MLA150 was used for photolithography and patterning the sample.  www.advancedsciencenews.com www.advintellsyst.com afterward with the aforementioned chemicals. Then, 10 nm of the ferroelectric layer, PEALD HZO, and 25 nm of the top electrode (TE), PEALD TiN, were deposited. Again, 21 nm of PEALD Al 2 O 3 was deposited to form the hard mask for TE patterning. After patterning the top electrode with the identical procedure for BE, the pad area was opened using diluted HF. Finally, the sample was rapidly annealed at 450 C for 30 s to crystallize the ferroelectric orthorhombic phase of the HZO layer. Figure 12b shows the schematics of a single crossbar capacitor in the fabricated array from lateral and top viewpoints. Array-Level Measurement Setup: Figure 13a-b shows the schematic of our measurement setup for "reading weighted sum," where the rows received input voltage in parallel through Keithley 707B semiconductor switch matrix. The columns were externally connected to OPAMPs on a printed circuit board. VMM was performed in two phases: 1) charging the array and 2) charge transfer to the output. In our setup, input pulses with 100 mV and 0 V represented binary 1 and 0, respectively, in the input vector using Keysight 81150 A pulse function arbitrary generator. The input vector was multiplied with a column of capacitive weight, resulting in product of charges on each capacitor. After the input was returned to ground, there was no voltage drop across each capacitor; thus, every entry of the product (charges) was then transferred to the reference capacitor (C ref ) shunting of the OPAMP. The transferred charges resulted in the output voltage of the OPAMP (V out ) as the weighted sum, which was read from the oscilloscope. With more devices in HCS or more input activated, there were more charges and thus a higher V out was obtained. In Figure 13c, the "program/erase" sequence of the capacitive crossbar array for write was processed with Keithley 4200-SCS parameter analyzer and Keithley 707B semiconductor switch matrix. The write pulse was applied from the 4200-SCS parameter analyzer and multiple cells were selected using the switch matrix. The Cascade-12 K semiauto probe station was used while probing the sample. The photo image of the entire setup for the array-level measurement is shown in Figure 13d.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.