Device and Circuit Architectures for In-Memory Computing

difference between these two values can be translated into a Hamming distance and used to accelerate custom neural network training. [185


Introduction
Data processing in digital computers is generally carried out by a sequence of Boolean logic operations executed in silicon by the complementary metal-oxide-semiconductor (CMOS) technology.The CMOS transistor has been regularly scaling for the last 40 years via Moore's law, where the reduction of the transistor size results in less area consumption, hence lower fabrication cost.Transistor size scaling was accompanied by a reduced power consumption and an increase in the operation frequency, thus leading to an improvement in circuit performance generation after generation. [1]The increase in CMOS logic performance has been challenged by the increase in data processing need and is even more stressed by the exponential growth of data circulating in the internet and provided by always-on and ubiquitous sensors.Unfortunately, reducing the device area also causes an increase in power density which has caused a slowing down in the CMOS scaling trend in the last decade. [2]Conducting AI learning tasks is also heavily demanding in terms of energy consumption, which causes a world-scale concern in view of ubiquitous AI tasks such as image tagging, traffic monitoring, and vocal assistants. [3,4]ompared with digital computers, the human brain only uses the extremely low power (about 20 W) and low frequency (typically in the few Hz range) of information processing. [5]The human brain thus appears as a living biological example to help introduce novel energy-efficient computing paradigms to tackle data-intensive and AI tasks.One of the main assets of the human brain which enables low energy consumption is its peculiar architecture, where memory and computation are colocated. [6]This is against the conventional computer architecture, where computing takes place in a central processing unit (CPU) according to programs and data which are fetched from a working memory according to the von Neumann architecture. [7]he working memory, i.e., most typically a dynamic randomaccess memory (DRAM), is generally located on a physically separate chip, thus resulting in long latency and energy consumption for data intensive tasks.Similar to the human brain, in-memory computing (IMC) instead conducts data processing in situ within a suitable memory circuit. [8]IMC suppresses the latency for data/ program fetch and output results upload in the memory, thus solving the memory (or von Neumann) bottleneck of conventional computers.Another key advantage of IMC is the high computing parallelism, thanks to the specific architecture of the memory array, where computation can take place along several current paths at the same time.IMC also benefits from the high density of the memory arrays with computational devices, which generally feature excellent scalability and the capability of 3D integration.[14][15] Thanks to the combination of in situ, high-density, parallel, physical, and analogue data processing, IMC appears as one of the most promising novel approaches for computing in the frame of AI and big data.
In addition to analogue computing, digital-type IMC supported by physical properties of the memory devices has also been shown.][16][17][18][19] This approach can improve the density of logic gates and suppress the latency associated with data transfer for digital computation.On the other hand, digital IMC suffers from an increased energy per operation due to the need to change the state of a device during computation.The state switching also increases the time for logic operation and critically limits the lifetime of the circuit due to endurance constraints.For these reasons, a device technology breakthrough might be needed to support the development of largely scaled, low-energy, high-performance logic IMC processors.
This work presents an overview of IMC in terms of device technologies and circuit architectures.Within the extremely large scenario of IMC concepts, we focus our attention on analoguetype computing based on matrix-vector multiplication (MVM) in the memory array.In Section 2, we provide an overview of devices for IMC, covering both two-terminal and three-terminal devices that have emerged recently.In Section 3, we describe the main memory structures which are used in IMC circuit.In Section 4, we focus on the programming operation, where a certain set of conductance values are stored in the memory circuit to serve a certain IMC operation.In this respect, we describe the main methodologies to program a set of conductance values in the computational memory to serve for a certain IMC function.In this respect, we highlight the main programming methodologies as well as the most typical nonidealities which affect the accuracy of the IMC operation during either the offline or online training of the memory array.Section 5 address the nonidealities of the memory circuit.Finally, Section 6 presents the main architectures that have been proposed for IMC, including crosspoint arrays and other computational memory arrays, which are relevant for various types of neural networks and generalpurpose algebraic computing tasks.

Memory Devices for IMC
Recently, several memory technologies based on the material modification at the nanoscale have emerged as high-density, low-power, low-cost, and high-speed devices for storage and computing. [7,8,15,20,21]In general, the material modification, such as a local change in the chemical composition or phase structure, causes a major change in the device resistivity which can be easily sensed by the peripheral circuit via electrode wires.In particular, these two-terminal devices offer the advantage of scalability to only few nm [22][23][24] and integration in 3D, [25,26] thus supporting the ultrahigh density of memory needed for computing applications.
Figure 1 shows a summary of two-terminal devices which are currently considered for storage and computing.Device technologies include the resistive-switching random access memory (RRAM), the phase-change memory (PCM), the magnetic random-access memory (MRAM), and the ferroelectric randomaccess memory (FERAM).

RRAM Devices
[29] The resulting metalinsulator-metal (MIM) structure shows a relatively large resistance, thanks to the insulating nature of the oxide layer.This is sometimes replaced with an alternative high-resistance material,  such as a nitride layer, [30] a chalcogenide material, [31][32][33] or 2D transition metal dichalcogenides (TMDs). [34]The MIM device is first electrically formed by a soft breakdown operation, causing a local modification of the material composition or an increase in the defect concentration, such as oxygen vacancies in the metal oxide.
The forming operation generally causes the buildup of a conductive filament, where the conductance is higher than that in the original insulating layer, thus resulting in a low resistance state (LRS) of the device.The conductivity of the conductive filament can be electrically reduced by the reset operation, leading to a high resistance state (HRS) of the device, or increased by the set transition, to recover the LRS.In a bipolar RRAM device, the set and reset transitions are induced by voltage pulses of opposite polarities, whereas the polarity of set/reset operations is the same in unipolar RRAM devices. [35]Uniform-switching RRAM devices also exist where the oxide layer modification extends throughout the whole area instead of a localized filament region. [36,37]

PCM Devices
[40][41] The amorphous phase shows a disorder-induced high resistivity, in contrast with the low resistivity of the crystalline phase; thus, the PCM state can be identified by a simple voltage/current sensing.
Compared with the filamentary switching process of the RRAM, the PCM relies on the bulk properties of the active material, which generally leads to a larger resistance window and the ability to operate the device with a multilevel cell (MLC) scheme. [42,43]On the other hand, a large Joule heating is generally needed to accelerate the phase transitions, such as melting and crystallization, which result in relatively large currents for programming/erasing the device.[46] A significant problem for the PCM is the resistance drift, where the device resistance increases with time after programming due to the structural relaxation of the amorphous phase. [47]Device technologies with improved stability against drift have been developed [48] and demonstrated in IMC. [49]

MRAM Devices
Figure 1c shows the MRAM device, where the magnetic polarization within a layer of ferromagnetic material such as CoFeB is changed by electrical manipulation.The residual polarization in the ferromagnetic material can be sensed via the magnetic tunnel junction (MTJ), namely, a stack made of a thin insulating layer, usually a highly crystalline metal oxide such as MgO, sandwiched between a reference ferromagnetic layer with fixed polarization and a free ferromagnetic layer with variable polarization.When the two layers have parallel magnetization directions, the resistance of the MTJ is relatively low, whereas the MTJ resistance is relatively high for antiparallel magnetization. [50]The magnetization direction in the free layer can be written by field-induced switching, where a current pulse is applied across suitable write lines to create a local magnetic field, [51] or spin-transfer torque (STT), where the current pulse is applied directly across the MTJ. [52,53]STT-MRAM devices have the advantage of fast switching in the few ns range, which makes them a strong candidate for last-level cache (LLC) static RAM. [54]On the other hand, MRAM generally displays a limited resistance window around a factor 2, which makes it difficult to implement some IMC algorithms. [55]4.FERAM Devices Figure 1d shows the FERAM device concept, which is based on ferroelectric (FE) materials where the electrostatic polarization can be reversibly switched by the application of an external electric field.Historically, most typical FE materials include perovskite oxides such as PbZr 1-x Ti x O 3 (PZT) and SrBi 2 Ta 2 O 9 (SBT). [56]These materials, however, have relatively a low bandgap, high leakage, and low compatibility with the CMOS process line.Most recently, FE phases of doped HfO 2 have been discovered, [57] which have revived the interest on FE phenomena and materials for both storage and IMC applications.Similar to the MTJ, an FE tunnel junction (FTJ) is able to convert a residual FE polarization into a resistance signal, by placing the FE switching layer in series with a dielectric layer. [58,59]The FTJ structure can be easily programmed by application of voltage pulses.Despite the nonfilamentary switching within the FE layer, FERAM uniformity can be affected by local variation in the coercive fields among various crystalline grains and domains within FE material. [60]5.Three-Terminal Devices Although the two-terminal structure is strongly promising for crosspoint architectures with high densities, three-terminal devices might tradeoff density with other properties such as a better control of the conductance state or an easier cell selection within the array.Figure 2 shows a summary of three-terminal devices that have been considered for IMC.The Flash device (Figure 2a) is at the basis of most nonvolatile memory devices used for highdensity storage in solid-state drives (SSDs).The Flash memory essentially consists of a metal-oxide-semiconductor (MOS) transistor with a floating gate (FG) between the contacted gate and the substrate.The charge stored in the FG can be electrically manipulated by high-field tunneling of electrons to/from the substrate. [61]Once stored in the FG, the charge affects the transistor threshold voltage, namely, a larger amount of electrons in the FG results in a higher value of the threshold voltage.Alternatively, a different amount of charge also corresponds to a different channel conductance, that can be used as a variable resistance for IMC.This concept was used for hardware accelerators of neural networks with arrays of Flash memories. [62]Also, unsupervised learning by spike-timing-dependent plasticity (STDP) was demonstrated with Flash memories. [63,64]igure 2b shows the typical structure of a DRAM, which represents the standard device for working memory in digital computers.Different from the Flash memory, in a DRAM, the charge is stored at a capacitor at the gate of the conduction transistor.Another pass transistor is generally kept in the off state, unless during the programming operation, when the pass transistor is switched on.The charge across the capacitor can be tuned to control the threshold voltage and conductance of the conduction transistor for analogue IMC. [65]To increase the retention time of the capacitor charge, which is around 1 ms in DRAM, the pass transistor can be fabricated with low-mobility semiconductors such as InZnGaO. [65]The lower subthreshold channel conductance helps enhancing the retention time so that the analogue DRAM can be used for practical IMC applications.
The ferroelectric field-effect transistor (FEFET), shown in Figure 2c, is a transistor concept where the threshold voltage is varied by the remnant polarization in the FE gate-insulating layer. [66,67]FEFETs can be arranged with NAND architecture in either 2D [68] or 3D, [69] which may allow to reach similar density as Flash memories.The interest in FEFET has significantly increased after the discovery of FE phases in HfO 2 , [57] thanks to the better CMOS compatibility of HfO2 with respect to FE ternary/quaternary oxides.
Figure 2d shows the spin-orbit torque (SOT) MRAM, consisting of an MTJ deposited on top of a heavy metal (HM) line such as Ta [70] or Pt. [71]In the STO-MRAM, the parallel/antiparallel states of MTJ can be manipulated by applying an in-plane current pulse along the HM line via SOT induced by spin Hall or Rashba effect.Sub-ns switching speed has been demonstrated with current densities in the range of few hundreds of MAcm À2 . [71]he main advantage of SOT-MRAM with respect to the STT structure is that the programming operation does not involve any current across the MTJ, which was the major source of degradation and endurance failure in STT-MRAM.This advantage comes at the price of a three-terminal structure, hence a larger device area.Similar to STT-MRAM, the SOT-MRAM also typically shows binary switching between the parallel and the antiparallel state, which is not suitable for analogue-type IMC.
Figure 2e shows the ionic transistor, also known as the electrochemical random access memory (ECRAM).In a Li-based ionic transistor, the gate dielectric consists of an ionic conductor for Li þ such as lithium phosphorous oxynitride (LiPON). [72]The channel transistor instead consists of a material such as LiCoO 2 where Li þ intercalation and deintercalation can induce a change in channel conductivity.For instance, the application of a positive gate voltage leads to Li þ migration and channel lithiation, which leads to a reduction in conductivity. [72]Ionic transistors have also been developed based on organic materials where H þ was the migrating ion. [73]The Li-based ionic transistor has shown a strong linearity where an applied gate pulse causes a fixed increase or decrease in conductivity. [74]A potential problem of the Li þ -based synaptic transistor is the leaky gate, due to the relatively high conductance of the solid-state electrolyte.To prevent the corresponding leakage, a selector device has to be connected to the gate of the ionic transistor, which significantly increases the array complexity. [75]Another potential issue is the lack of compatibility with the CMOS process line, for which Li þ is considered a concern.To solve both these issues, recently, a metal-oxide-based ionic transistor was proposed. [76]In this device, the migration of oxygen vacancies across a trilayer metal oxide causes a change in the conductance of the WO 3 channel.Thanks to the insulating property of the metal-oxide stack, no selector is needed in series with the gate.
Figure 2f shows the memtransistor, a contraction of memristive transistor, consisting of a MOS transistor with a 2D semiconductor channel, such as MoS 2 . [77]In this structure, the application of a large source-drain voltage leads to a permanent change in conductivity as a result of the migration of grain boundaries [78] or Li þ impurities in the MoS 2 channel. [79]The gate can be used to control the channel conductance, e.g., to activate and deactivate the defect migration induced by the source-drain voltage.The use of a 2D semiconductor makes the memtransistor highly scalable and suitable for 3D integration in the back end.

Memory Structures
Figure 3 shows the possible memory array structures for twoterminal devices.In the one-resistance (1R) structure (Figure 3a), the memory device is tied to a row wire by the TE and a column wire by the BE or vice versa.This is the conventional crosspoint array, [15] which allows the maximum density of packing memory devices on the chip.The minimum theoretical area for the 1R device is 4F 2 , where F is the lithographic feature size, which dictates the width of the row/column and their spacing.This density can be further increased in case of the 3D stacking of more crosspoints. [80]For instance, the effective device area becomes 2F 2 for a two-layer crosspoint and only F 2 for a four-layer crosspoint.Both horizontal stacking of crosspoint arrays and vertical arrays can be realized, the latter achieving a higher density, thanks to the increased stackability due to the easier patterning process of vertical wires. [81]Thanks to the close packing of the crosspoint structures, the memory density of 4.5 Tb per square inch has been demonstrated in one layer. [24]ssuming that conductance values G ij are stored in the memory devices at row i and column j, the application of column voltages V j will induce a current G ij V j in each device, according to the Ohm's law.Based on the Kirchhoff's law, the row current reads 10][11] The passive crosspoint array is thus capable of executing a parallel MVM in the analogue domain, which would require instead a huge number of multiplyaccumulate (MAC) operations in a conventional digital computer.
During MVM, voltages are applied simultaneously to all columns whereas currents are collected at the grounded row terminals.Ideally, assuming negligible voltage drop as a result of parasitic wire resistances, the MVM operation should not suffer from any cell-cell disturb or sneak-path effect. [82]On the other hand, when individual devices are programmed, such as for executing forming, set, and reset operations in the array, disturbs might become a significant problem.For instance, application of a positive voltage at a certain column of a crosspoint array of RRAM devices might potentially induce set operation on all cells in the row, unless specific biasing schemes are adopted.
[85] In this biasing scheme, voltages V/2 and ÀV/2 are applied to the selected column and row, respectively, whereas all other lines are grounded.As a result, the bias voltage across the selected cell is V, whereas all other unselected cells are biased at 0 V, and halfselected cells, sharing the same row or column of the selected cell, are biased at V/2 or ÀV/2.As a result, the voltage drops across nonselected and half selected is significantly lower than Figure 3. Illustration of memory array structures for two-terminal devices.a) Passive crosspoint array, consisting of 1R elements with conductance G ij , each connected between a row and a column.b) V/2 biasing scheme for 1R arrays, where a voltage V is applied across the selected cell (blue), whereas half-selected cells (red) sharing the row/column of the selected cells are biased at voltage V/2.c) 1S1R array, where each memory element is connected to an individual selector to prevent sneak paths.d) 1T1R array, where the select transistors allow to select a cell at the crossing between the selected wordline and bitline.
the one across the selected cell, thus preventing any disturb within the array.To read an individual cell a voltage V R is applied to the cell row, whereas all other rows and all columns are grounded.The current at the selected column will reveal the resistance R of the selected cell according to I ¼ V R =R. [85] Note that reading individual cells is essential to make sure that a conductance value G ij is stored correctly in the crosspoint array.
Although the set, reset, and read operations appear feasible with the V/2 biasing scheme, the 1 R crosspoint architecture becomes unpractical because of the large standby current flowing during set and reset.Also, as the selected device is being set at V, the voltage V/2 and the corresponding current flowing across half-selected devices in the LRS might be sufficiently high to disturb the device, thus modifying the previously stored conductance.

1S1R Structure
88] In this structure, the memory device is connected to a selector device with a strongly nonlinear I-V characteristic, where the current is virtually zero below a threshold voltage V t .As a result, as a voltage V > V t is applied to the selected cell to induce set/reset processes, the half-selected voltage V=2 < V t will not induce any disturb.Both silicon-based and nonsilicon-based selectors have been proposed, the latter category being favored as it enables the back-end-of-line (BEOL) process and 3D stacking.[97] OTS selectors are characterized by the low subthreshold leakage, large V T , and negative differential resistance (NDR), which allows an excellent nonlinearity factor of several orders of magnitude between the off-state and on-state currents.In addition, OTS shows good endurance of above 10 11 [98] and the ability for stacking at least two layers. [95,97]he 1S1R concept is very promising for creating a new memory market named storage-class memory (SCM), combining nonvolatile storage, a density higher than DRAM, and a performance better than Flash memories.Because of these properties, the 1S1R structure seems an ideal vehicle for IMC applications, although the nonlinear behavior of threshold-switching selectors and the corresponding large current in the on state have to be carefully considered in the architecture design.

1T1R Structure
Figure 3d shows the one-transistor/one-resistor (1T1R) structure, where the memory device is connected to an MOS transistor for selection.With respect to the 1R and the 1S1R structures, the 1T1R structure is more complicated in that a third terminal and a corresponding wire must be dedicated to the transistor gate.The presence of the gate terminal makes the selection and unselection of the array device extremely straightforward.The gate line is perpendicular to the TE line; therefore, only the device at the intersection between the selected gate line and the selected TE line is addressed during set, reset, and read.In addition, the transistor allows for a proper current limitation during forming and set transition of RRAM devices to control the resistance state of the LRS. [99,100]During reset and read, instead, the gate terminal is biased to a relatively high voltage to reduce the parasitic resistance of the MOSFET, which might degrade the precision and dynamic range of the conductance G for analogue MVM.The larger flexibility, however, comes with the expense of a larger device area and higher complexity of the array.Despite these drawbacks, the 1T1R structure is by far the preferred structure for IMC applications.
The circuit structures of Figure 3 are limited to two-terminal devices, although the 1T1R structure can be adapted for threeterminal devices, such as three-terminal Flash memory array.This is the so-called NOR structure, where applying a pulse at a given gate (word) line and a given drain (bit) line results in the programming of the device, without affecting all other devices in the array.In general, however, dedicated array structures might be needed for correct programming, reading, and computing with three-terminal devices.

Computational Memory Programming
One of the strongest advantages of IMC is the ability to parallelize analogue MVM within a memory array, according to Figure 3a.11]100] Each layer of the network can be thus mapped into a memory array, where each memory element stores a synaptic weight.On the other hand, nonlinear activation functions are generally achieved by an external analog or digital circuit.Similarly, memory-based MVM in the crosspoint array can accelerate other types of computations, such as linear algebra and image processing. [10]For all these IMC applications, which we refer to as "computational memory," the device requirements are different from those of a simple memory, in at least three aspects.First, a high precision in the stored conductance values G ij of the computational memory is essential, to compete with floating-point precision of digital MAC.While such a strong precision of conductance is not strictly necessary for memory or storage applications, which are generally limited to 1-or 2-bit precision, the analogue-type accuracy of conductance is instead a key requirement for IMC.The second requirement is that of a relatively high resistance, to limit the overall summation of all the individual computational memory currents according to Kirchhoff 's law.In fact, a large current would result in a large size of the transistor for column selection.To reduce the current, each computational memory device should have a relatively high resistance, which would also help reducing the parasitic voltage drop across the array rows/columns.On the other hand, the read current for memory applications should be a large as possible, to enable fast random readout and easy design of the sense amplifiers (SAs).The third difference which distinguishes computational and conventional memories is the required performance in terms of the programming time.The programming time for a computational memory element is generally relaxed with respect to the case of the conventional memory, as programming must be operated only at the beginning of the IMC operations and state reconfiguration is generally rare.This is the case of "offline training," where conductance G ij that has to solve a certain task is stored at time zero in the memory array and later reconfigured only if/when needed.The values G might consist of either input data obtained from sensors, e.g., the genes from a DNA sequencer or synaptic parameters obtained from the backpropagation algorithm to train a fully connected neural network.
Opposite to offline training, the "online training approach" consists of iteratively adjusting the memory conductance directly on the hardware memory array, e.g., by adopting standard gradient descent techniques such as the backpropagation algorithm.This approach allows to take advantage of the IMC energy benefit in both the training and the inference tasks.

Offline Training
To address offline training procedures and the corresponding sources of nonideality, we consider a RRAM device with 1T1R structure. [101]Figure 4a shows the I-V characteristics of the RRAM device for increasing gate voltage of the select transistor.The RRAM device consists of an active HfO 2 layer sandwiched between a Ti TE and a C BE, the latter connected to the drain of the transistor according to the structure in the inset of Figure 4. Set transition takes place as the applied voltage across the 1T1R structure reaches a characteristic voltage V set of about 2.2 V.
During set transition, the gate voltage controls the saturated transistor current, which in turn controls the final conductance of the LRS. [99]Then, the application of a negative voltage causes the reset transition to the HRS.
From the results in Figure 4a, the gate voltage appears the most suitable parameter to control the conductance G of the RRAM for IMC applications.This is shown in Figure 4b, showing the measured G after the application of a set pulse with increasing gate voltage V G . [101]The individual traces for 100 experiments from the same device are shown and compared with the average conductance.The average conductance increases almost linearly with V G À V T , where V T ¼ 0.7V is the threshold voltage of the transistor.However, the individual traces display noisy characteristics due to the stochastic ionic migration during the physical set process. [102,103]Figure 4c shows the distributions of G for increasing V G , indicating a normal shape with a standard deviation σ G ¼ 3.8 μS, independent of the programming level.These results suggest that accurate programs/verify algorithms are needed to correctly tune the conductance for IMC.
In addition to the cycle-to-cycle variability displayed by individual devices, there is also a device-to-device variability arising from differences in the composition, structure, and geometry of the cells within the array.Figure 4d shows the distributions of read current at V read ¼ 0.5 V for RRAM cells with the HfO 2 switching layer, which were programmed with four different levels (L2-L5) of compliance current. [100]The lowest current level L1 corresponds instead to the HRS.All distributions show a significant a-c) Reproduced with permission. [101]Copyright 2020, IEEE.d) Cell-to-cell distributions of measured current at V read ¼ 0.5 V in an 1T1R array for five programmed levels.Reproduced with permission. [100]Copyright 2019, AIP Publishing.e) Time-dependent fluctuations of resistance R for a RRAM device in HRS, indicating both RW and RTN phenomena.Reproduced with permission. [110]Copyright 2015, IEEE.f ) Resistance drift of a PCM device programmed at four levels.Adapted with permission. [48]Copyright 2013, IEEE.
variation in current, which includes both cycle-to-cycle and device-to-device contributions.
Programming variability effects can be alleviated or even suppressed by accurate program-verify techniques.For instance, in the scheme of Figure 4b, one might gradually increase the gate voltage to reach a certain target G.If G is exceeded by an error ΔG which exceeds the tolerable window, corresponding, e.g., to an accuracy of 8 bits, then the device can be reinitialized to the HRS and a new V G ramp is attempted.Instead of restarting from the HRS, one might apply suitable negative voltage pulses to gradually decrease G, until the error become smaller than the tolerance. [104,105]This approach takes advantage of the RRAM being able to gradually increase and decrease G in RRAM devices by application of positive and negative voltage pulses, respectively.Despite the energy and time needed to conduct, such as an accurate program-verify technique may be considerable, the overhead might still be tolerable, as long as the device conductance is not frequently updated.For instance, some memory arrays might be programmed only once for neural network accelerators, so that the programming time/energy might be amortized over the whole chip lifetime.
[108] Another possibility is that the RRAM device cannot be formed, thus resulting in an extremely low value of G, even lower than the HRS value.In all these cases, it is clear that, in most cases, the matrix G ij cannot be stored correctly in the memory array.These problems can be solved with suitable redundancy schemes, where the individual cell, or most typically its entire row/column, are disabled and replaced by a spare one.Error-tolerant online training schemes have also been proposed to correctly compensate these stuck memory elements. [109]ven if the programming operation appears successful at time zero, the conductance might still change after the programming step as a result of subsequent relaxation or fluctuation of the microscopic structure of the device.Figure 4e shows a typical fluctuation of resistance, following a reset pulse on an RRAM device. [110]Three devices with the same initial resistance were chosen initially and measured at increasing time.The devices show abrupt steps of resistance, called random walk (RW) and random telegraph noise (RTN).As a result, the cell resistance can increase, decrease, or stay unchanged.
Another typical phenomenon of unstable resistance is the drift process of PCMs. Figure 4f shows the measured resistance of PCM as a function of time after the reset process for four different levels of an MLC.Various resistance levels in the PCM can be obtained, e.g., by amorphizing an increasing volume of the PCM. [111]The resistance increases with the amount of amorphous volume in the PCM, as the amorphous phase has a higher resistivity than the crystalline one.The PCM resistance increases with time in the figure can be attributed to the structural relaxation of the amorphous phase, [112] consisting of an annihilation of defects, such as Ge-Ge wrong bonds, [113] and the consequent increase in the mobility gap. [114]Both resistance fluctuation and drift clearly represent significant problems for analogue MVM, where the conductance G of all elements in the array should remain stable.

Online Training
Figure 5a shows a typical three-layer multiple-layer perceptron (MLP), where input signals propagate from left to right.In the forward propagation, a neuron n j of a generic layer generates a signal x j that is sent out to all output neurons m i in the next layer after multiplication with the synaptic weights w ij connecting neuron n j with neuron m i .The signal received by any neuron m i is given by the accumulation of all weighted signals from the previous layer, which thus reads This formula perfectly matches Equation ( 1), namely the analogue MVM executed by the memory array of Figure 3a.15,116] It has been estimated that, thanks to the suppression of data movement in the IMC architecture, the energy consumption is reduced by more than 10 000 times in an RRAM array with respect to the conventional MAC approach in digital computers. [117]To correctly map a neural network with a memory array, however, the conductance G should be able to implement both positive and negative values of the synaptic weight w ij .To this purpose, two circuits are generally adopted: in the first circuit, the current I ¼ VG is compared with the current I ref ¼ VG ref , obtained from a reference cell biased at the opposite voltage (Figure 5b).Current comparison can be achieved by simple Kirchhoff's law and the current can be used to feed the activation function of the output neuron, together with all current contributions from other synapses.In this scheme, the effective synaptic weight is given by G À G ref , which can thus be positive or negative depending on the value of G with respect to G ref .In the second circuit, the synaptic weight is mapped by a pair of conductances G þ and G À , which are biased at positive and negative voltages, respectively. [115,116]The equivalent conductance is G þ À G À which can again have either a positive or negative sign.
The memory array can accelerate not only the forward propagation from input to output layers during the inference mode, but also the so-called backpropagation algorithm for online training. [115,118]In this approach, the synaptic weights are updated after the submission of a whole (or part of the) dataset, and the iterative repetition of the update allows to minimize the error and improve the accuracy of the network.Referring to the network of Figure 5a, the online training process consists of three phases, namely 1) forward propagation, 2) backward propagation, and 3) weight update.In the first operation, an input sample of the dataset is presented at the input and propagated throughout the network, thus leading to results y j appearing at the output layer.These results are compared to the ideal results o j , thus yielding a set of errors δ j ¼ y j À o j .At this point, one should backpropagate the error and update the value of each synaptic weight w ij , according to the weight update rule where, x i is the signal at the synapse during the forward propagation and η is the learning rate. [118,119]In this scheme, the weight must be updated with the least amount of time and energy, for best efficiency of the online training process.Thus, the weight should be updated without any preliminary read or following verify pulse; rather, a single update pulse at fixed voltage and time should be operated.
To test the compatibility of a memory device to online training, the standard approach consists of the application of a train of positive voltage pulses for weight increase, followed by a train of negative voltage pulses for weight decrease.This is shown in Figure 6 for a typical bipolar switching memory capable of weight update on both positive and negative voltage pulses.Figure 6a shows the ideal behavior of the memory device, where the conductance G increases and decreases linearly for the increasing number of pulses.In this case, the weight update ΔG ¼ Δw ij is constant, irrespective of the initial conductance G, thus allowing for a weight update according to Equation (3) without any preliminary measurement of G.In general, however, memory devices show a nonlinear weight update, such as the one shown in Figure 6b.Here, the initial pulses cause a steep increase in conductance, followed by a saturation at longer pulses.The same occurs for negative pulses.This is the behavior generally observed for bipolar RRAM devices. [116]In this implementation, the synapse can have the structure of Figure 5b where G ref is kept constant, whereas G is increased or decreased to change the overall synaptic weight.
In addition to nonlinear update, the weight increase and decrease might also display asymmetric shapes due to different linearity factors for positive and negative applied pulses (Figure 6c).The impact of the asymmetric weight update is that more pulses might be needed to increase the conductance by a contribution ΔG than the number of pulses needed to decrease the conductance by the same amount.There is only one conductance value G sym , in general, where the derivatives of the increase and decrease characteristics are the same. [120]In the zero-shifting technique, the reference conductance G ref is chosen to be equal to G sym , so that the symmetric response is obtained for G ref ¼ G sym , corresponding to G ¼ 0. [76,120] An extreme case of asymmetric update is the PCM device, where G can gradually increase via crystallization, whereas the conductance decrease induced by phase amorphization is generally abrupt and nongradual. [118]In this case, the synaptic weight has the structure of Figure 5c, where the crystallization-induced increase in G þ causes an overall increase in weight, whereas the crystallization-induced increase in G À causes an overall decrease in weight.A change in G can thus be achieved by unidirectional updates in G þ and G À , i.e., an increase of G can be achieved by an increase in G þ or a decrease in G À .A significant problem of the unidirectional update scheme is the limited increase in G þ and G À , which can never exceed the maximum value corresponding to the fully crystalline state. [118]When one of the two conductances reaches the maximum value, then a reset operation is necessary, to allow for further update operations.For instance, if G þ reaches the maximum value G max , then both G þ and G À should be reduced to keep a constant G ¼ G þ À G À , while allowing for further increase in G þ .This type of unidirectional update is schematically reported in the diamond plot of Figure 7a, showing G þ as a function of G À on AE45 axis.In the diamond plot, the net value G is represented by the position along the vertical axis.For a unidirectional device, where G þ and G À can only increase, the position on the plot can only move toward the right along the G þ or G À axis.[74][75][76] In this case, the position on the diamond plot can move in any direction; thus, resetting to a lower G is generally not necessary.
In general, the memory conductance does not only have a superior limit G max , but also an inferior limit G min , which possibly creates an additional constraint to net conductance G.This is schematically shown in Figure 6d, indicating a bidirectional update of G limited between G min and G max .In such a case, the differential synapse of Figure 5c is useful, as the zero conductance G¼0 can be achieved by carefully tuning G þ and G À so that equal values are obtained to ensure the weight annihilation according to G þ À G À ¼ 0. While this situation is straightforward with G þ ¼ G À ¼ 0, the presence of a minimum G might make the achievement of null G rather difficult.
Other sources of nonideality are the stochastic variation of conductance of Figure 6e, where an applied pulse can cause a relatively large, random increase, or decrease in conductance similar to Figure 4b.The weight-update granularity (i.e., the dynamic range of conductance is covered by only few individual increase/decrease steps) and stochasticity (i.e., the amplitude of each step is random) prevent the fine control of the weight, hence the network accuracy.A possible solution to large granularity as well as asymmetric weight update is the hybrid CMOS/PCM synapse of Figure 8.The hybrid synapse includes two differential synapses, one storing the most significant pair (MSP) whereas the other stores the least significant pair (LSP).Each element of the LSP synapse is a three-transistor, one-capacitor element for linear weight update, whereas the differential MSP synapse consists of two PCM memories with a 1T1R structure with nonvolatile storage. [122,123]In this way, the fine weight update is conducted in the highly linear capacitor with conductance g, which is then periodically aggregated to the PCM weight of conductance G.At each time, the conductance is given by FG þ g, where F is a gain factor usually in the range of F ¼ 3.This circuit structure allows to largely improve the accuracy of online training toward the one achieved by software offline training in a previous study. [122]hile the gradual update of the synaptic weight is generally beneficial for offline and online training, some memory devices show binary switching with abrupt increase and decrease in conductance, as shown in Figure 6f.This is the case for STT-MRAM, for instance, where magnetic polarization switches as a macrospin throughout the whole device area; thus, partial polarizations are generally not possible. [124][127] In this case, the resulting neural network is inherently digital, which is referred to as the binarized neural network (BNN).Note that the gradual update of Figure 6a-e is not possible in BNNs, thus making online training particularly challenging.A stochastic version of online training can still be conducted in BNNs, utilizing RRAM devices where an internal state variable can be controlled by the application of voltage pulses. [125]Two synaptic weights can thus be associated with the RRAM device, namely, an internal, nonobservable weight W int and an external, measurable W ext .The internal weight maps the state variable of the device, e.g., the defect density and configuration within the filament region in Figure 9a, whereas W ext is the device conductance which is . Illustration of the weight update for the differential synaptic memory of Figure 5c.a) Unidirectional update characteristic, where G þ and G À show a gradual increase and abrupt decrease.b) Bidirectional update characteristic, where G þ and G À show both gradual increase and gradual decrease.In both cases, the equivalent conductance G ¼ G þ À G À can be seen on the vertical axis, while G þ and G À are measured along the axis at þ45 and À45 , respectively, with respect to the horizontal axis.
Figure 8. Illustration of the hybrid CMOS/resistive synapse.The hybrid synapse includes a differential synapse with two three-transistor, onecapacitor element for linear weight update, combined with a differential synapse with 2 1T1R elements of PCM devices.The weight update term g þ À g À contains the LSP in volatile memories, whereas the equivalent conductance G þ À G À stores the MSP in nonvolatile memories.The total equivalent weight is given by where a gain F ¼ 3 is usually assumed.Reproduced with permission. [123]Copyright 2019, RSC Publishing.
high for the filament connecting the two electrodes, otherwise zero for all other configurations. [125]The application of pulses results in a continuous change of W int , although W ext will only change as W int reaches a certain threshold.Note that the transition across the threshold is highly stochastic, as a certain W int can correspond to various configuration of defects.The BNN can thus be trained with the backpropagation algorithm, similar to an analogue network. [125]In a similar approach, W ext can be generated based on a measurable W int , thus combining the benefits of the gradual update of the analogue weight and higher precision of the BNN. [126]nother approach to online training with binary switching devices is the concept of multidevice synapse, where a single synapse including several binary devices in parallel effectively behaves as an analogue synapse. [127]Figure 9b shows simulation results for the update characteristics for increasing the number of memory elements.As the number of defects increases, the synapse update becomes increasingly analogue, thanks to the stochastic switching of individual elements. [127]In general, multidevice synapses also benefit from the better averaging of stochastic variations (Figure 6e), thus improving the weight controllability and the resulting network accuracy. [128]s a final remark, the main advantages of online training of neural networks are 1) the energy efficiency, thanks to conducting the computation in the memory, thus taking advantage of inmemory MVM for forward propagation, and 2) the possibility of adapting the training to the specificity of the memory array, e.g., the presence of defects and deviceto-device variations. [129]At the same time, online training for each individual neural network becomes energetically unfeasible; thus, the best approach is to conduct online training on a specific task on a master neural network, then transferring all synaptic weights to all other hardware samples.Techniques for defect-aware training have been proposed, e.g., by introducing random stuck short/open within the simulated network. [109]

IMC Circuit Nonidealities
Various nonidealities at the device levels, such as device variations, fluctuations, drift, and stuck open/short states, all affect the performance of the IMC circuit.For instance, the accuracy of the neural network, namely, the ability of recognizing objects or speech, might be degraded with respect to the ideal software accuracy for a certain set of synaptic weights.It has been shown that neural networks with a relatively large number of neurons for each layer display the highest resilience to variations, thanks to the better parallelism and the larger number of parameters to represent the data at each layer.On the other hand, relatively deep neural networks are instead more prone to device variations, due to the accumulation of errors during feed-forward propagation along the numerous layers of the deep neural network. [130]n addition to device nonidealities, also, array parasitic can represent a serious concern for the IMC circuits.One of the major sources of circuit nonideality is the parasitic wire resistance in the array, causing current-resistance (IR) drop along the rows and columns of the memory array.This is shown in Figure 10a, where the wire resistance r between each cell is evidenced.Assuming a typical read voltage of 0.1 V, which is limited by noise, possible offsets of the voltage references and amplifiers, and possible mismatches in the CMOS periphery, and assuming an average device resistance R ¼ 100 kΩ, each device is expected to carry an average current I ¼ 1 μA.Assuming the same current I for each device, then the overall voltage drop across the wire [131] the voltage drop is around 5 mV, which is a significant contribution to the overall V R .In addition to the large IR drop, the large total current NI also raises concerns in terms of power consumption, size of the decoder transistors, and of the SAs.
To reduce the line current and the corresponding IR drop, the average device resistance should be increased as much as possible, e.g., in the MΩ [115] or GΩ range. [65]A large device resistance, The defect configuration in the filamentary path is described by W int , whereas the connection/disconnection of the filament to the TE/BE dictates the binary value of W ext .Adapted with permission. [125]Copyright 2017, IEEE.b) Multidevice synapse, where the combination of the conductance of various binary memory devices can lead to an overall analogue synapse suitable for gradual weight update.Reproduced with permission. [127]Copyright 2015, IEEE.
however, might be more heavily impacted by resistance variations, fluctuations, and drift, which are generally most relevant for HRSs. [102,103]Also, offline training with programming/ verifying such a large resistance also becomes challenging, due to the relatively long time needed to sense the extremely low read current within a very high resistance.Instead of increasing the memory resistance, one may also reduce the size of the individual memory arrays to reduce the overall IR drop.A tiled-RRAM architecture has been proposed to conveniently reduce the maximum array dimension of state-of-the-art RRAM devices with typical resistances. [132]However, reducing the array size also results in an increase in the number of the necessary analogdigital converters (ADCs), digital-analog converters (DACs), and other peripheral digital circuits, thus resulting in an overhead in terms of circuit area and power consumption.Another solution to partially solve the issue of the IR drop is the current-controlled synaptic element of Figure 10b. [65]Here, a three-terminal device is considered, such as a FEFET, a Flash memory, or an ionic transistor, which serves as a currentcontrolled synapse operating in the saturated regime.The saturated current can be programmed by either online or offline training techniques and represents the synaptic weight, whereas the input information is encoded in the pulse width of the applied gate pulse.The synaptic currents are summed by Kirchhoff's law and used to discharge a pre-charged line or integrated on a capacitor.Note that this is an alternative way of conducting the MVM of Equation (1), where the pulse amplitude is replaced by the pulse width as input vector, the synapse conductance is replaced by the saturated current as weight matrix, and the summed current is replaced by the integrated charge as output, according to As shown in Figure 10c, the IR voltage drop plays a much smaller impact on the saturated characteristics of the synaptic transistors, compared with linear characteristics of two-terminal memory elements.This scheme has the additional advantages of digital input voltages at the gate, as well as the possibility of operating each transistor in the subthreshold regime, to enable low-current IMC.

IMC Circuit Architectures
Figure 11 shows various IMC architectures that have been developed to address application-specific computing problems.All architectures take advantage of the possibility of building compact memory arrays in a matrix shape and programming each memory device with an arbitrary analog value.The most popular architecture is the memory array for MVM acceleration in the analogue domain, [9][10][11] although other architectures can be built such as the content addressable memory (CAM) [133] and analogue IMC accelerators for solving inverse problems in one computing step. [104,134]

MVM Accelerators
Figure 11a shows a typical architecture for performing the MVM, namely x ¼ A Â b. [9][10][11] The input vector b is generally converted into the analog domain voltage vector V with a DAC; then, it is applied to crosspoint rows.The matrix A is mapped as conductance values of the memory elements in the crosspoint array.In principle, any of the cell structures of Figure 3, namely 1R, 1T1R, and 1S1R, can be used in the memory array.Array columns are connected to virtual ground such that the resulting current in each column is given by Equation (1) for a crosspoint of a given size N.Each current is converted into the voltage signal by a transimpedance amplifier (TIA), then converted into the digital domain by an ADC.This simple architecture can conduct MVM in one operational step with constant time independent of the size N of the problem, namely O(1) time complexity.
MVM is the building block for accelerating neural networks, where sum of product must be executed many times during forward propagation.Here, vector b can be seen as the output neuron signal at a given layer, whereas the conductance matrix G maps the synaptic weights.IMC-based neural network accelerators have been widely demonstrated both for inference with offline supervised training [135,136] and for online training, [11,137] where MVM in the crosspoint array can be used to accelerate both the network evaluation and the training.Online training also allows to experience device nonidealities such as programming variations, limited window, and stuck open/short, thus resulting in a relatively high accuracy. [137]The nonlinear neuron activation is generally performed within the digital domain.The architecture is thus agnostic with respect to the type of training, which can span various learning algorithms such as supervised learning, [11,136,137] unsupervised learning, [138] and reinforcement learning. [139]Multilayer architectures, such as convolutional neural networks (CNN), can be accelerated within crosspoint memory arrays using separate arrays for each network layer, [122] arranging all networks in different locations within the same MVM is executed by multiplying the pulse width of the input signal with the saturated current of a synaptic transistor.c) Impact of IR drop for current-controlled synapses and ohmic devices.Adapted with permission. [65]opyright 2019, IEEE.
array [11] or even breaking each layer in several subarrays or tiles. [132]Integrated circuits comprising crosspoint memory arrays, DAC, TIA, and ADC have been already presented for neural network training, [140][141][142][143] showing software-equivalent accuracy and a performance density above 1 TOPs À1 mm À2 . [140][146][147][148][149] In the latter case, one can consider the MVM accelerator as a Hopfield-type RNN. [150,151]Hopfield RNNs are brain-inspired networks that can perform cognitive computing tasks on attractors, which are memory states that represent a minimum energy value in the landscape described by the network connectivity.Cognitive tasks in RNNs include attractor learning, attractor recall, and probabilistic model training. [152,153]hen performing a recall operation, the Hopfield RNN converges to a stable state by minimizing the energy function [154] Thus, by programming the conductance matrix G with a function to optimize, the Hopfield RNN can iteratively find the minimum energy E. [150,151] However, many optimization problems have a nonconvex energy landscape, meaning that many local minima are present.As a result, a Hopfield RNN cannot solve the problem efficiently.This class of CSPs includes Max-SAT, Max-Cut, and the generic multidimensional expression of Sudoku. [155]To make the system capable of solving such nonconvex problems, computational annealing techniques are conducted, by introducing noise in the system, which is equivalent to increasing temperature in an annealing experiment.Simulated annealing allows the system to escape from local minima and reach the global minimum.The intrinsic noise in memory devices has been used as an experimental tool to accelerate computational annealing, [146,147] allowing for speedup of the solution by a factor 30Â compared with GPU [146] in a low-power RNN.
Analogue MVM can also be used to implement spiking neural networks (SNNs), which aim at mimicking the type of computation that takes place in the brain.In fact, while many SNNs have been developed based on standard CMOS technology, [5,6,[156][157][158] it has been recognized that IMC allows for a more direct implementation of the neural network structure, as well as providing a better resemblance of the learning and spiking mechanisms of the brain.For instance, the biological learning rules, such as the STDP [159] and the Bienenstock-Cooper-Munro (BCM) rule for triplet-based learning, [160] can be naturally replicated in memory devices.For instance, STDP has been demonstrated in a relatively simple 1R structure, [161][162][163] one-transistor structures, [63,64] 1T1R structures, [164][165][166] and two-transistor/one-resistor (2T1R) structures. [167,168]The time-dependent dynamics of volatile RRAM [169,170] was also shown to feature bioinspired processes, such as STDP learning, [171] BCM learning, [172] short-term plasticity, [173] and oscillating neurons. [174,175]This type of neuromorphic, brain-inspired IMC is highly promising for ultralow-power smart sensors and biomedical devices interfacing with the brain, such as neuromorphic neuroprostheses.
Finally, analogue MVM in the memory can naturally accelerate algebraic computing problems such as image processing, [10] sparse coding, [176] and the solution of linear systems and differential equations. [177,178]In the latter case, numerical algorithms are adopted to break the algebraic problem in several iterative steps, including MVM within the memory architecture and a separate operation performed on a digital computer with floating point precision.80]

Analogue Computing Accelerators
Recently, it has been shown that the crosspoint array can be properly connected in a feedback loop with operational amplifiers (OAs) to solve a linear system of equation in one step without any iteration. [104]Figure 11c shows the circuit architecture for solving the linear system Ax ¼ b in one step.The core architecture is the same as Figure 11a, namely a crosspoint architecture performing MVM between column voltages and conductance values G representing matrix A stored in the memory array.Vector b is applied as input analogue current i, obtained as a DAC output signal applied to an input conductance G 0 connected to the virtual ground.Virtual ground is obtained at the input terminal of an OA where the output is connected to the array column with a feedback configuration.The OA generates an output voltage vector such that Gv þ i ¼ 0, to support the MVM by Kirchoff 's and Ohm's law.By rearranging this equation, one can obtain the unknown vector v ¼ ÀG À1 i in one step, which is the solution x ¼ v to the linear system Ax ¼ b.
The circuit can be extended to matrices A which contain both positive and negative entries, by inverting the output voltage v of the OAs and applying it to a second crosspoint array G 0 parallel to G.As a result, one can solve a generic linear system ðB À CÞx ¼ b, where B and C are mapped in the crosspoint arrays G and G 0 , respectively.As a special case, if the matrix G 0 is replaced by the diagonal matrix λI, where I is the identity matrix, and if the input vector is assumed i ¼ 0, then the problem reads ðA À λIÞx ¼ 0, where the unknown is the eigenvector of the matrix A. These linear algebra problems can be extended to differential equations, such as the Fourier equation or the Schrödinger equation in one step within a crosspoint array. [104]ote, however, that the circuit can only calculate the eigenvector for the maximum eigenvalue, which should be shown to allow for circuit implementation.This is the case, for instance, of the Pagerank, which is an algorithm for ranking webpages, where the maximum eigenvalue λ ¼ 1 is always known. [181]o perform Pagerank, the matrix G of the connections between webpages is programmed into the memory array, and the eigenvector corresponding to the maximum eigenvalue is computed.Crosspoint circuits have been used to compute the Pagerank problem. [101,104]hile matrix A is always square in the circuit of Figure 11c, rectangular problems where the number of equations exceeds the number of unknowns can also be addressed with dedicated IMC architectures. [134]For instance, Figure 11d shows a doublefeedback circuit to compute regression in one step.A current input vector y is applied as input current i by DAC connected to input conductance G 0 .According to Kirchoff 's law, the total current G X v þ i, where G X is the conductance matrix of the left crosspoint array and V is the output voltage of the second stage of OAs, is converted to voltage v R ¼ ðG X v þ iÞ=G T by the TIAs and applied to the right matrix.The right array encodes the same conductance G X of the left array; thus, the output current is given by G X T ðvG X þ iÞ=G T , which must be equal to zero due to the infinite input resistance of the second-stage OAs.As a result, the output voltage reads v ¼ ÀðG X T G X Þ À1 G X T i, which represents the Moore-Penrose inverse w ¼ ðX T X Þ À1 X T y, where matrix X is encoded into the conductance of the crosspoint array G X .
The solution is given in one step regardless of the matrix size, without any iteration.
The Moore-Penrose inverse can be used to compute the linear regression of a given set of data.By storing the independent variables X in the crosspoint array and applying the dependent variable y as the input current, the circuit output voltage is v ¼ w, which represents the linear coefficient of the best fitting line (or plane or hyperplane, depending on the number of dimensions). [134]The same concept can be extended to other types of regressions, such as polynomial regression and logistic regression.The latter can act as a building block for large-scale classification systems. [134]he feedback configuration of the IMC circuits of Figure 11c,d allows for physical iteration in the analogue domain to find the solution of the problem with a relatively large size N.In principle, the solution time does not depend on the size N of the problem, thus resulting in O(1) complexity.This low complexity makes IMC extremely promising for machine learning and other areas which rely on linear matrix computation.However, due to the nonidealities at the device level (e.g., device-to-device variations, drift, etc.) and circuit level (e.g., IR drop, etc.), it appears challenging for the IMC technology to reach the same precision as conventional digital circuits with floating-point precision.A more general study at the system level is still needed to meet these challenges and take full advantage of the low complexity and high energy efficiency of IMC.

Content Addressable Memories
Memory arrays are usually accessed by an address, which allows to select a certain memory bit to retrieve its content data.This operation is unambiguous, i.e., a single data bit corresponds to any specific address.However, many computing tasks require the opposite operation, namely searching the position, or multiple positions, where a given information is stored in the memory.This memory architecture, which is referred to as CAM, returns the data address in one clock cycle, independently from the memory size, thus allowing for an acceleration of data search with respect to software and other hardware approaches.CAM has been used to accelerate multiple computing tasks such as IP routing, image coding, and regular expression matching. [133]he conventional CMOS-based CAM requires a large area and complex circuit structure that limit its hardware implementation.On the other hand, CAM can be naturally implemented with IMC using two-terminal memory devices to allow for the significant increase in density.
Figure 11b shows a 2 Â 2 ternary CAM (TCAM) array implemented with RRAM devices, where the single cell is highlighted.[184][185][186] Two operations can be performed on the TCAM array, namely writing and searching. [184]Signals SX1 and ND control the access transistor for write operation, whereas the selection transistor gate (WL1) is biased constantly at V DD .To set the RRAM device on the right (M1), V set is applied to SL1, with SL1 kept at V DD to turn off the left transistor, corresponding to device M2.The compliance current is regulated by the control voltage SX1 whereas ND is grounded.To reset the device M1, SL1 is grounded whereas ND is biased at V reset .The same scheme can be applied to write M2 by inverting the signals SL1 and SL1.State "0" corresponds to M1 in HRS and M2 in LRS, whereas state "1" corresponds to M1 in LRS and M2 in HRS.To write state "X," corresponding to "don't care," both M1 and M2 should be in HRS state.During search operation, the match line 1 (ML1) is precharged to V DD whereas the search bit is applied at SL1.If there is match, then ML1 remains in the charged state.In fact, assuming that a "1" is searched whereas a "1" is stored in the cell, device M2 in HRS prevents discharge of ML1 to the grounded SL1, thus maintaining the charged state of ML1.On the other hand, if there is no match, e.g., a "1" is searched whereas a "0" is stored in the cell, then device M2 in LRS connects ML1 to ground, thus inducing a fast discharge of the line.If state "X" is stored in the cell, then ML1 remains charged regardless of the input vector, as both M1 and M2 in HRS prevent connection of ML1 to SL1 and SL1.
Thanks to its modular implementations, TCAM can be easily arranged in an array to search for large data patterns, as shown in Figure 11b.Giving an input word on SL1 and SL2, ML1 and ML2 remain charged only if each column data matches the input data, or data "X" are stored in the column.The ML1 and ML2 potential is rapidly probed by a SA to recognize a discharge if ML1 or ML2 drops below a certain threshold.
Interestingly, TCAM can be used to accelerate computing problems without the need for area/power consuming ADC and DAC and can be directly connected to the memory module such as DRAM. [186]The matchline (ML) discharge speed contains important information on the similarity between the search and stored value.For instance, if a weak LRS is written on M1, the discharge time while searching for "1" will be longer than the time corresponding to a full LRS.The difference between these two values can be translated into a Hamming distance and used to accelerate custom neural network training. [185]Moreover, analog-resistive CAM circuits have been proposed, [187] where the stored values represent a range and the ML will stay charged if the analog input signal is within the stored range.Analog CAM can be used as the inference machine for machine learning problems, such as decision trees and random forests. [187,188]

Conclusions
This work provides an overview on the devices, circuits, and architectures that enable data processing directly within the memory according to the so-called IMC paradigm.Emerging memory devices, including two-terminal and three-terminal devices, are first reviewed to clarify the operation principle and the associated advantages and disadvantages for computing.The device structures, including selector-free 1R, 1T1R, and 1S1R structures, have been discussed and compared.The most typical nonidealities of the memory concept are discussed with reference to different training processes, namely offline training consisting of memory programming operation and online training where the synaptic weights are updated in situ.Nonidealities at the array level are then considered, such as the IR drop along the array wires which dictates additional requirements for the memory resistance.Finally, the IMC architectures are reviewed with focus on MVM, TCAM, and analogue accelerators for solving linear algebra problems.Due to several advantages of performance, energy efficiency, and complexity, IMC appears extremely promising to accelerate many data-intensive computing tasks.Improvements in the device state control and resistance window are however needed to compensate the device nonidealities and improve the accuracy of IMC.

Daniele
Ielmini is a full professor at the Dipartimento di Elettronica, Informazione, e Bioingegneria of Politecnico di Milano, Politecnico di Milano.He received his Ph.D. degree from Politecnico di Milano in 2000.He conducts research on emerging nanoelectronics devices, such as PCM and RRAM, and on novel computing with memory devices.Giacomo Pedretti received his B.S., M.S., and Ph.D. (cum laude) degrees in electronics engineering from Politecnico di Milano, Milan, Italy, in 2013, 2016, and 2020, respectively, where he is currently a postdoc research associate.His research interests include the design of memristive circuits for optimization and analog computing.

FERAMFigure 1 .
Figure 1.Illustration of the two-terminal memory devices for storage and computing.a) RRAM, where the device resistance is controlled by fieldmodulated filamentary paths in the dielectric layer.b) PCM, where the device resistance is controlled by the amorphous/crystalline phase in the chalcogenide active layer.c) STT-MRAM, where the device resistance is controlled by the parallel/antiparallel polarization of the ferromagnetic layers in the MTJ.d) FERAM, where the electrostatic polarization is controlled by the orientation of FE domains in the ferromagnetic active layer.

Figure 2 .
Figure 2. Illustration of three-terminal memory devices for storage and computing.a) Flash, where the threshold voltage of the transistor is controlled by the charge stored within the FG.b) Analog DRAM, where threshold voltage of the transistor is controlled by the charge stored across an independent capacitor.c) FEFET, where the threshold voltage is controlled by the orientation of the FE dipoles within the gate insulator.d) SOT-MRAM, where the MTJ resistance can be electrically manipulated by the in-plane current I pol along a HM line.e) ECRAM, where the channel conductance is manipulated by the field-induced ionic migration.f ) Memtransistor, where the conductance is controlled by the migration of defects across a 2D semiconductor channel.

Figure 4 .
Figure 4. Offline training of a computational RRAM.a) I-V characteristics of a bipolar RRAM with 1T1R structure (inset) for increasing gate voltage of the select transistor.V G controls the compliance current I C during set transition, hence the device conductance G. b) Measured RRAM conductance as a function of the gate voltage V G , indicating an almost linear increase in average behavior.Note the relatively large cycle-to-cycle variations of G. c) Distributions of conductance for seven levels of LRS and one level of HRS (inset).The standard deviation is σ G ¼ 3.8 μS, independent of the programmed level.a-c)Reproduced with permission.[101]Copyright 2020, IEEE.d) Cell-to-cell distributions of measured current at V read ¼ 0.5 V in an 1T1R array for five programmed levels.Reproduced with permission.[100]Copyright 2019, AIP Publishing.e) Time-dependent fluctuations of resistance R for a RRAM device in HRS, indicating both RW and RTN phenomena.Reproduced with permission.[110]Copyright 2015, IEEE.f ) Resistance drift of a PCM device programmed at four levels.Adapted with permission.[48]Copyright 2013, IEEE.

Figure 5 .Figure 6 .
Figure 5. Neural network implementation with memory arrays.a) Schematic illustration of an MLP with three synaptic layers.The output y 1 , y 2 , etc. is compared with the true output o 1 , o 2 , etc., to yield the error ε 1 , ε 2 , etc., which can be backpropagated to perform the training of the network.b).A possible implementation of the synaptic weight by a 1T1R memory, where the current I is compared with a reference current I ref across a common conductance G ref , thus resulting in an equivalent conductance G À G ref to enable mapping of both positive and negative weights.c) Another possible implementation of the synaptic weight with two 1T1R elements, where the synaptic weight is described by equivalent conductance G þ À G À .

Figure 9 .
Figure 9. Illustration of stochastic synaptic memories with binary devices.a) Stochastic weight update, where positive/negative pulses (top) are applied, thus resulting in a gradual change of the internal weight and a binary update of the external weight W ext (center).The defect configuration in the filamentary path is described by W int , whereas the connection/disconnection of the filament to the TE/BE dictates the binary value of W ext .Adapted with permission.[125]Copyright 2017, IEEE.b) Multidevice synapse, where the combination of the conductance of various binary memory devices can lead to an overall analogue synapse suitable for gradual weight update.Reproduced with permission.[127]Copyright 2015, IEEE.

Figure 10 .
Figure 10.Parasitic voltage drop across array wires.a) Illustration of the voltage drop across a three-cell row with memory resistance R and intermemory wire resistance r. b) Current-controlled synaptic element, whereMVM is executed by multiplying the pulse width of the input signal with the saturated current of a synaptic transistor.c) Impact of IR drop for current-controlled synapses and ohmic devices.Adapted with permission.[65]Copyright 2019, IEEE.

Figure 11 .
Figure 11.Array architectures for IMC.a) MVM accelerator including DAC at the input, TIAs for current-voltage conversion, and ADC at the output.b) TCAM array using terminal-resistive memory devices.c) IMC accelerator for solving a linear system Ax ¼ b. d) IMC accelerator for linear and logistic regression.