In‐Memory Database Query

In recent years, several in‐memory logic primitives were proposed where bit‐wise logical operations are performed in memory by exploiting the physical attributes of memristive devices organized in a crossbar array. However, a convincing real‐world application for in‐memory logic and its experimental validation are still lacking. Herein, the application of database query where a database is stored in an array of binary memristive devices is presented. The queries are formulated in terms of bulk bit‐wise operations and are executed in memory by exploiting Kirchhoff's current summation law. The concept is experimentally demonstrated by executing error‐free queries on a small 4 × 8 selector‐less phase‐change memory crossbar. The impact of crossbar size, resistance of routing wires, and interdevice variability on the accuracy of the logical operations are studied through numerical and circuit‐level simulations. Finally, a system for cascaded query is proposed that combines the in‐memory logic with conventional digital logic and its functionality is verified on a healthcare‐related database. It is estimated that an 11‐step long query is executed in 36 ns, consuming 560 μW, thus achieving an energy efficiency of 166 TOPS/W.


Introduction
Today's computing systems are primarily built based on the principle-known as von Neumann architecture-that data storage and processing units are physically separated. Hence, during the execution of computational tasks, data have to be shuttled back and forth between the processing and memory units, incurring significant costs in latency and energy. The energy and latency associated with accessing data from memory units is a key performance bottleneck for a range of applications, in particular the increasingly prominent data-centric computing tasks. [1] In-memory computing is an emerging non-von Neumann paradigm, where certain computational tasks are performed in the memory itself by exploiting the physical attributes of the memory devices as well as their array-level organization. [2][3][4] It has recently been shown that computational primitives such as bulk bit-wise logical operations [5][6][7] and matrix-vector multiply operations [8,9] can be efficiently realized in memory using an emerging class of memory devices known as resistive memory devices or memristive devices. Even though the matrix-vector multiply primitive is gaining traction in a range of applications such as deep learning inference, [10,11] sparse coding, [12] linear solvers, [13] and hyperdimensional computing, [14] there are few experimental demonstrations of end-to-end applications involving in-memory logic operations.
A potential application involving a high percentage of logical operations is data analytics in database management systems. When performing queries on large databases, one of the main challenges is to retrieve the stored data and bring it to the processor that will execute the query. The latency and energy associated with moving data to the processor is a key performance bottleneck. Conventional acceleration methods range from software-oriented schemes that solve queries approximately [15] to hardware-oriented systems that parallelize operations using graphics processing units. [16] However, these approaches do not tackle the main bottleneck of data movement caused by the limited cache capacity of processing units and limited memory bandwidth. This limitation is further enhanced by the constantly increasing size of real-world databases. Near-memory computing, which incorporates complementary metal-oxidesemiconductor (CMOS)-based computing units close to the main memory, is a promising approach to reduce the cost of data movement. [17] However, it does not completely eliminate the separation between memory and processing units and involves additional challenges such as integrating the logic units into the main memory chip. Another tantalizing prospect is that of storing large databases in dense arrays of nonvolatile resistive memory devices (also referred to as memristive devices) and executing the queries directly in memory by exploiting the analog properties of memory devices and circuits. In a previous work, a novel resistive memory-based architecture was proposed to support a variety of query functionalities (including bit-wise logical operations) and perform them explicitly in-memory. [18] However, the proposed design, based on a content-addressable memory structure, does not support efficient cascading of multiple in-memory query operations consisting of different logical operands. Moreover, a validation of the feasibility of realizing accurate query operations using existing resistive memory crossbar arrays has been missing.
In this article, we present the realization of database queries based on in-memory read logic primitive. We experimentally demonstrate the concept on a small fabricated phase-change memory (PCM) crossbar array by successfully executing multiple queries in memory involving AND and OR operations. We quantify the impact of interdevice variability and finite wire resistance on the system's accuracy for different array sizes via circuit simulations. Subsequently, we introduce a design that executes cascaded queries of arbitrary length and complexity by combining in-memory computations with near-memory logic circuits in the crossbar's periphery. Using the Cleveland heart disease database as a case study, we demonstrate the functionality of the proposed system via accurate HSPICE circuit simulations calibrated on the hardware experiments and evaluate its throughput and energy efficiency.

The Concept of In-Memory Database Query
Query operations can be performed on databases that are structured collections of attributes or features, associated with different subject or item entries. The objective of a query is to retrieve the entries from the database that satisfy certain constraints related to the attributes. When databases are represented in a bitmap representation (vectors of the logical "0" and "1"), it is possible to formulate the queries as bulk bit-wise operations on the attribute vectors (see Figure 1a). The key idea of in-memory database query is to store the database entries in arrays of memristive devices using their conductance as the logic state variable. Memristive devices can be scaled down to nanoscale dimensions [19,20] and their nonvolatile storage capability ensures that no energy is spent to retain the stored information unlike in dynamic random access memory or static random access memory. Subsequently, the bulk bit-wise operations associated with the query operations are performed in place in the memristive arrays using in-memory logic.
For in-memory logic, we exploit the nonvolatile binary storage capability of the memristive devices. For example, the 0 and 1 s can be represented by the memristive devices' low and high conductance states, respectively. Several logical operations can be enabled through the interaction between the voltage and conductance state variables. [21,22] One particularly interesting characteristic of certain memristive logic families is statefulness, where both the operands and the result are stored in terms of the conductance state variable. [2,23,24] However, stateful logic operations involve the repeated writing or programming of memristive devices. [25,26] This is highly undesirable given the limited cycling endurance of these devices as well as the large energy footprint associated with writing these devices (typically in the order of pico Joules per write event).
Hence, for the database query problem, we resort to nonstateful logic operations where the logical operands are stored as conductance values, but the result of the logical operation is obtained as a current signal. This method was first implemented in Pinatubo [27] and has inspired the concept of scouting logic. [28,29] The operands stay fixed in the memory array and the devices need not be programmed during the evaluation of the logical operation. Figure 1b shows the realization of bit-wise logical operations using scouting logic. When memristive devices are organized in a crossbar configuration, it is possible to implement logical operations such as AND and OR, by simultaneously activating multiple rows and with the appropriate choice of reference currents (I ref ) for the sense amplifiers (SA). This enables us to execute the database query operations. For example, a query that requests the input entries that satisfy attributes "A" AND "D" is performed by biasing simultaneously the crossbar rows that correspond to A and D with a specific read voltage (V Read ). The resulting read current (I Read ) at each column corresponds to the summed conductance of the memristive devices according to Kirchhoff's circuit laws. The logical output is obtained from comparison with predefined reference currents via a SA per column. Therefore, the logical result is calculated in place without having to move the contents to an external processing unit. Moreover, the output is a vector with equal length (a) (b) Figure 1. In-memory database query. a) An example database consists of five-subject or item entries, each of them with six attributes expressed in a binary format (typically referred to as a bitmap representation). Queries consist of performing logical operations between the attributes and can be executed as bit-wise logical operations. b) A schematic illustration of scouting logic used for bit-wise logical operations. The operands are stored in terms of the conductance states of memristive devices organized in a crossbar configuration. To perform logical operations, multiple rows are biased simultaneously and the resulting current is sensed per column using variable reference SAs. According to the input query, a controller provides the appropriate reference current to execute AND or OR logical operations in place, without having to move the operands into an external processing unit.
www.advancedsciencenews.com www.advintellsyst.com to the number of crossbar columns and contains the query response for all the entries, thus facilitating the execution of queries with high parallelism. Note that although the concept is not limited to two attributes per operation, only one kind of logical operations can be executed at once. For example, "A" AND "C" AND "D" is executable with appropriate adjustments to the reference current values and the SA's precision as a function of the given number of operands. It appears that AND, OR, and XOR are the key logical operations needed for database query. [18] XOR is also compatible with scouting logic but would require SA that has three inputs and is designed to output a logical 1 when I Read is between I OR ref and I AND ref . [29] Processors based on memristive devices with logic capabilities have been demonstrated in the past [30][31][32][33][34] and even specifically using the scouting logic concept. [35] However, several aspects related to the functionality of the database application at the array level have been overlooked by previous studies, especially in the case of selector-less crossbars. In this work, we analyze the limitations arising as the crossbar size increases and we set boundaries for successful scouting logic as a function of size and routing wire resistance. Moreover, we develop and calibrate a dynamic circuit model that accurately simulates the read and write operations on PCM crossbars, to derive the maximum selector-less crossbar size that can reliably perform database queries. Another fundamental shortcoming from previous scouting logic demonstrations is the optimal selection of reference currents, which is arguably the most important parameter of scouting logic. It is typically chosen to be the geometrical mean of the expected I Read values, as shown in Figure 1b, which is not optimal in the presence of interdevice conductance variations. In the following section, we analytically derive the optimal I ref values, and based on this, an end-to-end experimental demonstration of the database query application on a fabricated selector-less PCM crossbar is presented.

Experimental Results on Small-Scale Crossbars
It is possible to fabricate memristive crossbars with or without selector devices. When each memristive device is placed in series with a selector such as a diode or a transistor, it greatly simplifies the operational challenges associated with the crossbar but at the expense of higher fabrication cost and areal footprint. [36][37][38] Even though, the proposed in-memory database query approach can be implemented using both types of crossbars, here we study the more challenging case of selector-less crossbars.
Experimental results are presented using small-scale crossbars of PCM devices. PCM is one of the most advanced resistive memory technologies [39][40][41] and is being actively explored for in-memory computing. [42][43][44] We fabricated a crossbar array using projected PCM devices. A projected PCM device consists of a segment of a noninsulating material, referred to as the projection material, in parallel to the phase-change material segment (see Figure 2a). The essential idea is to decouple the read and write processes in a PCM device by taking advantage of the highly Figure 2. Projected PCM devices. a) Each projected PCM device comprises a noninsulating segment of projection material parallel to a segment of phase change material. We assign to logical "1" the highly conductive SET state in which the phase-change material is in the crystalline phase and to logical "0" the highly resistive RESET state where a portion of the phase-change material is amorphous. The projected PCM devices are fabricated with a fixed resistive element in series to set a compliance current during writing. b) The I-V characteristic curve of the projected PCM device capturing a transition from RESET to SET state. The threshold switching voltage (V th ) is critical for selector-less crossbar operation. c) Successive SET/RESET switching behavior of a projected PCM device. An iterative programming scheme was used to reduce the SET and RESET variability.
www.advancedsciencenews.com www.advintellsyst.com nonlinear current-voltage (I-V ) characteristics of amorphous phase change materials. The sheet resistance of the projection material is such that the current bypasses the highly resistive amorphous phase during the low-field "read" operation and flows through it only during the high-field "write" process. Thus the read signal is marginally affected by the nonideal electronic properties of the amorphous phase, which makes the projected PCM device remarkably immune to conductance variations arising from structural relaxation, 1=f noise, and temperature variations. [45,46] In addition, by tuning the sheet resistance of the projection layer, one can alter the conductance ratio between the high and low conductance states. The fabricated projected PCM devices were electrically characterized. An appropriate programming pulse in the 100 ns timescale was applied, to provide adequate power to melt the phase-change material. The trailing edge of the pulse is sufficiently short to quench the molten volume and create the amorphous phase. This process is typically referred to as "RESET" and the same term describes the achieved low conductance state. A lower-amplitude pulse crystallizes the amorphous volume in a process referred to as "SET" and leads to a high conductance state. The nonlinear I-V characteristics of the amorphous-to-crystalline phase transition are shown in Figure 2b. At low bias voltage, the PCM devices exhibit a fairly Ohmic behavior in both the SET and RESET states. [47] Moreover, one of the main performance benefits of the projected PCM devices is the notably weaker field dependence over a wider biasing range. [46] The resistor in series (R S ) with the device serves for current-control element in a selector-less implementation. Although the series resistor consumes some extra power during scouting logic operations (up to 20% of the total power in our experiments), controlling the current during threshold switching is essential for the proper operation and endurance of the used devices. The threshold voltage V th is a particularly critical parameter for the crossbar operation; therefore, it will be studied thoroughly in Section 3.2. The conductance variability associated with the conductance states upon repeated switching can be mitigated by an iterative programming scheme involving a sequence of program-and-verify routines (see Section 5). With this scheme, the conductance values of the SET and RESET states over multiple cycles were roughly uniformly distributed (see Figure 2c).
A scanning electron micrograph of a fabricated 4 Â 8 crossbar is shown in Figure 3a. The projected PCM devices are fabricated in planar configuration and hence the design differs from the conventional cross-point structure, where vertical memristive devices are fabricated at the overlapping area of BL and WL wires. Although the areal efficiency may be reduced, there is no functional difference between this crossbar architecture and the conventional cross-point one, because the left and right terminals of this lateral device are connected to column and row wires, respectively. Each cross-point is populated with a metallic resistor in series with a projected PCM device, as shown in Figure 2a. A crossbar without selectors suffers inherently from current sneak-paths that depend on the states of the other devices in the array. [48] Special biasing schemes have to be used for each operation, to limit this effect and make the selector-less crossbars functional (see Figure 3b). To successfully write an individual device through the selected cross-point, pulses with half the programming voltage have to be applied simultaneously to all rows and columns, except the row and column on which the selected device is located. The programming pulse is applied to the selected row, whereas the selected column is grounded. A fundamental prerequisite for this method is that the threshold voltage is higher than half of the programming pulse amplitude; otherwise, the state of the nonselected devices along the selected row and column will be disturbed. On the other hand, precise current measurement is required for both read and scouting logic operations, in which the nonselected rows are grounded so that the contribution of sneak-paths to the selected column www.advancedsciencenews.com www.advintellsyst.com current is minimized. The basic goal is to suppress unwanted voltage differences across nonselected devices by giving them additional grounding paths other than the column read-out node. [49][50][51] Prior to presenting the database query experiment, we investigate two primary sources of inaccuracies related to the in-memory read logic. Scouting logic operation involves the simultaneous read operation involving two devices per column. Given that there are three possible combinations, namely, both devices are in a RESET state, both devices are in a SET state, and the two devices are in SET and RESET states, there are three distinct nominal values for the column currents.
The first source of inaccuracy arises from nonideal routing wires. The finite wire resistance causes voltage drops that lead to a reduced effective read voltage across the selected PCM devices. In particular, it becomes crucial for the devices that are positioned the farthest away from the voltage supply node and require the longest routing wires. The effect of this voltage drop is larger for the SET state because the routing wire resistance may become comparable with the SET state as the crossbar size increases. The finite resistance of the routing wires also results in sneak-path currents that tend to increase the effective read current from the RESET states by adding parasitic contributions from alternate devices in the array. Although the rowgrounding method is used, sneak-paths cannot be entirely suppressed, because voltage gradients along the row-wires bias devices that otherwise would have grounded terminals. Combined, these intertwined effects diminish the effective read-current ratio, to a point where the three possible column currents cannot be differentiated reliably by an SA.
The routing resistance path is a critical fabrication parameter that depends on both the physical distance between the devices and the sheet resistance of the routing metal. Ultimately, the wire resistance per unit cell is a key factor that determines the maximum size of a functional crossbar. We used a circuit simulator to study the reduction of read-current ratio as a function of wire resistance and crossbar size. We took the worst possible scenario in which 1) all devices are programmed to the SET state to maximize the sneak-current and 2) the selected device is at the outermost corner that has the longest routing path. The selected device is programmed sequentially to both logical states and their read-currents are estimated. We used these values to reconstruct and compare the ratios of the column currents I 00 , I 01 , and I 11 that correspond to the logical combinations of (0,0), (0,1), and (1,1). The resolution limits were set according to the rule that the SA needs a signal that is at least 20% higher than its reference current, to work reliably against process-related variations and voltage-level imprecision. The failure points for AND and OR operations are met when I 11 =I 01 ≤ 1.2 and I 01 =I 00 ≤ 1.2 respectively. AND requires higher precision, hence fails slightly earlier than OR (see Figure 4a). One approach to increase the size of functional crossbars is to use an adaptive SA unit that adjusts The study was based on the worst-case scenario where read device lies at the longest routing point and the rest of the crossbar is at the SET state maximizing the sneak currents. Resistance/size combinations within the white area correspond to functional systems. AND requires higher precision, thus fails earlier than OR. b) Circuit simulations (similar with (a)) of crossbars with nonunitary aspect ratio and wires with 1 and 5 per unit cell. Reduced number of rows (attributes) results in functional crossbars with larger numbers of columns (entries). c) Probability density function (PDF) of the sum of read currents from two devices for the three possible state combinations. Each device conductance is assumed to be uniformly distributed with the standard deviation expressed as the percentage of their mean value. Column current is calculated, assuming V Read ¼ 0.1 V bias at both devices. The reference currents for AND and OR operations are set according to Equation (1) and (2). d) Numerical simulations of the three possible column currents given by biasing two devices with V Read ¼ 0.1 V, each of them being programmed to a conductance state that follows uniform distribution. The same relative standard deviations are used to confirm the statistical analysis shown in (c).
www.advancedsciencenews.com www.advintellsyst.com the reference currents proportionally, to compensate for the changes in readout current along the routing path. An alternate approach is to use crossbars with nonunitary aspect ratio (see Figure 4b). Note that the real-world databases have much more columns (entries) than rows (attributes); therefore, the numbers of rows can be reduced to incorporate larger numbers of columns. The second key challenge for accurate in-memory read logic is the interdevice conductance variations. During a logical operation, the column current is proportional to the sum of the two device conductance values representing the operands, for example, a low and a high one would be I Read ¼ V Read ðG 0 þ G 1 Þ. Because of the inaccuracies in programming G 0 and G 1 , they will be randomly spread around their respective target values (see Figure 2c). The probability distribution of the sum of two or more independent random variables is the convolution of their individual distributions. The distributions of the three possible column currents were analytically derived (see Supporting Information) and are shown in Figure 4c. The distributions of G 1 and G 0 were assumed to be uniform with mean values (μ) of 50 μS and 1 μS, respectively. We set V Read ¼ 0.1 V and calculated the expected distribution of the three possible column currents for three different values of standard deviation (σ), expressed as the percentage of their mean conductance values. Based on this observation that the SET state variability has a relatively much higher impact on the column current, it is also possible to derive the optimal current reference levels for AND and OR operations (see Supporting Information) The statistical analysis is confirmed by numerical simulations of 20 000 devices (see Figure 4d). Every pair of devices is programmed to conductance states uniformly distributed around the mean values μ G 1 ¼ 50 μS and μ G 0 ¼ 1 μS. Histograms corresponding to I 00 , I 01 , and I 11 show a perfect match with the analytical derivation.
For the experimental demonstration, we used a database with information about scientific awards presented to some of the worldwide IBM research laboratories. We programmed the devices of a 4 Â 8 crossbar according to the bitmap representation of the database. SET states follow a uniform distribution around μ ¼ 51 μS with σ ¼ 3 μS and RESET states have μ ¼ 0.8 μS and σ ¼ 0.1 μS (see Figure 5a). Database queries are used to search for laboratories that meet certain criteria involving the geographical location, awards received, etc. We constructed two simple queries and executed them using the crossbar. The appropriate reference currents are set according to the logical operation associated with the query, as shown in Equation (1) and (2). The eight column currents are measured and plotted against the names of the laboratories assigned to each column node (see Figure 5b). They are sufficiently separated for reliable detection by a SA.

Circuit-Level Simulation Results on Large-Scale Crossbars
The requirement to program the PCM devices at the point of storing the database entries prior to executing the queries introduces an additional limitation to the size of functional crossbars, again unique to selector-less arrays. A consequence of voltage drop along the wires is that the programming voltage should be adequately increased to successfully program the outermost devices. This comes with the penalty of over-biasing the devices nearest to the voltage source potentially beyond V th , that inevitably leads to state disturbance. According to the half-selection programming scheme, all the devices along the selected row and column receive a pulse with half the amplitude of the one applied to the selected device (see Figure 3b). To study the influence of programming on the half-selected devices, we developed a circuit model that emulates the behavior of the projected PCM devices based on conventional electronic components such as transistors, capacitors, and resistors (see Figure 6a and Section 5). It is a two-terminal component designed to behave like a projected PCM cell that is initially in the amorphous phase and depending on the voltage level, it may (or may not) switch to the crystalline phase. The device ðMÞ is modeled by an N-MOS transistor ðN M Þ driven by a circuit of switches. The exact values of these www.advancedsciencenews.com www.advintellsyst.com components are finely tuned to match the experimental device characteristics. A set of 64 experimental I-V curves are measured on different projected PCM devices and the emulator is configured to capture their mean behavior (see Figure 6b). Threshold switching is controlled by Schmitt triggers, which match the experimentally measured V th ¼ 2.1 V. The circuit is designed in a fully symmetric manner, to capture the unipolar nature of PCM devices. Nonvolatility is ensured by the capacitor C M that retains the gate voltage required for each conductance state during read or scouting logic operations.
We implemented this dynamic model in a Synopsis HSPICE circuit simulator equipped with 65 nm CMOS library, to test the efficacy of the concept on a practical problem with a much larger database. The Cleveland heart disease database, that is available on the UCI machine learning repository, consists of medical metrics obtained from 303 patients with heart-related health problems. [52] After binarization, the database featured 41 attributes requiring a 41 Â 303 crossbar array, which can be interconnected using a wide range of wire resistances according to Figure 4b. But when the device is at the high conductance state, there is significant voltage drop across the series resistance, that implies a high programming voltage at the row node. A representative programming pulse obtained during the small-scale experiments shows that approximately half of the programming power is dissipated in R S (see Figure 6c). Given these power requirements, the minimum voltage for reliable programming is set at V Prog ¼ 3.6 V. According to the write scheme (see Figure 3b) the voltage across nonselected devices is half the programming one; therefore, the condition for undisturbed programming is V Prog =2 < V th ¼ 2.1 V. As a consequence, the maximum level of the node voltage to compensate for the losses is V max Prog ¼ 4.2 V. We simulated the system at the worst possible scenario, in which the selected device is the outermost one. The other devices are in the high conductance state, apart from one control device that is at the nearest position to the voltage source (see Figure 6d). We measured the programming power delivered to the selected device using wires with resistance in the range from 0.2 to 1.25Ω per unit cell. V max Prog was applied at the selected node, ensuring that the voltage across the control device did not exceed V th . Results indicate that for reliable programming under the worst conditions, the 41 Â 303 crossbar should be split in half and the wire resistance cannot exceed 0.55 Ω per unit cell (see Figure 6e). The latter implies that as the number of entries increases, the database should expand into more than one crossbar array. In the context of in-memory logic, this is not a particularly detrimental problem, because the crossbar can be split into smaller segments along the rows, without influencing the scouting logic operation at the cost of marginal increase in circuit-level complexity. Nevertheless, it is clear that a crossbar with selector devices would offer significantly better control and flexibility at Figure 6. a) A circuit model that emulates the behavior of the projected PCM device. b) The experimentally obtained I-V measurements from the projected PCM devices are compared with that obtained from the emulator. c) The RESET programming pulse which is 100 ns long with a trailing edge of 10 ns. Approximately half of the total programming power is consumed at series resistance. d) Schematics of a crossbar biased according to the "write scheme." This combination of stored values and selected device's position with respect to the voltage node is the most prone to failure. e) The maximum programming power that can be provided to the farthest-positioned device without exceeding V th at any nonselected device, as a function of wire resistance per unit cell.
www.advancedsciencenews.com www.advintellsyst.com the initial set of write operations, in addition to the major power consumption benefits associated with not using the "write scheme" at the deselected devices. We generated SET and RESET conductance states according to the Cleveland heart disease database. The SET (RESET) conductance states were uniformly distributed with a mean and standard deviation of 50 ð0.8Þ and 2 ð0.1Þ μS, for the logical "1" and "0" states, respectively (see Figure 7a). These distributions were mapped to the 41 Â 152 and 41 Â 151 subcrossbars (see Figure 7b) and replicate the experimental results of Figure 2c in a larger scale. The routing wires' resistance was set to 0.2 Ω per unit cell. This value corresponds to the sheet resistance of the two lowest-level metals in the 65 nm node, assuming the most compact design (4F 2 footprint), in which the distance between devices is only twice the minimum width of the wire. Queries were constructed for all possible permutations of the 41 attributes and applied to the crossbar. Statistical analysis of the simulated column currents shows error-free query response for both OR and AND logic (see Figure 7c). To mitigate yield or early-device-failure issues, techniques such as avoiding columns that correspond to failed devices and rewriting the content in redundant ones can be applied. The explicit case of selector-less crossbars with devices that fail into an electrical short would require deactivation of the corresponding row as well, although this failure mode is less common in PCM technology.

Cascaded Database Query
Real-world database queries consist of a multitude of subqueries with associated logical operations rather than a single query.
Solving such a query in the previously demonstrated fashion could yield an inefficient system. The main challenge is that it requires an additional memory unit for temporary storage of intermediate logical results and subsequent fetching for further processing along with the next set of logical outputs. To address this, a configurable computing system is introduced that combines in-and near-memory computations. Note that any query can be expressed as the sum of products (SOP), ða 1 Â b 1 Þ þ ða 2 Â b 2 Þ, or the product of sums (POS), ða 1 þ b 1 Þ Â ða 2 þ b 2 Þ, where sum and product operators correspond to OR and AND, respectively. However, their occurrence is instructed by the query and is not necessarily alternated OR and AND operations. Hence, an arbitrary query function FðAÞ can be expressed as a combination of POS and SOP, as given by where FðSÞ i ¼ a i Ãb i , Ã is an OR or AND operator, and p depends on the query length.
The key idea of the proposed cascaded logic computing system is to perform a logical operation both in-memory and nearmemory simultaneously. While an in-memory analog computation using scouting logic is executed at the memristive crossbar, a near-memory digital logic operation is conducted at the periphery of the memory array using conventional CMOS-based gates (see Figure 8a). But rather than independently computing in parallel, the system executes the decomposed query, expressed in terms of Equation (3), in a cascaded manner. At a given clock cycle, the control unit selects the crossbar rows that correspond to the questioned attributes and configures the SA by selecting 2 Ω per unit cell and 0 Ω. It can be seen that the nonzero wire resistance reduced the column currents especially for the logical combinations of 01/10 and 11. However, it is still possible to execute error-free queries. The minor deviation of the column current distributions from that predicted by the theoretical analysis is attributed to the variable routing length depending on the position of each device.
www.advancedsciencenews.com www.advintellsyst.com the appropriate reference current (switch S1), as instructed by the query operator. Subsequently, the logical results obtained at the SA output nodes are stored in a buffer. This will serve as the first input to the digital gate that can be either an OR or an AND gate, depending on switch S2 that enables the corresponding flip-flop and multiplexer channel. The second input to the digital gate is Figure 8. Cascaded logic operations. a) Schematic illustration of the cascaded system showing how the analog and digital computations are distributed to the crossbar and periphery, respectively. The digital circuitry design, that locates at each column node, consists of an SA, CMOS logic gates, a multiplexer, and the switches that configure the circuit according to the query-instructed operation. SA converts the analog result from each crossbar column to digital and feeds it to the selected digital gate. There it cascades with the preceding logical products, until the final result is calculated at V o after as many cycles as the analog operations. b) An example query with three operations applied to the 41 Â 303 crossbar, that is programmed according to the heart disease database. The final operation C ¼ A AND B uses the partial results A, B that correspond to OR operations. A simple illustration of the cascaded logic system in the "POS" configuration that outputs the result C after the end of cycle #2. c) Simulated waveforms of the digital circuitry solving an 11-step cascaded database query. The three nodes correspond to one column periphery and their positions are marked in (a). The final result can be read at V o node right after the end of the last clock cycle (#6). d) Simulated computational metrics for solving the 11-step example query on the heart disease-related database.
www.advancedsciencenews.com www.advintellsyst.com the accumulated logical result of all the previously executed logical operations. The output of the digital gate will serve as the new intermittent result that gets buffered in a delay circuit. At the subsequent clock cycle this signal will in turn be buffered and will be input to the gate, along with the new crossbar output. In other words, at every cycle, the digital output gets updated with the subsequent result of the logical operation obtained from the crossbar, until the query function gets fully executed.
To demonstrate the concept, we created an example based on the Cleveland heart disease database and the emulator presented earlier. The query comprises the AND of two OR operations (see Figure 8b). At the first clock cycle, rows 3 and 41 are biased with V Read and each column current is measured by an SA that is configured to perform the OR scouting logic operation. The result of OR is input to the digital AND gate. For the first cycle, the second input to the digital AND gate would be initialized to logical level 1. At the subsequent cycle, rows 1 and 2 are activated and their partial logical result (OR) is input to the digital AND gate along with the buffered OR result from cycle #1. The final query response is the binary vector that consists of the 303 logical results, as obtained from the output nodes of the digital gates right after the end of cycle #2.
The system offers massive parallelism combined with no need for high-power device programming and thus is expected to execute queries with remarkable efficiency. We used our 41 Â 303 crossbar emulator to provide computational metrics based on the experimentally measured power consumption, on which we built the cascaded logic digital circuit using the 65 nm CMOS components. Delays related to charging the selected row and column wires and other parasitic capacitors have been well compensated by setting the clock's period to 6 ns, that is 2.2Â larger than the time required for the accurate operation of the used current-latched SA. [53] The simulated system was clocked at 167 MHz and configured to solve POS. We chose an example query comprising 11 consecutive OR and AND operations, out of which 6 are performed in the analog domain. A maximum of two operations are performed at each clock period which means that the total time required is 36 ns (see Figure 8c). Simulated performance metrics revealed that both the PCMbased crossbar and the digital-logic circuitry (gates, flip-flops, multiplexer) have a very low energy impact compared with the 303 SA units that dominate the time and power consumption. The total average power consumption of the system is 558 μW and the total required energy for the fully cascaded query is 20 pJ (3.3 pJ cycle À1 ). These numbers refer explicitly to the core components, which means that the control unit and any postprocessing circuits are excluded from the calculation. The achieved throughput was 92.6 GOPS and the energy efficiency reached 166 TOPS=W. The performance metrics are shown in Figure 8d for the total system. Note that the crossbar's power consumption is not deterministic but depends on both the stored database and the query itself. For example, the percentage of stored 1s at the queried rows of this demonstration was higher than the database's average, fact that has a direct impact to the crossbar's power consumption. Nevertheless, it was measured 30Â lower than the power consumed by the SAs. In addition, we notice that the strict conditions for successful programming, that implied the use of highly conductive wires and the split of the crossbar in two subarrays (see Section 3.2), suppressed almost completely the drawbacks associated with the lack of selectors. The uncontrollable voltage gradients along the wires are kept below 2 μV even at the outermost areas of the crossbar. Consequently, only 0.1% of the total power was wasted at the grounded rows as undesired signal and the capacitive currents are considered negligible. Based on these results, using crossbars with selectors as a way to suppress sneak-paths would marginally increase the power efficiency during scouting logic operations. Finally, 18% of the crossbar's power consumption is owed to the series resistors (3.2 μW). This number is strongly data dependent, given that for each scouting logic operation, the fraction of power that is dissipated on the series resistors is 0.4% when both stored values are 0s and can reach to 20% when both are 1 s.

Conclusion
We have presented the concept of in-memory database query where the database is stored in a dense array of memristive devices and the queries are performed in place using in-memory logic operations. We fabricated and characterized a 4 Â 8 crossbar and experimentally demonstrated the execution of error-free database queries. Moreover, we developed a circuit model that emulates the nonlinear electrical behavior and the nonvolatility of these devices. The emulator was incorporated into a circuit simulator that solved successfully a real-world database query problem in memory. In addition, we studied the operational challenges of selector-less crossbars for this application and provided the array size limitations as a function of wire resistance and the aspect ratio. Finally, we introduced the concept of cascaded logic that combines in-memory computations at resistive crossbars with peripheral CMOS logic and executes queries with arbitrary size and complexity. The proposed system can process database queries with massive parallelism, high throughput, and low energy consumption.

Experimental Section
Device Fabrication: Projected PCM devices were fabricated on an isolating 100 nm SiO 2 layer, thermally grown on a 525 μm n-type silicon wafer. The projection layer was a metal nitride formed by reactive sputtering. The pure metal target was sputtered in the presence of gaseous N 2 . Note that in this step the sheet resistance of this film was tuned by adjusting the N 2 flow rate in the sputtering chamber. Subsequently, an ultrathin film of phase-change material (Sb 2 Te 3 10 nm) was sputtered and protected by a 5 nm SiO 2 capping layer. These three layers were deposited on sequentially in the same sputtering chamber to avoid oxidation of the nitride interface and the PCM material arising from exposure to atmospheric air. Several e-beam lithography routines were executed to 1) pattern the stacked layers via ion milling to the design shown in Figure 2a and open contact areas through the capping layer, 2) define the contact electrodes, the series resistors, and the first-level metal wires by RIE of a sputtered tungsten layer, and finally 3) form the second-level metal and robust top-level pads for the probe card used in the electrical characterization setup.
Electrical Characterization: A four-channel Keithley 2606B sourcemeasure unit (SMU) was used to DC bias and measure the column currents. The nanosecond-long programming pulses were applied by an Agilent 81110A pulse generator and monitored by an Agilent MSO6104A digital oscilloscope. A Celadon T40AF probe card with 25 probes arranged in line with a 100 μmm pitch was used to contact the crossbar pads and the www.advancedsciencenews.com www.advintellsyst.com signals were distributed to the specific probe with a Keithley 707A switching matrix. Iterative Programming Scheme: The verification condition was met when the device conductance fell within the target range given by G target AE δG according to specifications. If the programming pulse failed to create an appropriate conductance state, the voltage amplitude was readjusted, following the equation V iþ1 ¼ V i þ θ ⋅ ½logðG i Þ À logðG target Þ, where θ is a parameter that has to be set according to the programming characteristics of the devices (in our case θ ¼ 0.2).
Circuit Model and Simulator: Under low electric field, both phases had linear I-V characteristics. [54] Because scouting logic is carried out under low-field biasing, the devices were represented by Ohmic resistors. The device model used in Section 3.2 consisted of circuit components drawn from TSMC 90 nm library (see Figure 7a). The N-MOS transistor N M has a W=L ratio of 30 and the series resistance is R S ¼ 5 kΩ. The switching unit consists of two Schmitt triggers to ensure symmetrical behavior independent of the bias node (V Row or V Col ). They are supplied with external DC power V ¼ 4 V, to keep the state unaffected from the input signals. The switches require additional components which are the resistors R α , R γ ¼ 18 kΩ, R β , R δ ¼ 11 kΩ, and the P-MOS transistors that have W=LfP ζ , P ξ , P χ , P ψ g ¼ 10. The gate voltage V M is built up by charging the capacitor C M ¼ 30 fF through a resistor R M ¼ 300 kΩ.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.