Wide word‐length carry‐select adder design using ripple carry and carry look‐ahead method based hybrid 4‐bit carry generator

This research aims to fill up the research gap in energy‐efficient transistor‐level wide word‐length carry circuit generator by using ripple carry (RC) and carry look‐ahead (CLA) method‐based hybrid 4‐bit carry generation process for wide word‐length carry‐select adder (CSLA). Compared to the existing 4‐bit CLA architectures, the proposed 4‐bit RC‐CLA method‐based hybrid 4‐bit carry generator showed performance improvement in terms of power and power delay product (PDP). Later, the 4‐bit carry architectures (existing and proposed) were used as a base to implement 16‐bit carry select adder (CSA) in order to investigate and compare the effect of using the proposed hybrid RC‐CLA based 4‐bit carry generator in large structures. Unlike 4‐bit operation, the proposed design displayed the best performance in power and PDP for 16‐bit CSA extension, which proves its effectiveness in wide word‐length adder structures.

more arithmetic operations need addition in their intermediate stages.Ensuring the computing speed of these digital blocks is an obligatory requirement to maintain processor performance.Since adder plays a pivotal part in several binary arithmetic operations, high-speed but energy-efficient design of adder circuits will assist in improving the overall ALU performance. 10odern microprocessors require multiple-bit operation.Ripple carry adder (RCA) is the simplest way of implementing multiple-bit addition. 11However, present-day high-speed microprocessors usually do not use the RCA method due to its slow speed operation.In RCA, the carry output of one stage works as the input to the next stage.Therefore, one stage cannot start its operation unless the input carry arrives from its previous stage.Due to this speed issue, the application of RCA method has become limited.In order to improve adder performance, carry generation method in multiple-bit operation becomes the target optimization parameter.To improve carry propagation speed, the concept of parallel-adder has come into existence. 12,13Amidst the several types of parallel adders, CLA is highly popular. 14,15At first, Weinberger and Smith introduced the CLA method in Ref. 16.After that, over the course of time, several CLA designs have been proposed by researchers in order to cope with the rapidly changing technologies.8][19] As a result, the performance of a fast wide word-length adder is highly dependent on its 4-bit fundamental unit cell.Thus, overall improvement of a wide word-length adder structure can be realized by using efficient 4-bit unit cells.
In the existing literature, researchers highly emphasized on CLA algorithms and their logic level designs whereas very few attempts have been taken to improve transistor level designs.This research aims to fill up the existing research gap of energy-efficient transistor level representation of wide word-length fast adder design.At first, RCA and CLA based hybrid 4-bit carry generator has been proposed.Later, the proposed hybrid 4-bit cell has been used as a building block for CSLA.The performance of the proposed hybrid 4-bit unit cell-based CSLA has been analyzed, simulated, and compared with similar existing structures.The proposed design showed significant improvement in average power (AP) and power delay product (PDP) while maintaining quite satisfactory speed.

LITERATURE REVIEW OF EXISTING 4-BIT CLA
To observe existing 4-bit CLA unit cell, the carry terms are expressed as C i (C 0 : input carry, C 1 -C 4 : output carry), A i and B i (input bits), where 0 ≤ i ≤ 3. Descriptions of existing transistor level 4-bit unit CLA cells are provided in sub-sections.

Static CMOS logic based conventional 4-bit CLA
G i and P i terms are the main basis of static CMOS logic based conventional 4-bit CLA. 20The G i represents AND operation of the input terms A i and B i (G i = A i B i ) . 21On the other hand, P i represents XOR operation of the input terms A i and B i ( 20 Based on G i and P i terms, Boolean representation of carry terms are 21 : Now, the transistor level representations of the G i and P i terms are portrayed in Refs.21,22.

Compact 4-bit CLA cell
Ruiz and Granda provided the design of a 4-bit CLA cell (named as Compact CLA) in Ref. 32.This 4-bit CLA cell utilized the conventional static CMOS based P i circuits.However, G 1 circuits are not utilized.Feedback mechanism is the main problem associated with this 4-bit cell.To provide full swing signal, some of the internal nodes used feedback circuits.However, these feedback mechanism takes some time to provide the feedback by which this 4-bit CLA cell faces speed issues.

Novel 4-bit CLA cell
Design of a 4-bit CLA cell (named as Novel CLA) with novel Boolean expressions for CLA circuits is shown in Ref. 33.No G i and P i circuits were used in this design.Rather, G i (inverted G i signal) was used.On the other hand, NOR gates were used rather than using AND gates.The CLA circuits in Ref. 33 are quite unique and complex.

PROPOSED 4-BIT HYBRID RCA-CLA
Conventional CLA designs used 4 subsequent CLA circuits to generate C 1 -C 4 signals, which result in increased transistor count, silicon area, and power dissipation.However, the proposed design uses a different approach.The hybrid RC-CLA based 4-bit carry generation circuit utilizes RC method for the first 3-bit operation to reduce transistor count and power consumption.On the other hand, speed of 4-bit carry generation process is ensured by applying CLA circuit for the 4th bit.Rather than using conventional Carry Generate (G i ) and Carry Propagate (P i ) circuits, the proposed hybrid 4-bit carry generator uses inverted Carry Generate (G i ) and Carry Propagate (P i ) circuits.For G i and P i , a NAND-XNOR module is proposed.Block diagram of the proposed 4-bit hybrid RCA-CLA architecture is demonstrated in Figure 1.Rather than generating G i and P i signals unlike the conventional design, the proposed 4-bit hybrid RCA-CLA architecture used inverted G i and P i signals (G i and P i ).These G i and P i signals are nothing but NAND and XNOR operation.Therefore, the proposed design architecture demonstrated in Figure 1 uses a NAND-XNOR module to generate G i and P i signals.In the 4-bit structure, first three carry stages (C 1 -C 3 ) are generated using RCA style.The final output C 4 is generated using a CLA circuit.Since C 4 is the final output of a 4-bit carry circuit, speed of a 4-bit carry circuit is fully dependent on C 4 .For this reason, CLA architecture is used to generate C 4 signal to ensure speed.On the other hand, C 1 -C 3 are generated using simple RCA style though transistor count and power consumption can be reduced.Detailed descriptions of the NAND-XNOR module, C 1 -C 3 signal generation using RCA style, and NAND-XNOR signal based CLA circuit for C 4 generation are provided in the following sub-sections.From the above mentioned XNOR logic conditions, it can be observed that the XNOR signal node gets at least one full swing output signal for each of the four logic conditions.Thus, the XNOR signal can provide full swing output level without facing any voltage degradation issue.Now, the NAND circuit is similar to a static CMOS logic-based NAND circuit except for the shared node A i .Operation of the NAND signal generation can be understood as follows: NAND Logic Condition 1 (A i B i = 00): For A i B i = 00, p 1 and p 1 are switched ON through which strong logic 1 is passed towards the NAND signal node.As a result, the NAND signal becomes logic 1.
NAND Logic Condition 2 (A i B i = 01): For A i B i = 01, p 1 is switched ON through which strong logic 1 is passed towards the NAND signal node.As a result, the NAND signal becomes logic 1. NAND Logic Condition (A i B i = 10): For A i B i = 10, p 2 is switched ON through which strong logic 1 is passed towards the NAND signal node.As a result, the NAND signal becomes logic 1.
NAND Logic Condition 4 (A i B i = 11): For A i B i = 11, n 3 and n 4 are switched ON through which strong logic 0 is passed towards the NAND signal node.As a result, the NAND signal becomes logic 1.

3.2
Schematic of 1-bit carry generator unit for C 1 , C 2 and C 3 As per Figure 1, the proposed architecture consists of 3-bit RCA carry generator, which is made up of three 1-bit carry generator units arranged in simple RCA style.The 1-bit carry generator unit is taken from Ref. 34.Therefore, the operation principle of the 1-bit carry generator used in the proposed architecture can be understood from Ref. 34.Schematic of the 1-bit carry generator unit is shown in Figure 3.

Proposed CLA circuit for C 4
Schematic of the proposed CLA circuit for C 4 is expressed using Figure 4. Since the proposed hybrid RCA-CLA architecture is not based on G i and P i terms, a new CLA circuit for C 4 is proposed which takes G i and P i signals as input.
Although the circuit in Figure 3 looks a bit similar to the conventional static CMOS logic based CLA circuit, the input signals, gate control of the transistors and interconnections among the transistors are quite different.The main advantage of the proposed C 4 circuit compared to conventional static CMOS logic based C 4 circuit in section 2.1 is the proposed circuit eliminates an inverter stage from the operation.This helps the circuit to achieve faster speed compared to the conventional design.

WIDE WORD-LENGTH CSLA USING 4-BIT ARCHITECTURE
To analyze the effect of using the proposed hybrid RCA-CLA based 4-bit architecture in wide word-length structure, the proposed design is used as a building block for 16-bit CSLA implementation.Block diagram of 16-bit CLSA implemented using 4-bit carry generator is shown in Figure 5. From

CIRCUIT SIMULATION PARAMETERS IN CADENCE
A proper simulation environment is required to be ensured in case of simulating digital circuits. 35,36Therefore, to verify the proposed architecture and to compare the proposed design with the designs mentioned in section 2, all designs are analyzed under a standard simulation environment. 37,38The following sub-sections will elaborate on these topics.

Circuit simulation and comparison parameters
The circuits are implemented in 45 nm standard CMOS process using Cadence design toolkit.Supply voltage for all simulations is set to 1 V and frequency is set to 100 MHz.To compare performance of different circuits, the following performance parameters are considered. 39,40Average power consumption: Since different input-output combinations in digital circuits lead to different power dissipations, it becomes necessary to analyze all input-output transitions.After getting power dissipation data of all F I G U R E 5 Wide word-length CSLA implementation using 4-bit carry generator.
input-out transition, average value of the power dissipations due to different input-output combinations is taken as the average power of a circuit. 41 • Propagation delay: Unlike average power, different input-output combinations in digital circuits lead to different propagation delays.However, there exists one input-output pattern due to which highest propagation delay occurs.This highest propagation delay of a circuit is considered as the propagation delay in this paper.Therefore, in this research, to calculate propagation delay of any 4-bit or 16-bit circuit (existing or proposed), all possible input-output combinations are applied to the circuit and circuit delays have been calculated for all possible conditions.The highest circuit delay is taken to be the propagation delay or the path delay of a design. 42 • Power delay product (PDP): PDP is simply the product of power consumption and delay.Therefore, to obtain PDP, average power is multiplied by the propagation delay of a circuit. 43

Simulation test bench
The simulation test bench described in Ref. 29 is used for simulating all the circuits (existing and proposed).Therefore, details and explanations regarding the simulation test bench can be found from Ref. 29.

RESULT AND DISCUSSION
At first, the 4-bit architectures (existing and proposed) are simulated and results were extracted and compared.Later, the effect of using the 4-bit architectures (existing and proposed) in wide word-length CSLA is analyzed.The following sub-sections contains the simulated results of the 4-bit architectures and their 16-bit extended versions using CSLA style.

Simulation result of 4-bit architectures
Simulation results of the proposed and existing 4-bit architectures are provided in Table  the proposed design did not obtain best performance in speed, the result is quite if we carefully observe and compare the data presented in Table 1.Therefore, the proposed design is an attractive alternative to the existing ones.The reason that the proposed RCA-CLA based hybrid design for 4-bit carry generation provides better results in power consumption is due to its low transistor count.The low transistor count corresponds to low switching activity due to which dynamic power gets minimized.Also, lower transistor count results in lower leakage power.As a result, overall improvement in power consumption can be achieved compared to the existing designs.Also, the conventional 4-bit CLA circuit provides a C 4 signal in the output terminal, due to which an inverter is required to add in series to the output to generate C 4 from C 4 .However, in the proposed circuit, due to utilization of NAND-XNOR based G i and P i signals as an input to the CLA circuit, the design generates C 4 signal directly.As a result, the proposed design completely removes an inverter state from the output circuit, which ensures better speed compared to the conventional and modified CLA circuits.In addition to these, due to low transistor count, the area coverage of the proposed design is quite low compared to the existing ones.Therefore, the design would be quite suitable for systems having are constraint.

Simulation result of 16-bit CSLA implemented using 4-bit architectures
After extending the 4-bit architectures in CSLA style, the circuits are simulated to extract results.Results for different circuits are provided in Table 2. From the simulation results presented in Table 2, it can be seen that the proposed hybrid RCA-CLA 4-bit architecture based 16-bit CSLA obtained better performance in power and PDP.In terms of speed, CLA implemented without P i and G i circuits obtained better results, However, from the comparison presented in Table 2, it can be observed that the proposed design has similar level of propagation delay compared to the designs presented in Refs.30,31.Thus, performance of the proposed design in wide-word length structure is quite satisfying while compared with the existing designs.Although the comparative performance improvements look marginal, the proposed design only requires 572 transistors to reach these performance parameters.Among the existing designs, Compact CLA in Ref. 32 has the least number of transistors (628 transistors).However, its performance is quite low compared to the proposed design.Therefore, the proposed design offers far better performance with 8.91% less devices.Now, if comparison is made with the conventional CLA, the design presented in this research provides far superior performance with 39.67% less transistors.Among the existing designs, CLA without P i , G i terms in Ref.
30 have the best performance parameters.Now, if we compare the proposed design with Ref. 30, it can be observed that the proposed circuit offers marginal performance improvement with 20.99% less transistor count.Nowadays, to enhance integration density, designers look for circuits offering optimal performance with least number of devices.For this reason, although having marginal level of improvement compared to designs presented in Refs.30,31, the proposed design can be an excellent alternative due to its ability to provide better performance with far less number of transistors.

CONCLUSION
This research presented the design of a hybrid 4-bit carry generator by integrating RCA and CLA method within the same block.Later, the hybrid based 4-bit structure was used as a basic building block to implement 16-bit CSLA.Performance of the design is justified by comparing with 9 existing designs.At first, the designs were simulated as 4-bit structure.Later, the designs were used as a building block to implement 16-bit CSLA.The proposed hybrid RCA-CLA based 4-bit architecture and its 16-bit extended version in CSLA style showed best performance in power and PDP.Also, propagation delay of the proposed design is in the similar level of the existing circuit that obtained best performance in speed.Hence, due to the excellent performance parameters, the proposed hybrid design offers better alternative to the existing design for high-performance computing.

F I G U R E 1
Proposed 4-bit hybrid RCA-CLA architecture.F I G U R E 2 Proposed NAND-XNOR Module for G i and P i Signal Generation.

3 . 1
Proposed NAND-XNOR module for G i and P i signal generationSchematic of the proposed NAND-XNOR module for the hybrid RCA-CLA architecture is shown is Figure2.Rather than using separate circuits for NAND and XNOR, this paper proposes an integrated NAND-XNOR circuit by sharing circuit components of NAND and XNOR operation.As per the proposed circuit diagram of the NAND-XNOR module, the following logic applies to the XNOR signal.XNOR Logic Condition 1 (A i B i = 00): For A i B i = 00, p 4 and p 5 are switched ON through which strong logic 1 is passed towards the XNOR signal node.As a result, the XNOR signal becomes logic 1.XNOR Logic Condition 2 (A i B i = 01): For A i B i = 01, n 2 is switched ON through which A i = 0 is passed towards the XNOR signal node.As a result, the XNOR signal becomes logic 0.XNOR Logic Condition 3 (A i B i = 10): For A i B i = 10, n 1 is switched ON through which B i = 0 is passed towards the XNOR signal node.As a result, the XNOR signal becomes logic 0.XNOR Logic Condition 4 (A i B i = 11): For A i B i = 00, p 3 is switched ON by the gate control signal A i = 0. Through p 3 , strong logic 1 is passed towards the XNOR signal node.As a result, the XNOR signal becomes logic 1.

Figure 5 ,F I G U R E 3 F I G U R E 4
it can be seen that a 4-bit carry generator works as the initial circuit which generates C 4 signal.Later, for C 5 − C 8 , C 9 − C 12 and C 13 − C 16 signals, parallel 4-bit units are used.Schematic of the 1-bit carry generator used for the proposed hybrid RCA-CLA 4-bit architecture.Schematic of the proposed C 4 circuit based on NAND-XNOR module.To understand the operation, C 5 − C 8 signal generation part is considered.The parallel 4-bit unit for C 5 − C 8 in the left side generates output signal considering the input carry as logic 0. The parallel 4-bit unit for C 5 − C 8 in the right side generates output signal considering the input carry as logic 1.Now, based on the input carry from the previous stage (C 4 signal), the 2:1 MUX decides which C 8 signal will be passed towards the output node.If C 4 = 0, then the C 8 generated by the left side 4-bit carry generator unit is passed on to the output node of that stage.On the other hand, if C 4 = 1, the C 8 generated by the right side 4-bit carry generator unit is passed on to the output node of that stage.Exactly in the same way, the subsequent 4-bit stages compute their output signals.Transmission Gate based MUX is used for all the 2:1 MUXs.

F I G U R E 6
Average power comparison.F I G U R E 7 Propagation delay comparison.F I G U R E 8 PDP comparison.

2.3 4-bit CLA without G i and P i term Two
The only difference lies in the circuits for G i and P i terms.Rather than using static CMOS logic-based circuits, 4-bit CLA cells in Refs.22-25 used hybrid logic-based circuits.By using this approach, the number of transistors could be reduced and several performance parameters could be improved.CLA cells in Refs.22,23 used hybrid AND and XOR gates in Refs.26,27, respectively.On the other hand, CLA cells in Ref. 24 used AND gate presented in Ref. 28, whereas the XOR gate is taken from Ref. 29. 4-bit CLA unit cells without utilizing G i and P i terms are portrayed in Refs.30,31.The circuits in Refs.29,30 focused on generating C 1 -C 4 using the input bits A i and B i directly.Rather than dividing the 4-bit CLA cell into G i , P i circuits and CLA circuits, the designs in Refs.30,31 used a complex mirror network of pull-up and pull-down circuits to implement C 1 -C 4 .Since the outputs are generated by skipping the G i and P i terms, the 4-bit CLA cells in Refs.30,31 do not experience the delay occurred due to G i and P i circuits.By this way, speed could be improved.
Several 4-bit CLA cells based on modified circuits for G i and P i terms are presented in Refs.22-25.In these 4-bit CLA cells, the CLA blocks along with their Boolean equations are identical to the static CMOS logic based conventional 4-bit CLA.
1. Also, Figures 6, 7, and 8 present graphical representation of the data provided in Table 1.It is clear from Table 1 that the proposed hybrid RCA-CLA based 4-bit architecture obtained better performance in average power and PDP compared to the existing ones.CLA architectures without P i and G i terms in Refs.30,31 have better speed than the proposed one.Although Simulation results of 4-bit architectures.
TA B L E 1 Simulation results of 16-bit CSLA implemented using 4-bit architectures.