Vedic division methodology for high-speed very large scale integration applications

Transistor level implementation of division methodology using ancient Vedic mathematics is reported in this Letter. The potentiality of the ‘Dhvajanka (on top of the flag)’ formula was adopted from Vedic mathematics to implement such type of divider for practical very large scale integration applications. The division methodology was implemented through half of the divisor bit instead of the actual divisor, subtraction and little multiplication. Propagation delay and dynamic power consumption of divider circuitry were minimised significantly by stage reduction through Vedic division methodology. The functionality of the division algorithm was checked and performance parameters like propagation delay and dynamic power consumption were calculated through spice spectre with 90 nm complementary metal oxide semiconductor technology. The propagation delay of the resulted (32 ÷ 16) bit divider circuitry was only ∼300 ns and consumed ∼32.5 mW power for a layout area of 17.39 mm. Combination of Boolean arithmetic along with ancient Vedic mathematics, substantial amount of iterations were reduced resulted as ∼47, ∼38, 34% reduction in delay and ∼34, ∼21, ∼18% reduction in power were investigated compared with the mostly used (e.g. digit-recurrence, Newton–Raphson, Goldschmidt) architectures.


Introduction
Division is a fundamental operation in many scientific and engineering applications, like arithmetic computation, signal processing, artificial intelligence, computer graphics etc. [1][2][3].Generally, computations of such division operations are calculated in sequential manner, thereby costlier in terms of propagation delay (latency) compared with other mathematical operations like addition, subtraction and multiplication [4].
Substantial amount of works have so far been investigated by various researchers to implement the high-speed divider [1-15] like digit recurrence (DR) methodology (restoring [1, 3,5], nonrestoring [2,6,9]), division by convergence (Newton-Raphson (N-R) method [10][11][12]), division by series expansion (Goldschmidt (G-S) algorithm [13,14]) etc.Generally, division architectures can be classified into two categories: namely (i) iteration based and (ii) multiplication based.Iterative divisions consist of shift-and-subtract operations, generates one quotient bits, in each of the iterations, like radix-2 restoring and nonrestoring division.Thereby, in iterative division, after each subtraction cycle, it should require to check whether the resulting remainder is lesser than the divisor or negative.The cost in terms of computational complexity of DR algorithms [1- 3,5,6,9] is low because of the large number of iterations; therefore latency becomes high.Although, some of the researcher rely on higher radix implementation of DR algorithm [6,7,10] to reduce the iterations, therefore the latency becomes improved from earlier reports [1-3, 5, 9], but these schemes additionally increases the hardware complexity.Some other attractive ideas are based on functional iterations, like N-R [10][11][12] and G-S [13][14][15] algorithm, utilises multiplication techniques along-with the series expansion, where the amount of quotient bits obtained in each of the iterations is doubled.These methods converge quadratically towards the quotient when the number of iterations is increased, thereby latency becomes high.Each iterations of N-R and G-S methods involve two dependent multiplications; namely, the product of the first multiplication is one of the operands of the second multiplication thereby it cannot be optimised like a parallel multiplier [13].The drawback of these methods is operands should be previously normalised, most used primitive are multiplications and the remainder is not directly obtained.
In algorithmic and structural levels, substantial amount of division techniques has so far been developed to reduce the propagation delay and power consumption of the divider circuitry; by reducing the iteration, aiming towards high-speed operations, but principle behind division techniques are same in all cases.Vedic mathematics [16] is the ancient system of mathematics which has unique computation techniques based on 16 sutras (formulae).Recently, we [17] reported on a Vedic divider based on 'Nikhilam Navatascaramam Dasatah' for some specific number system, like, the divisor was chosen very close to the base of operations.The implementation reduces the number of iterations, if the divisor is closer to the base of operation, otherwise increases the iterations, a serious bottleneck of the algorithm.In this Letter, we report on a division technique and its transistor level implementation of such circuitry based on such ancient mathematics.'Dhvajanka' is a Sanskrit term indicating 'on top of the flag', is adopted from Vedas; formula is encountered to implement the division circuitry.In this approach, divider implementation was transformed into just small division instead of actual divisor, subtraction and few multiplication, thereby reduces the iterations, owing to the substantial reduction in propagation delay.Transistor level (application specific integrated circuit (ASIC)) implementation of such division circuitry was carried out by the combination of Boolean arithmetic with Vedic mathematics, performance parameters like propagation delay, dynamic switching power consumption calculation of the proposed method was calculated by using spice spectre in 90 nm complementary metal oxide semiconductor (CMOS) technology and compared with other designs like DR- [9], N-R- [11], and G-S [15]-based implementation.The calculated results revealed (32 ÷ 16) bit divider circuitry has propagation delay ∼300 ns with ∼32.53 mW dynamic switching power for a layout area of 17.39 mm 2 .

Vedic division methodology
The gifts of the ancient Indian mathematics in the world history of mathematical science are not well recognised.The contributions of mathematician in the field of number theory, 'Sri Bharati Krsna Thirthaji Maharaja', in the form of Vedic sutras (formulae) [16] are significant for calculations.He had explored the mathematical potentials from Vedic primers and showed that the mathematical operations can be carried out mentally to produce fast answers using the sutras (formulae).In this Letter, we report only 'Dhvajanka' formula to implement the division algorithm and its architecture.

Numerical example of 'Dhvajanka' sutra
With the help of example, shown in Fig. 1a, dividend has been considered as 38 982 (five digit number) and divisor is equals to 73 (two digit number).Out of divisor 73, we put down only the first digit (i.e.7) in the divisor column and put the other digit (i.e. 3) 'on top of the flag'.On the other hand, shown in Fig. 1b, dividend has been considered as 135 791 and divisor has been considered as 1632.The entire division for Fig. 1a is to be set by 7; and for Fig. 1b is to be set by 16.The diagram implementation procedure has been described in Table 1.Table 1 Chart implementation procedure, the example has been considered from Fig. 1 Implementation steps of Fig. 1a Implementation steps of Fig. 1b 1.One digit of divisor has been put on top; we allot one place (at the right end of the dividend) to the remainder portion of the answer and mark it off from the digit by a vertical line 1.Two digits have been put on top; we allot two places (at the right end of the dividend) to the floating point portion of the answer and mark it off from the digit by a vertical line 2. 38 is divided by the most significant digit (MSD) of the divisor (i.e. 7).Quotient is 5 and remainder is 3.This remainder will be used for next step division 2.2 Algebraic proof of 'Dhvajanka' sutra Algebraic proof of the formula is shown in Fig. 2, where x stands for 10.To understand the steps taken from Fig. 1a; by means of which 38 982 is sought to be divided by 73.Algebraically, the dividend is represented as 38x 3 + 9x 2 + 8x + 2; and the divisor is 7x + 3. Now, let us proceed with the division in the usual manner.
1.If we try to divide 38x 3 by 7x, our first quotient digit is 5x 2 .In the first step of the multiplication of the divisor by 5x 2 , we obtain the product 35x 3 + 15x 2 and this gives us the remainder 3x 3 2. The first step remainder term (i.e.24x 2 ) plus 8x being our second-step dividend, we multiply the divisor by second quotient and 21x 2 + 9x there from and then obtain 3x 2 -x as the remainder.3.However, this 3x 2 is equals to 30x which (with -x + 2) gives us 29x + 12 as the last step dividend.Again multiplying the divisor by 4, we obtain the product 28x + 12; and subtract this 28x + 12, thereby obtaining x − 10 as the remainder.However, x is being 10, thus the remainder vanishes.

Mathematical modelling of 'Dhvajanka' sutra
Let us assume the numbers A = n−1 i=0 a i x i is dividend, and B = m−1 i=0 b i x i is divisor, where x is the radix of the number.So 'A' can be expressed in terms of 'B' as 2) and ( 3))

Illustration of Dhvajanka sutra
Consider dividend f (x) = a 3 x 3 + a 2 x 2 + a 1 x + a 0 and divisor g(x) = b 1 x + b 0 , where 'x' is radix.We have to compute f(x)/g(x) with the help of 'on top of the flag' sutra.Mathematically, f (x)/g(x) = (a 3 x 3 + a 2 x 2 + a 1 x + a 0 )/(b 1 x + b 0 ) can be represented as (see ( 4) and ( 5)) (see equation ( 6) at bottom of the next page) remainder.Through the algebraic identity the equations can be rewritten as

Flowchart diagram of the algorithm
In this section, divider implementation algorithm has been discussed leading towards high-speed operation.The flowchart of the algorithm is shown in Fig. 3. Where, dividend (A) and divisor (B) considered as n-bit and m-bit, respectively.The implementation procedure using the flowchart diagram has been described in Table 2, where two examples have been considered.
Example 1 has been considered for perfect division (remainder = 0), Example 2 has been considered for imperfect division (remainder ≠ 0).For simplicity purpose (8 ÷ 4) bit divider example has been considered, example of higher order bit can be implemented in similar manner.

Divider implementation technique
Proposed divider implementation technique is shown in Fig. 4. The architecture has been implemented via (3).For simplicity purpose, let us assume dividend has greater length than divisor.Divisor has been broken into two parts, that is, most significant part (L) and least significant part (R).L is compared with equal number of bits of dividend taken from most significant bit (MSB) side.If the dividend is greater than L, directly divide the dividend bits by L, otherwise concatenation with next significant bit of dividend.Divide procedure has been implemented through subtractor.Difference is acting here as remainder, and borrow has been working as the selector input of the multiplexer.If the borrow is equal to '0' hence quotient '1' else '0'.The remainder is again concatenated of next MSD of the dividend and subtracted from the cross-multiplication result of the quotient bits and least significant bits of divisor.If result is negative, the quotient is reduced by '1' and set the new quotient bits, otherwise for positive result it is promoted to the next stage.Similarly, the division algorithm has been implemented.Consider the number A = n−1 i=0 a i 2 i to be divided by B = m−1 i=0 b i 2 i , where (a i , b i ∈ 0, 1).To execute the division operation easily through 'Dhvajanka (on top of the flag)' methodology, it has been assumed that the length of dividend is greater than length of divisor.

Implementation procedure
Step 1: Consider the most significant part of dividend Step 2: Determine Suppose the first borrow '0', then through multiplexer it will set the quotient (Q n ) '1' and the remainder is 'R'.
Step 4: Determine Again divide in similar procedure (step 1).Set the quotient bit Q n−1 and remainder 'R'.

Latency of the divider
The hardware cost of the architecture can be computed based on the number of complex operations performed in its critical path, hence total propagation delay can be estimated.The reported architecture for division using Vedic mathematics can be computed in five steps shown in Fig. 5, with maximum 'n' (for imperfect division) iterations.So the total latency can be computed in terms of the propagation delay of summation the individual subsection, with 'n' iterations.The total propagation delay of the proposed architecture (t pd ) can be computed as where t stage1 is the propagation delay of stage1; t stage2 is the propagation delay of stage2; t stage3 is the propagation delay of stage3; t stage4 is the propagation delay of stage4; and t stage5 = propagation delay of stage5.
Stage 1 contains only comparator [18], and comparator has been implemented through '2' stage parallel adder and '2' stage XOR gates.For m bit divisor maximum, m/2 bit comparator is required.Thereby, maximum m/2 bit parallel adder is required in each case.Critical path to implement a full adder is equal to 2 XOR gate delay; thereby critical path for to implement m/over2 bit parallel adder is equal to (m/2) × 2 XOR = mXOR gate delay.'2' stage parallel adders and '2' XOR stage are required to implement a comparator, thus total propagation delay equals to (2m + 2) XOR gate delay.Second stage contains only m/2 bit parallel subtractor, and critical path of 1 bit subtractor equals to 3 XOR gate delay, thereby, total critical path delay for m/2 bit subtractor maybe estimated as (m/ 2) × 3 XOR gate delay.Third stage contains only parallel adder of n bit, assuming one full adder may require 2 XOR gate delay, thereby total propagation delay of n bit parallel adder requires n × 2 XOR gate delay.Fourth stage contains m/2 bit multiplier, and n bit subtractor in feedback path.Assume critical path delay of n bit subtractor equals to 3 × n XOR gate delay.To implement multiplier, three stages are required, namely (i) partial product generation, (ii) partial product addition and (iii) final addition [18].In partial product generation stage, maximum depth in a column of the partial product is equal to m/2.For generation of partial product, it requires m/2 XOR (let us assume XOR gate delay and 'AND' gate delays are equal) delays.For addition, it may require (m/(2 × 3)) × 2 XOR gate, that is, m/3 XOR gate for This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/)partial product addition in first stage.For second stage requires m/6 XOR gate and so on, thus total addition purpose may be approximated as m + (m/2) = (3m/2) XOR gate delay.Also for multiplication approximated, maximum XOR gate delay equals to 3m/2.In the fifth stage, m/2 bit subtractor is required, thereby critical path delay of m/2 bit subtractor equals to (m/2) × 3 XOR gate delay.
Thus, total propagation delay for each of the iterations may be approximated as XOR gate delay.Thereby n iteration may consume n(5n + (13m/ 2) + 2) XOR gate delay.

Results and discussion
The advantages of CMOS transmission gate (TG) logic over conventional CMOS and complementary pass transistor logic (CPL) [19,20] logic are well established.As the CMOS TG consists of one p-channel MOSFET (PMOS) and one n-channel MOSFET (NMOS), connected in parallel, the 'ON' resistance is smaller than even a single NMOS.Proper modifications at the device, circuit and architectural levels of design hierarchy have been implemented to reduce the energy delay product (EDP) and power delay product (PDP) for the proposed design.TGs are used for the design of different modules for faster operation and better logic transformation.Dual threshold voltage (V T ) operating mode was considered for simulation to determine the performance parameters.The proper choice of threshold voltages for a particular transistor in the circuit is based on a number of logics as described below: (i) Placement of high-V T transistors on the leakage path directly between supply and ground reduces the subthreshold leakage current and hence static power.The entire algorithm in this Letter was simulated and their functionality was examined by spice spectre simulator.Performance parameters like propagation delay and dynamic power consumptions analysis of this Letter was calculated using standard 90 nm CMOS technology with 1 V power supply, operated at 250 MHz.As shown, the application of the Vedic division methodology reduces the iteration resulted the reduction of propagation delay and dynamic switching power consumptions.
To implement the Vedic divider like (4 ÷ 4), (4 ÷ 8), (4 ÷ 16), (8 ÷ 4), (8 ÷ 8), (8 ÷ 16) etc. bits, all the individual modules such as subtractor, adder, cross-multiplier etc. were implemented through TG to make the circuit faster.The individual performance parameters such as propagation delay, dynamic switching power consumption, EDPs and PDPs for different circuit modules have been computed.With the help of all the modules, the final simulation has been carried out and performance parameters have been calculated.Comparative study between different architectures and proposed architecture like (4 ÷ 4), (4 ÷ 8), (4 ÷ 16), (8 ÷ 4), (8 ÷ 8), (8 ÷ 16) etc., bit divider is shown in Table 3. Proper modifications at the device, circuit and architectural levels of design hierarchy have been analysed in terms of propagation delay, average power dissipation and their products.The values of delay, power, EDP and PDP of different architectures are measured and tabulated in Table 3.The EDP (10 −21 ) J s and PDP (10 −12 ) J are quantitative measures of the efficiency and a compromise between speed and power dissipations.EDPs and PDPs are particularly important when high-speed operation is needed and its comparison at 1 V supplies voltage with 90 nm CMOS technology.Input data were taken in a regular fashion for experimental purpose.For each transition, the delay is measured from 50% of the input voltage swing to 50% of the output voltage swing.
It is worth mentioning here that we have taken the implementation methodology from different references [9,11,15] and implemented in the same technological environments (spice spectre with standard 90 nm CMOS technology) and then compared the performance parameters.The propagation delay and switching power are the worst-case delay and power of all possible bit combinations.It can be observed from Table 3 (32 ÷ 16) bit squarer requires ∼300 ns to propagate a signal and consumes ∼32.53 mw power for a layout area of ∼17.39 mm 2 .Proposed architecture offered ∼47.3, ∼38.4,∼34% faster operation (propagation delay) than DR [9], N-R [11] and G-S [15] architecture, respectively.On the other hand corresponding reduction of power consumption

Table 2 Fig. 4
Fig. 4 Hardware implementation of divider using dhvajanka formula