Low ‐ power fast Fourier transform hardware architecture combining a split ‐ radix butterfly and efficient adder compressors

Fast Fourier transform (FFT) is the most common low ‐ complexity implementation of the discrete Fourier transform, intensively employed to process real ‐ world signals in smart sensors for the internet of things. Butterflies play a central role as the FFT computing core data path since it calculates complex terms employing several multipliers. A low ‐ power FFT hardware architecture combining split ‐ radix decimation ‐ in ‐ time butterfly and 5 ‐ 2 adder compressors (ACs) is proposed and implemented. The circuits are described in Verilog hardware description language and synthesized using the Cadence Genus synthesis tool. The circuits are mapped onto a 65 ‐ nm CMOS ST standard cell library. Results reveal that the proposed FFT hardware architecture using the split ‐ radix butterfly is 13.28% more power efficient than the radix ‐ 4 one. The results further show that, by combining 5 ‐ 2 AC within the split ‐ radix butterfly, our proposal saves up to 43.1% of the total power dissipation considering the whole FFT hardware architecture, compared with the state ‐ of ‐ the ‐ art radix ‐ 4 butterfly employing the adder automatically selected by the logic synthesis tool.


| INTRODUCTION
In the current internet of things (IoT) era, low-power implementations of smart sensors are among the primary design concerns. Smart sensors are capable of locally processing the captured signal to reduce the burden on the RF (radio frequency) communication channel for transferring massive amounts of data. Signal processing in the sensor can drastically minimize the amount of energy consumed in both industrial [1] and consumer electronic wireless communication. For such applications, the ASIC design of digital signal processing (DSP) algorithms is of paramount importance to meet the energy consumption constraints.
Nowadays, power dissipation in VLSI DSP circuits has gained special attention, mainly due to the proliferation of battery-powered high-performance devices such as smartphones, portable monitors, and notebooks. Efficient DSP algorithms are applied to process data in smart IoT devices [2]. Among those, the fast Fourier transform (FFT) is one of the most widely employed to transform the signal from the time to the frequency domain. Smart sensors for IoT commonly employ FFT to perform the spectral analysis of the signals, as shown in [3]. When the energy-efficiency of devices with power restrictions becomes an issue, low-power VLSI architectures must be developed [4]. Therefore, designing low-power FFT hardware architectures is essential to maximize the efficiency of this type of smart sensors, as we are proposing here.
FFT is the most common low-complexity implementation of the DFT (discrete Fourier transform). FFT is often employed to support the data classification or compression in the IoT smart sensors [1]. The butterfly is the mathematical core of the FFT to perform complex operations. Butterflies perform the complex multiplication of input data by the twiddle factors [5] (i.e. appropriate coefficients). The twiddle factors are multiple values of N=2, where N represents the number of FFT points. One of the main challenges in the FFT hardware architecture is the optimization of the butterfly by using efficient arithmetic operators, which is the goal of the architecture proposed here.
The FFT algorithm splits the processing into butterflies, perform a significant part of the butterflies without using the last points of the input array [6]. The literature proposes different butterfly designs attending both of the characteristics mentioned, such as radix-2, radix-4, mixedradix, and split-radix [7][8][9][10][11][12][13]. However, the split-radix algorithm requires the fewest operations among the other radix-2 and radix-4 algorithms [14]. The key idea of a split-radix butterfly is to mix decomposition of both radix-2 and radix-4 butterflies. The hardware architecture of the split-radix butterfly enables the simultaneous addition of complex terms. For this task, employing adder compressors (ACs) is a promising alternative for low-power designs. ACs parallelization schemes allow the reduction of the hardware architecture's internal switching activity, which proportionally reduces the dynamic power dissipation, compared with the combination of the conventional carry propagating adders [15,16].
By performing a careful analysis of the complex equations, composed of real and imaginary multiplications, additions, and subtractions, we propose a 5-2 AC in the butterfly of a splitradix with decimation in time (DIT). According to [14], the DIT FFT butterfly involves less computation time than the DIFAQ4 one. Therefore, we combine the optimized DIT split-radix butterfly within ACs in a fully-sequential 16-point FFT implementation to investigate the entire FFT architecture's power efficiency. Furthermore, by implementing the DIT butterfly, we can compare our results with the literature solutions.
The hardware architectures were described in Verilog hardware description language (HDL) and synthesized for ASIC employing the standard cells of the ST 65nm CMOS process. We used a power extraction methodology flow to perform a realistic power estimation for considering the glitching. The results show that our split-radix butterfly proposal employing 5-2 ACs is more area and power efficient than both a baseline split-radix butterfly (without AC) and radix-4 butterflies. Furthermore, the FFT hardware architecture with the proposed split-radix butterfly results in higher area and power efficiency than the FFT employing the other aforementioned butterflies.
The main contributions of this work are as follows: 1. A hardware architecture for FFT employing a split-radix butterfly, which is more power-efficient than previous works. 2. Demonstration of the efficient design of 5-2 ACs within the split-radix butterfly used in the complete FFT hardware architecture.
This article is organized as follows: Section 2 presents a background on the FFT as well as an overview of split-radix butterfly and ACs. The most relevant works from the literature are also discussed in this section and compared with our proposals. Section 3 presents the architecture of the proposed FFT, combining the split-radix butterfly and the ACs. Section 4 shows the main results achieved by the proposed solutions. Finally, Section 5 highlights our conclusions.

| BACKGROUND
This section presents a background on the FFT, the split-radix butterfly structure, an overview of 5-2 AC, and finally, the most relevant works from the literature.

| FFT
FFT is one of the most important algorithms in signal processing and data analysis. The FFT can be applied to solve a myriad of problems such as speech and image processing, signal analysis, and communication systems. The proposed work in [14] shows a survey with the advances on the FFT algorithms and applications during the last 50 years.
The DFT is one of the most used mathematical applications in DSP and communication systems. The DFT is an accurate technique that converts temporal or spatial data into frequency domain data. The discrete version of the FT views both the time domain and frequency domain as periodic. In other words, it establishes a relationship between both the time and the frequency domain representation. However, the DFT is not computed directly due to both a large number of arithmetic operations and data transfers involved in its process. Thus, the FFT is preferred since it resorts to algorithms with lower computational cost, evaluating the DFT approximately and efficiently [14]. The FFT reduces the number of computations needed for a problem of size N from O(N 2 ) to O(Nlog 2 N).
One of the drawbacks of conventional FFT hardware architectures is the presence of multiplier blocks, which cost higher hardware complexity, increase the power consumption, and limit the maximum operating clock frequency as well. However, some techniques have solved this problem by using strategies such as the coordinate rotation digital computer [12] or even using more efficient ACs, such as in this work.
The most commonly used FFT algorithm is the Cooley-Tukey algorithm [5]. It is a divide-and-conquer algorithm for the computation of complex Fourier series. The main characteristic of this algorithm is to break the overall DFT into smaller DFTs. The basic idea uses the radix-2 butterfly block, that is, one calculates the FFT X(k) of a signal x(n) using (1), where W N is the N-th twiddle factor, j is the imaginary unit, and N is the number of points of the FFT [5].
By referring to the FFT, one is not considering just a single algorithm or mathematical expression but a family of algorithms and methods to efficiently calculate the DFT. However, in all of these methods, the reasoning is similar and can be explained mathematically in several ways. Recent advances in the FFT algorithm take into account higher radix and split-radix.
Standard radix-2 algorithms use two half-length DFTs, and the radix-4 algorithms are based on four quarter-length DFTs. The split-radix algorithm uses both one half-length DFT and two quarter-length DFTs. It is possible because, in the radix-2 computations, the even-indexed points are independent of the odd ones. The split-radix algorithm uses the radix-4 algorithm to compute the odd-numbered points. Therefore, the N-point DFT is decomposed into one N/2point DFT and two N/4-point DFTs [12]. The advantage of split-radix is that it has a reduced number of arithmetic computations compared with those of radix-4 and radix-2 FFT. The split-radix also has advantages such as a regular structure and no reordering of internal signals except for outputs.

| Split-radix with DIT overview
The split-radix butterfly's main idea is twofold: to use one radix for one decimation product of a sequence and other radices for other decimation products of the sequence [12]. In practice, the split-radix FFT mixes radix-2 and radix-4 decomposition (as in Figure 1).
The DIT split-radix butterfly comprises two complex multiplications and other complex additions and subtractions (see Figure 1). The development of the complex equations ( Figure 1) and the terms' grouping into real and imaginary parts lead to the expressions (2)- (9). The split-radix implementation, with eight multipliers and 16 adders/subtractors, is derived directly from the previous expression's factorization, shown in Figure 2.
According to the literature [14], the split-radix yields an algorithm with about one-third fewer multipliers than those of the radix-2 FFT. The split-radix FFT has lower complexity than the radix-4, or any higher radix power-of-two FFT. We intend to reduce power in the split-radix butterfly by using efficient ACs.

| Overview of ACs
The AC structures are widely used in fast-and low-power multiplier architectures and are composed of exclusive-or (XOR) gates and multiplexers (MUX).
We propose to apply 5-2 AC [12] in the butterfly. Figure 3(a) shows the internal structure of the AC. Note that the 5-2 AC can be power advantageous because of its reduced critical path, composed of just four XOR gates for the simultaneous addition of five operands.
The 5-2 AC has seven inputs and four outputs, as presented in Figure 3(a). The seven inputs are the five operands A, B, C, D, E, and two input carry signals C in0 and C in1 . The four output signals are Sum, Carry, C out0 , and C out1 , where the Carry and both C out0 and C out1 terms have 1-bit higher order than the Sum. We implemented the 5-2 compressor cell, as shown in Figure 3(a), at the logic gate level using six XOR and three MUX gates.
The implementation of an N-bit 5-2 AC, shown in Figure 3(b), uses N 5-2 compressor cells of Figure 3(a). Recombination of the partial Sum, Carry, C out1, and C out2 terms must be realized along with the cells. An extra recombination line recombines Sum and Carry terms. This line is shown as a cascade of full adders in the Figure 3(b) as a ripple carry adder. In the VLSI synthesis results of this work the final sum's recombination line is implemented by the (+) macro function operator selected by the synthesis tool.

| Related work
The low-power implementations of FFT algorithms from the literature try to optimize the entire FFT architecture by using pipelining, reusing butterflies, proposing sequential and semiparallel structures, or even reordering the twiddle factors, such as in [12] and [12]. Most of the prior works are based on radix-2 or radix-4 butterflies, such as in [9][10][11][12][13]. Other hybrid architectures use radix-2 and radix-4 butterflies, such as the split-radix butterfly, simultaneously. Table 1 summarizes the related work compared to our work that explores split-radix butterfly employing ACs for the FFT VLSI design. The primary goal of the different solutions is to reduce the critical path and, consequently, improve the entire FFT architecture's performance, as seen in [14], whose split-radix structure uses the radix-4 butterfly, but in which no power results are presented. The work in [27] also proposes a split-radix FFT based on the radix-4 butterfly. However, the results are only FPGA based and with no evidence of its efficiency since it was only compared against a 1024-point radix-2 FFT. A split-radix FFT based on a radix-2 butterfly was proposed in [28]. The modified radix-2 butterfly unit exploits the multiplier-gating technique to save the dynamic power at the expense of using more hardware resources. However, there are only results for the individual components of a 1024-point FFT in ASIC.
As the main characteristic of the split-radix butterfly is to reduce the number of arithmetic operators, the works in [29] and [30] propose a multiplierless architecture of the split-radix F I G U R E 2 Split-radix butterfly hardware architecture in [14] F I G U R E 3 5-2 AC for (a) 1-bit (b) N-bits [12] FERREIRA ET AL.
-233 FFT algorithm using new distributed arithmetic. The work in [29] presents power results in both FPGA and ASIC implementations. However, its authors present no explanation about the methodology used for power estimation nor about the type of arithmetic operators used. On the other hand, the work in [30] uses parallel prefix adders in the split-butterfly, but its results are only for FPGA implementation and no power dissipation data is presented. The work in [31] also explores the use of distributed arithmetic (DA) for a 256point split-radix FFT. The approach in [31] incorporates a method to overcome the result overflow problem introduced by the DA method.
The DA strategy has the advantage of substituting the multiplications in the butterfly. However, its bit-serial characteristic can negatively affect the performance operation since the total number of cycles required to complete the multiplication is proportional to the number of input bits [31]. Therefore, other strategies have tried to turn the splitradix FFT more efficient, such as in [32], proposing an FPGA-based shared-memory architecture. The work in [33] also presents a shared-memory low-power split-radix FFT processor architecture, which is computed by using a modified radix-2 butterfly unit. Although both works claim that the solutions are low power, there are no comparison results to prove this, and only FPGA results are available. The work presented in [34] also proposes a shared-memory-based split-radix FFT processor. Although [34] concludes that the proposed method reduces the dynamic power consumption at the expense of more hardware circuit area, there are no power dissipation figures in [34] to sustain it. The same for the work in [35], which proposes a low-power 128-point split-radix FFT, also lacking an investigation about the power dissipation.
The work in [36] promises the implementation of a lowpower split-radix FFT processor. However, there is no evidence that the proposed approach is a low-power solution. Furthermore, there are only results in FPGA. An ASIC implementation of a 32-point split-radix FFT is presented in [37]. Although the authors claim that the proposed method has advantages in terms of chip area, short execution time, and power consumption, there are no low-power strategies applied nor comparisons to prove it.
A different scheme for split-radix FFT was presented in [38][39][40][41], where DIT radix-3 and radix-6 FFT can calculate a 12point DFT. However, the results are only in FPGA with no power dissipation results. A split-radix 4/8 FFT algorithm is proposed in [42], but with no power dissipation savings over the prior literature. Furthermore, the authors present no comments about the methodology for a realistic power extraction report.
Here we extend our prior work in [43], investigating the area and power efficiency of our new split-radix butterfly proposal in the whole FFT architecture. In contrast to the related work, we also show how the FFT architectures behave in a real-world scenario by submitting them to a thorough power-estimation methodology, as our Section 4.1 will address. It is noticeable in Table 1 that only our previous work in [43] used a detailed synthesis flow to obtain the results as we present here. Finally, none of the related work optimizes the FFT architectures employing the efficient ACs within the split-radix butterfly, as we are herein proposing. The work in [44] suggests an FFT using modified 4-2 and 7-2 ACs while omitting the entire FFT implementation. Moreover, no power dissipation investigation is made in that article to sustain the claimed efficiency of employing ACs. In summary, as the opposite of the literature solutions presented in Table 1, our work brings an FFT implementation with an optimized split-radix butterfly using efficient 5-2 ACs.

| FFT HARDWARE ARCHITECTURE PROPOSAL
This section shows the split-radix butterfly employing 5-2 AC and indicates the FFT architecture with the optimized splitradix butterfly.

| Split-radix butterfly employing ACs
The split-radix butterfly has a well-balanced structure, as shown in Figure 2. This aspect can contribute to glitching reduction, that is, reducing unnecessary switching activity, thus favorably impacting dynamic power reduction. Therefore, the split-radix structure is a prime candidate for a power-efficient implementation employing ACs.

TA B L E 1 Related Works on FFT VLSI design
Related work about split-radix FFT By a careful analysis from expressions in (2), it is noticeable that the final values of X N terms are composed of the sum of five terms, making the 5-2 AC a suitable option for this implementation. The five terms to be used as inputs for the 5-2 AC are the first rows of multipliers that compute the W x ⋅ P y terms. Figure 5 presents the architecture proposed in [41] for the split-radix using 5-2 AC.

| Employing the optimized split-radix butterfly within the FFT hardware architecture
Our proposed 16-point FFT implementation uses a split-radix butterfly, Figure 5-a, with DIT in its baseline structure. To draw a proper comparison between the FFTs, we also implemented an FFT using the radix-4 butterfly.
Implementing a fully parallel FFT for a large number of points is impractical. The current literature proposes several topologies to address this problem. Although there are several ways of implementing massive FFT architectures, most of the topologies use butterflies as their primary operator in a sequential structure.
As previously mentioned, the DIT algorithm is the basis for this architecture. First, to use this algorithm, we have to reorder the inputs to a bit-reversal indices order. Second, we separate the elements into groups. We split the first group according to the least significant bit. It is the same process for the second group, where we divide it according to the second least significant bit, and so on.
The developed architecture's data path is shared across split-radix and radix-4 butterfly types, differing only in the control signals (omitted in Figure 5 for simplification). The control path is composed of one finite state machine (FSM) for the split-radix and one for the radix-4 butterflies. The main difference is the number of clock cycles required to compute the FFT correctly in each butterfly design. The FSM is also responsible for enabling the correct MUX groups, register sets, and activating the ROM module's coefficients. As the twiddle factors depend only on the FFT size, we can find the real and imaginary parts of these twiddle factors earlier and store them in a ROM memory.
In the proposed sequential architecture, we use only one butterfly for processing the complex terms of the transform. A set of registers read the real (Figure 5-b) and imaginary ( Figure 5-c) parts of the data. These same sets of controlled registers store the obtained results from the computation of each stage. The proposed architecture also uses four groups of MUX circuits targeting better control of the system. While the second line of the MUX (with A to D outputs) group will feed the butterfly with data from the sets of registers at different periods, the last line of the MUX (with P 0 to P 3 outputs) group is responsible for organizing that data in the correct input ordering for the butterfly. Another group of these MUX circuits, the first line of MUX, receives the data from the TRUNCAQ13 and orders them before storing them in the register sets.
The TRUNC modules, connected to the butterfly outputs, are composed mainly of buffers. As it corresponds to a 16-bit architecture, these blocks truncate its inputs to the most significant 16 bits of the butterfly calculation results. These 16 bits represent the most significant part of the fractional value between -1 and 1. Finally, Figure 5-d shows the register with control, which allows greater control over the register group, enabling the use of previously calculated results or the new FFT inputs each 15 clock cycles. Figure 5-d shows the internal structure of the controlled-input set of registers.

| RESULTS AND DISCUSSIONS
This section presents the results of our work. We show the synthesis results to both the investigated butterflies, and it is employing in FFTs. This section begins gives the technical details about the power extraction methodology purposefully used to obtain a realistic estimation. Our method is carefully described to assure the reproducibility of the results herein presented.

| Power extraction methodology flow
A realistic power estimation methodology is used here, which considers the signal transitions and the circuit's node F I G U R E 4 Our butterfly architecture proposal: Combining 5-2 AC within the split-radix [5] FERREIRA ET AL.
-235 capacitance loads estimated by a pre-placement of the standard cells. Hence, our power estimation is layout aware, as it uses the information on gates and interconnections delays as inputs for the synthesis tool [45].
1. Standard delay format (SDF) generation occurs after the first initial synthesis. This format enables the precise estimation of temporal glitches since it considers the delay for all transitions. The SDF is highly recommended for a realistic power dissipation estimation. The results presented here show the difference between power results with and without considering the delay files. 2. Physically aware layout estimation (PLE) mode estimates the length of all nets and considers the load capacitance effects in the power dissipation, finding a relatively pessimistic layout estimation. The library exchange format (LEF) files for all cells are used in the estimate. This file contains the library's physical layout information. The LEF macro consists of the internal logic cell capacitance, and the tech LEF includes the process metal capacitance for the interconnection capacitance estimation [46]. Another file with the same information as LEF is named CapTable and contains the routing capacitance table. The CapTable file is more precise than LEF, however, since it considers the process variations [46]. Finally, the standard cell library files (in liberty format) are employed for power dissipation estimation. These files contain the information of the power dissipation for each slew rate and load capacitance. The results given here are from industrial-strength cell libraries from ST 65 nm process [47].
To provide a more precise power estimate, we generate the stimulus format from the gate-level netlist simulation, running a testbench with real signal inputs. The simulation runs on the gate netlist, the Verilog generated after the first logic synthesis [47]. This method guarantees area and power precise values, since the simulation model is the same as the final circuit. The stimulus file formats supported by the commercial tool are the value change dump, the toggle count format, or the switching activity interchange format.
A new split-radix butterfly using 5-2 AC is herein proposed and it is compared with other solutions from the literature in the context of the full FFT VLSI design. The circuits were described in Verilog HDL (Hardware Description Language) and synthesized using the Cadence Genus synthesis tool [48]. The circuits were mapped onto a 65-nm CMOS ST standard cell library [49] using 1.0 V voltage supply and at the 125°C temperature corner. For fair power comparisons, a 10 MHz operating clock frequency was set for all butterfly circuits compared in Table 2, while considering 100k random input test vectors and real logic circuit delays. We present a comparison regarding gate count (relative to 2-input NAND gates), total cell area, critical path delay, and power dissipation initially for the butterflies only in Table 2.

| Butterflies' synthesis results comparison
The single butterflies were individually synthesized to investigate our proposal's efficiency compared with others previously published. The following butterflies were designed, synthesized, and evaluated at 65 nm, according to the method previously described: 1. Radix-4 butterfly in [10] 2. Radix-4 butterfly in [14] 3. Original split-radix butterfly in [34] (without ACs) 4. Our proposal: split-radix butterfly with 5-2 AC Table 2 shows circuit area and critical path delay synthesis results for these four butterflies. The radix-4 architecture is composed of 12 multipliers and 22 adder/subtractors [8], which results in a total area of 44078.84 μm 2 . On the other hand, the optimized version of the radix-4 decreases the number of multipliers and adders, enabling a reduction of 7.25% in its total area. As the split-radix butterfly has only 8 multipliers and 16 adders/subtractors, it results in a total area of 30657.64 μm 2 , which reveals a reduction of 30.45%, compared with the original radix-4 butterfly. Note that the impressive decrease in the arithmetic operators has allowed the split-radix butterfly to also present an area reduction of around 25% compared with the optimized radix-4 butterfly.
The use of ACs has been advantageous for the split-radix butterfly, with an impressive reduction in the area, as can be seen in Table 2. The reduction that occurs because the ACs have a simplified structure based on XOR and MUX gates only, with a slight delay penalty, as shown in Figure 3. We could also substitute 16 adders/subtractors in the original split-radix butterfly by eight 5-2 ACs in the optimized one.
For a realistic power dissipation evaluation, we have included results considering the SDF files, as shown in Table 3.
The SDF files are essential to enable a precise estimation of the temporal glitches, which is crucial for the dynamic power dissipation.
To the best of our knowledge, works in the current literature lack a thorough power analysis of the butterflies or even the entire FFT, considering a realistic power extraction. A meaningful estimation is necessary, at least based on vectordriven timed simulations capable of capturing logic glitches, or on realistic layout estimation, all of which affect the power dissipation.
To show the importance of presenting results with realistic power extraction, we offer in Table 3 power results with and without the SDF file. Note that, as the opposite of the original radix-4 and the split-radix architectures, the optimized radix-4 butterfly is an irregular structure, which makes it prone to higher glitching activity [8]. When synthesized without the SDF files, the optimized radix-4 saves 5.36% in power dissipation. However, when the synthesis with the SDF files is taken into account, its dynamic power increases more than four times, which indicates a massive influence of glitching activity in the entire architecture.
Although the proposed split-radix butterfly with 5-2 AC has a slightly longer critical path, it presents substantial power improvements by more than 30% reduction compared with the original split-radix. Compared with the original radix-4 butterfly, the proposed butterfly architecture shows an even more significant improvement in power, with savings of 55.97% and up to 47.82%, without and with SDF analysis, respectively. This is due to our AC optimized design, which uses a reduced number of transistors with transmission gate MUXes or passtransistor logic [50]. -237

| Synthesis results for the FFT comparison
Results for the entire 16-point FFT architectures, with the butterflies designed in the previous section, are presented next. The architectures were synthesized for a maximum clock frequency of 150 MHz. As the butterfly results made clear the importance of using a realistic power extraction method, we present power results using a precise analysis with the SDF file. The FFT architecture with the split-radix butterfly is the most efficient over the radix-4 butterfly in both circuit area and power dissipation. When considering the use of 5-2 AC, the results are further improved, proving it to be the most powerefficient combination among all structures. It occurs due to the optimal combination of the low-power aspect presented by the efficient AC and the reduced amount of arithmetic operators presented by the split-radix butterfly structure.
Reducing two multipliers leads to the smaller area presented by the optimized split-radix butterfly, compared with the original one, even though the optimized structure shows two more adders/subtractors. It has an impact on reducing the area of the FFT architecture, as can be compared in Table 4. It is noticeable that the area saving is smaller in the full FFT architecture with our optimized split-radix butterfly (Table 4) compared with the single butterfly (Table 2). This occurs because, in the FFT implementation, other blocks contribute to the power dissipation (registers, MUX, ROM, control state machine). Mainly, the ROM and the state machines differ in the FFT implementations with either split-radix or radix-4 butterflies.
As the implemented FFT is a fully sequential structure, with the processing of one butterfly per clock cycle, the use of an optimized butterfly has a considerable contribution to the power reduction. In this aspect, using a split-radix butterfly has been advantageous for power reduction, as shown in Table 4. This reduction is more significant when the FFT uses the optimized split-radix butterfly. As it has a longer critical path, the optimized butterfly presents a slightly higher delay than the original structure. Despite the critical delay disadvantage, the optimized split-radix butterfly offers gains in terms of dynamic power, which leads to the reduced total power presented by its design. The use of ACs with optimized internal structure with XOR and MUX only contributes to the power dissipation reduction.

| Discussions and comparisons with the literature
To compare our proposed FFT directly with others from the literature is not easy since, to the best of our knowledge, there are no works that present VLSI power results only for the split-radix butterfly. Then, here, we synthesized prior work architectures (as shown in the top 3 lines of Table 4) with the same 65 nm VLSI cell-library, and estimated area and power for four of them with the same method. Table 4 summarizes area and power results from prior published entire FFT architectures-not just the butterflies-to enable comparing them with our FFT proposed architecture.
According to the results in Table 4, it is noticeable that our FFT with split-radix butterfly is more area and power efficient than the solutions with radix-4 butterfly. Compared with the FFT with the split-radix butterfly from the literature [34], our design is more circuit area and power efficient for the same number of 16 points. It occurs because our split-radix butterfly uses efficient 5-2 ACs. Some other works have presented power results for the entire FFT with split-radix butterfly, but some are FPGA-based reports, such as in [27][28][29]. The work in TA B L E 4 Circuit area and power dissipation comparison between our split-radix FFT proposal and prior state-of-the-art  [6] presents a reduced power value, but it only shows full results for the entire FFT in FPGA. It is known that for a realistic, precise, and low-power extraction method, FPGAs are not the most appropriate technology. Regarding ASIC-based FFT, there are solutions in the literature that present power reports, such as in [28], which proposes a 1024-point FFT with a split-radix butterfly based on a radix-2 one. In the scheme, clock-gating registers are put in the multiplier path to prevent unnecessary switching activity. A few registers are placed at the address port of memory banks to synchronize the whole design. Therefore, the method reduces the dynamic power consumption at the expense of more hardware resources. Here, the switching activity is reduced with no additional hardware resources since the ACs present small switching activity stemming from the reduced number of transistors in the XOR and MUX gates. Therefore, our FFT hardware architecture proposal presents a reduced power dissipation compared to the other ASIC solutions of the literature as well as a competitive maximum clock frequency of 150 MHz.
The literature presents different split-radix butterflies, such as in [42], which proposes a split-radix based on radix-4 and radix-8 butterflies. The work offers a tool that can generate the VHDLAQ19 descriptions of the required N-point fully parallel split-radix4/8 FFT design that can be synthesized on an ASIC and/or FPGA platform. However, the developed architectures aim to achieve high-throughput and low-latency demands, without savings in power dissipation possible with other butterflies. There is a high power dissipation presented by this solution, synthesized for the Nangate 45 nm open-source process library and operating at 200 MHz.
Our work shows the methodology employed for a realistic power dissipation extraction for either butterflies alone or the entire FFT. This methodology allows for improving the reproducibility and comparability with prior research architectures from others. We use the more precise, placementaware SDF file to include real delays, thus precisely considering the glitching contributions to the power results. We showed the power dissipation reduction when employing efficient ACs into the split-radix butterfly within the FFT.

| CONCLUSION
We presented a low-power FFT hardware architecture combining 5-2 AC within a split-radix butterfly. We applied a realistic methodology for the power dissipation estimation, including the circuit glitching activity. The results showed that the FFT employing the proposed split-radix butterfly architecture leads to considerably less power dissipation than original and optimized radix-4 butterflies. Moreover, combining the split-radix butterfly with 5-2 AC increased power savings even more in both the split-radix butterflies and the entire FFT architecture, reaching up to 47.28% and 43.10% savings, respectively. Therefore, our VLSI FFT architecture proposal contributes for maximizing the power efficiency of smart sensors for IoT that commonly employ FFT to perform the spectral analysis of the signals.