Static power model for CMOS and FPGA circuits

Natural Sciences and Engineering Research Council (NSERC) Abstract In Ultra‐Low‐Power (ULP) applications, power consumption is a key parameter for process independent architectural level design decisions. Traditionally, time‐consuming Spice simulations are used to measure the static power consumption. Herein, a technology‐independent static power estimation model is presented, which can estimate static power with reasonable accuracy in much less time. It is shown that active area only is not a good indicator for static power consumption, hence in this model, the effects of transistor sizing, transistor stacking, gate boosting and voltage change are considered. The procedure to apply this model to processors and FPGAs is demonstrated. Across different process technologies, compared to traditional spice simulation, this model can estimate the static power consumption of processor with an error of 1%–4%, while static power consumption of an FPGA system with an error of 1%–15%.


| INTRODUCTION
In the recent years, the demand for Ultra-Low-Power (ULP) applications has grown significantly. In ULP design, power consumption is one of the most important parameters and availability of power dissipation data is an absolute necessity for architectural level design decisions [1]. The performance of different systems can be compared either using simulation tools or using estimation models. In the literature studies, Elmore delay model [2] is available for delay estimation, similarly for area estimation models like VPR area model and COFFE area model [3] are available. Dynamic power can be estimated based on the capacitance of the system, but to the best of our knowledge, there is no reliable process independent static power model for CMOS or pass transistor devices. It is commonly thought that static power can be estimated based on the active area but in our work, we will show that this is not true in all cases. In the FPGA research community, a common practice for static power is to use spice simulation [4][5][6]. Spice simulations can not only be time-consuming but for each process technology, one has to generate new simulation results. In this work, we will propose a process independent static power estimation model that can reliably predict the static power consumption of heterogeneous systems to assist the process independent architectural level investigation of design trade-offs between fixed logic CPUs and reconfigurable logic FPGAs.
To meet performance requirements, designers have to decide between heterogeneous systems like ASIC, FPGA or CPU. For example, it is a well-known fact that FPGAs have lower latency and are more energy-efficient compared to general-purpose processors [7]. But currently, most powerlimited applications such as wireless sensor networks and implantable devices, prefer custom ASICs [8] or low power micro-controllers [9]. In recent years FPGA manufacturers have offered some low power devices like iCE FPGA and Actel Igloo, these low power FPGAs have gained the attention of designers and few published results are using them for ULP applications [10]. Now, in a power-limited scenario, if a designer wants to replace a processor with a custom-designed low-power FPGA fabric, the first question that needs to be answered is that how many FPGA tiles can be turned on in the available power budget? and then decide if that number of tiles is enough for the targeted application.
Similarly, power dissipation also influences on the decision among homogenous devices as well. For example in FPGAs generally, NMOS pass transistor switch is used but, it was shown by [5] that for low voltage and low power applications Transmission Gate (TG)-based FPGAs can achieve 26% better power delay product (PDP) compare to NMOS pass transistor-based FPGAs.
In this work, we investigate procedures to estimate the static power consumption for both processors and FPGAs in a process independent manner. Moreover, for FPGAs we will demonstrate the procedure for NMOS-based FPGAs and TGbased FPGAs. To verify the validity of our model we measured the static power consumption of the DLX processor and an FPGA system implementing different benchmark circuits. Then we compare our model with the measured values. For simulations we used Predictive Technology Models (PTM) [11] for 130, 90, 65, 45 and 32 nm. Table 1 tabulates the physical parameters used for each process technology. Default values of physical parameters are used and one can refer to [12] for more explanation. DLX processor was selected because of its simple architecture and easily available open-source implementation.
Herein, Section 2 briefly reviews the accuracy of currently used performance estimation models for FPGAs. Section 3 explains the procedure to measure the static power consumption of DLX process and how our static power model can be used for CMOS circuits. In Section 4 methodology adopted to measure the static power consumption of FPGA Logic Element is explained. In the same section, we will also explain how to apply our static power model on pass transistor circuits. In Section 5, utilization of our static power model is shown with a case study of replacing CPU with an FPGA. In Section 6 we analyse the accuracy of the static power model by implementing different benchmark circuits on an FPGA system. Section 7 concludes the paper.

| ACCURACY OF PERFORMANCE ESTIMATION MODELS FOR FPGAS
An FPGA device consists of billions of transistors. It is practically not possible to analyse the complete FPGA device using SPICE-based simulation solely, therefore higher level Computer-Aided Design (CAD) tools are required for the performance analysis [13]. CAD tools provided by the vendors can accurately and quickly estimate the performance of an application on an existing FPGA, but they cannot be used to study the novel FPGA architectures [14]. Verilog-to-Routing (VTR) [15] (previously known as Versatile Packing, Placement and Routing [VPR]) is one of the most widely used opensource FPGA CAD tool that allows a variety of options to explore new FPGA architectures. VTR uses a combination of both SPICE simulations and mathematical estimation models to compute area, delay and power of any circuit implemented on the targeted FPGA [16].
In VTR, the minimum width transistor area model is used to estimate the area of the FPGA device. However, it was shown by Ref. [17] that the error percentage of the VPR area model for some components is even greater than 50%. Table 2 shows the error percentage of the VPR area model for different components. Similarly, VPR uses the Elmore delay model to estimate the delay across different components in the FPGA device and the error percentage of Elmore delay is also considerably high ranging from 0.7% to 24%, as shown in Table 3.
VTR provides architects with useful information to evaluate their designs, therefore despite having this high error percentage VTR is still one of the most widely used CAD tools in academia for FPGA research. Accuracy of models is important for the simulations and final stage of the design, but it's also desirable to have simple mathematical models to facilitate high-level decision between alternate designs. At an early stage of the design, a useful level of abstraction for application is more important than the absolute accuracy of these models [1]. In this work, we are investigating if the static power consumption of FPGAs and CMOS circuits can be estimated using mathematical models with reasonable accuracy.
Both the delay and area estimation models used in VTR are process independent but, to estimate the power VTR requires process technology file. Power consumption of each FPGA subcircuit is measured using SPICE simulation and then VTR estimates the total power of the circuit based on the utilization of each subcircuit [16]. Development of a new FPGA is an iterative process that involves transistor-level modifications (resizing) to tune the area, delay and power [3]. This means that each time the transistor size is changed in a subcircuit, a new SPICE simulation is required. The availability of a mathematical power estimation model can help in speeding up this iteration process, furthermore, it will also facilitate FPGA architects to evaluate their design independent of process technology. Although it's out of the scope of this work but, integration of power estimation model in VTR can allow the following three modes to the users: � A quick process independent power estimation. � Initial measurement of power using accurate SPICE simulations and then utilise a mathematical model during the iterative performance tuning process. � For more accurate results utilise only SPICE simulations (currently available feature).

| STATIC POWER MEASUREMENT OF DLX PROCESSOR
Many open source VHDL implementation of DLX processor are available, in this work we are using the implementation of [18]. We measured the static power consumption of the DLX processor across different process technologies by adopting the following procedure: � We synthesised VHDL implementation of [18] using Synopsis Design Vision (we are not disclosing process technology here because of NDA). � A detailed cell report was generated to find out the type and quantity of standard cells required to implement the DLX processor. � Then we used PTM to design the cells that were reported by Synopsis. We designed cells in PTM 130 , 90,65, 45 and 32 nm process technologies. � Using spice simulations static power consumption of individual cells was measured (average static power for all possible input combinations was measured). � Once the static power consumption and quantity of each cell were known, we found the total static power consumption of the DLX processor by just added the static power consumption of individual cells. Table 4 shows the static power consumption of DLX processor along with some of the building blocks across different process technologies. It is usually expected that the power consumption of a device reduces as the process technology shrinks but according to [19] static power dissipation per device increases with the process technology shrinking.
Here it is also important to mention that in a shorter process technologies devices operate at a much faster rate and as a result the total energy consumption per device reduces with the process shrinking not necessarily being the power consumption [19].
But interestingly Table 4 shows that for different process technologies static power consumption of the DLX processor remains almost the same. To explain why this is happening we need to look at the subthreshold leakage current (cause of static power) equation which is given by: Where. μ O is the zero bias mobility.
From the Equation-(1) it is clear that: � I sub increases with the reduction of V TH . � I sub increases with the increase in C OX . � I sub decreases with the reduction of Supply voltage V DD (because V G and V DS will decrease).
With the technology shrinking V TH is decreasing and C OX is increasing (because of oxide thickness decreases), which causes an increase in I sub . On the other hand, with the technology shrinking it is necessary to decrease the supply voltage to suppress the power consumption. Now, two parameters (V TH and C OX ) are causing an increase in I sub while the third parameter V DD is causing a decrease in static power consumption, therefore, we do not observe much change in static power consumption from one process technology to another.

| Active area model to predict static power
Static power consumption of CMOS circuits is usually measured using spice simulations which is not only a timeconsuming process but also requires transistor models for each process technology. In this part of our work, we are proposing a static power estimation method which is independent of process technology and can be used by designer for estimation of static power consumption at the early stage of design. It is generally thought that active area can be a good predictor for static power consumption. But this statement is not true in most cases. Consider the example of transistors connected in series, their power does not increase proportionally to the active area because of stacking effect [20]. Similarly, if we have N number of transistors connected in parallel, if all the transistors are in OFF state then their static power is proportional to area, but even if one of them is in ON state then total static power consumption will be almost zero (V DS will be 0). So on average for parallel transistors as well active area is not a good indicator for static power consumption. Just to strengthen this point we will use an active area model to predict static power for different CMOS logic gates and will see how far off it is in predicting static power. We are using the following steps to estimate static power from the active area: 1. Use COFFE [3] area model to estimate active NMOS-Area (A N ) and PMOS-Area (A P ) in the circuit. 2. Measure Static power consumption of NMOS (P N ) and PMOS (P P ) in inverter configuration (PMOS is sized according to PN ratio). Table 5 tabulates P N , P P and PN ratios for different process technologies we are using in our work. 3. Use following equation to estimate the static power Static power prediction using Equation 2 for different logic gates and DLX processor is shown in Table 6. The model works well for smaller gates (one to two input), but as mentioned earlier this model does not account for stacking effect or parallel circuit effect, so it overestimate the static power for logic gates with three or more inputs. Moreover, the rate of increase of active area with sizing (width) is much slower compared to leakage current (see Equation-1), so static power model based on active area underestimate the effect of sizing. But interestingly since it underestimates for some logic gates and overestimates for others this simple static power model only has 1%-15% error for DLX processor across different process technologies.

| Our process independent static power model
This model accounts for the effect of transistors in series or parallel. In this model, we assume transistor as a resistor, the value of resistance is very high in the OFF state, while in the ON state its value is very low (almost 0 Ω). We will start explaining our model with the example of an inverter shown in Figure 1. In inverter there can be two possible states, we calculate static power consumption for each state (Figures 1b  and 1c) and then take average of all the states (Figure 1a). Applying the same concept we can estimate the static power consumption of any logic gate. For illustration purposes, another example of a 2-input NAND gate is shown in Figure 2. Although at this point we are ignoring lots of factors such as gate leakage, effect of V DS on transistor's resistance, etc but ignoring those factors make problem very simple to solve and at the later stage, we will add correction factor. From Figure 2 average static power of the NAND gate will be: Most of the basic logic gates are a combination of series and parallel transistors. If one of the pull-up networks or pulldown networks consists of series-connected transistors then the other one will have parallel-connected transistors or vice versa (obviously not true for complex gates like Exor or Exnor and for those gates detailed analysis explained in Figure 2 has to be followed). Therefore, rather than deriving equations for individual logic gates, we will devise generic equations for N transistors connected in series or parallel. Applying the procedure explained in Figures 1 and 2, static power consumption for 2, 3, and 4 NMOS transistors connected in series and parallel is shown in Table 7. Derivation for PMOS-based circuit follows the same procedure. Average static power for PMOS circuits would also have similar equations except P N will be replaced with P P . Based on Table 7, the most closest generalized equation of static power for single transistor as a function of N can be written as: Here, depending on the configuration P P/N is the power of either PMOS or NMOS used in the unit inverter. W in the  To verify the equations we simulated for both series and parallel connected transistors, results are shown in Figure 3. Accuracy of Equation-(5) for PMOS parallel circuit can be seen from Figure 3a, the equation fits exactly to the measured values across all process technologies. For the other three configurations, slight variation across different process technologies was observed. These variations can be associated with the factors that we ignored in the initial stage of our model. We didn't consider the effect of gate leakage and as gate leakage in NMOS is much more compared to PMOS [21], therefore more variation is observed in NMOS. Simulation results also showed that the actual reduction in static power due to the stacking effect was much more than what we estimated in Equation-(6) [22]. Based on simulation results, we will add a correction factor that results in on average minimum error across all process technologies.
The green dotted line shown in Figure 3 is the graph of new static power estimation equations which are given as under: For parallel.
For series. Table 8 shows the power estimation of different logic gates and DLX processor using our static power model. Error for larger gates (4-inputs) is still in the range of 2%-31% but significant improvement is achieved compared to the areabased model presented in Table 6.

| STATIC POWER MEASUREMENT OF AN FPGA LOGIC ELEMENT
An FPGA architecture consists of an array of identical tiles. Each FPGA tile consists of a Logic Block (LB), connection blocks (CB), and switch block (SB). Figure 4 shows the typical FPGA tile used in VPR. In this work, we are using Look-up- Table (LUT) size (K) of four and cluster size (N) of 4. Because K of four or five is the most area-efficient and beyond the cluster size of four there is no significant improvement in the total area of an FPGA [23]. The approach we are using to measure the static power consumption of the FPGA Logic Element (LE) is very similar to the approach we used for the processor. We will first design the individual building blocks (i.e. Multiplexers, Buffers, SRAM, D Flip Flop, etc.). Then combine the building blocks to design the LE and use a SPICE simulator to measure static power consumption. Multiplexer (MUX) is a key element in an SRAM-based FPGA architecture, its used to implement LUT and routing network [20]. Multiplexers in FPGAs are either implemented using NMOS Pass transistor switch or Transmission Gate (TG) switch. In this work, we designed both NMOS-based FPGA tile and TG-based FPGA tile. For both tiles same LE architecture shown in Figure 5 is used. Though the design of individual blocks for each type would be little different and we will briefly highlight some of the main differences in the following sections.

Logic Element based on NMOS Pass Transistors (LE-NMOS):
As we know that, NMOS pass transistor-based multiplexer is not good in passing '1' and requires gate boosting to overcome high static power consumption of downstream CMOS circuit [13]. As can be seen from Figure 4 that the multiplexers used in routing resources (CB, SB and local routing) are SRAM controlled (SRAM connected to gate of transistors). Gate boosting in the SRAM controlled multiplexers is relatively easy because SRAM cells can be powered by higher voltage, which results in boosted voltage at the gate. We used two supply voltages in LE-NMOS implementation, V SRAM is used to power-up SRAM cells in routing resources, while rest of the circuit is powered by V DD . SPICE simulations showed that the best PDP is achieved when V SRAM is approximately V TH higher than V DD , hence we are using this value for V SRAM . The Multiplexer used in a LUT is not SRAM controlled. Therefore, [6,20] used a modified buffer at the output of the LUT which acts as a level-restorer (LR). CMOS circuit following LR receives a strong '1' and as a result, does not consume very high static power. We are also using LR for LUT in LE-NMOS, circuit diagram of LR is shown in Figure 6.
Logic Element based on Transmission Gate (LE-TG): Another way of implementing multiplexer is using Transmission Gate (TG). Transmission gate switch can provide full rail to rail swing (equally good for passing '0' or '1') and unless very high performance is required TG-based FPGAs can be operated without gate boosting. Therefore, we are using a single voltage source for LE-TG and at the LUT's output, a normal buffer is used instead of an LR.
Measurement results for both LE-NMOS and LE-TG are shown in Table 9. Unless otherwise stated throughout this paper LE performance refers to the performance of a complete logic cluster. Refer to Figure 4, in Table 9, delay measurement is from the input of routing multiplexer to the output buffer of Basic Logic Element (BLE), similarly reported static power includes power of four BLEs and sixteen routing multiplexers that are connected at the input of BLE.

| Static power model for FPGA blocks
We will use the model described in Section 3.2 to estimate the static power of FPGA's blocks. For CMOS circuits in an FPGA such as inverters and buffers, we can directly apply the equations (7-10), in this section we will explain the procedure for pass transistor-based circuits and how to deal with gate boosting effect.

| 2x1 Multiplexer Circuit diagram of both NMOS-based and TG-based 2x1
Multiplexer is shown in Figure 7, considering pass transistor as a three-input device Table 10 shows the static power estimation for each state of NMOS pass transistor (N-Pass), NMOSbased 2x1 Multiplexer (2 � 1 NMOS ) and TG-based 2x1 Multiplexer (2 � 1 TG ). Average static power for each of them can be written as: Simulation results showed that P 2�1 T G does not require any correction (as shown in Table 11, P 2�1 T G has error of less than 1%), but P 2�1 NMOS requires a correction factor. 2 � 1 NMOS is consuming on average 0.8 times of P N across different process technologies. The reason for this lower than anticipated static power consumption can be associated with the fact that NMOS-based multiplexer produces weak '1' and maximum value of V DS for OFF transistor will be less than V DD , resulting in less average static power consumption. In our model, for P 2�1 NMOS we will use following equation:

| Multi-Tree Multiplexer
The LUT of an FPGA is implemented using encoded Multiplexer while the routing network is implemented using 2-level Multiplexer [3,5,17]. It can be observed from the Figure 8 that, to implement N input encoded multiplexer, we need N-1, 2x1 multiplexers, while to implement N input 2-level multiplexer, we need N 2 þ 1, 2x1 multiplexers. According to [24], if larger multiplexer is built by the combination of the smaller multiplexer, then the power of a larger multiplexer can be found by adding up the power of the smaller multiplexers. Using the same concept, the static power consumption of an encoded multiplexer (P MUX Enc: ) and a 2-level multiplexer (P MUX 2Lev: ) can be found using the following equations:

| Effect of sizing
Multiplexers in an FPGA needs to be sized according to the load they are driving and its important for the static power model to account for the transistor sizing. Figure 9 shows the effect of transistor sizing on the static power consumption of a multiplexer, the graph shows that the static power increases linearly with the increase in size and this trend is consistent across all the process technologies. Therefore, we can write the generalized equation for the static power consumption of a multiplexer as:   Figure 10 shows that an SRAM cell consists of two back to back connected inverters and two pass transistors. Therefore, its total static power can be estimated as: It's important to mention here that, to save the static power consumption of SRAM cells, transistors with the thick oxide are used in the SRAM cells [4], P N and P P in Equation-18 refer to thick oxide transistors. Moreover, PMOS transistors used in an SRAM cell are of unit size and we need to divide P P with the PN ratio of process technology.
To avoid loading effect in SRAM cells sometimes an additional inverter is added at the output. In our design, we are only using such a setup for SRAM in LUT. Its static power can be calculated by: Where, P INV is Power of Inverter.

| Gate boosting effect
As mentioned earlier, the SRAM cells used in routing resources are powered with higher voltage and we need to account for the effect of this voltage boosting in our model. Figure 11 shows the effect of supply voltage on the power consumption of an SRAM cell, it can be observed that with every increase of 0.1 voltage power consumption doubles. This trend is very consistent across all process technologies under test. Therefore, (V B ) refers to the boosted voltage and normally it's equal to V TH .

F I G U R E 4
Typical FPGA tile of VPR [17] F I G U R E 5 Conventional logic element

-
Boosting of Multiplexer is much more tricky because only the gate voltage is changing and V DS remains the same. Referring back to subthreshold leakage current Equation-1, leakage current increases exponentially by increasing gate voltage V G . Moreover, in boosting situation gate voltage is higher than the normal rated voltage, this higher gate voltage will also increase gate leakage current exponentially [21] and as result overall rate of increase of leakage current would be more than just e ΔV G (V B = ΔV G ). Considering these two factors, we use the following equation for the boosted multiplexer Figure 12 shows the graphical comparison between measured and estimated static power of boosted multiplexer. Except for 32 nm process technology, Equation 21 estimates static power of boosted multiplexer with maximum error of 26%. But for 32 nm process technology error is high at 68%. Table 11 shows the comparison between measured and estimated static power of different components used in the LE, while the graphical comparison between measured and estimated power of both LEs (NMOS and TG) across different process technologies is shown in Figure 13. Our static power model estimates power very accurately for non-boosted components and as LE-TG has no gate boosting therefore, the error percentage for it is less than 1%. But, since for boosted components such as 2x1 multiplexer error percentage is up to 68% and a major part of LE-NMOS consists of the boosted circuit, therefore the estimated static power of LE-NMOS has relatively high error percentage ranging from 1%-14%.

| REPLACING DLX PROCESSOR WITH FPGA
The power budget plays a big role in deciding the logic capacity of an FPGA device. For example, over 5.  to 5000 logic elements [26]. Before implementing an application on an FPGA device, the designer needs to know if the required logic capacity for the application can be achieved in the available power budget or not. Consider a case where a designer wants to replace a DLX processor with a reconfigurable logic FPGA with a constraint that the FPGA fabric should operate in the power budget of a DLX processor. The designer wants to find the maximum logic capacity that can be achieved in the given power budget. To validate and show the utilization of our static power, we will answer this question using both measured values and estimated values.
In this case study, we are using a simple FPGA architecture consisting of LE-NMOS presented in Section-4, routing channel width (W) of 20, directional wires in routing channels, logic cluster input pin connectivity F C in ¼ 0:5W and switch block flexibility F S = 3. Furthermore, it is assumed that at the rated voltage, for both the processor and the FPGA static power consumption consists of 50% of the total power consumption. Table 12 shows number of FPGA tiles that can be turned ON with the power budget of a DLX processor. The required number of tiles predicted using our static power model is very close to measured values, maximum error percentage across all process technologies is 9%. Based on these results we can say that our static power model can be used in the early stage of design for process independent architectural level decisions. We are utilising VTR CAD tool to measure the static power consumption of a whole FPGA system. Figure 14 shows the flow diagram of VTR to measure the power consumption an FPGA system. We will briefly discuss some of the major steps, for the detailed explanation of each step refer to [15,16,27]. VTR requires following three input files: 1) Verilog Benchmark Circuit: User needs to provide a Verilog code of the circuit he wants to implement on the FPGA. In the VTR package, some benchmark circuits that come from a variety of real applications are already included. The 12 benchmark circuits we are using in this works are listed in the Table 13. 2) Architecture Description File: Templates of different fully designed architecture description files are also included in the VTR package. These architectures could either be modified to test new architectures or can be used unmodified. We are using the same architecture described in Section 5. which consists of four BLEs per cluster, 10 inputs per cluster, and four inputs for each LUT. 3) SPICE CMOS Technology File: This file describes the properties of NMOS and PMOS transistors. We are using the same PTM files for 130, 90, 65, 45 and 32 nm described in Section 1.
In the first step of VTR CAD flow, hardware description code is synthesised and mapped into logic primitives (LUTs, Flip Flops, etc.) defined by the architecture description file. SPICE simulations (external software required) are performed to measure the power consumption of each subcircuit (primitive). Then based on the utilization of each subcircuit and net actively VTR calculates the total power of the targeted circuit.
Our static power model initially estimates the static power consumption of each subcircuit following the procedure explained in Section 4 then uses the subcircuit count and utilization reported by the VTR to estimate the static power of the whole FPGA system. It is important to mention here that we are verifying our model using a simple FPGA architecture but since the basic building blocks of all the FPGA devices are the same, therefore our model can generically be applied to other FPGA architectures as well. Multiplexers, buffers and SRAM cells are the three main components used in different sizes to design an FPGA device. In Section 4 it was shown that for these basic blocks (multiplexers, buffers and SRAM cells) our static power model can estimate power with reasonable accuracy. Some FPGA devices also have additional functional blocks such as multipliers, adders, etc. These functional blocks are mostly designed using CMOS logic and Section 3 showed the validity of the proposed static power model for CMOS logic circuits. Table 13 shows the comparison between the measured and estimated static power of 12 benchmark circuits implemented on the FPGA. The results show that across all the process technologies under test our static power model shows reasonable accuracy with the error ranging from 1%-15%

| CONCLUSION
In this study, we presented process independent static power estimation model. We showed that circuit configuration (transistors in parallel/series) plays a big role in the static power consumption and merely using the active area as the static power estimation will result in a high percentage of error, for example for 4-input NOR gate the error was almost 241%-438%. For that reason, our static power model considers circuit configuration and estimates power much more accurately. For the basic CMOS logic gates, our model predicts static power consumption of individual gates with an accuracy of 70% or more. For pass F I G U R E 1 3 Comparison between measured and estimated static power of LE  transistor-based circuits, our model considers the effect of gate boosting and can estimate the static power consumption of FPGA systems with an accuracy of more than 85%.