Performance evaluation of the SM4 cipher based on field ‐ programmable gate array implementation

Information security is essential to ensure security of exchanged sensitive data in resource ‐ constrained devices (RCDs) because they are used widely in the Internet of things (IoT). The implementation of special ciphers is required in these RCDs, as they have many limitations and constraints, such as low power/energy dissipation, and require low hardware resources. The SM4 cipher is one of the common block ciphers, which can be easily implemented and offers a high level of security. The objective of this study is to determine the optimum field ‐ programmable gate array (FPGA) design for SM4 to facilitate reconfiguring the FPGA with an optimum design during operation. Various FPGA design options for SM4 ciphers are examined, and the performance metrics are modeled: power, energy, area, and speed. Scalar and pipelined designs with one or multiple hardware rounds are considered without altering the cipher algorithm. The re-sults show that the best scalar implementation utilises less resources than the pipelined implementations by 7%. Alternatively, pipelined implementations perform better regarding speed and energy dissipation by 10 times and 40% of the scalar implementation, respectively. The pipeline implementations with eight or 16 rounds are optimum for continuous streams of data, and the two ‐ round design is the optimum design across ciphers.


| INTRODUCTION
With the vast development of resource-constrained devices (RCDs), pecial cipher implementations should be considered to meet the constraints while ensuring a high level of security [1]. Ciphers are used for performing encryption and decryption operations in various devices to ensure the security of exchanged data and information between the connected devices. There are two types of cipher algorithms: asymmetric and symmetric [2]. Asymmetric cipher algorithms use two keys (public and private keys) whereas symmetric algorithms use one private shared key. Typically, asymmetric ciphers are more complex and secure than symmetric ones [3]. Rivest-Shamir-Adleman (RSA) and Elliptic-curve cryptography (ECC) are common examples of asymmetric ciphers [4], while Data Encryption Standard (DES) and Advanced Encryption Standard (AES) [5] are examples of symmetric ciphers. Symmetric ciphers are used in RCDs because they provide privacy for the devices in addition to their high performance.
Hardware design is recognised through application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) technology [2]. Although ASIC has better performance, FPGA provides more flexibility, reconfigurable features, and low-cost designs. Furthermore, FPGA provides other appealing features for cipher designs, such as algorithm agility, upload, and enhancement, as well as numerous security features [2].
The SM4 cipher is a symmetric block cipher (also known as the SMS4) [6], which was released by the Office of the State Commercial Cryptography Administrator in China in January 2006 based on the advantages of the AES architecture in terms of security and performance [7]. The size of the block and key of the SM4 cipher is 128 bits, which is similar to the AES128 cipher. The SM4 also implements 32 rounds of nonlinear iteration [8]. Initially, it was designed for protecting wireless networks, especially WLAN.
Then, it was applied to many fields, such as internet of things (IoT) and smart cards because its implementation is easy and offers high security [8]. Since it has been published, many research studies have focused on implementing and optimising the SM4 cipher for different purposes on many platforms, such as GPU, ASIC, and FPGA, and it has demonstrated great performance.
Our motivation for this work is to empower running an optimum SM4 design on FPGA hardware to combine a high level of security with flexibility and re-configurability at the minimum energy and resource utilisation of RCDs in IoT applications. Two types of FPGA designs are considered: a scalar design targeting intermittent data and a pipeline design aimed at continuous streams of data.
In this work, several FPGA implementations of SM4 cipher were presented and applied using multiple rounds. Different performance metrics were modeled, including area, throughput, power, and energy, while focusing on modeling the energy/area metric with more emphasis on energy. The optimum design implementation was determined, and guidelines were drawn. Although modifying the cipher architecture can improve its performance, however, it could affect its security and other metrics. Because of this, we are focusing on how to best implement the SM4 cipher without altering its internal architecture. In fact, several studies demonstrated that examining architectural and design options for cipher implementations lead to better performance in particular energy consumption [1,[25][26][27].
The main contributions of this paper are as follows: -Implementing the scalar and the pipelined design of the SM4 cipher with one and multiple rounds (i.e. 1, 2, 4, 8, 16, and 32). -Determining and modeling the performance metrics for the scalar and pipeline implementations. -Determining the best design options that are appropriate for each performance metric and the optimum design using the energy/area metric while focusing more on energy, which is the most critical factor for RCDs. -Comparing our conclusions with related research.
The rest of the paper is organised as follows. Section 2 presents the related work published in the literature. The design methodology is shown in Section 3. The SM4 cipher algorithm is illustrated in Section 4. Section 5 briefly describes the proposed approach. The results are summarised and discussed in Section 6. Finally, the conclusions and future trends are presented in Section 7.

| RELATED WORK
The SM4 block cipher has been implemented in different hardware platforms and was optimised for efficient area, throughput, and low power consumption. The first hardware implementations for the SM4 cipher were introduced [11]. Two FPGA implementations of the SM4 algorithm were examined: the pipelined and folded designs. The obtained throughput was 24 Gbits/s, whereas the optimised area was 380 CLB slices. The authors in this research did not consider the power analysis, and the pipelined design was implemented with 32 stages only.
The SM4 cipher design in Reference [12] includes two implementations to balance the performance and area: the rolling and unrolling implementations. The clock frequency of 162 MHz and throughput of 20,736 Mbit/s were achieved. The implemented designs do not consider different numbers of rounds or even the power calculation.
In Reference [13], improved iteration and pipelined architectures of the SMS4 cipher were evaluated on the Altera Straitx II FPGA device. The proposed design used the similarity between the key expansion and encryption algorithms. The performance results demonstrated a better area utilisation for the iteration architecture with a cost of 1158 ALMs, whereas a higher throughput with 21,760 Mbit/s was achieved in the pipelined architecture compared to the previous work. However, power and energy metrics were not considered. The improved architectures were implemented without considering the number of hardware rounds.
A low-area implementation is proposed in Reference [7] using a normal basis in the composite field to implement the S-box instead of lookup tables (LUTs). The proposed architecture saved the hardware resources because it required only 37 slices and 68 four-input LUTs, which are much lower than the previous designs relying on LUTs. This work focuses only on using the hardware area without considering achieving low power/energy consumption, as it is one of the main issues for Additionally, this work did not estimate the design throughput.
The authors in Reference [17] proposed a fully unrolled pipeline to adapt SM4 to the high throughput requirement in the XTS-SM4 module architecture. The area of the S-box is optimised, and the quantity of registers is controlled as well. The results showed that the proposed architecture had attained a maximum throughput of 33.68 Gbps and an efficiency of 325.12 Mbps/Kgate, which are twice better compared with other XTS-AES designs with the same technology. However, there is no power/energy analysis or area-reduction technique.
Multiple hardware designs have been proposed to examine the trade-off between the area and speed as given in Reference [14]: The resource-first design required the lowest number of resources: 687 logical units and 448 registers, with a throughput rate of 815.6 Mbps. A 27 Gbps throughput rate was achieved using the speed-first design. An improvement was achieved compared with the published solutions. This work provided the designers with a good view on how to choose the proper hardware design based on their requirements. The authors in this work were concerned more about using area and throughput.
The SM4 S-box is optimised as given in Reference [19] by adopting the PPRM circuit structure to reduce the S-box power, which is the main component in the SM4 cipher. The presented results showed that the design reduced the power consumption by 50% and the delay by 44% at the cost of a 20% increase in area using SMIC 0.18 μm technology. A pipeline architecture with four stages was also implemented. The achieved throughput rate was 2 Gbps, and the power consumption was 3.65 mW at 10 MHz with a moderate cost of hardware. The authors in this work implemented only a fourstage pipelined architecture without considering the other number of stages.
The authors in Reference [20] designed the first asynchronous architecture of the SM4 cipher algorithm using an asynchronous dual-rail pipeline. The results demonstrate a 20 times improvement in power consumption compared with the synchronous designs as given in Reference [19]. Only a full pipeline architecture (i.e. one round per stage) was adopted without examining the other number of rounds.
A new FPGA implementation of the SM4 cipher was proposed in Reference [8] based on separating the key expansion function and round function. The keys are generated in advance on the host computer, while the encryption process is accomplished on the FPGA. Additionally, a dualcascade implementation architecture was proposed in which only 16 cycles are required instead of 32 cycles for the iterative architecture of the SM4. The result shows a great improvement in throughput by 50% compared with existing work. Less resources are required compared to other designs in the same technology. Evaluating the power and energy dissipation was not considered in this work.
To summarise, the existing research work on the SM4 cipher concerned with optimising the area, such as in Reference [7], improving the throughput, such as in References [8,17], or optimising both the area and throughput, such as in References [11][12][13][14]. However, all these studies did not focus on optimising the power/energy consumption, which is the dominant issue in RCDs, and it is more important than utilising the area [2]. Most of them focus on implementing the 32 rounds per pipeline stage (full pipeline architecture) because it improves the throughput. However, in most cases, it increases the power and energy dissipation. Few studies have focused on achieving a low power consumption value, such as in References [19,20]. However, only one pipelined design is considered: either a full pipeline architecture (32 stages), such as in Reference [20], or a partial pipeline architecture (four stages), such as in Reference [19]. Examining different numbers of stages is not considered because they can affect the results. Our work is difficult to be compared with works in References [19,20], as the design parameters and technology used are different, and it will not be a fair comparison. To the extent of our knowledge, this is the first research that concerns implementing and optimising the SM4 cipher in two main architectures: the iterative (scalar) and pipelined architectures, by varying the number of rounds in each one to determine the best design in terms of different performance metrics while focusing on the power/energy analysis. The optimum design/designs suited for implementation in RCDs are also determined. The performance models are generated and optimised based on the implementation results.

| DESIGN METHODOLOGY
To accomplish the research goals, the FPGA design methodology was followed, as shown in Figure 1. Many studies have followed the same approach [25][26][27][28][29].
The performance metrics were computed according to the following reports generated by Altera FPGA software package Quartus-II: timing report, resource utilisation report, and power dissipation report. The environmental setup for compiling and simulating is shown in Table 1. We highlight the following observations.
In previous work [25][26][27], the cyclone-II FPGA device chosen was to evaluate the performance of the indicated ciphers (i.e. Simon, Hight, and Katan/Katantan) as it was suitable for implementing such ciphers. However, it is not suitable for the SM4 cipher because it is more complex and includes more logic elements. Figure 2 compares the resource utilisation of SM4 with Katan [27], Simon [25], Hight [26], and AES [30,31]. Clearly, SM4 requires a significant number of resources. Unfortunately, when compiling SM4 with various design options, Quartus requires a significant amount of memory and CPU. Additionally, significant routing resources were required. It was clear that we needed a larger FPGA; hence, we chose the Startix-IV device because it is more suitable for implementing complex designs.

F I G U R E 1 Design methodology
For frequency constraints, the initial Quartus compilation and analysis were performed at 50 MHz. If the timing results show that the implementation runs at a frequency that is slower than 50 MHz, then a second run is conducted using the first result frequency. This approach reasonably constrains the design and avoid over-constraining, which could produce unexpected compilation results.

| BACKGROUND ON SMCIPHER ALGORITHM
The SM4 is a Chinese symmetric-key block cipher (known as SMS4), which is designed based on the unbalanced Feistel network [7]. The plaintext input, ciphertext output, block data, and key are each 128 bits. The 128-bit input plaintext is split into four 32-bit words, where the word is the smallest operation unit [14]. The SM4 encryption algorithm is composed of the round function and key expansion function, which takes 32 rounds of nonlinear substitutions [20]. In each encryption round, a new block of 32 bits from the 128-input block is generated. In each iteration, the new 32-bit block is combined with the previous 96-bit input data to form the new 128-bit input block for the next iteration [20]. The new round key of 32 bits also is produced using the key expansion function for each iteration and is used in each round operation.
The ciphertext is generated as an output of the round encryption function after 32 rounds of iterations. The encryption iterative architecture of SM4 is shown in Figure 3.
The SM4 cipher notations used throughout this section are shown in Table 2.

| Encryption round function
The structure of the round function, F, is shown in Figure 4, which is the basic operation unit in the SM4 cipher algorithm.
The round function F algorithm, illustrated in Algorithm 1, uses the mixed substitution T function shown in Algorithm 2. The T function consists of two main functions: the nonlinear transform τ and linear transform L. The nonlinear transform τ implements four S-boxes in parallel. The S-box is implemented with lookup table as a 16 � 16 array, as shown in Figure 5, and the output of the S-boxes is combined into one 32-bit word and is the input of the linear transform, which applies the cyclic left-shift and XOR operation [14]. Because of the confusing effect of the nonlinear transformation, the SM4 cipher's security is greatly improved [32]. The plaintext input block X of 128 bits is divided into four words (X 0 , X 1, X 2 , and X 3 ), each with 32 bits, and a new block of data X iþ4 of 32 bits is generated in each round iteration and is combined with the

MK
Master key previous 96-bit block (X iþ1 , X iþ2 , and X iþ3 ) to form the input for the next iteration. This round operation is continuously repeated until all round iterations are implemented (i.e. 32 rounds) as displayed in Algorithm 1. The ciphertext output Y is obtained by the last four rounds by reversing the order of the output using the final-round transformation R function. This function maintains the consistency of the encryption and decryption structure [14].

| Key expansion function
The key expansion function structure is presented in Figure 6. Algorithm three clarifies the key expansion function, which uses the mixed substitution T 0 function illustrated in Algorithm 4. It has a similar structure of round function except that the linear transformation L 0 has a different bit-shift operation. As the input for plaintext, the master key (MK) of 128 bits is divided into four 32-bit subkeys (MK 0 , MK 1 , MK 2 , and MK 3 ). These sub-keys are XORed with four constant parameters (FK 0 , FK 1 , FK 2 , and FK 3 ) to generate four sub-keys (K i , K iþ1 , K iþ2 , and K iþ3 ). In each i iteration, a new round key rk i of 32 bits (which represents the K iþ4 sub-key) is generated by the original input key MK in Algorithm 3 to be used as an input for each encryption round function operation in Algorithm 1. After all rounds are done, 32 round keys are generated.

Algorithm 1 Encryption round Function F
-125

Algorithm 3 Key Expansion Function
3: return C

| PROPOSED APPROACH
This section discusses the proposed approach, including the scalar design implementation and pipelined implementation of the SM4 block cipher. Initially, the scalar FPGA design of the SM4 block cipher was implemented with one round. Then, the scalar design was represented with multiple rounds. Finally, the pipelined design was illustrated to examine and optimise the considered performance issues, including area, power, and energy. The equation notations used throughout the paper are listed in Table 3.

| Scalar design
Scalar designs are suitable to encrypt intermittent data. The scalar design with one round is an iterative architecture where one hardware round is implemented. Figure 7 illustrates the scalar design of the SM4 cipher with one round, which includes three blocks: the control logic block, round block, and key generation block. The control block is the interface of the system, which manages the external and internal functionalities of the design. Three main registers are considered in this block: key, round counter, and X register. It also includes the finite-state machine (FSM), which is responsible for handling the sequence order of the system activities and functionalities. When the start input signal is assigned to one, it informs the control block to start the encryption process. The plaintext value is stored in the X register, and the round counter is set to zero to start processing the round logic. Simultaneously, the MK is loaded into four sub-key registers to execute the key generation function. The sub-keys (K i , K iþ1 , K iþ2 , K iþ3 ) are assigned to the key generation block, while X in and the round counter values are assigned to the round block through the control block. The specific round key (rk i ) is generated by the key generation block and is supplied to the round block, which produces the X out output and updates the X register each round. This process is repeated for 32 rounds, and when the counter reaches the value of 31, the encryption process is completed, and the done output signal is assigned to one by the control block. The output ciphertext is stored in the X register after performing the R function. The key generation and round logic block contain the T 0 and T substitution functions, respectively, as illustrated previously in Section 4.
The one hardware round in the scalar design is extended to implement multiple hardware rounds r. One clock cycle is required to perform one iteration/cipher round I in the basic design, thus 32 cycles are requested for one hardware round in addition to the idle cycles (i.e. C idle ). Increasing the hardware rounds number decreases the iterations number I by a factor of r, as expressed by Equation (1): where r ¼ 1, 2, 4, 8, 16, 32. To encrypt one block of data (C B ), the number of clock cycles required is obtained using Equation (2): The implementation of two hardware rounds of the SM4 cipher is shown in Figure 8. The differences between the scalar design with one round and multiple rounds are listed below: r round logic blocks are executed simultaneously [i.e.
Round iþ0 , … , Round iþ(r-1) ] as a replacement for only one round. As a result, the hardware requires 32/r iterations instead of 32 iterations.

| Pipelined design
The scalar design of the SM4 cipher with 32 hardware rounds is used to implement our pipelined design by placing registers between pipeline stages. Pipelining means performing more than one job simultaneously, which improves the design throughput. Pipelined designs are used to process a continuous stream of data as a new block of data is supplied to the system each clock cycle. -127 However, in the scalar design, blocks can be fed only after C B clock cycles. One or multiple rounds per pipelined stage were implemented to examine the influence of varying the number of rounds. The SM4 pipelined design with one round per stage is shown in Figure 9. It requires 32 stages and 32 registers inserted between them for the round logic block.
The key generation block is pipelined in the same manner to supply each round with its appropriate round key, and an additional 32 registers are required to be inserted between each stage. The SM4 is implemented by doubling the rounds number r (i.e. 1, 2, 4, 8, 16, and 32) for each pipelined stage. As a result, the stages and registers number inserted between them are decreased by a factor of two, as illustrated by Equation (3). Figure 10 explains the implementation of two rounds per stage:

| RESULTS AND DISCUSSION
In this section, we illustrate the scalar and pipelined implementations results of the SM4 cipher after simulating and running the implementations on ModelSim and Quartus, respectively. The presented results examine the following performance metrics: maximum frequency, throughput, utilised resources, power, and energy dissipation. The maximum frequency is calculated by megahertz (MHz). Power is measured by milliwatts (mW), and energy is represented by pico Joules (pJ). The throughput is considered by computing the number of encrypted bits per second. The resources are represented by the number of LEs utilised. Performance metrics are calculated and modeled for all implementations, and the average model error is calculated for each modeled equation using the following Equation (4): F I G U R E 8 SM4 scalar design with two hardware rounds The FPGA scalar and pipelined results of our approach are shown in Tables 4 and 5, respectively. For FPGA scalar design, SM4 with r hardware rounds is represented as SM4 r and SM4 r ¼ SM4 2k . For pipelined designs, SM4 with r hardware rounds per pipeline stage is represented as SM4P r and SM4P r ¼ SM4P 2k , where r ¼ In Tables 4 and 5, performance metrics for each implementation is presented including maximum operating frequency, logic elements, power consumption and energy dissipated to process one block. To collect such measurements, each implementation was designed and analysed thoroughly.

| Speed and throughput Scalar implementations:
The maximum frequency with the number of implemented rounds for the scalar implementations is shown in Figure 11. Doubling the rounds number clearly results in decreasing the frequency trend, which means a longer time for one cycle. The frequency trend decays by an average of 10 MHz, which is 17% of the SM4 1 frequency. Using Equation (5) with an average model error of 10%, the frequency trend ( f ) for the scalar implementations is demonstrated.

Pipelined implementations:
The frequency trend for pipelined designs performs similarly to the scalar designs, as clearly shown in Figure 12. It decreases by an average of 17% of the SM4P 1 frequency as the number of rounds doubles. The reason behind this is the increase in the logic levels when the rounds number is doubled in one pipeline stage and hence increases the critical timing paths. The frequency trend ( f ) for the pipelined implementations can be modeled using Equation (6) with a 5% average model error.
where f(SM4 1 ) ¼ 66 MHz. Comparing scalar and pipelined implementations in terms of speed. It is clearly shown that the fastest designs are SM4 1 , SM4 2 , and SM4P 1 . They have the fastest frequency, a smaller number of implemented hardware rounds, and less logic implemented in one clock cycle.
The encrypted bits per second are calculated to measure the throughput for all scalar and pipelined designs, as expressed in Figure 13. Pipelined designs have 10 times better throughput than scalar designs. This is because the pipeline design encrypts one block in each clock cycle. The SM4P 1 provides the best throughput. Whereas the SM4 8 has the best throughput value through scalar implementations, the pipelined implementations provide 10 times better throughput than scalar implementations.

| Logic elements (LEs) and power
Scalar implementations: Resource utilisation in the Startix-IV FPGA is represented by the number of LEs based on the combinational adaptive LUT (ALUT) and dedicated register. Figure 14 shows the resource utilisation of scalar implementations by varying the number of hardware rounds.
The total number of LEs increases by an average of 71.4% as the number of rounds varies. To gain a better understanding of this trend, Figure 14 also plots the types of LEs: combinational LEs (i.e. ALUT) and register LEs (i.e. a dedicated register). Obviously, the combinational LEs grow by an average of 79%, while the number of register LEs is constant. The increase in combinational LEs is due to the synthesis tool, as it reuses the combinational logic and connections as the number of implemented rounds increases [25]. The number of registers in the control block remains the same as stated previously, excepting the round counter register, as it gets smaller by one bit. Equation (7) presents the modeling equation of the number of LEs with an average model error of 7%.
Where LE ((SM4 1 ) Comb)¼ 793, and LE ((SM4 1 ) Register)¼265. The power consumption trend is described in Figure 15. The total power demonstrates an increasing trend as the number of rounds changes. Power components, the combinational and sequential (i.e. registers and control block signals) power, are shown in Figure 15 to illustrate a clear total trend. As the number of rounds doubles, combinational power increases by an average of 70%. However, the sequential logic power increases slightly by an average of 15%. The main purpose behind the growing trend of total power is combinational power. The power trend (P) for scalar implementations is modeled by the following Equation (8) with an average error of 12%: where P ((SM41)Comb) ¼ 8.68 mW. and P (SM41) Seq) ¼ 6.19 mW.
A clear connection exists between power consumption and the number of LEs (resource utilisation). Figure 16 plots the power components (combinational and sequential) with the LE components (combinational ALUTs and registers) to describe the relationship between them as the number of rounds doubles. The following observations are noted: -A slight increase occurs in sequential power, while LEs demonstrate a constant trend. This returns to the sequential logic activities when the number of rounds doubles. -The growth in combinational power is higher than the growth in combinational LEs except for the last point (i.e. SM4 32 implementation).
The higher growth returns for some factors contribute to making the power higher, which is the interconnect and glitch 1 power, as they increase when the number of rounds doubles due to increased connections and routing logic [26]. However, less power growth is due to the synthesis tool because it tends to optimise the larger logic better, as more chances for optimisation exist than for the smaller logic. This justifies the different power growths for the last point implementation (i.e. SM4 32 ).
Pipelined implementations: Figure 17 displays the number of LEs when doubling the implemented rounds in a pipelined stage. The trend of the total number of LEs is decreasing. To understand this trend, Figure 17 also shows the LE components: combinational and register LEs. The number of combinational LEs shows a slight decay from SM4P 1 to SM4P 16 , decaying again at SM4P 32 . The tool tends to minimise and share the logic combinations and optimise the connections better as the number of rounds per stage doubles. Moreover, the number of register LEs decreases at an average of 55%. The number of registers decreases (thus, the number of stages decreases) when the number of rounds per stage doubles. Therefore, the largest number of registers is in the SM4P 1 design, as it contains the largest number of pipeline stages (i.e. 32 stages) and a partial result is pipelined each round. The smallest number of registers is in the SM4P 32 design because it has only one pipeline stage. Accordingly, the resource utilisation trend (LE) is modeled using Equation (9) with 2% average model error. Glitches are spurious transitions created by unbalanced arrival times (i.e. delays) of input signals to the gates, which results in switching the output unnecessarily to intermediate voltages before it settles to the final correct voltage value. While they do not impact the correctness of the design, glitches increase the dynamic power [33].
where LE ((SM4P 1 )Comb) ¼ 15,406. and LE ((SM4P 1 ) Register) ¼ 5226. Figure 18 depicts the power trend versus the number of implemented rounds in a pipeline stage. The total power consumption initially drops at SM4P 2 , which is the minimum value, then increases again until reaching the maximum value at SM4P 32 . The justification of this dish-like shape is shown by the power component trends in Figure 18.
Clearly, the combinational power decreases slightly at the beginning in SM4P 2 , then starts growing until reaching the maximum value at SM4P 32 . Thus, the main descent in the total power at SM4P 2 is because of the combinational power. In contrast, the sequential power decreases by an average of 30% when the number of rounds doubles. The oppositely trending curves (combinational and sequential) create a dish-like shape for the total power.
Equation (10) shows the power trend (P) for the pipelined implementations, and the modeling power has an average error rate of 4%. Where P ((SM4P 1 ) Comb) ¼ 133.57 mW, and P (SM4P 1 ) Seq) ¼ 45.92 mW. Figure 19 shows the relation between the power components and LE components for the pipeline implementations. A 30% decrease was observed in sequential power, while the sequential logic (i.e. register LEs) decayed by an average of 55%. Moreover, the figure shows a slight increase in combinational power by an average of 4% and a slight decrease in combinational LEs by an average of 2%.
The growth of the combinational power is a little higher than the combinational LEs due to the interconnect and glitch power. In general, an obvious correlation exists between dynamic power and the utilised resources. This is due to the certainty that the dynamic power is proportional to the design area [26]. Moreover, other factors, such as the interconnect and glitch power, optimisation of synthesis tools, and signal switching activity, make some differences.
Linking scalar and pipelined implementations, through the scalar implementations, the best design in terms of LEs is SM4 1 because the number of LEs is 7% of the number of LEs of the best pipelined implementation SM4P 32 . However, in terms of power consumption, the best design is also SM4 1 and it consumes only 9% of the best pipelined implementation power (i.e. SM4P 2 ). Thus, the scalar implementations, in general, perform better in terms of power and a lower number of resources, as shown in Figure 20 and Figure 21, respectively.

| Energy
Scalar implementations: Figure 22 shows the "energy per block" trend curves for the scalar designs with the number of implemented rounds. Energy per block value is computed using the following Equation (11): where T block is the time to encrypt one block and T cycle is the cycle time.
The energy trend decreases when doubling the number of rounds from SM4 1 to SM4 4 by an average of 11%, where SM4 4 has the least energy consumption value. The energy then increases from SM4 4 to SM4 16 , which is the maximum value, and then slightly decreases again at SM4 32 . For a better understanding of this trend, the energy components that are contributed to the total energy are also shown in Figure 22. The combinational energy clearly grows by an average of 10%, except for the points at SM4 2 and SM4 32 . The sequential energy decays by an average of 31% as the implemented rounds double. To get a clear understanding of the behaviour of the energy trend, some facts should be distinguished as number of rounds r increases: -Combinational power and sequential power increase.
-The time to process one block (T block ) decreases as the number of cycles needed to encrypt the block decreases.
In addition to these facts, sometimes, for larger logic, the synthesis tool accomplishes better routing optimisation, resulting in less routing power and thus less energy, such as in the SM4 2 and SM4 32 implementations.
Pipelined implementations: Energy per encrypted block trend for pipelined implementations is expressed in Figure 23, which plots the total energy and energy components. For pipelined implementations, energy is computed using the following Equation (13).
The computed energy is highly dependent on the average power. Thus, the total energy has a trend curve similar to that of the power (i.e. dish-like shape) with the least energy dissipation at SM4P 2 and the highest at SM4P 32 . Figure 23 clearly illustrates that the combined energy is increasing slightly by an average of 5%, whereas the sequential energy is decaying by an average of 28%. This justifies the total energy shape. the glitch and interconnect power are the reasons behind the increase in combinational energy, as stated previously. However, the decrease in the number of registers inserted between stages results in the decay in the sequential power.
The modeled equation for energy (E) trend of pipelined designs is presented in Equation (14)  To compare scalar and pipelined implementations, it is clear that the pipelined designs perform better in terms of energy, as shown in Figure 24. The best pipelined implementation (i.e. SM4P 2 ) consumes only 40% of the energy for the best scalar implementation (i.e. SM4 4 ).

| Optimum design
To determine the best design of the SM4 cipher implementation, the most important performance metrics should be considered. As the SM4 is a block cipher targeted originally for WLAN, which is one of the RCD applications, the most critical -133 performance metrics are the energy and area. The design of the cipher should have the least number of LEs and the least energy dissipation. However, energy is becoming a dominant issue because transistor features are constantly minimising [2], while the need to increase the battery lifetime is increasing. Therefore, the optimum performance metric Equation (15) is considered [26] and applied to all scalar and pipelined implementations of the SM4 cipher to present a clear comparison regarding the area and energy while emphasising minimal energy: where μ is the energy emphasis factor, μ ¼ 1.0, 1.2, 1.4, 1.6, 1.8 and 2.0. Table 6 demonstrates the optimum designs using the optimum metrics, considering different μ values. Obviously, for the scalar design, the one-round implementation (closely followed by the two-round implementation) is the best for all μ values except for μ ¼ 2.0, where two rounds are slightly better than one round. In contrast, the best pipelined implementation is the eight rounds followed closely by 16 rounds for all μ values. In general, to determine the best number of implemented rounds, the highest emphasis factor (when μ ¼ 2.0) is considered because energy is the dominant issue. The best scalar implementations are the two-and one-round implementations, closely followed by the eight and 16 rounds for pipelined implementations (SM4 2 , SM4 1 , SM4P 8 , and SM4P 16 ). Scalar implementations utilise 7% less LEs and consume 9% less power compared to pipelined implementations, which are very low percentages. That is why the scalar design is optimal, as it satisfies requiring the least area with the least power consumption. However, it is only suited for intermittent-data applications, where pipelined implementations are used for continuous streams of data. Therefore, the best implementation depends on the application. Table 7 presents the number of rounds for the optimum design for this work and the previously published research. The number of rounds is not the same because the design complexity is not the same, as illustrated in Figure 2. We can draw the following conclusions regarding the number of rounds for the optimum design: � For scalar designs, increased complexity increases the number of rounds. � For pipeline designs: two-round designs seem to be the best; however, increased complexity might reduce the number of rounds.

| CONCLUSION
Our objective in this work was to enable running the optimum SM4 design on FPGA hardware to combine a high level of security with flexibility and reconfigurability. The optimum design mitigates the energy and resource utilisation of RCDs in IoT applications. The SM4 block cipher was implemented using different design options: scalar and pipelined implementation, with different number of hardware rounds. Different performance metrics were measured and modeled, including area, speed, power, and energy. The implemented results show that scalar implementations have better performance in terms of lower resources (7%) and power dissipation (9%) than pipelined implementations. Pipelined implementations perform 10 times better in terms of speed compared with scalar implementations. Moreover, pipelined designs have less energy consumption by 40% compared to scalar the implementations.
Determining the optimum design regarding the energy/ area metrics depends on the application used, as the scalar design is generally targeted for intermittent-data applications, while the pipelined implementations are used for continuous streams of data. Thus, the optimum scalar designs are SM4 2 followed by SM4 1 , whereas SM4P 8 followed by SM4P 16 are the best pipelined designs of the SM4 cipher.
Compared with other ciphers, the number of rounds for the optimum scalar design increases with complexity. For pipeline designs, the two-round design is optimum across ciphers.  Future research could potentially extend this work to investigate and design a power-aware SM4 encryption system, where the FPGA is reconfigured with a specific SM4 implementation based on the system power and energy status [32]. The potential savings of power and energy could be substantial and worth the study. Another area of future investigation is to examine the vulnerability of the SM4 design to hardware Trojans. Multiple replicas of SM4 designs with different implementations could be considered a potent detection technique for vulnerable systems.