• security, cryptographic algorithms;
  • computer architecture;
  • instruction set extensions


  1. Top of page

In this paper, we design, implement, and realize a cryptographic unit (CU) that can easily be integrated to any reduced instruction set computing (RISC)-type processor for the safe and efficient execution of cryptographic algorithms. Design of the CU takes a novel approach in the execution of cryptographic algorithms when compared with cryptographic accelerators and architectural enhancements. Although it is integrated to a pipeline of an embedded RISC processor, it is partially an autonomous unit with its own resources, which is analogous to the floating point unit in this sense. It provides new instructions to accelerate cryptographic algorithms, and its associated cost in terms of area is acceptable and justified by the improvement in the performance and efficiency. The CU can also be instrumental in protecting the cryptographic computation against active and passive attacks and other malicious processes running simultaneously. We demonstrate that the execution of Advanced Encryption Standart (AES) encryption can be performed inside the CU, which prevents secret and/or sensitive information from leaving the CU during the cryptographic computation. Copyright © 2012 John Wiley & Sons, Ltd.


  1. Top of page

Implementation issues of cryptographic algorithms have been in the focal point of major research efforts for the last two decades. Although the performance and efficiency in terms of speed and used resources had been the only focus in the early years of the discipline, the concerns about the secure implementation of cryptographic algorithms have become at least as important. A great deal of research work has already been dedicated to performance issues of cryptographic algorithms.

Efficiency, which can be thought as a combined performance metric, is usually measured using area–time product that primarily shows how effectively the design space is utilized.

Secure implementation of cryptographic algorithms is another (and increasingly more and more) important goal because the side-channel attacks [1-3] are shown to compromise the security in entirety independently from the theoretical/mathematical strength of the underlying cryptographic construction. Although a multitude of countermeasures at different levels from circuit through architectural to algorithmic is being proposed to harden implementations against side-channel attacks, the problem has not been fully solved. This is partly due to the difficulty of the problem and more importantly due to the fact that there is no panacea for absolute protection [4]. Almost every proposed remedy, therefore, has its own limitations and shortcomings.

A particular class of side-channel attacks concerning mostly software implementations exploits the computational residues left in the micro-architectural constituents [5-9]. This class of attack clearly underlines the fact that general-purpose processors with totally different design constraints are inadequate in providing the computational primitives for secure execution of cryptographic algorithms from a purely software implementation point of view. This necessitates a paradigm shift in the design process of general-purpose microprocessors to accommodate the security requirements.

General-purpose processors are designed to be versatile in fulfilling diverse set of tasks, which results in sharing of resources by different and simultaneously executing processes. The main protection that keeps processes from interfering with each other is known as process isolation enforced by the operating system. The operating system that implements the core protection mechanism, however, is overly complex and usually full of deficiencies (e.g., software bugs exploited by attackers) to provide an adequate level of security. The cryptographic applications, in particular, require special care in process isolation because they have access to private/confidential information such as secret keys.

Major microprocessor manufacturers (Intel, AMD, and ARM) already introduced hardware extensions to their processor cores that allow isolated execution of programs [10-12]. This can be achieved by making the portions of memory, cache, and translation lookaside buffer (TLB) used by a program inaccessible to other programs. The techniques proposed in Refs. [13, 14] deal with performance problems in isolated execution of security-sensitive codes by minimizing trusted code base and by proposing some hardware extensions.

A highly favored practice is to implement cryptographic algorithms on dedicated hardware (e.g., cryptographic accelerators or trusted platform modules as defined in Ref. [15]). A cryptographic processor with limited functionality can easily overcome the aforementioned difficulties. Furthermore, being unquestionably free from overly complicated subtleties of process execution semantics on general-purpose processors, hardware implementations provide simple yet perfectly isolated execution environment for cryptographic algorithms. And lastly, it may not be possible to overperform the hardware implementations in terms of time.

Hardware implementations, however, suffer from different sets of disadvantages. First, a dedicated cryptographic processor is essentially a coprocessor and relies on a host (general-purpose) processor with which it needs to communicate. The communication not only results in considerable overhead but also may introduce new security risks, accrued in processor/coprocessor settings. And finally, the last (but not the least) issue is that dedicated hardware implementations do not provide sufficient level of flexibility and scalability, which are often needed in cryptographic applications as security requirements and awareness change over time.

The alternative to hardware implementations is enhancing general-purpose processors in such a way that they not only provide performance, flexibility, and scalability but also introduce secure execution semantics into the computation core. The advantage is obvious and manifold: (i) tight integration to the processor core, no communication overhead and accrued security risks; (ii) relatively moderate space compared with cryptographic accelerators; and (iii) high degrees of flexibility, scalability, and agility that go far beyond the fixed-function hardware such as a coprocessor because the architecture can still be used for general-purpose computing. A perfect example is the new algorithms [16, 17] for direct anonymous attestation protocols in the context of trusted computing that combines elliptic-curve cryptography (ECC) and pairing operations.

The main novelty in this paper is the idea of using a cryptographic unit (CU) in the computation of cryptographic operations and in handling the secret information. The CU, a preliminary prototype of which has originally been proposed in Refs. [18, 19], is a relatively low-cost executional unit that can be integrated to many RISC processors because its interface comply with the conventions of RISC-style instruction set architecture (ISA). It is more akin to a floating point unit, which can be similarly integrated to the execution pipeline of general-purpose processor while providing different sets of instructions via new functional units. The CU can also be implemented as a separate core in a multicore design. Although sensitive and the speed-critical parts of the cryptographic computation can be executed in the CU, the other parts of the computation can be executed on the base processor. The CU is flexible and can be programmed to implement any cryptographic algorithm with any key size.

In this paper, we design, implement, and realize such a CU and demonstrate its advantages. First, we report the achieved speedups for basic arithmetic and cryptographic operations through the use of the CU. Later, we provide the area requirements of the CU when implemented in both application-specific integrated circuit (ASIC) and field-programmable gate array (FPGA). To demonstrate its efficiency, we provide the time–area product metric obtained for elliptic curve (ECC) and RSA cryptography operations. We also demonstrate that AES algorithm can be efficiently and securely executed in an isolated environment established through the CU. Providing full protection for other more complicated cryptographic algorithms is beyond the scope of this work.

Related work and our contribution

Previous works [20-24] propose various enhancements to accelerate cryptographic operations. For instance, the authors in Ref. [20] propose five custom instructions to accelerate arithmetic operations in both GF(p) and GF(2n) on an MIPS32 core to benefit ECC, whereas ISA extensions in Ref. [22] aim to accelerate pairing-based cryptography. Similarly, the authors in Ref. [24] explore the effect of on-chip memory on the execution time of s-box computations in symmetric key cryptography. Similarly, recent works such as those in Refs. [25-29] mainly propose cryptographic blocks and/or cores dedicated to accelerate certain cryptographic algorithms such as RSA, AES, Kasumi, and the like.

In another work in Ref. [30] that aims to provide protection as well as performance, new instructions that accelerate bit-sliced implementation of AES also protect the AES computation against known cache-based attacks.

In our earlier work that appeared in Ref. [18], we proposed a CU that potentially accelerates cryptographic algorithms such as RSA and ECC. The proposed CU facilitates new and powerful custom instructions to accelerate multiplication and inversion in prime finite field GF(p) and, potentially, cryptographic operations of ECC and RSA. The CU is also shown to be instrumental in implementing the AES in the software, which is resistant against cache-based side-channel attacks in Ref. [19]. The CU was prototyped for embedded processor core by Tensilica [31], and some estimates for area usage and acceleration rates for RSA, ECC, and AES were given in Refs. [18, 19].

Contributions of this paper along with those in Refs. [18, 19] can be summarized as follows (indicating what is new and improved in this paper):

  • The results of Refs. [18, 19] are given together in a coherent way (new).
  • We fully implement and realize the proposed CU and integrate it into an embedded processor (improved).
  • We report on the cost of the CU in terms of area and time overhead in an embedded processor core by Tensilica [31] for both ASIC and FPGA implementations (improved).
  • We provide architectural details of the CU to an extent, which are not provided in Refs. [18, 19] (new).
  • We provide the speedup values obtained through the use of the CU for RSA and ECC operations.
  • We compare our implementation with a hand-optimized assembly language implementation on a well-known embedded processor (ARM7TDMI) and demonstrate that significant speedups can be obtained using straightforward C language implementation on the CU (new).
  • We compare the performance of our implementations against those provided by dedicated crypto-blocks/cores (new).
  • We provide implementation results for five-stage and seven-stage pipeline organizations to see if a deeper pipeline provides any advantage (new).
  • We provide implementation results for RSA and ECC operations when precomputation techniques are used (new).
  • We analyze the efficiency the CU versus its accompanying overhead using time × area metric (new).
  • We implement the AES using the CU in a protected manner and show that confidential values (e.g., secret keys, intermediate values obtained in different phases of the computation) never leave the protected zone. During the computation, unprotected functional units can also be used for nonsecret values of the computation. But memory, cache, and architectural registers cannot be used for confidential values at any time because memory is generally not a safe place, we do not have any control over the cache functioning, and architectural registers are subject to automatic spilling at any time.
  • We provide the execution time and throughput values for secure implementation of AES algorithm, which are comparable with similar implementations.

The CU alone cannot ensure the isolated execution zone for cryptographic applications, but it can be a very useful component. We only implement AES algorithm in isolated fashion because it utilizes simple operations that can be performed entirely inside the CU. The other and more complicated cryptographic applications such as RSA and ECC require a small amount of cache-like on-chip memory to keep the sensitive values during the calculation. In addition, software support is also necessary in protecting the sensitive values during the cryptographic computation because other simultaneously executing processes can in principle access them. Therefore, the CU usage may need to be restricted to the privileged protection rings, and access to the CU by other processes has to be strictly regulated. However, these issues are either beyond the scope of this paper or left as a future work.


  1. Top of page

In this section, we provide the details about the reconfigurable processor and our basic enhancements to it.

Configurable processors

A typical configurable processor consists of a predefined processor core that can be enhanced to measure up to specific application requirements. Configuring these processor cores generally includes modifications, additions, or removals to processor peripherals, memories, width of the memory bus, and handshake protocols to deliver performance improvement for a given set of applications. Once finished with the configuration, configurable processors are synthesized as RTL code and can be mapped to ASIC or FPGA's. ARC, Improv, and Tensilica are some of the major companies that offer configurable processor cores.

We prefer Tensilica's Xtensa configurable processor cores as the target embedded processor in our work, because they are one of the configurable cores that offer full software development tool chain.

We choose an Xtensa's LX2 core as our base processor, which is a 32-bit processor architecture featuring a compact instruction set optimized for embedded system designs. The base architecture includes a 32-bit arithmetic logic unit (ALU) up to 64 general-purpose physical registers, and 80 base instructions [31]. Furthermore, an LX2 core has two essential features, namely, configurability and extensibility, which will be utilized in the process of generating our CU.

The configurability attribute of LX2 core offers designers to adjust their designs for specific applications where they can modify the processor core according to the specifications. Modification of a processor can be made by defining the width and number of execution units, data interfaces, and optional data paths. Whereas with the extensibility feature, custom execution units, registers, register files, and single-instruction multiple-data functional units can be added to a processor data path. Extensions to data path are achieved through Tensilica instruction extension (TIE) language. TIE is a Verilog-like language that is used to describe instruction set extensions to the processor core. Functional behaviors of desired extensions are defined in TIE, and TIE compiler will generate and place the RTL equivalent blocks into a processor data path.

Cryptographically enhanced processor

The processor core that incorporates the CU is referred throughout this paper as either cryptographically enhanced processor or enhanced processor.

The design process of creating such processor consists of two steps. First, the LX2 processor core is configured into the so-called base processor and then the base processor is extended with the CU by using TIE language to build final configuration.

Base processor. In order to demonstrate that it is feasible to integrate the proposed CU to any RISC processor core, we select to configure (resource-wise) a very conservative embedded processor as a base processor. To keep processor size as small as possible, we removed unnecessary units from the LX2 core. For instance, floating point and 32-bit integer divider units are removed because they are not relevant in this context. Data and instruction caches are also configured with as a direct-mapped cache with reasonable sizes. To increase the processor's performance, we chose memory–cache interfaces and processor interface (PIF) as 128 bit (largest available) to increase the bandwidth in the front-side bus. The configuration of the base processor is presented in Table 1.

Table 1. Configuration of the base processor.
Multiply unit32 bit
Register file32 × 32 bit
Data memory/cache interface128 bit
PIF128 bit
Data cache8 kb/direct mapped/16-byte line size
Instruction cache8 kb/direct mapped/16-byte line size

The pipeline length of the LX2 core is also configurable, and two versions of the base processor are generated with five-stage and seven-stage pipeline lengths. The hardware costs of the five-stage and seven-stage pipelined versions of the base processor in 0.13m CMOS technology are approximately 10 2000 and 11 5000 equivalent gates, respectively.

Figure 1 shows the integration of the CU to the general architecture of Xtensa LX2. The CU consists of two parts: cryptographic register file (CRF) and cryptographic execution unit (CEU). In the following sections, the CRF and the CEU are explained in detail.


Figure 1. General architecture of enhanced embedded core.

Download figure to PowerPoint

Cryptographic register file

The CRF is an array of 32 registers, each of which has 128-bit width and is used to store operands and temporary results of arithmetic operations in cryptographic algorithms. Storing these values in the CRF will significantly reduce the execution time because the number of time-consuming memory access operations will be reduced. Besides, the CRF can be used to store sensitive information such as secret keys and small lookup tables for increasing security level of cryptographic algorithms. In subsequent sections, we will show that the CRF will be of crucial importance for protecting software implementation of AES from cache-based attacks side-channel attacks.

Cryptographic execution unit

The CEU is the new execution unit designed to utilize 128-bit width PIF and the CRF during cryptographic operations. By choosing interface precision as 128 bit, we simply increase our word length to 128 bit for cryptographic operations instead of 32-bit word size of the general-purpose processors. Using 32-bit ALU in the core processor will be inefficient for these operations; therefore, the CEU is designed to be used as a functional unit for cryptographic operations. Functional units of the CEU will now take 128-bit operands from the CRF

The CEU is composed of three parts: an integer unit (IU), a shifter circuit, and a multiply unit. Although IU is capable of adding/subtracting and comparison of two 128-bit integers, shifter circuit performs shift operation on both directions on a 128-bit register. The final functional unit in the CEU is a multiply unit that performs 128-bit multiplications and generates 256-bit result and stores the most and least significant 128 bit of the result on special-purpose registers HI and LO, respectively. Figure 2 shows the detailed architecture of the CU and functional units inside the CEU.

  • Adder unit. We design a 128-bit adder unit using three 64-bit adders as shown in Figure 3. The adder unit utilizes three 64-bit adders implemented in the hardware to keep the critical path delay low.
  • Multiply unit. Multiply unit is the most crucial functional unit of the CEU for accelerating modular multiplication operations, which are performed excessive number of times in RSA and ECC, as well as many other public key algorithms.

Figure 2. Detailed architecture of the CU.

Download figure to PowerPoint


Figure 3. Adder unit for 128-bit addition operation.

Download figure to PowerPoint

A 128-bit multiplication can be decomposed into four 64-bit multiplications as shown in Figure 4. Each 64-bit multiplication produces a partial product of 128 bit, and in the end, all partial products are aligned and summed to compute the final product. The final product, of 256 bits, is stored on HI and LO special-purpose registers.


Figure 4. Dividing 128-bit multiplication into four 64-bit multiplication.

Download figure to PowerPoint

In a 64-bit multiplication, four 32-bit multiplications are performed in parallel in one (1) clock cycle using four 32-bit multipliers. Figure 5 illustrates the operations performed for the multiplication of a3a2 and b1b0 to obtain the partial product p2. As seen in the figure, the partial product calculation can be finished in three clock cycles, two of which are spent on the subsequent alignment and addition operation within the partial product. The same operations are repeated for the other three partial products as well.


Figure 5. Partial product computation.

Download figure to PowerPoint

Four partial products of 128 bit each, namely, p0, p1, p2, p3 (Figure 6), which are calculated in the previous step, are stored temporarily in four 128-bit registers. The final product is computed after three iterations, which is composed of successive additions of partial products into the HI and LO registers. These iterations are also summarized in Figure 6 and explained below.

  • First iteration. Partial products p1 and p2 are added, and the result (t) is stored temporarily in a register. (In the following iterations, t will be divided into two halves, tH and tL, and each half will be used as operands of the addition operations in the second and third iterations. Also, the carry of the addition is stored in a 1-bit carry register C1 as it is used in the calculation of result on HI register in the final iteration.
  • Second iteration. In this iteration, the lower half of the partial sum calculated in the first iteration, tL, is added to the p0, and the result will be the lower half of the final product, which will be stored in the LO register. Again, the carry-out from this step is stored in a carry register, C2, and is used in the final iteration.
  • Third iteration. With the final step, final product is calculated and stored in HI and LO registers. In this iteration, the upper half of the partial sum of the first iteration, tH, is concatenated with C1 and added to p3. During the addition, the carry of the second iteration, C2, is used as the carry-in value. Finally, the result of the addition is stored in the HI register.

Figure 6. Alignment and addition of partial sums.

Download figure to PowerPoint

Proposed instructions

A new family of instructions is introduced to the processor ISA to fully employ the CEU. These instructions operate on 128-bit operands and conform to the instruction type and formats of the LX2 core, which uses RISC instruction encoding. Therefore, new instructions are encoded as RISC instructions with a slight difference. Common notations of source, target, and destination registers (denoted as rs, rt, and rd, respectively) in RISC encoding are adjusted to reflect changes such that functional units in the CEU uses operands stored in the CRF. Therefore, source, target, and destination registers of the CRF are designated as c_rs, c_rt, and c_rd.

Some of the proposed instructions are presented in Table 2. ADD_CREG and SUB_CREG operations perform unsigned addition and subtraction, respectively. Both operations take their operands from c_rs and c_rt registers and write result back to c_rd register. COMP_CREG operation compares the values of c_rs and c_rt registers, and if the value of c_rs register is greater than the c_rt register, then it writes 1 to c_rd; otherwise, it writes 0. SHL_CREG and SHR_CREG operations perform 1-bit shift operations in either directions. Because the CRF has two read and one write ports, only the value of c_rs register can be changed, whereas the value in c_rt register remains unchanged. MUL_CREG operation performs 128-bit unsigned multiplication and writes product into HI and LO special-purpose registers. Finally, LOAD_CREG and STORE_CREG operations perform data transfer operations between memory and the CRF for given memory address.

Table 2. List of new instructions.
ADD_CREG c_rd, c_rs, c_rtUnsigned addition(Cout, c_rd) := c_rs+c_rt+Cin
SUB_CREG c_rd, c_rs, c_rtUnsigned subtraction(Bout , c_rd) := c_rs-c_rt-Bin
COMP_CREG c_rd, c_rs, c_rtComparisonc_rd = c_rs > c_rt ? 1 :0
SHL_CREG c_rs, c_rtShift together leftc_rs := c_rs [126:0] || c_rt[127]
SHR_CREG c_rs, c_rtShift together rightc_rs := c_rt [0] || c_rt[127:1]
MUL_CREG c_rs, c_rtUnsigned multiplication(HI/LO) := c_rs × c_rt
LOAD_CREG c_rdLoad data from memoryc_rd := Memory [address]
STORE_CREG c_rdStore data to memoryMemory [address] := c_rd


  1. Top of page

In this section, we provide the implementation details of two important arithmetic operations in many cryptographic algorithms, namely, multiprecision modular multiplication and modular inversion.

Modular multiplication

We use the separated operand scanning (SOS) method given in Ref. [32] that basically consists of two phases: (i) the schoolbook multiplication of two big integers and (ii) the Montgomery reduction. Even though the SOS requires more memory space than others in Ref. [32], it does not use all the variables at the same time. Consequently, because each phase in isolation requires less memory than the Coarsely Integrated Operand Scanning (CIOS), which executes the two phases interleaved, all the operands needed in the SOS phases can fit in the CRF for up to 1024-bit multiplications. Table 3 summarizes the implementation results of the CIOS algorithm on the base processor and the SOS algorithm on the enhanced processor in terms of number of clock cycles.

Table 3. Implementation results for modular multiplication in five-stage pipeline.
PrecisionCIOS (base)SOS (with CU)Speedup
51225 605236510.8
1024100 304765413.1

Modular inversion

Inversion is a relatively slow operation needed in RSA (e.g., key generation or Chinese Remainder Theorem (CRT) method), ECC, and pairing-based cryptography. Although it is possible to avoid most of the inversion operations in many cases (e.g., projective coordinates in ECC), there will be other situations where fast inversion is useful. The best way to compute multiplicative inversion is to use what is known as binary extended Euclidean algorithm and its variation, the Montgomery inversion algorithm [33]. The Montgomery inversion is implemented for both the base and the enhanced processors. The results are enumerated in Table 4 in terms of number of clock cycles.

Table 4. Implementation results for modular inversion in five-stage pipeline.
PrecisionInversion (base)Inversion (with CU)Speedup
16078 17436 9782.11
192106 08243 8642.42
256172 40757 1683.02
512579 878141 8364.09

Modular multiplication and inversion on seven-stage pipeline

In order to show that the proposed CU can easily be integrated to RISC processors with different properties, we implement it on a seven-stage pipelined version of Xtensa processor. Especially, it is important to demonstrate that having a long latency instruction such as 128-bit integer multiplication does not cause a significant performance degradation due to true data dependencies. The timing results of modular multiplication and inversion for both the base processor and enhanced processor with the CU are given in Tables 5 and 6, respectively. The last columns in the tables enumerate the performance loss in algorithm executions in clock counts due to the deeper pipeline when compared with five-stage pipeline.

Table 5. Implementation results for modular multiplication in seven-stage pipeline.
PrecisionCIOS (base)SOS (with CU)Performance loss
160303211329.66 %/8.12 %
19237471282− 3.25 %/7.19 %
256731010139.25 %/8.81 %
51227 85625988.83 %/9.85 %
1024108 49384188.16 %/9.98 %
Table 6. Implementation results for modular inversion in seven-stage pipeline.
PrecisionCIOS (base)SOS (with CU)Performance loss
16094 17440 78920.27 %/10.31 %
192127 29648 48519.98 %/10.53 %
256207 93963 35420.61 %/10.82 %
512698 104153 82120.39 %/8.45 %

As can be observed from Tables 5 and 6, the modular multiplication and inversion operations take longer in seven-stage pipeline than in five-stage pipeline. This result is expected because the true data dependencies have a higher negative impact in deeper pipelines. This fact is confirmed by our implementations of Montgomery multiplications and inversions on the seven-stage pipeline. It is a well-established fact that the modular multiplication is much easier to parallelize than modular inversion operation. In our software implementations in the base processor with seven-stage pipeline, Montgomery inversion operation accrues more overhead because of the true data dependencies in the algorithm, on average, twice the overhead of the modular multiplication. However, when implemented in the proposed CU that is integrated to the seven-stage pipeline, both algorithms suffer equivalently (i.e., about 10%). These results clearly show that the long latency instructions implemented in the CU have no particular negative effect in longer pipelines.


  1. Top of page

In this section, we provide the speedup values obtained for both RSA and ECC.

We implemented a simple 1024-bit RSA using two methods: windowing method with 4-bit window size and no windowing. On the base processor, 1024-bit RSA with 4-bit windows takes, on average, 132 361 636 clock cycles, 97.48% of which is spent on modular multiplication. Consequently, the speedup on the processor with the CU is found to be 11.26. RSA (1024 bit) with no windowing method takes 156 812 860 clock cycles, 97.86% of which is spent on modular multiplication. The speedup for this case is found to be 11.47. To demonstrate the effectiveness of our architecture, we cite here another 1024-bit RSA implementation on an Xtensa processor [[34]], which takes 24.4 million clock cycles to finish one RSA operation. Because our slowest 1024-bit RSA operation takes about 13, 642, and 547 clock cycles, this translates almost to 44% speedup.

Similarly, we implemented elliptic-curve scalar point multiplication with Jacobian coordinates [35], and the implementation results are given in Table 7.

Table 7. Implementation results for elliptic-curve point multiplication.
PrecisionPoint multiplication (base)Point multiplication (with CU )% of modular multiplicationSpeedup
1605 684 8442 695 09787.00 %2.11
1929 774 0693 673 00090.17 %2.66
25621 509 5764 412 63392.49 %4.87
512160 109 43919 798 81296.51 %8.08

It is a common tendency to think that there is no need to speedup inversion operation because of the projective coordinates (e.g., Jacobian coordinates) [35]. Use of the projective coordinates eliminate all but one inversion from elliptic-curve point operations at the expense of more multiplications; only a single inversion operation is needed for converting the resulting point from projective coordinates to affine coordinates. We demonstrate, in this section, that the time spent even on a single inversion may be significant especially when the modular multiplication is performed in our enhanced processor.

By using projective coordinates, one elliptic-curve scalar point multiplication takes approximately 2 695 097 clock cycles for 160-bit elliptic curve on the enhanced processor. Our implementation of Montgomery inversion for 160-bit operands would consume, on the other hand, 78 174 clock cycles if the CU is not utilized. The inversion consumes only about 2.9% of all clock cycles spent on scalar point multiplication including conversion. This does not call for a need to speedup the inversion operation because any improvement on inversion will marginally speedup the entire operation.

There are, however, precomputation techniques that significantly improve the elliptic-curve point operations. For example, with fixed-base comb method, it is possible to perform one scalar point multiplication in 342 901 clock cycles on the enhanced processor. This time, the inversion operation would consume about 22.80% of clock cycles without the CU, which is a good motivation for speeding up inversion operation. Consequently, this would be translated into about 10% speedup in one point multiplication because of the improvement in inversion calculations.


  1. Top of page

In this section, we give the details of secure AES implementation on our enhanced processor. The results are also presented in Ref. [19].

Efficient software implementations of many block ciphers rely on the use of lookup tables, which naturally makes them vulnerable to cache-based side-channel attacks [5, 7-9]. The lookup tables are used to implement the nonlinear functions (s-box) of the block cipher, which may be the most important part of the algorithm, both performance and security-wise. These tables usually fit in the first-level or second-level caches of modern processors. The most efficient AES implementation in software to the best of our knowledge (not relying on hardware support of special instructions) [36] uses four 1-kb tables for the first nine rounds of 128-bit AES. Another table of the same size is used for the last round. Many cache-based attacks [5-7] exploit access patterns of cryptographic process to cache lines, which may contain the desired table item (cache hit) or not (cache miss). A spy process running simultaneously to the cryptographic computation can find out the access patterns of cryptographic process by creating carefully timed access patterns of its own to the same cache.

Our CU is beneficial, addressing the security issues related to cache-based attacks using a couple of useful practices as explained in the following:

  • We keep the confidential values, including secret keys and intermediate results (e.g., 128-bit AES state), within the CU in special-purpose registers. A relatively few number of special registers are sufficient to hold all the secret values during the computation.
  • The lookup tables are implemented in the CRF to avoid cache access.
  • The architectural registers are generally not used to hold confidential values. They are only used to keep public data such as loop indices and some temporary variables. If temporary variables are somewhat stored in architectural registers even for a while, they are used in such a way that they are not spilled to the memory (e.g., no function call is made before the used register is reset in rotating register architectures).

Here, we rely on the assumption that the CU can be made tamper proof.

Architectural enhancements allowing protected execution of AES

Although larger s-boxes may be preferable from the security point of view, they are usually chosen of moderate sizes in practice because of implementation concerns. Table 8 lists the lookup table sizes needed to implement the s-boxes of some of the well-known block cipher algorithms. The lookup tables for these block cipher algorithms can be implemented in our CRF shown in Figure 7(a) because the CRF is 512 bytes in total sizes. We use the first 16 cryptographic registers (from cr0 to cr15) to hold the lookup table entries of AES in our implementation.

Table 8. Hardware cost of cryptographic unit.
Block cipher algorithmLookup table size
DES/3DES256 bytes
AES256 bytes
Twofish512 bytes (two 8 × 8 permutation tables)
Serpent256 bytes

Figure 7. Cryptographic register file used to implement secure lookup table for s-box.

Download figure to PowerPoint

The CRF can be used to perform 128-bit integer arithmetic (e.g., addition, subtraction, and multiplication) and bitwise logical operations (e.g., AND, OR, XOR) with a single native instruction. Bitwise logical operations are especially very useful in AES implementation. On the other hand, some other special-purpose registers are needed to perform table lookup operation through the CRF, which are listed in Figure 7.

A cryptographic register in the CRF can be considered as a register capable of holding 16 bytes of the AES lookup table as demonstrated in Figure 8. For instance, the cryptographic register (cr0) stores the s-box entries for the input bytes from 0 to 15. In the figure, (crx) represents any one of the cryptographic register in the CRF, and (crx[i]) stands for an individual byte within the register (crx). The individual bytes of cryptographic registers are not directly accessible. As explained in the following, only special instructions can transfer an individual byte of a cryptographic register into a special register (s_out) as shown in Figure 7. Direct access to an individual byte and its transfer to any other register may overly complicate the design and incur high-performance penalties.


Figure 8. One cryptographic register holding 16 bytes of lookup table.

Download figure to PowerPoint

Figure 7(b) shows a small register file that can hold two versions of the AES state (each of which is 128 bit) in each round, one for the old and one for the newly computed AES state (or block). Recall that each AES round takes 128-bit block as input and generates a new 128-bit block as output. For 192- and 256-bit implementations of AES, only the state register file needs to be modified, whose area overhead is negligible compared with the overall area of the processor.

In every round, we perform 16 s-box operations for 128-bit AES. The AES state is in (st0, st1, st2, st3) at the beginning of each round. The first 32-bit part of the state is transferred from (st0) to (index) register (Figure 7(c)) using the instruction (mv_st2index st0). The least significant byte of the (index) register is used to access the s-box output, which is stored in the CRF.

The upper four bits of the least significant byte of (index) register is used to determine which cryptographic register (crx) holds the desired s-box output. The least significant 4 bit of (index) serves as the offset value within (crx). Once the cryptographic register that holds the s-box output is known, the instruction (rd_tab_creg crx) reads the s-box output from (crx) and puts it in a special register (s_out) (Figure 7(d)). The instruction (mv_sout2st stx) first rotates (stx) to the right by 1 byte and put the content of (s_out) register in the most significant byte of a state register (stx).

When one table lookup operation for 1 byte of the AES block is completed, the content of (index) register is shifted to the right by 1 byte. As a result of this, the table lookup operation for the second byte can start. When lookup operations for 4 bytes read from the state register to the (index) register are completed, the next 4 bytes are transferred from the next state register (i.e., st1) to the (index) register and the same operations are applied to the (extttindex) register content as well. The table lookup operation finishes for one round when 16 table accesses are completed. Note that except for the CRF, other registers are of special type, and they cannot be accessed directly. Only special instructions shown in Table 9 can access the content of these registers.

Table 9. Special instructions for AES implementation.
Instruction nameSyntaxDefinition
rd_tab_cregrd_tab_creg crxs_out:=crx [index0xF]
shlmodshlmod std, sts, arxstd [i]:=sts [i](sts [i]_7)arx
  i = 0,1,2,3
rowoprowop str, stss_out := index_7sts [3]
  index_6sts [2]index_5sts [1]
  index_4sts [0]index_3str [3]
  index_2str [2]index_1str [1]
  index_0str [0]
mv_st2indexmv_st2index stdindex := (0,0,0,std [3]) and
  std := (0,std [3],std [2],std [1])
mv_sout2stmv_sout2st stdstd := (sout,std [3],std [2],std [1])
mv_cr2stmv_cr2st std, crsstd := index_0(crs [3],crs [2],crs [1],crs [0]
  index_1(crs [7],crs [6],crs [5],crs [4])
  index_2(crs [11],crs [10],crs [9],crs [8])
  index_3(crs [15],crs[14],crs [13],crs [12])

Note that the instructions discussed so far are not designed to benefit particularly the AES algorithm. They can benefit many block cipher algorithms that utilize relatively small s-boxes such as DES/3DES, AES (Rijndael), Serpent, and Twofish. One instruction that may be considered as specific to AES is shlmod std, sts, arx (Table 9) that can perform four simultaneous shift left operations by 1 bit in GF(28). If the irreducible polynomial of GF(28) we work in is p(x) = x8 + r(x), then the architectural register (arx) in the instruction is initialized to r(x). For instance, the irreducible polynomial of GF(28) used in AES is x8 + x4 + x3 + x + 1 and therefore (xtttarx := 0x1B). In Table 9, std[i] stands for the ith byte of the destination state register std and sts[i]_7 for the most significant bit of the ith byte of the source state register (sts). The instruction (shlmod) works for any irreducible polynomial and can benefit the applications using GF(28) arithmetic.

Another instruction used in our AES implementation is (rowop str, sts) that takes two words (32-bit variable) stored in (str) and (sts) registers and XOR certain bytes of these two words, which are determined by the bits of the (index) register, where (index_i) stands for the ith least significant bit of (index) register (Table 9). The resulting byte is stored in the (s_out) register. This instruction is useful in matrix arithmetic where the elements of the matrix are in GF(28).

The rest of the new instructions in Table 9 are useful for moving data between the special registers and cryptographic registers. They have generic usage because we need to move the data around if we want to use the CU. They are easy to implement, do not incur significant overhead in the area, and definitely are not in the critical path of the processor.

As can be observed from the discussions in this section, our approach is not to integrate powerful instructions that can provide superior performance, specific to the cryptographic algorithm in question and expensive to implement. Our design principle is to propose simple and inexpensive instructions that can benefit a wide range of cryptographic algorithm implementations while providing a secure and isolated execution.

Time performances of the different implementations of AES algorithm

In this section, we compare the time performances of four different (and state-of-the-art) implementations of AES.

The first implementation is taken from Ref. [36], which is one of the most efficient (i.e., the fastest) implementation of AES in software, which will be referred as large lookup table implementation because it uses relatively large tables. Naturally, this implementation is vulnerable to cache-based side-channel attacks. The second implementation is referred as small lookup table implementation and uses a 256-B lookup table. This is a straightforward implementation and may be vulnerable to the cache-based attack as well.

The third implementation, referred as hardened in Ref. [18], utilizes the CRF to store the lookup table. It is secure against cache attacks but does not run in an isolated zone. The hardened implementation in Ref. [18] gives the overhead in number of clock cycles per round to protect a particular round. Most powerful attacks focus either on the first round of AES as in Ref. [9] or on the last round as in Ref. [6] because these rounds directly interact with the outside world by taking the plaintext or outputting the ciphertext, which are easily observable by an adversary. Therefore, it is of utmost importance to protect especially the first and last rounds. All the same, it would be prudent to protect the first two and the last two rounds of AES in case effective attacks are discovered against the second (or the ninth) round of AES, whose first and last rounds are already protected.

Table 10 lists the overhead values (in number of clock cycles) round-wise for a single block encryption of 128 bits.

Table 10. Overhead of protecting the rounds of AES against cache attacks (in clock cycles).
[36]FirstLastFirst lastPer round
796171 (21.5 %)33 (4.5 %)199 (25 %)178 (≈ 22.4 %)

Finally, the last implementation, as we prefer to call isolated, uses the lookup table in the CRF and does not use memory, cache, or architectural register to store confidential data. None of the confidential values such as secret key, round key, and intermediate blocks from AES rounds will leave the protected zone during the AES computation. Under the assumption that the CU is manufactured as tamper proof, the isolated implementation of AES can even withstand cold-boot attacks.

The time performances of the four aforementioned AES implementations are given in Table 11. As can be observed from the figures in the table, large lookup table implementation performs much better than the other three implementations. This is due to the fact that large lookup table implementation mainly consists of lookup operations to the five large tables stored in memory. As long as the processor provides fast memory accesses through the use of first-level or second-level caches, it is almost impossible to provide a better performance except for the implementations exploiting hardware support (e.g., [37]). This implementation, however, has been demonstrated to be vulnerable to cache attacks.

Table 11. Time performance of the four software implementations of AES.
ImplementationTime performance (clock cycles)Characteristics
Large lookup table [36]796Fast, insecure
Small lookup table2654Moderate speed, insecure, no isolation
Hardened (first and last rounds protected) [18]995 (est.)Relatively fast, secure, no isolation
Hardened (first two and last two rounds protected) [18]1356 (est.)Relatively fast, secure, no isolation
Hardened (All rounds protected) [18]2424 (est.)Moderate speed, secure, no isolation
Isolated (this work)2620Moderate speed, secure, isolated

The small lookup table implementation, which can be generally considered as insecure against cache attacks, provides a moderate performance, whereas the hardened implementation (all rounds protected) provides expectedly 8.6% improvement. One can always selectively protect the AES rounds in order to increase performance of hardened AES implementation, which can increase the speedup over the small lookup table implementation.

The final AES implementation executes AES encryption operation in complete isolation within the CU. Its performance is comparable with a small lookup table and hardened implementation of AES. Because it does not use any other shared functional units (memory, cache, architectural registers) than those inside the CU for computations involving security-sensitive values, it provides a very extensive security for a software implementation. A slight performance degradation when compared with hardened implementation is unavoidable because the latter uses the best of both the CU and the other existing (and speed-optimized) functional units.


  1. Top of page

Adding new instructions and functional units inevitably introduces additional costs in terms of area and time complexity. For an embedded processor, it is essential that the extra hardware cost not exceed the benefits of the enhancements. It is, however, difficult to give a complete perspective on cost-benefit issues when some of the benefits are of qualitative nature. The advantages of the proposed CU can indeed be threefold: (i) speed, (ii) security, and (iii) isolated execution. Not all cryptographic algorithms utilizing the CU can benefit from all the advantages. For instance, RSA and ECC gain up to 10 times acceleration because the CU is originally designed to support multiprecision arithmetic. AES, on the other hand, can mainly benefit from the latter two advantages: security and isolated execution without increasing the execution time.

Although RSA and ECC can also benefit from these two advantages, the current implementation does not execute RSA and ECC operations in complete isolation because they require a relatively small amount of cache-like on-chip memory to hold sensitive information during the cryptographic calculations. Cryptographic calculations definitely benefit from fast memory accesses via on-chip storage. However, the topic of on-chip memory for cryptographic computation is an important subject per se and therefore beyond the scope of this work and is left as a future work.

We synthesized our design into both ASIC and FPGA target devices using Xtensa LX2 embedded processor core by Tensilica [31]. For ASIC implementation, we used the estimates provided by Tensilica tools for 0.13µm CMOS technology. Table 12 shows the hardware costs of various functional units in terms of equivalent gate count.

Table 12. Hardware cost of cryptographic unit in ASIC implementation.
Area utilizationWithout CUWith CU
Base processor77 00077 000
New operations027 195
New state registers010 004
New register files060 496
New functions04399
Total area77 000179 094
Max clock frequency270 MHz270 MHz

The base processor in Table 12 refers to a simple 32-bit embedded processor optimized for ASIC implementation. Note that the figures in the table are estimates, and additional units in the CU are not optimized for ASIC implementation. Full ASIC realization of reconfigurable Tensilica processors requires further work, and the required tools are not available to research community.

Note also that there is an area increase for seven-stage pipelined version of both the base and enhanced processors. For the base processor, the increase is approximately 11.72%, whereas the increase for the CU is 6.82%. Consequently, the increase in the total area of the enhanced processor is 9.31% if it is implemented as a seven-stage pipeline.

We also synthesized the design into FPGA target device and generated the bit files to program the Avnet LX200 board that features Xilinx Virtex-4 type FPGA. This basically implies that the implementation figures are obtained after placement and routing. Timing constraint is chosen as 30 ns as suggested by Tensilica. The implementation results are listed in Table 13.

Table 13. Hardware cost of cryptographic unit in FPGA implementation.
Logic utilizationWithout CUWith CU
Total number of slice registers958217 939
Number of occupied slices14 36331 381
Total number of four-input lookup tables (LUTs)22 48749 089
Total equivalent gate count224 014475 308
Max clock frequency50 914 MHz50 594 MHz

Because the maximum clock frequency that can be applied to Avnet LX200 boards are specified as 50 MHz, the proposed CU is not on the critical path of processor, which is one of our design goals. There is a significant increase in area usage for which we have several justifications:

  • The CU provides extensive and multifaceted support for cryptographic algorithm implementations in software. It has its own execution pipeline, functional units, register files, special registers, and more importantly, execution semantics that ensure security and isolation.
  • The CU is not optimized for the FPGA in question. A better analysis of the proposed changes and careful mapping to FPGA resources can decrease the area overhead.
  • Increase in area is compensated by the increase in the speedup in certain classes of cryptographic algorithms such as ECC and RSA where the time performance is much more important.

In order to clarify the last point, we use time × area metric to measure the true advantage of the CU for ECC and RSA algorithms as far as the time performance is concerned. We choose ECC and RSA algorithms because the unit is primarily designed to accelerate these algorithms. Table 14 provides the time × area metric of 1024-bit RSA implementation for both no windowing and 4-bit windowing options. The time × area metric is normalized to the 4-bit windowing implementation of RSA on the processor with the CU.

Table 14. Time × area product for RSA.
CoreOperationArea (Slices)Clock cyclesTime × area (normalized)
BaseRSA (4-bit w.)14 363132 361 6365.15
With CURSA (4-bit w.)31 38111 753 2991.0
BaseRSA (no w.)14 363156 812 8606.11
With CURSA (no w.)31 38113 642 5471.16

Table 15 provides the time time × area values of point multiplication operations for ECC. Note that the area × time metric is normalized to the 160-bit point multiplication operation on the processor with the CU.

Table 15. Time × area for elliptic-curve cryptography.
CoreOperationArea (slices)Clock cyclesTime × area (normalized)
Base160-bit ECC14 3635 684 8440.97
With CU160-bit ECC31 3812 695 0971.00
Base192-bit ECC14 3639 774 0691.66
With CU192-bit ECC31 3813 673 0001.36
Base256-bit ECC14 36321 509 5763.65
With CU256-bit ECC31 3814 412 6331.64
Base512-bit ECC14 363160 109 43927.19
With CU512-bit ECC31 38119 748 1967.33

Finally, Table 16 presents the absolute performance improvements of the 1024-bit RSA and ECC point multiplication operation. Except for the 160-bit ECC, benefit of the CU far exceeds its associated cost.

Table 16. Improvements for RSA and elliptic-curve cryptography.
RSA (4-bit w.)5.15
RSA (no w.)6.50
160-bit ECC0.97
192-bit ECC1.22
256-bit ECC2.23
512-bit ECC3.71

Comparing against software implementations on other architectures

In order to give an idea as to how our implementations in the CU compares with those on more known and popular architectures, we implemented the Montgomery multiplication and elliptic-curve point operations on the ARM7TDMI processor. The CIOS version of the Montgomery multiplication operation is written in ARM assembly language and highly optimized. Our implementation on Xtensa processor is written only in C language with no assembly support. As can be observed from Table 17, our implementation of ECC and RSA operations on the CU provides significant speedups over those implemented on ARM7TDMI. The precomputation method in the table refers to the fixed-based comb method, which uses relatively high amount of lookup tables.

Table 17. Comparison with ARM7TDMI (figures in number of clock cycles).
OperationAssembly on ARM7TDMIC on the CU @ XtensaImprovement
160-bit Montgomery multiplication146410471.40
192-bit Montgomery multiplication200811961.68
256-bit Montgomery multiplication33849313.63
1024-bit Montgomery multiplication45 60076545.96
1024-bit RSA70 041 600 (est.)11 753 2995.96
160-bit ECC point multiplication3 554 0002 695 0971.32
192-bit ECC point multiplication5 688 0003 673 0001.55
256-bit ECC point multiplication11 696 0004 412 6342.65
512-bit ECC point multiplication84 352 00019 798 7234.26
160-bit ECC point multiplication   
with precomputation688 000342 9012.01
192-bit ECC point multiplication with precomputation1 096 000462 8402.37
256-bit ECC point multiplication with precomputation2 320 000545 8604.25
512-bit ECC point multiplication with precomputation15 448 0002 409 2726.41

Finally, to give an idea how fast the elliptic-curve operations would execute in a real application scenario, we provide the actual times in milliseconds for an FPGA device running on a relatively low clock speed of 50 MHz. For instance, a 160-bit EC point multiplication operation can be finished as low as about 6.86 ms in the CU implemented on Xtensa LX2 processor, if the precomputation method is used.

Comparing against implementations on other architectures with dedicated crypto-blocks/cores

Another approach to accelerate cryptographic operations is to utilize crypto-blocks, each of which is dedicated to speed up one type of cryptographic algorithm. For instance, a RISC-like crypto-processor in Ref. [25] supports both RSA and Secure Hash Algorithm 1 (SHA-1) algorithms in one unified functional unit. Ref. [26] proposes a RISC processor extension only for the KASUMI encryption algorithm. Likewise, an embedded cryptosystem in Ref. [28] hosts a number of coprocessors to accelerate ECC, SHA-1, RSA, and AES. Finally, a similar crypto-processor in Ref. [29] features crypto-blocks for AES, KASUMI, triple-DES, ECC, and RSA. Naturally, these architectures provide superior time performance over the proposed CU because they take full advantage of hardware architectures that are tailored to specific needs. However, they lack the flexibility, agility, and programmability, which are some of the foremost advantages of our design. For instance, our CU can be utilized to accelerate pairing operation over prime fields, where similar speedup values to those of ECC and RSA can be anticipated.

All the same, it is beneficial to demonstrate the performance figures of our architecture together with those obtained from dedicated architectures to better appreciate the advantages and disadvantages of the two approaches. Some of the relevant figures for comparison are listed in Table 18.

Table 18. Comparison with dedicated crypto-blocks/cores.
ArchitectureCryptographic algorithmTechnologyFrequency (MHz)Execution time
Dedicated [25]RSA-1024ASIC (0.13µm CMOS)196190 ms
 RSA-1024FPGA100.2531.93 ms
Dedicated [28]ECC-163 (binary Koblitz)(Altera Stratix)9592 µm
 AES EncryptionEP1S40F780C51000.43 µm
 RSA-1024FPGA2858.9 ms
Dedicated [29]ECC 146 (binary) 507.28 ms
 RSA-1024FPGA50235 ms
This workECC-160 (prime)(Xilinx Virtex-4)5053.9 ms
 AES Encryption 5052.4 ms

Note that a direct comparison between the figures in Table 18 is unfair because many issues such as the target technology, clock frequency used, mathematical structures employed, and architectural approaches adopted have significant impact on the time performance of each implementation. For instance, ECC implementation in Ref. [28] makes use of the highly efficient binary Koblitz curves, whereas we use regular curves over prime fields with no optimization in the underlying finite field structure exploited. Even when all these factors are considered, our architecture cannot provide the same time performance as those with dedicated blocks. This places our approach somewhere in between purely software implementations enjoying the flexibility, agility, and programmability and implementations on dedicated crypto-blocks exploiting all advantages of hardware architectures.

A note on AES time performance and comparison with other implementations

As pointed out earlier, the speed is not our main design goal for AES implementation; hence, we only add instructions to better exploit the CU for secure and isolated execution of AES. From another standpoint, however, the implementation should also give a decent time performance. In order to evaluate time performance of our AES implementation, we enumerate the time performances of other state-of-the art AES implementations in Table 19.

Table 19. Comparison of AES implementations.
ImplementationHardware supportPerformance (cycles)
[38] on ARM7TDMI1675
[39] on AMD Opteron2699
[30] on CRISPBit-sliced + lookup tables2203
[30] on CRISPBit-sliced + lookup tables 
 + bit-level permutation1222
this workOn CU2620

Time performances of different implementations vary because of several reasons. First, different processor architectures can render certain optimizations easier. Second, whether the implementation is secure against cache attacks, it incurs overhead in the cycle count (e.g., Ref. [39] utilizes the bit-slicing technique to protect AES against cache attacks). Third, the types and extent of the hardware support play a very decisive role in time performance. For instance, the implementation in Ref. [30] combines three hardware-based techniques to accelerate the AES implementation to a great extent. Our implementation, even though not providing the best time performance, does not suffer a considerable deterioration in speed. Furthermore, if the same hardware-based techniques such as bit permutation are applied in our design, a considerable speedup would be obtained.


  1. Top of page

We designed and implemented a CU, for secure and fast execution of a wide range of cryptographic algorithms, which can be integrated into any RISC processor architecture. To show the design's efficiency and applicability, we integrated the proposed CU into the execution pipeline of a low cost, extensible, embedded processor core. We obtained considerable speedups for basic multiprecision arithmetic operations such as modular multiplication and inversion that are the dominant operations in many public key cryptosystems. We found out that the speedup values gained through the CU for ECC and RSA are up to 8 and 11 times, respectively. We demonstrated that the CU can also be used to harden software implementations of symmetric ciphers with low overhead against certain side-channel attacks (i.e., cache-based attacks). We also established that the hardware overhead of the proposed CU in terms of chip area is acceptable even for embedded processors. A comparison of the obtained speedup values and incurred hardware overhead clearly confirms that the benefits of the CU far exceed its cost.

The main structure of the CU resembles an IU of an RISC-style general-purpose processor, and its interface is the same as a typical RISC ISA. For instance, instructions are register based and take at most three operands. To demonstrate that the CU can provide a similar benefit for different implementations of a RISC processor, we integrated it to both five-stage and seven-stage pipelined versions of the same embedded processor. Although there is a slight degradation in area and speed (in terms of clock cycles count) for seven-stage pipeline, which is expected, the degradation is less in the CU than in the other part of the processor. This clearly shows that the CU design is indeed generic and especially beneficial for cryptographic operations.

We realized the enhanced processor on an FPGA device and determined that the CU unit is not in the critical path of the processor; therefore, we were able to achieve the maximum clock frequency of 50 MHz for the target device.

We demonstrated, in the particular case of the AES, that a cryptographic algorithm can be executed inside the CU in complete isolation. For other and more complicated public key algorithms, more extensive support within the CU is needed. For example, isolated implementation of AES is only made possible using assembly implementation. Implementing RSA and ECC in secure and isolated fashion would be overly complicated in assembly; therefore, support for high-level language implementations for public key algorithms is deemed necessary. In addition, a small on-chip, noncached, scratch memory is needed to store excessive amount of sensitive data in the CU. These and similar issues are left as future work.

  • An instruction that needs the result of a 128-bit multiplication instruction may have to wait for the result, and this may lead to long stalls in the pipeline (hence, data dependency).


  1. Top of page
  • 1
    Kocher PC. Timing attacks on implementations of Diffie–Hellman, RSA, DSS, and other systems. In CRYPTO, ser. Lecture Notes in Computer Science, Vol. 1109, Koblitz N (ed). Springer: Santa Barbara, California, USA, 1996; 104113.
  • 2
    Kocher PC, Jaffe J, Jun B. Differential power analysis. In CRYPTO, ser. Lecture Notes in Computer Science, Vol. 1666, Wiener MJ (ed). Springer: Santa Barbara, California, USA, 1999; 388397.
  • 3
    Coron J-S, Naccache D, Kocher PC. Statistics and secret leakage. ACM Trans. Embedded Comput. Syst. 2004; 3(3): 492508.
  • 4
    Oswald E, Mangard S. Template attacks on masking—resistance is futile. In CT-RSA, ser. Lecture Notes in Computer Science, Vol. 4377, Abe M (ed). Springer: San Francisco, CA, USA, 2007; 243256.
  • 5
    Bernstein D. Cache-timing attacks on AES. Website, 2005, accessed on April 2, 2012.
  • 6
    Aciiçmez O, Koç ÇK. Trace-driven cache attacks on AES (short paper). In ICICS, ser. LNCS, Vol. 4307, Ning P, Qing S, Li N (eds). Springer Verlag: Berlin, 2006; 112121.
  • 7
    Aciiçmez O, Schindler W, Koç ÇK. Cache-based remote timing attacks on the AES. In CT-RSA, ser. LNCS, Vol. 4377, Abe M (ed). Springer Verlag: Berlin, 2007; 271286.
  • 8
    Blömer J, Krummel V. Analysis of countermeasures against access driven cache attacks on AES. In Selected Areas in Cryptography, ser. LNCS, Vol. 4876, Adams CM, Miri A, Wiener MJ (eds). Springer: Ottawa, Canada, 2007; 96109.
  • 9
    Osvik DA, Shamir A, Tromer E. Cache attacks and countermeasures: the case of AES. In CT-RSA, ser. LNCS, Vol. 3860, Pointcheval D (ed). Springer: San Jose, California, USA, 2006; 120.
  • 10
    Devices AM. AMD64 virtualization: Secure virtual machine architecture manual. AMD Publication no. 33047 rev. 3.01, May 2005.
  • 11
    Intel. LeGrande technology preliminary architecture specification. Intel Publication no. D52212, May 2006.
  • 12
    ARM. TrustZone Technology Overview., accessed on April 2, 2012.
  • 13
    McCune JM, Parno B, Perrig A, Reiter MK, Isozaki H. Flicker: an execution infrastructure for TCB minimization. In EuroSys, Sventek JS, Hand S (eds). ACM: Glasgow, Scotland, 2008; 315328.
  • 14
    McCune JM, Parno B, Perrig A, Reiter MK, Seshadri A. How low can you go? Recommendations for hardware-supported minimal TCB code execution. In ASPLOS, Eggers SJ, Larus JR (eds). ACM: Seattle, WA, USA, 2008; 1425.
  • 15
    TCG Software Stack (TSS). Specification version 1.2, level 1. Part 1: Commands and structures. Trusted Computing Group, Incorporated., January 6 2006,
  • 16
    Chen L, Morrissey P, Smart NP. Pairings in trusted computing. In Pairing, ser. Lecture Notes in Computer Science, Vol. 5209, Galbraith SD, Paterson KG (eds). Springer: Egham, UK, 2008; 117.
  • 17
    Brickell E, Chen L, Li J. A new direct anonymous attestation scheme from bilinear maps. In TRUST, ser. Lecture Notes in Computer Science, Vol. 4968, Lipp P, Sadeghi A-R, Koch K-M (eds). Springer: Villach, Austria, 2008; 166178.
  • 18
    Kocabaş Ö, Savaş E, Großschädl J. Enhancing an embedded processor core with a cryptographic unit for speed and security. In RECONFIG '08: Proceedings of the 2008 International Conference on Reconfigurable Computing and FPGAs. IEEE Computer Society: Washington, DC, 2008; 409414.
  • 19
    Yumbul K, Savas E. Efficient, secure, and isolated execution of cryptographic algorithms on a cryptographic unit. In SIN, Elçi A, Makarevich OB, Orgun MA, Chefranov AG, Pieprzyk J, Bryukhomitsky YA, Örs SB (eds). ACM: Gazimagusa, North Cyprus, 2009; 143151.
  • 20
    Großschädl J, Savaş E. Instruction set extensions for fast arithmetic in finite fields GF(p) and GF(2m). In CHES, ser. LNCS, Vol. 3156, Joye M, Quisquater J-J (eds). Springer: Cambridge, MA, USA, 2004; 133147.
  • 21
    Großschädl J, Tillich S, Szekely A. Performance evaluation of instruction set extensions for long integer modular arithmetic on a SPARC V8 Processor. In DSD. IEEE, 2007; 680689.
  • 22
    Vejda T, Page D, Großschädl J. Instruction set extensions for pairing-based cryptography. In Pairing, ser. LNCS, Vol. 4575, Takagi T, Okamoto T, Okamoto E, Okamoto T (eds). Springer: Tokyo, Japan, 2007; 208224.
  • 23
    Tillich S, Großschädl J. Instruction set extensions for efficient AES implementation on 32-bit processors. In CHES, ser. LNCS, Vol. 4249, Goubin L, Matsui M (eds). Springer: Yokohama, Japan, 2006; 270284.
  • 24
    Fiskiran AM, Lee RB. On-Chip lookup tables for fast symmetric-key encryption. In ASAP. IEEE Computer Society: Samos, Greece, 2005; 356363.
  • 25
    Huang W, You K, Zhang S, Han J, Zeng X. Unified low cost crypto architecture accelerating RSA/SHA-1 for security processor. In ASIC, 2009. ASICON '09. IEEE 8th International Conference on, Oct. 2009; 151154.
  • 26
    Balderas-Contreras T, Cumplido R, Uribe R. On the design and implementation of a RISC processor extension for the KASUMI encryption algorithm. Computers and Electrical Engineering 2008; 34(6): 531546.
  • 27
    Kim H-W, Choi Y-J, Kim M-S. Design and implementation of a crypto processor and its application to security system. Online—Last Accessed, 01 August 2011,, accessed on April 2, 2012.
  • 28
    Hani MK, Wen HY, Paniandi A. Design and implementation of a private and public key crypto processor for next-generation it security applications. Malaysian Journal of Computer Science 2006; 19(1): 2945.
  • 29
    Kim HW, Lee S. Design and implementation of a private and public key crypto processor and its application to a security system. Consumer Electronics, IEEE Transactions on 2004; 50(1): 214224.
  • 30
    Grabher P, Großschädl J. Page, light-weight instruction set extensions for bit-sliced cryptography. In CHES, ser. Lecture Notes in Computer Science. Vol. 5154, Oswald E, Rohatgi P (eds). Springer: Washington, D.C., USA, 2008; 331345.
  • 31
    Tensilica. Xtensa LX2 embedded processor core. Website,, accessed on April 2, 2012.
  • 32
    Koç ÇK, Acar T, Kaliski BS Jr. Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro 1996; 16(3): 2633.
  • 33
    Kaliski BS Jr. The Montgomery inverse and its applications. IEEE Transactions on Computers 1995; 44(8): 10641065.
  • 34
    Ravi S, Raghunathan A, Potlapally NR, Sankaradass M. System design methodologies for a wireless security processing platform. In DAC. ACM, 2002; 777782.
  • 35
    Cohen H, Miyaji A, Ono T. Efficient elliptic curve exponentiation using mixed coordinates. In ASIACRYPT, ser. LNCS, Vol. 1514, Ohta K, Pei D (eds). Springer: Beijing, China, 1998; 5165.
  • 36
    Barreto P. The AES block cipher in C++. Website, 2003,
  • 37
    Intel. Advanced encryption standard (AES) instructions set, April 2008,
  • 38
    Bertoni G, Breveglieri L, Fragneto P, Macchetti M, Marchesin S. Efficient software implementation of AES on 32-bit platforms. In CHES, ser. Lecture Notes in Computer Science, Vol. 2523, Kaliski BS Jr, Koç ÇK, Paar C (eds). Springer: Redwood Shores, CA, USA, 2002; 159171.
  • 39
    Könighofer R. A fast and cache-timing resistant implementation of the AES. In CT-RSA, ser. Lecture Notes in Computer Science, Vol. 4964, Malkin T (ed). Springer: San Francisco, CA, USA, 2008; 187202.