Efficient software implementations of many block ciphers rely on the use of lookup tables, which naturally makes them vulnerable to cache-based side-channel attacks [5, 7-9]. The lookup tables are used to implement the nonlinear functions (s-box) of the block cipher, which may be the most important part of the algorithm, both performance and security-wise. These tables usually fit in the first-level or second-level caches of modern processors. The most efficient AES implementation in software to the best of our knowledge (not relying on hardware support of special instructions)  uses four 1-kb tables for the first nine rounds of 128-bit AES. Another table of the same size is used for the last round. Many cache-based attacks [5-7] exploit access patterns of cryptographic process to cache lines, which may contain the desired table item (cache hit) or not (cache miss). A spy process running simultaneously to the cryptographic computation can find out the access patterns of cryptographic process by creating carefully timed access patterns of its own to the same cache.
Architectural enhancements allowing protected execution of AES
Although larger s-boxes may be preferable from the security point of view, they are usually chosen of moderate sizes in practice because of implementation concerns. Table 8 lists the lookup table sizes needed to implement the s-boxes of some of the well-known block cipher algorithms. The lookup tables for these block cipher algorithms can be implemented in our CRF shown in Figure 7(a) because the CRF is 512 bytes in total sizes. We use the first 16 cryptographic registers (from cr0 to cr15) to hold the lookup table entries of AES in our implementation.
Table 8. Hardware cost of cryptographic unit.
|Block cipher algorithm||Lookup table size|
|Twofish||512 bytes (two 8 × 8 permutation tables)|
The CRF can be used to perform 128-bit integer arithmetic (e.g., addition, subtraction, and multiplication) and bitwise logical operations (e.g., AND, OR, XOR) with a single native instruction. Bitwise logical operations are especially very useful in AES implementation. On the other hand, some other special-purpose registers are needed to perform table lookup operation through the CRF, which are listed in Figure 7.
A cryptographic register in the CRF can be considered as a register capable of holding 16 bytes of the AES lookup table as demonstrated in Figure 8. For instance, the cryptographic register (cr0) stores the s-box entries for the input bytes from 0 to 15. In the figure, (crx) represents any one of the cryptographic register in the CRF, and (crx[i]) stands for an individual byte within the register (crx). The individual bytes of cryptographic registers are not directly accessible. As explained in the following, only special instructions can transfer an individual byte of a cryptographic register into a special register (s_out) as shown in Figure 7. Direct access to an individual byte and its transfer to any other register may overly complicate the design and incur high-performance penalties.
Figure 7(b) shows a small register file that can hold two versions of the AES state (each of which is 128 bit) in each round, one for the old and one for the newly computed AES state (or block). Recall that each AES round takes 128-bit block as input and generates a new 128-bit block as output. For 192- and 256-bit implementations of AES, only the state register file needs to be modified, whose area overhead is negligible compared with the overall area of the processor.
In every round, we perform 16 s-box operations for 128-bit AES. The AES state is in (st0, st1, st2, st3) at the beginning of each round. The first 32-bit part of the state is transferred from (st0) to (index) register (Figure 7(c)) using the instruction (mv_st2index st0). The least significant byte of the (index) register is used to access the s-box output, which is stored in the CRF.
The upper four bits of the least significant byte of (index) register is used to determine which cryptographic register (crx) holds the desired s-box output. The least significant 4 bit of (index) serves as the offset value within (crx). Once the cryptographic register that holds the s-box output is known, the instruction (rd_tab_creg crx) reads the s-box output from (crx) and puts it in a special register (s_out) (Figure 7(d)). The instruction (mv_sout2st stx) first rotates (stx) to the right by 1 byte and put the content of (s_out) register in the most significant byte of a state register (stx).
When one table lookup operation for 1 byte of the AES block is completed, the content of (index) register is shifted to the right by 1 byte. As a result of this, the table lookup operation for the second byte can start. When lookup operations for 4 bytes read from the state register to the (index) register are completed, the next 4 bytes are transferred from the next state register (i.e., st1) to the (index) register and the same operations are applied to the (extttindex) register content as well. The table lookup operation finishes for one round when 16 table accesses are completed. Note that except for the CRF, other registers are of special type, and they cannot be accessed directly. Only special instructions shown in Table 9 can access the content of these registers.
Table 9. Special instructions for AES implementation.
|rd_tab_creg||rd_tab_creg crx||s_out:=crx [index ∧ 0xF]|
|shlmod||shlmod std, sts, arx||std [i]:=sts [i] ⊕ (sts [i]_7) ∧ arx|
| || ||i = 0,1,2,3|
|rowop||rowop str, sts||s_out := index_7∧sts  ⊕|
| || ||index_6∧sts  ⊕ index_5∧sts  ⊕|
| || ||index_4∧sts  ⊕ index_3∧str  ⊕|
| || ||index_2∧str  ⊕ index_1∧str  ⊕|
| || ||index_0∧str |
|mv_st2index||mv_st2index std||index := (0,0,0,std ) and|
| || ||std := (0,std ,std ,std )|
|mv_sout2st||mv_sout2st std||std := (sout,std ,std ,std )|
|mv_cr2st||mv_cr2st std, crs||std := index_0∧(crs ,crs ,crs ,crs |
| || ||⊕ index_1∧(crs ,crs ,crs ,crs )|
| || ||⊕ index_2∧(crs ,crs ,crs ,crs )|
| || ||⊕ index_3∧(crs ,crs,crs ,crs )|
Note that the instructions discussed so far are not designed to benefit particularly the AES algorithm. They can benefit many block cipher algorithms that utilize relatively small s-boxes such as DES/3DES, AES (Rijndael), Serpent, and Twofish. One instruction that may be considered as specific to AES is shlmod std, sts, arx (Table 9) that can perform four simultaneous shift left operations by 1 bit in GF(28). If the irreducible polynomial of GF(28) we work in is p(x) = x8 + r(x), then the architectural register (arx) in the instruction is initialized to r(x). For instance, the irreducible polynomial of GF(28) used in AES is x8 + x4 + x3 + x + 1 and therefore (xtttarx := 0x1B). In Table 9, std[i] stands for the ith byte of the destination state register std and sts[i]_7 for the most significant bit of the ith byte of the source state register (sts). The instruction (shlmod) works for any irreducible polynomial and can benefit the applications using GF(28) arithmetic.
Another instruction used in our AES implementation is (rowop str, sts) that takes two words (32-bit variable) stored in (str) and (sts) registers and XOR certain bytes of these two words, which are determined by the bits of the (index) register, where (index_i) stands for the ith least significant bit of (index) register (Table 9). The resulting byte is stored in the (s_out) register. This instruction is useful in matrix arithmetic where the elements of the matrix are in GF(28).
The rest of the new instructions in Table 9 are useful for moving data between the special registers and cryptographic registers. They have generic usage because we need to move the data around if we want to use the CU. They are easy to implement, do not incur significant overhead in the area, and definitely are not in the critical path of the processor.
As can be observed from the discussions in this section, our approach is not to integrate powerful instructions that can provide superior performance, specific to the cryptographic algorithm in question and expensive to implement. Our design principle is to propose simple and inexpensive instructions that can benefit a wide range of cryptographic algorithm implementations while providing a secure and isolated execution.
Time performances of the different implementations of AES algorithm
In this section, we compare the time performances of four different (and state-of-the-art) implementations of AES.
The first implementation is taken from Ref. , which is one of the most efficient (i.e., the fastest) implementation of AES in software, which will be referred as large lookup table implementation because it uses relatively large tables. Naturally, this implementation is vulnerable to cache-based side-channel attacks. The second implementation is referred as small lookup table implementation and uses a 256-B lookup table. This is a straightforward implementation and may be vulnerable to the cache-based attack as well.
The third implementation, referred as hardened in Ref. , utilizes the CRF to store the lookup table. It is secure against cache attacks but does not run in an isolated zone. The hardened implementation in Ref.  gives the overhead in number of clock cycles per round to protect a particular round. Most powerful attacks focus either on the first round of AES as in Ref.  or on the last round as in Ref.  because these rounds directly interact with the outside world by taking the plaintext or outputting the ciphertext, which are easily observable by an adversary. Therefore, it is of utmost importance to protect especially the first and last rounds. All the same, it would be prudent to protect the first two and the last two rounds of AES in case effective attacks are discovered against the second (or the ninth) round of AES, whose first and last rounds are already protected.
Table 10 lists the overhead values (in number of clock cycles) round-wise for a single block encryption of 128 bits.
Table 10. Overhead of protecting the rounds of AES against cache attacks (in clock cycles).
|||First||Last||First last||Per round|
|796||171 (21.5 %)||33 (4.5 %)||199 (25 %)||178 (≈ 22.4 %)|
Finally, the last implementation, as we prefer to call isolated, uses the lookup table in the CRF and does not use memory, cache, or architectural register to store confidential data. None of the confidential values such as secret key, round key, and intermediate blocks from AES rounds will leave the protected zone during the AES computation. Under the assumption that the CU is manufactured as tamper proof, the isolated implementation of AES can even withstand cold-boot attacks.
The time performances of the four aforementioned AES implementations are given in Table 11. As can be observed from the figures in the table, large lookup table implementation performs much better than the other three implementations. This is due to the fact that large lookup table implementation mainly consists of lookup operations to the five large tables stored in memory. As long as the processor provides fast memory accesses through the use of first-level or second-level caches, it is almost impossible to provide a better performance except for the implementations exploiting hardware support (e.g., ). This implementation, however, has been demonstrated to be vulnerable to cache attacks.
Table 11. Time performance of the four software implementations of AES.
|Implementation||Time performance (clock cycles)||Characteristics|
|Large lookup table ||796||Fast, insecure|
|Small lookup table||2654||Moderate speed, insecure, no isolation|
|Hardened (first and last rounds protected) ||995 (est.)||Relatively fast, secure, no isolation|
|Hardened (first two and last two rounds protected) ||1356 (est.)||Relatively fast, secure, no isolation|
|Hardened (All rounds protected) ||2424 (est.)||Moderate speed, secure, no isolation|
|Isolated (this work)||2620||Moderate speed, secure, isolated|
The small lookup table implementation, which can be generally considered as insecure against cache attacks, provides a moderate performance, whereas the hardened implementation (all rounds protected) provides expectedly 8.6% improvement. One can always selectively protect the AES rounds in order to increase performance of hardened AES implementation, which can increase the speedup over the small lookup table implementation.
The final AES implementation executes AES encryption operation in complete isolation within the CU. Its performance is comparable with a small lookup table and hardened implementation of AES. Because it does not use any other shared functional units (memory, cache, architectural registers) than those inside the CU for computations involving security-sensitive values, it provides a very extensive security for a software implementation. A slight performance degradation when compared with hardened implementation is unavoidable because the latter uses the best of both the CU and the other existing (and speed-optimized) functional units.