Design and implementation of fast and hardware-efﬁcient parallel processing elements to set full and partial permutations in Beneš networks

A new design for parallel and distributed processing elements (PEs) is proposed to con-ﬁgure Beneš networks based on a novel parallel algorithm that can realise full and partial permutations in a uniﬁed manner with very little overhead time and extra hardware. The proposed design reduces the hardware complexity of PEs from O ( N 2 ) to O ( N (log 2 N ) 2 ) due to a distributed architecture. In the proposed design, asynchronous operation was introduced in parts to reduce the time complexity per PE stage down to O (1) within a certain N, while it takes O (log 2 N ) time per PE stage in conventional algorithms. A prototype parallel was constructed and PEs were distributed in a ﬁeld programmable gate array to investigate performance for the switch size of N = 4 to 32. The experimental results demonstrate that the proposed design outperforms a recent method by at least several times in terms of hardware and processing time complexities.


INTRODUCTION
The Beneš network (BNW) is a typical multistage switching network that provides unicast connections between N inputs and N outputs, where N = 2 n . An N × N BNW comprises 2n − 1 stages, each of which comprises N/2 2 × 2 switch elements (SE) [1]. There is a total of O(N log 2 N ) SEs, which satisfies the theoretical lower bound of nonblocking switch complexity [2]. The BNW is rearrangeably nonblocking, that is, existing connections can be diverted to alternative routes to create space for a new call. Meanwhile, when all connection requests are given simultaneously, it can be configured without such rearrangements using complex routing algorithms. These characteristics make BNWs effective for interconnection networks that provide a limited number of permutations for which sets of switch control bits (SCBs) are precomputed and indexed [3,4]. In fact, the switch control complexity of the BNW isO(N log 2 N ) under sequential algorithms [5,6], and this O(N log 2 N ) time simply reflects the number of SEs that must be set in the BNW. This is much longer than the O(1) time of crossbar switches, which, in contrast, have up to O(N 2 ) SEs [7]. Several parallel algorithms have been investigated to apply BNWs to high-speed time-division multiplexed systems of agile responsibility to arbitrary permutations [8]. The initial work emerged in two different areas concurrently and independently. In 1981, Lev et al. formulated parallel algorithms for Beneš/Clos networks based on a mathematical (graph theory) approach [9], and, in 1982, Nassimi and Sahni developed a parallel algorithm for BNWs based on an engineering (parallel computing) approach [10]. Both of these techniques achieved O((log 2 N) 2 ) time complexity with completely interconnected parallel computers for full permutations, where each input corresponds to a unique output.
In many practical applications, some inputs may be idle, resulting in partial permutation. Parallel algorithms for BNWs should handle both full and partial permutations [11]. However, the original parallel algorithms suspend processing when they encounter an idle connection. There are two major approaches to address this issue. First, in 1995, Lee and Oruc introduced quadruple datasets to match idle inputs with idle outputs as dummy destinations [12]. Once each input is assigned to a unique destination output, conventional parallel algorithms work effectively. However, this approach requires considerable pre-processing and a complicated data structure. Second, in 2002, Lee and Liew proposed an additional merging process that fits in efficiently with the original algorithms [13]. In 2017, Jiang and Yang implemented Lee's algorithm in a field programmable gate array (FPGA) for N = 8-32 [14]. They found that it worked well but incurred significant overhead time. For example, the first stage of a parallel control unit (PCU) for N = 16, of which the main process can be completed in only four (i.e. log 2 N) clocks, required up to 17 clocks. In addition, this approach incurs a steep increase of PCU hardware in O(N 2 ) due to a crossbar-like centralized architecture [15].
In 2009, Kai et al. preceded Jiang and Yang in designing a PCU to configure BNWs. Their design only focused the first stage of the PCU for a 16 × 16 BNW [16]. However, they employed a distributed architecture rather than a centralized architecture, which reduced the hardware complexity from O(N 2 ) toO(N (log 2 N ) 2 ). In addition, they suggested that the time complexity can be reduced from O((log 2 N ) 2 ) to O(log 2 N ) using a pipeline architecture suggested in Ref. [12]. Despite these potential advantages, their algorithm could not handle partial permutations and no detailed experimental results were shown.
In contrast to Ref. [16], we propose a parallel algorithm that realises partial permutations as well as full permutations. We also implement a whole PCU in an FPGA and show experimental results in detail. Our approach is to construct a PCU with distributed PEs that generate SCBs in a pipelined and in part asynchronous manner. In Ref. [12], partial permutations were addressed by matching the idle inputs to idle outputs as dummy destinations; however, our algorithm does not need such overhead processing but can handle partial permutations in a unified manner with full permutations.
The remainder of the paper is organised as follows. In Section 2, beginning with several definitions, we present our routing principle in the form of sequential processing as an introduction to our parallel algorithm. In Section 3, we describe a design of parallel and distributed PEs to be implemented in an FPGA with an emphasis realising full and partial permutations in a simple but unified manner. In addition, we discuss two design options relative to speed and hardware complexity perspectives. In Section 4, we describe the implementation of the parallel algorithm in an FPGA to realise higher speed than clock rate using asynchronous operation in part. We show experiment results to highlight the performance of the design. Finally, the paper is concluded in Section 5.

DEFINITIONS AND PROPOSED ROUTING PRINCIPLE
Here, we consider parallel construction of BNWs [17]. Figure 1 shows a primitive model of an N × N BNW in a three-stage structure, where a parallel pair of half-sized (i.e. N ∕2 x N ∕2) matrices are in between two sets of N/2 SEs. Each N/2 × N/2 matrix may be replaced in a three-stage BNW with a pair of N/4 × N/4 matrices. The reduction proceeds until the  minimum size matrices, that is, 2 × 2 SEs, appear in the centre stage. As a result, a complete BNW can be obtained (Figure 2), where each SE has two connection modes, that is, bar (=) and cross (×), flipped by a SCB. Note that S p, q represents the q-th top SE at the p-th stage from the left, where 0 ≤ p ≤ 2n − 2 and 0 ≤ q ≤ N/2 − 1. Assume s p, q ∈ {0, 1} indicates its status, and s p, q of 0 and 1 makes S p, q bar and cross.
Let i and j, where 0 ≤ i ≤ N − 1 and 0 ≤ j ≤ N − 1, be the input and output port numbers in an N × N BNW. For now, we consider a full permutation 0 as follows: where j i is the designated output at input i.
We express a duplet i : j as an individual corresponding pair of input i and output j. Let b(j) be the binary address of output j, defined by Each SE in the last stage in Figure 2 has a pair of outputs j and j c , of which b(j) LSB and b(j c ) LSB are different (or complementary). We define m(j) as the output SE number with j, where 0 ≤ m(j) ≤ N/2 − 1, and we have In addition, relative to a binary form, let b( j) be the truncated binary address of b(j), defined by Then we have Similar relationships hold between i and i c in an input SE, which has a pair of duplets i : j and i c : j a , where j a is the associated destination output in the input SE. Note that we use j and j c (and i and i c ) interchangeably, i.e. their positions in an SE are not fixed. Here, j c and j a may be coincident to each other for a given j. For example, in Figure 1, j c = j a = 2 holds for j = 3 at S 0, 0 . The original permutation π 0 is divided into a pair of subpermutations as follows: where u 0 l and d 0 l (0 ≤ u 0 l ≤ N/2 − 1 and 0 ≤ d 0 l ≤ N/2 − 1 for 0 ≤ l ≤ N/2 − 1) are output SE numbers to which a designated output j belongs, as shown in Figure 1. Previously, a sub-permutation has been referred to as a complete residue system [6] or equivalent class [10]; however, sub-permutations constitute systems of distinct representative (SDR) [18,19], and we refer to the sub-permutations as SDR.
Assume that the three-stage structure shown in Figure 1 is the 0th reduction. Generally, at the k-th reduction, with 0 ≤ k ≤ n − 2, there are 2 k sub-permutation pairs ( k, u 0 , k, d 0 ), …, ( k, u 2 k −1 , k, d 2 k −1 ), where each sub-permutation has N/2 k + 1 elements. The width of the binary address of these elements is given virtually by n − k − 1. Note that, throughout this paper, an output address of n bits is transferred to the next subpermutation as is, while the region of interest in the address at the k-th reduction, denoted by R(b(j), k), is reduced to n − k − 1 bits, truncated by one bit at each reduction, and is expressed as follows: As shown in Figure 2, the BNW is equivalent to back-to-back concatenation of two baseline networks [20]. Similarly, our routing algorithm is broken down into two parts. In the first part, the permutation is divided recursively to one-half while satisfying the SDR constraints as shown in Figure 1. Although our division process is essentially the same as that of the looping algorithm [5], it is described briefly in the following as an introduction to our parallel algorithm. This process is recursive in nature; thus, we primarily focus on the first stage to illustrate how π 0 is divided into 0, u 0 and 0, d 0 .
(i) Select an arbitrary SE that is not set yet and set it to bar by default. In the initial state, we select S 0, 0 and set s 0, 0 = 0. Let the upper duplet be i : j (0 : 3 in Figure 1). (ii) Search for an SE to which j c belongs (j c = 2 in Figure 1) and find j c in S 0,0 , In this case, we return to the beginning SE and a cycle ends [10]. and find S 0, 2 . A pair of outputs j = 5 and j c = 4 shares S 4, 2 , and they must be distributed to different subpermutations due to the SDR constraints. As m(j) = 2 has already been included in 0, u 0 at i', m (j c ) = 2 must be in 0, d 0 . As a result, we set s 0, 2 = 1. Here, we have 0, u 0 = {1, 2, 3} and 0, d 0 = {1, 0, 2}. Note that s 0, 2 is not equal to s 0, 1 because the relative positions of the pair of duplets including j = 5 and j c = 4 (i.e. either the upper or lower input in the respective SE) are identical [10]. (iii) Next, identify j a = 6 in S 0,2 as the associate of j c = 4.
Then, substitute j with j a and repeat steps (ii') to (iii) until the cycle ends. When the cycle ends, resume a new cycle from (i) until all SEs are set.
The second part of our routing algorithm relies on destination-tag routing (DTR), which correlates to a superposed binary tree structure of the reverse baseline network (RBN) [20] Figure 2. Here, let d(i) be a binary destination output imposed at the i-th input of the RBN as follows.
We label each pair of outlets of SEs in the RBN with 0 and 1, as shown in Figure 2. Subsequently, we trace a route from i to d(i) as follows. When d(i) appears at an input of S p, q , it is forwarded to either outlet 0 or 1 of the S p, q depending on whether d The DTR is so simple and straightforward that we omit a detailed description of its hardware implementation.

Design overview
The distributed architecture of parallel processing elements (PEs) is shown in Figure 3, where the upper and lower halves are the switch body of an N × N BNW and the proposed PCU. The PCU accepts address information (A in ) slot by slot, computes routes and generates SCBs stage by stage in parallel. A PE at a stage is connected to a pair of PEs at the next stage in the same manner as its counterpart SE. A PE accepts a pair of addresses and transfers the addresses to the next PEs in the same mode (i.e. bar or cross) as the SE. The PCU has up to O(N log 2 N ). PEs and is similar to a massively parallel system. Fortunately, as described in Section 3.2, the PEs can comprise several kinds of primitive logic circuits and memories, for example, comparators, multiplexers and registers, unlike [14,21], where arithmetic units and shared memories are required. As shown in Figure 3, there are two types of PEs, that is PE k, h in the first part, where 0 ≤ k ≤ n − 2 and 0 ≤ h ≤ N/2 − 1, and PE k', h in the second part, where n − 1 ≤ k' ≤ 2n − 2 and 0 ≤ h ≤ N/2 − 1, according to the aforementioned two parts of routing algorithm. Note that the first stage of the PCU corresponds to the index of k = 0. Here, PEs in the first k-th (0 ≤ k ≤ n − 2) stage are divided into 2 k independent groups. Thus, a PE group in the k-th stage includes N/2 k + 1 PEs and accepts a total of N/2 k addresses, of which the bit width of the ROI becomes n − k − 1. PEs in a group are interconnected with N/2 k + 1 buses, which are shown as bold lines in the first stage of the PCU in Figure 3, to communicate within the group.
In the first part of our routing algorithm, the division process of sub-permutations is common to all PE groups, although the number of PEs in a group (i.e. the size of sub-permutations) decreases with stage. In the second part, all PEs have uniform, low complexity. As a result, PEs in the first stage of the PCU have the most complex hardware. As discussed in Section 3.2, it takes O(log 2 (N ∕2 k )) time for a PE group in the k-th reduction, where 0 ≤ k ≤ n − 2, whereas all PEs in the second part require small constant time O (1). As a result, the processing time and hardware complexity are most critical in the first stage of the PCU. In the following, we focus on PEs in the first stage as in Ref. [16] and describe our parallel algorithm in detail.

Design of PEs for full permutation
In the first stage of the PCU, there is a single PE group containing N/2 PEs. Here, we assign binary suffixes to the N/2 PEs as PE b(p) (e.g. PE 00 ), where 0 ≤ p ≤ N/2 − 1. Note that we also re-assign binary suffixes to the status of N/2 SEs (e.g. s 00 ). The PEs are interconnected with the same number of buses B b(r) (e.g. B 00 ), where 0 ≤ r ≤ N/2 − 1 (Figure 4(a)). We assume a full permutation in Figure 4  .
A simplified functional block diagram of the PEs is shown in Figure 5, and the function of each block is described as follows. In the division process, PE b(p) puts out its own data, for example, b( j), b( j c) and b(p), to B b(p) . Here, b(p) is an extended PE suffix that is identical to b(p) for full permutations and is redefined for partial permutations in Section 3.3. Each bus comprises the data information of n − 1 bits and a data valid flag (f v ) of 1 bit. Although each PE can receive multiple data simultaneously, at most one pair of eligible data is accepted through the multiplexers in the bus interface (BIF) in Figure 5. Thus, the internal processing speed of the PEs remains constant independent of the PE group size. The division process for full permutations has three phases, which are described in the following.

3.2.1
First phase The first phase is a parallel process for the neighbour search.
Here, assume PE b(p) has two destination addresses b(j) and b(j a ), as shown in Figure 5. The neighbour search is performed to find PEs with truncated addresses b( j) and b( j a). If b( j) = b( j a) holds in a PE (step ii; Section 2), the neighbour PE is identical to the original PE, and no addresses are launched onto the bus. The other PEs compare their own truncated addresses with the incoming b( j) and b( j a) using the comparators in the BIF. For example, as shown in Figure 4(b), PE 00 first broadcasts 11 from its upper inlet to B 00 , and the other PEs compare their truncated addresses to 11. PE 10 discovers PE 00 as a neighbour because it has 110 at its upper inlet. Note that 111 and 110 are at upper inlets, and they must be in different sub-permutations; therefore, s 10 = to s 00 (Figure 4(c)). Similarly, when PE 00 launches 10 from its lower inlet, PE 11 discovers PE 00 as a neighbour because it has 100 at its upper inlet. Thus, s 11 = to s 00 .
The 'equal to' and 'not equal to' relations are illustrated in Figures 4(b,c) by solid and dashed links, respectively. Here, we refer to both relations as link status collectively. Link status flag (f s ) and neighbour PE suffixes are saved in registers, as shown in Figure 5. In practice, a pair of counter-rotating links is implemented with a portion of each bus in the third phase. The link status between two adjacent PEs is implemented in our FPGA design by an inverting or non-inverting gate, as shown in Figure 4(d), where the link originates and terminates at PE 11 , as the result of the following second phase. Here, neighbour search is realised in parallel, and the time complexity is O(1).

Second phase
The second phase is an iterative procedure to find a representative PE in each cycle. In a cycle (e.g. Figure 4(c)), each PE is correlated with each other, and a system of the relations is referred to as an initialising equation and resolved in a relatively complex manner [13]. Our solution for the initialising equation is based on a plain but unique observation. If we appoint a PE as a representative in a link diagram (e.g. PE 11 in Figure 4(d)) and set the corresponding SE to the default state, the status of the other SEs is generated automatically in the link diagram, which cannot be easily implemented with only software. In other words, we borrow a concept from evolving hardware/software codesign techniques [22]. We assume the representative PE is that with the maximum suffix in a cycle (e.g. PE 11 in Figure 4b). Our iterative solution proceeds as follows. In the initial state, each PE in a cycle is a candidate for a final representative, of which status is indicated as a double circle in Figure 6(a), where only binary suffixes of PEs, that is, b(p), are shown for simplicity. In the first step, each PE sends its suffix to two neighbour PEs, as shown in Figure 6(b), and each PE compares its own suffix y 0 to the incoming suffixes y 1 and y 2 . Note that a similar situation occurs in the subsequent steps, and there are three possible cases.
(i) If (y 0 < y 1 ) ∪ (y 0 < y 2 ), the PE will be lost, y 1 and y 2 are transferred to the next neighbour PEs, for example, PE 00 and PE 01 in Figure 6(b), and PE 10 in Figure 6(c), where lost PEs are shown by dashed circles. Note that lost PEs become transparent in the subsequent steps. (ii) If(y 0 > y 1 ) ∩ (y 0 > y 2 ), the PE remains as a potential representative and discards both y 1 and y 2 , for example, PE 11 and PE 10 in Figure 6(b). (iii) If y 0 = y 1 = y 2 , the PE becomes a final representative and goes to the third phase, for example, PE 11 in Figure 6(d).
Lost PEs bypass incoming data through simple transfer switches embedded in BIF; thus, the bypass delay is negligible. The comparison result at each PE is updated in a representative flag (f r ), as shown in Figure 5. Neighbour PE suffixes are referred to by the BIF to specify a bus to receive the suffix data from. Note that each individual cycle has a maximum suffix, and there can be several representatives in a stage, for example,  Figure 2. As a result, the number of candidates for a representative is reduced to one-half after each iteration; thus, the second phase for the first stage completes in at most O(log 2 N ) time.

Third phase
The third phase is performed to determine the status of each SE in each cycle according to the direction of a representative and pass addresses on to the next stage. For example, PE 11 in Figure 4(d) sets an initial value of s 11 = 0, which also serves as a trigger to the link diagram. Note that each link between neighbour PEs has already been established over buses in the beginning of the third phase. In addition, inverting and noninverting gates, which reside in an SCB gen. in Figure 5, have been configured according to f s . Here, a pair of input addresses b(j) and b(j a ) is transferred to PEs in the next stage via a 2 × 2 switch ( Figure 5), which is set to bar or cross by an SCB. The propagation delay in the link diagram is very small compared to a clock period; thus, the time complexity of the third phase is estimated as O (1).
In summary, the total processing time in the first stage is O (log 2 N ). Generally, the total processing time in the k-th reduction, where 0 ≤ k ≤ n − 2, is O(log 2 N ∕2 k ), because the size of permutation in the k-th reduction is given by N ∕2 k . It  Figure 4(a) becomes idle and is represented as x; (b) The link diagram is disconnected between PE 00 and PE 10 ; (c) The disconnection is protected by the loop-back mechanism (similar to self-healing ring networks); (d) Alternatively, each PE is provided with an extended suffix to mask disconnection is evident that the first stage has a maximum processing time, which decides the maximum throughput of the pipelined PCU.

Unified parallel algorithm for full and partial permutations
Here, we begin by observing how an idle connection affects our algorithm. We assume a partial permutation as follows: 0 = ( 0 1 2 3 4 5 6 7 x 5 2 0 6 3 4 1 where x indicates an idle address. This partial permutation is similar to the full permutation in the previous Section, for example, Equation (11), except the first element that is changed to x. As shown in Figures 7(a,b), there are no links between PE 00 and PE 10 due to the idle address. Thus, no representative PE can be determined because y 1 and y 2 (step (iii); Section 3.2.2) are missing due to the link disconnection. In spite of the situation, we have two design options to overcome this problem. One option is a structural adaptation to loop back the disconnected links at the two end PEs (Figure 7(c)). This loop-back mechanism allows all PEs to have a pair of neighbour PEs, similar to full permutations. Although the loop-back causes no change in the comparison operation, it nearly doubles the length of a cycle and increases processing time.
The other option is a behavioural adaptation to mask the disconnection by blocking the received suffix data at the two end PEs. For this purpose, we redefine b(p) as follows: If a PE has a single neighbour PE like PE 00 and PE 10 , which reside on both ends in Figure 7 If a PE has two neighbour PEs like PE 01 and PE 11 , which reside in between two end nodes, b(p) is obtained by It is evident from Equations (13) and (14) that the two end PEs have greater suffixes than intermediate PEs; thus, they block intermediate PE numbers and survive competitions to the final iteration. Eventually, they exchange suffixes over the available links, and either becomes a final representative. Here, we must add a pair of termination conditions to step (iii) (Section 3.2.2) as follows. Note that b(y) MSB = 1 indicates that PE b(y) is an end PE.
Note that the latter design incurs very little additional processing time and hardware while maintaining the comparison operation for full permutation. In this study, we developed a PCU using the latter design option to realise high speed and low hardware complexity.

Design environment and alternatives
FPGAs are an efficient hardware target for rapid prototyping [23]. In this study, we used a Xilinx midrange FPGA (XC6SLX45), which has 6,822 configurable logic blocks (CLBs) operating at 100 MHz. We also used the ISE Design Suite 14.7 development tool for Windows 10, where the code for PEs was written in VHDL with the IEEE Std_Logic_1164ALL package. Note that the synthesis options were set to defaults, that is, speed priority mode, normal optimisation effort etc. Here, a constraint file was used to allow internal signals, for example, counter outputs, f r and SCB, to be output to I/O pins to monitor their logic status and measure delay time. In addition, the input addresses were generated within the FPGA. Note that we needed to redesign the FPGA when the permutations changed; however, external digital pattern generators were not required. We referred to a device utilisation summary to analyse hard- Note that each PE in a given stage operates synchronously to a common clock. In our algorithm, three processing steps operate synchronously. The first step is an initializing process, where destination addresses are imported and flags and registers are reset. The second step is composed of the three phases described in Section 3. The third step is a terminating process, where SCBs and addresses are exported. However, to increase processing speed, we introduced asynchronous hardware operation in part as follows, while we made no change to the parallel algorithm described in Section 3. In the first part of the PCU, the iterative procedure to find a representative (Section 3.2), can operate asynchronously by referring to the f v flags and comparison results. Here, we added an additional set of buses in Figure 4(a) for this purpose, and each pair of input addresses at PEs was processed simultaneously. In the second part of the PCU, DTR was also implemented asynchronously by referring to a specified routing bit given by Equation (10). Note that these asynchronous operations were described easily in VHDL code using if-statements, and the synchronous operations were realised using when-statements. Figure 8 shows the number of occupied slices in the first and second parts of the PCU for N = 4 to 32, denoted by HW 1 and HW 2 , respectively. As expected in Section 2, the first part accounts for a large portion of the PCU hardware, whereas the second part requires a significantly less amount of hardware than the first part. We estimate the hardware amount of a single PE in the first part ( Figure 5) as O(log 2 N ) because most parts of a PE have a dimension of log 2 N . Recall that a single stage comprises N/2 PEs; thus, the hardware complexity per stage is O (N log 2 N ). Therefore, the total hardware complexity over log 2 N − 1 stages, denoted by HW 1 , is O(N (log 2 N ) 2 ). Similarly, the total hardware complexity of the second part, denoted by HW 2 , is also given by O (N (log 2 N ) 2 ). As shown in

FIGURE 9
Processing time for the first stage in the first part and DTR in the second part Figure 8, the experiment results agree with the theoretical estimation, that is, (N (log 2 N ) 2 ). The experimental results demonstrate that, when N = 32, the occupied CLBs was less than 1,000 (CLB utilisation of approximately 15%), which was estimated to be approximately 6,400 cells [24]. Note that the conventional hardware complexity increases in O(N 2 ) [14], the number of cells for N = 32 is up to 36,200 cells (approximately six times greater than our design). These results indicate that the proposed design is significantly more efficient relative to hardware costs compared to existing methods. As for time complexity, we focus on the processing time in the first stage because it has a largest portion and correlates with the total processing time as described at the end of Section 3.2. Specifically, we consider the processing time in the second phase (Section 3.2.2), denoted by t f , which shares most of the processing time in the first stage. Figure 9 shows t f vs N for the proposed design. It can be seen that t f remains less than a clock period (10 ns) up to a certain switch size due to the asynchronous operation, and increases in approximately O(log 2 N 0.16 ).We found that the asynchronous operation causes a dramatic decrease in processing time. Note that the attenuating effect depends on devices and t f can be generalized as O(log 2 N ), α < 1. In Figure 9, it can be seen that t f for the special case of N = 4 indicates a sharp drop. We guess that 4 × 4 BNW only has two SEs in the first stage, and it is easy to fix a representative quickly with no iterative search. This tendency to have a sharp drop in the processing time at N = 4 was also observed in a previous study [14]. Figure 9 also shows the processing time for the DTR in the second part of the proposed design, denoted by t s , for reference. As can be seen, t s is also less than a clock period due to the asynchronous DTR operation. It is significantly less than t f due to the simplicity of the DTR and is negligible compared with t f as suggested at the end of Section 2.
It is interesting to note that t f and t s have the same approximation curve as O(log 2 N 0.16 ). Recall the time complexity in the first stage is given by O(log 2 N ), as described at the end of Section 3.2.3. The processing for DTR in the second part takes O(1) time per stage as described at the end of Section 2. Since the second part has log 2 N stages, the total processing time complexity is also given by O(log 2 N ). However, both processes of O(log 2 N ) complexity were reduced to O(log 2 N 0.16 ) due to the asynchronous operation, as predicted in Section 4.1. Figure 10 shows the total processing time in clock cycles for the first stage. The conventional algorithm [15] for the first stage requires O(log 2 N ) clock times, while the proposed design demonstrated O(1) clock time until N = 128, because the most time-consuming process for the proposed algorithm completes within a single clock cycle as shown in Figure 9. In fact, the proposed algorithm requires five clocks, whose breakdown is as follows: one clock for the initializing step, three clocks for the first to third phases, and the last clock for the terminating step described in Section 4.1. For example, in the N = 16 case, our design is more than three times faster than the previous method, and the performance difference increases with increasing N. Note that the number of clock cycles for N > 128 is estimated to increase by only one clock cycle due to the asynchronous operation. Consequently, the number of clock cycles of the proposed design is at least several times faster than the previous method [15].

CONCLUSION
In this paper, we have proposed a new design for parallel and distributed PEs to set up partial permutations in Beneš networks based on a novel parallel algorithm. Our algorithm realises both full and partial permutations in a unified manner with little overhead time and additional hardware costs. Specifically, the proposed design reduces hardware complexity from O(N 2 ) to O(N (log 2 N ) 2 ) due to its distributed architecture and reduces time complexity from O((log 2 N ) 2 ) to O(log 2 N )) due to its pipeline architecture. In addition, we introduced in part asynchronous operation to further increase speed, and the time complexity is reduced to O(log 2 N ), α < 1. This result suggested that a conventional process of O(log 2 N ) time complexity can be reduced to O (1).We built prototype parallel and distributed PEs in an FPGA to investigate performance for N = 4 to 32. The experiments results demonstrate that the proposed design outperforms a conventional method by at least several times in terms of hardware and processing time complexities. We expect that the proposed algorithm and design will extend the application areas of Beneš networks and stimulate parallel processing techniques implemented using this hardware.