Low latency group-sorted QR decomposition algorithm for larger-scale MIMO systems

Sorted QR decomposition (SQRD) has been extensively adopted for various multiple-input-multiple-output (MIMO) detectors, in which the sorting process incurs severe latency when it comes to larger-scale MIMO situations. This paper proposes a group-SQRD (GSQRD) algorithm to alleviate the latency problem of general SQRD architectures for larger-scale MIMO systems. Via predictively sorting a group of 4 columns at one stage, the GSQRD could eliminate the processing latency by 41% for decomposing 16 × 16 complex-valued matrices. Additionally, this percentage even rises up to 68% for decomposing 128 × 128 matrices. To analyse the side effects, the GSQRD is applied in various MIMO detectors in a simulation link, which exhibits a negligible performance degradation for MIMO detection. Moreover, GSQRD is a hardware-friendly algorithm because the division and square root operations in GSQRD are converted to multiplications for sim-plifying the hardware implementation. Based on this algorithm, two corresponding hardware architectures, which contains 2 and 4 columns respectively in a sorting group, are also implemented with 65-nm CMOS technology. These architectures can work at 513 MHz to decompose 16 × 16 complex-valued matrices. The processing latencies are respectively 0.32 and 0.26 𝜇 s, superior to the state-of-art designs.


INTRODUCTION
MIMO technique is extensively studied in wireless communication for its high spectrum efficiency [1]. And larger-scale MIMO technique is regarded as a potential solution to the contradiction between the ever-rising data service and the limited spectrum resources [2]. In MIMO systems, QR decomposition (QRD) is an essential component for various MIMO precoding or detection methods from linear methods such as zero force (ZF) algorithm [3], minimum-mean-square-error (MMSE) algorithm [4,5] to tree-search methods such as K-best algorithm [6,7]. Based on QRD, SQRD is proposed by D.Wubben et al. in [8], which incorporates the sorting process into the QRD steps to generate the sorted upper triangular matrices R. As the rear diagonal elements of matrices R become larger after sorted in SQRD, the noise can be more effectively inhibited in MIMO detectors. Therefore, SQRD can lead to better bit error rate (BER) performances for various MIMO detectors than QRD, and the quality of SQRD can directly impact the BER performance and throughput of MIMO systems. Moreover, SQRD is also employed in lattice reduction (LR) techniques to help generate better-conditioned matrices and reduce the iteration times [9,10]. Notably, LR is a significant preprocessing method for future MIMO detection which remarkably improves the detection accuracy at receivable hardware cost. Therefore, high performance SQRD algorithm is significant for future MIMO systems.
Existing QRD/SQRD algorithms are mainly based on four methods-Householder transformation (HT) method [11,12], Gram-Schmidt (GS) method [13][14][15], Givens rotation (GR) method [6,16,17], and Cholesky method [7], of which the complexity is analysed in [7,18]. As HT method is undesirable for hardware implementation, the practical SQRD/SQRD architectures in wireless communication are mainly based on the remaining three methods. The GS method is known for low PL because the matrices are operated at column wise. In GS method, the norm value of each column is explicitly presented during the whole decomposition, which can be directly utilized as the sorting basis, therefore, the GS method is convenient for designing SQRD algorithms. However, the hardware overhead of GS method is relatively higher. In [19] and [20], a Permutation-Robust QRD (PR-QRD) is proposed to reduce the complexity of GS method for MIMO detection. The GR method has an excellent frequency property with the help of the coordinate rotation digital computer (CORDIC). In GR method, the calculation of matrix Q is replaced by applying the same rotation operations as that for generating matrix R to the received signal vectors. However, the GR method suffers from long latency due to the CORDIC chains, especially for processing larger-scale matrices. Cholesky method [7] gathers the excellence of both low complexity and low latency. But this method can only generate matrix R, thus additional circuit would be required if matrix Q is explicitly needed. In each of these methods, SQRD is achieved by applying a sequence of sorting operations and unitary operations. Sorting operations swap the columns in channel matrix H, while unitary operations nullify the lower-left elements of matrix H. In this paper, the unitary operations for decomposing one column of matrix H are regarded as a reduction stage (RS). In addition, one sorting operation and one RS constitute a computation iteration, and N iterations constitute the whole SQRD process for N×N matrices.
Existing literatures about SQRD are mainly focused on the RS module, while the sorting module is neglected because it is inconspicuous in small scale MIMO systems. As the number of antennas increase, however, the PL of the sorting process becomes more obvious, because in every iteration the sorting process will bring in a latency period of at most N clock cycles. Taken GS method as an example, assume the input matrix H is generated column by column, thus the scheme of general SQRD can be demonstrated as Figure 1. As shown in Figure 1, an Initial module is typically utilized to calculate the norm value for each column. Then in each Ite_i module, the column flow is sorted by the sorting module, after which it is sent to the RS module to perform the nullifying process. The RS modules are the same for each iteration, taking about 6 clock cycles, while the sorting modules vary in PLs for different iterations. In the ith iteration as depicted in Figure 1, the sorting module would wait for (N − i + 1) column vectors to determine the sorting results, thus the PL is (N − i + 1) clock cycles, where N represents the matrix size. Therefore, for the whole SQRD, the RS modules and sorting modules will take about (6×N ) and 0.5(N 2 + N ) clock cycles, respectively. As the PL of sorting modules increases with the square of N, when N goes greater, the sorting modules will play a predominant role in SQRD and lead to severe latency. It is estimated that when N equals to 16 and 64, the sorting modules will take respectively 58% and 84% of the whole PL in SQRD, and when N rise to 128, this rate even goes up to 91% . The severe latency of sorting modules has impeded the application of SQRD in massive MIMO systems. And this paper aims at optimizing the sorting process to cut down the overall PL for larger-scale SQRD.
In this paper, an GSQRD algorithm is proposed based on the GS method, which can predict the sorting results ahead of time, thus cuts down the overall latency remarkably. In this algorithm, a group of columns with the minimum norms are selected at once, and then swapped to the front of the matrix. Therefore, the following iterations can immediately start the RS procedures, without wait for the sorting results. The latency reduction efficiency of this strategy is analysed in terms of different matrix sizes and group configurations, which suggests that the GSQRD with a parameter of 4 could reduce approximately 41% of over-all latency for decomposing 16×16 complexvalued matrices. More importantly, this method works more efficiently on larger-sized matrices, making GSQRD appropriate for massive MIMO applications. Since the sorting results in GSQRD are predicted ahead of time, the false predictions would inevitably cause some side effects. To evaluate these side effects on MIMO detection accuracy, GSQRD is applied as a pre-processing algorithm for various MIMO detectors and LRaided MIMO detectors in a MIMO simulation link, and the corresponding performance is compared with that of general SQRD. The above simulations are conducted in 16, 64 and 128 antennas cases, respectively, which indicate that the side effects are receivable on the successive interference cancellation (SIC) and K-best MIMO detectors, and are negligible on LR algorithms. Based on this algorithm, two corresponding hardware architectures are also implemented for 16×16 complexvalued matrices using 65-nm CMOS technology. In these architectures, the numbers of columns in a group are set to 2 (called G2SQRD) and 4 (called G4SQRD), respectively. Complicated division and square root operations are converted to multiplications by the reciprocal square root (RSR) modules to simplify the VLSI implementation. The word length (WL) of registers is reliably constructed by a fixed-point simulation to maintain sufficient precision for MIMO detection, and hardware-reuse is employed to save the area. Synthesis results show that both G2SQRD and G4SQRD achieve an excellent frequency performance of 513 MHz to decompose 16 × 16 complex-valued matrices every 16 clocks. The PLs of G2SQRD and G4SQRD are respectively 0.32 and 0.26 s, superior to the state-ofart designs.
The rest of this paper is organized as follows. Section 2 briefly introduces the general SQRD algorithm and specifies the proposed GSQRD algorithm, as well as the latency reduction analysis. Section 3 presents the software simulation about the side effects of GSQRD algorithm on various MIMO detectors and LR algorithm. Section 4 demonstrates two hardware architectures about GSQRD. Section 5 illustrates the ALGORITHM 1 General SQRD Algorithm Based On GS method swap the ith and mth columns in A, R, P, norm ; 8: 10: implementation results and their comparisons with the state-ofart designs. Finally, Section 6 draws the conclusions. In this paper, bold uppercase like A denotes a matrix. Bold lowercase a j or the format of ⃗ a j denotes the j th column vector of matrix A. a i j denotes the ith row and j th column of matrix A, and a (m∶n, j ) denotes the elements from the mth row to the nth row of j th column. In addition, (⋅) H and (⋅) −1 denote the conjugate transposition and inversion, respectively. |⋅| denotes the norm of a complex-valued element or vector. The PL is defined as the period between starting a matrix input and outputting the last result data about this matrix.

General SQRD algorithm
In MIMO systems, QRD is widely utilised as the preprocessing algorithm to decompose an estimated channel matrix H into a unitary matrix Q and an upper triangular matrix R. The majority of QRD algorithms are based on GS method, GR method, Household method and Cholesky algorithm, of which the complexity is analysed in [18]. Based on QRD, SQRD incorporates the sorting process (as line 6, 7 in Algorithm 1) into single QRD to decompose a matrix H as (1), where P is a column permutation matrix.
Taking GS method as an example, the general SQRD algorithm is demonstrated in Algorithm 1. As shown in Algorithm 1, the channel matrix H is decomposed column by column, and the result matrix R is generated row by row. In each iteration, the column with the minimum norm is firstly swapped to the front for current decomposition, making the relevant diagonal element r ii as small as possible. As the matrix H is fixed, thus the product of the diagonal elements in R is also a constant. Therefore, the remaining diagonal elements after r ii are inclined to be larger, meaning that matrix R is better sorted. For MIMO detection, a larger diagonal element r ii would lead to better immunity to noise and signal interference [7], so SQRD can help MIMO detectors further improve the detection accuracy than single QRD. Besides, the sorted matrices R can help LR algorithm generate better-conditioned matrices [9,10] and simultaneously reduce the iteration number of LR algorithm. Therefore, SQRD can also bring more performance gain for LR aided MIMO systems while reducing the system complexity than single QRD. However, SQRD suffers from extra PL than QRD. As shown in Algorithm 1, in the ith iteration, there are (N − i + 1) columns to be compared before decomposition (line 6). Assume the matrices from the prior iteration is generated column by column, then the sorting module would wait at least (N − i + 1) clock cycles before determining the comparative results. According to this deduction, all sorting operations in SQRD would cause an extra PL of totally 0.5(N 2 + N ) clock cycles than QRD. Since this PL increases with the square of N, when N goes greater in massive MIMO situations, the severe PL would impede the application of SQRD in future communication.

Proposed GSQRD algorithm
In general SQRD algorithm as Algorithm 1, the ith iteration could be simplified as a sorting operation followed by subtracting the vector component of ⃗ a i 's direction from the rear vectors {⃗ a j } i ( j = i + 1 ∶ N ), where {⋅} i denotes the signals in the ith iteration. Because each vector of {⃗ a j } i has at most N components and every component is random, when N goes greater, the subtracting of one component from all {⃗ a j } i has little influence on the sorting sequence of {⃗ a j } i . Therefore the sorting sequence of {⃗ a j } i before subtracting (line 12) can be utilised to predict the sorting results (namely m on line 6) for the (i + 1)th iteration. Thus the sorting operation (line 6, 7) of the (i + 1)th iteration can be done ahead of time and the corresponding PL is hidden. Even if the prediction fails, denoting that the chosen vector from {⃗ a j } i+1 is not the one with the minimum norm, its norm is inclined to be smaller than most of the vectors in {⃗ a j } i+1 according to statistic rules. Therefore, the false prediction has little impact on the sorting quality. Furthermore, this prediction can also be expanded to the (i + 2)th, (i + 3)th etc. iterations, which means that one sorting module would select a group of columns with the minimum norms for the following numerous iterations. Using the above prediction, the GSQRD is proposed as Algorithm 2. In GSQRD, one sorting operation selects one column with the minimum norm for current decomposition and simultaneously predicts the sorting sequences for the following (g − 1) iterations, as listed on line 8, 9. Therefore, the sorting operations of the next (g − 1) iterations can be done ahead of time, and the corresponding PL is hidden. In this paper, the ALGORITHM 2 Proposed GSQRD Algorithm 1: : swap the (m 1 , m 2 , … m g )th columns to the front of (i∼N )th columns in A, R, P, norm ; 15: columns that are selected at one time is regarded as a group, and the parameter g in Algorithm 2 represents the number of columns in a group. Additionally, the GSQRD with parameter g is represented as GgSQRD.
In GSQRD, to further improve the parallelism, the steps 8-12 of Algorithm 1 are transformed as (2) in Algorithm 2, as listed on line 14-24.
In Algorithm 2, the RSR functions as (3), which converts the dividing and square root operations of Algorithm 1 into multiplications to decrease the complexity for hardware implementation.
In RSR modules, the variable z is first scaled to the Left-closed right-opening interval [1,4) via shifting by 2n bits as (4), marked as x.
Then 1∕ √ x is approximated via the first-order Taylor expansion at 48 points of interval [1,4) as (5), where a 1 and a 0 denote the Taylor coefficients corresponding to the specific expanding point. In hardware implementation, the values of a 1 and a 0 for various expansion points can be stored in a look up table (LUT).
Finally, the approximate value y n is operated by Newton-Raphson iteration [21] to improve its accuracy as (6).
The Newton-Raphson iteration of (6) is generated based on the function defined as (7). According to (5), (y n ) is close to 0. Thus the operation of (8) makes (y n+1 ) more close to 0, namely that y n+1 is more close to the accurate value of 1∕ √ x.

Latency reduction analysis
To estimate the latency reduction efficiency of GSQRD, the PLs of general SQRD and the proposed GSQRD are respectively calculated and compared in terms of different antenna number and parameters. In this calculation, the input matrices H are assumed to be N ×N complex-valued matrices and are generated column by column. For general SQRD algorithm, in the ith iteration, the RS module takes 6 clock cycles while the sorting module takes (N − i + 1) clock cycles, as exhibited in the hardware design bellow. Thus the total PL of general SQRD can be represented as (9).
For the GSQRD algorithm with a parameter g, the PLs of RS modules are the same as in general SQRD, while the PLs of sorting modules are accumulated every g iterations. Therefore, the total PL of GSQRD can be represented as (10). Comparing (9) and (10), the latency reduction efficiency of GSQRD can be defined as (11), which is intuitively presented in Figure 2.
In Figure 2, axis x represents the matrix sizes, and axis y represents the PL reduction efficiency of GSQRD. GgSQRD indicates the GSQRD algorithm with a parameter g. Figure 2 shows that the PL reduction efficiency increases with the rise of matrix size and that a larger parameter g leads to better PL reduction efficiency. From the blue line, it is shown that in G2SQRD case, even if the matrix size is as small as 16, the proposed GSQRD algorithm could reduce a considerable PL of 28% than general SQRD. And when the matrix size goes up to 128, this percentage rises as high as 45% . More importantly, in the LR aided MIMO detections as demonstrated below, a larger parameter g is tolerable, indicating that a notable PL reduction can be realized with little side effects. For example, in the G4SQRD (the pink line), when N reaches 128, the GSQRD algorithm could eliminate up to 68% of the overall PL than general SQRD, which would provide a noteworthy profit for massive MIMO systems.
To sum up, the proposed GSQRD mainly has two advantages: (1) As the PL of some sorting modules is eliminated, GSQRD is a promising algorithm to decrease the PL of SQRD. Furthermore, this PL reduction efficiency goes more remarkably for larger size matrices, making GSQRD appropriate for massive MIMO situations. (2) As the division and square root operations are all converted to multiplying operations by RSR module, the proposed GSQRD is a hardware-friendly algorithm to achieve high frequency performance and thus to improve the throughput. Note that although the GSQRD algorithm is exhibited using the N × N matrix, this algorithm is also applicable for other rectangular matrices. Taken an N r × N t (N r > N t ) matrix as an example, just let the later (N r − N t ) columns be zero, and initialise their norm to a large number, hence it can be processed using the GSQRD algorithm.

SOFTWARE PERFORMANCE SIMULATION
In GSQRD, the sorting results are predicted ahead of time, thus false predictions would inevitably cause some performance loss for MIMO detections. To estimate the side effects, a 16×16 MIMO simulation link is designed as Figure 3 in which GSQRD is applied to test its impacts on the BER performance. As shown in Figure 3, a 64-QAM modulation scheme and a rate 1/2 industry standard convolution code with a [133 171] polynomial are employed in the link, along with interleaver. The coding is performed over 160 symbols, and each simulation is conducted for 100,000 frames. The channel is assumed to exhibit Rayleigh fading, and the channel matrix H is assumed to be accurately estimated. In the simulation, the channel matrix H is comprised of complex-valued elements drawn from normal distribution with mean 0 and variance 0.5. GSQRD is applied as a preprocessing algorithm in the simulation, together with general SQRD algorithm for comparison. Various MIMO detectors are employed to test the side effects of GSQRD on them. Moreover, GSQRD is also combined with LR algorithm in the simulation to test the impacts on LR aided MIMO detections. Finally, to confirm that the proposed GSQRD maintains its properties in larger-scale MIMO situations, the simulation is also conducted for higherorder antenna settings, such as 64×64 and 128×128 MIMO systems. The simulation results are analysed in the following subsections.

The effects of GSQRD on SIC and K-best MIMO detectors
To test the effects of GSQRD algorithm on various MIMO detectors, the proposed GSQRD is applied as the preprocessing algorithm together with general SQRD in a 16×16 MIMO simulation link. In this link, SIC and K-best detectors are respectively adopted to test the BER performances. The system model of this link is defined as (12), where x represents the symbol vectors from QAM-64 constellation and H denotes the channel matrices, In this link, SQRD operation of (1) converts the system model as (13). Thusỹ, R, and s are useful information for MIMO detectors.
Based on (13), the SIC MIMO detector in the simulation is designed as Algorithm 3, where N = 16. In each iteration of Algorithm 3, the signal from one antenna is detected and the corresponding interference is cancelled from other antennas. The proposed GSQRD and general SQRD algorithms are adopted respectively in step 3 of Algorithm 3, and the corresponding BER performances of MIMO detections are illustrated in Figure 5 with full lines. The K-best MIMO detector with K = 10 in the simulation is demonstrated as Figure 4, which utilizes the distributed sorting method [22] to select the K best child nodes at each stage. As shown in Figure 4, the K-best detector consists of 16 stages to solve the output signalŝ for a 16×16 MIMO system. First, the none-constrained solution ofŝ (16) is calculated as (14): s (16) =ỹ (16) ∕r (16,16) (14) Then, the four nearest points toŝ (16) on the QAM-64 constellation are selected as the child nodes of the 1st stage. Stages 2 to 15 are similar in processing. In each of these stages, firstly, the child nodes of the prior stage are regarded as the parent nodes of the current stage, and an interference cancellation process is performed for these parent nodes. Then, each of these parent nodes is expanded to obtain 4 candidate child nodes, and all candidate child nodes are sorted to select the 10 best child nodes with the smallest partial Euclidean distances (PED) for the current stage. Stage 16 is different from stages 2 to 15 in that stage 16 selects only one child node with the smallest PED, whose path is regarded as the solution of vector ‚ s. Various SQRD algorithms are adopted for the K-best detector and the corresponding BER performances of K-best are illustrated in Figure 5 with dash lines.
In Figure 5, the performance of SIC (full lines) and K-best (dash lines) MIMO detectors are respectively demonstrated to analyse the effects of GSQRD on various MIMO detectors. For each detector, the performances about GSQRD with various parameters g are demonstrated together with that of general SQRD. Compared with general SQRD, GSQRD indeed causes some performance degradation for MIMO detection, and larger parameter g leads to severer side effects. To quantify the specific impacts, the BER of Figure 5 is set to 10 −4 , and the side effects are quantified as the corresponding SNR difference between GSQRD and general SQRD. Based on this standard, it is measured in Figure 5 that G2SQRD leads to side effects of 0.28 and 0.48 dB respectively on K-best and SIC MIMO detectors, and that G4SQRD causes side effects of approximately 0.6 and 1.3 dB on K-best and SIC detectors, respectively. Considering that G2SQRD could reduce 28% of PL, as shown in Figure 2, the degradation of 0.28 dB is acceptable for MIMO systems. More importantly, since the PL reduction efficiency of GSQRD goes more remarkably while the side effects remain the similar for larger-scale MIMO systems, as demonstrated in the following simulation, the proposed GSQRD algorithm can provide more profits for larger-scale MIMO systems.

The effects of GSQRD on LR algorithm
LR is a favourable preprocessing technique in MIMO systems for achieving high performance with polynomial complexity [23]. In LR algorithm, SQRD is extensively employed because it can help generate better-conditioned matrices and simultaneously decrease the iteration times than single QRD. Therefore, evaluating the effects of GSQRD on LR algorithm is necessary for the application of GSQRD. Since the majority of existing LR algorithms in MIMO systems are modified from the Lenstra-Lenstra-Lovĺćsz (LLL) algorithm [24], LLL is also employed as the LR algorithm in this paper. Generally, the combination of SQRD and LR algorithms can be expressed as (15), and the detail of LR algorithm with parameter in the simulation is depicted in Algorithm 4.
In Algorithm 4, GSQRD is applied in step 4, and it mainly has two influences on LR algorithm: (1) influence on the quality of R L . (2) influence on the iteration times. The first influence can be exhibited by sending R L to MIMO detectors and observe the BER performance, while the second influence can be described by the average iteration times of Algorithm 4.
First, the impacts on the performance of LR is simulated. In the simulation, the LR algorithms based on GSQRD and general SQRD are respectively employed as the preprocessing algorithm for a K-best MIMO detector, and the corresponding BER performances are presented in Figure 6 for comparison. In Figure 6(a), the parameter of Algorithm 4 is set to 0.5 while in Figure 6(b) it is set to 0.25. According to these figures, the LR algorithm with smaller parameter is more sensitive to the side effects of GSQRD. In Figure 6(b), the G2SQRD causes an SNR degradation of 0.4 dB than general SQRD at 10 −4 BER. Whereas, the side effect of G2SQRD in Figure 6(a) is almost negligible. Notably, the G8SQRD in Figure 6(a) causes an SNR degradation of merely 0.28 dB, which indicates a remarkable PL reduction with little side effects. As for the parameter , it is theoretically between 0.25 to 1 in LLL algorithm, and larger always leads to better MIMO performance and simultaneously higher iteration times. In practice, the parameter is generally set to 0.5 for reasons that it is a preferable tread-off of the performance and complexity, and that the multiplications with in Algorithm 4 can be simplified as shitting operations. When equals to 0.5 as in Figure 6(a), the proposed GSQRD algorithm could provide a notable profit with little side effects for LR aided MIMO systems.
To test the influence of GSQRD on the complexity of LR algorithm, the LR algorithms based on GSQRD and general SQRD are respectively utilized to process 10000 random 16×16 channel matrices H, and the average iteration times of LR for each case are depicted in Figure 6(c). In this experiment, the elements of H are drawn from normal distribution with mean 0 and variance 0.5. As shown in Figure 6(c), the side effects of GSQRD is negligible when parameter is smaller than 0.65. Considering that is always set around 0.5, GSQRD can completely satisfy the practical application.
To sum up, the proposed GSQRD has negligible side effects on the performance and complexity of LR algorithm when parameter is set around 0.5. Only when the parameter is as small as 0.25 that the side effects on the performance of LR become noticeable. And only when is larger than 0.65 that the side effects on the complexity of LR appear to be visible. Considering that is generally set around 0.5 in practice, the proposed GSQRD causes negligible side effects on LR aided MIMO systems while reducing PL.

The effects of GSQRD on larger-scale MIMO situations
All the simulations presented above are based on 16×16 MIMO systems which indicate that the proposed GSQRD algorithm could reduce the overall PL with little side effects on MIMO detectors and negligible side effects on LR algorithm. To confirm that GSQRD maintains its properties for higher order MIMO systems. Figure 7 shows the simulation results for different antenna configurations such as 64×64 and 128×128. According to Figure 7(a), in 64×64 MIMO systems, the G2SQRD based SIC and K-best detectors suffer from extra SNR degradation of 0.3 and 0.32 dB respectively, which are  almost the same as in 16×16 MIMO situations. In addition, Figure 7(b,c) demonstrate that the side effects of GSQRD on LR algorithm are also negligible for 128×128 MIMO systems. Therefore, the proposed GSQRD algorithm could suit well for higher-order MIMO situations.

HARDWARE ARCHITECTURE
To evaluate the frequency performance of GSQRD algorithm, two hardware architectures about G2SQRD and G4SQRD are implemented with 65 nm CMOS technology. As a concrete example, the VLSI architectures are designed for 16×16 MIMO systems, and the methods for other larger-scale MIMO systems are similar. In the hardware, fixed-point is employed as the data format, and the word length (WL) of each register is properly constructed to ensure the same BER performance as the double floating model. Figure 8 shows  Figure 8, each RS unit decomposes one column of H. Therefore after each RS unit, the number of valid columns to be processed would decline by one. For example, for the RSi unit, there are totally (17 − i ) input columns to be resolved. After these inputs, there will be (i − 1) empty clocks  Architecture of initial unit without invalid inputs. According to this deduction, for RS9, there are 8 clocks of valid inputs which are followed by 8 clocks of empty inputs. In order to increase the hardware efficiency, the outputs of RS9 is sent back to fulfil the empty clocks, meaning that RS9 and RS10 share one RS hardware module. In the same manner, after RS10, RS(2n − 1) and RS(2n) also share one RS hardware module. Thus 4 RS hardware modules can be saved, which constitute approximately 25% of the total area. The processing and architectural detail for each unit is discussed in the following subsections.

Initial unit
The goal of the initial unit is to calculate the norm for each column of input matrices H. In the initial unit as shown in Figure 9, signal vector ⃗ a i is connected to column ⃗ h i , and a k,i is a complexvalued element of ⃗ a i . The initial unit takes 2 clocks to process one column. In the first clock, all the elements of ⃗ a i are processed by PE-A units simultaneously to generate the element norms. And in the second clock, these element norms are accumulated by a tree adder to calculate the vector norm. In addition, each signal ⃗ a i is delayed by two clocks before outputting in order to get aligned with its norm.

Sorting unit
After the initial unit, the columns of matrix A are successively transferred to the sorting unit together with the corresponding norms. The goal of the sorting unit is to select a group of columns with the minimum norms and swap them to the front for decomposing, as listed on line 7 to 12 of Algorithm 2. The architectures of sorting units for G2SQRD and G4SQRD are similar, and the case of G2SQRD is taken as an example to depict the block details in Figure 10. In G2SQRD, all the sorting units are the same in structure, except for the number of Reg-A registers, as shown in Figure 10. Since the sorting processes for matrices A, P, R and norm of Algorithm 2 work in the same manner, the signals A, P and R are omitted and only the architecture for sorting norm is exhibited in the figure. As shown in Figure 10, the Reg-A register chain is firstly utilised to store the successive inputs of vector norms. Meanwhile, each norm i is sent to the comparator (CMP) module to compare with the current minimum norms, which are stored in the two Reg-C registers. If norm i is smaller than any of the Reg-C values, one Reg-C will be updated by norm i through multiplexer Mul-1 or Mul-2. After all the valid norms are generated, the minimum two norms are ultimately selected by Reg-C registers, and the output sequence is determined. During the first two output clock cycles, the minimum norms in Reg-Cs are selected successively by Mul-3 and sent out by Mul-4. Meanwhile, two Reg-B resisters are utilized to buffer the shifted data of Reg-A. After letting out the minimum norms, the address of Mul-4 is set to 0 for outputting other norms. If the currently selected norm in Mul-4 has already been selected by Reg-C, the address of Mul-4 will increase by 1 to select the next register. The specific PL of a sorting module depends on its location in the top scheme. For example, if a sorting unit is located before RSi of Figure 8(a), since there are (16 − i + 1) valid columns to be sorted, the total latency is (16 − i + 1) clock cycles. So far, the description is about G2SQRD. As for G4SQRD, the number of Reg-B and Reg-C registers is 4, and other details of sorting units in G4SQRD are the same as that of G2SQRD.

Reduction stage (RS) unit
The goals of a RS unit are to decompose a matrix column and to update the subsequent columns of matrix A, as listed on line 14 to 24 of Algorithm 2. The RS units in G2SQRD and G4SQRD are the same, and in this paper, RSi is taken as an example to depict the architectural details in Figure 11. In Figure 11, ⃗ a i∶16 and norm i∶16 represent the successive input columns and the corresponding norms of matrix A, and ⃗ a dly3 (i∶16) represents the delayed signal from ⃗ a i∶16 by three clock cycles. Besides, blue registers are utilised as memories while the green blocks indicate the function modules. After receiving the first valid column ⃗ a i , the RSi unit starts to work until the last column ⃗ a 16 is updated. During the first clock, the signal norm i is sent to the RSR module to calculate its reciprocal square root, represented as rsr in Algorithm 2. The RSR module takes three clock cycles, and the resulting signal rsr is stored in Reg-2. Also in the first clock, the first vector ⃗ a i is transferred through Mul-3 to be stored in Reg-1. Next, the subsequent vectors ⃗ a j ( j = i + 1 ∶ 16) are continuously sent to the M-1 module together with ⃗ a i . The M-1 module is composed of 32 real-valued multipliers and is combined with the following adder (Add) module to calculate the inner product pt of ⃗ a i and ⃗ a j in two clocks, as listed on line 19 of Algorithm 2. During clock 4, the signal rsr is multiplied with pt to calculate r i j , and simultaneously rsr is multiplied with vector ⃗ a i in M-2 module to calculate ⃗ q i , as listed on line 20 and 15 of Algorithm 2. The value of ⃗ q i is stored in Reg-3 register. From clock 5, M-2 module is reused to multiply r i j with vector ⃗ q i , which is incorporated with the followed subtractor to update the vector ⃗ a j , as listed on line 21 of Algorithm 2. Also in clock 5, r i j is multiplied with itself, and the product is subtracted by norm j to update the norm value, as listed on line 22 of Algorithm 2. After that, the updated vectors ⃗ a new (i+1∶16) and the corresponding norms are continuously transferred to the RS(i + 1) unit for further decomposition. The total latency of a RS module is 5 clocks, and the first valid vector ⃗ a new (i+1) and the corresponding norm are sent out in the 6th clock.

RSR unit
RSR is a crucial component for RS units, which takes three clocks to calculate the reciprocal square root of input signals, as defined in (3). Figure 12 presents the architecture of RSR module, which consists of a scale block, a look-up table (LUT), four multipliers and a series of adders and shifters. The LUT block is utilized to store the 48 pairs of Taylor coefficients a 1 and a 0 , corresponding to the 48 Taylor expansion points on interval [1,4), as presented in (5). The input signal z is connected to signal norm i in the RS unit, and the scale and LUT modules share one clock with the previous module. As shown in Figure 12, the input signal z is first scaled into interval [1,4) via shifting by 2n bits in the scale module, marked as x of (4). Then the highest five bits of signal x is used as the address of LUT module to find the nearest Taylor expansion point and outlet the corresponding expansion coefficients a 1 and a 0 for the calculation of y n as (5). y n is calculated in clock 1, and the Newton-Raphson iteration of (6) starts at clock 2. In clock 2, y n is squared to calculate y 2 n , and in the same clock it is multiplied with x to generate √ x. Then in clock 3, y n+1 of (6) is calculated by the last multiplier and subtractor, which is also named as rsr in Figure 12. Finally, the output signal √ x is utilised to calculate r ii of Algorithm 2 via shifting by n bits, and signal rsr is transferred to other units together with the scaling factor n. In this architecture, as the signal r ii is conveniently generated during the Newton-Raphson iteration without extra efforts, the hardware for calculating r ii can be saved.

Word length design
In this paper, fixed-point is employed as the data format of registers for G2SQRD and G4SQRD architectures. To construct a proper WL of the integer and fractional part for each register, a hardware simulation system is designed based on the MIMO link of Figure 3. In the simulation system, G2SQRD and LR algorithms are employed for channel preprocessing, and K-best algorithm is employed for MIMO detection. The channel matrix H is comprised of complex-valued elements drawn from normal distribution with a mean of 0 and a variance of 0.5. All the modules except G2SQRD are calculated with doubleprecision floating numbers to ensure enough precision, while the G2SQRD has two models: reference model and fixed-point model. The reference model uses double-floating numbers for calculation, in order to evaluate the theoretical precision. As for the fixed-point model, the number of fractional bits for registers varies from 14 to 20, and the number of integer bits for each register is set long enough to avoid overflow. The two modules are respectively integrated to the MIMO link, and the corresponding BER performance is simulated, as shown in Figure 13. According to Figure 13, the case with 18 bits of fractional part has reached a very close BER performance to the reference model, hence the WL of the fractional part is determined. In addition, the WL of the integer part for each register is determined by avoiding overflow. In this design, the WL structures for some significant registers are listed in Table 1. Although the WL is constructed based on G2SQRD, it is also used for

IMPLEMENTATION RESULTS AND COMPARISONS
The goals of the proposed GSQRD algorithm are to alleviate the severe latency of general SQRD algorithms via optimizing the sorting strategy; and to decrease the hardware complexity by eliminating the complicated division and square root operations together with module reuse. According to Figure 2, the proposed G2SQRD and G4SQRD could reduce approximately 28% and 41% of PL than general SQRD in 16×16 MIMO systems, and this rate could rise to 45% and 68% respectively in 128×128 MIMO systems. Moreover, to evaluate the side effects of GSQRD algorithm, the proposed GSQRD and general SQRD are employed together for various MIMO detectors and for LR algorithm in a 16×16 MIMO simulation link. Simulation results show that GSQRD has acceptable side effects on the performance of SIC and K-best MIMO detectors and negligible side effects on the performance and complexity of the LR algorithm. Additionally, The similar simulation is performed for larger-scale MIMO systems such as 64×64, 128×128 MIMO systems, which indicates that GSQRD maintains its properties in larger-scale MIMO situations. To evaluate the frequency performance of GSQRD, the corresponding hardware architectures about G2SQRD and G4SQRD are also designed for 16×16 complex-valued matrices, and the architectures for larger-scale matrices are similar. In the hardware implementation, RSR module is employed to eliminate the division and square root operations for hardware-friendly design, and module reuse is applied to save the area overhead. Besides, a fixed-point simulation system is designed for properly constructing the word length of registers. Based on this simulation, the fractional part of each register is set to 18 bits, while the length of integer part varies for different registers, as listed in Table 1. Finally, the two architectures are implemented in 65 nm CMOS technology. Synthesis results show that the hardware could operate at 513 MHz to decompose 16×16 complexvalued matrices every 16 clocks, therefore the throughput is 32 MQRD/s. The overall PLs for G2SQRD and G4SQRD are respectively 163 (including 72 for sorting) and 131 (including 40 for sorting) clocks, namely 0.32 and 0.26 s. In the G4SQRD architecture, four sorting units takes 16,12,8, and 4 clock cycles respectively, namely 40 clock cycles totally. By contrast, 16 sorting units would be utilised if without the group-based sorting method, namely (16 + 15 + ⋯ + 2 = 135) clock cycles totally. Accordingly, the G4SQRD can save 95 clock cycles, approximately 42% of the total, which is consistent with the theoretical analysis of Figure 2. The area overheads are equivalent to 5486K (including 484K for sorting modules) and 5258K (including 256K for sorting modules) two-input NAND gates, respectively. The implementation details of GSQRD are listed in Table 2 together with the similar state-of-art designs presented in [7,14,[25][26][27][28]. Since different technologies and matrix sizes are utilized in these literatures, for fair comparisons, the throughputs and PLs are normalized to 65 nm technology and 16×16 complex-valued matrices, as shown in Table 2.
In Table 2, the architectures in [14,[25][26][27][28] are designed for smaller-scale MIMO systems while in [7] and GSQRD they are designed for larger-scale MIMO systems. The work of [14] presents a 4×4 SQRD scheme based on the GS method. Via carefully scheduling the tasks of SQRD, the hardware efficiency in [14] is significantly improved. Thereby, it achieves the highest gate efficiency of 1.30. However, the WL of [14] is set to 18 bits which are not enough for larger-scale MIMO applications. Therefore the gate efficiency of [14] would be somewhat scaled down if [14] is extended for larger-size matrices. The work of [25] employs a multi-dimension CORDIC scheme to the GR-based SQRD architecture, and achieves a similar gate efficiency to the proposed GSQRD. In [26], an MMSE-SQRD scheme for K-best MIMO detector is proposed, which achieves a gate count of 1529 (with detector) and an excellent throughput of 69 MQRD/s. The work of [27] presents an adaptive QRD processor to provide designers with multiple levels of tradeoffs between power and performance. Because of the extra overhead for adaption design, the gate efficiency and normalised latency performance of [27] are relatively lower than other designs. The work of [28] implements an SQRD architecture with merely 65 kg area overhead, and achieves an excellent gate efficiency of 0.58. However, as the hardware is deeply shared by numerous tasks, the throughput of [28] is much slower than other designs. The work of [7] presents a novel SQRD structure based on the Cholesky method for 16×16 MIMO systems and achieves a superior frequency performance of 588MHz. Additionally, it achieves a much smaller area overhead of 3700K gates compared with the proposed GSQRD. However, the resulting matrix Q in [7] is output in the form of H H and R −1 . Hence for the applications where matrix Q is explicitly required, the method of [7] will require some extra hardware for multiplying H H with R −1 .
Compared with these works, the proposed GSQRD is designed for larger-sized matrices. Hence, the hardware overhead is naturally higher than that of [14,[25][26][27][28]. It's admirable that [7] also decomposes 16×16 matrices but achieves lower gate count of 3700K. This is mainly because that [7] substitutes the resulting matrix Q with H H and R −1 so as to reduce a matrix multiplication. However, this method in [7] also causes some side effects that one more matrix-vector multiplication will be required during the subsequent projection process, and the corresponding hardware overhead is nonignorable. For fair comparison of the throughput performance, the normalized throughput (NT) is introduced which takes the matrix size and technology into consideration. As illustrated in Table 2, the proposed GSQRD achieves an NT performance of 32 MQRD/s, notably higher than that of [14,[25][26][27][28]; and slightly lower than that of [7], which we believe is due to the difference in register lengths. Moreover, the G4SQRD achieves a gate efficiency of 0.61, which is superior to that of [25][26][27][28]. Although the design of [14] achieves a higher gate efficiency than GSQRD, the scheduling strategy in [14] is rather complicated for largersized matrices. In terms of latency, the PL is also normalized before comparison. The G4SQRD achieves an normalized latency (NL) of 0.26 s, which is respectively 39% and 37% of the NLs in [27] and [28]; and is remarkably lower than those in [26,27]. It looks that the NL of G4SQRD is only slightly lower than that of [14]. However, this superiority is somewhat underestimated because the PL for each SQRD is assumed to increase linearly with the growth of the matrix size N in the table, while actually it increases with the square of N, as presented in Section 2.3. For example, the PL of [14] is 32 clocks for decomposing 4×4 matrices, and is normalized to 128 clocks for decomposing 16×16 matrices in Table 2. However, according to the timing schedule presented in [14], the PL is actually about 230 clocks when extended for 16×16 matrices. Compared with this data, the latency predominance of the proposed GSQRD would be more remarkable than that as listed in Table 2. To sum up, the proposed G2SQRD and G4SQRD both achieve excellent throughput and gate efficiency performance. Additionally, the latency performance is superior to that of other designs.

CONCLUSION
This paper proposes a group-sorted QR decomposition algorithm for larger-scale MIMO detection. Via predicting the sorting results ahead of time, this algorithm can remarkably reduce the processing latency with negligible side effects. Moreover, this algorithm is a hardware-friendly algorithm because the complicated square root and division operations are all converted to multiplications. Based on this algorithm, two corresponding hardware architectures are designed for decomposing 16×16 complex-valued matrices. Synthesis result indicates an outstanding throughput performance and a more excellent latency performance than other designs. Future work will focus on the circuit size reduction and the combination with efficient LR algorithms.