Early results on deep unfolded conjugate gradient-based large-scale MIMO detection

Deep learning (DL) is attracting considerable attention in the design of communication systems. This paper derives a deep unfolded conjugate gradient (CG) architecture for large-scale multiple-input multiple-output detection. The proposed technique combines the advantages of a model-driven approach in readily incorporating domain knowledge and deep learning in effective parameters learning. The parameters are trained via back-propagation over a data ﬂow graph inspired from the iterative conjugate gradient method. We derive the closed-form expressions for the gradients for parameters training and discuss early results on the performance in a statistically identical and independent distributed channel where the training overhead is considerably low. It is worth noting that the loss function is based on the residual error that is not an explicit function of the desired signal, which makes the proposed algorithm blind. As an initial framework, we will point to the inherent issues and future directions.

that massive MIMO implementation continues to be at least as exciting as massive MIMO theory. Massive MIMO is a form of multiuser MIMO where the number of serving antennas at the base transceiver station (BS) is an order of magnitude larger than the number of user terminals served within each radio resource element. Given a large number of antennas, reliance on time division duplex (TDD) channel reciprocity is essential [3].
Because of its advantages in terms of very high spectral efficiency (sum rates), increased reliability and power efficiency, massive MIMO has been the subject of a large number of research activities [7]. Under favourable channel conditions and/or as the number of antennas increases, the users' channels are mutually orthogonal which makes linear processing based on maximum ratio combining (MRC), zero-forcing (ZF) detection or minimum mean squared error (MMSE) detection, a suitable and optimal choice [7][8][9]. Many works on linear and low complexity processing are proposed in [9][10][11][12], which raise the performance bar to pass even higher. The detection/precoding problem based on ZF or MMSE technique is an arithmetic operation with cubic computational complexity in the order of the matrix dimension. To reduce the implementation complexity, matrix inversion approximations such as Neumann series expansion (NSE) is proposed [9]. A technique based on Gauss-Seidel (GS) was shown to outperform NSE due to its fast convergence at considerably low computational complexity [10]. However, this comes at the expense of higher latency and lower throughput [10]. It has actually been shown that the NSE performance degrades as the number of UTs increases [11]. To counter the load increase effect, GS can still afford using more iterations while maintaining lower computational complexity, albeit at the expense of reduced throughput [10]. It has, therefore, been argued to resort to exact matrix inversion [12]. On the other hand, it has also been argued that these centralized processing techniques still impose stringent constraints on the interconnects' bandwidth between the massive MIMO radio heads and the central processing unit. Distributed, or decentralized, massive MIMO processing has been introduced to overcome such limitations [13,14]. On the other hand, to support ultra-reliable low-latency communications, low latency and high throughput processing is required. As such, we introduce a recursive Gram matrix inversion update method wherein the inversion of the Gram matrix is performed by exploiting matrix inversion update of a matrix in the form of H H H when a new column is added/updated to a complex-valued matrix H [12].
It is worth mentioning that these signal detection schemes can be readily used for downlink precoding in a single cell scenario. In a multi-cell scenario, the reader is referred to [15][16][17][18]. In [18], Kazemi et al. considered a multi-cell scenario with noisy CSI where a fully cooperative cellular structure is presented first, and then, to mitigate the overhead, a limited cooperation setting, where the amount of the exchanged CSI among the cells is significantly decreased, is proposed. On the other hand, a low overhead centralized construction for constant envelop precoding (CEP), which employs limited cooperation among the cells while providing higher system throughput, is discussed in [15]. To achieve the minimum feedback overhead, a distributed realization of CEP is proposed. Furthermore, a new optimization problem is solved to compensate for the effects of pilot contamination [15,16].
Recent contributions seem to advocate for the potentials of using DL for communication system design [1,[19][20][21]. Even if most signal processing algorithms have solid well-established roots in statistics and information theory for tractable mathematical models, it remains that a practical system has many impairments and non-linearities, which can be roughly captured by such models [22]. For this reason, a DL-based communications system, which is tailored for a specific hardware configuration and channel, might be able to better optimize in the presence of such impairments. It has been shown that neural networks (NNs) are universal function approximators and has shown a notable ability for algorithmic learning [23]. As such, some initial insights and findings, using state-of-the-art DL tools, on signal compression [20] and channel decoding [21] are revealed. On the other hand, massively parallel processing architectures, such as graphic processing units, have shown to be very energy efficient with remarkable computational capabilities when fully exploited by concurrent algorithms [24].
So far, the goal in introducing DL is to either improve parts of existing algorithms or to completely replace them with an endto-end approach [25,26]. As an example, O'Shea and Hoydis [1] have discussed several promising new applications of DL to the physical layer. They have introduced a new way of addressing a communication system as an end-to-end reconstruction optimization task using autoencoders. On the other hand, two different deep architectures for point-to-point MIMO detection are introduced in [19] wherein the promising architecture relies on unfolding the iterations of a projected gradient descent algorithm into a network.
Despite the historical context and related works (refer to the introduction section of [1] and [19]), our primary approach is relying on the deep unfolding of existing iterative algorithms by mainly interpreting every iteration as a set of layers. The deep unfolding of existing iterative algorithms is discussed in [27]. It has been recently applied in the context of MIMO detection and channel decoding in [19] and [28], respectively.
There is a general consensus that ultra-large-scale MIMO, at the infrastructure level, and machine-learning aided physical layer algorithms, at the protocol/algorithmic level, are key enablers for the next generation of communication systems [7] and [32]. There is also a common interest to investigate different ML and DL architectures (such as deep NNs-DNN, deep unfolding, etc.) and framework (such as reinforcement learning, transfer learning, etc.) for the sole purpose to gain more insights on the potential of leveraging these tools and frameworks to outperform the baseline models [29,33]. As a major contribution, we propose a basic deep unfolded (model-driven approach) conjugate gradient (CG) architecture wherein the parameters are learned via backpropagation. We first unfold the iterative CG method to infer the data flow graph over which the gradient is computed. We then adopted a loss function based on the residual error which is not an explicit function of the training data. As such, the architecture exhibits following key features.
1. The parameters are trained rather than being explicitly computed using the data flow graph based on unfolding the iterative CG algorithm. The closed-form expressions of the gradients of the loss function with respect to the parameters are derived. This enables (for future works) the use of state-ofthe-art methods mainly used in transparent ML for interpreting and understanding such networks [35]. 2. The loss function is set to be the squared norm of the residual error term, which is not an explicit function of the training data. Therefore, the approach is blind so that no explicit and dedicated training data is required. Nevertheless, the appendix addresses another approach where the loss function is an explicit function of the training data. 3. The architecture can be readily incorporated as part of an end-to-end communication system learning process where the propagation of the gradient is seamlessly supported (see our work [34]). This is in line with the visionary statement in [2] wherein it is argued that the optimal design of smart radio environments needs to be tackled by taking the benefits of both model-based and data-driven (or simulation-driven) approaches and by leveraging the concept of transfer learning [34]. In addition, the regularity of the network lends itself to transfer learning to update the last stages only to enable on-line training [33].
4. The deep unfolding can be seen as a means of incorporating domain knowledge experts in order to aid in speeding up the training phase. For instance, Nachmani et al. [29] used radio transformer network (RTN) as domain knowledge experts to compensate for the carrier frequency offset. It is argued that the benefit of such an arrangement may be a reduced complexity and more flexibility regarding imprecise knowledge about the channel [29].
Notations: This paper adopts the following notations: (·) H represents the Hermitian transpose operator while (·) T and (·) −1 represent the transpose and the matrix inverse operators, respectively. Matrices and column-vectors are denoted by boldface capital and boldface small letters, respectively. We convert a K × K complex-valued matrixĀ ∈ ℂ K ×K to a 2K × 2K realvalued one A ∈ ℜ 2K ×2K using the following transformation where Re(•) and Im(•) denote the element-wise real and imaginary parts, respectively. These transformations are used in Section 2.2. The paper is organized as follows: Section 2 presents the uplink signal model and the CG-based detection technique. Section 3 details the proposed deep unfolded CG method. Early performance results are discussed in Section 4. Finally, the conclusions are drawn and some future research directions are outlined in Section 5.

Signal model
We consider an uplink transmission where K single antenna users are communicating with a BS equipped with M antennas (whereM ≫ K ) in TDD duplex mode using the OFDM modulation scheme. For the sake of simplicity, we consider a baseband equivalent channel and expressions per subcarrier where the subcarrier index is suppressed. The data signal of the kth user is denoted bys k ∈ ℂ and is normalized to unit power. The vectorh k ∈ ℂ M ×1 represents the corresponding channel which is modelled, for simulation purposes, as a flat Rayleigh fading channel vector whose entries are assumed to be independent and identically distributed (i.i.d.) with zero mean and unit variance. We model the received signal at the BS as is the channel matrix ands = [s 1s2 ⋯s K ] T .n ∈ ℂ M ×1 represents the additive receiver noise vector whose entries have a zero mean and a variance equal to 2 .

CG technique
Equation (2) can be readily solved iteratively using CG techniques (refer to page 214 of [30]). For convenience, we first convert the baseband complex-valued problem to a real-valued one where we reformulate Equation (2) as The CG technique is, therefore, summarized in Table 1.

Data flow graph for the CG algorithm
Prior to the deep unfolding transformation, Equation (3.7) in Table 1 is written as 2) is skipped so that the problem has 2L MAX tuning parameters to train, namely ( ) and ( ) for = 1, 2, … , L MAX . In addition, the matrix A can be replaced with B ( ) = A + ( ) I 2K to consider additional L MAX tuning parameters ( ) for = 1, 2, … , L MAX . The parameters ( ) and ( ) preserve the CG algorithm structure while ( ) can be viewed as a parameter that learns the inherent noise variance (as in the MMSE, problem formulation). For the sake of simplicity in introducing a simple canonical framework for deep unfolded CG-based detection method, we limit the current structure to these 3L MAX parameters. The algorithm structure can be enhanced to include more parameters and non-linear activation functions.
Deep unfolding the CG algorithm entails unfolding the L MAX iterations into L MAX stages (layers) as shown in Figure 1.

Network training and gradient computation
The loss function is based on the L 2 norm applied on r (L MAX ) which represents the mean squared error so that The gradient of the loss function is computed with respect to every parameter using backpropagation over the deep network of Figure 1. In the forward pass, we process the data in the thstage as shown in Figure 1(b) while the gradients are computed in the reverse direction (the backward pass). For the sake of simplicity, we compute the gradient for L MAX = 5 and then generalize for any thstage.
The gradient of the loss function Loss(Θ;A,y MF )w.r.t (5) is Loss(Θ;A,y MF ) Applying the chain rule one can compute the gradient w.r.t ( ) as Similarly, the gradient w.r.t (4) is computed as follows 2 Loss(Θ;A,y MF ) (5) q (5) q (5) p (4) p (4) Applying the chain rule one can compute gradient w.r.t ( ) , for < L MAX − 1, as Finally, the gradient w.r.t. ( ) is computed as Note that the loss function (Equation (4)) is not an explicit function of the desired signal s = [ Re(s) Im(s) ] ∈ ℜ 2K ×1 which makes the proposed architecture blind. It would not be the case if the loss function is defined as Loss(Θ;A,y MF , s) = ‖s (L MAX ) − s‖ 2 2 , which is an explicit function of the desired signal. In fact, our first attempts were based on computing the gradients ofLoss(Θ;A,y MF , s). It ended up having a similar performance using the loss function in Equation (4). The computation of the gradients Loss(Θ;A,y MF , s) is straightforward using the data flow graph in Figure 1(a) and (b) and reported in the appendix.
The initialization of the parameters is based on the first forward pass using steps 2.5 and 2.6 in Table 1.
Many works have addressed the matrix inversion issue inherent in the MMSE or ZF based detection [10][11][12]. It is also well known that the CG technique [31, p. 214] is a low complexity iterative method compared to a direct matrix inversion using Cholesky decomposition. Wu et al. [11] and [12] have studied this matter in detail, this is the reason why we did not consider it in this paper. Note that the algorithm in Table 1 does not involve any matrix-matrix multiplications. The matrix-vector multiplications in (3.1) and (3.5) are equivalent to the detection phase in MMSE and ZF after an implicit matrix inversion.

Discussion and potential future
Even though the proposed architecture might seem simple but it paves the way towards exploring the features outlined at the end of the introduction section. Prior to outlining the future works, let's infer a few key points from the closed forms solution of the gradient. Looking at Equations (3.3) s ( ) = s ( −1) + ( ) p ( −1) and (3.4) r ( ) = r ( −1) − ( ) q ( ) , it is clear that the data signal update is in the opposite direction compared to the residual error update. The proportion of the update is given by Equation (5) for the last layer and by Equation (6) for the subsequent layers. Substituting q ( ) = Ap ( −1) into Equation (6), the parameter update for ( ) can be expressed as For interpretation purposes, let's assume that ( ) = 0. Therefore Equation (10) can be reduced to Loss(Θ;A,y MF ) The second line of Equation (11) follows from Equation (3.1) and the last line follows from the fact that A is a symmetrical matrix. This shows that the parameter update for ( ) is the per-layer modified norm of the projected residual error. The plot of the residual error will be depicted in Figure 5 in Section 4.4.
For a non-zero value of ( ) , the second term in p ( ) = x ( ) + ( ) p ( −1) plays the role of a smoothing/averaging term.
Similarly, the update term for ( ) in Equation (8) depends on the per-layer modified norm of the projected residual error which is, in turn, weighted by the product of the past layers' parameters ∏ +1 n=L MAX −1 (n) . One can perceive this term as a factor that controls the trade-off between stability and the convergence speed. This is an intuitive interpretation of the interaction between the parameters' update (gradients), the data signal and the residual errors updates. As future works, we envisage adopting an in-depth look at the deep CG architecture via adopting state-of-the-art method and tools mainly used in trans-parent ML [35] for interpreting and understanding such networks.
It shall be noted that the proposed deep unfolded architecture is simple enough as only a few parameters are trained but it can be augmented by considering these points for future works.
1. Adding extra parameters such as replacing a singleparameter equation ×4K is a higher dimension weight matrix operating on a stacked real-valued 4K × 1 vector made by concatenating x ( −1) and p ( −1) . Such a proposal will increase the algorithm learning degree of freedom close to what the current DL architecture (such as deep NN-DNN) is using. 2. Adding non-linear activation functions to enable the algorithm structure to capture the channel and the hardware nonlinearities. 3. Now that the deep unfolded architecture exposes the values of the gradients at every layer, one can adopt state-of-the-art method and tools mainly used in transparent ML framework [35] for interpreting and understanding such networks. To the authors best knowledge, no work has been conducted to infer the optimal size of deep unfolded networks, to discuss analysis frameworks such as activation maximization and sensitivity to identify the most important input features via the relevance scores [35] or even consider other backward propagation techniques such as layer-wise relevance propagation which incorporate filtering to form a separate explanation for (1) what is specifically relevant to a given task (think of learning the modulation and radio resources assignment as a learning task) and (2) what is commonly relevant to all tasks.

PERFORMANCE RESULTS AND ANALYSIS
This section discusses the performance of the proposed deep unfolded CG technique under i.i.d. channel with zero mean and unit variance. The simulation results cover (i) the number of the required training symbols overhead (where an explicit knowledge of the symbols themselves are not needed) and (ii) the performance in terms of error vector magnitude (EVM) as a function of the number of users (K) and SNR. We adopt EVM instead of bit error rate (BER) as a performance metric as the initial simulation platform is designed to evaluate transmit precoding. However, there is a direct link between EVM and BER [31]. The system is operating in a massive MIMO regime where the BS is equipped with 128 antennas while the number of served single-antenna users is an order of magnitude lower than the number of the antennas at the BS. Any QAM based modulation can be used. 3 Herein 64-QAM modulation (i.e., M QAM = 64) is used unless otherwise stated. The 3 So far only QAM-based modulation is tested.

How many training symbols are needed?
Based on the simulation parameters stated above, the SNR and the number of users are fixed to 20 dB and 20 users, respectively. By varying N Train ∈ {10, 50, 100, 500, 1000}, Figure 2 shows that the optimal training size is as low as 100 symbols in such i.i.d. (favourable) channel conditions. This is quite encouraging and enables effective training of the basic structure's parameters using one RB in the first OFDM symbol only. Depending on the channel coherence time, this represents a sub 1% training overhead in slow time varying channel scenarios. Therefore, it is worth noting that the overall computation complexity, due to backpropagation processing, of the basic deep-unfolded CG is not substantial. This, in fact, supports the expectation in [29] wherein incorporating domain knowledge experts is a key in reducing the training overhead. We attribute the high EVM at the higher number of the training symbols to the overfitting and a potential numerical divergence.

Model-based CG versus basic deep-unfolded CG
As one would expect, the model-based CG detection is almost optimal in i.i.d. channel conditions. In the model-based CG, the parameters are explicitly computed for every symbol over L MAX iterations. These parameters differ from one symbol to another. Whereas in deep-unfolded CG these parameters are computed during the training phase and kept constant over the rest of the transmission frame. Figure 3(a) compares the model-based CG with state-of-the-art methods such as NSE [9] and GS [10] and other reference methods such as ZF with direct matrix inversion and MRC.

4.3
Does the basic deep-unfolded CG depend on the number of antennas as the system load increases? Figure 4 depicts the EVM as a function of the number of users where L MAX = log 2 (M QAM )K for both deep CG and regular CG algorithms. The extensive simulations revealed that dimensioning the architecture as a function of the modulation depth and the number of users preserve the performance within a certain EVM threshold. The reason why Figure 4 shows an almost flat EVM performance (RMS EVM within +/-0.5%) as far as L MAX = log 2 (M QAM )K . One can, therefore, state that to preserve the system performance as the system load (K) increases, only the number of iterations/layers need to be scaled according to L MAX = log 2 (M QAM )K while the number of serving antennas at the BS is kept unchanged.

Convergence behaviour through residual error
To complement our discussion in Section 3.3, Figure 5 depicts the residual error plot as a function of the number of layers The fast decaying residual error curves demonstrate the fast convergence behaviour of the learning process which depends, among other system parameters, on the modulation type. The arrows point to the iteration number that achieves a given RMS EVM so that the system engineer can infer the optimal number of layers for a give RMS EVM value.

A note on the computational complexity
The regular CG method implements Equations (3.1)-(3.7) iteratively over L MAX iterations. However, the proposed method does not compute (3.2), (3.6) and the division in (3.7), which amount to 4K multiplications/additions and 2 divisions per iteration, respectively. This is a total of 4K N Frame multiplications/additions and 2N Frame divisions per iteration over a frame of N Frame symbols. On the other hand, the proposed method computes Equations (6) and (8) to determine the fixed parameters per layer. This amounts to 4K multiplications/additions for (6) and (8). Note that the matrix-vector multiplication (8) is explicitly computed in (3.1). We, therefore, expect a total of 8K N Train multiplications/additions per layer. This results in a computational reduction by a factor of N Frame ∕(8N Train ). Using our simulation parameters N Frame = 1200 × 14 and N Train = 100, the complexity reduction factor is 21.

CONCLUSION
This paper has proposed a basic deep unfolded implementation of the iterative model-based CG wherein the parameters are trained via backpropagation. We derived the closedform expressions of the gradients of the loss function w.r.t. the parameters. The loss function is based on the squared norm of the residual error which is not an explicit function of the desired transmitted symbols. The simulation results reveal interesting insights; (i) the training overhead is very low which makes the deep unfolded CG a good implementation alternative to iterative model-based CG and (ii) the basic deep unfolded structure does not depend on the number of antennas as the load (the number of users) increases as far as the network structure is dimensioned based on the number of users and modulation depth. We do not attempt to outperform the iterative modelbased CG in favourable i.i.d. channel. However, the following insights can be deduced: (i) the deep unfolded CG structure can be incorporated in an end-to-end communication system learning process [2-ZAP19] as the structure readily backpropagates the gradient for the parameters' optimization. The algorithm structure is preserved by the deep unfolded structure as domain knowledge expert which, in turn, explains the low number of training samples (fast training phase). This is in line with what the literature expects from incorporating domain knowledge and RTNs .
As future work, one can envisage extending the architecture to consider more realistic time-varying channels and hardware impairments (and non-linearities). As in many DL approaches, the performance is sensitive to the initial values of the parameters (which we noted through extensive simulations), which seems to depend on the system parameters such as SNR, number of users and stages. It, therefore, needs an in-depth investigation. ( s (5) p (4) p (4)