Multi-agent deep reinforcement learning-based energy efﬁcient power allocation in downlink MIMO-NOMA systems

NOMA and MIMO are considered to be the promising technologies to meet huge access demands and high data rate requirements of 5G wireless networks. In this paper, the power allocation problem in a downlink MIMO-NOMA system to maximize the energy efﬁciency while ensuring the quality-of-service of all users is investigated. Two deep reinforcement learning-based frameworks are proposed to solve this non-convex and dynamic optimization problem, referred to as the multi-agent DDPG/TD3-based power allocation framework. In particular, with current channel conditions as input, every single agent of two multi-agent frameworks dynamically outputs the optimum power allocation policy for all users in every cluster by DDPG/TD3 algorithm, and the additional actor network is also added to the conventional multi-agent model in order to adjust power volumes allocated to clusters to improve overall performance of the system. Finally, both frameworks adjust the entire power allocation policy by updating the weights of neural networks according to the feedback of the system. Simulation results show that the proposed multi-agent deep reinforcement learning based power allocation frameworks can signiﬁcantly improve the energy efﬁciency of the MIMO-NOMA system under various transmit power limitations and minimum data rates compared with other approaches, including the performance comparison over MIMO-OMA.


INTRODUCTION
Due to the explosive growth of internet-of-thing-based largescale heterogeneous networks and the emergence of the fifth generation (5G) wireless networks, the communication requirements become more and more strict such as low latency, high speed and massive connectivity etc. [1]. In order to meet these stringent communication requirements and provide high-quality communication services, non-orthogonal multiple access (NOMA) has been proposed. NOMA is widely recognized as one of the most promising multiple access technologies for 5G wireless networks [2]. Different from conventional orthogonal multiple access (OMA), the multiple users in NOMA systems can be served by the same frequency time resources via power-domain multiplexing or code-domain multiplexing, which can increase the network capacity and fulfil the requirements of low latency, high throughput and massive connection [3]. In researches, the interests of NOMA are mostly about power allocation [4] and user matching [5], and more often the objectives are to improve the users' fairness [6], spectrum efficiency (SE) [7] and energy efficiency [8]. The power-domain NOMA can serve multiple users at the same time slots and has become the potential candidate in the development of 5G by allocating different power levels to different users to achieve multiple access [9], thus the EE of systems can be improved by using this technology. The power-domain NOMA with the use of superposition coding (SC) at the transmitter side and successive interference cancellation (SIC) at the receiver side allows users to share the same resources by multiplexing multiple users' signals with different allocated powers. In power-domain NOMA, powers are allocated to users according to their channel gains, i.e. high and low powers are respectively assigned to users with better and worse channel gains, and then the user with better channel gain by SIC removes interference from other users. For the purpose of further improving the system performance, NOMA is capable of being combined with various technologies like multiple-input multiple-output (MIMO) [10][11][12][13][14], device to device [15][16][17], deep learning [18][19][20][21][22][23][24][25][26][27] and cognitive radio [28,29] etc. The combination of MIMO and NOMA, which is defined as MIMO-NOMA, is one of the hottest research areas. MIMO is a conventional technology where the base station (BS) and the user both have several antennas which can take advantage of the parallel independent channel according to the characteristics of space division multiplexing [30]. MIMO-NOMA transmission can achieve high SE and low latency through power-domain in NOMA and spatial diversity in MIMO [31]. Some research results show that MIMO-NOMA can further improve the throughput of communication systems [32]. In this paper we mainly focus on the issue of power allocation in the power-domain NOMA with MIMO.
Designing the proper resource allocation algorithm is significantly important to improve the SE or EE of the MIMO-NOMA system [33]. Many model-based power allocation algorithms have been proposed to increase the SE or EE in NOMA systems [34,35], and some studies have researched the joint subchannel assignment and power allocation in order to maximize the EE for multicarrier NOMA systems [36,37]. Due to the dynamics and uncertainty that are inherent in the wireless communication system, it is mostly difficult or even unavailable to obtain the complete knowledge or mathematical model that are required in these conventional power allocation approaches in practice [38]. However, because of high computational complexity of these algorithms, they are inefficient or even inapplicable for future communication networks. In addition, the dynamic resource allocation is another challenge for a fastchanging channel condition. Optimizing energy consumption with efficient resource allocation is an important research issue using NOMA. In practice, taking into account the dynamics and uncertainty that are inherent in wireless communication systems, it is generally difficult or even unavailable to obtain the complete knowledge or mathematical model required in these conventional resource allocation approaches. Nevertheless, intelligent learning methods are extensively developed to deal with this challenge. Some studies have used DL (deep learning) as a model-free and data-driven approach to reduce the computational complexity with available training inputs and outputs [39]. As one main branch of machine learning (ML), DL is a useful method that can be applied to overcome the resource allocation problem in the case of multiple users [21,24,40]. DL has been used to solve resource allocation problems by training the neural networks offline with simulated data first, and then outputting results with the well-trained networks during the online process [26,27,41]. However, obtaining the correct data set or optimal solutions used for training can be difficult and the training process itself is actually time-consuming.
In mobile environments the accurate channel information and the complete model of the environment's evolution are unknown, therefore, the model-free reinforcement learning (RL) method has been used to solve stochastic optimization problems in wireless networks. RL, as another main branch of ML, can be used in solving the dynamic resources allocation problems [18,22,42]. RL is able to provide the solution for the sequential decision-making problems where the environment is unknown and the optimal policy is learned through interactions with the environment [43,44]. In RL connecting with wireless networks, the agent (e.g. the base station or user) observes the environment (e.g. wireless channel) states and discovers which actions (e.g. decisions of subchannel assignment or power allocation) yield the most numerical reward (e.g. immediate EE) by trying them, and finally generate the policy of mapping states to actions. That is, the agent selects an action for the environment, and the state (e.g. current channel condition) changes after the environment accepts the action. At the same time, the reward is generated and then feedbacks to the agent. Finally, the agent selects the next action based on the immediate EE and the current channel condition in order to ensure energy efficiency in wireless communications. The goal of RL is to adjust the parameters dynamically so as to maximize the reward. Besides, instead of optimizing current benefits only, RL can generate almost optimal decision policy which maximizes the long-term performance of the system through constant interactions. Therefore, RL has demonstrated its enormous advantages in many fields. The existing works that applied the model free RL framework usually discretise the continuous values of network parameters into a finite set of discrete levels or acquire the stochastic policy. Different from most of existing work that uses RL algorithm, we consider the deterministic policy gradient-based actor-critic reinforcement learning algorithm to solve power allocation optimization problem with continuous-valued actions and states in MIMO-NOMA systems.
In this paper, we investigate the dynamic power allocation problem with multi-agent DRL-based methods in a downlink MIMO-NOMA system. Motivated by the aforementioned considerations, we propose two multi-agent DRL-based frameworks (i.e. multi-agent DDPG/TD3-based framework) to improve the long-term EE of the MIMO-NOMA system while ensuring minimum data rates of all users. We construct DRL networks on the basis of the multi-agent model [45], including the additional actor network which does not have its own critic network and is updated by comprehensive impact of every agent critic network. We refer to these two frameworks as the multiagent DDPG-based power allocation (MDPA) framework and the multi-agent TD3-based power allocation (MTPA) framework.
The main contributions of this paper are summarized as follows: • We develop two model-free DRL-based power allocation frameworks to solve the power allocation problem in a downlink MIMO-NOMA system. They are multi-agent model frameworks based on DDPG and TD3 algorithms which are on the basis of the deterministic policy gradient methods with continuous action spaces, since the stochastic policy gradient methods cannot properly solve the dynamics and uncertainty that are inherent in wireless communication systems. • For our multi-agent model frameworks, every single agent dynamically outputs the power allocation policy for all users in every cluster of the MIMO-NOMA system, and we also add the additional actor network to the conventional multi-agent model in order to adjust power volumes allocated to clusters to improve overall performance of the MIMO-NOMA system. To the best of our knowledge, no such method for power allocation in MIMO-NOMA systems has been studied in the existing literatures. • We provide the performance analysis of proposed power allocation frameworks based on multi-agent deterministic policy gradient in two-user scenario in a cluster of the MIMO-NOMA system and compare the EE of the proposed frameworks under various power limitations and minimum data rates with single agent DRL-based power allocations, the discrete DRL-based one and the fixed power allocation strategy. We also verify the advantage of energy performance of our proposed frameworks over that of the MIMO-OMA system. The simulation results show that the TD3-based frameworks can achieve best performance.
The rest of this paper is organized as follows. The related work is reviewed in Section 2. Section 3 introduces the system model and problem formulation of a downlink MIMO-NOMA system. In Section 4, we propose two multi-agent DRL-based algorithms to solve the dynamic power allocation problem of the MIMO-NOMA system. The simulation results are discussed in Section 5, and we conclude this study in Section 6.

RELATED WORK
The conventional RL algorithms suffer from slow convergence speed and become less efficient for problems with large state and action spaces. In order to overcome these issues, the deep reinforcement learning (DRL), which combines deep leaning with RL, has been proposed. One famous algorithm of DRL, named deep Q-learning [46] that is one of the off-policy methods, uses a deep Q network (DQN) which applies deep neural networks as function approximators to conventional RL. DRL has already been used in many aspects such as power control in NOMA systems [47,48], resource allocation in heterogeneous network [25,49] and Internet of Things (IoT) [50]. DQN uses a replay buffer to store tuples of historical samples in order to stabilize training and makes efficient use of the hardware optimization. And the mini-batches are randomly drawn from the replay buffer to update the weights of the networks during training. However, the main drawback of DQN is that the output decision can only be discrete, which brings quantization error for continuous action tasks. Besides, the output dimension of DQN increases exponentially for multi-action and joint optimization tasks. Some works that introduced the model-free RL framework to solve the stochastic optimization problems typically discretise the continuous variables of the studying scenario into a finite set of discrete values (levels), such as the quantized energy levels or power levels. However, such methods destruct the completeness of continuous space and introduce quantization noise, thus they are incapable of finding the true optimal policy. RL was initially studied only with discrete action spaces, however, practical problems sometimes require control actions in a continuous action space [51]. Concerning our problem with energy efficiency, both the environment state (i.e. wireless chan-nel condition) and the action (i.e. transmission power) have continuous spaces. For problems with the continuous-valued action the policy gradient methods [52] are very effective, and in particular it can learn stochastic policies. In order to overcome the inefficiency and high variance of evaluating a policy in the policy gradient methods, the actor-critic learning algorithm is introduced to solve the optimal decision-making problems with an infinite horizon [53]. Actor-critic algorithm is a well-known architecture based on policy gradient theorem which allows applications in continuous space [54]. In the actor-critic method [55], the actor is used to generate stochastic actions whereas the critic is used to estimate the value function and criticizes the policy that is guiding the actor, and the policy gradient evaluated by the critic process is used to update policy.
In [51], authors developed deep deterministic policy gradient (DDPG) method to improve the trainability of the stochastic policy gradient method with continuous action space problems. DDPG is an enhanced version of the deterministic policy gradient (DPG) algorithm [56] based on the actor-critic architecture, which uses an actor network to output a deterministic action and a critic network to evaluate the action [46]. DDPG takes the advantages of experience replay buffer and target network strategies from DQN to improve learning stability. By using the combination of actor-critic algorithm, replay buffer, target networks and batch normalization, DDPG is able to perform continuous control, while the training efficiency and stability are enhanced from original actor-critic network [51]. These features make DDPG more efficient for dynamic resource allocation problems with continuous action space. They also managed to make it an ideal candidate for application in industrial settings [42,57]. As it concerns the overestimation bias of the critic can pass to the policy through policy gradient in DDPG. However, if the actor is updated slowly, the target critics are too similar with the current critics to avoid overestimation bias. There have been proposed many algorithms to address the overestimation bias in Q learning. Double DQN [58] addresses the overestimation problem by using two independent Q functions. Nevertheless, it is difficult to find appropriate weights for different tasks or environments. The most common failure mode of DDPG is that the learning of Q-function which begins by overestimating the Q-value and eventually ends by breaking down the policy because it takes advantage of errors in the Q-function. Overestimation bias is a property of Q-learning which is caused by maximization of a noisy value estimate. The value target estimate depends on the target actor and target critic. As the target actor is constantly updating, the next action used in the one-step value target estimate is also changing. Therefore, the update of target actor may cause a big change in target value estimate, which would cause instability in critic training [59]. In order to reduce overestimation bias problems, the authors in [60] extended DDPG to twin delayed deep deterministic policy gradient algorithm (TD3), which estimates the target Q value by using the minimum of two target Q value, called clipped double Q learning. DPG and DDPG algorithms paved way to TD3 algorithms by the successful works on DQN [60,61]. TD3 adopts two critics to get a less optimistic estimate of an action value by taking the minimum between two estimates. In TD3, there is just one actor which is optimized with respect to the smaller of two critics. In order to ensure the TD-error remains small, the actor is updated at a lower frequency than the critic, which results in higher quality policy updates in practice [59].
Since we focus on the issue of power allocation in MIMO-NOMA systems which can be generally comprised of several clusters according to users' propagation channel conditions, we also investigate the multi-agent RL methods, where single agent can output the policy for one cluster of the MIMO-NOMA system. With the development of DRL, single agent algorithm is gradually on the right track and has already addressed plenty of difficult problems giving researchers more powerful instruments to study in high dimension and more complex action spaces. By studying more on single agent RL, researchers realized that the introduction of multi-agent could achieve higher performance in complex mission environments [62]. The multiagent DDPG algorithm that extends the DDPG algorithm to the multi-agent domain is proposed for mixed cooperativecompetitive environments in [63], even though it did not have made more considerations for joint overestimation in multiagent environments. In [45], the multi-agent TD3 algorithm is proposed to reduce the overestimation error of critic networks, and they explained that the multi-agent TD3 algorithm has a better performance than the multi-agent DDPG algorithm in complex mission environment. These led to an extension of it to any other multi-agent RL algorithm.

System model
In this paper we consider a downlink multi-user MIMO-NOMA system, in which the BS equipped with M antennas that send data to multiple receivers that are equipped with N antennas. The total number of users in the system is M × L, which are grouped into M clusters randomly with L(L ≥ 2) users per cluster. NOMA is applied among the users in the same cluster. For the signal to the l th user in cluster m, denoted by UE m,l , the BS precodes the superposed signal √ p m,l x m,l by using a transmit beamforming matrix F, where p m,l is the allocated power to the l th user in cluster m, and x m,l denotes the normalized transmit signal. We consider the composite channel model with both Rayleigh fading and largescale path loss. In particular, the channel matrix H m,l from the BS to UE m,l can be represented as [13]: where G m,l ∈ ℂ N ×M represents Rayleigh fading channel gain, L(d m,l ) denotes the path loss of UE m,l located at a distance d m,l (km) from the BS and assumed to be the same at each receive antenna, d 0 is the reference distance according to cell size, and ζ denotes the path loss exponent. The precoding matrix used at the BS is denoted as F ∈ ℂ M ×M , which implies the antenna m serves cluster m with the power p m = ∑ L l =1 p m,l for any m. At the receive side, UE m,l employs the unit-form receive detection vector v m,l ∈ ℂ N ×1 to suppress the inter-cluster interference. In order to completely remove inter-cluster interference, the precoding and detection matrices need to satisfy the following constraints: where f k is the kth column of F [12]. It should be noted that the number of antennas should satisfy N ≥ M to make this feasible. Because of the zero-forcing (ZF)-based detection design, the inter-cluster interference can be removed even when there exist multiple users in a cluster. Note that only a scalar value |v H m,l H m,l f k | 2 needs to be fed back to the BS from UE m,l . For the considered MIMO-NOMA scheme, the BS multiplexes the intended signals for all users at the same frequency and time resource. Therefore, the corresponding transmitted signals from the BS can be expressed as: where the information-bearing vector x∈ ℂ M ×1 can be further written as: where ∑ M m=1 ∑ L l =1 p m,l ≤ p max , p max denotes the total transmit power at the BS. Accordingly, the observed signal at UE m,l is given by where n m,l ∼  (0, 2 I) is the independent and identically distributed (i.i.d.) additive white Gaussian (AWGN) noise vector. By applying the detection vector v m,l on the observed signal, Equation (5) can be expressed as [10]: where the second term is interference from other clusters, and x k denotes the kth row of x. Owing to the constrain 1 on the detection vector, i.e. v H m,l H m,l f k = 0 for any k ≠ m, Equation (6) can be simplified as: Without loss of generality, the effective channel gains are ordered as: At the receiver, each user conducts SIC to remove the interference from the users with worse channel gains, i.e. the interference from UE m,l +1 , ⋯ , UE m,L is removed by UE m,l . As a result, the achieved data rate at UE m,l is given by where B m,l is the bandwidth assigned to UE m,l .

Problem formulation
The total power consumption at the BS is comprised of two parts: the fixed circuit power consumption p c , and the flexible transmit power In this work, we find out the power allocation coefficient q m,l instead of power volume p m,l , i.e. q m,l = p m,l ∕p m is the power allocation coefficient for UE m,l in cluster m, where p m is the power allocated to cluster m in a cell. Thus, Equation (9) can be rewritten as: Similar to [11,34], we define the EE (energy efficiency) of the system in cluster m as: where m = ∑ L l =1 m,l denotes the achievable sum rate in cluster m. We aim to maximize the EE of the system in MIMO-NOMA when each user has a predefined minimum rate min m,l . Thus, our considered problem can be formulated as: The architecture of wireless network with small-cells and reinforcement learning formulation where C1 represents users' minimum rate requirements, and C2 and C3 respectively represent the transmit power constraint in a cell and a cluster. C4 indicates that the user with worse channel condition is allocated more power and C5 are the inherent constraints of p m , q m,l . This optimization problem is non-convex and NP-hard, and the global optimal solution is usually difficult to obtain in practice due to the high computational complexity and the randomly evolving channel conditions. More importantly, the conventional model-based approaches can hardly satisfy the requirements of future wireless communication services. Therefore, we propose two DRL-based frameworks in the following sections to deal with this problem.

Deep reinforcement learning formulation for MIMO-NOMA systems
In this section, the optimization problem of power allocation in a downlink MIMO-NOMA system is modelled as a reinforcement learning task, which consists of an agent and environment interacting with each other. In Figure 1, we describe the architecture of wireless network with small-cells which are ultra-densely deployed and have the same number of antennas as user handsets, or even less. As shown in Figure 1, the base station is treated as the agent and the wireless channel of the MIMO-NOMA system is the environment. The action, transmit power from the BS to users, taken by the DRL controller (at the BS) is based on the state which is the collective information on channel condition from users. Then at FIGURE 2 Multi-agent DRL-based MIMO-NOMA power allocation system model each step, based on the observed state of the environment, the agent performs an action from the action space to allocate power to users according to the power allocation policy, where the policy is learned by the DRL algorithms. With the obtained transmit power, the optimal power allocation is conducted and the step reward can be computed and fed back to the agent. This reward is the energy efficiency of the MIMO-NOMA system. The agent can perform actions, such as power allocation, optimized for a given channel information through a repetitive learning of the selection process that maximizes the reward.
In this paper, we propose two multi-agent DRL-based frameworks (i.e. MDPA and MTPA) for the downlink MIMO-NOMA system to derive the power allocation decision, which are respectively based on DDPG and TD3 algorithm. The multi-agent DRL-based network comprises of the M pairs of actor and critic network where each pair outputs power allocation coefficients for all users in a cluster, as well as one additional actor network which decides the power volume allocated to every cluster. In Figure 2, we show the multi-agent DRLbased power allocation network mechanism with M clusters in the MIMO-NOMA system (target networks for soft update weights are not showed), where actor network m (m = 1, … , M ) decides power allocation coefficients for all users in each cluster and actor network 0 decides the power volume allocated to every cluster. Comparing with the other mechanisms in previous works [42,48,57], we not only properly adopted the off-policy multi-agent method in accordance with properties of MIMO-NOMA systems, but also created additional actor network (actor 0 in Figure 2) which is updated in cooperation with the other agents' critic networks to adjust power volumes allocated to clusters for improving overall performance of the MIMO-NOMA system.
In order to better illustrate our algorithm, we first briefly introduce the backgrounds of DRL. A general RL model consists of four parts: state-space , action-space , immediate reward  and the transition probability between current state and next state  ss ′ . In every TS (time slot) t , one or several agents take an action a t (e.g. decisions of power allocation) under current state s t (e.g. current channel condition), then receives an immediate reward r t (e.g. immediate EE) and a new state s t +1 , which we define as following [42,57].
States: The state s t ∈  in TS t is defined as the current channel gain of all users where s m t (m = 1, … , M ) represents the current channel gain of all users in cluster m, and h m,l (t ) represents the channel gain between BS and user UE m,l in TS t . They are assumed to be obtained at the beginning of the TS. The states space complexity is related to the number of UEs in the system. Since the total number of users in the system is M × L, which are grouped into M clusters randomly with L users per cluster, the space complexity of s t is O(M × L) and that of s m t is O(L). Actions: The most suitable action space  should contains all power allocation decisions, so the action a t of TS t is where p m t ∈ R 1 is power volume allocated to cluster m in a cell, q m t ∈ R L is the power allocation coefficient for all users in cluster m and q m,l (t ) ∈ R 1 is the power allocation coefficient for UE m,l in TS t . The power allocation decision network outputs power allocation coefficients to every user and power volume for every cluster, hence, each user can use the amount of its own power allocation coefficient portion of power volume allocated to the corresponding cluster. Therefore, the actions space complexity of the power allocation network is O(M × (L + 1)). The transmit power is a continuous variable which is infinite, so we develop the MDPA network and the MTPA network to solve aforementioned problems.
Rewards: We use the EE of the MIMO-NOMA system in cluster m to represent the immediate reward r m t ∈  returned after choosing action a t in state s t , that is And we aim to maximize the long-term cumulative discounted reward defined as: with discount factor ∈ (0, 1). In order to achieve this goal, a policy m for cluster m defined as a function mapping from states to actions ( m ∶  → ) is needed. The policy m acts like a guidance, i.e. it tells the agent which a m t should be taken in a specific s m t to achieve the expected R m t , thus maximizing Equation (16) and it is equivalent to finding the optimal policy represented as * m . For a typical RL problem, the Q value function [43], which describes the expected cumulative reward R m t of starting s m t , performing action a m t and thereafter following policy m , is instrumental in solving RL problems, and is defined as: The optimal policy * m which maximizes Equation (16) also maximizes Equation (17) for all states and actions in cluster m. And the corresponding optimal Q value function follows * m is given as: Once we have the optimal Q value function Equation (18), the agent knows how to select actions optimally.

Multi-agent DDPG-based power allocation (MDPA) network
We use the off-policy multi-agent DDPG network based on actor-critic structure to solve the dynamic power allocation problem, where the actor part is an enhanced DPG network and the critic part is a DQN. As mentioned before, the centralized controller receives the channel gain of all users as s t = {s 1 t , … , s M t } at the beginning each TS. With input s t , every DNN named actor network with weights k (k = 0, 1, … , M ) outputs a deterministic action rather than a stochastic probability of actions which removes further sampling and integration operations that are required in other actor-critic-based methods [46], that is where 0 is a policy corresponding to Actor 0 which decides power volume allocated to every cluster in a cell, (m) 0 denotes mth element of 0 (s t ; 0 ) and a policy m decides the power allocation coefficient for all users in cluster m.
However, a major challenge of learning with deterministic actions is the exploration of new actions [49]. Fortunately, for the off-policy algorithm such as DDPG, the exploration can be treated independently from the learning process. We construct the exploration policy by adding a noise to the original output action which is similar to the random selection of − greedy method in DQN, that is where  0 ∈ R M ,  m ∈ R 1 represents the noise and follows a normal distribution. The action p m t is restricted by the interval (0, p max ) and q m t is restricted by the interval (0, 1). After executing the action a t and receiving r t = {r 1 t , … , r M t }, the system moves to the next state s t +1 . Since the action a t is generated by a deterministic policy network, we rewrite Equation (17) as: Since it is impractical to calculate Q value of every step in this way, we also use another DNN named critic network with weights m (m = 1, 2, … , M ) to output the approximate Q value Q m (s m t , a m t ; m ) to evaluate the selected action a m t . We utilize the experience replay method to allow the networks to benefit from learning across a set of uncorrelated tuples. With the N b random tuples (s i , a i , r i , s i+1 ) sampled from replay memory , the actor network for m (s m t ; m ) updates its weights in the direction of getting larger Q value according to the deterministic policy gradient theorem in [49], they are given by Equations (23) and (24), where J ( m ) and J ( 0 ) respectively represents the expected total reward of following policy m in all states of cluster m and policy 0 in all states in a cell. Then we use the target network architecture from DQN to solve the unstable learning issue caused by using only one network to calculate target Q values and update weights at the same time. We create a copy of the actor network and the critic network as m (s m ;̄m ) and Q m (s m , a m ;̄m ) and name them as the target actor network and the target critic network. Thus, the target Q value can be generated by these two networks, that is The multi-agent DDPG-based power allocation (MDPA) in downlink MIMO-NOMA 1. Initialize the replay memory  with capacity ||.

Initialize the multi-agent DDPG-based power allocation actor networks
3. Initialize the target actor networks 0 (s t ;̄0 ), m (s m t ;̄m ) and target critic networks Q m (s m t , a m t ;̄m ) with parameters̄0 = 0 ,̄m = m and m = m . 4. Initialize the random process  0 ,  m for the DDPG action exploration, the terminate TS T max and the parameters update interval size C .
5. Controller at the BS receives the first channel condition information of all users as initial state s 1 .

7.
Controller selects the power allocation action a m t = (p m t , q m t ) according to Equations (20), (21).

8.
Controller broadcasts the power allocation action a m t to all users, and users transmit their signal with the specific power.

9.
If action a m t for any m can satisfy the minimum rate requirement, controller receives the current EE as reward r m t . Otherwise, it receives none reward.

10.
Controller receives the next state s t +1 as users moving to their next positions.

11.
Store the tuple (s t , a t , r t , s t +1 ) in .

12.
Randomly sample a mini-batch of N b tuples (s i , a i , r i , s i+1 ) from .

13.
Critic networks Q m (s m t , a m t ; m ) updates m by minimizing the loss function Equation (27). 14.
Actor networks m (s m t ; m ) updates m according to Equation (23).

15.
Critic networks Q m (s m t , a m t ; m ) updates m by minimizing the loss function Equation (28).

17.
Update target actor networks̄0,̄m and target critic networks̄m according to Equation (29) in every C TSs.

end for
We use the same method in DQN to update the critic network weights by minimizing the loss function defined as: Instead of directly copying the weights to target networks, we update the weights̄k and̄m in a soft manner to make sure the weights are changed slowly, which greatly improves the stability of learning, that is where 0 < < 1. The MDPA algorithm in a downlink MIMO-NOMA system is summarized in Algorithm 1.

Multi-agent TD3-based power allocation (MTPA) network
Twin-delayed deep deterministic policy gradient (TD3) is an offpolicy model with continuous high dimensional spaces which recently has achieved breakthrough in artificial intelligence. RL algorithms characterized as off-policy generally utilized a separate behaviour policy which is independent of the policy which is being improved upon. The key advantage of the separation is that the behaviour policy can operate by sampling all actions, while the estimation policy can be deterministic [61]. TD3 was built on the DDPG algorithm to increase stability and performance with consideration of function approximation error [60]. The uniqueness of the TD3 algorithm is in the combination of three powerful DRL techniques such as continuous double deep Q learning [58], policy gradient [56] and actor-critic [54]. Even though DDPG can sometimes achieve a good performance, it is very sensitive for hyper-parameters and other types of adjustments. The overestimation of Q-value when in the beginning of the process of learning the Q-function is actually the common failure mode of DDPG. The overestimation bias is a property of Q-learning in which the maximization of a noisy value estimate induces a consistent overestimation [60,64]. This noise is unavoidable when given the inaccuracy of the estimator in function appropriation settings. Therefore, the overestimation can occur due to the inaccurate action values. TD3 is an algorithm that solves this problem by introducing three key techniques [60]. The first one is clipped dual Q networks. TD3 learns two Q-functions and uses the smaller of the two Q-values to form the target of the Bellman error. The second is the delayed policy update, where the policy network parameters are updated after dual Q-function networks are updated. Thirdly, the smoothed target policy utilization makes all action values within the specific range [45].
In this paper, we also use the multi-agent TD3 network to derive the power allocation decision in the MIMO-NOMA system, where the network mechanism is equal to the above DDPG network. Every actor networkm (m = 1, 2, … , M ) learns two Q-functions to get a less optimistic estimate of an action value by taking the minimum between two estimates, so two Q-functions with j m for every actor network can be expressed as: Similar to our DDPG-based method, we utilize the experience replay memory and then the target Q value y m i is given by: where the added noise m , m are clipped to keep the target close to the original action. Then critic networks are updated by minimizing the loss functions defined as: Then we update the actor network for m in the direction of getting larger Q value with 1 m by Equations (37) and (38). Finally, we update the weights̄0,̄m and̄j m in a soft manner to make sure the weights are changed slowly, that is The MTPA algorithm in a downlink MIMO-NOMA system is summarized in Algorithm 2. 2. Initialize the multi-agent TD3-based power allocation actor networks 10.
Controller receives the next state s t +1 as users moving to their next positions.

11.
Store the tuple (s t , a t , r t , s t +1 ) in .

12.
Randomly sample a mini-batch of N b tuples (s i , a i , r i , s i+1 ) from .

19.
ift mod C then

SIMULATION RESULTS
In this section, we present the simulation results of the two multi-agent DRL-based power allocation algorithm (i.e. MDPA and MTPA). The results are simulated in a downlink MIMO-NOMA system, where the BS is located at the centre of the cell, 4 users are randomly distributed in the cell with a radius of 500 m. The specific values of the adopted simulation parameters are summarized in Table 1. If not specified, T max = 5000 TS, p max = 1 W and all users perform random moving with a speed of v = 1 m∕s. Our multi-agent DRL-based framework for simulation has five networks (three actor networks and two critic networks) that it does not mean including target networks. The  (29), (39) and (40) is set to be 0.01, the noise process in Equations (20) and (21) follows a normal distribution of  (0, 1), and m = m = 0.1 in Equations (32) and (34). In order to compare the performance of the proposed frameworks, we consider several alternative approaches: (1) single agent DDPG/TD3-based power allocation methods (SDPA/STPA), which output the power allocation policy for every cluster by independent single agents based on DDPG/TD3 algorithm to maximize the sum EE of the system; (2) DQN-based discrete power allocation strategy (QPA), which uses DQN method to output the allocated power of all users by quantizing the power into 10 levels between 0 and p max in order to fit the input layer of DQN, because the transmit power is a continuous variable which is infinite and the actionspace of DQN has to be finite; (3) fixed power allocation strategy (FPA), which chooses the action that can maximize the sum EE with maximum transmit power by exhaustive searching in each TS to ensure high quality-of-service (QoS) of the system; (4) multi-agent DDPG/TD3 -based power allocation methods for MIMO-OMA systems (MTPA-OMA/MDPA-OMA), where the system model in MIMO-OMA is referenced in [14].
In QPA, we evaluated it by exhaustive searching in each TS. Because it does not rely on global trajectory data, it is easy to fall into a local optimum. However, such quantization results in a serious problem which is the huge action-space of possible selected power. In our QPA scenario with 4 users and a quantization level k = 10, each user can be allocated the power amount corresponding to one of 10 possible power allocation choices, then the number of actions that can be selected has already reached 10 4 = 10,000. It should be noted that it is only a small system, while in future generation wireless networks with high density of users, the size of action space grows exponentially as the number of users increases. Such a large action space leads to poor performance, because the DQN agent needs plenty of time to explore the entire action space to find out the best power allocation option. And due to the random evolving nature of the communication environment, it may be difficult to choose the option which leads to the best performance. In addition, such quantization also discards some useful information, which is essential for finding the optimal power allocation option. This is also the reason for us to use the off-policy network based on actor-critic structure for power allocation tasks. The convergence complexity of the DRL algorithm depends on the size of the action-state space. For the QPA framework, the size of output actions is O(k L × M ), which exponentially increases with the number of users and the quantization levels of power allocation coefficients, and for our proposed frameworks, the size is O(M × (L + 1)). The state space needs some space to be stored, hence the corresponding space complexity is O(M × L) for QPA and the proposed frameworks. Therefore, the proposed algorithms can improve the performance of the energy efficiency of MIMO-NOMA systems with low complexity.
First, we evaluate the convergence of the proposed two multiagent frameworks under learning rate 0.01 that is fixed since it provided the better performance of learning in our simulations. We fix the location of all users to determine how many TSs are required for our frameworks to find the optimal power allocation policy. As shown in Figure 3, the loss function value decreases and tends to a stable value within 200 TS for two frameworks, and the value is small enough to accurately predict the Q value.
Then, we compare the proposed two multi-agent algorithms in the random moving system with different power allocation approaches. It should be noticed that all results are averaged by taking a moving window of 100 TSs to make the comparison clearer. For the objective of sum EE maximization, we investigate the performance of our multi-agent DRLbased frameworks. Figure 4 shows that MTPA frameworks achieve the better performance of the averaged sum EE than the other approaches. DRL-based frameworks are able to dynamically choose the transmit power of all users with continuous control according to their current channel conditions in every TS. Specifically, the averaged sum EE value achieved by the MTPA framework over all 5000 TSs is 21.47% higher than the FPA approach, whereas the values for the MDPA framework is 17.3% higher. And for the STPA and SDPA frameworks, the values are 15.8% and 12.39% higher, respectively. On the other hand, for the QPA framework based on the discrete DRL method is only 11.13% higher. More importantly, as long as the data rate requirement of all users is satisfied, it is unnecessary and energy-wasted to use maximum power for transmission since it reduces the EE of the system. This also shows the importance of power allocation to the performance improvement of the NOMA system. Also, we verify the performance advantage of the proposed algorithms over MIMO-OMA. As shown in Figure 4, numerical results show that the proposed algorithms outperform the advantage of the energy efficiency over MIMO-OMA. In our simulation scenario, MTPA achieves 18.04% higher than MTPA-OMA and MDPA achieves 16.86% higher than MDPA-OMA.
We investigate the averaged sum EE performance of two frameworks in 5000 TSs under different transit power limitations. As shown in Figure 5, the averaged results of DRLbased frameworks grow and tend to stable values as the maximum available power increases, while the FPA approach slightly increases and continues to drop since it always uses full power for the signal transmission. This is because as long as the data rate requirement is satisfied, our algorithms no longer need full power for the transmission and can dynamically allocate the power based on the communication conditions to optimize the sum EE no matter how p max changes. Obviously, our proposed power allocation methods based on deterministic policy gradient with continuous action spaces outperform the discrete DRL-based approaches and conventional methods under most power limitation conditions and we find out that the MTPA framework achieves the best average sum EE performance compared with other approaches. And we verify the  The variation of the energy efficiency with the minimum required date rate improved performance of the proposed algorithms for MIMO-NOMA over MIMO-NOMA. Figure 6 illustrates a look on how EE varies with minimum required data rate min m,l . As shown in Figure 6, our proposed frameworks have an outstanding performance compared with the other approaches as well as over MIMO-OMA, even though EE decreases with min m,l . This situation can be explained by connecting min m,l with the transmit power, i.e. lower min m,l has the same influence on EE as higher transmit power. The obtained results indicate that the average system energy efficiency decreases with increasing min m,l , because as min m,l increases, the system reaches the infeasibility case more rapidly and becomes unable to satisfy min m,l for all the system users.

CONCLUSION
In this paper, we have studied the dynamic power allocation problem in a downlink MIMO-NOMA system and two multi-agent DRL-based frameworks (i.e. multi-agent DDPG/ TD3-based framework) are proposed to improve the long-term EE of the MIMO-NOMA system while ensuring minimum data rates of all users. Every single agent of two multi-agent frameworks dynamically outputs the optimum power allocation policy for all users in every cluster by DDPG/TD3 algorithm, and the additional actor network also adjusts power volumes allocated to clusters to improve overall performance of the MIMO-NOMA system, finally both frameworks adjust the entire power allocation policy by updating the weights of neural networks according to the feedback of the system. Compared with other approaches, our multi-agent DRL-based power allocation frameworks can significantly improve the EE of the MIMO-NOMA system under various transmit power limitations and minimum data rates by adjusting the parameters of networks. We also have verified the advantage of energy performance the proposed algorithms over that of the MIMO-OMA system. In future work, we are going to study on the joint subchannel selection and power allocation problem for more practical scenarios of MIMO-NOMA systems properly applying more powerful DRL approaches.