Deep Multi-Agent Reinforcement Learning for Cost Efficient Distributed Load Frequency Control

The rise of microgrid-based architectures is heavily modifying the energy control landscape in distribution systems making distributed control mechanisms necessary to ensure reliable power system operations. In this paper, we propose the use of Reinforcement Learning techniques to implement load frequency control without requiring a central authority. To this end, we approximate the optimal solution of the primary, secondary, and tertiary control with the use of the Multi- Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. Generation units are characterised as agents that learn how to maximise their long-term performance by acting and interacting with the environment to balance generation and load in a cost efficient way. Network effects are also modelled in our framework for the restoration of frequency to the nominal value. We validate our Reinforcement Learning methodology through numerical results and show that it can be used to implement the load frequency control in a distributed and cost efficient way.


I. INTRODUCTION
Electrical systems are undergoing major changes; there is a large number of deployed distributed generation that is slowly substituting large electromechanical generators [1]. In the past, the majority of the load was met by large generation units, such as coal or nuclear plants. Nowadays, every single house can be a prosumer, i.e., produce and consume energy; and deliver excess energy to the network. This is facilitated by new market designs, e.g., peer-to-peer markets.
This paradigm shift is shaping our understanding of energy and bringing us a whole new branch of opportunities as well as challenges. In this context of decentralisation, coordination amongst generators to balance generation and load [2], [3] is more challenging. Traditionally, a hierarchical control system is used to meet this objective, i.e., primary, secondary and tertiary frequency control. Primary control keeps frequency between some acceptable limits; secondary control restores frequency to the nominal value; and tertiary control performs so in a cost efficient way. Secondary and tertiary control layers need a central authority to send appropriate control signals to generators to shift their generation to meet load. However, in this new paradigm where there are numerous generators participating in frequency control, the centralised approach has central limitations in terms of computation and privacy concerns. In this regard, new distributed schemes are necessary to deal with the aforementioned challenges [4].
Different approaches have attempted to tackle this problem by implementing the traditional hierarchical control in a distributed manner (see, e.g., [4], [5], [6]). In [7], the authors propose a methodology for primary control to mimic droop control strategies which are by nature decentralised algorithms that act upon each generator. The proposed methodology explicitly represents the modified system dynamics of having electronic inverters instead of large turbines. Moreover, efforts have been made to implement a decentralised secondary control scheme, e.g., the centralised averaging PI (CAPI) presented in [8] and distributed averaging PI (DAPI) given in [9]. These algorithms use weighted averages of the frequency as the integral feedback. Despite their theoretical appeal, they suffer from lack of robustness, and their communication demands make them difficult to implement in real-life scenarios [10]. Recently, several nature-inspired optimisation algorithms have been proposed to solve the primary and secondary layers of the load frequency control problem. Some of the most relevant ones are the water cycle algorithm (WCA) (see, e.g., [11], [12]), the yellow saddle goatfish (YSGA) (see, e.g., [13]) and the butterfly optimisation algorithm (BOA) (see, e.g., [14]). However, none of these techniques take the economic cost into consideration. Regarding tertiary control, it is usually common to solve a primal-dual algorithm that converges to the solution of the dual problem (see, e.g., [15], [5], [16], [17], [18]) where the communication between nodes enables joint global actions. Nevertheless, as with other approaches, communication is intense between nodes and the system may become too complex. Multi-Agent Reinforcement Learning (MARL) is a promising alternative to implement load frequency control in a decentralised way addressing the aforementioned challenges (see, e.g., [19], [20]). The main drawback of these methods is their computational complexity, that grows exponentially in the number of agents. However, the rise of Deep Learning has opened the door to new techniques and algorithms that address these scalability issues in the load frequency control problem (see, e.g., [21], [22]).
In MARL, various software agents learn optimal policies by negotiating, cooperating, and/or competing [23]. In this work we formulate the primary, secondary and tertiary control layers as an MARL problem so that the agents, i.e., generation units, learn to keep generation and load balanced in a cost efficient way by controlling the energy supply while minimising information exchange. More specifically, we recast the load frequency control problem as a Markov Decision Process (MDP), as is usually the practise in reinforcement learning problems. We define the states which are the frequency deviation, the control action of each generator, and their action space. We model the dynamic behaviour of the generators and the network to determine the probability state transition function of the MDP. We design the reward function of the agents so that frequency deviation and total cost are minimised. The design of the reward function is critical, since it determines the behaviour that each agent will learn. In order to determine the reward function we make use of the frequency deviation as well as optimality conditions of the economic dispatch problem to incorporate the cost component in the proposed framework. We use this setup to estimate the action-value function of each state-action pair with the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. MADDPG is an actor-critic algorithm; this means that the architecture of each agent is split into two. First, the actor directly estimates an action while secondly, the critic assesses the suitability of an action by estimating the actionvalue function of the state-action pair. In MADDPG, the critics use central information to teach each actor the dynamics of the environment as well as the behaviour of the rest of the agents. In operation, actors only use local information since they have learnt how other actors behave during the training phase. Each actor and critic is modelled with a Long Short-Term Memory Network (LSTM) so that previous history is stored and acted upon. To summarise, the contributions of the paper are: i) reformulation of the load frequency control problem as a Markov Decision Process (MDP); ii) use of a detailed model taking into account the network, renewable-based generation and generator rate constraints; iii) design of the reward function of the agents so that frequency deviation and total cost are minimised; iv) development of the proposed framework to solve the optimal load frequency control in a fully distributed manner with only the use of local information; and v) validation of its robustness against uncertainty introduced from renewable-based generation. This problem was initially introduced in [21] and is extended in this paper to implement tertiary control or economic dispatch, i.e., the generation units modify their output to meet the change in load in a cost efficient way; to include a detailed power systems' model by explicitly incorporating the network, windbased generation and a more realistic model of synchronous generator by including the generator rate constraints (GRC).
The remainder of the paper is organised as follows. In Section II we describe the power system model that we adopt to develop our analysis framework. In Section III, we formalise the frequency control problem as an MARL problem. In Section IV, MADDPG is used to implement primary, secondary and tertiary control in a multi-agent problem. In Section V, we present numerical studies to demonstrate that the proposed methodology is a valid alternative to solve load frequency control in a distributed and cost efficient manner. In Section VI, we summarise the results and make some concluding remarks.

II. PRELIMINARIES
In this section, we introduce the secondary and tertiary control models that we utilise to develop our framework. More specifically, we introduce dynamic models for synchronous generators, the automatic generation control (AGC) system, the network, and the economic dispatch.
The system frequency indicates if supply and demand are properly balanced. When the generated power exceeds the load the system frequency increases. Similarly, the system frequency decreases if generation is not sufficient to meet the load. Thus, controlling the frequency of the system is a standard approach to balance demand and supply [24]. The frequency control is divided in a hierarchy of three layers: primary, secondary and tertiary control. In primary control, generation and demand are rapidly balanced since the synchronous generators are either speeding up or slowing down due to the load generation imbalance. This is achieved by a decentralised proportional control mechanism called droop control [25]. Then, a secondary control layer implements an integral control that compensates the steady-state error derived from droop control. Automatic Generation Control (AGC) [26] implements the secondary control layer collecting information from all generation units in a centralised way. Last, the tertiary control layer is related to the economic aspect of power system operations. This layer establishes the load sharing between the sources so that the operational costs are minimised [27]. Tertiary control is implemented through the economic dispatch, which calculates the optimal operating point in an offline process. Next, we present two models, i.e., Model I and Model II, for the description of the power system dynamics. These two models will be used to formulate the frequency control problem of a power system with n generators denoted by G = {G 1 , . . . , G n }.

A. Model I: Balancing Authority (BA) Area Model
It is common in power systems operations to model the dynamic behaviour of the entire balancing authority area instead of individual generators. In this regard, we define by ∆ω the deviation of the centre of inertia speed from the synchronous speed; the total mechanical power produced P SV = i∈G P SVi , with P SVi the mechanical power of generator i; and the total secondary command Z G = i∈G z i , with z i the participation of generator i to AGC. Then the BA area dynamics are: where M = 2H ωs , with H the system inertia constant and ω s the synchronous speed; T SV = i∈G T SV i n , T SVi the time constant of the mechanical power dynamics of generator i; D = i∈G D i , with D i the machine i damping coefficient; with R Di the governor droop of generator i. In this case we neglect the network effects and set P G = P L (1 + ρ), where P L is the system load and ρ denotes the sensitivity of the losses with respect to the system load. The normalised participation factor of bus load changes ∆P Li with respect to total system load change ∆P L is denoted by σ i , the output of generator i is denoted by P i , then ρ, which denotes the sensitivity of the losses with respect to the system load is

B. Model II: Synchronous Generator Dynamics
In Model II the individual generators' dynamics are represented. For the i th synchronous generator, the three states are the rotor electrical angular position δ i , the deviation of the rotor electrical angular velocity from the synchronous speed ∆ω i , and the mechanical power P SVi . We denote by z i the participation of each generator i in AGC. The evolution of the three states of the generator i is determined by: where the inertia constant is H i ; the synchronous speed is ω s and M i = 2Hi ωs ; the machine damping coefficient is D i ; the governor droop is R Di ; and the parameter z i is an input provided by the AGC. The definitions of the machine parameters may be found in [25]. The output of generator i P i is determined by (11).

C. Network
Let us consider a power system with N nodes and P Li represents the real power load at bus i. Further, let Q i and Q Li denote the reactive power supplied by the synchronous generator and demanded by the load at bus i, respectively. Then, we model the network using the standard nonlinear power flow formulation (see, e.g., [25]); thus, for the i th bus, we have that where G ik +jB ik is the (i, k) entry of the network admittance matrix and θ ik = θ i − θ k . We assume that i) bus voltage magnitudes are |V i | = 1p.u for i = 1, . . . , N , ii) lines are lossless and characterised by their susceptances B ik = B ki > 0 for i, k = 1, . . . , N with i = k, iii) reactive power flows do not affect bus voltage phase angles and frequencies, and iv) coherency between the internal and terminal voltage phase angles of each generator so that these angles tend to swing together, i.e., δ i = θ i . As a result, we neglect (8) and simplify (7) to be: If bus i does not contain a generator then P i = 0.
In order to increase the accuracy of (9) we can slightly modify it by incorporating an approximation of the losses. We define the normalised participation factor of bus load changes ∆P Li with respect to total system load change ∆P L by σ i , then ρ i , which denotes the sensitivity of the losses with respect to the system load at bus i is Then (9) becomes:

D. Economic Dispatch
The economic dispatch process is formulated as an optimisation problem, where the objective function that needs to be minimised is the sum of the individual costs of all generating units, c i (P i ), for i ∈ G ; this is typically a quadratic function that computes the production cost of each generation unit. Here, the constraint is that the system has to keep generation and load balanced; if generation and load are balanced then frequency is also nominal. The economic dispatch problem may be formulated as:

E. Wind-based generation
The increasing penetration of renewable-based resources in the system introduces a source of uncertainty in power system operations and thus in load frequency control problems. In this regard, we investigate the effect of wind-based generation units in the proposed framework. The relationship between the wind speed and the generated power can be efficiently modelled as a linear dynamical system [28]. More specifically, P W denotes the real wind generation power output; ∆v the variation of the wind speed; α W1 and α W2 are parameters that depend on the wind turbine characteristics; W t is a Wiener process and β W1 and β W1 coefficients that represent prior knowledge of the wind speed probability distribution. Then, the dynamics of the wind-based generation power output are formulated as follows: III. MULTI-AGENT REINFORCEMENT LEARNING FOR LOAD FREQUENCY CONTROL In this section we formulate the load frequency control problem as an MARL problem. Reinforcement Learning (RL) is an area of Machine Learning strongly related with the notion of software agents [29]. RL studies how software agents interact in an environment to maximize their long-term performance. We use MARL to train a collection of agents how to implement the load frequency control problem in a distributed way. In this paper, an agent physically represents the controller of a generation unit. As such, by using the MARL scheme, which allows for a fully distributed control architecture, the load frequency control can be achieved with no communication infrastructure between the agents, i.e., generating units. The controller physical circuit of each generator does not have to be connected with any of the other controllers, thus allowing a physical "distribution" of the load frequency control system. A diagram of the entire architecture and how the reinforcement learning based AGC design fits in the power system dynamics is shown in Fig. 1.
RL problems are mathematically formalized through a Markov Decision Process (MDP) [30], that is defined as the tuple: where each term is: • S or state space: all possible states where the agent can be in the environment. There are two continuous states in the load frequency control: the deviation from synchronous speed which is quantified by ∆ω i for each generator i or ∆ω in the case of the BA area model; and z i , the current control action of each generator i. These states provide to the agent information about the difference between demand and supply and how much they are contributing to the total generation. • A or action space: all possible actions that each agent takes in every state. Our agents-generators can increase or decrease the control action z i in order to modify the state of the environment. • P or probability state transition function: it defines the dynamics of the environment, modelling the transition between states. This is determined by the modelling approach we are following, i.e., Model I as described in Section II-A or Model II given in Section II-B. For the BA area model or Model I described in Section II-A the transition equations derived from (1) and (2) are: For the detailed modelling of Model II given in Section II-B the transition equations based on the (4),(5), (6) and (11) are as follows: where ∆z i is the increase or decrease in power generation by each unit i in G estimated by each agent. MADDPG is used to estimate ∆z i , as described in Section IV. • R or reward function: it defines a numerical signal or reward expressing the value of being in a state and performing an action. The reward function considers two different dimensions in our case: frequency deviation and operational costs. MARL attempts to learn an optimal policy π : S → A that maximises the cumulative reward, or return. However, the reward is instantaneous and does not address the global nature of the task, i.e., one bad action can lead to an extremely good position from which the agent can obtain a good reward. Thus, action-value functions Q π are used in RL to express the expected long-term reward achievable from being in an state, taking an action and following a policy π: where E[·] is the expectation operator, γ is the discount factor, which expresses the fidelity in long-term predictions of Q π , the cumulative reward achievable in the long run R t , and the reward r t at time t. Most algorithms strongly support their learning process in value functions, such as Q-learning [31].
The action-value function associates a value Q π to each state-action pair. However, when the number of states and actions is too large, it becomes computationally challenging to estimate them efficiently. Recent work has merged the field of RL with Deep Learning, giving birth to a powerful algorithm called Deep Q-learning (DQN) [32]. This algorithm uses deep neural networks as parametric function approximators to estimate the action-value function of each state-action pair.
The spectrum of existing algorithms to solve MARL problems is wide. Most of them use game-theoretic approaches to augment Q-learning, i.e., Nash Q-learning or minimax Qlearning [33]. In our problem, state and action spaces are continuous and the interaction of various agents is required. This limits the range of algorithms available in the literature. MADDPG addresses both problems at the same time [34].

IV. MULTI-AGENT DEEP DETERMINISTIC POLICY GRADIENT
In this section, we present the selection of the appropriate multi-agent actor-critic algorithm that takes into account the MADDPG is an actor-critic algorithm. This means that the architecture of each agent, or generation unit, is split into two. First, the actor directly estimates an action while secondly, the critic assesses the value of the action by estimating the actionvalue function Q π of the state-action pair. The Q π estimated by the critic is used by both the critic and the actor to learn how to behave in the environment. In MADDPG, the critics use central information to teach each actor the dynamics of the environment as well as the behaviour of the rest of the agents. In operation, actors only use local information because they have learnt how other actors will behave.
We describe the actor-critic algorithm for the BA model or Model I. This is the same for the detailed modelling of Model II presented in Section II-B, the only difference is that instead of the deviation from the centre of inertia the input to the actor and the critic is the deviation of the rotor speed from the synchronous speed of each generator ∆ω i . For the BA area model given in Section II-A we have each actor i that estimates ∆z i given the state of the environment ∆ω and its current z i . Each critic assesses each state-action pair defined by the environment and the actions of all the actors. The critic estimates each state-action action-value that is used during actor's training, as it can be seen in Fig. 2. We denote by ∆z −i (∆z −j ) the action predicted by all other actors besides i (j) and z −i (z −j ) the control action state of all other actors besides i (j).
Deep Recurrent Neural Networks, particularly Long Short-Term Memory Networks (LSTMs) [35], are used to model each actor and critic. LSTMs implement memory so that previous history is stored and acted upon [36]. In MDPs the Markov assumption states that the current state comprises all information needed to choose an action. However, in the frequency control problem the dynamics are quite complex and the Markov assumption may not hold. Thus, LSTMs help us correcting the violation of the Markov assumption.
The actor network, see Fig. 3, has as inputs ∆ω and z i and computes ∆z i . The critic network, depicted in Fig. 4, has as inputs the frequency state of the network ∆ω; the secondary control action z i ; the change in the action predicted by the actor associated to that critic ∆z i ; the secondary control action z −i ; and the change in the action predicted by all other actors ∆z −i . The critic network then computes the Q π (·) value of the state-action pair estimated by the actor associated with that critic. As seen in the respective figures, both networks have a 100-neuron LSTM that implements memory, and three more 1000, 100 and 50 fully-connected hidden layers. Generation rate constraints can be easily introduced in this neural network based approach. More specifically, the output ∆z i of each actor can be bounded by applying a non-linear function (e.g., sigmoid function, hyperbolic tangent, etc.) at the output of the network, so the agent has to learn that it cannot generate at an unrealistic rate.
The design of the reward function is critical, since it determines the behaviour that agents will learn. Ideally, the reward function incorporates two different components: (i) the frequency state of the environment to solve the primary and secondary problem; and (ii) the operational cost associated with the system to solve the tertiary control problem. Taking into account the frequency component in the reward function is straightforward since we set a higher reward for smaller frequency deviations. Next, we need to determine how to define the reward function in order to take into account the cost component. In this regard, we study the case where the cost functions of generators are of the form c i (P i ) = a i P 2 i + β i P i + γ i for i ∈ G [5]. The cost minimisation is part of the tertiary control in the hierarchical control setting; the formulation of which may be found in (12). For quadratic cost functions under no generation limits we may find the optimal solution in an analytical way [24]. The Lagrangian may be written as where λ is the dual variable of the power balance constraint. The necessary conditions for a minimum are The solution to the problem above defines the base point operation of tertiary control. We now define with the aid of participation factors how would a generator participate in a load change so that the new load is served in a cost efficient way. We start from a given base point λ 0 as found from (31). Assume the change in load is ∆P L ; the system incremental cost moves from λ 0 to λ 0 + ∆λ. For a small change in power output on unit i ∆P i we have  Thus we wish that each generator i changes its output so the following holds i.e., for each generator the change in the action ∆z i , i ∈ G we wish that Now we use these two conditions, i.e., frequency deviation and cost information, to determine the reward functions for each modelling approach.

A. Reward function -Model I
We construct two conditions that will be used in the formulation of the reward function. The first condition is: where 1 is some selected tolerance; this condition ensures that the reward function r will reward actions that help in frequency restoration. The second condition is: where 2 is some selected tolerance; this condition ensures that r will reward actions that follow the cost efficient path. When only the primary and secondary control problems need to be solved, the reward function may be formulated using C1 as where d is a constant. On the other hand, by taking these two conditions into account we may formulate a general form of the reward function to solve all levels of control from primary to tertiary as where ∧ is the logical and; ∨ is the logical or; d 1 , d 2 are constants with d 2 < d 1 . This reward function both helps in frequency restoration as well as performs the latter in a cost efficient way, since the critic values actions higher that if both purposes are met.

B. Reward function -Model II
In order to ensure frequency restoration, i.e., secondary control, we wish that |∆ω i | < for all i ∈ G . To this end, we formulate the reward function as follows: where ∧ is the logical and sign; d 1 , d 2 , . . . , d n are constants with d 1 < d 2 < d 3 < · · · < d n . This formulation ensures that the reward is higher when more generators' frequency deviation is smaller than a specified tolerance. In this work we have not performed the tertiary control for Model I, which is part of future work.
A flow diagram of the training process of determining the RL-based control for each agent is shown in Fig. 5.

V. NUMERICAL RESULTS
We validate the MARL methodology using three test systems. We formulate the reward function and present the results of the primary and secondary control problem for Model I and the detailed modelling of Model II taking into account the network effects. We also demonstrate the flexibility of the proposed methodology to incorporate generation rate constraints as well as its robustness against the uncertainty introduced due to wind-based generation. Next, we formulate the reward function and present the results of all levels of control for one single BA area using Model I. We demonstrate that the generators are able to restore the system frequency back to nominal and operate at a point close to optimal when a change in load occurs in a distributed way. We compare the results with an standard distributed optimal load frequency control alternative [15].

A. Secondary Control -Model I
The test case to validate the secondary control using Model I comprises of a group of eight generating units or agents that interact with a load; the parameters of the environment can be found in Table I   varies around a nominal set point randomly. The modification is indicated by P L ± ∆P L = 3 ± β pu, where β follows a uniform distribution. The reward function has been derived following (35) and is defined by: During operation, only the actors interact with the environment. They only observe local information about the frequency of the system and the control action that they are executing. As a consequence of the training phase, they know how to act according to the states of the environment in order to keep load and generation balanced. The validation of the training is tested by changing the load by 0.15 pu and observing how the generators modify their output. In Fig. 6a the cumulative reward obtained by the agents is depicted. The agents can obtain 1, 000 at maximum per episode, i.e., the maximum reward per step is 10 and the number of steps per episode is 100. The agents learn how to obtain higher rewards as the number of episodes increases, since if that was not the case the cumulative reward function would oscillate around small values near zero.
In Figs. 6b, 6c the centre of inertia speed and power response of an eight-generator system when a single load increases by 0.15 pu is depicted. However, as it can be seen in Fig. 6d, the solution may be unrealistic given that the operational cost component is neglected in this test case. As such, one agent learns to balance the entire system while the others have zero output. Further analysis on secondary control using Model I may be found in [21].
In order to highlight the ability of the proposed framework to implement generation rate constraints, we have used Model I to conduct numerical analyses of the response of a twoagent system, whose data can be found in Table II and is depicted in Fig. 7, when the generation rate of each unit is   bounded by different values. We modify the load by 0.15 pu and limit the output of each actor ∆z i to 0.1, 0.05, and 0.01 pu respectively by using a hyperbolic tangent function. As depicted in Fig. 8a, although the system manages to meet the new load, the generation rate constraints affect the elapsed time until the new steady state is reached. This can be also inferred by observing Fig. 8b, where the actual values of ∆z i are shown. When the generation rate is more constrained, i.e., the generators are allowed to modify their output in smaller increments, the system spends more time balancing generation and demand, as expected.
Next, we validate that the proposed framework is robust against the uncertainty introduced by wind-based generation, as described in Section II. We model a wind-based generator as a stochastic process with parameters α W1 = −0.002, α W2 = 0.01, β W1 = −0.5, and β W2 = −0.4,. We train the two-agent system of Table III to balance the load under such conditions. In Figs. 9a, 9b, it can be seen that the load is met and the frequency is close to nominal under the scenario that the windbased generation evolves randomly. More specifically, minor variations appear in the frequency response as the agents adapt to re-balance the load.

B. Secondary Control -Model II
Analogously, we have designed a test case to validate the performance of the proposed solution using the detailed Model II. The dynamic behaviour of two generators that are part of a BA area is explicitly taken into account as well as the network. The configuration of the system that has two loads, i.e., P L1 and P L2 , can be found in Fig. 10. The parameters of the environment can be found in Table III. In each training episode, each load varies around a nominal set point randomly. The modification of each load is indicated by P Li ± ∆P Li = 1.5 ± β pu, where β follows a uniform distribution.
The reward function has been derived following (37). Setting = 0.05 pu; d 1 = 100; and d 2 = 200. The reward function is formulated as follows: In Fig. 11a the cumulative reward obtained by the agents during training can be seen. Again, we notice that the agents are learning and have discovered how to obtain higher rewards. In this case, the agents learn how to jointly balance generation and demand.
Following the same schema, we change both loads by 0.15 pu and then we observe how the frequency and each generator output change. The rotor electrical angular velocity of each generator is restored as it may be seen in Fig. 11b. The generation output of the two generators may be seen in Figs 11c and 11d; Fig. 11c is a zoomed in version of Fig. 11d. In Fig. 11c, where the timescale is up to 100 s, we notice that the total power of the generators meets the new load; thus restoring frequency. However, the secondary control system sends signals to the generators to modify their output as seen in Figs 11d and 11e. The system frequency is nominal since even if the output of the two generators changes the summation of the output remains constant and equal to the new load.
We demonstrated that the proposed framework may be applied to solve primary and secondary control problems with the detailed modelling with Model II. This is achieved in a distributed way, i.e., without centralising any kind of information, the agents learn how to balance the system. Here, the agents learn that keeping ∆ω i close to 0 for all generators is associated with high rewards.

C. Tertiary Control -Model I
The test case designed to test the performance of all levels of load frequency control in a single BA area comprises of two generation units or agents that interact with a load whose configuration during training may be found in Fig. 7. The parameters of the environment can be found in Table II with cost functions for generator 1 c 1 = 2P 2 1 [£/pu] and generator 2 c 2 = P 2 2 [£/pu]. In each episode, or training simulation, the load varies randomly around a nominal set point. The load varies as P L ± ∆P L = 3 ± β pu, where β follows a uniform distribution.
The reward function has been derived following (36). We set 1 = 0.05 pu, 2 = 0.2 pu, d 1 = 200, and d 2 = 100. Thus we have the two conditions: C1 : |∆ω| < 0.05, Taking these two conditions into account we may formulate the reward function as The reward function is used only during the training period. In the operation phase, the actors interact with the environment without experiencing any reward. Agents only observe the frequency of the system and their own control action z i . They have learnt during training how to behave according to the evolution of the environment to balance supply and demand while minimising operational costs. For the operation phase, we change the load by 0.15 pu and then, we observe how the agents restore the system frequency.    We can observe in Fig. 12a the cumulative reward obtained by the agents. The agents can obtain 20, 000 at maximum per episode, i.e., the maximum reward per step is 200 and the number of steps per episode is 100. The agents learn how to obtain higher rewards as the number of episodes increases. If that was not the case the cumulative reward function would oscillate around small values near zero.
In Figs 12b, 12d we see how the agents restore the frequency to the nominal set point, thus balancing supply and demand. In  Fig 12c the rate of change in frequency (RoCoF) that measures the dynamic performance of the system is depicted. The maximum, minimum and mean RoCoF values are 0.607, −0.712 and 0.002 respectively, thus being within the admissible limits of 1 Hz/s recommended by ENTSO-E [37]. Actors learn how to balance generation and demand without exchanging information. The agents have learnt that keeping ∆ω close to 0 is the key to obtain high rewards. Thus, the agents may perform primary and secondary control in a totally distributed manner.
In order to test the optimality of the solution provided by the proposed approach in terms of cost, we need to calculate the optimal point when the load in the system is P L = 3 + ∆P L = 3.15 pu for the cost functions given in Table II. By solving the economic dispatch problem as given in (12) we have P 1 = 1.05 pu and P 2 = 2.10 pu. In Figs 12e, 12f, the behaviour of each generator output and its associated cost are depicted. It can be observed that the agents operate near the optimal solution, which is that the generation of generator 2 is twice that of generator 1. As seen in Fig. 12e, the control action of agent 1 stabilises around a set point that is approximately half of the control action of agent 2. This does not coincide with the optimal solution (slightly above half the production, i.e., 60%), but through the training process the agents learn how to keep load and supply balanced in a fully distributed cost efficient way. The performance of the agents is determined by what actions they learn during training that lead to high rewards. Thus, the reward function is the main tool to show each agent what is the optimal action. The reward function defined in (38) builds a reward combining two different dimensions: cost and frequency. This means that the reward function can show various maxima depending on the combination of both reward dimensions. The  agents learn by trial and error a behavioural heuristic to obtain high rewards, but they may converge to a local optimum that may be different from the global one. An improvement of the reward function (38) could help the agents to improve the results showed here and to get closer to the optimal solution.
We compare the proposed framework with [15], neglecting the network effects so that the comparison is fair since in Model I we do not consider the network. In [15], a distributed load frequency control algorithm that restores system frequency in a cost effective way is presented. This is achieved by exchanging some information between the generators during the operation phase. The algorithm is based on a partial primaldual gradient scheme to solve the optimal load frequency control problem, this type of dual-based approach is fairly common in the literature of the minimum cost load frequency control problem. We refer to this algorithm as benchmark algorithm. In Figs. 13a, 13b it can be seen that the benchmark algorithm manages to balance generation and demand, although it converges slightly slower than the proposed approach. In Figs. 13c, 13d the secondary control action and cost of each  agent are shown. The response of the benchmark algorithm is smoother than the proposed approach and the generation cost is minimised. However, this solution still needs to share dual information across units. On the other hand, although there are no optimality guarantees in the proposed framework, the results show that a sub-optimal solution is reached, it is fully distributed, i.e., no information is shared between the agents, and they only use local information.
We also run a numerical experiment implementing a more realistic scenario, where an initial load increase of 0.15 pu is followed by a continuous change in the load sampled from a uniform distribution defined in the [−0.1, 0.1] interval. We can observe in Figs. 14a, 14c that the agents manage themselves to keep generation and demand balanced, although the load is continuously changing. The dynamic behaviour of the system is depicted in Fig. 14b, where it can be seen that the RoCoF does not go beyond of the admissible 1 Hz/s bound (maximum, minimum and mean RoCoF 0.595, −0.708 and 0.002 respectively). Interestingly, it is shown in Figs. 14d, 14e that the agents keep generating in a close-to-optimal ratio  despite the continuous change in the load that increases the difficulty of this task.
In the numerical studies we showed that load frequency control may be performed efficiently in a distributed manner. More specifically, we showed that instead of solving the economic dispatch to obtain the optimal operating point, the MARL framework may be used to infer the production costs and the necessity of balancing demand and supply from the reward function and enclose this information in the behaviour of the actors. The benefit of the proposed approach is that these agents can act in real time in a distributed way. Once trained, they do not need to centralise information at all. Dynamics are embedded in the agents that only use local information.

VI. CONCLUDING REMARKS
In this paper, we proposed an MARL alternative to implement load frequency control in a distributed cost efficient way. To this end, we expressed the load frequency control problem in an MARL setup and designed the reward functions based on insights on the economic dispatch problem. We chose an appropriate algorithm, i.e., MADDPG, to solve this problem. Through the numerical examples, we showed that the proposed framework performs load frequency control in a satisfactory way. In particular, we demonstrated that all levels of control are achieved using Model I, i.e., restored frequency to the nominal value in a cost efficient way; and that secondary control is performed under detailed modelling of Model II. Moreover, we showed that the proposed methodology can cope with generation rate constrains and uncertainty sources efficiently.
There are natural extensions of the work presented here. For instance, different elements of the MARL paradigm can be enhanced, i.e., the reward function, the LSTM architecture and the introduction of domain knowledge could be further analysed to come up with agents that are able to improve their performance. More specifically, other architectures such as gated recurrent units (GRU) could be used instead of a LSTM. An exhaustive search of the appropriate architecture, parameters and hyper-parameters are left to be explored in future work. Another obvious extension consists of adding the tertiary control layer to the network model. In our future studies, we also plan on studying the applicability and scalability of these techniques in more complex scenarios. In addition, we will investigate the performance of MADDPG to deal with different types of generation resources. We will report on these developments in future papers.