An earlier version of this article appears in the Proceedings of the 24th Australasian Joint Conference on Artificial Intelligence (AI2011)—Chao Yu, Minjie Zhang and Fenghui Ren. Coordinated Learning for Loosely Coupled Agents with Sparse Interactions, Lecture Notes in Artificial Intelligence, Springer, LNAI 7106, pp. 392–401, Perth, Australia, 2011.

Research Article

# Coordinated learning by exploiting sparse interaction in multiagent systems^{†}

Article first published online: 18 OCT 2012

DOI: 10.1002/cpe.2947

Copyright © 2012 John Wiley & Sons, Ltd.

Issue

## Concurrency and Computation: Practice and Experience

Volume 26, Issue 1, pages 51–70, January 2014

Additional Information

#### How to Cite

Yu, C., Zhang, M. and Ren, F. (2014), Coordinated learning by exploiting sparse interaction in multiagent systems. Concurrency Computat.: Pract. Exper., 26: 51–70. doi: 10.1002/cpe.2947

^{†}

#### Publication History

- Issue published online: 12 DEC 2013
- Article first published online: 18 OCT 2012
- Manuscript Accepted: 23 SEP 2012
- Manuscript Revised: 18 SEP 2012
- Manuscript Received: 19 MAY 2012

- Abstract
- Article
- References
- Cited By

### Keywords:

- multiagent learning;
- reinforcement learning;
- coordination;
- sparse interaction

### SUMMARY

- Top of page
- SUMMARY
- INTRODUCTION
- RELATED WORK
- GENERAL THEORETICAL MODELS IN MULTIAGENT LEARNING
- PROBLEM DESCRIPTION AND DEFINITIONS
- A COORDINATED LEARNING APPROACH
- EXPERIMENT
- CONCLUSION AND FUTURE WORK
- ACKNOWLEDGEMENTS
- REFERENCES

Multiagent learning provides a promising paradigm to study how autonomous agents learn to achieve coordinated behavior in multiagent systems. In multiagent learning, the concurrency of multiple distributed learning processes makes the environment nonstationary for each individual learner. Developing an efficient learning approach to coordinate agents’ behavior in this dynamic environment is a difficult problem especially when agents do not know the domain structure and at the same time have only local observability of the environment. In this paper, a coordinated learning approach is proposed to enable agents to learn where and how to coordinate their behavior in loosely coupled multiagent systems where the sparse interactions of agents constrain coordination to some specific parts of the environment. In the proposed approach, an agent first collects statistical information to detect those states where coordination is most necessary by considering not only the potential contributions from all the domain states but also the direct causes of the miscoordination in a conflicting state. The agent then learns to coordinate its behavior with others through its local observability of the environment according to different scenarios of state transitions. To handle the uncertainties caused by agents’ local observability, an optimistic estimation mechanism is introduced to guide the learning process of the agents. Empirical studies show that the proposed approach can achieve a better performance by improving the average agent reward compared with an uncoordinated learning approach and by reducing the computational complexity significantly compared with a centralized learning approach. Copyright © 2012 John Wiley & Sons, Ltd.

### INTRODUCTION

- Top of page
- SUMMARY
- INTRODUCTION
- RELATED WORK
- GENERAL THEORETICAL MODELS IN MULTIAGENT LEARNING
- PROBLEM DESCRIPTION AND DEFINITIONS
- A COORDINATED LEARNING APPROACH
- EXPERIMENT
- CONCLUSION AND FUTURE WORK
- ACKNOWLEDGEMENTS
- REFERENCES

Multiagent learning (MAL) is one of the most important issues in multiagent system (MAS) research [1] and has been applied to solve practical problems in a variety of domains ranging from robotics [2, 3], distributed control [4, 5], and resource management [6, 7] to automated trading [8]. MAL uses techniques and concepts coming from areas such as artificial intelligence, game theory, psychology, cognition, and sociology, and has attracted a great deal of interest in the research community in recent years [9-12]. In MAL, the distributed learning processes are carried out concurrently so that an agent's decision making can be influenced by other agents’ decisions. These interdependencies between agents make the learning environment nonstationary for the individual learner. Developing an efficient learning approach to coordinate agents’ behavior in such a dynamic environment is a challenging issue because of three main factors: (i) decentralization: practical problems are decentralized in nature, and there is no central controller to control the whole learning process, while at the same time communication and computation capabilities of agents are highly restricted; (ii) computational complexity: the information that an agent keeps during its learning process increases dramatically with the complexity of the problem (e.g., number of agents and continuity of states); and (iii) local observability: an agent usually has only local observability of the environment and thus cannot receive full state and action information from other agents to guide its own decision-making process.

This paper focuses on solving MAL problems in loosely coupled MASs where sparse interactions of agents mean that agents only need to coordinate with each other under some particular circumstances [13]. This kind of MASs has a wide range of applications in practical problems. One example might be that of two robots fighting a fire in a building with a doorway connecting two rooms. We assume the doorway is too narrow to let both robots pass at the same time. In this case, most of the time the robots can move around independently according to their own decisions. It is only when both robots come near the doorway that they need to coordinate their behavior. In such type of MASs, it is obvious that coordinated behavior can be confined to certain specific parts of the environment, for example, the area around the doorway. Agents should learn from experience to determine these parts where coordination is most beneficial and how to coordinate their behavior once these parts are determined. However, because of the uncertainties caused by agents’ local observability and the dynamics of concurrent distributed learning processes, contriving an efficient coordinated learning (CL) approach by exploiting sparse interactions of agents is still a challenging research issue in MAL.

In recent years, a number of approaches have been developed to deal with coordinated leaning by exploiting sparse interactions of agents, using techniques such as coordination graphs [14, 15], statistical learning [16, 14], and learning automata [17]. Most of these approaches, however, require certain assumptions: (i) the specific situations where coordination is necessary must be predefined, and it is assumed that agents know these situations beforehand [15, 18]; (ii) each agent must be required to have certain prior knowledge, for example, an individual optimal policy [16]; and (iii) each agent must have full observability of the joint states and/or joint actions of other agents [14, 15, 17]. These assumptions cannot always be met in practice, and this means that these approaches are limited in real-world applications.

To overcome the current limitations of the existing approaches, this paper provides a MAL approach that enables agents to learn where and how to coordinate their behavior on the basis of each agent's local observability of the environment. In this approach, an agent first learns independently in an environment using a single-agent learning method and keeps statistical information of the rewards so as to discover the states where coordinated behavior is most needed. After that, the agent learns to coordinate its behavior with others through its local observability of the environment according to different scenarios of the transitions between states. To deal with the uncertainties caused by agents’ local observability, an optimistic estimation mechanism is introduced to guide the agents’ learning process. This approach can capture the feature of sparse interactions between agents and thus decompose the learning process into subprocesses efficiently to reduce the high complexity of decision making in MAL. Agents using the proposed approach do not need to have any prior knowledge of the domain structure and assumptions of individual optimal policy and global observability of the environment. Empirical studies show that this approach can achieve a better performance by improving average agent rewards compared with a learning approach that does not consider agent interactions and by reducing the computational complexity significantly compared with a centralized learning approach.

The remainder of this paper is organized as follows. Section 2 gives a brief overview of some related work. Section 3 describes the general theoretical models in MAL, which are the foundation of this paper. Section 4 formally gives the problem description and definitions. Section 5 presents the proposed approach in detail, and Section 6 shows the experimental results and analysis. Finally, Section 7 concludes the paper and lays out some directions for future research.

### RELATED WORK

- Top of page
- SUMMARY
- INTRODUCTION
- RELATED WORK
- GENERAL THEORETICAL MODELS IN MULTIAGENT LEARNING
- PROBLEM DESCRIPTION AND DEFINITIONS
- A COORDINATED LEARNING APPROACH
- EXPERIMENT
- CONCLUSION AND FUTURE WORK
- ACKNOWLEDGEMENTS
- REFERENCES

Much attention has been paid to coordinated MAL in recent MAS research. A number of approaches have been proposed to exploit the structural or networked dependence of agents for an efficient CL. Ghavamzadeh *et al*. [19, 20] proposed hierarchical MAL approaches to solve agent coordination problem where an overall task can be subdivided into a hierarchy of subtasks, each of which is restricted to the states and actions relevant to some particular agents. Coordination graph-based approaches [21, 15] took advantage of an additive decomposition of the joint reward function to local value functions, each of which is related only to a small number of neighboring agents, to efficiently construct jointly optimal policies. Schneider *et al.* [22] proposed a learning approach based on the distributed value functions to take advantage of the dependence between networked agents so that each agent can learn a value function to estimate a weighted sum of future rewards of all the agents in a network. Zhang and Lesser [23] proposed a coordinated MAL approach to distribute the learning by exploiting structured interactions in *networked distributed partial observable Markov decision processes (ND-POMDP)*. All the aforementioned approaches, however, focused on solving CL when an explicit representation of agent independence is given beforehand. The agent independence is represented either through a predefined decomposition of the main task or a fixed interaction network topology. Our work differs from these approaches in that no explicit representation of the agent independence is assumed so that agents should build such a representation from experience. In our work, this representation of agent independence is indicated by the coordinated states detected through a statistical learning process.

Many approaches have also been proposed for an efficient CL by exploiting sparse interactions between agents in a fully cooperative environment. Kok and Vlassis [18] proposed an approach called *sparse tabular Q-learning* to learn joint-action values on those coordinated states where coordination is beneficial. The action space is reduced significantly because agents can learn individually without taking into account the other agents in most situations but only need to conduct CL in the joint-action space when dependence between agents exists. These coordinated states, however, need to be specified beforehand, and agents are assumed to have prior knowledge of these states. In later work, Kok *et al*. further extended their approach to enable the agents to coordinate their actions when there are more complicated dependence between agents [15], and used statistical information about the obtained rewards to learn this dependence [14]. All these studies are based on the framework of *collaborative multiagent MDP* [24], in which agents make decisions by using the joint-state information. This full observability of domain state is not assumed in our work as the problems studied here are framed as decentralized MDPs (Dec-MDPs), where each agent has only local observability of the environment; as a result, the joint state cannot always be determined through an agent's local observation.

In recent years, a number of approaches have been developed with the aim of learning when coordination is beneficial in loosely coupled MASs (particularly in robot navigation problems). De Hauwere *et al*. [17] proposed a solution to CL problems called *2observe*. This approach decouples a MAL process into two separate layers. One layer learns where it is necessary to observe the other agent, and the other layer adopts a corresponding learning technique to avoid conflicts. In their work, the first layer uses a *generalized learning automation* (GLA) to decide whether agents should act cooperatively. The GLA receives the Manhattan distance between two agents as an input, and on the basis of this distance and the rewards, agents learn when coordination would be beneficial. Then, in the second layer, a random action selection mechanism is performed to decrease the probability of collision in the coordinated states. This approach, however, implies that the environment has to inform the GLA of the Manhattan distance between the two agents. This means that each agent's full observability of the environment is assumed. This assumption, which is undesirable in many systems, is not required in our approach. Furthermore, the coordination mechanism in 2observe is too simple to reflect the complex interactions between agents. In our approach, however, agents reduce the conflicts in coordinated states through a CL process so that coordinated behavior can be achieved autonomously. Spaan and Melo [25] introduced a model for solving the coordination problem in loosely coupled MASs called *interaction-driven Markov games* (IDMG). In IDMG, the states where agents should coordinate with each other are specified in advance, and a fully cooperative Markov game is defined in these coordinated states such that agents can compute the game structure and Nash equilibria to choose their actions accordingly. IDMG is based on the game-theory solution to resolve the planning problem that requires the computation of multiple equilibria, which is computationally demanding. Later, Melo and Veloso [26, 13] proposed a two-layer extension of the Q-learning algorithm to enable agents to learn where coordination is beneficial by augmenting the action space with a pseudo-coordination action. In their approach, agents are able to learn a trade-off between the benefits arising from good coordination and the cost of choosing the pseudo-coordination action. As a result, agents can learn to use the pseudo-coordination action in states only when it is necessary. The learning performance in [26, 13], however, can be affected by the cost of choosing the pseudo-coordination action. In [16], an algorithm called *CQ-learning* was proposed to enable agents to adapt the state representation in order to coordinate with other agents. CQ-learning, however, depends on the assumption that each agent has already had an optimal individual policy so that every agent can have a model of its expected rewards. In this way, those states where the expected rewards differed significantly from the observed rewards are marked as dangerous states in which the other agent's state information should be considered for decision making. Our approach does not require such an assumption of individually learnt optimal policy, and enables agents to learn the circumstances when coordination is necessary through a statistical learning process, and to learn coordinated behavior after these circumstances are determined.

### GENERAL THEORETICAL MODELS IN MULTIAGENT LEARNING

- Top of page
- SUMMARY
- INTRODUCTION
- RELATED WORK
- GENERAL THEORETICAL MODELS IN MULTIAGENT LEARNING
- PROBLEM DESCRIPTION AND DEFINITIONS
- A COORDINATED LEARNING APPROACH
- EXPERIMENT
- CONCLUSION AND FUTURE WORK
- ACKNOWLEDGEMENTS
- REFERENCES

In this section, we review some of the general theoretical models in MAL. We start with the single-agent MDP model before moving on to the extended multiagent decentralized MDPs model. We clarify some fundamental concepts and build the notations for further description.

#### Markov decision process

An MDP describes a single-agent sequential decision-making problem in which an agent must choose an action at every time step to maximize a reward-based function [27]. Formally, an MDP can be defined by a four-tuple *M* = (*S*,*A*,*P*,*R*), where *S* is a set of states representing the finite state space, *A* is the set of actions available to the agent, *P* : *S* × *A* × *S* [0,1] is the Markovian transition function, and *R* : *S* × *A* *R* is a reward function that returns the immediate reward *R*(*s*,*a*) to the agent after taking action *a*, resulting in a transition from state *s* to state *s* ′ according to the transition function *P*(*s*,*a*,*s* ′ ). An agent's *policy π* : *S* × *A* [0,1] maps a state *s* ∈ *S* to an action *a* ∈ *A* that the agent will take.

The goal of an agent in an MDP model is to learn a policy *π* so as to maximize the expected discounted reward *V* ^{π}(*s*) for each state *s* ∈ *S*.

- (1)

where *E*_{π} is the expectation of policy *π*, *s*_{t} is the state at time *t*, and *γ* ∈ [0,1) is a discount factor.

For any finite MDP, there is at least one *optimal policy π*^{ * }, such that for every policy *π* and every state *s* ∈ *S*. To involve the action information, we can use Q value to represent the value of each state-action pair given as follows:

- (2)

An MDP problem can be solved by using linear programming or dynamic programming techniques if an agent fully knows the environment structure in terms of the reward and transition functions. When these functions are unknown to the agent, the problem can be solved by using *reinforcement learning* (RL) [28] methods in which an agent learns through trial-and-error interactions with the environment. One of the most important and widely used RL approach is Q-learning [29], which is an off-policy model-free temporal difference control approach. Its one-step updating rule is given by Equation (3), where *α* ∈ (0,1] is the learning rate.

- (3)

Every Q value of the state-action pair is stored in a table for a discrete state-action space. It is proved this tabular Q-learning converges to the optimal *Q*^{ * }(*s*,*a*) w.p.1 when all state-action pairs are visited infinitely often, and an appropriate exploration strategy and learning rate are chosen [29].

#### Decentralized Markov decision processes

A Dec-MDPs [30] model is an extension of the aforementioned single-agent MDP model to allow decentralized decision making by multiple agents. In Dec-MDPs, at each step, each agent receives a local observation and subsequently chooses an action. The state transitions and rewards depend on the actions of all the agents. More formally, a Dec-MDPs model can be defined by a tuple *N* = ({*α*_{i}},*JS*,{*A*_{i}},*P*,*R*), where each *α*_{i} (*i* ∈ [1,*n*]) represents an agent, *JS* is a finite set of joint states of all agents, *A*_{i} (*i* ∈ [1,*n*]) is a finite set of actions available to agent *α*_{i}, *P*(*js*,*ja*,*js* ′ ) represents the transition probability from state *js* to state *js* ′ when the joint action *ja* = ⟨*a*_{1},…,*a*_{n}⟩ (*a*_{i} ∈ *A*_{i}) is taken, and *R*(*js*,*ja*) represents the reward received to the agents for taking the joint action *ja* in state *js*. The joint action *ja* should be in space *JA*, where is the set of all possible joint actions. We write *a*_{ − i} = ⟨*a*_{1},…,*a*_{i − 1},*a*_{i + 1}..,*a*_{n}⟩to denote the *reduced action* of agent *α*_{i}; thus, the joint action *ja* can be simply represented as *ja* = ⟨*a*_{i},*a*_{ − i}⟩.

Throughout this study, we focus on a more specialized version of the Dec-MDPs model, that is, *factored Dec-MDPs* [31, 32], in which the system state can be factored into *n* + 1 distinct individual components so that *JS* = *S*_{0} × *S*_{1} × … × *S*_{n}, where *S*_{0} denotes an agent-independent component of the state (i.e., assumed common knowledge among all agents), and *S*_{i} (*i* ∈ [1,*n*]) is the state space of agent *α*_{i}. For each agent, its local state is defined as , where *s*_{i} ∈ *S*_{i} represents the portion specific to agent *α*_{i} and *s*_{0} ∈ *S*_{0} represents the portion shared among all agents. In the paper hereinafter, unless otherwise specified, we also directly use *s*_{i} to denote the local state of agent *α*_{i} by omitting *s*_{0} to simplify notation.

The Dec-MDPs model is a particular case of the more general model *decentralized partial observable MDP* [33], where every agent has *partial observability* so that the agent can only determine its local state ambiguously through its local observation. In Dec-MDPs, however, an agent can have *local full observability* [13], which means that each agent can infer the corresponding local state unambiguously from its local observations. In other words, all the agents in Dec-MDPs, together, have *joint full observability* or *collective observability* [34], which means that each agent observes a part of the state, and the combined observations of all agents uniquely identify the overall state. This is in contrast with *multiagent MDPs* [35, 36], where each agent already has *individual full observability* [13] so that the individual observation of an agent can allow the agent to recover the overall state unambiguously. It has been shown that the decision complexity for a finite-horizon Dec-MDPs problem is NEXP-complete, even if in a two-agent case, versus the P-completeness of a single MDP and multiagent MDPs [30]. This high complexity is attributed to the decentralized decision-making process in Dec-MDPs, where agents are coupled with a shared transition and reward function, but no individual agent can access the global state information because of the agent's limited observability of the environment.

### PROBLEM DESCRIPTION AND DEFINITIONS

- Top of page
- SUMMARY
- INTRODUCTION
- RELATED WORK
- GENERAL THEORETICAL MODELS IN MULTIAGENT LEARNING
- PROBLEM DESCRIPTION AND DEFINITIONS
- A COORDINATED LEARNING APPROACH
- EXPERIMENT
- CONCLUSION AND FUTURE WORK
- ACKNOWLEDGEMENTS
- REFERENCES

This section gives a brief description of the robot navigation problem that this paper aims to solve. Some important concepts, that is, agent independence and coordinated states, are then formally defined for the purpose of introducing our approach.

#### Robot navigation problem

Figure 1 illustrates three very simple domains in which two robots are navigating in an environment, each trying to reach its own goal. In this figure, *R*_{1}, *R*_{2} represent two robots and *G*_{1}, *G*_{2} are their goals, respectively (In Figure 1(a), *G*_{1} and *G*_{2} are in the same grid denoted by *G*). This kind of MASs can be modeled as Dec-MDPs, in which each agent has to make a decision to optimize a global value function through its individual observation. Because of its local observability of the environment, an agent cannot observe other agents’ information to determine the overall state of the Dec-MDPs when these agents are beyond the agent's observability limit. Learning to achieve an efficient coordinated policy in this kind of domain is a challenging problem. In recent years, many approaches have been proposed to solve this problem under different assumptions and emphasis [13, 16, 26, 17, 25].

In this kind of MASs, agents are loosely coupled, and sparse interactions of agents make coordination *local*, which means that, in general, each agent can make its own decision without regard to other agents’ state and/or action, but in certain specific situations (i.e., some local parts of the environment), the agents are tightly coupled and hence must coordinate with each other to achieve a better performance. In such a case, it is possible to decompose the learning problem into two distinct subproblems. One is to let each agent learn its own optimal policy and completely disregard the existence of the other agent. Each agent thus can be modeled independently as an MDP learner in this situation, and the policy of each agent can be learnt by using a single-agent learning approach (e.g., standard Q-learning). The other is to let the agent coordinate its behavior with other agents when interactions (in terms of coordination) are necessary. This occurs when an agent's transition function and/or reward function may depend on other agents under some particular circumstances. A typical example is when multiple agents attempt to simultaneously use a common resource such as space, tools, or shared communication channels [25]. As shown in Figure 1, if both robots choose their optimal policies to achieve their individual goals and cross the doorway simultaneously, they may crash into each other and get stuck there. In many applications, this kind of conflict may prevent agents (robots) from achieving their goals, and in some domains such as agent-based disaster management and emergency rescue systems, such conflicts must be avoided. In this situation, the robots must learn to coordinate their behavior when they are in an area of potential conflicts (possible areas are exemplified by the shadowy states in Figure 1).

Learning to identify the areas of such potential conflicts so that a coordinated policy can be achieved in these areas is not a trivial task. This is because all the agents are learning concurrently so that the environment is nonstationary from the perspective of each agent. Because of this nonstationarity, it is difficult for an agent to build an explicit representation of the environment. This difficulty is made worse when each agent usually has only local observability of the environment. For this reason, a number of approaches have been put forward to simplify robot navigation problems by predefining the conflicting areas to the robots [25], or by imposing some assumptions on each robot, either its individual optimal policy [16] or full observability of the environment [17]. This paper, however, aims to propose an approach that enables the robots to learn a coordinated policy based on limited observability of the environment and without any prior knowledge about either the domain structure or the robot itself. Before introducing our approach, it is necessary to give a formal description of several important concepts to help understand how a general Dec-MDPs problem can be decomposed efficiently into subproblems by exploiting the sparse interactions (i.e., independence) of agents.

#### Formalization of agent independence and coordinated states

It has been recognized that decision making in a Dec-MDPs model is generally computationally prohibited [30]. To reduce the high complexity, one approach is to identify subclasses of the Dec-MDPs model by making reasonable assumptions about the model. As observed in many practical multiagent problems (e.g., the robot navigation problem studied here), the state transitions and rewards usually involve certain dependence so as to allow a more compact problem representation and lower computational complexity. In [37, 38], the authors defined a specific subclass model of Dec-MDPs, that is, transition-independent Dec-MDPs, in which the overall transition function *P* can be separated into *n* distinct individual transition functions *P*_{i} (*i* ∈ [1,*n*]). For any next state of agent *α*_{i}, *P*_{i} is given by . In other words, the next local state of each agent is independent of the local states of all other agents, given its previous local state and individual action. It has been shown that a transition-independent Dec-MDPs problem has an NP-complete complexity, which is a significant reduction from general Dec-MDPs [38, 37]. However, it is argued that not all multiagent domains are fully transition-independent [32]. Furthermore, because of the shared reward component, it is still a nontrivial task to solve a transition-independent Dec-MDPs problem.

Similarly, Dec-MDPs can also be reward-independent when the joint reward function *R* can be represented as a function of individual reward functions *R*_{1},…,*R*_{n} [5, 31]. Interestingly, it was recently shown that a reward-independent Dec-MDPs problem retains an NEXP-complete complexity [39, 31]. However, when associated with transition independence, reward independence implies that an *n*-agent Dec-MDPs problem can be decomposed into *n*-independent MDPs subproblems, each of which can be solved separately. The complexity of this class of problems thus reduces to that of standard MDPs (P-complete).

Although assuming a full transition and reward independence in a Dec-MDPs problem can significantly reduce the complexity of solving such a problem, this assumption is hardly feasible in practical applications because in many MASs, agents are coupled with each other in certain specific situations. Furthermore, because the individual functions of an agent are potentially affected by other agents because of their interdependencies, an agent's local state and individual action usually cannot determine the individual functions fully as assumed in general transition-independent and reward-independent Dec-MDPs. To better reflect the feature of uncertainties in Dec-MDPs, we decompose the individual transition function *P*_{i} as , where corresponds to a local individual transition component that depends only on agent *α*_{i}'s local state and individual action, and *P*_{I}(*js*,*ja*,*js* ′ ) is an interaction transition component that depends on all the agents. Similarly, the individual reward function *R*_{i} can be decomposed as , where corresponds to a local individual reward component and *R*^{I}(*js*,*ja*) is an interaction reward component. On the basis of this factorization, we formally define the independence between agents as given by Definition 1.

Definition 1. *Agent independence* An agent *α*_{i} is independent of the remaining agents *α*_{ − i} in a state *s*_{i} ∈ *S*_{i}, which is denoted as , if and , where *js* = ⟨*s*_{i},*s*_{ − i}⟩, .

As given in the definition, an agent *α*_{i} is independent of the remaining agents in a state if in this state, both individual transition and reward functions of agent *α*_{i} can be fully determined by its local information (i.e., local state and individual action *a*_{i}). This means that the dynamics of the learning environment caused by other learning agents has no influence on agent *α*_{i} such that agent *α*_{i} can make its own decision individually without regard to the existence of other agents. To be more explicit, let us look at the value function decomposed as follows:

Similarly, we can write the value function for a policy *π* as Equation (4).

- (4)

From Equation (4), it is clear to see that if an agent is independent on the other agents as defined by Definition 1, the interaction component , thus the agent's individual value function , which depends on the joint policy *π* of all agents, can be fully determined by the agent's local component , which depends on its individual policy *π*_{i}.

Similarly, we can define agent dependence as Definition 1 accordingly, and let denote that agent *α*_{i} is dependent on the remaining agents *α*_{ − i}. On the basis of these definitions, we can formalize *uncoordinated states* and *coordinated states* as follows.

Definition 2. *Uncoordinated states* Uncoordinated states are a set of adjacent states where agent *α*_{i} can act independently, which can be defined as , where .

Definition 3. *Coordinated states* Coordinated states are a set of adjacent states where agent *α*_{i} needs to coordinate with other agents, which are defined as , where .

If it is considered that , an *n*-agent Dec-MDPs model is equivalent to a full transition-independent and reward-independent Dec-MDPs model as stated earlier. In this case, the Dec-MDPs model reduces to *n*-independent MDPs, each of which can be solved independently. While , a Dec-MDPs model can be transformed into an MMDP [35, 36], which can be solved by assuming a centralized controller or agents’ global observability of the environment so that each agent can make a decision based on joint-state/action information of all the agents. In the former case, solving the Dec-MDPs problem has a low complexity (P-complete as MDP) but does not capture the interactions between agents and thus may result in a poor performance. Although in the latter case, an optimal learning result might be achieved, which is infeasible because an agent usually has limited observability of the environment and practical problems are decentralized in nature. A direct combination of these two extreme cases is to decompose the Dec-MDPs model to independent MDPs when an independent relationship really exists between agents and to MMDP when agents must depend on each other in case of uncoordinated behavior. In this way, a Dec-MDPs problem can be solved efficiently without considering all the state-action information during decision making, most of which is redundant as indicated by the context-specific independence.

### A COORDINATED LEARNING APPROACH

- Top of page
- SUMMARY
- INTRODUCTION
- RELATED WORK
- GENERAL THEORETICAL MODELS IN MULTIAGENT LEARNING
- PROBLEM DESCRIPTION AND DEFINITIONS
- A COORDINATED LEARNING APPROACH
- EXPERIMENT
- CONCLUSION AND FUTURE WORK
- ACKNOWLEDGEMENTS
- REFERENCES

A CL approach is proposed in this section to solve robot navigation problems by capturing sparse interactions (i.e., local independencies) between these robots. The main idea of the approach is (i) to identify the coordinated states by collecting statistical information during agent learning; and (ii) to coordinate agents’ behavior based on each agent's local observability of the environment after the coordinated states are determined. The sketch of our learning approach is given by Algorithm 1, with details given in the following subsections.

#### Learning the coordinated states

When agents learn independently in an environment, they may have conflicts in any state during the learning process. Because the agents have no knowledge about the domain structure, which means they do not know in which states they must consider other agents for coordination, they need to learn these coordinated states from experience. In an RL setting, the only feedback from the environment is the reward. When an agent has received a severe penalty reward in a state, it is notified by the environment that coordinated behavior is required in this potentially conflicting state. Otherwise, the agent can act independently according to its individual information. However, during the learning process, agents are exploring the environment, making the learning a stochastic process so that a conflicting state is not sufficient to reveal the true structure of the environment. Nevertheless, from a statistical point of view, more frequent conflicts in a state indicate that this state is more likely to be one of the coordinated states. Furthermore, if agents conflict in a state, agents are also likely to conflict in the neighboring states. The contributions of the neighboring states to the conflict in a state can be determined by the similarities between those states and the state where a conflict occurs.

On the basis of the considerations stated earlier, we choose the *kernel density estimation* (KDE) technique to collect statistical information to determine the coordinated states. The basis of KDE is the *kernel* function *F* satisfying . When a variable (or an event) is observed, an estimation of the density is formed by centering the kernel at this variable, and the overall estimation is the sum of all the overlapping kernels. In our problem, the observation means that agents conflict with each other in a state. An observation with the highest density signifies that the corresponding state is a location where coordination is most required. Let *s* be a local state of an agent, which can be represented by a grid in Figure 1. Let *P*_{s}(*x*_{s},*y*_{s}) represent the central point of state *s* with the coordinate (*x*_{s},*y*_{s}), and let *F*_{P}(*x*,*y*) be the kernel function centralized at point *P*. The overall estimation *Den*_{s} for state *s* is calculated by summing up all the overlapping kernels given by . After the statistical collecting period, agents can determine the coordinated states *S*^{c} according to Algorithm 2.

In Algorithm 2, *s*^{ * } is the state with the highest density and *R* is the scanning distance of the agents. As there may be more than one area of coordinated states, we let *i* be the index of each area of coordinated states . Agent *k* firstly determines possible areas of coordinated states by comparing the density of each domain state with the highest density. If this difference is greater than a threshold *δ*, the state is considered to be the central state of an area of the coordinated states (lines 2–4, where is the set of central states of corresponding areas in the coordinated states). Agent *k* then computes its individual coordinated states by involving the states that are located in its scanning distance (line 10). If a candidate state in already belongs to a coordinated state area, this state will be no longer considered to compute the individual coordinated states (line 11). However, not all the states in are the causes of a conflict in central state . As a simple example, if an agent transits from *s*_{1} and *s*_{2} to *s*^{ * }, causing the conflict in *s*^{ * }, and transits from *s*^{ * } to *s*_{3} and *s*_{4}. It is obvious that *s*_{3} and *s*_{4}, which have the same densities with *s*_{1} and *s*_{2}, are not the causes of the conflict in *s*^{ * } and thus should be eliminated from . An elimination mechanism is applied to eliminate this kind of states (line 13), which is given by Algorithm 3 in detail. Finally, the overall coordinated states are the union of the coordinated states of all the agents (line 14).

Algorithm 3 illustrates the process of the elimination mechanism. States in are sorted in a descending order according to the density derived from the KDE process (line 1) to represent the importance that coordination is needed in the corresponding state. For each sorted state , the agent determines whether its neighboring state is the cause of the conflict in state . This can be carried out by collecting historical information of the transitions between states and (lines 3–5). If the agent transits from the state to a neighboring state more often than the reserve (line 4, where represents the times of transitions from *s* to *s* ′ ), state is not the cause of conflict in and should be eliminated from (line 4). is a temporary set to store the previous computed states in such that these states would not be eliminated by a state with a lower density (line 6). In this way, the individual coordinated states can be computed by considering both the direct causes of the conflicts in a state and each state's different role of causing the conflicts.

#### Learning of coordination

After determining coordinated states and uncoordinated states , agent *k* needs to learn how to coordinate its behavior with other agents to avoid potential conflicts. In this section, a Q-learning-based updating rule is proposed to guide agents’ learning processes for coordinated behavior. As described in Section 4, when an agent is in an uncoordinated state, its reward and transition functions are independent of other agents so that the agent can learn independently according to its own policy. The single-agent Q-learning approach is applied to update the Q value in this case. Situations become more complicated when an agent comes to a coordinated state where the agent's reward and transition functions are tightly coupled with other agents. Learning of coordination in this situation is difficult when the agent only has local observability of the whole environment. Here, we assume that each agent has *limited full observability*, that is, Distance(*s*_{i},*s*_{j}) *⩽ R* : *P*[*S*(*t*) = *s* | *O*_{i}(*t*) = *o*_{i}] = 1, where is the overall domain state determined by agent *i* and agent *j*, *o*_{i} is the individual observation of agent *i*, and *R* is the agent's scanning distance. This means only when agent *j* is in the scanning distance of agent *i* (i.e., Distance(*s*_{i},*s*_{j}) *⩽ R*) can agent *i* recover the overall state unambiguously through its individual observation. The perception process can be carried out either through the agent's limited observing capability of the environment or by using an explicit communication. For example, in robot navigation problems, one robot can rely on its sensor to localize the other robot or just send a message to require the other robot to divulge its location [13]. Communication is assumed to be unlimited and noise-free in the coordinated states. This assumption is reasonable because the coordinated states are determined by the scanning distance of the agent as stated in Section 5.1. Furthermore, assuming communication to be confined to local parts in the environment is common in practical applications, for example, in the robot navigation domains where two robots equipped with wireless communication devices are able to communicate only when they are spatially close to each other. Our approach is to disregard the interactions between agents in the uncoordinated states and enables agents to communicate locally in the coordinated states. In this way, the computational complexity of solving a Dec-MDPs problem decreases dramatically because the coordinated states usually account for a small proportion of the whole state space because of the sparse interactions of agents. As a result, the demand for communication is greatly reduced, and agents can achieve a good learning performance while keeping minimum information during the learning process.

At the beginning of learning, each agent maintains a single-state-action Q-value table denoted by *Q*_{k}(*s*_{k},*a*_{k}) (*s*_{k} ∈ *S*_{k}) for all states, where *s*_{k} is agent *k*'s local state and *a*_{k} is its individual action. After coordinated states have been determined, a joint-state-action Q-value table for the coordinated states is created by combining all the Q-value information from the single learning process. Suppose there are total *N* agents in the environment. Let and *A*_{k} be the coordinated state space and the action space of agent *k*, respectively. The joint-state space of all agents in coordinated states can be given by , and the joint-action space of all agents is . This joint Q value *Q*_{c}(*js*^{c},*ja*^{c}) can be initialized by summing up the single Q values *Q*_{k}(*s*_{k},*a*_{k}) (*s*_{k} ∈ *S*^{c}) of each agent, which can be given by Equation (5), where *s*_{k} ∈ *S*^{c},*js*^{c} ∈ *JS*^{c} and *ja*^{c} ∈ *JA*^{c}.

- (5)

After adding the joint Q value *Q*_{c}(*js*^{c},*ja*^{c}), agents can coordinate their behavior according to this Q value when in the coordinated states. The basic idea of our learning approach is to let agents act optimistically when facing uncertainties caused by their local observability of the environment. In more detail, when all agents are in the coordinated states at the same time, they can observe the overall state *js*^{c} and choose joint action *ja*^{c} according to the joint Q value *Q*_{c}(*js*^{c},*ja*^{c}). But when there are agents out of the coordinated states, those agents in the coordinated states cannot receive the joint-state-action information of all the agents to determine their actions from the joint Q value *Q*_{c}(*js*^{c},*ja*^{c}) because agents can only observe each other fully in the coordinated states (i.e., agents have limited full observability of the environment). To solve this tricky issue, an optimistic estimation mechanism is proposed so that agents can act optimistically by giving a best estimation of those unobserved agents. This means agents will always act according to the highest *Q* value based only on the available state-action information. Let *I*_{i} stand for the state and action information of a group of *i* agents. The *optimistic estimation* is formally defined by Definition 4.

Definition 4. A group of *i* agents’ *optimistic estimation* about the other *j* agents, *OE*(*I*_{j} | *I*_{i}), is defined to be a set of state and action information that makes the joint Q-value maximal, namely, , where *I*_{J} is the set of all possible state and action for the *j* agents.

Let *s*_{k} be the current state of the agent at step *t* and be the state in the next step *t* + 1. denotes the joint-action space of *m* agents in the coordinated states, and denotes their joint-state space. and are their joint action and joint state, respectively. There are mainly two scenarios based on the current state of agent *k*, and different learning processes can be applied according to the transitions between states in each scenario.

Scenario 1. .

In this scenario, agent

*k*is in an uncoordinated state. It looks up its own single Q-value table*Q*_{k}(*s*_{k},*a*_{k}) and takes an action*a*_{k}that has the highest Q value to transit into a new state. If the new state is still in the uncoordinated states, a normal single-agent Q-learning can be applied to update the Q-value given by Equation (3).However, if the new state is in the coordinated states, the agent needs to back up its Q value by adding the expected future reward from the coordinated state . Note that in the coordinated states, agents only maintain a joint Q-value table

*Q*_{c}(*js*^{c},*ja*^{c}), which represents the overall expected reward when all the agents are in the coordinates states with joint state*js*^{c}and joint action*ja*^{c}. However, agent*k*only has local full observability of the coordinated states, thus cannot observe the state-action information of those agents that are out of the coordinated states. As a result, the joint state*js*^{c}cannot be determined to choose a joint-action*ja*^{c}that maximizes the Q value. Suppose there are*m*agents in the coordinated states and*n*agents in the uncoordinated states at step*t*+ 1. Agent*k*observes the joint state of the*m*agents and chooses the highest*Q*_{c}(*js*^{c ′ },*ja*^{c ′ }) based on this information according to the optimistic estimation mechanism. The value of*Q*_{c}(*js*^{c ′ },*ja*^{c ′ }) represents the overall expected reward and can be averaged by the total number of agents*N*. The Q-value updating rule is formally given by Equation (6).- (6)

where

*ja*^{c ′ }is selected on the basis of the optimistic estimation mechanism given by Equation (7).- (7)

Equation (7) means that for all the unobserved information and , there is at least a state-action pair in the Q table that maximizes the joint Q value based on the available information and of the

*m*agents in the coordinated states.Scenario 2.

*s*_{k}∈*S*^{c}.In this scenario, agent

*k*is in the coordinated state at step*t*. It observes the whole coordinated states to gain the state information of other agents that are in the coordinated states currently. Assume there are now*m*(*m ⩽ N*) agents existing in the coordinated states with the joint state and other*n*agents in the uncoordinated states. The*m*agents will look up the joint Q-value table*Q*_{c}(*js*^{c},*ja*^{c}) and choose the joint action with the highest Q value simultaneously according to their optimistic estimation of the other*n*agents by Equation (8).- (8)

After taking the joint action , each agent jumps to a new state. Suppose among the

*m*agents, there are*p*(*p ⩽ m*) agents still in the coordinated states and other*q*= (*m*−*p*) agents moving out to uncoordinated states. The*m*agents should back up the future rewards from*Q*_{c}according to the joint state of the*p*agents and from*Q*_{k}according to the state of each agent that jumps out of the coordinated states. The joint Q value can be updated by Equation (9).- (9)

where is the sum of the reward of the

*m*agents,*ja*^{c ′ }is selected according to Equation (7) based on the*p*agents’ state information. In Equation (9), (i) is the expected reward of all the*N*agents based on the information of the*p*agents. This value multiplied by represents the expected reward of the*p*agents; (ii) is the expected reward of each agent that moves out of the coordinated states. Summing up these values represents all the expected reward of*q*agents; (iii) is the expected reward of the*m*agents. This value multiplied by represents the expected reward of all the*N*agents; and (iv) means that this Q-value updating is applied for all the joint-state action of the*n*agents. In this way, the joint*Q*_{c}value can be updated using the available information among the*m*agents.

### EXPERIMENT

- Top of page
- SUMMARY
- INTRODUCTION
- RELATED WORK
- GENERAL THEORETICAL MODELS IN MULTIAGENT LEARNING
- PROBLEM DESCRIPTION AND DEFINITIONS
- A COORDINATED LEARNING APPROACH
- EXPERIMENT
- CONCLUSION AND FUTURE WORK
- ACKNOWLEDGEMENTS
- REFERENCES

In this section, experiments are carried out to demonstrate the performance of our learning approach, which is denoted as *CL*. Two other approaches are selected as the basic benchmarks for comparison of learning performance.

*Independent learning* (*IL* ) [40]. In this approach, each agent treats the remaining agents simply as a part of the environment by ignoring their actions and rewards and learns its policy independently. The decision-making process is forcibly decomposed into *n* separate MDPs. This results in a large reduction in the state-action representation; however, at the same time, a poor learning outcome might occur because of the lack of coordination. IL provides a valid perspective to study the so-called *moving target* effect [1] that complicates general MAL problems. Despite the lack of guaranteed optimal performance, this method has been applied successfully in certain cases [41, 42].

*Joint-state-action learning* (JSAL). This approach provides the other extreme scenario against IL. Agents either communicate freely with a central controller and select their individual actions indicated by the central controller or have full observability of the environment to receive the joint-state-action information of all the agents and update learning information synchronously. With a sufficient learning period, JSAL can achieve an optimal performance as the decision-making process is considered as a single MDP where agents learn in a static environment. However, JSAL faces some tricky issues inherent in MAL, namely, (i) the curse of dimensionality: the search space grows rapidly with the complexity of agent behavior, the number of agents involved, and the size of domains; (ii) local observability: the agents might not have access to the needed information for the learning update because they are not able to observe the states, actions, and rewards of all the agents; and (iii) slow convergence: it takes many time steps to explore the whole joint-state-action space, which results in a slow convergence to the optimal performance.

#### Experimental setting

We first test all the approaches in the three small grid-world domains given in Figure 1. In each of these domains, the state space is relatively small, and in some regions of the domain, coordination might be heavily needed to avoid conflicts. The approaches are then applied to some larger domains as given in Figure 2, which originate from Figure 3 in [13]. These larger domains have a variety of state space sizes (from 43 individual states in the ISR domain to 133 in the CMU domain, corresponding to 1,849 and 17,689 joint states, respectively). The interactions between robots are much sparser than those in the small domains. In all these domains, robots are navigating with four actions, that is, *Move East*, *Move South*, *Move West*, and *Move North*. Each action moves the robot to the corresponding direction unambiguously. Although some other studies used a nondeterministic transition setting, assuming a low probability of failure of the actions; as a result, the agents transit to unintended directions with a uniform probability; from a statistical point of view, this setting can achieve the same learning performance as the setting we use here. When robots collide with a wall, they rebound back and stay where they were. If they collide with each other, both robots break down and are transferred back to their original states. The exploration policy we adopt is the fixed *ϵ* − greedy policy with *ϵ* = 0.1. The learning rate *α* = 0.05, discount factor *γ* = 0.95, and rewards are given as follows: + 20 for reaching the goal state, − 1 for colliding with a wall, and − 10 for colliding with the other robot. To use our approach, we choose a two-dimensional standard normal distribution function as the kernel function because of its simplicity of implementation. The value of *δ* is set to be 95*%*. To also show the effects of different scanning distance *R* on the learning performance, two different cases of CL are studied, that is, CL_{1} in which *R* = 2 and CL_{2} in which *R* = 4. In both cases, the first 1000 (i.e., *N*_{1} = 1000) episodes are used to collect statistical information to determine the coordinated states. We run all approaches for 10,000 episodes and average the last 2000 episodes to compute the overall performance. All results are then averaged over 25 runs.

#### Results and analysis

The learning curves in terms of average rewards gained by both robots are given in Figure 3, where *x*-axis indicates the learning episode and *y*-axis indicates the reward averaged every 50 episodes. As can be seen from the figure, the *JSAL* approach can converge to an optimal value because it can receive joint-state-action information of both robots. The optimal reward is a bit lower than 20 because of the stochastic exploration during the last 2000 episodes. However, the JSAL approach is not applicable in practice because the increase of searching space is exponential with the increase of the number of agents, and the assumptions of a central controller or agents’ global observability are infeasible in most practical problems. By contrast, the IL approach can only achieve a very low reward because of the lack of coordination. That is because when a robot is learning from the individual learner perspective, the environment is nonstationary, but the robot does not account for this nonstationarity. Thus, the learning robot will continuously *catch up* with the dynamic and adaptive environment, causing a suboptimal performance that is much lower than that in optimal JSAL. The performances of CL_{1} and CL_{2} are almost the same as IL during the first 1000 episodes because robots are both learning independently. But CL_{1} and CL_{2} quickly outperform IL after the coordinated states are determined so that robots can learn to coordinate with each other to avoid possible conflicts. The different performances between CL_{1} and CL_{2} will be explained later.

Table 1 gives the overall performances of these approaches in more detail. The state and action spaces of each approach are also laid out to show their computational complexities. As an example, when using CL_{1} in domain TTG, there are three coordinated states (shadowy area in Figure 1(a)) among the whole 25 states. The state space each robot keeps in domain TTG thus can be calculated as 22 + 3^{2} ∕ 2 = 26.5, and the action space is 4 × 22 ∕ 25 + 4^{2} × 3 ∕ 25 = 5.44. When using CL_{2}, robots have a longer scanning distance so that there are seven coordinated states in domain TTG. The state and action spaces can be calculated accordingly. As can be seen from Table 1, our approach reduces the computational complexity significantly compared with JSAL. This reduction is more desirable in larger scale domains where the computational complexity using JSAL is too high to be implemented. This is verified by the results in larger domains given later. The results in terms of reward (collision percentage) show that the results of CL_{1} and CL_{2} are much higher (lower) than those of IL because coordination between robots is considered but still a bit lower (higher) than those of the optimal JSAL because of the robot's overestimation of the other robot's behavior by using the optimistic estimation mechanism introduced in our approach.

Domain | Approach | State | Action | Q values | Reward | Collision(%) | Step |
---|---|---|---|---|---|---|---|

IL | 25 | 4 | 100 | 6.77 ± 0.21 | 0.42 ± 0.01 | 12.53 ± 0.04 | |

TTG | CL_{1} | 26.5 | 5.44 | 144.16 | 16.60 ± 0.08 | 0.10 ± 0.00 | 16.92 ± 0.11 |

CL_{2} | 42.5 | 7.36 | 312.8 | 16.81 ± 0.16 | 0.09 ± 0.00 | 20.33 ± 0.64 | |

JSAL | 625 | 16 | 10,000 | 18.22 ± 0.29 | 0.00 ± 0.00 | 22.42 ± 3.12 | |

IL | 21 | 4 | 84 | 0.16 ± 0.19 | 0.66 ± 0.01 | 12.50 ± 0.17 | |

HG | CL_{1} | 22.5 | 5.71 | 128.48 | 11.20 ± 0.15 | 0.24 ± 0.01 | 17.56 ± 0.36 |

CL_{2} | 52.5 | 7.68 | 403.2 | 15.77 ± 0.26 | 0.12 ± 0.01 | 20.39 ± 1.55 | |

JSAL | 441 | 16 | 7,056 | 17.10 ± 0.49 | 0.05 ± 0.02 | 21.66 ± 2.79 | |

IL | 36 | 4 | 144 | 6.71 ± 2.29 | 0.43 ± 0.08 | 13.66 ± 0.53 | |

TR | CL_{1} | 48 | 6 | 288 | 15.43 ± 0.16 | 0.16 ± 0.01 | 27.82 ± 1.35 |

CL_{2} | 148 | 9.33 | 1381.33 | 16.92 ± 0.24 | 0.07 ± 0.01 | 28.59 ± 2.03 | |

JSAL | 1296 | 16 | 20,736 | 18.19 ± 0.46 | 0.01 ± 0.01 | 29.94 ± 8.63 |

Another important aspect showing the different performances of these approaches is the number of steps for both robots to reach their own goals. We calculate the number of steps in those episodes when both robots do not collide with each other, thus can reach the goal successfully. The results show that robots in IL always find the shortest path to their goals, which, in turn, causes the high probability of collision because robots do not coordinate with each other when both come to the coordinated states. In JSAL, a central controller receives the joint-state-action information of both robots or robots can learn their joint policies based on their full observability of the environment. As a result, a safe detour strategy will be adopted by the robots to reduce the probability of collision, which accordingly increases the steps to the goals. As can be seen from the 95*%* confidence intervals of number of steps to goals in Table 1, JSAL is the most unstable approach, which implies that the learnt policy in JSAL can be affected by the stochastic learning process. This means that if the robots are *lucky* enough at the early stage of exploring the environment, they can identify a short path to the goal. On the contrary, a much longer path will be learnt if the learning process deviates from this trajectory. However, our approach combines the merits of both IL and JSAL, allowing robots to find the shortest path to the goals with higher certainty while only making a small detour around the coordinated states. This is why the number of steps to goals in our approach is higher than that in IL but lower than that in JSAL, and the stability is in between the two approaches.

As previously mentioned, the different ranges of coordinated states affect the learning performances in different domains. In TTG domain, CL_{1} already obtains a very good performance with high rewards, whereas expanding the coordinated states larger in CL_{2} does not improve the results much further. This can be explained that in this domain, when both robots are around the entrance of the tunnel, only one step of detour is sufficient to avoid the conflict at the entrance. Thus, three coordinated states (the shadowy area in Figure 1(a)) around the entrance using CL_{1} are enough to capture the most hazardous situation in this domain. This is vividly illustrated in Figure 4(c) and (d), where in both cases, the robots can avoid a collision near the entrance to the tunnel. However, the situation is quite different in HG domain because three coordinated states (the shadowy area in Figure 1(b)) are not sufficient to let the robots make a big enough detour to avoid the conflicts when both robots are near the doorway simultaneously. As can be seen from Figure 4(a), robots still have a high probability of colliding with each other near the doorway. That is why the performance of CL_{1} in this domain is higher than IL but still much lower than the optimal JSAL. However, when the robots learn with a wider range of coordinated states as shown in Figure 4(b), they can have a large enough detour to avoid the collision, which correspondingly increases the learning performance further in CL_{2}.

Table 2 gives the overall performances of these learning approaches in the larger domains. The fundamental difference between these larger domains and those in Figure 1 is that interactions in these larger domains are much sparser than those in the smaller domains. This means explicit coordination is not heavily required in these large domains, which explains why in some domains such as MIT, CIT, SUNY and CMU, the independent learning approach can already achieve a very good performance. Even in these domains, our learning approaches can still improve the performance further with the exception of in the SUNY domain where coordination is not necessary at all. It is noted that the minor difference between the results using our learning approach and that using IL in the SUNY domain is caused by the extra exploration introduced by the CL process in our approach. However, in all other domains where coordination is more necessary, the benefit that this CL process brings outweighs the uncertainty it causes. Another interesting finding in Table 2 is that as the state and action sizes grow, the performance of JSAL decreases. This is because the JSAL approach searches in the whole state-action space, and the robots are unable to learn an optimal policy in 10,000 learning episodes. For example, in the CMU domain, there are 17,689 × 16 = 283,024 Q values to be estimated. Figure 3(d) plots the learning process in the CMU domain, from which we can see that the JSAL approach converges too slowly to reach an optimal value. However, our learning approach only needs to consider the joint-state-action information when a robot is in the coordinated states, which usually account for a small proportion of the whole domain state space. In this way, the search space is reduced substantially compared with JSAL.

Domain | Approach | State | Action | Q Values | Reward | Collision(%) | Step |
---|---|---|---|---|---|---|---|

IL | 43 | 4 | 172 | 10.11 ± 2.61 | 0.32 ± 0.09 | 6.28 ± 0.26 | |

ISR | CL_{1} | 47 | 5.12 | 240.64 | 14.05 ± 0.17 | 0.15 ± 0.01 | 12.54 ± 0.42 |

CL_{2} | 60.5 | 5.95 | 359.98 | 15.76 ± 0.26 | 0.10 ± 0.02 | 12.96 ± 0.63 | |

JSAL | 1849 | 16 | 29,584 | 16.86 ± 0.30 | 0.06 ± 0.01 | 14.56 ± 5.10 | |

IL | 49 | 4 | 196 | 16.84 ± 1.23 | 0.08 ± 0.02 | 23.15 ± 0.98 | |

MIT | CL_{1} | 61 | 5.50 | 345 | 16.92 ± 0.55 | 0.07 ± 0.01 | 29.56 ± 1.53 |

CL_{2} | 193 | 8.41 | 1622.78 | 16.98 ± 0.67 | 0.07 ± 0.01 | 31.25 ± 1.65 | |

JSAL | 2401 | 16 | 38,416 | 16.49 ± 0.38 | 0.02 ± 0.01 | 44.02 ± 4.57 | |

IL | 52 | 4 | 208 | 12.18 ± 3.75 | 0.24 ± 0.13 | 10.04 ± 1.06 | |

PTG | CL_{1} | 53.5 | 4.69 | 250.92 | 15.25 ± 0.35 | 0.13 ± 0.03 | 12.51 ± 1.36 |

CL_{2} | 64 | 5.38 | 344.62 | 16.24 ± 0.41 | 0.09 ± 0.02 | 12.84 ± 1.71 | |

JSAL | 2704 | 16 | 43,264 | 17.76 ± 0.43 | 0.04 ± 0.01 | 14.32 ± 4.30 | |

IL | 70 | 4 | 280 | 15.10 ± 2.98 | 0.12 ± 0.10 | 20.87 ± 0.80 | |

CIT | CL_{1} | 71.5 | 4.51 | 322.47 | 15.65 ± 0.55 | 0.10 ± 0.08 | 22.26 ± 1.36 |

CL_{2} | 82 | 5.03 | 412.34 | 15.85 ± 0.52 | 0.08 ± 0.03 | 22.35 ± 1.42 | |

JSAL | 4900 | 16 | 78,400 | 16.81 ± 0.45 | 0.03 ± 0.01 | 22.42 ± 1.92 | |

IL | 74 | 4 | 296 | 19.10 ± 0.42 | 0.01 ± 0.01 | 12.17 ± 0.14 | |

SUNY | CL_{1} | 75.5 | 4.49 | 339.00 | 18.77 ± 0.45 | 0.02 ± 0.01 | 14.46 ± 1.36 |

CL_{2} | 81.5 | 4.81 | 392.08 | 18.97 ± 0.39 | 0.02 ± 0.02 | 16.26 ± 1.98 | |

JSAL | 5476 | 16 | 87,616 | 17.60 ± 0.32 | 0.02 ± 0.01 | 26.47 ± 3.98 | |

IL | 133 | 4 | 532 | 17.03 ± 0.86 | 0.03 ± 0.02 | 42.65 ± 1.81 | |

CMU | CL_{1} | 145 | 4.54 | 658.3 | 17.25 ± 0.65 | 0.02 ± 0.02 | 47.56 ± 2.36 |

CL_{2} | 193 | 5.08 | 980.96 | 17.65 ± 0.54 | 0.01 ± 0.01 | 53.02 ± 3.04 | |

JSAL | 17689 | 16 | 283,024 | -9.96 ± 1.41 | 0.05 ± 0.01 | 236.50 ± 9.62 |

In conclusion, the experimental results show that robots using the proposed CL approach can learn where and how to coordinate their behavior by capturing agents’ sparse interactions in loosely coupled MASs. CL outperforms the uncoordinated approach IL by considering coordination when necessary (i.e., in the coordinated states). On the other hand, CL reduces the state-action space considerably and enables robots to learn a shorter path to the goal with higher certainty than the JSAL approach. By removing the assumption of central controller or agents’ global observability of the environment, CL solves more realistic problems than JSAL and can achieve a good performance while only requiring limited information during the learning process.

### CONCLUSION AND FUTURE WORK

- Top of page
- SUMMARY
- INTRODUCTION
- RELATED WORK
- GENERAL THEORETICAL MODELS IN MULTIAGENT LEARNING
- PROBLEM DESCRIPTION AND DEFINITIONS
- A COORDINATED LEARNING APPROACH
- EXPERIMENT
- CONCLUSION AND FUTURE WORK
- ACKNOWLEDGEMENTS
- REFERENCES

In this paper, we propose a CL approach that enables agents to learn where and how to coordinate their behavior in loosely coupled MASs based on each agent's local observability of the environment. Through a statistical learning period, agents can detect those states where coordinated behavior is most necessary. Then, a Q-learning-based approach is applied to coordinate agents’ behavior to avoid potential conflicts according to different transitions between states. To deal with the uncertainties caused by agents’ local observability of the environment, an optimistic estimation mechanism is introduced to guide the agents’ learning process. Our approach captures the feature of sparse interactions of agents and decomposes the learning process into two subprocesses, that is, independent learning in uncoordinated states and CL in coordinated states. Through this decomposition, our approach reduces the complexity of decision making in Dec-MDPs. Experimental results show that our approach can achieve a good performance by improving the average agent rewards compared with the uncoordinated learning approach and by reducing the computational complexity significantly compared with the centralized learning approach.

This paper provides several interesting directions to be explored further. Firstly, in this paper, we simplified the solution by collecting statistical information in predefined episodes at the beginning of learning period. This can be improved by making this process dynamic and online. One solution might be that every time agents collide with each other, they can detect an area of coordinated states with a belief that coordination is needed in these states. As the learning process moves on, agents are more certain about the structure of the environment and thus can make decisions on whether to coordinate with each other according to their updated beliefs. Secondly, although two-agent scenarios, as studied here, have proved the effectiveness of our approach without losing the generality of its feasibility in more complex scenarios, it is still necessary to test the scalability and the performance of our approach in situations involving more than two agents. Finally, as shown in the experimental results, because of the overestimation of other agents’ behavior caused by the optimistic estimation mechanism, our approach cannot obtain an optimal performance. It is thus possible to introduce an extra mechanism to improve the learning update rules further so that an optimal performance can be guaranteed. We leave these issues for future work.

### ACKNOWLEDGEMENTS

- Top of page
- SUMMARY
- INTRODUCTION
- RELATED WORK
- GENERAL THEORETICAL MODELS IN MULTIAGENT LEARNING
- PROBLEM DESCRIPTION AND DEFINITIONS
- A COORDINATED LEARNING APPROACH
- EXPERIMENT
- CONCLUSION AND FUTURE WORK
- ACKNOWLEDGEMENTS
- REFERENCES

The authors would like to thank the anonymous reviewers for their valuable comments. Chao Yu is supported by the scholarship under China Scholarship Council(CSC)–University of Wollongong(UoW) Joint Postgraduate Scholarships Program. The kind assistance of Dr. Madeleine Cincotta in the copy editing process is also gratefully acknowledged.

### REFERENCES

- Top of page
- SUMMARY
- INTRODUCTION
- RELATED WORK
- GENERAL THEORETICAL MODELS IN MULTIAGENT LEARNING
- PROBLEM DESCRIPTION AND DEFINITIONS
- A COORDINATED LEARNING APPROACH
- EXPERIMENT
- CONCLUSION AND FUTURE WORK
- ACKNOWLEDGEMENTS
- REFERENCES

- 1A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on System Man Cybernetics: Part C 2008; 38(2):156–172., , .
- 2Learning complementary multiagent behaviors: a case study. In Proceedings of the 13th RoboCup International Symposium. Springer: Berlin/Heidelberg, 2010; 153–165., ,
- 3Multiagent reinforcement learning for multi-robot systems: a survey., .
*Technical Report CSM-404*, Department of Computer Science, Univervisty of Essex, Colchester, UK, 2004. - 4Self-organization for coordinating decentralized reinforcement learning. In Proceedings of 9th International Conference of Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, 2010; 739–746., , .
- 5Decentralized control of cooperative systems: categorization and complexity analysis. Journal of Artificial Intelligence Research 2004; 22:143–174., .
- 6A hybrid reinforcement learning approach to autonomic resource allocation. In 2006 IEEE International Conference on Autonomic Computing. IEEE Press: New York, 2006; 65–73., , , ,
- 7Resource allocation in the grid with learning agents. Journal of Grid Computing 2005; 3(1):91–100., , .
- 8Pricing in agent economies using multi-agent Q-learning. Autonomous Agent and Multi-Agent Systems 2002; 5(3):289–304., .
- 9Perspectives on multiagent learning. Journal of Artificial Intelligence 2007; 171(7):382–391..
- 10Multiagent learning is not the answer. It is the question. Journal of Artificial Intelligence 2007; 171(7):402–405..
- 11If multi-agent learning is the answer,what is the question?. Journal of Artificial Intelligence 2007; 171(7):365–377., , .
- 12An overview of cooperative and competitive multiagent learning..
*First International Workshop on Learning and Adaptation in MAS(LAMAS)*, Utrecht, The Netherlands, 2005; 1–46. - 13Decentralized MDPs with sparse interactions. Artifcial Intelligence 2011; 175:1757–1789., .
- 14Utile coordination: learning interdependencies among cooperative agents. In Proceedings of Symposium on Computational Intelligence and Games. IEEE Press: New York, 2005; 29–36., , , .
- 15Sparse cooperative Q-learning. In Proceedings of 21st International Conference on Machine Learning. ACM Press: New York, 2004; 61–68., .
- 16Learning multi-agent state space representations. In Proceedings of 9th International Conference of Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, 2010; 715–722., , .
- 17Learning what to observe in multi-agent systems. In Proceedings of 20th Belgian-Netherlands Conference on Artificial Intelligence. University of Twente Publisher: Enschede, the Netherlands, 2009; 83–90., , .
- 18Sparse tabular multiagent Q-learning., .
*Annual Machine Learning Conference of Belgium and the Netherlands*, Brussels, Belgium, 2004; 65–71. - 19Hierarchical multi-agent reinforcement learning. Autonomous Agents and Multi-Agent Systems 2006; 13(2):197–229., , .
- 20Hierarchical average reward reinforcement learning. The Journal of Machine Learning Research 2007; 8:2629–2669., .
- 21Coordinated reinforcement learning. In Proceedings of 19th International Conference on Machine Learning. Morgan Kaufmann Publishers: San Mateo, CA, 2002; 227–234., .
- 22Distributed value functions. In Proceedings of the 16th International Conference on Machine Learning. Morgan Kaufmann Publishers: San Mateo, CA, 1999; 371–378., , , .
- 23Coordinated multi-agent reinforcement learning in networked distributed pomdps. In Proceedings of the 25th National Conference on Argificial Intelligence (AAAI). AAAI Press: Menlo Park, California,, 2011; 764–770., .
- 24Planning under uncertainty in complex structured environments..
*PhD thesis*, Computer Science Department, Stanford University, August 2003. - 25Interaction-driven Markov games for decentralized multiagent planning under uncertainty. In Proceedings of 7th International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, 2008; 525–532., .
- 26Learning of coordination: exploiting sparse interactions in multiagent systems. In Proceedings of 8th International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, 2009; 772–780., .
- 27Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc.: Hoboken, New Jersey, 1994..
- 28Reinforcement Learning: An Introduction. MIT Press: Cambridge, 1998., .
- 29 , .
- 30The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research 2002; 27(4):819–840., , , .
- 31Complexity of decentralized control: special cases. Advanced Neural Information Processing Systems 2009; 22:19–27., .
- 32Exploiting factored representations for decentralized execution in multiagent teams. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems. ACM Press: New York, 2007; 469–475., , .
- 33Formal models and algorithms for decentralized decision making under uncertainty. Autonomous Agents and Multi-Agent Systems 2008; 17(2):190–250., .
- 34Multiagent teamwork: analyzing the optimality and complexity of key theories and models. In Proceedings of the First International Joint Conference on Autonomous Agents and Multiagent Systems: Part 2. ACM Press: New York, 2002; 873–880., .
- 35Planning, learning and coordination in multiagent decision processes. In Proceedings of 6th Conference on Theoretical Aspects of Rationality and Knowledge. Morgan Kaufmann Publishers: San Mateo, CA, 1996; 195–210..
- 36Sequential optimality and coordination in multiagent systems. In International Joint Conference on Artificial Intelligence. Morgan Kaufmann Publishers: San Mateo, CA, 1999; 478–485..
- 37Transition-independent decentralized Markov decision processes. In Proceedings of the Second International Joint Conference on Autonomous Agents and Multiagent Systems. ACM Press: New York, 2003; 41–48., , , .
- 38Solving transition independent decentralized Markov decision processes. Journal of Artificial Intelligence Research 2004; 22:423–455., , , .
- 39Interaction structure and dimensionality in decentralized problem solving. In Conference on Artificial Intelligence (AAAI). ACM Press: New York, 2008; 1440–1441., , .
- 40The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of National of Conference on Artificial Intelligence. AAAI Press: Menlo Park, California, 1998; 746–752., .
- 41IMulti-agent reinforcement learning: independent vs. cooperative agents. In Proceedings of the Tenth International Conference on Machine Learning. Morgan Kaufmann Publishers: San Mateo, CA, 1993; 1440–1441..
- 42Learning to coordinate without sharing information. In Proceedings of the National Conference on Artificial Intelligence. John Wiley & Sons, Inc.: Hoboken, New Jersey, 1994; 426–426., , ,