Policy Consensus-Based Distributed Deterministic Multi-Agent Reinforcement Learning
Policy Consensus-Based Distributed Deterministic Multi-Agent Reinforcement Learning Research Report
Reinforcement Learning (RL) has made significant breakthroughs in recent years in various fields such as robotics, smart grids, and autonomous driving. However, in real-world scenarios, multi-agent collaboration problems, also known as Multi-Agent Reinforcement Learning (MARL), are common. The core challenge in this area lies in designing efficient MARL algorithms, especially under constraints such as limited communication capabilities or privacy protection. Currently, most MARL algorithms rely on the widely adopted paradigm of Centralized Training with Decentralized Execution (CTDE). Although this paradigm effectively addresses the non-stationarity of the environment, its reliance on heavy communication and centralized processing presents challenges in real deployments, such as link failures and bandwidth limitations. Therefore, investigating the performance of distributed MARL algorithms under reduced communication requirements becomes particularly important.
This research focuses on addressing the above challenges by designing a policy consensus-based distributed MARL algorithm to overcome the limitations of existing methods. The research team is composed of Yifan Hu, Junjie Fu, and Guanghui Wen from the School of Mathematics at Southeast University, as well as Changyin Sun from the School of Artificial Intelligence at Anhui University. This study was published in the January 2025 issue of IEEE Transactions on Artificial Intelligence.
Research Background and Objectives
Existing MARL algorithms face several bottlenecks when dealing with high-dimensional continuous state and action spaces. For example, most current methods largely focus on discrete spaces, lacking theoretical analysis on learning performance in continuous state and action spaces. Moreover, many algorithms assume that the communication graph is undirected, whereas communication networks in real tasks are often directed graphs. Additionally, the learning capability of distributed MARL algorithms needs significant improvement, especially when compared with cutting-edge centralized training (CT)-based benchmarks.
To tackle these challenges, this study proposes a distributed deterministic actor–critic algorithm based on Deterministic Policy Gradient (DPG) techniques. The primary objectives are to effectively achieve collaborative learning among agents over high-dimensional continuous state-action spaces by incorporating parameter consensus mechanisms into policy and value function updates. The study also aims to provide theoretical convergence guarantees and enhance scalability, exploration capabilities, and data efficiency through the Deep Reinforcement Learning (DRL) framework.
Workflow and Research Methods
This research proceeds from theoretical formulation to practical algorithm implementation, encompassing the following key stages:
1. Design of Theoretical Distributed Algorithm
First, the research team proposed a local DPG theorem specifically adapted for distributed MARL, based on the classic deterministic policy gradient theorem. This theorem uses observation-based policies as its foundation and incorporates parameter consensus updates for both policy and critic networks. Assuming strongly connected directed graphs and employing stochastic approximation theory, the research establishes the asymptotic convergence of the theoretical algorithm under certain assumptions.
The core update rules of the algorithm consist of two components: updating critic network parameters and updating actor network parameters. In critic updates, local temporal difference (TD) errors are combined with consensus updates to approximate the joint Q-function iteratively. In actor updates, local estimations of policy gradients are used alongside consensus updates to achieve consistent policy parameters.
2. Design of Practical Distributed Algorithm
Although the theoretical distributed algorithm offers convergence guarantees, its learning performance may be constrained by standard assumptions, such as linear approximation, tapering learning rates, and deterministic policies. To address these challenges, the research team further introduced a practical Distributed Deep Deterministic Actor–Critic algorithm (D3-AC) by incorporating the DRL training architecture. Key improvements include:
- Network Design: Both actor and critic utilize scalable neural networks (NNs). The critic employs a Graph Convolutional Network (GCN) to extract complex interactions among agents, solving scalability issues as the number of agents increases.
- Experience Replay Mechanism: To improve sample efficiency, each agent maintains a replay buffer and leverages target networks to reduce training oscillations.
- Noise Augmentation Strategy: Gaussian noise is introduced to enhance exploration capabilities.
This algorithm achieves distributed learning by combining local parameter updates with distributed consensus updates during training.
Experimental Design and Results Analysis
Task Description
The study selected the Multi-Agent Particle Environment (MPE) as the experimental platform and designed three multi-robot coordination tasks:
- Coverage Control: Agents need to cover target regions while avoiding collisions with one another.
- Circle Control: Agents should evenly position themselves along the perimeter of a circle centered on a specified landmark while avoiding collisions.
- Square Control: Agents should evenly position themselves along the sides of a square defined by landmarks while ensuring collision avoidance.
Each task includes scenarios with eight and sixteen agents.
Algorithm Comparison
The research team compared the D3-AC algorithm with the following baseline algorithms:
- PIC: A centralized training-based deterministic policy algorithm, where all agents share a single global policy.
- MATD3: An algorithm where each agent independently trains its policy but shares a centralized critic network.
- D2-AC: A distributed actor–critic algorithm that incorporates policy consensus and stochastic policy gradient techniques.
Results show:
- Performance: Centralized algorithms (PIC and MATD3) exhibit excellent stability and performance across all tasks. D3-AC achieves learning capabilities close to centralized algorithms in large-scale scenarios while significantly reducing communication requirements.
- Comparison with D2-AC: D3-AC demonstrates superior stability when dealing with continuous action spaces, significantly outperforming D2-AC.
- Communication Efficiency: By communicating over sparse graphs, D3-AC reduces the communication overhead for each agent, showcasing strong deployment potential in real-world scenarios with limited bandwidth.
Ablation Experiments
- Effect of Neighbor Numbers: By varying the number of neighbors (communication density), the results show that D3-AC achieves better learning outcomes under a moderately sparse communication network.
- Intermittent Communication: Simulating an intermittent communication environment validates the robustness of D3-AC in the face of network link failures.
- Local Observation Constraints: An adapted version of D3-AC under local observability (D3-AC-L) demonstrates strong learning capabilities even with constrained local observations. However, without inter-agent communication, D3-AC-L fails to learn effective policies, emphasizing the importance of local information sharing.
Research Conclusions and Implications
This study proposed a novel policy consensus-based distributed deep deterministic actor–critic algorithm (D3-AC). Theoretically, the algorithm combines local deterministic policy gradients and distributed consensus mechanisms to overcome the communication limitations of traditional centralized frameworks, offering asymptotic convergence guarantees for directed graphs and continuous spaces. Practically, the D3-AC algorithm demonstrates efficiency, scalability, and stability in complex multi-agent tasks.
Research Highlights
- A distributed learning solution for high-dimensional continuous spaces and directed communication graphs.
- Validation of parameter consensus updates for supporting local learning both theoretically and practically.
- A paradigm for achieving efficient multi-agent collaboration under constrained communication resources.
Practical Value
Ultimately, D3-AC provides theoretical foundations and practical guidance for real-world distributed multi-agent systems with limited communication capabilities, such as drone swarms, distributed sensor networks, and intelligent transportation. Future work will focus on enhancing the algorithm’s performance under local observability constraints and extending it to distributed safe MARL domains.