Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning

Deep Reinforcement Learning Empowers Agile Soccer Skills for Bipedal Robots

Training Steps

Background Introduction

One of the long-term goals of artificial intelligence (AI) research is to enable agents to exhibit agility, flexibility, and understanding in the physical world. However, animals and humans not only smoothly complete complex physical actions but also perceive and understand the environment, achieving complex goals through their bodies in the world. Historically, there have long been attempts to create intelligent body agents with complex motor abilities, both in simulated and real environments. With the accelerated technological progress in recent years, especially the advancements brought by learning-based methods in the field, deep reinforcement learning (Deep RL) has proven to efficiently solve complex motor control problems for both simulated characters and physical robots.

However, for humanoid and bipedal robots, learning-based methods have seen limited application due to special challenges in stability, robot safety, degrees of freedom, and hardware availability. Current cutting-edge research still relies on targeted model predictive control, which limits the generality of the methods.

This paper, conducted by Tuomas Haarnoja and other collaborators from Google DeepMind, focuses on using deep reinforcement learning (Deep RL) methods to train low-cost, off-the-shelf small humanoid bipedal robots to perform simplified one-on-one soccer match skills, exploring their adaptability to complex and dynamic full-body control tasks. The research not only challenges the current limits of bipedal robot motor control but also demonstrates the effectiveness and potential of Deep RL in this process.

Source of the Paper

This paper, completed by Tuomas Haarnoja, Ben Moran, Guy Lever, and others from Google DeepMind, was published in the April 10, 2024, issue of the journal “Science Robotics” and amended on April 17, 2024.

Research Process and Methods

Research Process

The research process in this paper comprises the following two main stages:

First Stage: Skill Training In the skill training stage, the authors separately trained the robots for standing up skills and scoring skills. While training the scoring skills, the goal for the robot was to score as many goals as possible against an untrained dummy opponent. A reward function with a set of weights was used to encourage the robot to increase its forward speed and interaction with the ball while incorporating necessary constraints in physical robot applications to reduce the risk of robot damage. During the standing-up skill training, key postures were collected, and posture-based control and adjustments based on specific target postures were used to guide the robot’s actions, ensuring stability and collision-free standing up.

Second Stage: Distillation and Self-Learning In the second stage, the researchers combined different skills from the first stage through distillation and trained them in a multi-agent self-learning environment to eventually form a 1v1 agent capable of full-scale soccer matches. During the self-learning process, the opponent was randomly selected from snapshots of agents trained in the previous phase. Skills were fused and enhanced by overlaying skill weight rewards and multi-agent adversarial training.

Training Details

During the detailed training process, the researchers employed a partially observable Markov decision process (POMDP) and trained using the Maximum a Posteriori Policy Optimization (MPO) algorithm. The specific process entailed testing the initial policy in a simulated environment and optimizing through a series of low-cost robot samples. Inputs during training included the robot’s posture, linear acceleration, angular velocity, gravity direction, and game state (relative positions and speeds of the robot, ball, opponent, and goal). Through techniques like domain randomization and random perturbations, the trained policy exhibited stronger robustness and cross-domain transfer capability.

Experimental Results

Comparison and Performance Evaluation

The research team applied the trained policy in real-world environments, conducting a series of comparative experiments to showcase its performance, generalization capability, and stability. Robot behaviors used for comparison included walking, turning, standing up, and kicking. The experimental results indicated that, compared to script-based control methods, the Deep RL-trained policy significantly outperformed manually designed baselines in several aspects, with walking speed increasing by 181%, turning speed improving by 302%, standing up time reducing by 63%, and kicking speed enhancing by 34%.

By utilizing the Uniform Manifold Approximation and Projection (UMAP) method for visual analysis of robot behavior paths, it was demonstrated that the Deep RL strategy outperformed script-based methods in terms of action continuity and flexibility. Additionally, the Deep RL strategy initialized randomly within a certain range, providing greater flexibility in responding to opponents and adjusting actions, exhibiting various response strategies including interception and dynamic stride adjustments.

Conclusion and Significance

This paper demonstrates the potential of Deep RL in dynamic and complex full-body control tasks by training low-cost bipedal robots for simplified 1v1 soccer matches. The research indicates that through appropriate regularization, domain randomization, and the injection of noise during training, even low-cost robots can achieve high-quality policy transfer. This study not only advances the limits of bipedal robot motor control but also further validates the application potential of deep reinforcement learning in robotic dynamic tasks.

Highlights of the Research

  1. Impressive Performance: Compared to manually designed baseline control methods, the Deep RL strategy exhibits superior performance, showing significant advantages in acceleration, turning, standing up, and other aspects.
  2. Intelligent Response Strategies: The agent autonomously discovered behaviors suitable for specific game contexts, demonstrating strategies including blocking, goalkeeping, and defensive positioning, which are difficult to achieve through manual design.
  3. Seamless Skill Integration: By pre-training standing-up and scoring skills, exploration efficiency was improved, and the agent exhibited smooth action transitions when responding to different scenarios.

Future Work and Development Directions

This research not only reveals the feasibility of applying robotic motion control from simulation to reality but also provides new directions for future research: such as multi-agent cooperative training and exploring robot decision-making guided directly by visual information. Reducing dependency on external state information and further expanding the spectrum of dynamic behaviors of agents will be important topics for future research.