Robust Multiobjective Reinforcement Learning Considering Environmental Uncertainties

Background

In recent years, Reinforcement Learning (RL) has demonstrated its effectiveness in solving various complex tasks. However, many real-world decision-making and control problems involve multiple conflicting objectives. The relative importance (preference) of these objectives often needs to be balanced against each other in different scenarios. While Pareto optimal solutions are considered ideal, environmental uncertainties (such as environmental changes or observation noise) may lead agents to adopt suboptimal strategies.

To address this issue, Xiangkun He, Jianye Hao, and others published a paper titled “Robust Multiobjective Reinforcement Learning Considering Environmental Uncertainties,” which aims to study a new multiobjective optimization paradigm. They proposed Robust Multiobjective Reinforcement Learning (RMORL) that considers environmental uncertainties. The paper was published in the IEEE Transactions on Neural Networks and Learning Systems.

Source of the Paper

The authors of this paper include Xiangkun He, Jianye Hao, Xu Chen, Jun Wang, Xuewu Ji, and Chen Lv, who are from Nanyang Technological University, Tianjin University, Renmin University of China, University College London, and Tsinghua University, respectively. The paper was received on February 3, 2023, revised on August 7, 2023, and November 7, 2023, and finally accepted on May 1, 2024.

Research Process

Overview of the Research Process

  1. Modeling Environmental Perturbations: Environmental perturbations were modeled as an adversarial agent across the whole preference space by introducing a Zero-Sum Game into the Multiobjective Markov Decision Process (MOMDP).
  2. Antagonistic Defense Technology Against Observation Interference: Designed countermeasures to handle observation interference, ensuring that policy changes due to observation disturbances remain within a specified range for any given preference.
  3. Policy Optimization: Evaluated the effectiveness of the proposed technology in five multiobjective environments with continuous action spaces.

Detailed Steps of the Experiment

  1. Modeling Environmental Perturbations with an Adversarial Agent:
    • Defined an adversarial agent model of environmental perturbations to simulate harsh conditions across the entire preference space (i.e., worst-case scenario).
  2. Design of Antagonistic Defense Technology:
    • Formulated an antagonistic defense technology based on nonlinear constraints to limit the policy changes caused by adversarial attacks on observations within a specific range.
    • Solved the constrained optimization problem involving adversarial observation uncertainties and agent preference space using Lagrangian dual theory.
  3. Algorithm Design:
    • Implemented this method on the basis of Deep Deterministic Policy Gradient (DDPG), termed as Robust Multiobjective DDPG (RMO-DDPG).

Research Results

In various experimental environments, the main results are as follows:

  1. Learning Speed and Final Performance: Compared to classical and state-of-the-art baselines, RMO-DDPG exhibited higher hypervolume metrics in all experimental environments. Particularly in the MO-Hopper-v2 environment, there was significant improvement in performance relative to the baseline models.
  2. Robustness of Policy: In five experimental environments, RMO-DDPG showed higher robustness indicators compared to baseline methods. For instance, in the MO-Swimmer-v2 task, the robustness index of RMO-DDPG was significantly improved relative to the baseline methods.
  3. Computational Cost: The RMO-DDPG method is computationally expensive because it requires additional optimization for the adversarial model and dual variables during training.
  4. Pareto Frontier: RMO-DDPG can approximate a broader range of Pareto solutions and find convex and concave portions of the Pareto frontier in all tasks.

Conclusion and Value

The RMORL algorithm proposed in this study can generate robust Pareto optimal strategies for a given preference. This study not only fills the gap in existing multiobjective RL methods in dealing with environmental uncertainties and observational disturbances but also demonstrates its potential in improving Pareto quality and policy robustness through its performance in multiple experimental tasks.

Highlights of the Study

  1. Novelty of the Method: By introducing a Zero-Sum Game into MOMDP, a new multiobjective optimization paradigm was proposed, achieving the training of a single model to approximate a robust Pareto optimal strategy under environmental perturbation and observation disturbances.
  2. Antagonistic Defense Technology: The designed antagonistic defense technology effectively limits policy changes under observational disturbances, enhancing policy robustness under different preferences.
  3. Comprehensive Experiments: The effectiveness of the proposed technology was demonstrated in five multiobjective environments, and the method’s superiority was proven through various metrics in comparison with competitive baselines.

Additional Content

To demonstrate the convergence of RMO-PI in the multiobjective radiation treatment process involved in this paper, the paper also provides detailed theoretical proof and explanations for the design of multiobjective reward functions in experimental environments.