NPE-DRL: Enhancing Perception-Constrained Obstacle Avoidance with Nonexpert Policy-Guided Reinforcement Learning
Research on Improving UAV Obstacle Avoidance in Vision-Constrained Environments Based on Nonexpert Policy Reinforcement Learning
In recent years, unmanned aerial vehicles (UAVs) have gained widespread application in civilian fields such as package delivery, risk assessment, and emergency rescue, owing to their superior maneuverability and versatility. However, as the complexity, scope, and duration of UAV missions increase, the difficulty of achieving autonomous navigation rises significantly, particularly in crowded and highly uncertain environments. Traditional global navigation methods often rely on complete global information, making them inadequate for handling obstacle scenarios in vision-constrained conditions. This study aims to address the obstacle avoidance challenges in such settings, enhancing UAV real-time navigation capabilities.
Though obstacle avoidance algorithms based on deep reinforcement learning (DRL) demonstrate superiority in end-to-end processing, reducing computational complexity and enhancing adaptability and scalability, they suffer from low sample efficiency, requiring numerous sample iterations to achieve policy convergence. Meanwhile, existing sample-efficient methods rooted in imitation learning depend heavily on offline expert data, which may be challenging to acquire in hazardous environments. Therefore, improving UAV obstacle avoidance capabilities under limited data quality remains an urgent scientific problem. Against this backdrop, a research team from Nanyang Technological University and Nanjing University of Aeronautics and Astronautics proposed a novel obstacle avoidance method based on Nonexpert Policy Enhanced DRL (NPE-DRL). The article was published in the January 2025 issue of IEEE Transactions on Artificial Intelligence, authored by Yuhang Zhang, Chao Yan, Jiaping Xiao, and Mir Feroskhan.
Research Background and Problem Statement
Traditional obstacle avoidance methods like Simultaneous Localization and Mapping (SLAM) often require extensive computational resources and exhibit low efficiency in low-texture environments. Additionally, due to size and payload constraints on UAVs, active sensors such as radar, LiDAR, and RGB-D cameras are unsuitable for micro-UAVs, leaving monocular cameras as the preferred option. However, monocular cameras face limitations in 3D spatial representation and obstacle detection. To enhance sample efficiency and address these technical challenges, the authors leverage nonexpert example data to guide the initial phases of reinforcement learning policies, while augmenting these policies with conventional Deep Q-Network (DQN) strengths to improve obstacle avoidance performance.
Method Overview and Model Architecture
Overall Framework and Workflow
The core of the NPE-DRL model comprises two main components: 1) a core DRL algorithm; and 2) a manually designed nonexpert teacher. The specific workflow is as follows: 1. Nonexpert Policy Generation: The Artificial Potential Field (APF) method is first used to generate nonexpert policies for initial guidance. Instead of relying on high-quality expert samples, this heuristic obstacle avoidance approach provides foundational guidance for the DRL model’s initial learning and exploration. 2. Learning and Exploration: During the early training phases, the reinforcement learning agent primarily mimics the nonexpert policy’s behavior. In later phases, it gradually transitions to independent environmental exploration, thereby improving flexibility and adaptability. 3. Action Discretization: Continuous action spaces are mapped into discrete action spaces using fuzzy logic, significantly improving sample efficiency and reducing policy convergence time.
Network Structure Design
To address partial observability issues caused by the limited field-of-view (FOV) of monocular cameras, the research team designed a dual-input deep neural network architecture: 1. Inputs include RGB images captured by a forward-facing monocular camera as well as relative positional information between the UAV and the target (distance and angle). The images are normalized to 224×224 pixels and augmented with Gaussian noise for robustness. 2. The encoder component consists of two layers of 2D convolutions for feature extraction. Processed feature vectors are concatenated with positional vectors and passed through fully connected layers for further processing. 3. The model adopts a structure incorporating Double DQN and Dueling DQN architectures, separating the state-value function and the action-advantage function. This structure improves learning efficiency and enhances obstacle avoidance policy approximation accuracy.
Simulations and Experiments
Simulation Setup
To validate the effectiveness of NPE-DRL, the research team tested the algorithm in both simple and complex simulated environments: 1. Simple Environment: Contains 10 cylindrical obstacles, each 1m in diameter and 2m tall, within a 30×15m test area. 2. Complex Environment: Contains furniture-shaped obstacles varying in size from 0.4m to 2m, within the same area dimensions.
Evaluation metrics include success rate, collision rate, timeout rate, step count, total flight distance, and total energy consumption. During training, the Adam optimizer was used with a learning rate of 0.0001, an experience replay buffer size of 100,000, and a mini-batch size of 64.
Simulation Results
Compared to baseline algorithms (including Behavioral Cloning, D3QN, and D3QN-LfD), NPE-DRL achieved significantly higher success rates, particularly in environments with dense and complex obstacles (e.g., a 72% success rate compared to 34% for D3QN and 39% for D3QN-LfD). The simulation results demonstrated: 1. NPE-DRL generates smoother trajectories, avoiding large deviations and greatly enhancing navigation efficiency. 2. Compared to traditional reinforcement learning methods with random initialization, NPE-DRL achieved fast convergence (around 500 episodes), reflecting its superior sample efficiency.
Real-World Physical Experiments
To further validate the model’s generalization, the authors conducted flight experiments in real indoor environments. The test area was an 8×7×4m flight space, containing white rectangular obstacles and a marked box as the goal. A Tello EDU drone equipped with monocular cameras was used, with experiments monitored via the OptiTrack motion capture system for real-time tracking of UAV and obstacle positions.
In 60 trials, the UAV achieved a success rate of 81.67%, demonstrating robust adaptability and strong alignment with simulation results. Minor performance gaps compared to simulations were attributed to challenges in hardware limitations and sim-to-real gaps.
Research Conclusions and Significance
This study introduces the NPE-DRL algorithm, providing an efficient obstacle avoidance solution for monocular UAVs in vision-constrained environments. By incorporating nonexpert knowledge into reinforcement learning, the algorithm significantly enhances sample efficiency and obstacle avoidance performance. Moreover, the innovative use of fuzzy logic for action space discretization facilitates more efficient real-time decision-making. Experimental results show strong robustness and generalization capabilities across diverse environments.
The research presents the following significance: 1. Scientific Value: Provides a novel theoretical reference for UAV obstacle avoidance under vision-constrained conditions. 2. Practical Applications: Applicable to complex UAV tasks such as emergency rescue and package delivery. 3. Methodological Insights: Combines nonexpert policy with DRL, offering a fresh approach to learning optimization with low-quality data.
Future research directions may include extending NPE-DRL to multi-agent systems, exploring collaborative perception and decision-making among UAVs to tackle more challenging dynamic environments.