Episodic Memory-Double Actor–Critic Twin Delayed Deep Deterministic Policy Gradient

Academic Background

Deep Reinforcement Learning (DRL) has achieved remarkable success in various fields such as gaming, robotics, navigation, computer vision, and finance. However, existing DRL algorithms generally suffer from low sample efficiency, requiring vast amounts of data and training steps to achieve desired performance. Particularly in continuous action tasks, the high dimensionality of the state-action space makes it challenging for traditional DRL algorithms to effectively utilize episodic memory to guide action selection, further reducing sample efficiency.

Episodic memory is a non-parametric control method that enhances sample efficiency by memorizing high-reward historical experiences. In discrete action tasks, episodic memory can directly evaluate each possible action and select the one with the highest estimated value. However, in continuous action tasks, the action space is infinite, making it difficult for traditional episodic memory methods to be directly applied to action selection. Therefore, how to effectively utilize episodic memory in continuous action tasks to improve sample efficiency has become a significant issue in current DRL research.

Source of the Paper

This paper is co-authored by Man Shu, Shuai Lü, Xiaoyu Gong, Daolong An, and Songlin Li, affiliated with the Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education at Jilin University, the Changchun Institute of Optics, Fine Mechanics and Physics, and the College of Computer Science and Technology at Jilin University. The paper was published in 2025 in the journal Neural Networks, titled Episodic Memory-Double Actor–Critic Twin Delayed Deep Deterministic Policy Gradient.

Research Content

Research Process

1. Research Problem and Objective

The primary objective of this study is to address the issue of low sample efficiency in DRL algorithms for continuous action tasks. The authors propose a novel framework called “Episodic Memory-Double Actor-Critic (EMDAC),” aiming to guide action selection through episodic memory, thereby improving sample efficiency. Specifically, the EMDAC framework combines episodic memory and dual critic networks to evaluate the value of state-action pairs, reducing the negative impact of critic network estimation bias on sample efficiency.

2. EMDAC Framework Design

The core of the EMDAC framework lies in utilizing episodic memory and dual critic networks to evaluate action values. The specific process is as follows: - Dual Actor Networks: The EMDAC framework includes two actor networks, each dependent on a critic network. Each actor network outputs a candidate action. - Episodic Memory: Episodic memory is used to store the value estimates of high-reward state-action pairs from the past. The authors design a Kalman Filter-based method for updating episodic memory, enabling more accurate estimation of state-action pair values. - Action Selection: During action selection, the EMDAC framework combines episodic memory and critic networks to evaluate the values of two candidate actions and selects the one with the higher estimated value.

3. Kalman Filter-Based Episodic Memory

Traditional mean updating methods assign equal weight to experiences collected at early and later stages when updating episodic memory, leading to significant estimation bias. To address this issue, the authors propose a Kalman Filter-based episodic memory updating method. This method assigns different weights to experiences collected at different training stages, thereby improving the accuracy of episodic memory.

4. Intrinsic Reward Based on Episodic Memory

To enhance the exploration capability of the agent, the authors design an intrinsic reward based on episodic memory. This reward encourages the agent to explore more novel state-action pairs, thereby avoiding local optima.

5. EMDAC-TD3 Algorithm

The authors apply the EMDAC framework, Kalman Filter-based episodic memory, and intrinsic reward to the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, proposing the EMDAC-TD3 algorithm. The algorithm is evaluated in the Mujoco environments of OpenAI Gym, and the results show that it outperforms baseline algorithms in terms of sample efficiency.

Key Results

1. Improvement in Sample Efficiency

Experimental results in the Mujoco environments demonstrate that the EMDAC-TD3 algorithm significantly outperforms the baseline TD3 algorithm in terms of sample efficiency. Specifically, EMDAC-TD3 achieves higher rewards with the same number of training steps or reaches the same performance with fewer training steps.

2. Final Performance Comparison

Compared to the state-of-the-art episodic control algorithms and actor-critic algorithms, EMDAC-TD3 excels in metrics such as final reward, median, interquartile mean, and mean. On average, EMDAC-TD3 improves performance by 11.01% compared to TD3.

3. Effectiveness of Episodic Memory

By comparing the performance of EMDAC-TD3 with its variant algorithms, the authors validate the effectiveness of episodic memory in improving sample efficiency. The results show that combining episodic memory and critic networks to evaluate state-action pair values significantly enhances sample efficiency.

4. Exploration Capability of Intrinsic Reward

Experimental results in the SparseMujoco environments demonstrate that the intrinsic reward based on episodic memory effectively enhances the agent’s exploration capability, enabling better performance in sparse reward tasks.

Conclusion

This study proposes a novel EMDAC framework that improves the sample efficiency of DRL algorithms in continuous action tasks by combining episodic memory and dual critic networks. The Kalman Filter-based episodic memory updating method and intrinsic reward design further enhance the algorithm’s performance. Experimental results show that EMDAC-TD3 outperforms state-of-the-art algorithms in both sample efficiency and final performance.

Research Highlights

  1. Innovative Framework: The EMDAC framework is the first to combine episodic memory and dual critic networks to evaluate action values in continuous action tasks, addressing the challenge of directly applying episodic memory in traditional methods.
  2. Kalman Filter-Based Episodic Memory: By assigning different weights to experiences collected at different stages, the accuracy of episodic memory is improved.
  3. Intrinsic Reward Design: The intrinsic reward based on episodic memory enhances the agent’s exploration capability, enabling better performance in sparse reward tasks.
  4. Extensive Experimental Validation: Experimental results in the Mujoco and SparseMujoco environments demonstrate that EMDAC-TD3 outperforms state-of-the-art algorithms in both sample efficiency and final performance.

Research Value

This study not only proposes a novel DRL framework theoretically but also validates its effectiveness in practical tasks through experiments. The introduction of the EMDAC framework provides new insights for DRL algorithms in continuous action tasks, with broad application prospects, particularly in fields such as robotics control, autonomous driving, and financial trading.