Geometry-enhanced pretraining on interatomic potentials

Geometric Enhanced Pretraining for Interatomic Potentials

Introduction

Molecular dynamics (MD) simulations play an important role in fields such as physics, chemistry, biology, and materials science, providing insights into atomic-level processes. The accuracy and efficiency of MD simulations depend on the choice of interatomic potential functions that describe the interactions between atoms in the molecular system. Classical MD uses empirical formulas with fitted parameters, which are computationally inexpensive but lack accuracy. On the other hand, first-principles MD obtains accurate interatomic potentials by solving the Schrödinger equation, but at an enormous computational cost. Therefore, machine learning interatomic potentials (MLIPs), which fit machine learning models to energies and forces from first-principles calculations, have emerged as a promising alternative, achieving near ab initio accuracy with higher efficiency.

The performance and generality of MLIPs are limited by the scarcity of labeled data, as obtaining labeled data requires expensive first-principles calculations. Various self-supervised learning methods have been explored to learn general representations from large amounts of unlabeled data and then fine-tune on limited labeled data to extract task-specific information. However, existing methods have limitations in obtaining pretraining datasets and designing pretraining tasks for the MLIPs domain.

Paper Overview

This paper proposes a geometric-enhanced self-supervised learning framework for MLIPs, called GPIP. The framework consists of two core components:

  1. Geometric Structure Generation: Efficiently generate large-scale molecular geometric structures as unlabeled pretraining data through classical molecular dynamics simulations with empirical force fields.

  2. Geometric-enhanced Pretraining: Design three complementary self-supervised pretraining tasks – masking, denoising, and contrastive learning – to simultaneously capture topological and spatial structural information from the generated unlabeled structures.

Through the two steps of GPIP, MLIPs can significantly improve performance while only incurring a small computational cost. This method does not rely on any existing datasets and only requires generating MD trajectories for the target molecular system, thus avoiding the limited coverage of existing datasets and offering excellent generalizability.

The paper evaluates the performance of GPIP on a wide range of benchmarks, from small molecules to complex periodic systems, demonstrating its effectiveness and robustness. Additionally, a new electrolyte dataset is developed, containing more element types and complex configurations, for more comprehensive evaluation of MLIPs capabilities.

Research Workflow

a) Overview

  1. Use classical molecular dynamics simulations to generate a large number of geometric configurations for the target molecular system as unlabeled data.

  2. Apply three geometric-enhanced self-supervised learning tasks – masked denoising, denoising with masked atoms, and contrastive learning – to the generated unlabeled configurations to pretrain a graph neural network (GNN) to capture topological and spatial structural information.

  3. Fine-tune the pretrained GNN on a small amount of data with first-principles labels to learn task-relevant information.

b) Details

Unlabeled Data Generation

In the paper, for four different molecular systems of varying complexity (MD17, ISO17, liquid water, and electrolytes), classical molecular dynamics software LAMMPS is used with empirical force fields (such as OPLS-AA, TIP3P, etc.) to simulate MD trajectories at different temperatures, from which a large number of molecular configurations are sampled as unlabeled pretraining datasets.

Self-Supervised Learning Tasks

  1. Masked Denoising: Randomly mask portions of atom features, add noise to their coordinates, and train the GNN to infer the masked atom features from the visible atoms.

  2. Denoising with Masked Atoms: Randomly mask portions of atom features, add noise to the overall configuration coordinates, and train the GNN to predict the additive noise instead of restoring the original configuration, forcing it to capture spatial structural information.

  3. Contrastive Learning with 3D Networks: Build a 3D network to capture global 3D structural information of the molecule, and maximize the mutual information between the GNN and 3D network outputs, enabling the GNN to also learn global 3D information.

Fine-tuning

After the unsupervised pretraining with the three tasks, the pretrained GNN is fine-tuned on a small amount of data with first-principles labels to further learn task-relevant information such as energies and forces.

c) Research Conclusions

  1. GPIP significantly improves the accuracy and generalization ability of MLIPs across various benchmarks.

  2. GPIP has a very low computational cost, offering a more cost-effective alternative compared to increasing first-principles labeled data.

  3. The three self-supervised tasks are complementary; a single task has limited effectiveness, but combining them effectively captures both topological and spatial information of the configurations.

  4. GPIP has excellent generalizability, not relying on any prior datasets, thus avoiding the limited coverage of existing datasets.

d) Research Significance

  1. Scientific Significance: Proposes a low-cost, efficient, and general pretraining paradigm for MLIPs, addressing limitations in pretraining data and task design.

  2. Application Value: Improves the simulation accuracy of MLIPs for various molecular systems, advancing the application of MD simulations in multiple fields.

e) Research Innovations

  1. The novel idea of pretraining on unlabeled MD configuration data avoids expensive first-principles calculations.

  2. Unique design of a multi-task self-supervised learning framework combining masking, denoising, and contrastive tasks.

  3. Development of a new electrolyte dataset for more comprehensive evaluation of MLIPs capabilities.

  4. Comprehensive experimental evaluation covering a wide range of benchmarks and molecular complexities.

This research provides an effective solution for low-cost, high-performance MLIPs models, demonstrating innovations in both self-supervised learning and molecular simulation domains.