Decoupled Peak Property Learning for Efficient and Interpretable Electronic Circular Dichroism Spectrum Prediction
Efficient and Interpretable Electronic Circular Dichroism Spectrum Prediction: Decoupled Peak Property Learning
Academic Background
Electronic Circular Dichroism (ECD) spectroscopy is a crucial tool for studying molecular chirality, particularly in asymmetric organic synthesis and the pharmaceutical industry, where it is used to distinguish the absolute configurations of chiral molecules. However, existing ECD spectrum prediction methods face two main challenges: data scarcity and insufficient interpretability, leading to low trust in prediction results. Current ECD spectrum predictions rely on time-consuming quantum chemical calculations, including molecular structure extraction, conformational search, structural optimization, time-dependent density functional theory (TD-DFT) calculations, and Boltzmann weighting. This process not only requires experimental chemists to possess deep expertise but also consumes significant computational resources and time. Therefore, how to accelerate theoretical ECD spectrum calculations while improving prediction accuracy and interpretability has become an urgent problem to solve.
Source of the Paper
This paper is co-authored by Hao Li, Da Long, Li Yuan, Yu Wang, Yonghong Tian, Xinchang Wang, and Fanyang Mo, from Peking University Shenzhen Graduate School, Xiamen University, and Peking University. The paper was published on December 4, 2024, in the journal Nature Computational Science.
Research Process
1. Dataset Construction
Process Description
To address the issue of data scarcity in ECD spectrum prediction, the research team first constructed a large-scale ECD spectrum dataset called CMCDS. This dataset includes ECD spectra and Simplified Molecular Input Line Entry System (SMILES) sequences of 22,190 chiral molecules. The ECD spectra of these molecules were calculated using the Gaussian 16 software package, involving molecular structure optimization (B3LYP/6-31G level) and ECD spectrum calculation (CAM-B3LYP/6-31G(d) level, nstates=20).
Research Subjects and Processing
The subjects of study were chiral molecules extracted from the literature on asymmetric catalysis. Molecular structures were converted into MDL Molfiles format using the RDKit package and batch-generated as Gaussian calculation files.
Experimental Results
The CMCDS dataset was generated through large-scale theoretical calculations, providing high-quality data support for subsequent deep learning model training.
2. Construction of the ECDformer Model
Process Description
The research team proposed a deep learning model based on the Transformer architecture, named ECDformer, for efficient and interpretable ECD spectrum prediction. ECDformer decomposes ECD spectra into peak entities and predicts the number, position, and symbol of peaks separately. The model architecture includes four main modules: 1. Molecular Feature Extraction Module: Based on a geometry-enhanced graph neural network (GeoGCN), it extracts geometric and descriptor information from the molecule’s atom-bond and bond-angle graphs. 2. Peak Property Learning Module: Uses a Transformer encoder structure to extract peak-related information from molecular features. 3. Peak Property Prediction Module: Predicts the number, position, and symbol of peaks separately. 4. Spectrum Rendering Module: Reconstructs the ECD spectrum from the predicted peak properties.
Research Subjects and Processing
The input consists of the target molecule’s atom-bond-angle features and molecular descriptors. The model performs molecular representation learning through geometry-enhanced GNNs and extracts peak properties using a Transformer encoder.
Experimental Results
ECDformer demonstrates excellent performance in predicting peak properties, improving peak symbol accuracy from 37.3% to 72.7% and reducing spectrum prediction time from an average of 4.6 CPU hours to 1.5 seconds.
3. Model Performance Evaluation
Process Description
The research team evaluated the performance of ECDformer using three sets of peak-based evaluation metrics: Number-Root Mean Square Error (Number-RMSE), Position-Root Mean Square Error (Position-RMSE), and Symbol Accuracy (Symbol-Acc).
Research Subjects and Processing
The subjects of evaluation were chiral molecules from the CMCDS dataset, with the model’s predicted peak properties compared against real spectra.
Experimental Results
ECDformer outperforms baseline models across all evaluation metrics, particularly excelling in predicting complex spectra (number of peaks >5). The distribution of position and symbol differences also shows that ECDformer’s predictions are closer to the true values.
4. Interpretability and Generalization Capabilities of the Model
Process Description
Using the integrated gradients method, the research team labeled the regions of molecules that contribute most to spectrum generation, finding that chromophore structures play a crucial role in peak prediction. Additionally, ECDformer demonstrates strong generalization capabilities in predicting infrared (IR) and mass spectra (MS).
Research Subjects and Processing
The subjects of study include various natural products and pharmaceutical molecules, such as compounds with antiviral, antagonistic, and anti-inflammatory effects.
Experimental Results
ECDformer accurately predicts the ECD spectra of these complex natural products and demonstrates excellent generalization performance in IR and MS prediction tasks.
Research Conclusions
Significance and Value of the Research
The core contribution of this study lies in proposing an efficient and interpretable ECD spectrum prediction framework, addressing the deficiencies of existing methods in data scarcity and interpretability. By constructing a large-scale dataset and introducing a deep learning model, ECDformer significantly improves the accuracy and efficiency of spectrum prediction. Moreover, the model’s peak-decoupling approach not only enhances prediction accuracy but also provides greater interpretability for the spectrum generation process.
Innovations of the Research
- Large-Scale CMCDS Dataset: Fills the gap in ECD spectrum data for chiral molecules, providing high-quality training data for deep learning models.
- ECDformer Model: Through peak decoupling and property prediction, it significantly improves the accuracy and efficiency of spectrum prediction.
- Generalization Capabilities: ECDformer accurately predicts IR and mass spectra, demonstrating its broad application potential in different spectrum prediction tasks.
Other Valuable Information
The research team also explored the potential of ECDformer in molecular structure inference. Although the current model cannot fully reconstruct molecular structures from spectra, it shows some capabilities in identifying molecular orbitals and functional groups. Future research will further optimize the dataset, particularly by increasing data on molecules with multiple chiral centers, to enhance the model’s comprehensive characterization of chiral structures.