A Deep Learning Approach for Rational Ligand Generation with Toxicity Control

Latest Research on Deep Learning Applied to Target Protein Ligand Generation: Proposal and Validation of the DeepBlock Framework

Background and Research Problem

In the drug discovery process, finding ligand molecules that bind to specific proteins has always been a core objective. However, current virtual screening methods are often limited by the scale of compound libraries and the breadth of chemical space, making it challenging to discover innovative compounds with target-specific properties across a vast chemical space. In contrast, de novo drug design offers new possibilities by generating molecular structures from scratch, enabling exploration beyond existing compound libraries.

In recent years, deep generative models have made significant progress in chemical molecule generation, including autoregressive models, variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flow models, and diffusion models. However, a common limitation of these models is that they primarily generate molecules based on the distribution of chemical space, lacking the ability to directly design molecules for specific targets. They often require additional virtual screening or reinforcement learning to evaluate the compatibility of molecules with protein binding.

To address these issues, a collaborative team from Xidian University, Xi’an Jiaotong University, Macao Polytechnic University, University of Tsukuba, and Hunan University proposed a deep learning method called DeepBlock. Inspired by the DNA-encoded compound library technology, this method adopts a modular building strategy to achieve ligand generation based on target protein sequences and precise control over molecular properties. The study was published in Nature Computational Science under the title “A deep learning approach for rational ligand generation with toxicity control via reactive building blocks.”

Research Design and Innovative Framework

Research Process of DeepBlock

The DeepBlock framework proposed in this study completes molecule generation through a two-step process: first, generating molecular building blocks, and then reconstructing these blocks into complete molecules. This design aims to address the inconsistencies in chemical structures caused by multi-step processing in traditional molecule generation methods, while enabling control over the chemical reactions between modules and molecular properties.

1. Molecule Fragmentation and Reconstruction Algorithm

Leveraging the retrosynthetic chemistry-based BRICS (Breaking of Retrosynthetically Interesting Chemical Substructures) algorithm, the research team innovatively designed a graph-based molecule fragmentation and reconstruction algorithm. Through this algorithm, molecules are decomposed into independent block sequences with the following features:
- During fragmentation, by strictly managing bond-breaking rules and recording nodes/edges, the algorithm ensures that the fragmentation and reconstruction processes are bidirectionally unique.
- In validation on the ChemBL dataset, the algorithm failed only 70 times out of 2,205,345 molecules, achieving a success rate of 99.99683%, demonstrating its reliability and practicality.

2. Design of Block Generative Network (BGNet)

BGNet is the core generative network of DeepBlock, designed using a Conditional Variational Autoencoder (CVAE) to generate molecule block sequences based on protein sequence information. Its key features include:
- Dual encoding scheme: BGNet independently encodes ligand block sequences and protein sequences, then utilizes a binding contribution perception module to predict the binding contribution values of protein residues. This design addresses the challenge of lacking protein 3D structural data.
- Model pretraining: Pretraining on the ChemBL dataset significantly expands the chemical space, effectively mitigating the risk of overfitting due to the limited size of the protein-ligand dataset.

3. Integration of Optimization Algorithms

The research team combined BGNet with the Simulated Annealing (SA) algorithm and Bayesian Optimization (BO) to optimize additional properties such as molecular toxicity. During the optimization process, BGNet generates potential neighboring candidate molecules, which are then explored and filtered using the optimization algorithms. The generated molecules exhibit strong binding affinity to the target protein while maintaining good synthetic feasibility.

Experimental Design and Testing

The team used 100,000 protein-ligand pairs from the CrossDocked 2020 dataset for model training and generated target ligand molecules on 100 test proteins for evaluation. These molecules were compared with existing models based on the following metrics:
1. Binding Affinity: The Vina score was used to evaluate the physicochemical performance of molecule-target binding.
2. Drug Likeness and Synthetic Feasibility: The potential of molecules for drug development and their actual synthetic difficulty were quantified.
3. Molecular Property Distribution and Diversity: The chemical property distribution of generated molecules and its consistency with the reference molecule library were analyzed.

Research Results and Key Findings

Results and Analysis

  1. Binding Affinity of Generated Molecules
    Molecules generated by DeepBlock showed strong binding affinity in terms of the Vina score, with a more concentrated distribution, indicating higher consistency and reliability of the generated candidate molecules. Compared to baseline models such as Pocket2Mol and TargetDiff, DeepBlock demonstrated significant advantages in molecular quality and distribution uniformity.

  2. Drug Likeness and Synthetic Feasibility
    Molecules generated by DeepBlock not only exhibited high binding affinity but also performed better in terms of drug likeness (QED score) and synthetic feasibility (SA score). High-affinity molecules did not sacrifice molecular realism, demonstrating the model’s ability to generate practically feasible molecules.

  3. Optimization Control of Molecular Properties
    Toxicity control experiments based on the SA or BO algorithms successfully reduced the toxicity levels of generated molecules while retaining target protein binding ability, further validating the practicality of DeepBlock in multi-objective tasks.

  4. Generalization Ability of Structural Information
    In cases where target structural information was lacking, molecules designed by DeepBlock based on protein sequences exhibited similar key binding structures to known inhibitors, highlighting its potential in novel target drug discovery.

Highlights of the Research

  • For the first time, a modular approach was applied to molecule generation, combining the concept of DNA-encoded chemical libraries to achieve structured and controllable molecule generation.
  • The modular molecule generation method has broad application value in synthetic chemistry and drug development, particularly in addressing unsynthesizable issues at the molecular level.
  • Experiments validated the feasibility of the model in toxicity optimization and binding affinity improvement, providing new insights for “multi-property optimization” in drug design.

Research Significance and Future Directions

DeepBlock addresses the shortcomings of existing drug design models in integrating targets with chemical space. Through block generation and reconstruction, the model balances the chemical reality of structures and the comprehensive property control of molecules. This not only provides innovative tools for scientific research but also advances the development of drugs targeting novel targets and the precision design of low-toxicity drugs.

In the future, the team plans to further optimize in the following directions:
1. Exploring de novo generation algorithms for blocks to enhance molecular diversity and innovation.
2. Upgrading 2D molecule generation to three-dimensional (3D) molecular structures to better meet the needs of drug discovery environments.
3. Extending the practical application scenarios of the DeepBlock model in large-scale drug development projects.

This research not only expands the boundaries of deep learning in drug design but also provides visual and systematic methods for innovative drug discovery, which will undoubtedly have a profound impact in the fields of chemical biology and artificial intelligence.