SP-DTI: Subpocket-Informed Transformer for Drug–Target Interaction Prediction
Academic Background
Drug-Target Interaction (DTI) prediction is a critical step in drug discovery, significantly reducing the cost and time of experimental screening. However, despite the advancements in deep learning that have improved the accuracy of DTI prediction, existing methods still face two major challenges: lack of generalizability and neglect of subpocket-level interactions. First, the performance of existing models significantly declines when applied to unseen proteins and cross-domain settings. Second, current molecular relational learning often overlooks subpocket-level interactions, which are crucial for understanding the details of binding sites. To address these issues, researchers have proposed a novel model called SP-DTI, which enhances the accuracy and generalizability of DTI prediction by introducing subpocket analysis and pre-trained language models.
Source of the Paper
This paper was co-authored by Sizhe Liu, Yuchen Liu, Haofeng Xu, Jun Xia, and Stan Z. Li. They are affiliated with the Department of Computer Science and the Department of Quantitative and Computational Biology at the University of Southern California, as well as the School of Engineering at Westlake University. The paper was published in 2025 in the journal Bioinformatics, titled SP-DTI: Subpocket-Informed Transformer for Drug–Target Interaction Prediction.
Research Process
1. Problem Definition
DTI prediction is framed as a binary classification task, aiming to predict whether an interaction exists between a drug and a target protein. Drugs are represented by their SMILES (Simplified Molecular Input Line Entry System) notation, while target proteins are represented by amino acid sequences. The core of the task is to learn a function that maps drug-target pairs to a binary interaction score, where 0 indicates no interaction and 1 indicates the presence of interaction.
2. Model Design
The SP-DTI model consists of three main modules:
a) Subpocket Modeling Module (SMM)
This module aims to capture the intricate interactions between drugs and proteins at the atomic level. The three-dimensional structures of proteins are generated using AlphaFold2, and potential binding pockets are identified using the CAVIAR algorithm, which further decomposes them into subpockets. Each subpocket is assigned a score indicating its likelihood of being a ligand binding site. Subsequently, individual graphs are generated for each subpocket and processed using Graph Convolutional Networks (GCNs), resulting in a detailed subpocket feature embedding.
b) Seq-Graph Fusion Module (SGFM)
This module enhances the encoding capabilities by integrating pre-trained language models with Graph Neural Networks (GNNs). The sequences of proteins and drugs are processed through the ESM-2 and ChemBERTa language models, respectively, to generate embeddings that are added as node features to the GNNs. The final output is a unified representation of proteins and drugs.
c) Interaction Module
This module captures the interactions between drugs, proteins, and subpockets using a Transformer model. First, the embeddings of drugs, proteins, and subpockets are merged into a matrix, and positional encoding is introduced to capture the relationships between subpockets and pockets. Next, the embeddings are updated through multi-head attention, and the probability of interaction is predicted using a Multilayer Perceptron (MLP).
3. Experiments and Results
a) Datasets and Evaluation Metrics
The study utilized two datasets, Biosnap and Davis, containing 4,510 drugs and 2,181 proteins, and 68 drugs and 379 proteins, respectively. Evaluation metrics included ROC-AUC (Area Under the Receiver Operating Characteristic Curve) and PR-AUC (Area Under the Precision-Recall Curve).
b) Random Split Testing
In the random split setting, SP-DTI performed exceptionally well on both the Biosnap and Davis datasets, achieving ROC-AUC scores of 0.931 and 0.934, respectively, significantly outperforming all baseline models.
c) Unseen Drug/Protein Split Testing
In the unseen drug and unseen protein settings, SP-DTI maintained high performance, especially in the unseen protein setting, where the ROC-AUC reached 0.873, a drop of only 6%, while the performance of other baseline models dropped by more than 12%.
d) Cross-Domain Split Testing
In cross-domain testing, SP-DTI achieved a ROC-AUC of 0.773, further demonstrating its strong generalizability in cross-domain settings.
e) Model Interpretability
Through the attention mechanism, SP-DTI can predict which protein binding sites are most likely to interact with a given ligand. The study used the binding of HIV protease D545701 with GW0385 as an example to demonstrate how the model accurately identifies experimentally verified binding sites.
4. Ablation Study
The ablation study showed that the pre-trained language model had the greatest impact on model performance, followed by the subpocket encoder, interaction module, and fusion module. Removing any component resulted in a performance decline, further validating the importance of each module.
Conclusion and Significance
The SP-DTI model significantly enhances the accuracy and generalizability of DTI prediction by introducing subpocket information and the seq-graph fusion module. The results show that SP-DTI outperforms state-of-the-art models in random splits, unseen drug/protein splits, and cross-domain settings. Additionally, the interpretability of the model provides crucial insights for drug discovery, helping scientists understand the mechanisms of predicted interactions and thereby accelerating the drug development process.
Research Highlights
- Subpocket-Level Modeling: Introduces subpocket information for the first time in DTI prediction, providing a finer analysis of binding sites.
- Seq-Graph Fusion: Integrates pre-trained language models with GNNs for the first time, enhancing the model’s generalizability.
- Cross-Domain Performance: Demonstrates outstanding performance in cross-domain settings, showcasing the model’s potential in real-world applications.
- Model Interpretability: Provides visualizations of binding sites through the attention mechanism, improving the model’s interpretability.
Code and Data Availability
The code for SP-DTI is open-sourced and available on GitHub: https://github.com/steven51516/sp-dti. Dataset split information can be obtained from the GitHub repositories of MolTrans and DrugBan.
Acknowledgements
The authors thank the anonymous reviewers for their valuable suggestions.
Author Contributions
Sizhe Liu and Yuchen Liu are co-first authors, responsible for conceptualization, methodology, software development, and writing. Haofeng Xu participated in software development and paper review. Jun Xia supervised and validated the study. Stan Z. Li managed the project and secured funding.
Funding
This research was supported by the National Natural Science Foundation of China, the Center for Synthetic Biology and Integrated Bioengineering at Westlake University, and the Westlake University Industries of the Future Research Fund.