Hierarchical Negative Sampling Based Graph Contrastive Learning Approach for Drug-Disease Association Prediction

Research on Drug-Disease Association Prediction Using Graph Contrastive Learning Based on Layered Negative Sampling

The prediction of drug-disease associations (RDAs) plays a critical role in unveiling disease treatment strategies and promoting drug repurposing. However, existing methods mainly rely on limited domain-specific knowledge when predicting candidate associations between drugs and diseases, thus yielding limited effectiveness. Moreover, simply defining unknown information regarding drug-disease relationships as negative samples has inherent deficiencies. To overcome these challenges, this paper proposes a novel graph contrastive model based on layered negative sampling, called HSGCL-RDA, aimed at predicting potential associations between drugs and diseases.

Research Background and Questions

The process of drug development and disease progression control is lengthy and expensive, and with the continuous increase in the number and variations of diseases, the demand for effective drugs is also growing. Global disease outbreaks (e.g., COVID-19) have posed significant challenges to the treatment with existing drugs, creating an urgent need for the rapid development of new therapeutic drugs. Investigating new uses for existing drugs involves numerous challenges. Although existing algorithmic models have reduced the costs and time of drug development to some degree, they have the following limitations:

  1. Insufficient Similarity Measures: Many models fail to adequately consider the multidimensional features between different objects, thus avoiding noise and information loss during the calculation process.
  2. Negative Sample Selection Issues: Most models base their predictions on positive sample information from known associations, without considering the sparse association networks of unknown sample characteristics. Simply defining these as negative samples is insufficient to predict potential drug-disease associations, and choosing more reliable negative samples is key to these methods achieving satisfactory predictive results.
  3. Insufficient Application of Contrastive Learning: Although contrastive learning has shown significant effectiveness in various graph representation learning scenarios, it has not yet been applied to the prediction of potential drug-disease associations so far.

Source of the Study

This paper is authored by Yuanxu Wang, Jinmiao Song, Qiguo Dai, and Xiaodong Duan, who are from Xinjiang University and Dalian Minzu University. The research was published in the May 2024 issue of the IEEE Journal of Biomedical and Health Informatics.

Research Process

Constructing Heterogeneous Networks

  1. Constructing Similarity Networks of Different Biological Molecules: Calculate the similarity information of different types of drugs, diseases, and proteins, and extract effective feature information through regularized matrix decomposition. The paper first uses the Gaussian Interaction Profile (GIP) kernel similarity method, which has been widely adopted in recent years for similarity calculations of different biological molecules. To enhance feature representation capability, the methods for calculating disease semantic similarity, protein sequence similarity, and drug Jaccard similarity were also chosen.

  2. Fusing Similarity Matrices: Obtain similarity information of different biological molecules through various similarity calculation methods and construct a comprehensive feature network through feature fusion methods. Use regularized matrix decomposition to obtain low-dimensional vector representations to effectively capture node feature information.

  3. Layered Negative Sampling Strategy: Use a layered sampling algorithm based on similarity networks, first employing the PageRank algorithm to score and rank the similarity networks of drugs, diseases, and proteins to extract highly associated biological information. Then, extract protein information from disease molecules through association information, and perform data filtering based on the protein-drug association network to eventually obtain a reliable negative sample dataset.

Graph Contrastive Module

  1. Intra-Domain Meta-Path Information Aggregation Module: Use graph attention network layers to learn the importance of node interaction information within meta-path nodes and obtain node embedding information. Capture drug and disease node representations based on different meta-paths by learning attention weights.

  2. Inter-Domain Meta-Path Information Aggregation Module: As different meta-paths have different feature representations, further aggregate semantic feature information within these meta-paths to enhance feature validity and assign different weights to different meta-paths.

  3. Dual-Channel Network Feature Graph Contrastive Module: Considering that there exist deeper-level feature information between drugs and diseases, use GCN and SoGCN to respectively build global feature graphs and local feature graphs, fully extracting their internal representation information. Apply self-supervised graph contrastive learning methods by defining positive and negative samples based on global and local feature graphs to calculate contrastive loss.

Model Optimization and Experiment

The optimization section updates the obtained node representation information using a multi-layer perceptron (MLP) and normalizes it using the logsoftmax function. For the experiment, a 5-fold cross-validation method was employed, and the model’s performance was comprehensively evaluated using various metrics such as AUC, AUPR, accuracy, recall, and F1 score. Additionally, various comparative experiments and superiority analyses were conducted on hyperparameters, negative sample selection, and GCN and SoGCN layer settings.

Main Research Results

The experimental results of HSGCL-RDA on multiple benchmark datasets indicate that the proposed method outperforms existing methods in predicting drug-disease associations. Notably, by optimizing the contrastive joint cost function on the preliminary positive and negative sample feature networks and employing a layered negative sampling strategy, the model’s ability to capture graph structure information in non-Euclidean space was effectively enhanced.

Research Significance and Value

HSGCL-RDA not only demonstrates excellent performance in predicting drug-disease associations but also aids in discovering potential disease treatment effects of existing drugs, having significant application value. The proposed method provides an effective approach to addressing core issues in drug-disease association prediction, with its innovation reflected in the improvement of negative sample selection methods and the application of contrastive learning on heterogeneous networks.

Key Highlights

  1. Layered Negative Sampling Strategy: By selecting more reliable negative samples through a layered negative sampling method, the model’s prediction effectiveness in sparse association networks is improved.
  2. Intra- and Inter-Domain Meta-Path Information Aggregation: Effectively captured multidimensional node information under heterogeneous networks, enhancing feature representation capability.
  3. Dual-Channel Network Feature Graph Contrastive Learning: By thoroughly mining deeper-level associations between drugs and diseases through global and local feature graphs, the model’s prediction performance is improved.
  4. Validation and Evaluation: Proven applicability and effectiveness of HSGCL-RDA through a series of experiments under different datasets, hyperparameter ranges, and negative sample selection strategies.

Conclusion

This paper proposes a graph contrastive learning method based on layered negative sampling (HSGCL-RDA), effectively enhancing the prediction performance of drug-disease associations by optimizing negative sample selection strategies and graph contrastive structures. For future laboratory experiments, this research provides a reliable predictive foundation for determining actual drug-disease associations.