GCLink: A Graph Contrastive Link Prediction Framework for Gene Regulatory Network Inference
Research Background
Gene Regulatory Networks (GRNs) are crucial tools for understanding the complex biological processes within cells. They reveal the interactions between Transcription Factors (TFs) and target genes, thereby controlling gene transcription and regulating cellular behavior. With the advancement of single-cell RNA sequencing (scRNA-seq) technology, researchers can now obtain gene expression data at single-cell resolution, offering unprecedented opportunities for GRN inference. However, the sparsity and high variability of scRNA-seq data pose significant challenges for GRN inference.
Existing GRN inference methods can be broadly categorized into two types: unsupervised learning methods based on correlation or mutual information, and supervised learning methods based on machine learning. While these methods have shown promise in certain scenarios, they often rely on pairwise gene correlations, overlooking global information, which limits their generalization capabilities. Additionally, many methods struggle with data noise and sparsity, especially when known regulatory interactions are limited.
To address these challenges, researchers have proposed methods based on Graph Neural Networks (GNNs). GNNs are adept at handling graph-structured data and have demonstrated excellent performance in tasks such as node classification, graph classification, and link prediction. However, existing GNN methods still face challenges when dealing with limited known regulatory interactions or noisy networks.
Research Team and Publication Information
This study was conducted by Weiming Yu, Zerun Lin, Miaofang Lan from Shenzhen University, and Le Ou-Yang from Shenzhen MSU-BIT University. The paper, titled “GCLink: A Graph Contrastive Link Prediction Framework for Gene Regulatory Network Inference,” was published on February 17, 2025, in the journal Bioinformatics. The research was supported by several grants, including the National Natural Science Foundation of China, the Guangdong Basic and Applied Basic Research Foundation, and the Shenzhen Science and Technology Program.
Research Framework and Methods
Problem Definition
A GRN can be represented as a graph ( G = (V, E) ), where ( V ) denotes the node set and ( E ) denotes the edge set. scRNA-seq data can be represented as a gene expression matrix ( X \in R^{m \times n} ), where ( m ) represents the number of genes and ( n ) represents the number of cells. Known gene regulatory interactions can be described by an adjacency matrix ( A \in R^{m \times m} ), where ( A{ij} = 1 ) indicates a regulatory relationship between gene ( i ) and gene ( j ), otherwise ( A{ij} = 0 ). The primary goal of this study is to infer potential regulatory relationships based on known interactions, which can be formulated as a link prediction problem.
Graph Augmentation
To enhance the model’s ability to handle sparse networks, the researchers employed graph augmentation strategies. Specifically, they retained the original graph and randomly removed some edges to generate a perturbed graph. This approach allows the model to adapt to extremely sparse scenarios while preserving known information.
Gene Representation Learning
The researchers used Graph Attention Networks (GAT) to extract low-dimensional representations of genes from gene expression data. GAT assigns weights to each gene through a self-attention mechanism, thereby aggregating information from neighboring genes. The multi-head attention mechanism enables GAT to stably learn gene representations.
Graph Contrastive Learning
After obtaining low-dimensional gene representations, the researchers further optimized these representations using graph contrastive learning. They employed an inter-view contrastive loss to maximize the consistency of the same gene across different views while distinguishing it from other genes. This method allows for the learning of high-quality gene representations even when known regulatory interactions are limited.
Link Prediction
To infer potential regulatory relationships between genes, the researchers fed the low-dimensional gene representations into Multilayer Perceptrons (MLPs) and calculated link scores between genes using dot product operations. These scores were then mapped to probabilities between 0 and 1 using a sigmoid function, representing the likelihood of regulatory relationships between genes.
Experimental Results
Performance on Benchmark Datasets
The researchers evaluated the performance of GCLink on multiple scRNA-seq datasets and compared it with six baseline methods. The experimental results showed that GCLink outperformed other methods in terms of AUROC (Area Under the Receiver Operating Characteristic Curve) and AUPRC (Area Under the Precision-Recall Curve) on most datasets. Particularly, GCLink demonstrated outstanding performance on cell-type-specific ChIP-seq networks.
Few-Shot Studies
To validate GCLink’s generalization ability in scenarios with limited known regulatory interactions, the researchers conducted few-shot experiments. They selected a cell line with abundant known regulatory interactions as the source cell line for pre-training and fine-tuned the model on the target cell line. The results indicated that GCLink performs exceptionally well in few-shot scenarios, showcasing strong transferability.
Hyperparameter Analysis
The researchers also analyzed the impact of different hyperparameters on model performance, particularly the probability of randomly removing edges. The experimental results showed that setting the edge removal probability to 0.2 yielded optimal performance on most datasets.
Case Studies
The researchers applied GCLink to the human embryonic stem cell (hESC) dataset and successfully inferred several novel regulatory interactions. These results demonstrate that GCLink not only accurately infers known regulatory relationships but also predicts potential regulatory interactions.
Discussion and Significance
By combining graph attention and contrastive learning, GCLink significantly improves the accuracy of GRN inference, especially when known regulatory interactions are limited. This method effectively handles the sparsity and noise in scRNA-seq data and addresses data variability. Moreover, GCLink’s excellent performance in few-shot scenarios highlights its strong transferability and generalization capabilities.
However, GCLink still relies on high-quality known regulatory interaction networks, and its performance may degrade when the network contains noise. Future research could further explore how to enhance the model’s transferability in fully unsupervised scenarios and improve graph augmentation methods to enhance model stability.
Conclusion
GCLink provides a novel solution for GRN inference, particularly excelling in handling complex, sparse gene expression data at single-cell resolution. This method not only significantly improves inference accuracy but also maintains high generalization capabilities when known regulatory interactions are limited, offering a powerful tool for biological research.