Gene-Level Alignment of Single-Cell Trajectories
Gene-level Single-cell Trajectory Alignment: A New Method Based on Dynamic Programming
The advent of single-cell RNA sequencing (scRNA-seq) technology has dramatically advanced biological research, enabling scientists to observe dynamic changes at the single-cell level over time and space. However, comparing such dynamics across samples or conditions (e.g., control vs. treatment, in vitro vs. in vivo experiments, healthy vs. disease states) remains a major challenge. This study aims to address key issues in single-cell trajectory alignment by developing a new tool called “genes2genes,” which achieves precise dynamic change alignment, specifically at the gene level.
This study was conducted by researchers from institutions including the Wellcome Sanger Institute, University of Cambridge, and Columbia University, with Professor Sarah A. Teichmann serving as the corresponding author. The paper was published in Nature Methods on September 19, 2024. The research showcases how a Bayesian information-theoretic and dynamic programming framework can optimize the matches and mismatches between single-cell trajectories, addressing current methods’ limitations such as high dependency on assumptions and inability to capture insertion and deletion states.
Background and Technical Challenges
Single-cell trajectory alignment involves comparing the dynamic patterns of gene expression to analyze changes in cell states under different conditions. Among these, “pseudotime trajectory inference” is a key approach that orders cells along a temporal axis to capture biological process continuity. However, existing methods, often relying on Dynamic Time Warping (DTW) algorithms, face significant limitations: 1. The assumption that every time point in the reference trajectory must find a corresponding time point in the query trajectory. 2. Inability to identify mismatches, such as unobserved states due to insertions or deletions. 3. Use of simple metrics (e.g., Euclidean distance) that fail to capture the complexity of gene expression distributions.
To address these issues, the study developed a novel framework called genes2genes (g2g), which fuses dynamic programming and information theory to achieve precise matching and mismatching of reference and query trajectories at single-gene resolution.
Study Design and Workflow
1. Overview of Methodology
genes2genes is based on an enhanced version of Gotoh’s dynamic programming algorithm, extended to include five alignment states: one-to-one match (m), expansion match (one-to-many warp, v), compression match (many-to-one warp, w), insertion (i), and deletion (d). This allows trajectory alignment to capture both matches and mismatches between time points.
Additionally, the researchers incorporated the Minimum Message Length (MML) inference method from Bayesian information theory to assess differences in gene expression distributions between reference and query trajectories. Compared to traditional metrics such as Euclidean distance, this method more accurately quantifies differences in means and variances of gene expression.
2. Data Preprocessing and Interpolation
To ensure smoother and more uniformly distributed time points along trajectories, the study employed a distributional interpolation method that includes the following steps:
- Normalize the pseudotime axis to the [0,1] range.
- Choose m
equally spaced interpolation points.
- For each interpolation point, estimate the mean and variance of the Gaussian distribution of single-cell gene expression over nearby pseudotime ranges using a kernel-based approach.
This method allows fine-grained interpolation for regions with high variability, leading to more reliable downstream alignments.
3. Dynamic Programming Scoring and Alignment Algorithm
To achieve precise alignment, the researchers designed a scoring scheme based on an information-theoretic message transmission model. For each pair of reference and query time points, two costs are calculated: - Match cost: Based on the gene expression distributions at the reference and query time points, computed using the MML framework to assess the difference between the unified and independent models. - State transition cost: Takes into account the transition probabilities within the five-state alignment model.
The core of dynamic programming is to calculate a score matrix by propagating optimal scores from previous steps, ultimately deriving the optimal alignment path through backtracking.
Research Findings
1. Validation with Simulated Data
The study first validated g2g’s performance on simulated data comprising three types of alignment patterns: matching, trajectory divergence, and trajectory convergence (3,500 trajectory pairs in total). Results showed that g2g significantly outperformed existing methods like cellalign and tragedy in gene-level alignment accuracy, achieving accuracy rates of over 99%. Particularly for divergence and convergence cases, g2g accurately captured the correct match and mismatch regions and predicted the distribution of mismatch lengths more precisely.
2. Applications to Real-world Biological Data
a. Analysis of Gene Dynamics in Inflammatory Models
In a dataset of murine bone-marrow-derived dendritic cells under two stimulation conditions (Pam and LPS), g2g identified early mismatches and late high-expression mismatches in key antiviral genes such as IRF7 and STAT2, revealing shifts in cell subpopulation responses during immune activation.
b. Comparison of Cell Differentiation in Pulmonary Fibrosis
The study compared epithelial cell differentiation trajectories from healthy lungs and idiopathic pulmonary fibrosis (IPF) lungs. It found that pathological differentiation into aberrant basaloid cells in IPF is associated with early changes in epithelial-mesenchymal transition (EMT)-related genes like NNMT and CAMK1D, suggesting potential regulatory targets in IPF development.
c. Optimization of T Cell Differentiation In Vitro
When comparing in vitro-induced pluripotent stem cell differentiation into T cells with in vivo thymic development, g2g revealed that the TNF signaling pathway was absent during in vitro T cell maturation. Adding TNF during the late stage of in vitro differentiation improved the resemblance of T cells to their in vivo counterparts.
3. Conclusions and Potential Impact
This study introduces an innovative framework for single-cell trajectory alignment and demonstrates its potential applications in uncovering dynamic gene expression patterns and optimizing in vitro experimentation. Its precision in identifying differential gene expression dynamics has implications for disease modeling, organoid optimization, discovering therapeutic targets, and more.
4. Key Highlights
- Novel Algorithm Design: Combines dynamic programming with information theory, overcoming limitations of traditional DTW.
- Single-gene Resolution: Uncovering trajectory differences at the molecular level.
- Wide Applicability: Performed well across simulated and real-world datasets and applicable to various single-cell contexts.
- User-friendly Tool: An open-source implementation lowers the barrier to single-cell trajectory analysis.
The introduction of g2g opens new directions for single-cell trajectory research, with important implications in fields such as disease modeling, cell state characterization, and optimization of in vitro experiments.