ACImpute: A Constraint-Enhancing Smooth-Based Approach for Imputing Single-Cell RNA Sequencing Data

Single-cell RNA sequencing (scRNA-seq) technology has been widely applied in biological and medical research in recent years. It can reveal the transcriptomic information of individual cells, helping scientists better understand cellular heterogeneity and complexity. However, a common issue in scRNA-seq data is “dropout events,” which result in many genes being recorded as zero in individual cells. These zero values can be categorized into two types: “biological zeros,” indicating that the gene is indeed not expressed in the cell, and “technical zeros,” caused by limitations in sequencing technology that prevent gene expression from being detected. This data sparsity severely affects the accuracy and effectiveness of downstream analyses, such as cell clustering and trajectory inference.

To address this issue, researchers have developed various imputation methods, including model-based imputation, data smoothing, and matrix decomposition. However, existing methods often suffer from oversmoothing when handling large-scale data, which can flatten intercellular heterogeneity and compromise the accuracy of analysis results. Therefore, developing an imputation method that effectively restores gene expression while preserving intercellular heterogeneity has become an important research direction.

Source of the Paper

This paper was co-authored by Wei Zhang, Tiantian Liu, Han Zhang, and Yuanyuan Li from the School of Mathematics and Physics at Wuhan Institute of Technology, with Yuanyuan Li as the corresponding author. The paper was published in 2025 in the journal Bioinformatics, titled “ACImpute: A Constraint-Enhancing Smooth-Based Approach for Imputing Single-Cell RNA Sequencing Data.” The code for the paper has been made open-source on GitHub for researchers to use and improve.

Research Process and Results

1. Data Preprocessing

The first step of the research was preprocessing the raw scRNA-seq data. Due to potential technical factors affecting different cells during sequencing, such as experimental procedures and capture efficiency variations, the data required normalization. The normalized matrix was then filtered for highly variable genes based on the coefficient of variation to improve the accuracy of subsequent clustering.

Results: The normalized data matrix and highly variable gene matrix laid the foundation for the subsequent imputation analysis.

2. Calculation of the Markov Transition Matrix

Next, the research team used the highly variable gene matrix to calculate a stable transition probability matrix (Markov transition matrix). First, principal component analysis (PCA) was performed to reduce the dimensionality of the high-dimensional data, reducing noise and improving computational efficiency. Then, an affinity matrix was calculated based on the K-nearest neighbors (KNN) strategy, and the Markov transition matrix was obtained through symmetrization and normalization.

Results: The stable transition probability matrix provided information on intercellular transition probabilities for subsequent imputation.

3. Calculation of the Power Exponent

To further optimize the imputation results, the research team designed a power exponent matrix based on the negative correlation between gene expression levels and dropout rates. Specifically, genes with lower expression levels have higher dropout rates, so the transition probabilities of low-expression genes should be more constrained during imputation. Through normalization, the range of the power exponent matrix was limited to between 1 and 3.

Results: The power exponent matrix effectively constrained the transition probabilities of genes with different expression levels, preventing oversmoothing.

4. Imputation of Single-Cell Data

Finally, the research team combined the power exponent matrix and the transition probability matrix to calculate the imputation matrix. The imputation matrix was reverse-normalized, replacing the zero values in the original matrix to obtain the final imputation results.

Results: The imputed data matrix effectively restored gene expression while preserving intercellular heterogeneity.

Experimental Validation

1. Correlation Analysis

To validate the imputation performance of ACImpute, the research team used two datasets for correlation analysis. The first dataset used ERCC genes with known concentrations as a reference standard, while the second dataset used bulk RNA sequencing data as a reference standard. The results showed that ACImpute significantly outperformed other imputation methods in restoring gene expression.

Results: ACImpute performed excellently in correlation analysis, effectively restoring gene expression.

2. Clustering Analysis

The research team further conducted clustering analysis on six real datasets, using three clustering evaluation metrics: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and purity. The results showed that ACImpute outperformed other imputation methods in most datasets.

Results: ACImpute performed excellently in clustering analysis, effectively separating different cell types.

3. Trajectory Inference

Finally, the research team used the Monocle2 algorithm to perform trajectory inference analysis on the imputed data. The results showed that ACImpute outperformed other methods in trajectory inference, better reflecting the dynamic changes during cell differentiation.

Results: ACImpute performed excellently in trajectory inference, effectively revealing cell differentiation trajectories.

Conclusion and Significance

This paper proposes a smoothness-constrained imputation method, ACImpute, which effectively prevents oversmoothing by constraining the transition probabilities of genes with different expression levels. Experimental results demonstrate that ACImpute can effectively restore gene expression, preserve intercellular heterogeneity, and perform excellently in clustering analysis and trajectory inference. The development of ACImpute provides new insights for imputing scRNA-seq data, with significant scientific value and application prospects.

Research Highlights

  1. Innovation: ACImpute effectively prevents oversmoothing by constraining the transition probabilities of genes with different expression levels.
  2. Efficiency: ACImpute has a time advantage when handling large-scale data, enabling rapid imputation analysis.
  3. Wide Applicability: ACImpute’s excellent performance in clustering analysis and trajectory inference makes it highly applicable in biological and medical research.

Future Prospects

Although ACImpute has made significant progress in imputation effectiveness, there is still room for improvement. For example, the selection of parameter n when calculating the power exponent matrix may affect the accuracy of imputation results. In the future, the research team plans to further optimize the algorithm to better adapt to the needs of different datasets and distinguish between biological zeros and technical zeros.