Benchmarking Algorithms for Gene Set Scoring of Single-Cell ATAC-Seq Data

Benchmarking Gene Set Scoring Tools for Single-Cell ATAC-seq Data

Authors: Xi Wang, Qiwei Lian, Haoyu Dong, Shuo Xu, Yaru Su, Xiaohui Wu
Affiliations: Pasteurien College (Soochow University Medical College), Department of Automation, Xiamen University, School of Mathematics and Computer Science, Fuzhou University
Corresponding Author: xhwu@suda.edu.cn
Journal: Genomics, Proteomics & Bioinformatics
Publication Date: February 9, 2024 (Online)

Introduction

Assay for Transposase-Accessible Chromatin sequencing (ATAC-seq) is a powerful and widely used epigenomic technique for analyzing genome-wide chromatin accessibility. Recently, single-cell ATAC-seq (scATAC-seq) technology has made it possible to study chromatin accessibility at the single-cell level, revealing new cell subpopulations in chromatin regulatory mechanisms. However, compared to single-cell RNA sequencing (scRNA-seq), the development of computational models for scATAC-seq has lagged significantly. Gene Set Scoring (GSS) has been widely applied in RNA-seq data, but there are few GSS tools specifically for scATAC-seq. To fill this gap, this study comprehensively benchmarked ten GSS tools, including single-cell RNA-seq tools and bulk RNA-seq tools, comparing their performance on scATAC-seq data.

Methods

This study systematically evaluated ten GSS tools, including four bulk RNA-seq tools (PLAGE, Z-score, ssGSEA, GSVA), five single-cell RNA-seq tools (AUCell, pagoda2, VISION, VAM, unipath), and one scATAC-seq-specific tool (unipathatac). Multiple scATAC-seq datasets were used for evaluation, including eight independent scATAC-seq datasets and three paired scATAC-seq and scRNA-seq datasets. The research workflow included data preprocessing, gene activity transformation, GSS application, and result evaluation. To address data sparsity, the study also assessed the impact of different imputation methods on GSS results. Finally, practical guidelines for selecting appropriate preprocessing methods and GSS tools in different application scenarios were provided.

Main Results

  • GSS Tool Applicability Assessment: The applicability of RNA-seq GSS tools on scATAC-seq was tested, finding their performance comparable to that on scRNA-seq data. Notably, pagoda2 and PLAGE performed best across multiple datasets and scenarios.
  • Impact of Gene Activity Transformation and Imputation: Different Gene Activity (GA) transformation tools had limited impact on GSS, but imputation significantly improved the performance of almost all GSS tools. Scale-based and drimpute-based imputation performed best.
  • GSS Tool Performance: Pagoda2 and PLAGE performed best on raw data, while VISION showed the best overall performance after imputation. The specific performance of GSS tools depended on the data and preprocessing steps.
  • Influence of Gene Sets: The impact of different gene sets was relatively small, but using multiple gene sets for comparative analysis can provide more comprehensive biological interpretations.
  • Computational Speed: VISION and Z-score were the fastest in computation, recommended for small-scale dataset analysis where speed is a priority.

Conclusions and Application Value

This study validated the applicability of RNA-seq GSS tools on scATAC-seq data through systematic benchmarking, providing new methods for future research. Results show that pagoda2 and PLAGE perform excellently and are recommended for unimputed raw data, while VISION is the best choice after imputation. Gene activity transformation and imputation methods can significantly affect GSS results, with scale or drimpute improving accuracy. The experimental results provide practical guidelines for selecting scATAC-seq data processing and analysis tools, advancing single-cell epigenomics.

Experimental Highlights

  • Tool Applicability: Evaluated the applicability of RNA-seq tools on scATAC-seq data, introducing new analysis methods.
  • Comprehensive Assessment: Systematically analyzed the impact of imputation and gene activity transformation on analysis results, providing detailed comparative data and evaluation metrics.
  • Practical Guidelines: Offered clear tool selection guidelines, providing valuable references for researchers when handling different types of single-cell data.

Other Valuable Information

All datasets used in this study are publicly available, and analysis scripts and detailed data processing workflows are also openly shared, making this study highly reproducible and valuable for widespread application.