Gene Selection for Single Cell RNA-seq Data via Fuzzy Rough Iterative Computation Model

Background Introduction

Single-cell RNA sequencing (scRNA-seq) technology has been widely applied in biomedical research in recent years, as it can reveal the heterogeneity of gene expression at the single-cell level, providing an important tool for understanding cell types, cell states, and disease mechanisms. However, scRNA-seq data is characterized by small sample size, high dimensionality, and high noise, making gene selection a necessary step before clustering and classification. Traditional statistical analysis and machine learning methods often face the “curse of dimensionality” when dealing with high-dimensional data. Therefore, how to effectively select representative genes from a vast number of genes has become a hot topic in current research.

To address this issue, the authors of this paper propose a gene selection method based on the Fuzzy Rough Iterative Computation Model (FRIC-Model). By introducing a fuzzy symmetric relation and an iterative computation strategy, this method overcomes the limitations of classical rough set models and fuzzy rough set models in handling scRNA-seq data, aiming to improve the efficiency and accuracy of gene selection.

Source of the Paper

This paper was co-authored by Zhaowen Li, Jie Zhang, Yuxian Wang, Fang Liu, and Ching-Feng Wen, and was published in the journal Artificial Intelligence Review on March 24, 2025. The authors are affiliated with multiple research institutions, including the Chinese Academy of Sciences and Tsinghua University. This research was supported by the National Natural Science Foundation of China.

Research Process

1. Definition and Construction of Fuzzy Symmetric Relation

In the Single Cell Gene Decision Space (SCGD-Space), the authors first define a fuzzy symmetric relation. Traditional rough set models rely on strict equivalence relations, which are difficult to apply to scRNA-seq data due to its high noise and sparsity. To address this, the authors replace the traditional equivalence relation with the distance between gene expression values and introduce two variable parameters: one controls the gene subset, and the other dominates the distance between gene expression values. In this way, the fuzzy symmetric relation can better describe the similarity between gene expression values.

2. Establishment of the Fuzzy Rough Iterative Computation Model (FRIC-Model)

Based on the fuzzy symmetric relation, the authors propose the FRIC-Model. This model defines a series of evaluation functions through an iterative computation strategy, including fuzzy rough approximations and dependency functions. These functions dynamically adjust the computation process of gene selection to ensure the convergence of the algorithm. The introduction of the FRIC-Model overcomes the limitations of classical rough set models and fuzzy rough set models in handling scRNA-seq data.

3. Design and Implementation of the Gene Selection Algorithm

Based on the FRIC-Model, the authors design a Gene Selection Algorithm (GSA). This algorithm continuously iterates the fuzzy relation matrix to find the gene subset with the maximum dependency. As the number of iterations increases, the calculation formula of the dependency function is dynamically adjusted to ensure the convergence of the algorithm. Additionally, the authors combine the Fisher Score method to further reduce the initial dimensionality and improve classification performance.

4. Experimental Validation and Performance Evaluation

To verify the effectiveness of the proposed algorithm, the authors conducted experiments on multiple publicly available scRNA-seq datasets. The experimental results show that, compared to existing algorithms, the proposed algorithm performs better in terms of gene selection efficiency and classification accuracy. Specifically, the algorithm significantly reduces the number of genes while maintaining high classification accuracy. Moreover, the algorithm executes quickly and occupies less memory, making it suitable for processing large-scale datasets.

Main Results

1. Improvement in Gene Selection Efficiency

The experimental results show that the proposed algorithm significantly reduces the number of genes in all datasets, with a gene selection ratio (Reduction Ratio, Redr) of up to 97%. This indicates that the algorithm has a strong gene selection capability, able to filter out the most representative subset from a vast number of genes.

2. Enhancement in Classification Accuracy

Experiments on K-Nearest Neighbor (KNN) and Classification and Regression Trees (CART) classifiers show that the proposed algorithm achieves higher classification accuracy than the raw data in 13 datasets. Particularly, the algorithm achieves the highest classification accuracy in 7 datasets. This indicates that the selected gene subset can effectively improve classification performance.

3. Optimization of Algorithm Efficiency

Compared to existing algorithms, the proposed algorithm performs well in terms of execution speed and memory usage. The experimental results show that the algorithm is highly efficient when processing large-scale datasets, making it suitable for practical applications.

Conclusions and Significance

This paper proposes a gene selection method based on the Fuzzy Rough Iterative Computation Model. By introducing a fuzzy symmetric relation and an iterative computation strategy, the method overcomes the limitations of traditional rough set models in handling scRNA-seq data. The experimental results demonstrate that the algorithm performs excellently in both gene selection efficiency and classification accuracy, with high application value. Additionally, the algorithm executes quickly and occupies less memory, making it suitable for processing large-scale datasets.

Research Highlights

  1. Innovative Method: This paper is the first to apply fuzzy rough set theory to gene selection in scRNA-seq data, proposing a novel FRIC-Model that overcomes the limitations of traditional methods.
  2. High Efficiency: The proposed algorithm performs exceptionally well when processing large-scale datasets, significantly reducing the number of genes while maintaining high classification accuracy.
  3. Broad Applicability: The algorithm demonstrates superior performance on multiple publicly available datasets, indicating broad application prospects.

Future Outlook

Although the proposed algorithm has achieved significant results in gene selection, it still faces challenges when processing large-scale scRNA-seq data. Future research will focus on further improving the algorithm’s efficiency through batch updating and exploring its application on inconsistent data. Additionally, this research provides a theoretical foundation for gene selection in the biomedical field, and future studies will explore its value in clinical treatment.


Through this research, the authors not only propose an efficient gene selection method but also open up new directions for the application of fuzzy rough set theory in the biomedical field. This achievement is of great significance for advancing the development of single-cell RNA sequencing technology.