DeepES: Deep Learning-Based Enzyme Screening for Identifying Orphan Enzyme Genes

Academic Background

With the rapid advancement of sequencing technology, scientists have been able to obtain a vast amount of protein sequence data, including many enzyme sequences. However, despite the establishment of large enzyme databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and BRENDA, sequence information for many enzymes is still missing. These enzymes lacking sequence information are referred to as “orphan enzymes.” The existence of orphan enzymes severely hinders sequence similarity-based functional annotation, creating significant gaps in understanding the relationship between sequences and enzymatic reactions.

The issue of orphan enzymes extends beyond the lack of sequence information and affects our understanding of biological processes. For example, many metabolic processes in the human gut microbiota, such as the production of short-chain fatty acids (SCFAs), are closely related to gut inflammation and cancer progression. However, many of these reactions involve orphan enzymes, making it impossible to identify the associated genes. Therefore, developing a method to predict enzyme activity without relying on sequence similarity is crucial to bridging this gap.

Source of the Paper

The paper titled “DeepES: A Deep Learning-Based Enzyme Screening Tool for Identifying Orphan Enzyme Genes” was authored by Keisuke Hirota, Felix Salim, and Takuji Yamada, among others. The research team is affiliated with the School of Life Science and Technology at the Institute of Science Tokyo and collaborated with companies such as Metagen Inc., Metagen Therapeutics Inc., and Digzyme Inc. The paper was published on February 6, 2025, in the journal Bioinformatics and is available as an open-access article.

Research Process

1. Research Objectives and Framework Design

The core objective of DeepES is to develop a deep learning-based tool for identifying orphan enzyme genes. The research team adopted “Reaction Class” (RClass) as the classification standard for enzyme activity. RClass is based on the chemical transformation patterns of substrate-product pairs. Compared to the traditional Enzyme Commission (EC) number, RClass can handle incomplete reactions, making it more suitable for identifying orphan enzymes.

The workflow of DeepES consists of three main steps: 1. Input Gene Sequences: Extract continuous gene sequences from the genome. 2. RClass Classifier Prediction: Use a pre-trained deep learning model (ESM-2) to convert protein sequences into vector representations and predict whether each gene corresponds to a specific RClass using a multilayer perceptron (MLP). 3. Biosynthetic Gene Cluster (BGC) Evaluation: Evaluate whether these genes are likely to encode the target enzyme by calculating the geometric mean probability of continuous genes.

2. Dataset and Model Development

The research team obtained prokaryotic gene sequences and RClass data from the KEGG database to construct a training and testing dataset containing 4,413,823 data points. To address the class imbalance issue in RClass, the team adopted a weighted loss function and trained independent binary classifiers for each RClass.

During model development, the research team used the ESM-2 model, which can convert protein sequences into high-dimensional vector representations. Subsequently, predictions for each RClass were made using a multilayer perceptron. To optimize model performance, the team conducted hyperparameter tuning, including learning rate, hidden layer size, and dropout rate.

3. Model Validation and Performance Testing

To validate DeepES’s predictive ability in low-sequence homology environments, the research team constructed a small-scale, non-redundant validation dataset. Through leave-one-out cross-validation, the team found that DeepES could predict enzyme activity with high accuracy without relying on sequence similarity.

Additionally, the research team conducted a biosynthetic gene cluster (BGC) detection test by treating known enzymes as “pseudo-orphan enzymes” and testing whether DeepES could identify candidate genes for these enzymes. The results showed that DeepES performed exceptionally well in identifying BGCs, especially at high thresholds, where the reliability of predictions significantly improved.

4. Application Example: Orphan Enzymes in Human Gut Microbiota

The research team applied DeepES to 4,744 metagenome-assembled genomes (MAGs) from human gut microbiota and successfully identified candidate genes for 236 orphan enzymes. These orphan enzymes are involved in various metabolic pathways, particularly the production of short-chain fatty acids. The team also found that the predicted results for certain orphan enzyme genes were highly consistent with known metabolic functions, such as pathways related to aromatic compounds and isoprenoids.

Research Results and Conclusions

The development of DeepES provides a novel method for identifying orphan enzyme genes. By combining deep learning models with biosynthetic gene cluster information, DeepES can efficiently screen candidate genes without relying on sequence similarity. The research results demonstrate that DeepES has high accuracy and reliability in identifying orphan enzyme genes, particularly when handling data with low sequence homology.

Furthermore, the successful application of DeepES has revealed potential genes for many unknown metabolic pathways in human gut microbiota, especially those related to short-chain fatty acid production. These findings not only enhance our understanding of gut microbial metabolism but also open new research directions for related diseases.

Research Highlights

  1. Innovative Deep Learning Framework: DeepES is the first to combine deep learning models with biosynthetic gene cluster information, providing a novel solution for identifying orphan enzyme genes.
  2. Efficient RClass Classifiers: Through independent RClass binary classifiers, DeepES can predict enzyme activity with high accuracy without relying on sequence similarity.
  3. Broad Application Prospects: DeepES is not only applicable to prokaryotes but also has the potential to be applied to plants and fungi, offering vast application space for future research.
  4. Bridging the Gap Between Sequence and Function: DeepES’s successful application provides an essential tool for understanding the relationship between sequence data and biological function, particularly in handling orphan enzyme genes.

Research Value and Significance

The development of DeepES not only holds significant scientific value but also provides new tools for biotechnology and medical research. By identifying orphan enzyme genes, scientists can better understand the complexity of metabolic pathways, especially in areas such as human gut microbiota. Additionally, the successful application of DeepES opens new research directions for drug development and disease treatment, particularly in metabolism-related diseases such as gut inflammation and cancer.

DeepES provides a crucial solution for bridging the gap between sequence data and biological function, paving the way for future research.