Comprehensive Prediction and Analysis of Human Protein Essentiality Based on a Pretrained Large Language Model

Comprehensive Prediction and Analysis of Human Protein Essentiality Based on a Pretrained Large Language Model

Academic Background

Human Essential Proteins (HEPs) are crucial for individual survival and development. However, experimental methods for identifying HEPs are often costly, time-consuming, and labor-intensive. Additionally, existing computational methods predict HEPs only at the cell line level, while HEPs vary significantly across live humans, cell lines, and animal models. Therefore, developing a computational method capable of comprehensively predicting HEPs across multiple levels is particularly important. Recently, Large Language Models (LLMs) have achieved remarkable success in the field of natural language processing, and Protein Language Models (PLMs) have also emerged due to their ability to pretrain on large-scale protein sequences. However, it remains unknown whether PLMs can significantly improve the task of protein essentiality prediction.

Source of the Paper

This paper was co-authored by Boming Kang, Rui Fan, Chunmei Cui, and Qinghua Cui, with Qinghua Cui as the corresponding author. The team is affiliated with the Department of Biomedical Informatics at the School of Basic Medical Sciences, Peking University, and the School of Sports Medicine at Wuhan Institute of Physical Education. The paper was published in Nature Computational Science in 2024.

Research Process

Data Collection

The research team collected protein essentiality data from multiple public databases, including gnomad, ogee-mgi, and the Project Score database. These data were used to train models at the human level (pic-human), mouse level (pic-mouse), and cell line level (pic-cell). Specifically:

  • Human Level: 65,057 protein sequences and their corresponding LOEUF (Loss of Function Observed/Expected Upper Bound Fraction) values were obtained from the gnomad database, including 14,146 positive samples and 50,911 negative samples.
  • Mouse Level: 6,050 human protein sequences and their corresponding mouse protein essentiality labels were obtained from the ogee database, including 443 positive samples and 5,607 negative samples.
  • Cell Line Level: Essentiality labels for 17,185 protein sequences across 323 different human cell lines were obtained from the Project Score database.

Model Architecture

The research team developed a deep learning model called Protein Importance Calculator (PIC) by fine-tuning a pretrained Protein Language Model (PLM) to predict protein essentiality. The PIC model consists of three main modules:

  1. Embedding Module: The ESM-2 model is used to convert protein sequences into fixed-dimensional numerical feature vectors.
  2. Attention Module: A multi-head attention mechanism captures the importance of amino acids at different positions in the protein sequence.
  3. Prediction Module: A Multi-Layer Perceptron (MLP) generates the predicted probability of the protein sequence.

Model Performance Evaluation

The research team evaluated the performance of the PIC model using metrics such as accuracy, recall, precision, F1 score, Area Under the ROC Curve (AUROC), and Area Under the Precision-Recall Curve (AUPRC). The results showed that the pic-human model achieved the highest AUROC of 0.9132, followed by the pic-mouse model with an AUROC of 0.8736. The median AUROC of the pic-cell model was 0.8579. Compared to existing methods, PIC significantly improved prediction performance.

Protein Essential Score (PES)

The research team defined the Protein Essential Score (PES) based on the probability values output by the PIC model and validated its effectiveness through a series of biological analyses. PES showed significant positive correlations with biological metrics such as protein-protein interaction network node degree, normal tissue expression level, cancer tissue expression level, phylop, phastcons, and the number of related diseases.

Cross-Level Analysis

The research team also conducted cross-level analyses using PES and found significant differences in protein essentiality across human, cell line, and mouse levels. For example, substantial differences in protein essentiality were observed between non-solid tumors (e.g., acute myeloid leukemia) and solid tumors (e.g., breast cancer). Additionally, the team identified some proteins with high essentiality in specific tissues.

Case Studies

The research team validated the potential of PES in discovering prognostic biomarkers through a breast cancer case study. The results showed that eight out of ten proteins screened by PES effectively predicted breast cancer patient survival in multiple clinical cohorts. Furthermore, the team used PES to quantify the essentiality of 617,462 human microproteins and found that high-essentiality microproteins were primarily involved in fundamental biological processes such as cell division, cellular respiration, and DNA replication.

Conclusions and Significance

The PIC model, by fine-tuning pretrained protein language models, significantly improved the prediction performance of human protein essentiality and provided comprehensive prediction results across human, cell line, and mouse levels. The PES defined by the research team not only quantifies protein essentiality but also demonstrates potential in discovering prognostic biomarkers and drug targets. In the future, the PIC model is expected to play an important role in drug discovery, clinical therapeutics, and synthetic biology.

Research Highlights

  1. Significant Improvement in Prediction Performance: The PIC model outperformed existing methods in predicting protein essentiality at the human, cell line, and mouse levels.
  2. Cross-Level Analysis: The research team systematically analyzed differences in protein essentiality across multiple levels for the first time.
  3. Protein Essential Score (PES): PES provides an effective metric for quantifying protein essentiality and demonstrates excellence in biological analysis and clinical application validation.
  4. Case Study Validation: The breast cancer case study validated the potential of PES in discovering prognostic biomarkers and therapeutic targets.

Additional Value

The research team also developed a user-friendly web server (http://www.cuilab.cn/pic) for researchers to input candidate protein sequences and obtain prediction results of their essentiality at different levels. The server, built on Python 3, Flask, and NumPy, provides an easy-to-use interface and result download functionality.