Protein Structure Prediction: Challenges, Advances, and the Shift of Research Paradigms

Protein Structure Prediction: Challenges, Progress, and Shifts in Research Paradigms

Protein structure prediction is an important interdisciplinary research topic that has attracted researchers from various fields including biochemistry, medicine, physics, mathematics, and computer science. Researchers have adopted multiple research paradigms to solve the same structural prediction problem: biochemists and physicists attempt to reveal the principles of protein folding; mathematicians, especially statisticians, usually start by assuming a probability distribution of protein structures given a target sequence, and then find the most likely structure; while computer scientists view protein structure prediction as an optimization problem - finding the structural conformation with the lowest energy or minimizing the difference between the predicted structure and the native structure. Recently, deep learning has also achieved tremendous success in protein structure prediction. In this review, we present a survey of efforts in protein structure prediction. We compare the research paradigms adopted by researchers from different fields, focusing on the paradigm shift in the era of deep learning.

Author Information and Paper Source

This article was written by Bin Huang, Lupeng Kong, Chao Wang, Fusong Ju, Qi Zhang, Jianwei Zhu, Tiansu Gong, Haicang Zhang, Chungong Yu, Wei-Mou Zheng, and Dongbo Bu, published on March 30, 2023, in the journal Genomics, Proteomics & Bioinformatics. These authors are from institutions including the Key Laboratory of Intelligent Information Processing at the Institute of Computing Technology, Chinese Academy of Sciences, Peking University, University of Chinese Academy of Sciences, and Huawei Noah’s Ark Lab.

Methodological Framework for Protein Structure Prediction

Workflow

Protein structure prediction methods can be categorized into two main types: Template-Based Modeling (TBM) and Free Modeling (FM, also known as ab initio approaches). TBM methods can be further divided into homology modeling and threading methods.

Homology modeling methods: Based on the principle that protein structures are more conserved than sequences during evolution, these methods construct the structure of the target protein by comparing the sequence of the target protein with homologous proteins.

Threading methods: Unlike homology modeling methods that search for templates by comparing sequence similarity, threading methods search for proteins with the same structural fold as the target protein by comparing the structural match between the protein sequence and template proteins.

Free modeling methods: Based on the principle that proteins tend to adopt structures with the lowest free energy in natural environments, these methods complete structure prediction by minimizing energy functions or directly simulating the protein folding process.

Research Paradigms and Main Results

Researchers predict the native structure of proteins through the following methods:

  1. Homology modeling: Tools like Modeller build structural models for input sequences by comparing the sequences of target proteins and homologous proteins.
  2. Threading: Tools such as PROSPECT, RAPTOR, and DeepThreader evaluate the match between target protein sequences and template structures.
  3. Free modeling: Popular methods like AlphaFold2 and RosettaFold solve prediction problems through deep learning and simulating the folding process.

Paradigm Shift in the Deep Learning Era

In recent years, deep learning techniques have shown extraordinary potential in protein structure prediction:

  1. Algorithmic Modeling: This method uses deep neural networks to learn implicit rules of protein sequences based on large datasets, without relying on assumptions about the data generation process and distribution. This avoids potential false assumptions that may arise from data modeling methods.
  2. Language Models: Models like ProteinBERT and ESM learn potential rules of protein sequences through deep neural networks, improving the predictive performance of protein structure and function.
  3. End-to-end Prediction: Methods like AlphaFold2 directly predict the three-dimensional structure of proteins from sequences through end-to-end neural networks, greatly improving prediction accuracy.

Applications and Practical Significance

The advancement of deep learning technology has not only theoretically improved the accuracy of protein structure prediction but also opened up new possibilities for practical applications. For example, structures predicted using AlphaFold2 can be used to improve molecular replacement phasing in crystallography or to resolve new structures of viral proteins in combination with cross-linking data.

Moreover, researchers have found that deep learning models can be used to design protein sequences with specific functions, greatly improving the efficiency of protein engineering. These advances demonstrate the advantages and uniqueness of algorithmic modeling in the era of deep learning and big data, and indicate that this method will continue to play an important role in the future.

Research Highlights and Future Prospects

  1. Single-sequence structure prediction: Proteins in natural environments can fold into their native structures without homologous proteins, indicating that structural information is inherently contained in their sequences. Future research can focus more on improving single-sequence prediction methods.
  2. Efficient protein sequence design: Deep learning techniques have also shown excellent performance in protein sequence design. Future research can focus on designing proteins with specific functions.
  3. Interpreting neural network models: Although deep learning techniques have made great progress in structure prediction, understanding the intrinsic principles and key features of these models remains an important research direction.

In the era of deep learning and big data, algorithmic modeling has become the dominant research paradigm for protein structure prediction and will continue to play an important role in the future. By integrating the first and second cultures of statistical modeling, we can not only achieve high-precision structure prediction but also gain a deeper understanding of the mechanisms of protein folding.