EvoAI Enables Extreme Compression and Reconstruction of the Protein Sequence Space

Extreme Compression and Reconstruction of Protein Sequence Space: A Breakthrough Study on EvoAI

Background

Protein design and optimization have become central challenges in fields like biotechnology, medicine, and synthetic biology. The functions of proteins are determined by their sequences and structures, but this functional sequence space is highly complex and high-dimensional, containing an immense number of possibilities. The critical question in exploring this domain is how to efficiently analyze and compress this seemingly infinite sequence space to identify features essential to functionality. Previous approaches, including experimental strategies such as directed evolution, deep mutational scanning (DMS), and site-saturation mutagenesis, have provided crucial insights into the relationship between genotype and phenotype. However, these methods are severely limited in sequence space coverage, accuracy, and the ability to analyze high-dimensional spaces. Computational methods, such as sequence- or structure-based modeling, often rely on accessible training data, making it challenging to explore high-dimensional sequence space not fully covered by experimental approaches.

This study aimed to develop a novel method to overcome the challenges of existing experimental and computational approaches: how to rapidly scan and compress sequence space, especially in high dimensions, and how to utilize compressed data to reconstruct and predict proteins with improved functionality. To this end, the research team developed a new hybrid experimental-computational approach called “EvoAI.”

Publication Source

This groundbreaking research was conducted by scientists from multiple leading institutions, including Tsinghua University, the Broad Institute of MIT and Harvard, Williams College, and the Massachusetts Institute of Technology (MIT). The first authors are Ziyuan Ma and others, and the corresponding author is Shuyi Zhang. The paper was published in Nature Methods on November 11, 2024.

Research Workflow

The study focused on the development and validation of EvoAI, which integrates the experimental technique “EvoScan” with deep learning-based computational methods to establish an innovative workflow for exploring and reconstructing protein sequence space.

1. EvoScan: The Experimental Evolution Scanning System

EvoScan is based on an improved version of Phage-Assisted Noncontinuous Evolution (PANCE) and incorporates the CRISPR-guided DNA polymerase mutagenesis system EvolvR. Together, these tools enabled the rapid and efficient “evolutionary scanning” of sequence space.

  • System Construction and Design
    The core concept of EvoScan involves introducing the target gene into the phage genome and using guide RNAs (gRNAs) to direct segmented mutagenesis. This approach divides the complex high-dimensional sequence space into multiple lower-dimensional subspaces. The experiments used M13 bacteriophages as vectors and employed regulatory circuits linked to gIII expression to associate functional performance with phage proliferation. For instance, in the experiment involving Enhanced Green Fluorescent Protein (EGFP)-Nanobody interactions, EGFP-binding domains were fused to CRISPR repressor proteins, and mutations were screened via propagation of phages with functional mutations.

  • Validation and Implementation
    Using EGFP-Nanobody interactions as a model, the experiments quickly identified key mutation sites (“anchor points”). For example, after introducing the E103K mutation, the system reverted the mutation to glutamate after just two passages, validating EvoScan’s ability to efficiently pinpoint functionally critical anchors.

2. High-Dimensional Exploration and Detailed Case Studies

The broad applicability of EvoScan was validated through three key systems from different functional domains:

  • Protein-Ligand Interactions: Using the SARS-CoV-2 main protease (Mpro) model and known inhibitors GC376 and Nirmatrelvir, systematic scanning revealed a set of key resistance-associated mutation sites, such as E166V and S144A, while also identifying numerous novel mutations.

  • Protein-Nucleic Acid Interactions: By targeting the transcriptional regulator AmeR from the Tetr family, EvoScan efficiently generated 82 high-function anchor variants comprising 52 significant mutation sites. The study also revealed the complex effects of mutations on positive and negative epistasis.

3. Deep Learning-Assisted Reconstruction via EvoAI

EvoAI used the experimental results as training data to create a high-accuracy protein design model, showcasing remarkable capability in sequence space reconstruction:

  • Model Architecture
    The model combined the pretrained GeoFitness model with the protein language model ESM-2 (Evolutionary Scale Modeling). A Multilayer Perceptron (MLP) was incorporated to predict complex interactions among mutations. The trained model achieved a high prediction accuracy with a Spearman correlation coefficient of 0.91.

  • Designing and Validating New Proteins
    EvoAI designed 10 novel high-scoring protein variants, all of which exhibited significantly improved fold repression (experimental results showing 10- to 38-fold improvements). These results were superior to both the wild-type protein and control designs generated using traditional DMS approaches.

Main Findings

Through the combination of EvoScan’s experimental data and EvoAI’s computational capabilities, the study demonstrated the extreme compressibility of high-dimensional protein sequence space, reducing a theoretical design space of ~10^50 to just 82 anchor points. The findings not only opened new pathways for protein design but also revealed potential simplification mechanisms in biological evolution.

Key Highlights and Significance

  1. Extreme Compressibility: The study showed that high-dimensional protein sequence space can be represented and reconstructed through a remarkably small number of anchor points.
  2. Broad Applicability: EvoScan demonstrated versatility in exploring protein-protein, protein-ligand, and protein-nucleic acid interactions.
  3. Efficient Prediction: EvoAI designed variants with significantly enhanced functionality.
  4. Evolutionary Insights: The findings supported the hypothesis that natural selection might leverage sequence space compression mechanisms to optimize functionality.

This technological breakthrough provides a powerful tool for future applications in protein engineering and synthetic biology while also inspiring further exploration in areas such as evolutionary biology.