Contrastive Learning of T Cell Receptor Representations

New Breakthrough in T Cell Receptor (TCR) Specificity Prediction: The Introduction of the SCEPTR Model

Academic Background

T cell receptors (TCRs) play a crucial role in the immune system by determining the specificity of immune responses through their binding to peptide-MHC complexes (pMHCs). Understanding the interaction between TCRs and specific pMHCs remains a significant challenge in immunology. Despite advancements in high-throughput experimental technologies, which have provided a wealth of TCR sequence data, accurately predicting the binding specificity of TCRs to specific pMHCs is still a difficult task. Currently, protein language models (PLMs) have shown great potential in high-throughput data analysis but underperform in TCR specificity prediction tasks, especially in data-scarce scenarios. Therefore, effectively leveraging unlabeled TCR sequence data to train models has become key to addressing this issue.

Source of the Paper

This paper, co-authored by Yuta Nagano, Andrew G.T. Pyo, Martina Milighetti, and others from renowned institutions such as University College London and Princeton University, was published on January 15, 2025, in Cell Systems, titled “Contrastive Learning of T Cell Receptor Representations”. The study introduces a new TCR language model, SCEPTR (Simple Contrastive Embedding of the Primary Sequence of T Cell Receptors), and proposes a pretraining strategy combining contrastive learning and masked language modeling (MLM), significantly improving the accuracy of TCR specificity prediction.

Research Process

1. Problem Context and Motivation

Predicting TCR-pMHC binding specificity is one of the core challenges in immunology. Although many machine learning methods have been applied to this field, these models generalize poorly to unseen pMHCs, especially in data-scarce scenarios. Previous studies have shown that existing protein language models (e.g., ProtBERT, ESM2) underperform in TCR specificity prediction tasks, even compared to sequence alignment methods like TCRdist. Therefore, this study aims to design a protein language model better suited for TCR specificity prediction by introducing contrastive learning.

2. Design of the SCEPTR Model

The core innovation of the SCEPTR model lies in its pretraining strategy, which combines autocontrastive learning with masked language modeling (MLM). Specifically, SCEPTR divides a TCR sequence into its six complementarity-determining regions (CDRs) and vectorizes each amino acid residue using a simple one-hot encoding system. These vectors are then passed through a three-layer self-attention mechanism to generate a 64-dimensional TCR representation vector.

The core idea of autocontrastive learning is to generate two independent “views” of the same TCR and bring their representations closer in the embedding space while pushing apart the representations of different TCRs. This learning approach effectively overcomes the limitations of traditional MLM pretraining, especially when dealing with TCR sequence variations dominated by the randomness of VDJ recombination.

3. Implementation of Autocontrastive Learning

In autocontrastive learning, SCEPTR generates two independent views by randomly dropping some input features (e.g., certain amino acid residues or entire TCR chains). This data augmentation ensures that the model can capture TCR specificity-related features in an unsupervised manner. Additionally, SCEPTR uses a special token, whose contextualized embedding serves as the representation vector for the entire TCR.

4. Model Performance Evaluation

To evaluate SCEPTR’s performance, the research team designed a standardized few-shot prediction task. The task requires the model to predict whether a query TCR binds to a specific pMHC given a reference TCR. The study compared the performance of SCEPTR with existing models (e.g., TCR-BERT, ProtBERT, ESM2) and sequence alignment methods (e.g., TCRdist).

The results show that SCEPTR outperforms existing models in most cases, especially when the number of reference TCRs is limited. For example, when the reference TCR count is 200, SCEPTR performs better than TCRdist for five out of the six tested pMHCs. Furthermore, SCEPTR’s contrastive learning strategy significantly enhances its ability to discriminate between different pMHCs.

5. Model Ablation Studies

To verify the contribution of autocontrastive learning to SCEPTR’s performance, the research team conducted several ablation experiments. The results show that the SCEPTR variant trained solely with MLM performs significantly worse, while the contrastive learning-enabled SCEPTR variant performs on par with TCRdist. Additionally, the study found that SCEPTR’s representation vectors effectively capture sequence features related to TCR specificity, especially for TCR sequences with low generation probabilities (pgen).

Main Results and Conclusions

1. Performance Advantages of SCEPTR

SCEPTR excels in few-shot TCR specificity prediction tasks, significantly outperforming existing models. Particularly in data-scarce scenarios, SCEPTR’s contrastive learning strategy enables better generalization to unseen pMHCs. The study also found that SCEPTR’s representation vectors effectively capture TCR specificity features that sequence alignment methods fail to detect.

2. Scientific Value of Contrastive Learning

Through contrastive learning, SCEPTR can effectively distinguish TCRs with the same specificity in the representation space while pushing apart TCRs with different specificities. This characteristic gives SCEPTR a significant advantage in TCR specificity prediction tasks, especially in data-scarce scenarios.

3. Application Prospects

The introduction of SCEPTR provides a new paradigm for TCR specificity prediction. The model can not only be used for few-shot prediction tasks but also applied to TCR sequence clustering analysis to identify antigen-specific T cell groups (metaclonotypes). Moreover, SCEPTR’s contrastive learning strategy offers new insights for other protein-related tasks.

Research Highlights

  1. Innovative Pretraining Strategy: SCEPTR significantly improves model performance in TCR specificity prediction tasks by combining contrastive learning and MLM.
  2. Data Efficiency: SCEPTR performs exceptionally well in few-shot tasks, effectively utilizing unlabeled TCR sequence data.
  3. Broad Application Prospects: SCEPTR can be applied not only to TCR specificity prediction but also to TCR sequence clustering analysis and other protein-related tasks.

Summary

By introducing the SCEPTR model, this study proposes a pretraining strategy that integrates contrastive learning and masked language modeling, providing a new solution for TCR specificity prediction. This research not only addresses the generalization issues of existing models in data-scarce scenarios but also offers a new paradigm for training protein language models, holding significant scientific value and application potential.