EPICPred: Predicting Phenotypes Driven by Epitope-Binding TCRs Using Attention-Based Multiple Instance Learning
T-cell receptors (TCRs) play a crucial role in the adaptive immune system by recognizing pathogens through binding to specific antigen epitopes. Understanding the interactions between TCRs and epitopes is essential for uncovering the biological mechanisms of immune responses and developing T cell-mediated immunotherapies. However, although the importance of the CDR3 region of TCRs in epitope recognition is widely acknowledged, accurately predicting TCR-epitope interactions associated with specific diseases or phenotypes remains a challenge. To address this, researchers developed EpicPred, an attention-based multiple instance learning (MIL) model designed to predict TCR-epitope interactions related to cancer or the severity of COVID-19 patients.
Source of the Paper
The paper was co-authored by Jaemin Jeon, Suwan Yu, Sangam Lee, Sang Cheol Kim, Hye-Yeong Jo, Inuk Jung, and Kwangsoo Kim, affiliated with Seoul National University, Yonsei University, Korea National Institute of Health, Kyungpook National University, and Seoul National University Hospital. The paper was published in 2025 in the journal Bioinformatics under the title “EpicPred: Predicting Phenotypes Driven by Epitope-Binding TCRs Using Attention-Based Multiple Instance Learning.”
Research Workflow
1. Data Collection and Preprocessing
The study first collected 244,552 TCR sequences and 105 unique epitopes from six public TCR databases. These data were used to train and test the EpicPred model. To reduce noise, the researchers filtered TCR sequences, excluding those shorter than 8 or longer than 22 amino acids, as well as sequences containing non-standard amino acids.
2. Open-Set Recognition (OSR)
EpicPred initially uses open-set recognition (OSR) to predict and remove TCR-epitope interactions that are unlikely to occur, thereby reducing false positives. The OSR method effectively distinguishes known and unknown epitope-binding TCRs (EB-TCRs) from non-epitope-binding TCRs (NEB-TCRs).
3. Multiple Instance Learning (MIL) Model
After predicting EB-TCRs, EpicPred employs a multiple instance learning model to identify TCR-epitope interactions associated with cancer types or the severity of COVID-19 patients. The model encodes TCR sequences using BERT (Bidirectional Encoder Representations from Transformers) and applies an attention mechanism to aggregate similar TCR sequences, generating sample representation vectors.
4. Phenotype Prediction
The ultimate goal of EpicPred is to predict patient phenotypes, such as cancer or the severity of COVID-19. The model calculates the binding probability of each TCR to epitopes, groups similar TCR sequences using K-means clustering, and then trains the phenotype prediction model. The study uses two loss functions: TCR-specific loss and sample-specific loss, which are used to determine the relationship between individual TCR sequences and phenotypes and to detect groups of TCRs associated with phenotypes, respectively.
Main Results
1. Prediction of EB-TCRs
In the experiment predicting EB-TCRs, EpicPred performed excellently on both closed and open test sets. In the closed test set, the model achieved an F1 score of 0.97±0.01 in predicting TCR-epitope binding. In the open test set, EpicPred effectively distinguished EB-TCRs from NEB-TCRs, with an F1 score of 0.71±0.01.
2. Phenotype Prediction
EpicPred demonstrated outstanding performance in predicting the severity of COVID-19 patients and the phenotypes of cancer samples. In the COVID-19 dataset, the model achieved an AUROC (Area Under the Receiver Operating Characteristic) of 0.80±0.07 in predicting moderate versus severe cases. In the cancer dataset, EpicPred achieved an AUROC of 0.78±0.04 in distinguishing healthy from cancerous samples.
3. Single-Cell Data Analysis
Through the analysis of single-cell RNA sequencing data, EpicPred identified cell subpopulations associated with the severity of COVID-19. The study found that cells with high attention scores exhibited significant differences in recognizing SARS-CoV-2 epitopes, indicating that these cells play a key role in phenotype prediction.
Conclusions and Significance
By combining open-set recognition and multiple instance learning, EpicPred successfully predicted TCR-epitope interactions associated with cancer and the severity of COVID-19. The model not only improved the accuracy of phenotype prediction but also provided new insights into the role of TCRs in immune responses. The development of EpicPred offers an important tool for future immunotherapy and vaccine design, with broad application prospects in personalized medicine and precision immunotherapy.
Research Highlights
- Novel Model Design: EpicPred is the first to combine open-set recognition with multiple instance learning, effectively distinguishing EB-TCRs from NEB-TCRs and achieving significant results in phenotype prediction.
- High-Precision Prediction: On multiple public datasets, EpicPred outperformed existing methods in predicting TCR-epitope binding and phenotype classification.
- Single-Cell Data Analysis: Through single-cell RNA sequencing data, EpicPred identified cell subpopulations associated with the severity of COVID-19, providing new insights into the mechanisms of immune responses.
Additional Valuable Information
The software implementation of EpicPred has been open-sourced on GitHub, allowing researchers to freely use and modify the model to further advance research on TCR-epitope interactions. Additionally, the research team plans to extend EpicPred to other disease areas, exploring its potential applications in broader immunological research.
Through this study, EpicPred not only provides a new method for predicting TCR-epitope interactions but also opens new avenues for future immunotherapy and vaccine design.