A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions

The 5’ untranslated region (5’UTR) is a regulatory region at the start of messenger RNA (mRNA) molecules, playing a crucial role in regulating the translation process and affecting protein expression levels. Language models have demonstrated effectiveness in decoding protein and genomic sequence functions. In this study, the authors introduce a language model for 5’UTRs, referred to as UTR-LM.

Research Background In fact, the 5’UTR plays an important role in regulating the translation process of mRNA, affecting mRNA stability, localization, and translation efficiency. Previous studies have explored the biological characteristics of 5’UTRs, including their secondary structures, potential interacting RNA-binding proteins, and the effects of 5’UTR mutations on gene expression. The complex functions of mRNA and their potential impact on human health highlight the need to develop more generally applicable computational methods. Research Workflow

Research Source This research was a collaborative effort between Professor Mengdi Wang’s group from the Department of Electrical and Computer Engineering at Princeton University, Professor Le Cong’s laboratory from the Department of Pathology at Stanford University, RVAC Medicines, and ZipCode Bio. The paper was published in the April 2024 issue of Nature Machine Intelligence.

Research Workflow and Methods (a) Research workflow: The research adopted the following workflow: 1) Collect and preprocess endogenous 5’UTR sequences from multiple species, 5’UTR sequences from synthetic libraries, and endogenous human 5’UTR data; 2) Develop a Transformer-based language model, UTR-LM, and perform self-supervised pretraining on the aforementioned data, including masked nucleotide reconstruction, secondary structure prediction, and minimum free energy prediction tasks; 3) Fine-tune UTR-LM on downstream tasks, such as mean ribosome loading (MRL) prediction, mRNA translation efficiency (TE) prediction, mRNA expression level (EL) prediction, and identification of unannotated internal ribosome entry sites (IRESs); 4) Design and synthesize a library of 211 5’UTR sequences with high predicted TE, and experimentally validate the performance of these sequences through wet experiments (mRNA transfection and luciferase assays); 5) Analyze the attention scores of the language model, revealing known genomic motif patterns and potential new motifs.

(b) Main research results: 1) For the MRL prediction task, UTR-LM improved the Spearman correlation coefficient by 5% compared to the best baseline method; 2) For the TE and EL prediction tasks, UTR-LM improved the Spearman correlation coefficient by up to 8% compared to the best baseline method; 3) For the IRES recognition task, UTR-LM increased the Area Under Precision-Recall Curve from 0.37 to 0.52, outperforming the best baseline; 4) Experimental validation showed that the top 5’UTR sequences in the designed library increased protein yield by 32.5% compared to the clinically widely used optimized 5’UTR (NCA-7d-5’UTR); 5) On an independent wet experimental dataset, UTR-LM predicted experimental results with a Spearman correlation coefficient 51% higher than the best baseline method under zero-shot conditions; 6) The study revealed known regulatory motif patterns, such as the Kozak sequence and the positive correlation between high GC content and translation efficiency, as well as potential new motifs.

© Research conclusions: This study proposed a new self-supervised language model, UTR-LM, for studying mRNA 5’UTRs and decoding their functions, demonstrating superior performance in predicting MRL, TE, EL, and identifying IRESs. The research also successfully designed and experimentally validated a set of highly efficient 5’UTR sequences. This study has the potential to advance our understanding of gene regulation and provide innovative therapeutic interventions.

Research Significance 1) Scientific value: This research proposes an effective computational model for decoding the biological functions of 5’UTRs, providing new insights and tools for a deeper understanding of the regulatory mechanisms of mRNA in the protein biosynthesis process. 2) Application value: The highly efficient 5’UTR sequences designed in this study have the potential to be applied in biotechnology and therapeutic protein production processes, optimizing protein yields. 3) Research features: A language model integrating sequence, secondary structure, and minimum free energy was proposed; high-performing 5’UTR sequences were successfully designed and experimentally validated; known and newly discovered regulatory motif patterns were revealed.

This research provides a novel language model approach for understanding and optimizing 5’UTR functions, offering not only significant scientific value but also demonstrating broad application prospects. It represents an innovative research achievement in the field of mRNA regulation.