Multimodal Learning for Mapping Genotype–Phenotype Dynamics

Multimodal Learning Reveals Genotype–Phenotype Dynamics

Background

The complex relationship between genotype and phenotype has long been a central question in biology. Genotype refers to the genetic information of an organism, while phenotype is the manifestation of this genetic information in a specific environment. Although Wilhelm Johannsen introduced these terms in 1909 and attempted to quantify their relationship, over a century later, we still cannot precisely describe how genotypes shape phenotypes through complex gene expression patterns. In recent years, technologies such as single-cell RNA sequencing (scRNA-seq) have enabled us to observe the intricate dynamics of gene expression at the cellular level. However, these technologies still cannot comprehensively map how genotypic combinations lead to phenotypic outcomes.

Current research methods, such as forward genetics and reverse genetics, theoretically can dissect the relationship between genotype and phenotype, but they fall short in practice due to the scale and complexity of the research. This is particularly true in human cells, where combinations of thousands of genes create an extremely diverse phenotypic landscape. Additionally, while scRNA-seq can reveal thousands of gene expression changes across cells, it complicates the extraction of meaningful biological insights from these high-dimensional datasets. Recent advances in machine learning, especially self-supervised Transformer architectures from natural language processing (NLP), have provided new hope for analyzing complex biological datasets.

Source of the Paper

This paper, titled “Multimodal Learning for Mapping Genotype–Phenotype Dynamics,” is co-authored by Farhan Khodaee, Rohola Zandie, and Elazer R. Edelman. They are affiliated with the Institute for Medical Engineering and Science at the Massachusetts Institute of Technology and the Department of Medicine (Cardiovascular Medicine) at Brigham and Women’s Hospital. The paper was accepted on May 1, 2024, and published online on December 20, 2024, in the journal Nature Computational Science.

Research Process

1. Research Objectives and Method Design

This study aims to develop a computational framework to analyze the dynamic relationship between gene expression and phenotypic manifestation by integrating high-dimensional genotypic and phenotypic data. To achieve this, the authors introduced a multimodal foundational model called Polygene, which leverages self-supervised language models to simultaneously map genotype–phenotype relationships. The core innovation of Polygene lies in combining scRNA-seq data with phenotypic information such as sex, age, tissue type, and cell type, thereby enhancing the understanding of the biological context of gene expression.

2. Data Preprocessing and Model Input

The study used the Tabula Sapiens single-cell transcriptomic dataset, which includes nearly 500,000 human cells from 24 organs. Gene expression values for each cell were normalized and binned for subsequent analysis. Model inputs included gene expression values and associated phenotypic information, which were encoded as vector representations and fed into the network.

3. Model Architecture and Training

The Polygene model is based on the Transformer architecture and employs a self-supervised learning approach for pretraining. Specifically, the model randomly masks a portion of gene expression values and then predicts the masked values based on the remaining genes. This method resembles masked language modeling in NLP. During training, phenotypes and genotypes were masked with probabilities of 50% and 15%, respectively, to ensure model robustness.

4. Result Analysis and Validation

To evaluate the model’s performance, the authors conducted multi-level analyses of Polygene’s outputs. First, they used the model-generated gene and phenotype embeddings to classify cell types, tissue origins, age, and sex. The results showed that Polygene outperformed other state-of-the-art methods, such as scGPT, in distinguishing closely related cell types and states. Additionally, through cosine similarity analysis, the authors revealed the dynamic functions of genes across different phenotypic contexts. For example, the H4C3 gene exhibited significant similarity across all phenotypes, indicating its fundamental role in cell proliferation and cell cycle progression.

5. Gene Network Reconstruction and Polyfunctionality Analysis

Another key contribution of the study is the illumination of the dynamic structure of gene networks across different phenotypic contexts. By analyzing gene networks in endothelial cells (ECs), the authors found that aging alters the power-law distribution of the networks, suggesting a reorganization of their structure. Furthermore, by analyzing the embeddings of the von Willebrand factor (VWF) gene, the authors discovered that this gene forms two distinct functional clusters in ECs, related to blood coagulation and oxidative stress responses, respectively.

Key Findings and Logical Connections

  1. Generation of Gene and Phenotype Embeddings: The Polygene model successfully generated high-dimensional gene and phenotype embeddings that accurately captured differences in cell types, tissue origins, age, and sex. For example, the model excelled at distinguishing cardiac ventricle and atrium tissues, consistent with their close functional and anatomical relationships.

  2. Polyfunctionality of Genes: By analyzing the embeddings of the VWF gene, the study revealed its multifunctionality in endothelial cells. This not only expands our understanding of gene functions but also provides new perspectives for drug discovery and cell therapy.

  3. Reconstruction of Gene Networks: The study demonstrated that aging alters the structure of gene networks in endothelial cells, particularly the roles of low-degree nodes. This finding provides new candidate genes for studying vascular aging, such as KCNH8 and DNJA4.

Conclusion and Value

By integrating high-dimensional genotypic and phenotypic data, this study developed a multimodal foundational model, Polygene, successfully revealing the complex dynamic relationships between genotype and phenotype. The scientific value of this research lies in providing a new computational framework that can simultaneously analyze gene expression and phenotypic information, thereby enhancing the understanding of the biological context of gene expression. Its practical value is reflected in the model’s potential to discover cross-tissue biomarkers, dissect gene polyfunctionality, and accelerate therapeutic target discovery.

Highlights and Innovations

  1. Multimodal Learning Approach: This study is the first to introduce self-supervised language models into genotype–phenotype relationship research, pioneering a new paradigm of integrated genetics.

  2. Discovery of Polyfunctional Genes: By analyzing the VWF and CD55 genes, the study revealed their multifunctionality across different cells and phenotypic contexts, offering new directions for personalized medicine.

  3. Reconstruction of Gene Networks: The study is the first to report the context-dependent structure of gene networks from RNA expression data, particularly the restructuring of gene networks during aging, providing new insights into vascular aging.

Additional Valuable Information

  1. Public Availability of Data and Code: The research team has publicly released the training script for the Polygene model, the transcriptome tokenizer, and the code for data preprocessing and inference on GitHub and Zenodo, facilitating reproducibility and extension of the results by other researchers.

  2. Future Research Directions: The authors suggest that future research could further optimize data processing techniques to enable more efficient handling of diverse transcriptomic profiles and expand the model’s applications, particularly in personalized medicine and drug discovery.

Through this study, we have gained a deeper understanding of the complex relationship between gene expression and phenotypic manifestation and laid a solid foundation for future genomic research.