Identification of Clonal Hematopoiesis Driver Mutations Through In Silico Saturation Mutagenesis
Introduction
In the process of healthy hematopoiesis, a group of hematopoietic stem cells (HSC) contribute to all blood-related lineages. However, as we age, this process often leads to clonal hematopoiesis (CH), meaning the expansion of clones originating from a particular HSC, which then occupies a significant portion of blood cells and platelets. This clonal expansion is driven by somatic mutations acquired by HSCs throughout life and is highly prevalent among the elderly. Mutations in CH-related genes provide HSCs with a growth advantage, subjecting them to positive selection during hematopoiesis (1-13). Recent studies have shown that CH is associated with the development of hematologic malignancies, cardiovascular diseases, all-cause mortality, and increased risk of solid tumors and infections (2, 7, 14-20). Despite recent advances identifying approximately 60 CH driver genes (1, 12, 13, 21), our understanding of which mutations in these genes drive clonal expansion remains very limited.
Some research teams have summarized knowledge on multiple CH genes and formulated a series of expert-curated rules to select mutations most likely to drive CH. These rules are typically applied in conjunction with stringent variant filtering steps to identify variations in the blood samples of healthy individuals. However, these rules have limitations, such as not being able to learn directly from CH mutation data or systematically updating, and they exhibit heterogeneity in the number of covered genes and depth of knowledge.
To overcome these obstacles, researchers have adopted a machine learning-based approach to establish interpretable models trained on available high-quality CH mutation data. These models can reveal complex patterns in CH mutations and expand with the emergence of more CH mutation datasets (Section 28 of this paper). The goal of this study is to construct models for 12 CH driver genes using this approach to accurately identify CH driver mutations and validate their performance among nearly 500,000 UK Biobank donors.
Study Source
This paper was written by Santiago Demajo et al., from institutions including the Institute for Research in Biomedicine (IRB Barcelona), Centro de Investigación Biomédica en Red en Cáncer (CIBERONC), and University Pompeu Fabra. The article was published in the September 2024 issue of the journal Cancer Discovery.
Research Process
Experimental Design and Methods
- Data Collection and Processing:
The research team first collected data from over 33,000 cancer patients from three large cancer genomics cohorts (TCGA, HMF, MSK-IMPACT). High-quality blood somatic mutation data were obtained by reverse-calling to eliminate germline contamination. These data were used to train machine learning models to identify CH driver mutations.
- Model Construction and Validation:
The research team used XGBoost (Version 0.90) to train gene-specific machine learning models named BOOSTDM-CH. Training the model relied on a set of high-quality positive samples (known CH driver mutations) and synthetic negative samples (neutral mutations). Features included significant clustering of mutations in linear sequences, aggregation in three-dimensional folded structures, enrichment in functional domains, the type of mutation outcomes, and their conservation in vertebrates.
- Experimental Design:
The researchers designed cross-validation experiments to evaluate model performance while interpreting the prediction results (e.g., based on SHAP value analysis of feature contributions).
Main Experimental Results
- Model Performance Evaluation:
The BOOSTDM-CH model demonstrated superior performance in cross-validation. For example, for the DNMT3A gene, the model’s F50 score ranged from 0.86 to 0.99. These models significantly outperformed expert-curated rules in classifying observed CH mutations as driver mutations or non-driver mutations.
- Application to Large Cohorts:
Using the UK Biobank dataset, the research team applied the BOOSTDM-CH model to identify 201,857 potential mutations among 467,202 donors and classified them. In 92.5% of cases, the model-identified CH driver mutations consisted of a single driver mutation, similar to actual observed driver mutations. The response was statistically significantly correlated with clinical characteristics such as age, smoking history, cardiovascular diseases, hematologic malignancies, and all-cause mortality.
- Variation Distribution and Feature Analysis:
With the BOOSTDM-CH model, the research team could deeply analyze CH driver mutations based on model scores. For instance, in DNMT3A, high-confidence mutations (scores ≥0.9) were concentrated in specific regions, indicating that these regions’ mutations significantly interfered with normal protein function.
Research Conclusion
This paper demonstrated the construction and validation of machine learning-based models to identify driver mutations in 12 CH driver genes, showcasing their advantages over traditional expert-curated rules. While revealing more complex CH mutation patterns and mechanisms, the BOOSTDM-CH model outperformed in terms of accuracy for identifying CH driver mutations, providing a powerful tool for future large-scale cohort studies uncovering associations between CH and various conditions.
Research Highlights
- Application of Machine Learning Methods:
This study is the first to successfully apply machine learning methods to identify CH driver mutations, avoiding the subjective biases inherent in traditional expert-curated rules, marking a significant innovation.
- Large-scale Validation:
The study validated the model’s performance in the large UK Biobank cohort, showing that the model could accurately identify CH driver mutations and establish significant associations with various clinical traits.
- In-depth Understanding of CH Mechanisms:
By training and applying the BOOSTDM-CH model, the research provided new insights and tools for understanding the mechanisms of CH mutations in different genes.
Additional Information and Future Prospects
The research team has made the BOOSTDM-CH model and related data public on the Intogen website (www.intogen.org/ch/boostdm) for community use and plans to further expand and optimize the model as more datasets become available. In the future, this model has extensive application prospects in large-scale retrospective or prospective clinical studies, aiding in monitoring the health of high-risk individuals and supporting the development of personalized treatment plans.
Conclusion
This study demonstrates the potential of using the machine learning-based BOOSTDM-CH model to identify and analyze CH driver mutations, offering a novel and effective method for CH research and precise analysis of large-scale cohort data. By tapping into data from large cohorts like the UK Biobank, the BOOSTDM-CH model not only helps scientists better understand the mechanisms of CH but also provides valuable resources for future research and clinical applications.