A Statistical Framework for Multi-Trait Rare Variant Analysis in Large-Scale Whole-Genome Sequencing Studies
A New Framework for Multi-Trait Rare Variant Analysis: Multistaar
Research Background and Problem Statement
With the advancement of next-generation sequencing technologies and the decreasing cost of whole-genome sequencing (WGS), researchers can delve deeper into the impact of rare variants on complex human traits. However, single-trait analysis methods often lack sufficient statistical power to detect rare variant associations, especially when dealing with multi-ethnic samples and complex genetic structures. Additionally, many genetic variants exhibit pleiotropy, where a single gene can influence multiple traits, necessitating methods that can analyze multiple traits simultaneously to enhance detection capabilities.
Existing multi-trait rare variant analysis methods have shown higher statistical power compared to single-trait analyses but face computational bottlenecks when handling large-scale WGS data. They also fail to fully utilize functional annotation information, leading to reduced interpretability and statistical power. To address these issues, researchers have developed a new statistical framework—Multi-trait Variant-set Test for Association using Annotation Information (Multistaar)—aimed at enhancing the detection of rare variant associations in large-scale WGS data by jointly analyzing multiple traits and incorporating multiple functional annotations.
Paper Source
This paper was co-authored by researchers from the Harvard T.H. Chan School of Public Health, Columbia University Medical Center, and the Harvard T.H. Chan School of Public Health, and published in Nature Computational Science. This journal focuses on publishing cutting-edge research in computational science, covering a wide range of topics from foundational theories to practical applications.
Research Workflow and Key Results
Workflow
Data Preparation
Researchers collected WGS data from 61,838 individuals from the Trans-Omics for Precision Medicine (TOPMED) project of the National Heart, Lung, and Blood Institute (NHLBI). These individuals came from 20 multi-ethnic study cohorts, including African Americans, whites, Asian Americans, Hispanic/Latino Americans, and other ethnic groups. To ensure data quality, strict quality control steps were implemented, including removing low-quality DNA samples and duplicate samples.
Model Construction
The core of Multistaar lies in its two-step workflow:
Fitting Null Models: Using a sparse genetic relatedness matrix (GRM) and ancestry principal components (PCs) to adjust for population structure and relatedness while considering correlations among multiple traits. Specifically, researchers used a multivariate linear mixed model (MLM) to fit null models.
Association Testing: Building upon this, Multistaar enhances the detection of rare variant associations by dynamically integrating multiple functional annotations such as CADD, LINSIGHT, FATHMM-XF, etc. Specifically, Multistaar provides three different testing methods:
- Burden test (Multistaar-B)
- SKAT test (Multistaar-S)
- ACAT-V test (Multistaar-A)
Additionally, Multistaar offers an omnibus test (Multistaar-O) that combines the results of the above three tests to achieve higher robustness and statistical power.
Experimental Design
To evaluate the performance of Multistaar, researchers conducted extensive simulation studies and real data analyses. In the simulation part, they generated datasets of three quantitative traits, each containing 10,000 individuals, with varying proportions of causal variants and effect directions. For real data analysis, researchers applied Multistaar to perform multi-trait rare variant analysis on lipid traits (low-density lipoprotein cholesterol LDL-C, high-density lipoprotein cholesterol HDL-C, and triglycerides TG) in the TOPMED project.
Key Results
Type I Error Rate Control
Through 10^8 simulation runs, researchers verified the type I error rate control of Multistaar at α=10^-4, 10^-5, and 10^-6 levels. The results showed that all testing methods of Multistaar adequately controlled the type I error rate, closely matching the nominal significance levels.
Power Evaluation
In terms of power evaluation, researchers compared the performance of Multistaar with existing methods such as Burden-MT, SKAT-MT, and ACAT-V-MT. The results indicated that Multistaar exhibited higher statistical power under various genetic architectures, particularly showing strong robustness even when handling non-informative annotations.
Real Data Analysis
In the actual data analysis of the TOPMED project, Multistaar identified 51 significant associations of rare variants in coding regions related to lipid traits, with 34 remaining significant after conditional analysis. Additionally, Multistaar discovered 76 significant associations in noncoding regions and ncRNA genes, with 6 remaining significant after conditional analysis. Notably, many of these newly discovered association signals were undetectable by single-trait analysis, further demonstrating the effectiveness of Multistaar.
Conclusion and Significance
Conclusion
By introducing the Multistaar framework, researchers successfully addressed the computational bottleneck and insufficient functional annotation issues present in existing multi-trait rare variant analysis methods when handling large-scale WGS data. Multistaar not only improved the detection of rare variant associations but also enhanced the understanding of complex relationships between multiple traits. Specifically, Multistaar significantly boosted statistical power by jointly analyzing multiple traits and integrating multiple functional annotations, discovering numerous new rare variant association signals.
Significance
This study holds significant scientific value and application prospects. First, Multistaar provides new tools and methods for studying the genetic basis of complex traits, helping to uncover the mechanisms of rare variants in disease development. Second, the application scope of Multistaar is not limited to lipid traits but can be extended to other complex trait studies, such as glycemic traits and inflammation markers. Finally, the successful development of Multistaar offers strong support for future large-scale biobank sequencing studies, potentially advancing precision medicine.
Research Highlights
- Innovativeness: Multistaar introduces a novel multi-trait rare variant analysis framework that integrates multiple functional annotations, significantly enhancing statistical power.
- Robustness: Multistaar performs excellently in type I error rate control and power evaluation, especially showing strong robustness when handling non-informative annotations.
- Wide Application: Multistaar can be used not only for lipid trait studies but also extended to other complex trait studies, offering broad application prospects.
- Efficiency: Multistaar has high computational efficiency, capable of analyzing large-scale WGS data within a short time, making it suitable for large biobank sequencing studies.
The development of Multistaar provides new ideas and methods for multi-trait rare variant analysis, expected to play a crucial role in future research.