Enabling efficient analysis of biobank-scale data with genotype representation graphs

Research Based on Genotype Representation Graph (GRG): A New Framework to Enhance Biobank-Scale Data Analysis Efficiency

Academic Background and Research Motivation

With the rapid advancement of sequencing technologies, the collection of large-scale genomic data has become increasingly common, especially in the field of human disease association studies, where the demand for genomic data analysis is growing. By the end of 2023, the UK Biobank released approximately 500,000 whole genomes on its cloud computing platform, with 200,000 of them already phased. Such massive datasets provide unprecedented opportunities for research but also bring new challenges: how to efficiently encode and analyze such vast amounts of genomic data? Traditional tabular data structures (e.g., VCF file formats) face bottlenecks in storage and computational efficiency, making it difficult to meet the ever-growing data demands.

In this context, scientists have proposed new data representation and processing methods to optimize compression rates and computational performance. The goal of this research is to develop a novel data structure with higher compression and computational efficiency to address the needs of biobank-scale data analysis.

Source of the Paper

Titled “Enabling Efficient Analysis of Biobank-scale Data with Genotype Representation Graphs,” this paper was authored by Drew Dehaas, Ziqing Pan, and Xinzhu Wei and published in Nature Computational Science. All authors are affiliated with the Department of Computational Biology at Cornell University, with Dehaas and Pan serving as co-first authors and Xinzhu Wei as the corresponding author.

Detailed Research Process and Technical Methods

Core Development: Genotype Representation Graph (GRG)

The research team proposed a data structure called Genotype Representation Graph (GRG), which aims to reconstruct genotypic data using a graph-based approach to address the storage and computational efficiency issues of traditional tabular encodings. GRG is a fully connected hierarchical directed acyclic graph (DAG) that can losslessly represent phased whole-genome polymorphisms.

Key Structural Features of GRG:

  1. Node Types: Nodes are classified into sample nodes, mutation nodes, and internal nodes. Sample nodes represent haploid genomes, while specific mutations (deviations from the reference sequence) are encoded as mutation nodes.
  2. Directed Acyclic Graph Property: There are no duplicate paths in the GRG structure, and there is only one unique path from a mutation node to a sample node.
  3. Hierarchical Design: Internal nodes effectively cover genotype information shared by multiple samples and achieve compression by avoiding redundant relationships.

Research Methods and Experimental Process

The research team designed a series of experimental steps for GRG construction and validation, including algorithm development, simulated data testing, and real biological data application.

(1) GRG Construction Algorithm

The construction algorithm consists of four key steps: 1. Genome Segmentation: The genome is first divided into fixed-length segments, with each segment ranging from 50 to 150 kilobase pairs (kbp). 2. Local Tree Graph (Tree GRG, TGRG) Construction: A local ancestry tree is created for each segment, using Hamming distance to measure the similarity of mutations among samples. 3. Mutation Mapping: Based on the local tree, mutation data for each segment is precisely located and mapped to the corresponding nodes in the local tree graph. 4. Global Graph Merging: All local tree graphs are merged into a global GRG, with node numbering and graph structure optimized.

The algorithm also utilizes efficient Bloom filters and BK-trees to accelerate neighbor node searches, significantly reducing construction costs.

(2) Simulated Data Testing

To test GRG’s performance, the research team used the msprime tool to generate simulated genome data containing 10 to 1 million haploid samples, with mutation and recombination rates set at 10-8 (per base pair per generation). The experiments validated GRG’s construction efficiency, file size, and memory requirements. The results showed that GRG requires only 10GB of memory for 1 million samples, with file sizes ranging from 5 to 26GB per chromosome, demonstrating efficient scalability.

(3) UK Biobank Data Application

The team further tested GRG’s practicality on the UK Biobank dataset containing 200,000 phased genomes. Through multi-threaded parallelization (using 70 CPU cores), GRG construction for all 22 chromosomes took only 14 hours, with file sizes 13 times smaller than VCF files and the total data volume compressed to less than 160GB.

(4) Graph Traversal and Dynamic Computation

GRG’s graph traversal supports dynamic programming algorithms, enabling the reuse of intermediate computational results. For example, in allele frequency and genome-wide association studies (GWAS), traversing nodes upward or downward significantly improves computational speed. Notably, this approach resembles the recursive subproblem-solving strategy in numerical optimization.

Software Implementation and Ecosystem Expansion

The research team developed an open-source tool library, GRGL, to support GRG construction and computation, significantly simplifying the processing of large-scale genomic data.

Research Results and Key Findings

  1. File Storage Efficiency: Compared to traditional VCF and BGEN formats, GRG achieved a 13-fold compression ratio on the UK Biobank dataset of 200,000 genomes, and the compressed files do not require additional decompression, making processing more efficient.
  2. Computational Efficiency: On both simulated and real data, GRG’s dynamic programming approach for allele frequency analysis was 220 times faster than VCF, and GWAS analysis was more than 2.6 times faster than traditional matrix operation tools.
  3. Scalability Verification: GRG supports the processing of genome data with up to a million samples, with file storage and computational performance scaling sublinearly with sample size, demonstrating outstanding scalability on both simulated and UK Biobank datasets.

Conclusion and Research Value

This study presents an efficient genomic data representation method, GRG, offering new possibilities for large-scale biological data analysis. Based on the generative model of biological genetic diversity and graph theory, GRG achieves both extreme data compression and improved computational efficiency. Its potential value includes: 1. Data Compression: Effectively alleviating the storage and transmission pressure of biobank-scale data. 2. Computational Acceleration: Improving the execution efficiency of core genetic statistical tasks such as GWAS and allele frequency calculations through GRG. 3. Future Extensibility: GRG can support not only human data but also extend to other species, including virus sequence data compression and analysis (e.g., SARS-CoV-2).

The GRG proposed in this research not only opens a new path in the field of data structures but also demonstrates the potential application of graph structures in statistical genetics. The GRG-based data analysis framework will have a profound impact on future bioinformatics and genomic studies.