A Telomere-to-Telomere Complete Diploid Genome Assembly for Han Chinese

T2T-YAO: Assembly of a Han Chinese Full-length Diploid Reference Genome

Scientific Background

Since the launch of the Human Genome Project (HGP) thirty years ago, the field of biomedical research has set a long-term goal of constructing a complete and accurate human reference genome. However, due to limitations in sequencing technology, it has long been difficult to achieve the comprehensiveness and accuracy required for this goal. In recent years, with breakthroughs in sequencing technology, the T2T (Telomere-to-Telomere) project released the first full-length haploid human genome, T2T-CHM13v1.1. This achievement filled in 8% of previously unknown highly repetitive regions, bringing the genome quality to Q73.94, which means one error per 24.8 megabases.

However, even though this achievement is exciting, the T2T-CHM13 genome is not representative of a real human individual, but rather a haploid genome derived from a complete hydatidiform mole (CHM) cell line lacking the Y chromosome. Its cell line originated from Northern Europe, with the Y chromosome supplemented by HG002 of Eastern European Jewish ancestry, still not representing individuals globally. Importantly, although the Human Pangenome Reference (HPRC) integrates draft genomes from 47 individuals worldwide, it is still insufficient to fully represent all populations.

In this context, the Han Chinese, as the world’s largest population group, is underrepresented in current human genome references (such as GRCh38 and HPRC), especially lacking samples from source regions. Therefore, constructing a high-quality Han Chinese diploid T2T reference genome is crucial to promote in-depth biological research and medical applications for different ethnic groups.

Research Source

The authors of this original research article come from multiple research institutions, including Peking University People’s Hospital, Beijing Institute of Genomics, Chinese Academy of Sciences, and others. This paper was published online in the journal Genomics Proteomics & Bioinformatics on August 16, 2023.

Research Process

Selection of Research Sample

To achieve the construction of a Han Chinese full-length diploid reference genome, the research team collected samples from a healthy Han Chinese male from an ancient village in Shanxi Province, where Han Chinese have lived for generations since the Ming Dynasty, reflecting relatively pure genetic characteristics of the Han ethnicity.

Sequencing and Data Collection

The paper describes in detail the collection of genomes from peripheral blood mononuclear cell (PBMC) samples of a parent-child trio (offspring and parents). First, chromosome karyotype analysis was performed to exclude chromosomal diseases. Subsequently, various technologies were used to ensure sequencing depth and coverage, including PacBio High-Fidelity (HiFi) sequencing, Oxford Nanopore Technologies (ONT) sequencing, Illumina ARIMA genome chromosome conformation capture (Hi-C) sequencing, and Bionano optical mapping.

Genome Assembly and Correction

Trio assembly was performed using paternal and maternal-specific markers from the son’s ONT read data, and a gapless graph was constructed based on HiFi reads for gradual integration. Subsequently, ultra-long ONT reads and low-frequency k-mers were used to fill remaining gaps in the assembly, ultimately achieving T2T assembly. After multiple data proofreading, strict strategies were used to correct single nucleotide variants (SNV) and structural variants (SV) errors, ensuring the accuracy of the final genome reference.

Data Validation and Evaluation

Tools such as Merqury were used to evaluate the completeness and accuracy of the T2T-YAO genome, achieving a quality value (Q value, QV) of Q74.69, higher than T2T-CHM13’s Q73.94. This marks T2T-YAO as currently the highest quality diploid human genome reference in the world.

Research Results

Distribution of Han Chinese Genetic Markers

Based on SNP data from the 1000 Genomes Project, the T2T-YAO genome showed distinct East Asian genetic markers, with a mix of small amounts of South Asian, European, and American markers. This demonstrates the characteristic differences of the Han Chinese genome among different populations.

Unique Genes and Sequences

Through comparison with existing human genomes, about 10% unique sequences were found in the T2T-YAO genome, mainly distributed in heterochromatic regions such as centromeres, adding to the genetic diversity specific to the Han Chinese genome.

Structural Variations

The study also identified several large-scale structural variations, such as a 4MB inversion on the short arm of chromosome 8, which has been reported in previous genetic studies, indicating structural diversity among different populations.

Y Chromosome Architecture

The Y chromosome (YAO-Y) in T2T-YAO has a total length of 51MB, 10 MB less than the Y chromosome in CHM13, mainly in the Yq12 region. These differences reflect the length polymorphism of Y chromosomes in different populations.

Research Significance

This study successfully constructed the first full-length diploid reference genome of Han Chinese, which means that in future biomedical research, especially in precision medicine research targeting the Han population, gene variations can be more precisely located and analyzed. Furthermore, the construction of the T2T-YAO genome provides valuable application scenarios and practical basis for future genomics research and new drug development.

Conclusion

T2T-YAO represents a significant advancement in current genome assembly, being the first truly complete and accurate diploid human genome, which will play a huge role in future medical and biological research. This study not only demonstrates technological breakthroughs, but more importantly, provides a detailed and authentic genome reference for the vast Han Chinese population, holding significant academic value and application prospects.