Principled distillation of UK Biobank phenotype data reveals underlying structure in human variation

In this report, we present a detailed evaluation of a research paper published in the journal Nature Human Behaviour, titled “Principled Refinement of UK Biobank Phenotypic Data Reveals the Latent Structure of Human Variation.” The study, conducted by Caitlin E. Carey, Rebecca Shafee, Robbee Wedow, among others, was published online on XX XX, XXXX. The citation for the article is https://doi.org/10.1038/s41562-024-01909-5.

Research Background and Significance

With increasing public and private investment in large-scale data collection and integration, biobanks have emerged as data repositories that connect health outcomes with biological samples from thousands of individuals. Biobanks comprise a wealth of detailed variables extracted from electronic health records (EHR), self-reported survey measures, laboratory tests, and physical and cognitive assessments. Although these vast resources now drive discoveries in human health and disease, the breadth and depth of the data can obscure larger patterns present in the biobank. To better understand the landscape relevant to human health, methods are needed that can identify latent structures and reduce the thousands of variables into fewer, more comprehensible constructs.

Dimensionality reduction is a common task in many fields, and thus a variety of methods have been applied to biobank-scale data. Nonetheless, factor analysis has not received widespread attention in biobank analysis. This method models correlations among observed variables as one or more shared continuous latent factors. It is model-based and more conducive to statistical inference than descriptive summaries (e.g., principal component analysis) or “black-box” algorithmic solutions, and it optimizes the extraction of factors that have simple relationships where possible with observed items.

In this study, we applied and extended improved factor analysis methods to a broader multimodal set of biobank phenotypic data. Our goal was to assess whether the identified structures are informative in revealing relationships that might be unintentionally masked, and to enhance analyses linking phenotypic and genetic data through factor scores.

Moreover, the study emphasizes the significance of accounting for the composite nature of constructs such as socioeconomic status, trauma, or physical activity when evaluating public health patterns and considering the human phenome comprehensively.

Research Authors and Institutional Background

The primary authors include Caitlin E. Carey from Harvard Medical School. Other researchers hail from various scientific institutions such as Rebecca Shafee, Robbee Wedow, Amanda Elliott, Duncan S. Palmer, John Compitello, Masahiro Kanai, Liam Abbott, Patrick Schultz, Konrad J. Karczewski, among others. They belong to institutions like the University of California system, New York University, the Broad Institute, and other collaborative research centers.

Research Process and Findings

Next, we will detail each step of the research process and the main findings.

Research Process

The overall research process was carried out through the following key steps: a) Selection of study subjects: Individuals of non-related Asian ancestry were chosen as the study sample. b) Data processing and preparation: Diverse phenotypic data from the UK Biobank were processed and organized. c) Model structure determination: A multi-stage factor analysis approach was employed, including Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA). d) Factor score calculation: Based on the final factor model, latent factor scores for each individual were computed.

Main Findings

The main findings of this study are as follows: 1) Identified 35 orthogonal latent factors covering 505 observable items, capturing known disease classifications, deconstructing elements of socioeconomic status, highlighting the relevance of mental state to health, and improving the measurement of pro-health behaviors. 2) Demonstrated the association between factor scores and future mortality, genetic signals, and health outcomes. 3) Genetic correlations and genetic enrichment of factors revealed connections between biomarkers and diseases.

Research Conclusions and Value

By adapting factor analysis methods for large-scale biobank data and extracting interpretable and operational latent structures, this study underscores the value of principled dimensionality reduction, revealing significant insights into human variation. The study’s results highlight the importance of considering the structure of human variation and provide critical support for further research into health and well-being discoveries.

The research is particularly significant in the medical field because the extracted factors capture diagnoses, causes, and consequences of diseases such as asthma and coronary artery disease in a hypothesis-free and data-driven manner, simplifying and clarifying disease classifications from a broader relational structure.

Furthermore, the heritability of factor scores and the enhanced power of genetic findings suggest considering multiple cross-phenotypic indicators when studying complex human phenotypes, especially those not easily measurable by experiments.

Research Highlights and Features

By applying model-based data reduction techniques to hundreds of diverse items in the biobank, this study successfully distilled the phenotypic landscape into comprehensible latent constructs with interpretable axes of variation.

Notably, after such decomposition, socioeconomic status was found to be inherent in multiple factors, supporting long-held assumptions about separating elements like education, income, occupation, and other facets of social status. This provides a foundation for future identification of these extendable constructs in various sociopolitical, cultural, and diagnostic contexts.

Summary

The study’s results indicate that principled factor analysis provides a novel perspective and tool for modeling correlations between phenotypic datasets. This is crucial for understanding the complex relationships in human health, behavior, and disease.