Medical History Predicts Phenome-Wide Disease Onset and Enables the Rapid Response to Emerging Health Threats

Using Medical Records to Predict Common Disease Incidence and Support Rapid Response to Emerging Health Threats

Research Background and Motivation

The COVID-19 pandemic exposed systemic issues and a lack of data-driven guidance globally, significantly affecting the identification of high-risk populations and pandemic preparedness. Assessing individual future disease risk is crucial for guiding preventive interventions, early disease detection, and initiation of treatment. However, for common diseases, only a small fraction has bespoke risk scores, leaving healthcare providers and individuals without guidance for most related ailments. Even when established risk scores are available, there is a lack of consensus on which to use and the relevant physiological or laboratory measurements, leading to highly fragmented routine medical practice. Especially in the early stages of the COVID-19 pandemic, the lack of available data made it impossible to use risk scores to identify vulnerable populations.

Most medical decisions, including diagnosis, treatment, and disease prevention, are based on an individual’s medical history. With the digitization of records, this information has already been collected in the form of electronic health records by healthcare providers, insurance companies, and governments. However, due to the limited capacity of humans to process and understand large volumes of data, these readily accessible records have limited potential for improving medical decision-making.

In existing studies, electronic health records have been used to guide clinical decisions, conduct etiological, diagnostic, and prognostic research. Although some efforts have combined known clinical predictors with new methods or other data patterns such as clinical notes, few studies have explored the predictive potential across the entire spectrum of health phenomena. Therefore, the potential for systematically using routinely collected health records to guide medical decisions remains largely untapped.

Research Source

This study was authored by Jakob Steinfeldt, Benjamin Wild, Thore Buergel, Maik Pietzner, Julius Upmeier Zu Belzen, Andre Vauvelle, Stefan Hegselmann, Spiros Denaxas, Harry Hemingway, Claudia Langenberg, Ulf Landmesser, John Deanfield, and Roland Eils. The authors are from several well-known institutions in Germany, the United Kingdom, and the United States. The paper was published in the journal “Nature Communications” in 2024.

Research Process

Data Collection and Description

The study is based on the UK Biobank and All of Us cohorts. The UK Biobank includes 502,460 healthy individuals primarily of UK descent, with a median age of 58 years, 54.4% of whom are female. Individuals were recruited between 2006 and 2010, with a median follow-up of 12.6 years. The study examined endpoints within 1883 phenomena and used this data to develop and validate models. The All of Us cohort includes 229,830 diverse individuals from the United States, with a median age of 54 years, 61.1% of whom are female. This cohort started recruiting in 2019, with a median follow-up time of 3.5 years.

Model Development and Validation

The study employed a neural network model to learn the entire medical history of individuals to predict the risk of various diseases. A multi-layer perceptron neural network was developed, trained, and validated within the UK Biobank cohort to estimate disease risks from routinely collected health records. Unlike traditional methods such as linear models or survival trees that require separate models for each disease, this approach simultaneously predicts multiple endpoints through a single neural network, significantly simplifying the model structure.

To confirm the generalizability of these models, the study conducted external validation within the All of Us cohort, evaluating model performance across different healthcare systems and populations. Additionally, the study explored the application of this method in cardiovascular disease prevention and addressing emerging health threats such as COVID-19 (secondary infection, all-cause mortality).

Data Integration and Analysis

Before further analysis, the study mapped all health records to the OMOP vocabulary. The primary record domains were drugs and observations, followed by conditions, procedures, and devices. Very rare concepts were excluded, retaining 15,595 unique concepts, and using a multi-task multi-layer perceptron (88.4m parameters) to simultaneously predict the occurrence of phenomena within 1883 endpoints, compared against benchmark linear models.

Risk Status and Event Occurrence

To evaluate whether health records can identify high-risk populations, the study analyzed the relationship between neural network-estimated risk status for each endpoint and future disease risk. Results showed significant differences in event occurrence rates between the highest and lowest 10% risk status groups for the vast majority of endpoints. This phenomenon spanned multiple disease categories and etiologies, including rheumatoid arthritis, ischemic heart disease, and chronic obstructive pulmonary disease.

Research Results

Model Performance

The study found that for 1774 (94.2%) endpoints, models based on medical history significantly outperformed baseline models that only considered age and gender. The model particularly excelled in distinguishing high-risk and low-risk individuals for common diseases and more socially burdensome conditions.

External Validation

External validation in the All of Us cohort showed that for 1347 (85.9%) endpoints, models based on medical history similarly significantly outperformed baseline models. This indicates that risk prediction models based on medical history have good generalizability across different healthcare systems and diverse populations.

Disease Prevention and Response to Emerging Health Threats

The study further demonstrated the potential of this approach in cardiovascular disease prevention and responding to emerging health threats such as COVID-19. Risk prediction models based on medical history can identify high-risk populations early, thus helping to optimize prevention and treatment strategies.

Conclusion

This study demonstrates the potential of systematically using routine health records to assess disease risk across the spectrum of phenomena. These risk statuses can be used to rapidly respond to emerging health threats like COVID-19. The results indicate that this method not only has scientific value but also broad application prospects in medical practice.

Research Highlights

  1. Novelty of the Method: Using neural networks to simultaneously predict multiple endpoints simplifies the model structure.
  2. Wide Applicability: The model performs well across different healthcare systems and diverse populations.
  3. Practical Significance: The model can be used for cardiovascular disease prevention and responding to emerging health threats such as COVID-19.

This research demonstrates how collected data can be linked to clinical practice to guide preventive interventions and early diagnosis and treatment of diseases, providing new insights for future large-scale population health management.