Electronic Health Record Signatures Identify Undiagnosed Patients with Common Variable Immunodeficiency Disease

Utilizing Electronic Health Record Features to Identify Undiagnosed Patients with Common Subtype of Immunodeficiency

Research Overview

Recently, Johnson and colleagues published a study titled “Electronic health record signatures identify undiagnosed patients with common variable immunodeficiency disease” in Science Translational Medicine. This research utilizes electronic health records (EHRs) and the machine learning algorithm PheneT to identify undiagnosed patients with common variable immunodeficiency (CVID), providing a new pathway for earlier diagnosis and treatment.

Research Background and Purpose

Human inborn errors of immunity (IEI) encompass a range of functional and quantitative antibody deficiencies caused by B cell dysfunction, one manifestation of which is common variable immunodeficiency (CVID). CVID represents a highly heterogeneous group of diseases with diverse symptoms, including infections, autoimmune diseases, and inflammatory conditions, overlapping with many common diseases. Due to its rarity (approximately 125000 incidence rate) and phenotypic diversity, the diagnosis and treatment of CVID are often delayed, with an average delay from symptom onset to diagnosis of 5 to 15 years. This not only increases patient suffering but also significantly elevates the overall cost to the healthcare system. Currently, there is no universally recognized single cause for CVID, and genetic testing cannot provide a definitive diagnosis. Therefore, there is an urgent need for an effective method to shorten the diagnosis time of CVID, enabling early diagnosis and treatment for these patients.

Source of the Paper

This study was conducted by Ruth Johnson, Alexis V. Stephens, Rachel Mester, and others from UCLA, and was published in the Science Translational Medicine journal on May 1, 2024.

Research Institutions

The authors of this study come from multiple academic and medical research institutions, including:

  • University of California, Los Angeles (UCLA)
  • University of California, Irvine (UCI)
  • University of California, San Diego (UCSD)
  • Vanderbilt University, Nashville, TN

Research Methods

This study developed a machine learning algorithm named PheneT to identify undiagnosed CVID patients from EHR data.

a) Research Process

  1. Data Preparation:

    • Extracted approximately 3200 candidate patients with immune deficiency-related ICD codes from UCLA’s electronic health record system. After manual review by clinical immunologists, 197 patients meeting CVID criteria were identified as “true” cases for model construction.
  2. Feature Selection:

    • Extracted features from these cases and used the Human Phenotype Ontology (HPO) and Online Mendelian Inheritance in Man (OMIM) databases to map CVID clinical phenotypes to Phecodes (phenotype codes), obtaining 34 Phecodes related to CVID.
    • Increased the accuracy of feature selection using the training dataset containing CVID patients and selected 44 Phecodes.
  3. Model Training:

    • Trained the selected features using marginal logistic regression.
    • The training process included data balancing and moderately expanding the sample size (0.5 oversampling ratio).
    • Optimized the model using IgG lab test results during five-fold cross-validation, improving the model’s accuracy.
  4. Validation and Application:

    • Conducted external validation on over six million records from five different healthcare systems (including UCLA), showing PheneT’s applicability across different systems.
    • PheneT diagnosed CVID patients 244 days (approximately 8 months) earlier within UCLA’s EHR data.

b) Main Results

  • Performance of PheneT:

    • PheneT outperformed the existing Phers method, improving AUC-ROC and AUC-PR performance indicators by 17% and 42%, respectively.
    • The PheneT model accurately and efficiently identified CVID patients using 65 features.
  • Early Diagnosis:

    • PheneT can identify high-risk CVID patients months before diagnosis. The study shows that PheneT could detect the disease an average of 244 days before confirmed diagnosis.
    • Additionally, among the top 100 highest-risk score patients, 74% were evaluated as highly likely to have CVID, demonstrating the effectiveness of PheneT.
  • Cross-Institution Validation:

    • PheneT was applied to EHR data from several University of California medical centers and Vanderbilt University, showing high robustness and applicability across different datasets.

c) Conclusions and Research Value

  • Scientific Value:

    • This study demonstrates the significant potential of machine learning in the medical field, particularly for the early diagnosis of rare diseases.
    • It shows that large-scale EHR data used for machine learning can effectively shorten the diagnosis time of rare diseases, reducing patient suffering and waste of medical resources.
  • Application Value:

    • PheneT provides new methods and tools for clinical diagnosis, helping doctors to identify potential CVID patients earlier, enabling early intervention, and improving patient outcomes.
    • Healthcare systems can use this algorithm to screen a wide population, enhancing the identification rate of rare diseases and optimizing healthcare resource allocation.

d) Research Highlights

  • Innovativeness:

    • The PheneT algorithm combines machine learning and large-scale EHR data to explore complex pathological features not covered by traditional methods.
    • The comprehensive risk scoring model for CVID improves current methods and shows high reliability in cross-institution validation.
  • Clinical Impact:

    • PheneT can significantly reduce the diagnostic delay of CVID, save substantial medical resources, and improve the quality of life and prognosis of patients.

Through systematic analysis of EHR data, the PheneT algorithm demonstrates significant potential in diagnosing complex rare diseases, providing a valuable reference for the future application of AI in healthcare.