Large Language Models to Identify Social Determinants of Health in Electronic Health Records

Using Large Language Models to Identify Social Determinants of Health from Electronic Health Records

Background and Research Motivation

Social Determinants of Health (SDOH) have a significant impact on patient health outcomes. However, these factors are often incompletely recorded or missing in the structured data of Electronic Health Records (EHR). Large Language Models (LLMs) could potentially extract SDOH from the narrative texts of EHRs at high throughput to support research and clinical care. However, challenges arise due to category imbalance and data limitations, complicating the extraction of these sparsely recorded key information. This paper aims to explore the optimal methods for extracting six categories of SDOH (employment, housing, transportation, parenthood, relationships, and social support) from EHR narrative texts using LLMs.

Research Source

This study was jointly conducted by Marco Guevara, Shan Chen, and several collaborating authors from the AIM (Artificial Intelligence in Medicine) program at Harvard Medical School’s Mass General Brigham. Other involved institutions include Brigham and Women’s Hospital, Dana-Farber Cancer Institute, and Boston Children’s Hospital. The paper was published in the 7th volume of npj Digital Medicine in 2024, in collaboration with Seoul National University Bundang Hospital.

Research Process

Subjects and Methods

  1. Subjects: The study included clinical notes from the electronic health records of cancer patients who received radiation therapy (RT). The dataset consisted of 800 clinical notes from 770 patients.

  2. Data Annotation: Interviews with social workers, resource experts, and oncologists were conducted to identify clinically relevant SDOH not captured as structured data in the EHR. Six categories of SDOH were selected: employment status (employed, unemployed, underemployed, retired, disabled, student), housing issues (financial situation, homelessness, others), transportation issues (distance, resources, others), parenthood, relationships (married, partner, widowed, divorced, single), and social support (available or not). Data annotation involved two tasks: any mention of SDOH and mentions of adverse SDOH.

  3. Data Augmentation: Synthetic data were generated using GPT-3.5 to enhance the diversity of the training set and improve model performance.

  4. Model Development: Multi-label classification tasks were performed using BERT and Flan-T5 models, with the Flan-T5 model fined-tuned using the parameter-efficient LoRA method. Main models included Flan-T5 base, large, xl, and xxl models.

  5. Model Evaluation: Performance was evaluated on a development and test set, calculating the macro F1 score for tasks identifying any SDOH mention and adverse SDOH mention.

Main Experiments and Results

  1. Model Performance: On the radiation therapy test set, the best-performing model for any SDOH mention was Flan-T5 xxl with synthetic data (macro F1 0.71); for adverse SDOH mention, the best model was Flan-T5 xl without synthetic data (macro F1 0.70). Overall, Flan-T5 models outperformed BERT models, with performance improving with model size.

  2. Effect of Data Augmentation: Utilizing synthetic data to augment the training set improved model performance, particularly in categories with sparse data (e.g., housing, parenthood, transportation), where synthetic data often significantly enhanced model performance.

  3. Model Bias Evaluation: Flan-T5 models and ChatGPT showed different classification results for sentences with and without demographic information. However, the bias rate of Flan-T5 models was significantly lower than that of ChatGPT. Among sentences with demographic information, ChatGPT showed a higher proportion of classification changes under descriptions of females and males.

  4. Comparison with Structured EHR Data: The study found that SDOH information extracted through text was more effective in identifying patients with adverse SDOH than related ICD-10 codes in structured EHR data.

Study Highlights and Conclusions

  1. Highlights:

    • The study demonstrated the potential of large language models in extracting SDOH information from EHRs.
    • Incorporating synthetic data improved the LLMs’ ability to recognize rare SDOH data categories.
    • Flan-T5 models outperformed popular models such as ChatGPT in dealing with data scarcity challenges and exhibited fewer algorithmic biases.
  2. Significance and Value:

    • The research showed the potential of using LLMs to improve real-world SDOH data collection and support resource allocation for patients.
    • Provided new annotation guidelines and synthetic SDOH datasets for the research community.
    • The proposed method aids in better understanding the driving factors of health disparities and supports identifying patients who could benefit the most from resource and social work interventions.
  3. Future Research Directions:

    • Further optimization of synthetic data generation methods to better mine clinical information from sparse records.
    • Combining with other data sources to improve model generalization capability.

This study offers a new approach for using large language models to enhance the automatic extraction of SDOH information from EHRs, which is significant for improving the efficiency of health data utilization and supporting clinical decision-making. More details and model codes are available in related public repositories for further research and application.