Predicting Future Disorders via Temporal Knowledge Graphs and Medical Ontologies
Predicting Future Diseases: Integration of Temporal Knowledge Graphs and Medical Ontologies
Electronic Health Records (EHRs) are indispensable tools in modern medical institutions. They record detailed health histories of patients, including demographics, medications, lab results, and treatment plans. This data not only improves the coordination and continuity of medical services but also helps healthcare providers identify health trends and make data-driven decisions, thereby enhancing the overall quality of patient care. However, most of the data stored in EHRs is unstructured, especially the free-text data written by clinicians describing patient health conditions, which presents significant challenges for information extraction and effective utilization.
To address this challenge, numerous studies have attempted to extract relevant information from unstructured data using Natural Language Processing (NLP) technologies and link it to medical ontologies. In recent studies, Knowledge Graphs (KGs) have shown potential in integrating diverse types of patient data from various sources in recommendation systems, information retrieval, and natural language processing. However, traditional static knowledge graphs cannot depict temporal dependencies and fail to effectively reflect the dynamic changes in patient health states.
Research Background and Objectives
This study, authored by Marco Postiglione, Daniel Bean, Zeljko Kraljevic, Richard JB Dobson, and Vincenzo Moscato, was published in the IEEE Journal of Biomedical and Health Informatics. The research team comprises experts from the University of Naples Federico II and King’s College London, who have conducted a series of pioneering research in the field.
In this study, the authors propose a framework called MedTKG (Temporal Knowledge Graph), which integrates the dynamic clinical history information of patients with the static information from medical ontologies. The research aims to predict future diseases by identifying the missing objects in quadruples (s, r, ?, t), where s and r represent the patient and disease relationship type, respectively, and t is the query timestamp. The study validates the effectiveness of this method in predicting future diseases based on clinical notes from the MIMIC-III dataset and demonstrates the role of medical ontologies in enhancing model performance.
Methods and Procedures
Dataset and Preprocessing
The study uses the MIMIC-III dataset, developed by the MIT Lab for Computational Physiology, which contains information about patients in the Beth Israel Deaconess Medical Center intensive care unit from 2001 to 2012. The dataset includes 46,520 patients with a total of 2,083,179 unstructured clinical notes.
To extract concepts, the research team employed the Medical Concept Annotation Toolkit (MedCAT), a tool trained on the latest self-supervised learning models that can accurately identify clinical concepts and link them to the SNOMED-CT ontology. The extracted data was then preprocessed, including removing rare diseases occurring less than 100 times and easily identifiable patient concepts, retaining biomedical concepts appearing at least twice, excluding parent concepts in the SNOMED ontology that share an “is-a” relationship with existing timeline concepts, removing repeated concepts within the same day, and excluding medical histories with fewer than 10 concepts.
Medical Ontology and Temporal Knowledge Graph
The study established mappings between medical concepts and their corresponding codes through the SNOMED-CT ontology, identifying and analyzing both direct relationships (such as “is-a” relationships) and indirect relationships (such as shared parent concepts). Results showed that integrating medical ontologies with temporal knowledge graphs significantly improved the prediction model’s performance.
MedTKG defines medical history as a sequence of knowledge graphs (e.g., mt = {g1, g2, … , gt}), where t is the sequence length of knowledge graphs. Each knowledge graph gt = ⟨v, r, et⟩ at timestamp t is a directed heterogeneous graph, with v, r, and et representing the entities, relationships, and fact sets at timestamp t, respectively. The static knowledge graph gs models embedded knowledge in the medical ontology.
Model Design and Architecture
The architecture of the MedTKG model, as shown in Figure 2, mainly includes:
Input Module: It starts from the free text in clinical notes, extracting relevant clinical concepts through Named Entity Recognition and Linking (NER+L) tools, linking them to medical ontologies, and representing these extracted medical concepts in the form of temporal knowledge graphs.
Evolution Unit: This unit uses a relation-aware Graph Convolutional Network (GCN) to capture structural dependencies in the knowledge graph and a Temporal Gated Recurrent Unit (GRU) to model the temporal evolution of the knowledge graph. Additionally, to ensure the retention of the static characteristics of medical ontologies, the static graph constraint component introduces constraints to combine the static embeddings of the medical ontology with the evolutionary embeddings of entities.
Scoring Function and Loss Function: The scoring function aims to compute the conditional probability of candidate triples given the medical history mt, using ConvTransE as the decoder. The loss function consists of the entity prediction loss le and the medical ontology constraint loss ls.
Experiments and Results
Data Set and Medical Ontology Statistics
Medical histories were split into different knowledge graph training and test sets, with the training set accounting for 90%, and the validation and test sets each accounting for 5%. Appendix II provides detailed statistics on the dataset, showcasing the graph data generated by the study.
Evaluation Metrics
The study utilized multiple evaluation metrics, including Mean Reciprocal Rank (MRR), Top-k Hit Rate (Hits@k), and Mean Recall Rate (MR@k). The results showed that MedTKG significantly outperformed other baseline methods in terms of true positive rate and hit rate, verifying its high-accuracy prediction capability in a clinical application setting.
Conclusion and Future Directions
This study proposed the MedTKG framework, successfully integrating the dynamic information of EHRs with the static information of medical ontologies, demonstrating significant advantages in predicting future diseases. Future research directions include in-depth analysis of MedTKG’s interpretability to provide clear and understandable rationale for its predictions, expanding the scope of research to include new datasets and more types of medical events, and validation of the framework’s effectiveness in actual clinical applications through clinical trials.
By leveraging temporal knowledge graphs and medical ontologies, MedTKG provides a powerful modeling tool for the medical field, potentially improving the accuracy of clinical decision-making and thereby enhancing the overall health outcomes of patients.