KG4NH: A Comprehensive Knowledge Graph for Question Answering in Dietary Nutrition and Human Health
Background and Research Motivation
It is well-known that food nutrition is closely related to human health. Scientific research has shown that improper dietary nutrition is linked to more than 200 diseases. Especially when considering the metabolic processes of gut microbiota, the complex interactions between food nutrients and diseases become difficult to systematize and practically apply. Therefore, developing a comprehensive framework that integrates extensive knowledge and provides practical applications has become urgent to support dietary-related queries.
Research Source
This paper is a study jointly written by Chengcheng Fu, Xueli Pan, Jieyu Wu, Junkai Cai, Zhisheng Huang, Frank Van Harmelen, Weizhong Zhao, Xingpeng Jiang, and Tingting He. The participants of this study come from the Key Laboratory of Artificial Intelligence and Intelligent Learning of Hubei Province, the College of Computer Science at Central China Normal University, and the Department of Computer Science at Vrije Universiteit Amsterdam. Some authors are also affiliated with other institutions such as the Shanghai Pudong New Area Mental Health Center. This article has been accepted by the IEEE Journal of Biomedical and Health Informatics and will be officially published in 2023.
Research Process
The research primarily consists of the following parts: data collection, triple extraction, knowledge integration and expansion, and development of a question-answering system.
Data Collection
Researchers searched PubMed for articles related to food, nutrition, and human diseases, collecting titles and abstracts of 230,573 articles published between 2012 and 2022. These texts were processed with Stanford CoreNLP for tokenization and sentence segmentation, generating finer-grained tags.
Triple Extraction
Concept Recognition
Researchers used the Concept Identification tool (CI) from the EURECA project to recognize and classify nutritional and disease entities in the text. For example, “type 2 diabetes” was identified as a disease entity and associated with various classifications. A total of 46,807 nutritional entities and 47,749 disease entities were identified through concept recognition.
Relation Extraction
For relation extraction, researchers trained the BiolinkBERT model and optimized its parameters to ensure efficient relation extraction. Ultimately, the model automatically extracted relationships from numerous sentences, identifying a total of 27,873 relationships, including 706 nutrients and 2,705 diseases.
Knowledge Integration and Expansion
Knowledge from multiple sources, such as FDC (FoodData Center) and KEGG (Kyoto Encyclopedia of Genes and Genomes), was integrated. This knowledge was stored using the GraphDB graph database tool and extended using predefined rules with transitivity and symmetry. The final constructed knowledge graph contains approximately 255,017,496 triples, 154 semantic relationships, and 7,437,819 entities.
Development of Question-Answering System
Question Design
The system designed three types of questions based on the three key themes in food and health research: nutritional analysis, nutrient metabolism, and the impact of food on human diseases. Templates for descriptive, comparative, and causal questions were provided for SPARQL queries to extract answers from the knowledge graph.
Benchmark Dataset
The benchmark dataset includes 120 questions covering three primary user groups: patients, doctors and nutritionists, and researchers. The questions were carefully designed by experts and provided with standard answers. These questions were used to validate and assess the effectiveness of the system.
Main Results
Comparative Experiments
The study’s comparative analysis of the BiolinkBERT, Biobert, and BlueBERT models showed that BiolinkBERT performed better in relation extraction tasks, with an accuracy of 0.92, a recall of 0.81, and an F1 score of 0.86.
Interpretation Experiments
By calculating the importance of nutrient nodes in the relationship graph, the researchers found that folate and sucrose had higher importance among various nutrients.
Ablation Experiments
The study conducted ablation experiments to evaluate the contribution of different knowledge sources to the question-answering system. The results showed that removing existing knowledge significantly reduced the accuracy and other metrics of the question-answering system.
Comparative Discussion
The research team compared the question-answering system with ChatGPT, finding their system more advantageous in terms of accuracy and consistency but needing improvement in robustness and interpretability.
Quality Assessment
The data structure quality of the knowledge graph was evaluated using SHACL framework constraints, revealing errors and incomplete concept definitions during the data import process. These findings help further improve the knowledge graph.
Conclusion and Significance
This study developed a comprehensive, continually updated knowledge graph of dietary nutrition and human health through automated triple extraction and knowledge integration. Based on this knowledge graph, a query-based question-answering system was developed to provide precise answers to three types of questions. Five carefully designed experiments verified the method’s effectiveness. Overall, this study presents a systematic approach to constructing a knowledge graph of dietary nutrition and human health and provides researchers, clinicians, and patients with a powerful tool to explore the complex relationships between diet and health.
In future research, the team plans to optimize relation extraction models further, integrate large-scale language models and unsupervised learning techniques, and expand the classification of the question-answering system to cover more types of questions. Additionally, advanced natural language understanding technologies will be introduced to enhance the system’s adaptability and responsiveness.