EHR-HGCN: An Enhanced Hybrid Approach for Text Classification Using Heterogeneous Graph Convolutional Networks in Electronic Health Records
EHR-HGCN: A Novel Hybrid Heterogeneous Graph Convolutional Network Method for Electronic Health Record Text Classification
Academic Background
With the rapid development of Natural Language Processing (NLP), text classification has become an important research direction in this field. Text classification not only helps us understand the knowledge behind documents but also has wide applications in biomedical texts, including Electronic Health Records (EHR). Existing research mainly focuses on deep learning methods based on bidirectional transformer-based encoding representations (e.g., BERT) and Convolutional Neural Networks (CNN). However, these methods often face input length limitations and high computational resource demands when dealing with medical long texts. Meanwhile, representative CNN methods for text classification often only extract nearby contextual features, ignoring longer-range relationships in the text.
To address these issues, Heterogeneous Graph Convolutional Networks (HGCNs) have been proposed in recent years as a new method to consider extensive relationships within the text. However, the application of GCNs to various practical problems such as text classification still faces challenges. Against this backdrop, this paper proposes a new hybrid heterogeneous graph convolutional network (EHR-HGCN) method. By combining lexical and contextual embeddings with structured sentence-level and lexical-level relational information, it achieves more efficient text classification.
Introduction to the Paper
This paper is co-authored by Guishen Wang, Xiaoxue Lou, Fang Guo, Devin Kwok, and Chen Cao. The authors are from Changchun University of Technology and Nanjing Medical University, as well as McGill University in Canada. The paper is published in the 28th volume, 3rd issue of IEEE Journal of Biomedical and Health Informatics in March 2024.
Research Details
Research Workflow
The EHR-HGCN method consists of three main parts: word embedding, heterogeneous graph construction, and heterogeneous graph classification.
1. Word Embedding
First, this paper employs GloVe to generate initial word embeddings. The GloVe model combines global matrix factorization and local context window framework to generate initial word embeddings. Then, based on these word embeddings, Bi-directional Recurrent Neural Network (BiRNN) is used to obtain contextual information and calculate sentence embeddings.
L = ∑ f(xij)(viᵀ vj - log(xij))^2
f(x) = {(x/xmax)^0.75, if x < xmax
{1, if x ≥ xmax
Through formulas (1) and (2), the GloVe model can generate contextual embeddings for each keyword in the text.
2. Heterogeneous Graph Construction
After obtaining contextual word embedding vectors, cosine similarity is used to calculate distant relationships between words. If the cosine similarity exceeds a preset threshold, an edge will be created between the two word nodes. Additionally, if a word exists in a sentence, an edge is created between the word and the sentence. Each heterogeneous graph consists of sentences and words as nodes, and sentence-word and word-word as edges. Thus, each document is converted into a heterogeneous graph structure, transforming the text classification problem into a graph classification problem.
As shown in Fig 3, each document is transformed into a graph containing nodes and edges. By constructing heterogeneous graphs from the previous words and sentences, the text classification problem brought by documents is transformed into a heterogeneous graph classification problem.
3. Heterogeneous Graph Classification
The third step is to use the Heterogeneous Graph Convolutional Network (HGCN) for heterogeneous graph-level classification. HGCN mainly consists of aggregation operations and heterogeneous graph convolution operations. Through these operations, the document is represented as a graph, and then the prediction result is output through the fully connected layer.
ĥ lk_i = f(∑ aij ĥ lk-1_j ⊗ θijk)
As shown in formulas (5) and (6), HGCN applies the convolution operation of different types of edges to the graph and inputs the results into the graph embedding layer to obtain the graph representation, and finally obtains the classification result through the fully connected layer.
Main Results
To verify the effectiveness of the proposed method, experiments were conducted on multiple standard benchmark datasets as well as an EHR application benchmark. The standard benchmarks included 20 Newsgroups, R8 and R52 datasets, Ohsumed, and Movie Review datasets. Experimental results showed that the EHR-HGCN method outperformed other traditional deep learning and GCN methods in terms of accuracy and F1-score.
Example Results
For example, on the 20 Newsgroups dataset, the EHR-HGCN method outperformed the second-best method TextGCN in accuracy and F1-score by 1.65% and 4.28%, respectively. On the Ohsumed dataset, only the EHR-HGCN method’s accuracy exceeded 50%, reaching 52.3%.
Conclusion and Value
The EHR-HGCN method proposed in this paper demonstrated excellent performance in text classification, especially in the EHR field. By combining context information with the structural relationships of the text and introducing heterogeneous graph convolution networks, it achieved the goal of enhancing classification performance. This method not only has significant value in scientific research but also provides new directions and technical means for actual EHR processing.
Research Highlights
- Innovation: Proposing a novel method by combining GloVe, BiRNN, and heterogeneous graph models.
- Performance Improvement: Superior performance on multiple benchmark datasets, especially the EHR dataset.
- Comprehensive Structure: Considering complex relationships between words and sentences, leading to more efficient text classification.
Therefore, the EHR-HGCN method provides a powerful tool for text classification, especially in processing electronic health records, showcasing the potential and application prospects of the integration of big data and artificial intelligence in the medical field. Future research may further test and optimize this method on larger datasets to enhance its performance in real-world applications.