Strokeclassifier: Ischemic Stroke Etiology Classification by Ensemble Consensus Modeling Using Electronic Health Records

StrokeClassifier: An AI Tool for Etiological Classification of Ischemic Stroke Based on Electronic Health Records

Project Background and Motivation

Identifying the etiology of strokes, particularly acute ischemic stroke (AIS), is crucial for secondary prevention, but it is often very challenging. In the United States, there are nearly 676,000 new cases of ischemic stroke every year, with a quarter of these patients having a history of stroke. This condition has a high recurrence rate and can even lead to death or further disability. The etiology of ischemic stroke can vary widely, including large artery atherosclerosis, cardioembolism, small vessel disease, and other rare causes. However, approximately 20-30% of ischemic stroke patients in the U.S. have an undetermined etiology after evaluation, classified as cryptogenic stroke. These patients have a particularly high risk of recurrent stroke. Therefore, accurately identifying the cause of cryptogenic stroke is essential for optimizing treatment plans and improving patient prognosis. Making an accurate diagnosis requires integrating a vast amount of data, including clinical history, physical examination results, laboratory tests, cardiac rhythm monitoring, and imaging studies. Due to a scarcity of neurovascular specialists, diagnostic capacity may be limited. The authors of this paper aim to develop an automated, AI-based tool using electronic health records (EHR) to classify the etiology of stroke, thereby enhancing diagnostic accuracy and consistency.

Paper and Author Information

The research paper is titled “StrokeClassifier: Ischemic Stroke Etiology Classification by Ensemble Consensus Modeling Using Electronic Health Records” and is written by Ho-Joon Lee, Lee H. Schwamm, Lauren H. Sansing, Hooman Kamel, Adam De Havenon, Ashby C. Turner, Kevin N. Sheth, Smita Krishnaswamy, Cynthia Brandt, Hongyu Zhao, Harlan Krumholz, and Richa Sharma. The article is published in the journal NPJ Digital Medicine and is part of a special issue in collaboration with Seoul National University Bundang Hospital.

Research Process and Methods

Study Subjects and Data Sources

The paper selected EHR text data from 2039 non-cryptogenic AIS patients from two academic hospitals to develop and validate an automatic classification tool named StrokeClassifier. An additional 406 discharge summary texts from the MIMIC-III dataset were externally validated by a vascular neurologist. The study used natural language processing (NLP) techniques to extract features from discharge summary texts, generating an ensemble consensus model composed of nine machine-learning classifiers. By comparing results with those from vascular neurologists, StrokeClassifier achieved an average cross-validation accuracy of 0.74 and a weighted F1 score of 0.74 for multiclass classification tasks. In the MIMIC-III dataset, the accuracy and weighted F1 score were 0.70 and 0.71, respectively. The five most important features included atrial fibrillation, age, middle cerebral artery occlusion, internal carotid artery occlusion, and frontal lobe stroke location. Using a deterministic heuristic approach, 788 cryptogenic stroke patients were classified, reducing the cryptogenic diagnosis rate from 25.2% to 7.2%.

Algorithm and Model Development

The study sample consisted of 3,262 discharge summaries with an AIS diagnosis from YNHH, MGH, and BIDMC. Extracted features included medical history, imaging, cardiac, laboratory data, and UMLS Concept Unique Identifiers (CUI). Various machine-learning algorithms (including logistic regression, support vector machines, random forests, and XGBoost) were used for model training, with optimization of model hyperparameters. Preprocessing was performed on the 2039 non-cryptogenic stroke samples, with Metamap used as input for model development and MICE for feature imputation in the derived cohort samples and random forest-based imputation in the external validation cohort. Principal component analysis (PCA) was primarily used to enhance data dimensionality reduction performance, selecting the top ten features for each class for further model development. Finally, an ensemble consensus classification model, “StrokeClassifier,” was employed.

Research Results

StrokeClassifier was evaluated for performance in multiclass classification tasks and compared for accuracy and model performance. Results indicated high cross-validation accuracy and F1 scores for models optimized based on regression logistic (LR), support vector machine (SVC), XGBoost (XGB), and random forest (RF). StrokeClassifier demonstrated higher predictive performance in external validation compared to individual base models. The ensemble model improved robustness and generalizability of output through a certain degree of bias reduction method. StrokeClassifier achieved an average accuracy of 0.744, balanced accuracy of 0.710, and weighted F1 score of 0.740 in predicting the etiology of multiclass non-cryptogenic strokes, comparable to diagnoses provided by neurologists.

Performance and generalizability of the model were further validated through 300 repeated multilevel cross-validations. Analysis of different age, gender, and racial subgroups showed that StrokeClassifier could effectively adapt to various clinical contexts across different experimental categories.

For misclassified samples, further analysis of feature frequency ensured the model’s ability to correctly handle similar data misclassifications in the future, demonstrating extensive validation results for the nine basic models. For cryptogenic stroke patients with specific TOAST etiologies, a specific deterministic heuristic was applied for further classification, proving the tool’s superior performance in multiclass classification and external validation datasets.

Research Conclusions and Significance

This study demonstrated that StrokeClassifier, as an automated machine-learning tool, is efficient and accurate in classifying the etiology of ischemic stroke, achieving diagnostic levels comparable to those provided by vascular neurologists. With further training and clinical application, StrokeClassifier can serve as a clinical decision support system, significantly improving the accuracy of stroke etiology diagnosis, and facilitating the timely implementation of etiology-specific treatment plans. It can enhance clinical research and public health initiatives.

StrokeClassifier classifies stroke etiology in an automated, real-time manner using EHR text data, promising broad application prospects. It can be applied in complex data analysis tasks within health systems, especially in medical environments lacking specialized knowledge, improving diagnostic consistency and accuracy while promoting standardized and efficient diagnostic processes.

Through detailed data analysis and model optimization, this research provides a robust foundation for future studies, with further applications across various clinical and epidemiological fields, contributing to ongoing progress in stroke prevention and treatment efforts.