The Utility of GPT-4 in Chest Radiograph Evaluation
The Potential of GPT-4 in Chest Radiograph Evaluation: A Retrospective Study
Academic Background
In recent years, artificial intelligence (AI) has been increasingly applied in the medical field, particularly in radiology. The introduction of AI tools is transforming clinical practice, especially in imaging diagnostics. However, the widespread adoption of AI tools faces numerous challenges, including insufficient funding, inefficient IT integration, and inadequate validation. Additionally, medical professionals, particularly radiologists, generally lack sufficient statistical knowledge, which further hinders their in-depth understanding and application of AI tools. As radiology research increasingly relies on data-driven techniques, radiologists need to develop the ability to critically evaluate statistical methods and their limitations.
Large language models (LLMs), such as OpenAI’s GPT-4, have gained recognition in radiology due to their ability to understand natural language, reason, and interpret complex information. GPT-4’s Advanced Data Analysis (ADA) extension enables it to analyze data, solve mathematical problems, create charts, and write and execute code. However, the potential of GPT-4 ADA in clinical and academic radiology has not yet been fully explored. This study aims to validate whether GPT-4 ADA can be used for various analytical tasks without specialized statistical and machine learning (ML) expertise, particularly in the evaluation of chest radiographs.
Source of the Paper
This paper was co-authored by Dr. Soroosh Tayebi Arasteh, Dr. Robert Siepmann, Dr. Marc Huppertz, MSc Mahshad Lotfinia, Dr. Behrus Puladi, Dr. Christiane Kuhl, Dr. Daniel Truhn, and Dr. Sven Nebelung. The authors are affiliated with the Department of Diagnostic and Interventional Radiology, the Department of Oral and Maxillofacial Surgery, and the Institute of Medical Informatics at the University Hospital RWTH Aachen, Germany. The paper was published in the journal Radiology in November 2024.
Research Process
Study Subjects and Data
This retrospective study utilized bedside chest radiograph reports, associated demographic data, and inflammatory laboratory markers from intensive care unit (ICU) patients between January 2009 and December 2019. The data were sourced from the local database of the University Hospital RWTH Aachen, comprising 193,566 bedside chest radiographs and their reports and laboratory values from 45,016 patients. To simplify the analysis and avoid sampling bias, only the first available radiograph report per patient was included.
Research Process
- Data Visualization: GPT-4 ADA was tasked with plotting the utilization rates of chest radiographs over the years and the distributions of laboratory values.
- Basic Statistical Analysis: GPT-4 ADA was asked to summarize and quantify the severity of pulmonary opacities as a function of age and sex.
- Advanced Statistical Analysis: GPT-4 ADA was tasked with quantifying the variables determining the occurrence of pulmonary opacities through binary logistic regression.
- Machine Learning Modeling: GPT-4 ADA was asked to build two advanced AI models, one using all available variables and the other excluding C-reactive protein (CRP), leukocyte count, or procalcitonin, to predict the presence of pulmonary opacities.
Validation Strategy
The research team validated the outputs of GPT-4 ADA through a multi-step process, including reproducibility assessment, methodological verification, coding quality evaluation, and re-execution of the code. Additionally, a head-to-head comparison was conducted between the models generated by GPT-4 ADA and manually developed reference models.
Key Results
Data Visualization
GPT-4 ADA successfully plotted the utilization rates of chest radiographs over the years and the distributions of laboratory values, meeting scientific standards in terms of visual presentation. However, GPT-4 ADA did not indicate trend lines or outliers in the charts, and there were inconsistencies in the style and color of the outputs.
Basic Statistical Analysis
GPT-4 ADA correctly summarized the severity of pulmonary opacities in relation to age and sex. However, it used central tendency measures instead of frequency counts for ordinal variables and did not differentiate between left and right pulmonary opacities.
Advanced Statistical Analysis
GPT-4 ADA quantified the variables determining the occurrence of pulmonary opacities through binary logistic regression, providing coefficients and p-values for each variable. The test-retest reliability was excellent, but there were slight deviations from the manual reference results. GPT-4 ADA handled missing values using median imputation but encountered issues with categorical variables.
Machine Learning Modeling
GPT-4 ADA successfully built two predictive models, one using all available variables and the other excluding laboratory values. The AUC values for the two models were 0.76 and 0.75, with accuracies of 72% and 72%, respectively. In the head-to-head comparison, the models generated by GPT-4 ADA performed similarly to the manually developed reference models in terms of AUC and accuracy but showed significant differences in sensitivity and specificity.
Conclusion
This study demonstrates the potential of large language models (e.g., GPT-4 ADA) in complex data analysis in radiology, ranging from basic statistics to advanced machine learning modeling. Although GPT-4 ADA performed well in handling real-life clinical datasets, it faced challenges with statistical complexities such as data imputation, necessitating rigorous statistical oversight. LLMs should serve as a supplement to, rather than a replacement for, professional expertise.
Research Highlights
- Key Findings: GPT-4 ADA was able to autonomously perform complex data analysis tasks, including data visualization, statistical analysis, and machine learning modeling, with performance comparable to manually developed models.
- Methodological Innovation: This study is the first to validate the potential of GPT-4 ADA in radiology, particularly in scenarios where specialized statistical and machine learning knowledge is not required.
- Application Value: The use of GPT-4 ADA can simplify complex data analysis workflows for radiologists, clinicians, and researchers, promoting patient-centered research strategies.
Additional Valuable Information
The limitations of this study include the inclusion of only the first radiograph report per patient, the unresolved impact of prompts on LLM performance, and the potential bias introduced by data imputation. Future research needs to further evaluate the generalizability, robustness, interpretability, workflow integration, and clinical impact of LLMs in radiology.