Quantitative Assessment of GPT-4V’s Multimodal and Multianatomic Region Capabilities in Radiology

Quantitative Assessment of Large Vision-Language Models in Radiology: GPT-4V’s Multimodal and Multi-Anatomic Region Capabilities

Academic Background

In recent years, large language models (LLMs) such as OpenAI’s ChatGPT have made significant progress in text generation. These models, based on the Transformer architecture and trained on massive amounts of text data, can generate credible text output without requiring many examples (few-shot learning and zero-shot learning). LLMs are increasingly being applied in the medical field, such as transforming free-text radiology reports into standardized templates and mining data from CT reports for lung cancer. Additionally, LLMs have demonstrated a certain level of “knowledge” in radiology board-style examinations and have been shown to help simplify radiology reports.

With the introduction of GPT-4V (GPT-4 with Vision), models can now process both text and image inputs. These large vision-language models (LVLMs) have the potential to become foundation models, widely applicable to various tasks. Although some studies have shown promising performance of GPT-4V in generating radiology reports from single medical images, they have also highlighted the model’s limitations, particularly in interpreting radiological images. Nevertheless, the widespread availability of these models also brings potential risks, especially in unauthorized applications.

Given the potential and risks of GPT-4V, a thorough analysis is crucial. However, peer-reviewed literature on GPT-4V remains scarce. Therefore, this study aims to quantitatively assess the performance of GPT-4V in interpreting radiological images, particularly its accuracy with unseen data.

Source of the Paper

This paper was authored by Quirin D. Strotzer, Felix Nieberle, Laura S. Kupke, and others, from the Institute of Radiology at the University Medical Center Regensburg, the Division of Neuroradiology at Massachusetts General Hospital, Harvard Medical School, and other institutions. The paper was published in November 2024 in the journal Radiology.

Research Process

Data Acquisition

This retrospective study included single representative abnormal and healthy control images from neuroradiology, cardiothoracic radiology, and musculoskeletal radiology (CT, MRI, radiography). Reports were generated using GPT-4V via the OpenAI API, and the factual correctness of free-text reports and the performance in detecting abnormalities in binary classification tasks were assessed. The study compared the performance of GPT-4V with that of a first-year non-radiologist in training and four board-certified radiologists.

Experimental Methods

The study selected common pathological findings and imaging modalities, including neuroradiology (ischemic stroke, brain hemorrhage, brain tumor, multiple sclerosis), cardiothoracic radiology (pneumothorax, pulmonary embolism, pneumonia, lung cancer), and musculoskeletal radiology (fracture). Each category included at least 25 images, which were queried from the radiology information systems of the hospitals and manually confirmed based on all available information (including scan reports, follow-up imaging, and medical records).

Task Design

  1. Free-text Report Generation: Given an image, the model was prompted to generate a radiology report, including the imaging modality, anatomic region, main acute pathological finding and its location, the most likely diagnosis, and the top five differential diagnoses. The correctness of the reports was assessed through manual annotation.
  2. Consistency Test: A random selection of 25 images was used to generate three reports, and the variability of the model’s output was evaluated.
  3. Classification Task: A binary classification task was set up to compare the model’s performance with that of human readers in detecting abnormalities. The model was prompted to answer “yes” or “no” and provide a short description.

Statistical Analysis

All ratings were assessed binarily, and the accuracy, sensitivity, and specificity of free-text reports were calculated. The consistency of human readers was evaluated using the Cohen κ statistic.

Main Results

General Results

The study included 515 images from 470 patients (median age 61 years). GPT-4V correctly identified the imaging modality in all images, and the anatomic region was correctly identified in 99.2% of cases. In free-text reports, diagnostic accuracy varied depending on the pathological finding and imaging modality, ranging from 0% for pneumothorax to 90% for brain tumors. In binary classification tasks, GPT-4V showed sensitivities between 56% and 100% and specificities between 8% and 52%, indicating a clear tendency to overdiagnose.

Free-text Report Results

The model performed well in identifying the imaging modality and anatomic region but struggled with identifying the main pathological findings and diagnoses. For example, the model failed to identify all pneumothorax cases but performed well in diagnosing brain tumors. The model also had difficulty identifying normal images, particularly in CT scans.

Classification Task Results

GPT-4V performed poorly in binary classification tasks, with an overall accuracy slightly above chance (55.3%). In contrast, human readers significantly outperformed the model, achieving near-perfect agreement across all tasks.

Conclusion

In its earliest version, GPT-4V reliably identified the imaging modality and anatomic region from single radiological images but failed to detect, classify, or rule out abnormalities. Although the model-generated reports sounded convincing, its reliability in medical image interpretation remains limited. Nevertheless, large vision-language models show potential as foundation models in radiology. Future research should focus on further optimizing the model, particularly in handling three-dimensional medical data and fine-tuning for specific domains.

Research Highlights

  1. Innovation: This study is the first to quantitatively assess the performance of GPT-4V in interpreting radiological images, filling a gap in the field.
  2. Comprehensiveness: The study covers multiple anatomic regions and imaging modalities, providing a comprehensive performance evaluation.
  3. Practicality: The findings have important implications for the development of future medical image analysis models, particularly in model optimization and clinical applications.

Research Significance

This study provides important insights into the application of large vision-language models in radiology. Although GPT-4V performs well in identifying imaging modalities and anatomic regions, its limitations in detecting and diagnosing pathologies indicate that further optimization is needed. Future research should focus on improving the model’s performance in complex and rare abnormalities and exploring its practical applications in clinical practice.