Evaluation of Multimodal Large Language Models' Accuracy in Interpreting Radiologic Images

Performance of Large Language Models in Radiology Image Interpretation: A Comparative Study with Human Readers

Academic Background

In recent years, large language models (LLMs) have demonstrated powerful capabilities in various fields, particularly in natural language processing. With the development of multimodal LLMs, these models can now handle not only text but also audio, visual, and video inputs. Representative multimodal LLMs include OpenAI’s GPT-4 Turbo with Vision (GPT-4V), Google DeepMind’s Gemini 1.5 Pro, and Anthropic’s Claude 3. The application of these models in the field of radiology has also gradually increased, especially in generating and structuring radiology reports. However, despite their strong performance with text inputs, the ability of LLMs to interpret radiology images remains questionable. Previous studies have shown that LLMs perform significantly worse than board-certified radiologists in diagnostic tasks based on patient history and radiology images. Therefore, this study aims to evaluate the accuracy of LLMs in interpreting radiology images, compare it with human readers of varying experience levels, and explore factors affecting LLM accuracy.

Source of the Paper

This study was conducted by scholars from the Department of Radiology at Yonsei University College of Medicine, the Department of Radiology at Asan Medical Center in Seoul, and several other research institutions in South Korea. The main authors include Pae Sun Suh, Woo Hyun Shim, Chong Hyun Suh, and others. The study was published in December 2024 in the journal Radiology, titled Comparing Large Language Model and Human Reader Accuracy with New England Journal of Medicine Image Challenge Case Image Inputs.

Research Process and Results

Research Process

This study retrospectively analyzed cases published in the New England Journal of Medicine (NEJM) Image Challenge section between October 13, 2005, and April 18, 2024. A total of 964 cases were screened, and 272 cases containing radiology images were ultimately included. These cases covered various subspecialties, including neuroradiology, gastrointestinal radiology, chest radiology, musculoskeletal radiology, pediatric radiology, cardiovascular radiology, and genitourinary radiology. The study used four vision-capable LLMs (GPT-4V, GPT-4 Omni, Gemini 1.5 Pro, and Claude 3) to answer these cases and compared their accuracy with that of 11 human readers (including seven junior radiologists, two clinicians, one radiology resident, and one medical student).

Experimental Results

The results showed that GPT-4 Omni performed the best among the LLMs, with an overall accuracy of 59.6% (162272), significantly higher than that of the medical student (47.1%; 128272; p < 0.001) but lower than that of the junior radiologists (80.9%; 220272; p < 0.001) and the radiology resident (70.2%; 191272; p = 0.003). The accuracy of LLMs was not affected by image inputs but significantly improved with long text inputs (p < 0.001). The accuracy of human readers, however, was unaffected by text length.

In the subspecialty analysis, junior radiologists outperformed LLMs in most subspecialties, particularly in neuroradiology, gastrointestinal radiology, and musculoskeletal radiology. However, in pediatric radiology, GPT-4 Omni’s accuracy (88%; 2225) was slightly higher than that of the junior radiologists (76%; 1925), though the difference was not significant.

In terms of imaging modalities, LLMs performed better with MRI scans than with CT or X-ray modalities. GPT-4 Omni’s accuracy with MRI scans was comparable to that of the junior radiologists, but in X-ray and CT modalities, the junior radiologists significantly outperformed the LLMs.

Conclusion

The study indicates that LLMs demonstrate certain accuracy in interpreting radiology images based on text and image inputs, especially with long text inputs, where their accuracy significantly improves. However, LLMs’ accuracy remains lower than that of experienced radiologists, particularly with short text inputs. Additionally, while LLMs show high accuracy in providing image information (such as imaging modality, plane, anatomic location, and contrast usage), their ability in visual assessment and image interpretation remains uncertain.

Research Highlights

  1. Performance of LLMs in Radiology Image Interpretation: GPT-4 Omni performed the best among LLMs, but its accuracy was still lower than that of experienced radiologists.
  2. Impact of Text Length on LLM Accuracy: LLMs’ accuracy significantly improves with long text inputs, indicating their reliance on the richness of textual information.
  3. Impact of Imaging Modalities: LLMs performed better with MRI scans than with CT and X-ray modalities, suggesting their potential in interpreting complex images.
  4. Accuracy of LLMs in Providing Image Information: LLMs showed high accuracy in providing image information (such as imaging modality, plane, anatomic location, and contrast usage), but their ability in visual assessment and image interpretation remains uncertain.

Significance and Value of the Study

This study provides important insights into the application of LLMs in radiology. Although LLMs demonstrate certain accuracy in interpreting radiology images based on text and image inputs, their ability in visual assessment and image interpretation remains limited. Therefore, LLMs are unlikely to fully replace radiologists in the near future. However, with further technological advancements, LLMs have the potential to play an auxiliary role in radiology diagnosis, especially in handling large amounts of textual information and complex images.

Other Valuable Information

The study also explored the performance of LLMs in providing image information, finding that GPT-4 Omni significantly outperformed other LLMs in generating MRI sequence information. Additionally, the study noted that LLMs’ performance in answering multiple-choice questions may be overestimated, as radiologists in clinical practice typically do not rely on multiple-choice options for diagnostic decisions.

This study provides important empirical data on the application of LLMs in radiology while also highlighting their limitations in real-world applications. Future research could further explore how to optimize LLMs’ performance in radiology image interpretation and evaluate their potential in real clinical settings.