Empowering PET Imaging Reporting with Retrieval-Augmented Language Models and Reading Reports Database: A Pilot Study

2025-02-01 Sat
PET imaging large language models retrieval-augmented generation artificial intelligence medical report generation diagnostic support educational purposes
The Application of Large Language Models in PET Imaging Reports: A Single-Center Pilot Study Combining Retrieval-Augmented GenerationWith the rapid development of artificial intelligence, large language models (LLMs) have gained widespread attention for their zero-shot learning and natural language processing capabilities in the medical domain. Although LLMs have shown potential to improve efficiency and effectiveness in certain areas of medicine, their application in nuclear medicine, particularly in PET (Positron Emission Tomography) imaging reports, remains in its infancy. This study, conducted by Dr. Hongyoon Choi and his team from Seoul National University Hospital and Seoul National University College of Medicine, was published in the European Journal of Nuclear Medicine and Molecular Imaging.
Study Background and Problem StatementPET imaging is widely applied across various medical fields, playing a crucial role in disease diagnosis, staging, and therapeutic evaluation. However, PET imaging generates complex and diverse data that require labor-intensive interpretation, often influenced by the subjective judgment of readers. In nuclear medicine imaging reports, there is a lack of tools capable of meeting needs such as rapid referencing of similar cases, supporting differential diagnoses, and providing exemplary cases for educational purposes. Additionally, while large language models like ChatGPT have demonstrated potential in medical report generation, they are unable to access specific medical datasets to provide precise insights related to particular hospitals or cases.
The researchers sought to address this gap by integrating a retrieval-augmented generation (RAG) model with a comprehensive, long-term database of PET imaging reports to explore how LLMs might improve PET imaging report generation and fulfill unmet clinical needs.
Objective of the StudyThis study aimed to develop and evaluate a custom LLM framework based on the RAG architecture with the following objectives:
1. Provide imaging experts with references to previous reports, especially for retrieving and summarizing similar cases.
2. Support medical education by enabling access to exemplary cases for learning and training purposes.
3. Leverage existing imaging report data to support differential diagnosis processes for expert readers.
Methods and System ArchitectureDatasetThe research team extracted 211,813 PET imaging reports from 118,107 patients spanning from 2010 to 2023, sourced from a clinical data warehouse. Each report included information on the exam date, exam name, patient gender, and birth date in year-month format. All data were de-identified to ensure patient privacy. The study received Institutional Review Board (IRB) approval, and consent requirements were waived due to its retrospective nature.
System ArchitectureThe team designed a prototype chatbot integrating the RAG model with several modular components, including:
Sentence Embedding and Vectorization:
The “paraphrase-multilingual-MiniLM-L12-v2” sentence transformer model was used to convert report texts and user queries into vector representations. This model supports cross-linguistic (English and Korean) understanding and paraphrasing, critical for the bilingual nature of the dataset.
Vector Storage Mechanism:
The Chroma database was utilized to store sentence embeddings as searchable vector spaces. Retrieval was conducted using cosine similarity to match the input query vector against stored vectors. The system retrieved the top five relevant texts for generating context.
Retrieval-Augmented Question/Answer Generation:
Retrieved reports served as context, combined with user queries to form complete prompts input to the LLM. For testing purposes, the researchers employed the Llama-3 model (7 billion parameters), with its implementation based on the LangChain architecture.
Data Visualization:
The researchers used t-SNE (t-Distributed Stochastic Neighbor Embedding) for dimensionality reduction and visualization of the generated vector data. Clusters were analyzed based on key terms such as diagnostic keywords (e.g., “lung cancer”) to demonstrate semantic similarities among different reports.
Core Findings and Experiment FlowData Embeddings and Clustering AnalysisBy embedding 211,813 PET reports into vector spaces, the research team observed semantic clustering in the embeddings through t-SNE visualizations. Reports with keywords such as “lung cancer,” “breast cancer,” and “lymphoma” demonstrated clustered distributions that reflected clinical similarities. For example, “lung cancer” reports formed tightly-knit clusters, consistent with the high prevalence of lung cancer cases in the dataset. Similarly, reports associated with specific PET modalities (e.g., “C-11 Methionine PET” and “Ga-68 PSMA-11 PET”) exhibited distinct clusters, highlighting the effectiveness of sentence embeddings in capturing semantic similarities.
Querying and Diagnostic SuggestionsThe prototype chatbot demonstrated its capability to answer complex medical queries in simulated clinical scenarios. For instance, in response to the query “Identify cases of breast cancer with metastasis to internal mammary lymph nodes,” the system efficiently retrieved relevant cases from the database, providing detailed clinical information. Additionally, when tasked with a differential diagnosis query for “multiple mediastinal lymph nodes with increased FDG uptake without an identified primary site,” the chatbot generated a comprehensive list of potential diagnoses, accompanied by context-based references.
Among the cases reviewed, 84.2% of retrieved results were rated as “fair” or above in relevance by three nuclear medicine physicians. For suggested differential diagnoses, 78.9% were rated as “fair” or above. Additionally, the RAG-enhanced LLM significantly outperformed the non-RAG counterpart in diagnostic suggestion accuracy (Wilcoxon rank-sum test, p < 0.05).
Quantitative EvaluationThe researchers employed ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation) to evaluate the alignment between generated conclusions and reference conclusions from real-world reports. The RAG-enhanced system achieved significantly higher ROUGE-L F-scores compared to the baseline LLM (0.16 ± 0.08 vs. 0.07 ± 0.03, p < 0.001).
Significance and Future DirectionsScientific and Clinical ContributionsPractical Value: This study demonstrated the feasibility of combining the RAG framework with PET imaging datasets to provide practical support for nuclear medicine reporting workflows. It not only streamlines interpretation but also offers reliable case references for complex cases.
Educational Potential: By enabling access to similar cases and follow-up outcomes, the system provides valuable resources for training and education in clinical and academic settings.
Customized Decision Support: Through its contextual referencing mechanism, the framework enables AI-driven, customized diagnostic and patient management support in nuclear medicine.
Challenges and Further ResearchThis study highlighted limitations such as suboptimal performance in retrieving rare case data and reliance solely on textual data. Future iterations may address these challenges by incorporating rarity-weighted retrieval mechanisms and leveraging multimodal data, such as integrating imaging data with textual contexts, to enhance system performance further.
ConclusionThis proof-of-concept study underscores the transformative potential of LLMs in the field of nuclear medicine. By combining the RAG framework with a large PET imaging report database, the researchers illustrated the applicability of LLMs in enhancing report generation, differential diagnosis, and case retrieval—thereby improving workflows for nuclear medicine experts. As more sophisticated LLMs and multimodal systems become available, similar AI-assisted frameworks will likely play a crucial role in advancing precision medicine and personalized healthcare.