Large Language Model Ability to Translate CT and MRI Free-Text Radiology Reports into Multiple Languages
Capability of Large Language Models in Translating CT and MRI Free-Text Radiology Reports
Academic Background
In the context of globalization, the increasing mobility of patients necessitates the translation of radiology reports, which are essential tools for disease diagnosis and management, into various languages. However, language barriers can impede the effective use of these reports, potentially compromising timely and accurate patient care. The rise of telemedicine has further exacerbated this challenge, as patients increasingly seek remote expert consultations or second opinions, often submitting reports in languages unfamiliar to local providers. Without accurate translations, these reports may be misinterpreted or neglected, leading to diagnostic delays and potential errors.
Given that human translators with medical expertise are not always readily available, artificial intelligence-based models, particularly large language models (LLMs), offer promising alternatives. Although these models were initially designed for general language processing tasks, they have demonstrated promising results in applications such as translation. However, the capability of LLMs in translating radiology reports remains largely unexplored, especially for low-resource languages. Existing models often exhibit biases, as they are primarily trained on English-language data.
Research Objective
This study aims to evaluate the accuracy and quality of various LLMs in translating radiology reports across high-resource languages (e.g., English, Italian, French, German, and Chinese) and low-resource languages (e.g., Swedish, Turkish, Russian, Greek, and Thai).
Research Methodology
Dataset and Translation Process
The study utilized a dataset of 100 synthetic free-text radiology reports from CT and MRI scans, which were translated into nine target languages by 18 radiologists between January 14 and May 2, 2024. The translation process involved 10 LLMs, including GPT-4 (OpenAI), Llama 3 (Meta), and Mixtral models (Mistral AI). Translation accuracy and quality were assessed using metrics such as Bilingual Evaluation Understudy (BLEU) score, Translation Error Rate (TER), and Character-Level F-Score (CHRF++). Statistical significance was evaluated using paired t-tests with Holm-Bonferroni corrections. Additionally, radiologists conducted a qualitative evaluation of the translations using a standardized questionnaire.
Quantitative Evaluation
The quantitative evaluation employed three widely used linguistic metrics: BLEU score, TER, and CHRF++. The BLEU score measures the similarity between machine translations and human translations, with higher scores indicating greater accuracy. TER measures the number of edits required to convert a machine translation into a reference translation, with lower TER values indicating higher quality. CHRF++ assesses n-gram similarities at both the character and word levels, with higher scores indicating closer alignment with the reference translation.
Qualitative Evaluation
The qualitative evaluation was conducted using a structured questionnaire, assessing criteria such as accuracy of medical terminology, appropriateness for clinical use, clarity and readability, consistency with the original meaning, and grammar and syntax. Each criterion was rated on a five-point Likert scale, with 1 indicating poor performance and 5 indicating excellent performance.
Research Findings
Quantitative Evaluation Results
GPT-4 demonstrated the best overall translation quality across multiple languages, particularly excelling in translations from English to German, Greek, Thai, and Turkish. GPT-3.5 showed the highest accuracy in translations from English to French, while Qwen1.5 performed exceptionally well in English-to-Chinese translations. Mixtral 8x22b achieved the best results in Italian-to-English translations.
Qualitative Evaluation Results
The qualitative evaluation revealed that LLMs excelled in clarity, readability, and consistency with the original meaning but showed moderate accuracy in medical terminology.
Conclusion
LLMs exhibited high accuracy and quality in translating radiology reports, although the results varied depending on the model and language pair. GPT-4 performed best across multiple languages, while other models such as GPT-3.5 and Mixtral 8x22b also showed strong performance in specific language pairs. However, no single model was universally applicable across all language pairs, and translation quality for low-resource languages remains an area for improvement.
Research Highlights
- Key Findings: GPT-4 demonstrated the highest translation quality across multiple language pairs, particularly in high-resource languages.
- Methodological Innovation: This study is the first to systematically evaluate the performance of LLMs in translating radiology reports, encompassing both high-resource and low-resource languages.
- Application Value: The findings suggest that LLMs have significant potential in translating medical reports, particularly in situations where human translators are unavailable, thereby supporting global healthcare.
Research Significance
This study provides important empirical data on the application of LLMs in medical translation, particularly in handling multilingual radiology reports. The results underscore the need for further development and optimization of LLMs, especially in improving translation quality for low-resource languages and the accuracy of medical terminology. Additionally, the study offers valuable insights for the future development of multilingual medical translation tools.
Authors and Institutions
The study was conducted by a team of experts from various international institutions, including Aymen Meddeb, Sophia Lüken, Felix Busch, and others. The research team is affiliated with renowned institutions such as Charité–Universitätsmedizin Berlin, Technical University of Munich, and the University of Naples Federico II. The paper was published in the December 2024 issue of Radiology.
References
The study cites numerous relevant publications, including research on the application of LLMs in medical translation, challenges in multilingual translation, and structured translation of radiology reports. These references provide theoretical support and background knowledge for the study.
Data Sharing
Data generated or analyzed during the study are available from the corresponding author upon request.
Conflict of Interest Statement
All authors declare no relevant conflicts of interest.
Through this study, we have not only validated the potential of LLMs in translating radiology reports but also provided critical insights for the future development of multilingual medical translation tools.