Semi-Supervised Thyroid Nodule Detection in Ultrasound Videos

Research Report on Semi-Supervised Detection of Thyroid Nodules in Ultrasound Videos

Research Background

Thyroid nodules are common thyroid diseases. Early screening and diagnosis of thyroid nodules typically rely on ultrasound examinations, a common non-invasive detection method used for detecting various diseases such as thyroid nodules, breast cancer, and arterial plaques. However, due to the low resolution of thyroid nodules in ultrasound images, irregular and complex morphologies, ultrasound examinations highly depend on the experience of radiologists. Misdiagnosis and missed diagnosis are common, especially in underdeveloped areas and countries. Therefore, developing automated and accurate methods based on Computer-Aided Diagnosis (CAD) is particularly important.

In recent years, deep learning technology has been introduced into the computer-aided diagnosis of ultrasound images. Although existing methods for detecting thyroid nodules have made some progress on static ultrasound images, these methods have not fully utilized the spatial and temporal information varying over time during the diagnostic process. In clinical screening and diagnosis, radiologists need to carefully view multiple consecutive frames to locate nodules, analyze their characteristics, and finally complete the diagnosis. Therefore, utilizing video-based ultrasound image detection can provide more spatial and temporal information compared to individual images.

Diagram of the neural network structure constructed in this study Due to the diverse morphologies of thyroid nodules and the complexity of ultrasound image labeling, current detection solutions largely rely on a large amount of training samples. However, the diverse and complex nodules in low-resolution ultrasound images can only be labeled by experienced radiologists, making labeling these images more time-consuming and laborious than labeling single images. Therefore, fully utilizing ultrasound videos for thyroid nodule detection under limited annotation remains a challenging task.

Paper Source

This study was conducted by Xiang Luo, Zhongyu Li, Canhua Xu, Bite Zhang, Liangliang Zhang, Jihua Zhu, Peng Huang, Xin Wang, Meng Yang, Shi Chang, who are affiliated with institutions such as Xi’an Jiaotong University, Fourth Military Medical University, Xiangya Hospital, and Central South University. The paper was published on January 1, 2024, in IEEE Transactions on Medical Imaging.

Research Objectives

This paper aims to address the following issues: 1. How to use the spatial and temporal information of ultrasound videos for more accurate detection of thyroid nodules. 2. How to improve the accuracy of nodule detection through semi-supervised learning under limited labeled data.

Research Methods

This paper proposes a video-based semi-supervised framework for detecting thyroid nodules in ultrasound videos. The framework contains two main innovations: 1. Adjacent Frame Guided Network (AFGN): Improving spatial consistency of detection by using adjacent frames to infer the current frame. 2. Pseudo-label Adaptive Strategy: Generating pseudo-labels and using adaptive strategies for unlabeled frames to fully utilize unlabeled videos and reduce manual labeling workload.

Data Preprocessing and Labeling

  1. Data Collection: Collected 1648 transverse view and 1622 longitudinal view ultrasound videos from 1316 patients.
  2. Data Cleaning: Removed poor-quality videos and cropped device information from the video borders, resulting in 996 transverse and 1088 longitudinal view videos.
  3. Frame Selection and Labeling: Reduced labeling workload by removing similar frames based on the similarity calculation of adjacent frames. The remaining images were labeled by two radiologists with over ten years of experience, and the results were reviewed by another radiologist with over twenty years of experience, ultimately resulting in 4730 transverse and 4939 longitudinal view ultrasound images.

Semi-Supervised Ultrasound Video Detection Framework

To reduce manual labeling workload, a semi-supervised video detection framework was proposed. The framework includes the following main steps: 1. Initialization: Initialized the student AFGN (Student-AFGN) and teacher AFGN (Teacher-AFGN) networks with the same hyperparameter configuration. 2. Pseudo-label Generation: The teacher AFGN first trains and optimizes on labeled videos, generating pseudo-labels for unlabeled videos, using non-maximum suppression to remove duplicate detection results, and filtering uncertain bounding boxes with confidence thresholds. 3. Training Student Network: The student AFGN trains using unlabeled videos with pseudo-labels and labeled videos with true labels for supervised learning, introducing a parameter λ to balance supervised and unsupervised learning.

Pseudo-Label Adaptive Strategy

In generating pseudo-labels, due to the diverse morphology of thyroid nodules, pre-trained detection backbone networks may not predict all frames well. Thus, an adaptive strategy based on adjacent frame pseudo-labels is proposed to fill the labels of unlabeled frames, including three cases: 1. Unpredicted Frames at the Beginning/End: Calculating the structure similarity index with the two closest frames, if both exceed a set threshold, averaging the labels of the two frames to generate the pseudo-labels for the unpredicted frame. 2. Middle Frames with Pseudo-Labels Already Assigned to Previous/Next Frames: Similar to the first case, generating pseudo-labels by calculating the similarity index with adjacent frames. 3. Middle Frames without Pseudo-Labels in Adjacent Frames: Calculating the similarity index with frames labeled with pseudo-labels and using the two highest similarity scores to generate pseudo-labels if both exceed a threshold.

Ultrasound Video Detection Backbone Network

To fully utilize spatial and temporal information, a backbone network based on adjacent frame-guided detection (AFGN) is designed to enhance the detection results of the current frame by selecting and aggregating features from adjacent frames. Steps include: 1. Candidate Region Selection: Generating candidate regions for both current and adjacent frames and designing three indicators (confidence score of candidate region, frame distance score, and candidate region overlap score) to screen candidate regions closely related to the current frame. 2. Multi-Frame Attention Module: Introducing a relationship module to enhance the features of the current frame’s candidate regions with adjacent frames’ features, improving the detection results.

Framework Training Details

The optimization loss of the student AFGN includes supervised and unsupervised parts, with the loss function defined as: [ L_{total} = L_s + \lambda L_u ] where ( L_s ) and ( L_u ) represent the losses of supervised and unsupervised learning, respectively.

Experimental Results

Multiple sets of comparative experiments were conducted to verify the proposed method: 1. Impact of Different Numbers of Labeled Videos: The proposed method showed good performance under different numbers of labeled videos, with the advantages of the semi-supervised framework more evident when fewer labeled videos were used. 2. Comparison with Other Detection Models: With 100 transverse and 100 longitudinal labeled data, the proposed method improved by 8.20% and 5.75% over the best competitor, TransVOD++. In experiments using five-fold cross-validation with all labeled videos, the proposed method improved mAP by 0.26%-1.03% over the best competitor, RDN.

Conclusion

This paper proposes a semi-supervised framework for detecting thyroid nodules in ultrasound videos. By introducing the adjacent frame-guided detection backbone network (AFGN) and the pseudo-label adaptive strategy, the proposed method achieves good detection results with a small amount of labeled data and significantly improves over existing methods. Experimental results demonstrate the important application and scientific value of the proposed framework in thyroid nodule detection tasks.