Combating Label Noise with a General Surrogate Model for Sample Selection
Academic Background and Problem Statement
With the rapid development of Deep Neural Networks (DNNs), visual intelligence systems have made significant progress in tasks such as image classification, object detection, and video understanding. However, these breakthroughs rely heavily on the collection of high-quality annotated data, which is often time-consuming and expensive. To address this issue, researchers have turned to using large-scale web data for training, but these data often come with noisy labels, which can degrade the performance of DNNs. The presence of noisy labels leads to a discrepancy between the training and test data distributions, thereby affecting the model’s generalization ability on clean test data.
To tackle this problem, sample selection has emerged as an effective approach. The core idea is to separate clean samples from all training samples based on certain criteria. Previous methods primarily relied on the “small loss criterion,” where samples with small losses are considered clean. However, this strategy depends on the learning dynamics of each data instance, and some noisy samples may still be memorized due to frequently occurring corrupted learning patterns. Therefore, researchers have proposed a training-free surrogate model to avoid the effects of memorization.
Paper Source and Author Information
This paper is co-authored by Chao Liang, Linchao Zhu, Humphrey Shi, and Yi Yang, affiliated with the Reler Lab at Zhejiang University, Shi Labs @ UIUC & Oregon, and Picsart AI Research (PAIR). The paper was accepted for publication in the International Journal of Computer Vision on December 1, 2024.
Research Content and Methodology
Research Process
This paper proposes a sample selection method based on the vision-language surrogate model CLIP (Contrastive Language–Image Pretraining) to automatically filter noisy samples. CLIP leverages its text-image alignment capability to assign a confidence score to each sample, thereby helping to identify clean samples. Additionally, the paper introduces a margin adaptive loss to mitigate the selection bias introduced by CLIP, enhancing the model’s robustness to noisy labels.
1. Sample Selection
First, the researchers use the pre-trained CLIP model to score each sample. Given an image ( x ), CLIP extracts the image feature ( v ) through the image encoder and the text features ( {t_1, …, t_c} ) through the text encoder. The prediction of CLIP for label ( y = i ) is computed as follows:
[ q(y = i |x) = \frac{\exp(\cos(v, ti)/\tau)}{\sum{j=1}^c \exp(\cos(v, t_j)/\tau)} ]
where ( \cos(\cdot, \cdot) ) denotes the cosine similarity, and ( \tau ) is the temperature factor. The researchers propose two selection criteria:
- Prediction Confidence: The prediction confidence corresponding to the noisy label from CLIP is used as the sample’s confidence. Samples with confidence higher than a predefined threshold ( \rho ) are considered clean.
- Prompt Consistency: By injecting domain-specific knowledge, different prompt templates are designed, and the difference between predictions under two prompts is calculated. Samples with small differences are considered clean.
2. Margin Adaptive Loss
Although CLIP helps in selecting clean samples, it may also introduce selection bias. To address this, the researchers design a noise-aware balanced margin adaptive loss. This loss adjusts the model’s output probabilities by incorporating a transition matrix and a class frequency prior, thereby suppressing overconfidence in certain classes and alleviating the class imbalance issue caused by sample selection.
Experimental Results
The effectiveness of the proposed method is validated on multiple real-world and synthetic noisy datasets. The experimental results show that the proposed method achieves significant performance improvements on datasets such as WebVision, Clothing1M, CIFAR-10N, and CIFAR-100N. Particularly under high noise rates (e.g., 90%), the proposed method can still effectively identify clean samples, significantly outperforming existing baseline methods.
1. Real-World Datasets
On the WebVision dataset, the proposed method achieves Top-1 and Top-5 accuracies of 79.08% and 91.96%, respectively, significantly outperforming the Dividemix baseline. On the Clothing1M dataset, the proposed method also demonstrates strong performance, validating its effectiveness in handling real-world noisy labels.
2. Synthetic Datasets
On the CIFAR-10 and CIFAR-100 datasets, the proposed method performs well under various noise rates and noise types. Particularly under a high noise rate of 90%, the proposed method achieves Top-1 accuracies of 89.2% and 45.7%, respectively, significantly outperforming existing baseline methods.
Conclusion and Significance
This paper proposes a sample selection method based on CLIP, which can effectively identify noisy samples memorized by DNNs. By introducing a margin adaptive loss, the proposed method further mitigates the selection bias introduced by CLIP, enhancing the model’s robustness to noisy labels. The experimental results demonstrate significant performance improvements on multiple noisy datasets, showcasing the potential of the proposed method in handling noisy label problems.
Research Highlights
- Innovation: This paper is the first to leverage the off-the-shelf vision-language surrogate model CLIP for sample selection, avoiding the learning bias introduced by the traditional small loss criterion.
- Robustness: By designing a margin adaptive loss, the proposed method effectively mitigates the selection bias introduced by CLIP, enhancing the model’s robustness to noisy labels.
- Wide Applicability: The proposed method demonstrates strong performance on multiple real-world and synthetic noisy datasets, showcasing its broad applicability across different tasks.
Summary
By introducing the CLIP model and a margin adaptive loss, this paper proposes a novel sample selection method that effectively addresses the noisy label problem. The method not only achieves significant performance improvements on multiple datasets but also provides new insights for future research on noisy labels.