Delving Deep into Simplicity Bias for Long-Tailed Image Recognition
Academic Background and Problem Statement
In recent years, deep neural networks have made significant progress in the field of computer vision, particularly in tasks such as image recognition, object detection, and semantic segmentation. However, even the most advanced deep models struggle when faced with long-tailed distribution data, where the number of samples in minority classes (tail classes) is significantly smaller than that in majority classes (head classes). This data imbalance issue is prevalent in many real-world applications, such as pipeline failure detection and face recognition.
The main challenge in long-tailed image recognition lies in effectively addressing the data imbalance problem, especially in improving the generalization performance of minority classes. Common solutions include re-sampling, loss re-weighting, and data augmentation. However, these methods often fail to fundamentally address the issue of poor model generalization caused by insufficient samples in minority classes.
This paper investigates the problem of long-tailed image recognition through the lens of Simplicity Bias (SB). Simplicity Bias refers to the tendency of deep neural networks in supervised learning tasks to rely on simple predictive patterns while ignoring more complex features. This bias is particularly pronounced in long-tailed distribution data, where models are more likely to depend on simple features for minority classes, leading to degraded generalization performance.
Paper Source and Author Information
This paper is co-authored by Xiu-Shen Wei, Xuhao Sun, Yang Shen, and Peng Wang, affiliated with Southeast University, Nanjing University of Science and Technology, and University of Electronic Science and Technology of China, respectively. The paper was submitted on May 12, 2024, accepted on December 26, 2024, and published in the International Journal of Computer Vision in 2025.
Research Methodology and Process
This paper proposes a novel self-supervised learning method called Triple-Level Self-Supervised Learning (3LSSL), specifically designed to handle long-tailed distribution data. The method enhances the model’s ability to learn complex features through three levels of self-supervised learning, thereby mitigating the impact of Simplicity Bias on minority classes.
1. Holistic-Level Self-Supervised Learning (Holistic-Level SSL)
The holistic-level self-supervised learning is based on the classic contrastive learning framework (e.g., MoCo). By applying different data augmentations to the input image, two views are generated and fed into the encoder and momentum encoder, respectively. The cosine similarity between the embeddings of the two views is calculated to drive the model to learn global complex features.
2. Partial-Level Self-Supervised Learning (Partial-Level SSL)
The partial-level self-supervised learning forces the model to learn more complementary information from local regions of the image through masking. Specifically, Class Activation Mapping (CAM) is used to identify the regions of the image that contribute the most to classification, and these regions are masked out to compel the model to focus on other complex regions.
3. Augmented-Level Self-Supervised Learning (Augmented-Level SSL)
The augmented-level self-supervised learning enhances the model’s ability to learn features of minority classes by providing pseudo-positive samples derived from the classifier’s predictions. Specifically, an augmented queue is constructed to store the embeddings of pseudo-positive samples, and the similarity between these embeddings and the original sample embeddings is calculated.
Experimental Results and Analysis
Extensive experiments were conducted on five long-tailed image recognition benchmark datasets, including Long-Tailed CIFAR-10⁄100, ImageNet-LT, Places-LT, and iNaturalist 2018. The results show that the proposed 3LSSL method significantly outperforms existing state-of-the-art methods across all datasets.
1. Results on Long-Tailed CIFAR Datasets
On the Long-Tailed CIFAR-10 and CIFAR-100 datasets, the 3LSSL method achieved the highest classification accuracy under different imbalance ratios (100, 50, 10). Particularly, on the CIFAR-100 dataset with an imbalance ratio of 100, the 3LSSL method outperformed the best existing method (e.g., BCL) by 2.7%.
2. Results on ImageNet-LT Dataset
On the ImageNet-LT dataset, the 3LSSL method achieved classification accuracies of 59.1% and 59.9% using ResNet-50 and ResNeXt-50 as backbones, respectively, significantly outperforming existing state-of-the-art methods.
3. Results on Places-LT Dataset
On the Places-LT dataset, the 3LSSL method achieved a classification accuracy of 42.0%, which is 0.8% higher than the best existing method (e.g., PaCo).
4. Results on iNaturalist 2018 Dataset
On the iNaturalist 2018 dataset, the 3LSSL method achieved a classification accuracy of 75.8%, significantly outperforming existing state-of-the-art methods (e.g., SADE and PaCo).
Conclusion and Significance
By investigating the impact of Simplicity Bias in long-tailed image recognition, this paper proposes a novel self-supervised learning method (3LSSL) that enhances the model’s ability to learn complex features through three levels of self-supervised learning, effectively mitigating the impact of Simplicity Bias on minority classes. Experimental results demonstrate that the 3LSSL method achieves significant performance improvements across multiple long-tailed image recognition benchmark datasets.
This research not only provides a new solution for long-tailed image recognition but also offers new insights into the application of self-supervised learning in long-tailed data. Future research could further explore the application of the 3LSSL method to other tasks, such as few-shot learning.
Research Highlights
- In-depth Study of Simplicity Bias: This paper is the first to investigate the impact of Simplicity Bias in long-tailed image recognition tasks and experimentally validates that minority classes are more severely affected by Simplicity Bias.
- Triple-Level Self-Supervised Learning Method: The proposed 3LSSL method effectively mitigates Simplicity Bias through three levels of self-supervised learning, significantly improving the model’s generalization ability on long-tailed data.
- Extensive Experimental Validation: Extensive experiments were conducted on five long-tailed image recognition benchmark datasets, demonstrating the effectiveness and robustness of the 3LSSL method.
Other Valuable Information
This paper also provides visualization analyses to demonstrate the effectiveness of the 3LSSL method in mitigating Simplicity Bias. Through the visualization of activation maps, it is evident that the 3LSSL method enables the model to learn more comprehensive image features, particularly for minority classes.