An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training

2025-02-13 Thu
masked image modeling vision transformers lightweight networks knowledge distillation visual tracking
An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-trainingAcademic BackgroundIn recent years, self-supervised learning (SSL) has made significant progress in the field of computer vision. In particular, the successful application of masked image modeling (MIM) pre-training methods on large-scale vision transformers (ViTs) has greatly improved the performance of downstream tasks based on these models. However, existing research mainly focuses on large ViTs, and there are relatively few studies on pre-training methods for lightweight ViTs and their effects. Moreover, although many studies have been devoted to designing complex lightweight ViT architectures to improve performance, little attention has been paid to how to optimize pre-training strategies to further enhance the performance of existing lightweight models. This paper aims to explore whether MIM pre-training can be equally effective when applied to extremely simple lightweight ViTs and addresses this issue through systematic observation, analysis, and solutions.
Paper SourceThe paper was co-authored by Jin Gao, Shubo Lin, Shaoru Wang, and others from multiple institutions, including the Institute of Automation at the Chinese Academy of Sciences, the School of Artificial Intelligence at the University of the Chinese Academy of Sciences, and the School of Information Science and Technology at ShanghaiTech University. The paper was accepted by the International Journal of Computer Vision in December 2024 and will be officially published in 2025.
Research ContentResearch WorkflowThis paper adopts an observation-analysis-solution workflow for the study. Specifically, it first systematically observes the performance differences of various pre-training methods with respect to the scale of downstream fine-tuning data; then analyzes layer representation similarities and attention maps to reveal the shortcomings of MIM pre-training in higher-layer learning; finally, proposes a decoupled distillation strategy to improve the pre-training effect of lightweight ViTs.
Experimental Subjects and Sample SizeThis study uses a slightly modified version of the lightweight ViT-tiny proposed by Touvron et al. (2021) as the experimental unit, with a parameter count of 5.7M. Additionally, the recently proposed Hierarchical Transformer (Hiera) was studied, with a parameter count of 6.5M. The datasets involved in the experiments include ImageNet-1k, ADE20k, LASOT, etc.
Experimental ProcessAdaptation and Comparison of Pre-training Methods: Various popular MIM pre-training methods (e.g., MAE, SimMIM, BEiT, etc.), contrastive learning (CL) pre-training methods (e.g., MoCo-v3, DINO), and fully supervised pre-training methods were applied to lightweight ViTs.
Benchmarking: Fine-tuning evaluations were conducted on pre-trained lightweight models for the ImageNet classification task, and their transfer performance on other datasets was further assessed.
Linear Probing and Model Analysis: Linear probing and CKA (centered kernel alignment)-based layer representation similarity analysis, as well as attention map analysis, were used to uncover the working mechanisms of different pre-training methods.
Proposal and Validation of Decoupled Distillation Strategy: A decoupled distillation strategy was proposed to further improve the MIM pre-training effect by separating the reconstruction task from the distillation task.
Key ResultsProper Pre-training Can Unleash the Great Potential of Lightweight ViTs: Almost all compared pre-training methods outperformed random initialization on the ImageNet classification task, with MIM pre-training showing excellent performance under moderate pre-training costs.
MIM Pre-training Can Enhance Simple ViT-tiny to Perform on Par with Recent SOTA Lightweight ViT Derivatives on ImageNet: The enhanced simple ViT-tiny through MIM pre-training achieved performance comparable to some intricately designed lightweight ViTs on the ImageNet classification task.
Self-supervised Pre-training of Lightweight ViTs Can Hardly Benefit from “LLM-like” Scaling of Data: MIM pre-training performed poorly on larger datasets, indicating that the capacity limitations of lightweight ViTs restrict their representation quality.
Although MIM Pre-training Outperforms CL on ImageNet, It Shows Worse Performance in Transfer Learning Evaluation: Particularly on data-insufficient downstream tasks, MIM pre-training performed worse than CL pre-training.
ConclusionThrough systematic observation, analysis, and solutions, this paper proposes an improved MIM pre-training method for lightweight ViTs. Specifically, the proposed decoupled distillation strategy not only enables pre-trained lightweight ViTs to learn semantics relevant to recognition in higher layers but also preserves the useful locality inductive bias brought by MIM pre-training. Experimental results show that this method achieves significant performance improvements on multiple downstream tasks, including ImageNet classification, ADE20k semantic segmentation, and LASOT single-object tracking.
Research HighlightsKey Findings: Proper pre-training can significantly enhance the performance of extremely simple lightweight ViTs, enabling them to achieve state-of-the-art levels on the ImageNet classification task.
Significance of the Problem: It addresses the bottleneck issue of lightweight ViTs in pre-training, providing new insights for future lightweight model design.
Methodological Innovation: The proposed decoupled distillation strategy is a novel approach that effectively improves the MIM pre-training effect by separating the reconstruction task from the distillation task.
Other Valuable InformationIn addition to the main content mentioned above, this paper also provides comprehensive benchmarking of various pre-training methods, covering multiple downstream tasks and offering rich reference data for future research. Furthermore, the paper has released the improved code and raw results, facilitating reproducibility and further research by other researchers.
Through systematic research, this paper reveals the potential of MIM pre-training on lightweight ViTs and proposes an effective decoupled distillation strategy, providing new directions for the design and optimization of lightweight models.