Seaformer++: Squeeze-Enhanced Axial Transformer for Mobile Visual Recognition
SEAFormer++ - An Efficient Transformer Architecture Designed for Mobile Visual Recognition
Research Background and Problem Statement
In recent years, the field of computer vision has undergone a significant shift from Convolutional Neural Networks (CNNs) to Transformer-based methods. However, despite Vision Transformers demonstrating excellent global context modeling capabilities in many tasks, their high computational cost and memory requirements make them difficult to deploy on mobile devices, especially when processing high-resolution images. To meet the demand for low latency and high efficiency on mobile devices, researchers have proposed various lightweight methods, such as local attention mechanisms, Axial Attention, and dynamic graph message passing. However, these methods still fail to adequately address the high latency issues associated with high-resolution inputs.
To tackle this challenge, Qiang Wan et al. introduced the Squeeze-Enhanced Axial Transformer (SEAFormer), which aims to achieve efficient mobile semantic segmentation through innovative attention module design, significantly reducing computational complexity while maintaining high performance. Additionally, the authors incorporated a multi-resolution distillation technique based on feature upsampling, further optimizing the model’s inference speed and accuracy.
Paper Source and Author Information
This paper was co-authored by Qiang Wan (Fudan University), Zilong Huang (ByteDance), Jiachen Lu (Fudan University), Gang Yu (Tencent), and Li Zhang (Fudan University), and was published in the International Journal of Computer Vision in January 2025. The research was supported by the National Natural Science Foundation of China (Grant No. 62376060).
Research Content and Experimental Workflow
a) Research Workflow
The study primarily includes the following key components:
1. Core Module Design: Squeeze-Enhanced Axial Attention (SEA Attention)
SEA Attention is the core component of SEAFormer, designed to extract global semantic information and supplement local details through a “squeeze-enhancement” strategy. Specifically: - Squeeze Phase: Compresses the input feature maps along the horizontal or vertical axis into compact row or column representations using adaptive squeezing. - Enhancement Phase: Enhances local details via depthwise separable convolution layers and integrates the compressed global features to complete the final feature fusion. - Position Embedding: To address the loss of positional information during compression, the authors introduced Squeeze Axial Position Embedding, enabling the model to perceive positional information in the compressed features.
2. Dual-Branch Network Architecture
SEAFormer adopts a dual-branch structure, including a Context Branch and a Spatial Branch: - Context Branch: Focuses on capturing high-level semantic information by stacking multiple SEAFormer layers. - Spatial Branch: Aims to preserve low-level spatial details by fusing features from the context branch to enhance semantic information. - Fusion Block: Used to fuse features from the context and spatial branches, with sigmoid multiplication identified as the optimal fusion method.
3. Multi-Resolution Distillation Technique
To further reduce inference latency, the authors proposed a multi-resolution distillation framework based on feature upsampling: - Student Model: Trained with low-resolution inputs, it uses MobileNetV2 modules to upsample features to match the resolution of the teacher model. - Loss Function: Includes classification loss, cross-model classification loss, feature similarity loss, and output similarity loss, ensuring that the student model effectively mimics the behavior of the teacher model.
4. Experimental Setup
- Datasets: ADE20K, Cityscapes, Pascal Context, and COCO-Stuff.
- Evaluation Metrics: mIoU (Mean Intersection over Union), parameter count (Params), floating-point operations (FLOPs), and inference latency (Latency).
- Hardware Platform: All experiments were conducted on a single Qualcomm Snapdragon 865 processor, using only the ARM CPU core for testing.
b) Key Results
1. Performance of SEAFormer
Experiments on the ADE20K validation set demonstrate that SEAFormer outperforms existing methods across multiple metrics: - Small Model (SEAFormer-Tiny): Achieves an mIoU of 36.8% with a latency of just 41ms. - Medium Model (SEAFormer-Small): Boosts mIoU to 39.7% with a latency of 68ms. - Large Model (SEAFormer-Large): Reaches an mIoU of 43.8% with a latency of 369ms.
Compared to TopFormer (the current state-of-the-art lightweight Transformer), SEAFormer not only improves mIoU (up to +8.3%) but also significantly reduces latency (minimum reduction of 16%).
2. Effectiveness of Multi-Resolution Distillation
By incorporating multi-resolution distillation, the SEAFormer++ (KD) version further optimizes performance: - On the ADE20K validation set, SEAFormer-B++ (KD) achieves an mIoU of 39.5% while reducing latency to 55ms. - Compared to traditional low-resolution distillation methods, multi-resolution distillation improves mIoU by 3.4 percentage points (35.5 vs. 32.1).
3. Performance on Other Tasks
In addition to semantic segmentation, SEAFormer also excels in image classification and object detection tasks: - Image Classification: On the ImageNet-1K dataset, SEAFormer-L++ achieves a Top-1 accuracy of 80.6% with a latency of just 61ms. - Object Detection: On the COCO dataset, SEAFormer-L++ achieves an AP of 40.2%, far surpassing baseline models like MobileNetV3.
c) Conclusions and Significance
Scientific Value
SEAFormer fills the gap for mobile-friendly, efficient Transformer architectures, achieving the best balance between performance and efficiency in semantic segmentation tasks through innovative attention mechanism design and dual-branch architecture.
Application Value
SEAFormer is not only suitable for semantic segmentation but can also be extended to various tasks such as image classification and object detection, showcasing its potential as a versatile, mobile-friendly backbone network. Additionally, the multi-resolution distillation technique provides new insights for model optimization in resource-constrained environments.
d) Research Highlights
- Innovative Attention Mechanism: SEA Attention significantly reduces computational complexity through adaptive compression and convolutional enhancement while preserving global semantics and local details.
- Efficient Dual-Branch Architecture: The collaborative design of the context and spatial branches enables the model to capture rich semantic information at different scales.
- Multi-Resolution Distillation Technique: Feature upsampling facilitates knowledge transfer between high- and low-resolution models, significantly reducing inference latency.
- Wide Range of Applications: SEAFormer demonstrates excellent performance in semantic segmentation, image classification, and object detection tasks, proving its versatility and robustness.
e) Other Valuable Information
The paper publicly released the code and models, available on GitHub. Additionally, the authors provided detailed analyses of the impact of different upsampling modules and loss function configurations, offering valuable references for future research.
Summary
The research on SEAFormer++ not only addresses the performance bottlenecks of high-resolution semantic segmentation on mobile devices but also further optimizes model efficiency through multi-resolution distillation techniques. Its innovative design philosophy and broad applicability set a new benchmark for lightweight model development in the field of computer vision.