Weakly Supervised Semantic Segmentation via Alternate Self-Dual Teaching

Weakly Supervised Semantic Image Segmentation via Alternate Self-Dual Teaching

Model for WSSS mentioned in this paper

Background Introduction

With the continuous development of the computer vision field, semantic segmentation has become an important and active research direction. Traditional semantic segmentation methods rely on manually labeled pixel-level tags; however, obtaining these precise annotations usually requires a substantial amount of human effort and time. To address this issue, Weakly Supervised Semantic Segmentation (WSSS) has been proposed in recent years, aiming to achieve efficient semantic segmentation by using weakly labeled information (such as image tags, bounding boxes, scribbles, etc.) while minimizing manual annotation.

This study focuses on weakly supervised semantic segmentation based on image-level tags, which is the most challenging task among all WSSS categories. Current methods primarily rely on image classification models to generate pseudo segmentation masks (PSMs). However, the features from these models are mainly used for classification tasks, resulting in uneven responses in object regions and insufficient boundary details in the pseudo masks. To address this, this paper proposes an Alternate Self-Dual Teaching (ASDT) learning framework based on a dual-teacher single-student network architecture to generate high-quality PSMs.

Paper Source

This paper, titled “Weakly Supervised Semantic Segmentation via Alternate Self-Dual Teaching,” was authored by Dingwen Zhang, Hao Li, Wenyuan Zeng, Chaowei Fang, Lechao Cheng, Ming-Ming Cheng, and Junwei Han, and was published in the IEEE Transactions on Image Processing journal in August 2021. This research was supported by the Key R&D Projects of Guangdong Province and the National Natural Science Foundation.

Research Process

Overview

The research process comprises several stages:

  1. Feature Extraction: First, extract features from the image using a backbone network.
  2. Dual-Teacher Learning: Utilize classification and segmentation teacher networks to respectively generate discriminative object part features and full object region features.
  3. Alternate Distillation Learning: Transfer knowledge generated by the dual-teacher models to the student network through an alternate distillation algorithm to guide the generation of pseudo segmentation masks.
  4. Post-processing: Apply Conditional Random Fields (CRFs) to enhance the quality of the segmentation results.

Feature Extraction

Starting from feature extraction, a fully convolutional network (such as ResNet) is used to extract features from the input image. The obtained feature map will be used in the subsequent dual-teacher learning process.

Dual-Teacher Learning

  • Classification Teacher Network (Class-Teacher Branch): This part is responsible for generating discriminative object part features. It uses Global Average Pooling (GAP) and fully connected layers to generate image-level predictions, then combines the feature map with prediction weights to create Class Activation Maps (CAMs), which generate Trustful Semantic Localization used for subsequent distillation.

    • Loss Function: Cross-Entropy Loss (Lce).
  • Segmentation Teacher Network (Seg-Teacher Branch): This part is responsible for generating full object region features through dilated convolution layers and softmax operations. This network is guided by self-generated discriminative object part features.

    • Loss Function: Energy-Based Loss.

Alternate Self-Dual Teaching

During this stage, the study proposes an alternate distillation mechanism that alternately transfers knowledge generated by the dual-teacher models to the student network. Specifically, Pulse Width Modulation (PWM) signals are used to control from which teacher model to distill knowledge, preventing the student model from falling into local optima. Alternate Distillation Loss (Lad) includes distillation losses from the classification teacher to the student and from the segmentation teacher to the student.

Through the alternate distillation mechanism, the student network can more stably acquire reliable pseudo segmentation masks, alleviating the impact of teacher model errors on the student model learning process.

Experiments and Results

The effectiveness of the ASDT framework was validated on the PASCAL VOC 2012 and COCO-Stuff 10k datasets. The experimental results demonstrate that the ASDT framework achieves the best current segmentation performance.

  • PASCAL VOC 2012: Achieved significant performance improvements on both the validation and test sets, reaching an mIoU (Mean Intersection over Union) of 68.5% and 68.4%, respectively.
  • COCO-Stuff 10k: The ASDT framework also performed well on this dataset, improving the mIoU by 0.6% compared to the existing state-of-the-art methods.

Ablation Studies

Ablation studies analyzed the effects of different self-distillation strategies and confirmed the advantages of the alternate distillation mechanism. The specific results are shown in the following table:

Distillation Strategy Seg-Teacher Student PSM
Single Teacher (Classification) - 62.6 -
Single Teacher (Segmentation) 62.3 30.4 48.5
Direct Combination (Max Value) 61.4 40.1 53.2
Direct Combination (Mean Value) 62.3 40.0 53.6
Alternate Dual-Teacher 63.8 63.8 64.0

The results show that the alternate distillation mechanism significantly outperforms direct combination methods in training the student network branch.

Conclusion and Applications

The ASDT framework proposed in this paper introduces a novel dual-teacher single-student architecture by combining features from full object regions with discriminative object parts. The study achieves effective knowledge distillation under weak supervision through the alternate distillation mechanism, significantly improving model performance. This method has demonstrated excellent performance on the PASCAL VOC and COCO-Stuff datasets, indicating its broad application prospects. In the future, the research team plans to extend the ASDT mechanism to a wider range of weakly supervised learning tasks, such as weakly supervised object detection and instance segmentation.

The contributions of this paper include: 1. Reassessing the critical factors for generating high-quality pseudo segmentation masks, revealing the importance of discriminative object parts and full object regions in weakly supervised semantic segmentation. 2. Proposing a novel alternate distillation mechanism, which enables the student model to avoid local optima caused by the teacher models’ errors through alternately distilling the two types of knowledge under weak supervision. 3. Experimental results show that the proposed method achieves state-of-the-art segmentation performance on the PASCAL VOC 2012 and COCO-Stuff 10k datasets.