Unsupervised Domain Adaptive Segmentation Algorithm Based on Two-Level Category Alignment
Semantic segmentation aims to predict category labels for each pixel in an image (Liu et al., 2021; Wang et al., 2021) and is widely used in scene understanding, medical image analysis, autonomous driving, geographic information systems, and augmented reality (Strudel et al., 2021; Sun et al., 2023). Although the development of deep neural networks has significantly improved the performance of segmentation tasks (Chen et al., 2014; Guan et al., 2021; Zhao et al., 2017), these advancements require large amounts of pixel-level annotated data for model training, and obtaining this data is costly in real-world scenarios (Jiang et al., 2022; Liang et al., 2023). Furthermore, when there is a distribution shift between the testing data and the training data, the performance of most segmentation methods usually degrades (Huang et al., 2022). To address these issues, researchers have proposed unsupervised domain adaptation (Unsupervised Domain Adaptation, UDA) methods to enhance model generalization capabilities (Xu et al., 2021).
Source of the Paper
The title of this paper is “Unsupervised Domain Adaptive Segmentation Algorithm Based on Two-Level Category Alignment,” written by Dong Wenyong and his team from the School of Computer Science at Wuhan University, including Liang Zhixue, Wang Liping, Tian Gang, and Long Qianhui. This paper was published in the journal “Neural Networks” in 2024, under the article number 106399.
Research Background and Problem
Currently, most unsupervised domain adaptation segmentation methods focus primarily on pixel-level local features but ignore category information cues. This limits the segmentation network to learning only global cross-domain invariant features while ignoring fine-grained cross-domain invariant features, leading to degraded segmentation performance. To address this issue, this paper proposes an unsupervised domain adaptation algorithm based on two-level category alignment (UDA$_{CA}^+$) for semantic segmentation tasks.
Research Process and Methods
Overall Architecture
The architecture of UDA$_{CA}^+$ is shown in Figure 1, mainly including the ClassMix module, student network, and teacher network, as well as image-level and pixel-level category alignment modules. The network contains three branches: target domain branch ($B_t$), source domain branch ($B_s$), and mixed domain branch ($B_m$).
Research Objects and Processing Steps
Source and Target Domain Datasets:
- Source Domain Dataset: Daytime scene images from synthetic environments.
- Target Domain Dataset: Corresponding scene images from real environments, including GTA and Cityscapes datasets.
- Processing: All data underwent preprocessing operations such as size scaling, random cropping, random horizontal flipping, and RGB mean normalization.
Source Domain Model Training:
- Input: Source domain image $x_s$.
- Output: Using the semantic segmentation student network $g_{\theta}$ to obtain prediction $y_s$.
- Loss: Standard cross-entropy loss was used to constrain the student network.
Target Domain Model Training:
- Input: Target domain image $x_t$.
- Output: Using the teacher network $h_{\phi}$ to obtain prediction $y_t$, and further generate pseudo-labels.
- Pseudo-labels: Categories were determined through maximum probability values, and a confidence calculation method was introduced to mitigate negative transfer and over-alignment issues.
Two-Level Category Alignment Strategy
- Image-Level Category Alignment (IDA): Based on Class Activation Map (CAM), focusing on deep information of categories, such as position, distribution, and feature centers.
- Pixel-Level Category Alignment (PDA): Based on pseudo-labels, focusing on shallow information of categories, such as texture, color, and local context.
Adversarial Learning Strategy
- Feature Space Adversarial Learning: In the feature space, adversarial learning aligns category feature centers between the source domain and the target domain, balancing the feature distribution of different categories.
- Output Space Adversarial Learning: In the output space, further aligns the spatial distribution maps of categories, achieving alignment of both global and local information.
Mixed Domain Strategy
- Image Mixing Strategy: Using the ClassMix method to generate mixed images $x_m$ and their labels $y_m$, optimizing the UDA segmentation model through adversarial learning and self-training.
- Joint Alignment Strategy: The mixed domain branch uses the IDA and PDA modules to perform adversarial learning in both the feature and output spaces, optimizing the UDA segmentation model.
Results and Analysis
Quantitative Experimental Results
The experimental results on the GTA→Cityscapes and Synthia→Cityscapes datasets show that UDA${CA}^+$ significantly improves segmentation performance, surpassing previous SOTA methods. Specifically: - In the GTA→Cityscapes task, UDA${CA}^+$ achieved 69.7% mIoU, improving by 21.4% compared to the baseline model Segformer. - In the Synthia→Cityscapes task, UDA$_{CA}^+$ improved by 20.3% and 21.1% in the performance of 16 categories (mIoU16) and 13 categories (mIoU13), respectively.
Qualitative Experimental Results
As shown in Figure 6, UDA$_{CA}^+$ performs better than SOTA methods such as Daformer in predicting multiple scene contents, especially in the prediction of grass, trees, sidewalks, buildings, and walls, showing significant improvement.
Ablation Experiments
Detailed ablation experiments were conducted to study the impact of the two-level category alignment module and the adversarial learning module. The results show that: - The combination of image-level and pixel-level category alignment modules improves the algorithm’s performance more significantly than adding any single module alone. - Joint adversarial learning in the feature space and output space further enhances the segmentation network’s ability to capture cross-domain invariance.
Research Conclusions
The proposed UDA semantic segmentation algorithm successfully mitigates the domain shift problem between the source and target domains through the two-level category alignment strategy in the feature and output spaces. Experimental results verify the effectiveness of the proposed strategy, achieving SOTA performance in two synthetic-to-real adaptive tasks. Future research can further optimize the generation of class activation maps to improve model performance.