Weakly Supervised Semantic Segmentation of Driving Scenes Based on Few Annotated Pixels and Point Clouds
Few Annotated Pixels and Point Cloud Based Weakly Supervised Semantic Segmentation of Driving Scenes
Background and Research Issues
Semantic segmentation, a critical task in computer vision, has extensive applications in domains like autonomous driving. However, traditional fully-supervised semantic segmentation methods require exhaustive pixel-level annotations, which are costly to acquire. Weakly Supervised Semantic Segmentation (WSSS) leverages coarse annotations (such as image tags, bounding boxes, or point-level annotations) to achieve pixel-level segmentation, significantly reducing annotation costs.
Existing WSSS methods mostly rely on Class Activation Maps (CAMs) to generate initial segmentation seeds but perform poorly in complex driving scenes. These scenes often contain multiple overlapping objects and categories, which challenges traditional image-level WSSS methods.
To address these challenges, this study proposes a novel WSSS framework combining sparse point annotations and point cloud data, aiming to optimize segmentation results in complex driving scenes. This framework uses a few category-point annotations and point cloud data to generate pseudo-labels for training a semantic segmentation network without requiring additional point cloud annotations.
Paper Source
This study, titled Few Annotated Pixels and Point Cloud Based Weakly Supervised Semantic Segmentation of Driving Scenes, is published in the International Journal of Computer Vision by authors Huimin Ma, Sheng Yi, Shijie Chen, Jiansheng Chen, and Yu Wang from the University of Science and Technology Beijing and Tsinghua University. The research was submitted on January 18, 2024, and accepted on October 9, 2024.
Methodology and Process
1. Framework Overview
The study introduces a multi-dimensional feature fusion framework that integrates 2D RGB image features and 3D point cloud features to optimize pseudo-label generation. The framework consists of three key modules:
- 2D Pseudo-Label Generation Module: Extracts high- and low-level features from RGB images and generates initial pseudo-labels using point annotations.
- 3D Feature Clustering Module: Performs unsupervised clustering on point cloud data to generate instance masks and projects them onto RGB images.
- Multi-Level Feature Fusion Module: Combines 2D pseudo-labels with 3D projection masks to generate more accurate pseudo-labels.
2. Pseudo-Label Generation Methods
2.1 Initial Pseudo-Label Generation
- Feature Extraction: Extracts pixel-level (RGB values, superpixels), appearance-level (color distribution, edge features), and semantic-level (saliency, CAM) features.
- Role of Point Annotations: Uses the spatial information from category-point annotations to compute representative feature vectors for each category using the Expectation-Maximization (EM) algorithm.
- Label Assignment: Assigns pseudo-labels to pixels based on feature similarity exceeding a threshold.
2.2 Point Cloud Feature Clustering
- Ground Point Removal: Removes ground points by fitting a plane and eliminating points near it.
- Clustering Algorithm: Uses the DBSCAN algorithm to cluster point cloud data into object instances, forming 3D instance masks.
- Projection onto RGB Images: Projects point cloud masks onto 2D images to create sparse 2D projection masks.
2.3 Multi-Dimensional Feature Fusion
Combines 2D pseudo-labels with 3D projection masks to generate the final pseudo-labels: - Fusion Rules: Uses object class proportions within each mask to assign final labels, refining noisy areas in the initial pseudo-labels. - Ground Label Correction: Refines ground class labels using projected ground points.
3. Network Training
The final pseudo-labels are used to train a fully-supervised semantic segmentation network (DeepLab-v2), learning category features across samples.
Experiments and Results
1. Dataset and Evaluation Metrics
The experiments are conducted on the KITTI dataset, which contains 200 training images and 200 testing images. The evaluation metric is Mean Intersection over Union (mIoU).
2. Experimental Results
Performance Comparison
The proposed method achieves significant performance improvements over other WSSS methods on the KITTI dataset: - Training Set: Achieves an mIoU of 25.4% (class) and 46.7% (category), outperforming image-level-based methods. - Test Set: Achieves an mIoU of 21.6% (class) and 48.0% (category), demonstrating the effectiveness of the framework.
Annotation Efficiency
Compared to fully supervised methods requiring 430.5 hours to annotate 10,000 images, the proposed method requires only 0.9 hours to annotate 19 point-level labels, drastically reducing annotation costs.
3. Ablation Studies
The study evaluates various feature fusion strategies and highlights the critical role of 3D point cloud features. Performance significantly drops without 3D features, underscoring their importance in refining pseudo-labels for complex scenes.
Significance and Highlights
Academic Contributions:
- Introduces a WSSS framework combining 2D and 3D features, significantly improving segmentation performance in complex scenes.
- Proposes unsupervised point cloud clustering and 2D projection methods, offering new insights for future research.
Practical Value:
- Reduces annotation costs for semantic segmentation tasks.
- Enhances segmentation accuracy in domains like autonomous driving that involve complex scenes.
Innovations:
- Leverages point cloud spatial information to optimize pseudo-labels.
- Does not require additional annotations for point cloud data, fully utilizing its latent information.
Conclusion
This study addresses the limitations of existing WSSS methods in complex driving scenes by leveraging 3D point cloud features. The proposed framework effectively refines pseudo-labels using multi-dimensional feature fusion. Future work may extend this approach to other datasets and explore additional dimensions of feature fusion methods.