Lidar-guided Geometric Pretraining for Vision-centric 3D Object Detection

2025-02-13 Thu
3D object detection geometric pretraining vision-centric autonomous driving cross-modal knowledge distillation Lidar
Lidar-Guided Geometric Pretraining Enhances Performance of Vision-Centric 3D Object DetectionBackground IntroductionIn recent years, multi-camera 3D object detection has garnered significant attention in the field of autonomous driving. However, vision-based methods still face challenges in precisely extracting geometric information from RGB images. Existing approaches typically pretrain image backbone networks on depth-relevant tasks to acquire spatial information, but these methods overlook the critical issue of view transformation, leading to misalignment of spatial knowledge between the image backbone and the view transformation module, thereby affecting performance. To address this issue, this paper proposes a novel geometry-aware pretraining framework—GAPretrain.
Source of the PaperThe paper was authored by Linyan Huang, Huijie Wang, Jia Zeng, et al., who are affiliated with the Department of Artificial Intelligence at Xiamen University, OpenDriveLab at Shanghai AI Lab, and Shanghai Jiao Tong University. The paper was published in the journal International Journal of Computer Vision, received on April 13, 2023, and accepted on January 6, 2025.
Research Workflow and ResultsResearch WorkflowUnified BEV Representation:
To bridge the view discrepancies between different sensors, the researchers transformed both image features and point cloud data into a unified Bird’s-Eye-View (BEV) representation. Specifically, point cloud data was processed through a sparse convolutional neural network, compressing its height dimension into a BEV feature map. Meanwhile, multi-view RGB images were processed by a 2D backbone network for feature extraction and then passed through a view transformation module to generate a BEV feature map.
To align the data from the two modalities, the researchers designed a normalization operation to standardize the BEV feature maps by calculating channel-wise statistics across all training data.
LiDAR-to-Camera Pretraining:
During the pretraining phase, the researchers first trained the LiDAR model on the 3D object detection task, using the generated BEV feature maps as pretraining targets. To reduce the arbitrariness of value distributions across different channels, the researchers normalized the BEV feature maps.
To better align the BEV representations from LiDAR and camera models, the researchers designed a LiDAR-guided mask generation module. This module projects the LiDAR point cloud onto grids, counts the number of points within each grid, and applies a Gaussian smoothing kernel to densify the LiDAR attention map. Additionally, the researchers designed a target-aware geometry correlation module that extracts instance features and computes their geometric information for pixel-wise knowledge transfer.
Fine-Tuning:
In the fine-tuning stage, the researchers directly reused the pretrained parameters, using only images as input without requiring LiDAR point clouds. To ensure that the BEV representation of the camera model aligned with that of the LiDAR model, they designed identical detection head architectures and utilized LiDAR head parameters during the fine-tuning process.
Key ResultsExperimental Setup: The researchers conducted experiments on the NuScenes dataset, which contains 1000 driving scenes, with 700 used for training, 150 for validation, and 150 for testing. Each scene lasted approximately 20 seconds, sampled at a frequency of 2Hz.
Performance Improvement: The experimental results demonstrated that GAPretrain significantly improved the performance of various existing methods. For example, when using the BEVFormer method, GAPretrain achieved 46.2% mAP and 55.5% NDS on the NuScenes validation set, with respective improvements of 2.7% and 2.1%.
Ablation Study: Through ablation studies, the researchers validated the effectiveness of each module. The pretraining distillation module improved mAP by 2.4%, while the mask generation module further enhanced object localization accuracy by 5.9%. The target-aware geometry correlation module also brought a 0.4% NDS performance gain.
ConclusionThis study proposes a new geometry-aware pretraining framework, GAPretrain, which guides the pretraining process of camera models by incorporating rich geometric information from LiDAR. The experimental results show that this method not only enhances the performance of existing methods but also exhibits good generalization ability. Future work can further explore how to generate more representative and robust pretraining targets to improve the detection performance of distant objects.
Research HighlightsSolving Spatial Knowledge Misalignment in View Transformation: By introducing LiDAR-guided mask generation and target-aware geometry correlation modules, the spatial information accuracy of camera models is effectively improved.
Plug-and-Play Solution: The GAPretrain method can be flexibly applied to various existing multi-view camera models, offering excellent versatility.
Full Utilization of Unlabeled Data: During the pretraining phase, a large amount of unlabeled data can be leveraged to further enhance model performance.
Through this research, the authors provide an effective pretraining strategy for vision-based 3D object detection, which is expected to drive advancements in autonomous driving technology in the future.