CANet:Context-Aware Multi-View Stereo Network for Efficient Edge-Preserving Depth Estimation
Academic Background and Problem Statement
Multi-View Stereo (MVS) is a fundamental task in 3D computer vision that aims to recover the 3D geometry of a scene from multiple posed images. This technology has broad applications in robotics, scene understanding, augmented reality, and more. In recent years, learning-based MVS methods have achieved significant progress by employing a coarse-to-fine depth estimation framework. However, existing methods still face challenges in recovering depth in featureless areas, object boundaries, and thin structures, primarily due to the poor distinguishability of matching clues in low-textured regions, the inherent smooth properties of 3D Convolutional Neural Networks (3D CNNs) used for cost volume regularization, and the loss of information at the coarsest scale.
To address these issues, this paper proposes a Context-Aware Multi-View Stereo Network (CANet), which leverages contextual cues in images to achieve efficient edge-preserving depth estimation. By introducing the Self-Similarity Attended Cost Aggregation (SAA) module, CANet models long-range dependencies in the cost volume, thereby enhancing the matchability of featureless regions. Additionally, through the Hierarchical Edge-Preserving Residual Learning (HEPR) module, CANet progressively refines multi-scale depth estimation, resulting in delicate depth estimation at edges. To enrich features at the coarsest scale, CANet also introduces a Focal Selection Module (FSM), which enhances the recovery of initial depth with finer details such as thin structures.
Paper Source and Author Information
This paper is authored by Wanjuan Su and Wenbing Tao, both affiliated with the National Key Laboratory of Science and Technology on Multispectral Information Processing, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology. The paper was submitted on May 5, 2024, accepted on December 17, 2024, and published in the International Journal of Computer Vision in 2025.
Research Process and Experimental Design
1. Research Process
The research process of CANet includes the following steps:
1.1 Multi-Scale Feature Extraction
CANet first extracts multi-scale features from input images using the Focal-Aware Multi-Scale Feature Extraction Network. This network, based on UNet, incorporates the Focal Selection Module (FSM) to enhance the expressive power of features at the coarsest scale. The FSM enriches the coarsest scale features by fusing finer-scale features from the encoder and conducting focal selection across both channel and spatial dimensions, thereby amplifying responses in critical regions.
1.2 Self-Similarity Attended Cost Aggregation (SAA)
To address the challenge of matching in featureless regions, CANet introduces the Self-Similarity Attended Cost Aggregation (SAA) module. This module extracts self-similarity information from the reference view using an efficient attention mechanism and uses it to guide the aggregation of cost volumes. Specifically, the SAA module computes self-similarity weights through a cross-covariance attention mechanism and applies these weights to the raw cost volume, generating a context-enriched cost volume.
1.3 Hierarchical Edge-Preserving Residual Learning (HEPR)
To preserve edge information in depth estimation, CANet designs the Hierarchical Edge-Preserving Residual Learning (HEPR) module. This module progressively learns depth residual maps, blending high-frequency details into the depth maps predicted by the backbone network, thereby achieving edge-preserving upsampling and depth refinement. The HEPR module performs depth refinement and upsampling simultaneously at intermediate pyramid stages, avoiding the limitations of traditional methods that only refine or upsample the final depth map.
1.4 Lightweight Cascade Framework
To maintain high performance while reducing computational resource consumption, CANet adopts a lightweight cascade framework. This framework stacks two stages at the same resolution and maximizes depth hypothesis sampling at low resolutions, significantly reducing memory and runtime consumption without sacrificing fine-grained depth sampling or the scale of the cost volume regularization network.
2. Experimental Results
2.1 Main Results
CANet has been extensively evaluated on multiple MVS benchmark datasets, demonstrating superior performance in both reconstruction quality and efficiency. Notably, CANet ranks first on the Tanks and Temples Advanced dataset and the ETH3D High-Res benchmark among all published learning-based methods. Specifically, CANet reduces GPU memory consumption by 78.49% and runtime by 57.35%, while achieving comparable reconstruction quality to state-of-the-art methods.
2.2 Conclusions and Significance
The main contributions of CANet include: 1. A novel context-aware multi-view stereo network that fully exploits context information in images for high-quality edge-aware depth estimation, with friendly memory and runtime consumption. 2. A self-similarity attended cost aggregation module that propagates discriminative clues in cost volumes to featureless regions under the guidance of global context information. 3. A hierarchical edge-preserving residual learning module that supports blur-free depth upsampling. 4. A focal selection module that enables features at the coarsest scale to focus more on important regions, yielding better initial depth.
Research Highlights
- Innovation: CANet significantly improves depth estimation accuracy in featureless and edge regions by introducing the self-similarity attended cost aggregation module and the hierarchical edge-preserving residual learning module.
- Efficiency: Through the design of a lightweight cascade framework, CANet maintains high performance while significantly reducing computational resource consumption.
- Broad Applicability: CANet demonstrates outstanding performance on multiple benchmark datasets, particularly in complex scenes such as Tanks and Temples and ETH3D, showcasing its strong generalization ability.
Summary
CANet proposes an efficient and accurate multi-view stereo depth estimation method by combining context information, self-similarity attention mechanisms, and edge-preserving residual learning. The method not only achieves state-of-the-art performance on multiple benchmark datasets but also excels in computational resource consumption and runtime efficiency, providing a new solution for the field of 3D reconstruction.