DiffuVolume: Diffusion Model for Volume Based Stereo Matching

DiffuVolume—A Novel Method for Stereo Matching Based on Diffusion Models

Research Background and Problem Statement

Stereo matching is a crucial task in the field of computer vision, with wide applications in autonomous driving, robotics navigation, and more. Its core objective is to generate a dense disparity map from a pair of rectified stereo images. In recent years, cost volume-based methods have achieved significant success in stereo matching. The cost volume aggregates geometric information from the features of left and right images, providing rich contextual information for disparity prediction. However, the cost volume contains a large amount of redundant information, which not only interferes with model training but also limits further performance improvements.

To address this issue, researchers have attempted to optimize the design of the cost volume from various perspectives, such as improving feature extraction networks or designing more efficient cost aggregation modules. However, these methods often overlook the filtering of redundant information in the cost volume. Although a few studies have introduced attention mechanisms to screen useful information in the cost volume, these approaches typically require complex multi-stage training processes, leading to high computational costs.

Against this backdrop, Dian Zheng et al. proposed DiffuVolume, a cost volume filtering method based on diffusion models. This method embeds the diffusion model into the stereo matching task, recursively removing redundant information from the cost volume to achieve higher accuracy with lower parameter overhead.


Paper Source and Author Information

This paper, titled “DiffuVolume: Diffusion Model for Volume Based Stereo Matching,” was completed by Dian Zheng, Xiao-Ming Wu, Zuhao Liu, Jingke Meng, and Wei-Shi Zheng from the School of Computer Science and Engineering at Sun Yat-sen University, with Wei-Shi Zheng as the corresponding author. The paper was accepted on January 14, 2025, and published in the top-tier international journal International Journal of Computer Vision, with the DOI 10.1007/s11263-025-02362-1.


Research Details and Workflow

a) Research Workflow

1. Feature Extraction

The study first uses a shared ResNet-like convolutional network to extract features from the left and right images, outputting two 320-channel unary feature maps (Unary Feature Maps), denoted as $F_l$ and $F_r$, with dimensions $320 \times H/4 \times W/4$. Downsampling is caused by convolution operations.

2. Cost Volume Construction

Based on the extracted feature maps, the study constructs a base cost volume. Specifically, it adopts two common forms of cost volumes: 4D concatenation volume and 3D correlation volume. These two volumes fuse geometric information in different ways, ultimately forming the base cost volume.

3. Diffusion Filtering

This is the core part of DiffuVolume. The study embeds the diffusion model into the cost volume, designing an attention-like diffusion filter. The initialization of the diffusion filter is based on the discretized disparity map, with the formula: $$ dv0(d/4, x, y) = discretize(d{gt}(x, y)), $$ where $d_{gt}$ represents the ground truth disparity value, and $d$ is the maximum disparity value (set to 192 during training). The diffusion process is implemented through the following formula: $$ dv_t = \sqrt{\alpha_t} dv_0 + \sqrt{1 - \alpha_t}\epsilon, $$ where $\alpha_t$ is the noise coefficient, and $\epsilon$ is the added Gaussian noise.

4. Cost Volume Filtering

At each step of the diffusion process, the study randomly selects a time step $t$ and performs element-wise multiplication between the corresponding diffusion filter and the base cost volume, with the formula: $$ c{flt} = c{base} \odot (dvt + mlp(t)), $$ where $c{flt}$ is the filtered cost volume, and $mlp(t)$ is a fully connected layer that captures temporal sequence information.

5. Cost Aggregation and Disparity Regression

The filtered cost volume is fed into the cost aggregation module, which consists of multiple 3D stacked hourglass networks to aggregate information across different disparity levels. Finally, a probability volume is generated via 3D convolution and Softmax functions, and weighted summation yields the final disparity map.


b) Key Research Results

1. Effectiveness of Cost Volume Filtering

The study validated the effectiveness of DiffuVolume using information entropy (Information Entropy). Experiments show that as the number of iterations increases, the diffusion filter gradually transforms the probability vector into a unimodal distribution, significantly reducing the information entropy. This indicates that DiffuVolume can effectively remove redundant information while retaining useful geometric information.

2. Performance Improvement

Experiments were conducted on multiple public datasets, including Scene Flow, KITTI2012, KITTI2015, Middlebury, and ETH3D. The results demonstrate that DiffuVolume achieves state-of-the-art performance on all datasets. For example, on the Scene Flow dataset, DiffuVolume’s EPE (End-Point Error) is only 0.46, surpassing ACVNet (0.48); on the KITTI2012 and KITTI2015 datasets, DiffuVolume ranks first and second, respectively.

3. Plug-and-Play Characteristics

DiffuVolume is a lightweight plug-and-play module that can be embedded into any cost volume-based stereo matching network, requiring only about a 2% increase in parameters. For instance, when embedded into Fast-ACVNet, DiffuVolume not only improves performance but also maintains similar inference times.

4. Zero-Shot Generalization Ability

The study also tested DiffuVolume’s zero-shot generalization ability on unseen scenes. Experimental results show that RAFT-Stereo embedded with DiffuVolume performs exceptionally well on the KITTI, ETH3D, and Middlebury datasets, particularly excelling in edge and detail regions compared to other methods.


Conclusions and Significance

Scientific Value

DiffuVolume is the first to apply diffusion models to stereo matching tasks, proposing a novel task-specific module design approach. By transforming the diffusion target from images to an attention-like diffusion filter, the study successfully addresses the issue of redundant information in the cost volume.

Application Value

DiffuVolume has broad potential application value, especially in real-time stereo matching tasks. Its plug-and-play characteristics and low parameter overhead make it highly suitable for deployment on resource-constrained devices.


Research Highlights

  1. Innovation: DiffuVolume is the first to fully embed diffusion models into stereo matching tasks, avoiding the traditional approach of directly adding noise to images.
  2. Efficiency: Compared to traditional diffusion models, DiffuVolume improves inference speed by 240 times while reducing parameter size by 7 times (from 60M to 7M).
  3. Universality: DiffuVolume can be easily embedded into any cost volume-based stereo matching network, significantly enhancing performance.
  4. Robustness: DiffuVolume performs particularly well in complex scenarios, such as textureless regions and edges.

Other Valuable Information

The paper also explores the potential application value of diffusion models in dense prediction tasks. The study points out that diffusion models can be integrated into various computer vision tasks with low parameter overhead by designing task-specific modules. Additionally, the research emphasizes the importance of iterative optimization ideas, providing new directions for future research.