Adaptively Identify and Refine Ill-Posed Regions for Accurate Stereo Matching

Adaptive Identification and Optimization of Ill-Posed Regions for Accurate Stereo Matching

Research Background and Motivation

With the rapid development of computer vision technology, stereo matching technology has played a crucial role in various fields such as robotics, aerospace, autonomous driving, and industrial manufacturing due to its high accuracy, cost-effectiveness, and non-invasiveness. However, when dealing with occluded and blurred regions, the consistency constraint between pixel pairs becomes unreliable, making it challenging to explore hidden correspondences. Therefore, despite rapid advancements in Convolutional Neural Networks (CNNs) and Transformer-based studies, most methods still face performance bottlenecks when handling ill-posed regions. To address this challenge, the research team introduced an error region feature optimization mechanism to provide contextual features, thereby improving the stereo matching effect in ill-posed regions. Architecture of ERCNet

Source and Introduction of the Study

This paper, titled “Adaptively Identify and Refine Ill-Posed Regions for Accurate Stereo Matching,” is authored by Changlin Liu, Linjun Sun, Xin Ning, and other researchers from the Institute of Semiconductors, Chinese Academy of Sciences, and the School of Semiconductor Science and Technology, South China Normal University. The study will be published in the 2024 issue of the journal “Neural Networks.” This research was received on October 31, 2023, revised on April 26, 2024, and finally accepted on May 15, 2024.

Research Workflow

This study includes several core steps, summarized as follows:

1. Feature Extraction

A ResNet-like structure was used to extract multi-scale information from RGB images. Specifically, the process involves passing the RGB image through three convolutional layers with different stride lengths, downsampling the features to 14 resolution, and expanding the channels. Next, ResNet layers generate image features (l1, l2, l3, l4). These features are concatenated into a 320-channel feature map for use by subsequent prediction networks and the error region optimization module.

2. Dual-Constrained Volume (DCV)

To detect and optimize matching features in advance, this study constructed a DCV that combines image and geometric constraints. Specifically, the method includes the following steps:

  • Constraint Selection: Difference and correlation constraints jointly construct the DCV. Difference costs use absolute differences, while correlation costs use normalized cross-correlation (NCC).
  • Multi-Level Matching Cost Calculation: The dot product of features within the matching window is calculated, and the pixel matching cost within a nine-point coordinate set is used as weights. Finally, multiple cost convolutions are fused through a 3D convolution layer.

3. Error Region Feature Optimization Mechanism (EFR)

This is the key innovation of this study. The specific process is as follows:

  • Using the disparity map calculated by the hourglass structure with front-end and back-end costs, if the disparity fluctuates significantly in regions without aggregated costs, it is identified as a potential error region.
  • A transformer is designed to selectively expand the features of ill-posed regions by combining global information adjustment and suppressing redundant features.

4. Main Prediction Network

The network integrates the extended cost volumes and calculates the final disparity through stacked hourglass structures. This includes four layers of 3D convolution, ReLU, batch normalization, and a small transformer structure, ultimately generating a probability volume through a 3D deconvolution layer and restoring the process to the initial resolution to calculate the matching disparity.

Main Research Results

Experimental Validation

The experimental results on multiple datasets show that ERCNet performs excellently on Scene Flow, KITTI 2012, KITTI 2015, ETH3D, and Middlebury 2014 datasets. The addition of DCV and EFR significantly improves the accuracy and robustness of the network in ill-posed regions and effectively reduces texture overfitting.

  • Scene Flow: ERCNet achieves an EPE (End-Point Error) of 0.45 px, outperforming other recent algorithms’ 0.47 px.
  • KITTI 2012 and 2015: Compared to other methods, ERCNet performs best on most metrics, with test results from 2020 to 2024 showing its excellent performance in complex scenarios.
  • ETH3D and Middlebury 2014: Experiments demonstrate ERCNet’s high robustness and cross-domain generalization capability.

Solution to Texture Overfitting

In the study, the combination of EFR and DCV effectively mitigates the overfitting problem caused by strong texture regions. The model shows a significant advantage in testing on the KITTI 2015 dataset using pre-trained weights without fine-tuning.

Extraction Performance of Ill-Posed Regions

By extracting ill-posed regions in different scenes, the study demonstrates the model’s superiority in handling repetitive textures, textureless areas, and disparity discontinuities. Especially in real-world scenarios, the extracted ill-posed regions significantly enhance the model’s adaptability to complex scenes.

Conclusions and Future Work

The ERCNet framework proposed in this study effectively improves the stereo matching of ill-posed regions through error identification and feature optimization, providing more constraint clues and robust disparity inference capabilities for stereo matching. The research has shown far superior accuracy in multiple benchmark datasets compared to existing methods, proving its potential in handling ill-posed regions and excellent generalization in new scenarios. Future work will focus on developing more lightweight stereo matching models, enhancing the algorithm’s application capabilities in real environments, and continuing to optimize the disturbance application during the aggregation stage, reducing the dependence on fixed disturbance thresholds to enhance model robustness and the automation level of data annotation in more complex scenarios.