A RAFT-based Network and Synthetic Dataset for Digital Video Stabilization

Report on the Study of Deep Learning-Based Video Stabilization Methods and the SynthStab Synthetic Dataset

Background Introduction

Digital video stabilization technology, which removes unnecessary vibrations and camera motion artifacts through software, is a critical component in modern video processing, particularly for amateur video shooting. However, existing deep learning-based Direct Warping Stabilization (DWS) methods, while effective for low-quality videos, struggle with intense instabilities and fail to achieve the stability levels of traditional methods. This limitation stems from the following issues: unclear definitions of stable video in current datasets, simple model architectures, and insufficient utilization of predictive information from future frames.

To address these challenges, this paper proposes a new RAFT-based semi-online direct warping method, NAFT, and a novel synthetic dataset, SynthStab, which aim to resolve the aforementioned issues. These innovations not only improve the performance of DWS methods for videos with severe instability but also significantly reduce model size and parameter count, achieving results closer to state-of-the-art methods.

Source and Authors

The paper, titled “NAFT and SynthStab: A RAFT-Based Network and a Synthetic Dataset for Digital Video Stabilization,” was authored by Marcos Roberto e Souza, Helena de Almeida Maia, and Helio Pedrini, from the Institute of Computing, University of Campinas, Brazil. It was published in 2024 in the International Journal of Computer Vision.

Research Workflow

Construction of the SynthStab Synthetic Dataset

The SynthStab dataset comprises two parts: short videos with low-intensity instabilities (SynthStab-SL) and long videos with high-intensity instabilities (SynthStab-LH). The dataset is generated in the following steps:

  1. Stable Trajectory Generation: Camera trajectories with six degrees of freedom (6-DoF) are defined using kinematic models, including constant velocity and acceleration segments. Each trajectory is determined by initial positions, velocities, and segment sizes, with random keypoints controlling trajectory trends.

  2. Unstable Trajectory Generation: Instabilities are introduced to stable trajectories through random keypoints and Gaussian filtering while considering depth variations in scenes to preserve the original motion intent.

  3. Video Rendering: Using Unreal Engine and AirSim, synchronized stable and unstable videos are rendered across diverse environments, including RGB frames, depth maps, and 3D camera position data.

  4. Motion Field Computation: Motion fields between stable and unstable frames are calculated using depth maps and camera motion matrices for model supervision during training.

Design of the New Model NAFT

NAFT is based on the RAFT network and incorporates the following core modules: - Neighborhood-Aware Update Mechanism (IUNO): Iteratively refines optical flow predictions using information from neighboring frames, enhancing accuracy. - Multi-Task Decoders: Separates initial optical flow prediction (approximation task) from neighborhood adjustment (adaptation task) to ensure video stability and inter-frame consistency. - Implicit Stability Learning: Trains using motion fields instead of image textures, avoiding explicit stability assumptions and model biases.

In the inference phase, NAFT employs a sliding window approach for semi-online stabilization, leveraging anchor and lookahead frames for improved predictions. Additionally, video inpainting is used to achieve full-frame stabilization, avoiding the loss of valid areas caused by cropping.

Experimental Results

Comparison with Existing Methods

Experiments were conducted on the NUS dataset, which includes six categories of videos. NAFT was compared against five existing methods, including both deep learning-based and traditional approaches. The results demonstrate:

  1. Stability: NAFT achieved comparable performance to state-of-the-art methods (e.g., Deep3D) in metrics such as low- and high-frequency ratios (LHR-H and LHR-OF), particularly excelling in scenarios with severe instability like quick rotations.
  2. Image Quality: NAFT’s smoothness constraints improved optical flow continuity, reducing distortions and achieving higher SSIM values.
  3. Cropping Region: By employing video inpainting, NAFT avoided information loss from cropping, preserving the entire frame area.

Resource Efficiency

Compared to other DWS methods, NAFT significantly reduces model size and parameter count: - Model Parameters: Only 18% of the smallest competing model (StabNet). - Frame Rate: Balanced performance and efficiency with moderate FPS while achieving high-quality stabilization.

Dataset and Training Optimization

The SynthStab dataset provides a controlled experimental environment and enables large-scale data collection. Training strategies, such as progressively introducing smoothness constraints and pretraining on simpler trajectories, proved beneficial for learning complex instability scenarios.

Research Significance

This study holds theoretical and practical importance: 1. Theoretical Contribution: The construction of SynthStab and the design of NAFT introduce innovative approaches to video stabilization research, particularly through the concept of implicit stability learning. 2. Practical Application: The new method reduces computational resource requirements while significantly enhancing video stabilization quality, making it suitable for mobile devices and real-time applications.

Highlights and Future Directions

Key highlights of this research include NAFT’s neighborhood-aware mechanism, the SynthStab dataset, and full-frame inpainting strategies. Future work could explore larger neighborhood ranges and adaptation to more complex scenarios, advancing the comprehensive development of DWS technology.