Unsupervised Temporal Correspondence Learning for Unified Video Object Removal

Unsupervised Temporal Consistency Learning for Consistent Video Object Removal

Neural Network Architecture Designed in This Study

Background and Motivation

In the fields of video editing and restoration, Video Object Removal is an essential task with the goal of erasing target objects throughout an entire video, filling the gaps with plausible content. Existing solutions are mainly divided into two sub-tasks: (1) Mask Tracking and (2) Video Completion. However, these are usually treated as separate problems and are handled independently. This division results in overly complex systems that require the coordination of multiple models, increasing the difficulty of training and deployment and making practical applications less feasible.

The paper highlights the strong intrinsic connection between mask tracking and video completion at the pixel-level temporal correspondence. Leveraging these connections could simplify algorithm complexity and facilitate practical deployment. Consequently, the authors propose a new Unified Video Object Removal framework, aiming to solve both mask tracking and video completion under a unified framework.

Source and Authors

This paper was written by Zhongdao Wang, Jinglu Wang, Xiao Li, Ya-li Li, Yan Lu, and Shengjin Wang, some of whom are IEEE members. The study was conducted by researchers from Tsinghua University and Microsoft Research Asia and published in the IEEE Transactions on Image Processing.

Research Process

The research mainly consists of the following parts:

a) Detailed Research Process:

  1. Redefinition of Video Object Removal Task

    • The task setting requires simultaneously solving the two sub-tasks of mask tracking and video completion, integrating them into a single model. The two sub-tasks are linked through temporal correspondence reasoning across multiple frames, specifically via valid-valid (V-V) temporal correspondence for mask tracking and valid-hole (V-H) temporal correspondence for video completion.
  2. Construction of Temporal Correspondence Learning Framework

    • A single network is proposed that relates mask tracking and video completion by inferring temporal correspondence across multiple frames. This network can learn in an end-to-end, completely unsupervised manner without any annotations.
  3. Key Network and Numerical Network

    • The key network generates temporal correspondence information, and the numerical network processes video frames through an encoder and decoder, assisting in mask tracking and video filling. This method can track masks and fill holes in hidden layer features, decoding these features back into video frames.
  4. Automatic Conditional Propagation and Interactive Conditional Propagation

    • Automatic Conditional Propagation (ACP) and Interactive Conditional Propagation (ICP) mechanisms are proposed to improve the recall rate of mask tracking. ACP selects the most uncertain points as conditional points, while ICP allows users to manually correct the mask during tracking, enhancing the recall rate.

b) Research Results

The research results are divided into the following parts:

  1. Evaluation of Video Mask Tracking (V-V Correspondence)

    • The accuracy of mask tracking is evaluated on the DAVIS-2017 dataset using J score (Intersection over Union, IoU) and boundary F score. The results show that the proposed method performs excellently among unsupervised trackers, comparable to some of the latest correspondence learning methods, and can achieve higher recall rates under certain conditions.
  2. Evaluation of Video Filling (V-H Correspondence)

    • The effectiveness of video filling is evaluated through metrics of spatial and temporal consistency (e.g., PSNR, SSIM, MS-SSIM). The results reveal that the proposed method significantly outperforms other unsupervised methods in completion quality and excels in temporal consistency and visual effect.
  3. Overall Evaluation

    • A comprehensive comparison with existing mask tracking and video completion methods shows that the proposed unified method has significant advantages in overall quality and consistency.

c) Research Conclusion and Significance

This study proposes using an unsupervised temporal correspondence learning framework to unifiedly tackle the tasks of mask tracking and video completion in video object removal. This method not only reduces the need for multiple models during training and deployment, simplifying system complexity but also helps improve the practical application effect of object removal tasks.

Scientific Value: The study discovers the intrinsic link between mask tracking and video completion tasks, proposing a unified solution with theoretical and methodological innovations.

Practical Value: This method is expected to be widely used in practical video editing and restoration, reducing the complexity of existing methods and achieving efficient and automated object removal.

d) Research Highlights

  • Innovative Unified Framework: By unsupervised temporal correspondence learning, the unified solution addresses both mask tracking and video completion, simplifying system design.
  • Efficient Unsupervised Learning: The proposed method achieves efficient object removal through end-to-end training without manual annotations.
  • Practical Application Potential: The method has significant academic value and great application potential in practical video editing and restoration.

e) Other Valuable Information

The method also experimented with different network architectures and learning strategies during the study, further optimizing model performance. Additionally, the research provides detailed network design and specific implementation details, serving as a reference for subsequent research.