dvmark: a deep multiscale framework for video watermarking

DVMark: A Multi-Scale Deep Learning Framework for Video Watermarking

Video watermarking technology achieves data hiding by embedding information into the cover video. The DVMark model proposed in this paper is a multi-scale video watermarking solution based on deep learning that boasts high robustness and practicality, capable of resisting various kinds of distortions and attacks while ensuring video quality.

Background and Motivation

Details of the video watermarking framework

Video watermarking involves embedding messages in the cover video, which can be either visible or invisible. Invisible watermarks have advantages because they do not interfere with the original content and are difficult for attackers to detect. Watermarks can be applied in various scenarios, such as including metadata, timestamps, and creator information for the video. Furthermore, watermarks are widely used for information monitoring and tracking, as they can still be recovered even after the video undergoes some level of distortion and modification during propagation.

Currently, the main factors for evaluating video watermarking systems include invisibility (quality), robustness, and payload (number of message bits). Traditional watermarking methods often rely on manually designed features, typically failing to handle various types of distortions simultaneously, and their performance is less than ideal. To overcome these limitations, this paper proposes the DVMark model, an end-to-end trainable video watermarking solution based on deep learning.

Paper Source

This paper was authored by Xiyang Luo, Yinxiao Li, Huiwen Chang, Ce Liu, Peyman Milanfar, and Feng Yang from Google Research - Mountain View, California. The paper has been accepted by IEEE Transactions on Image Processing and was published in 2023.

Research Process

The paper details the overall research process, including four main modules: encoder, decoder, distortion layer, and video discriminator. Below is a specific introduction and algorithm implementation for each module:

1. Encoder

The encoder receives the input video and the binary message to be embedded, outputting the watermarked video. The encoder consists of two parts: the transformation layer and the embedding layer. The transformation layer converts the input video sequence to feature maps, and then the embedding layer outputs the watermark residual ( r ), which, when added to the original video, forms the final watermarked video.

vw = vin + r * α

The transformation layer uses four layers of 3D convolution operations, each with 64 output channels, learning the optimal transformation to embed the message into the video features. The embedding layer uses a two-stage multi-scale network to repeatedly fuse the message into the feature maps spatially and temporally, improving robustness.

2. Decoder

The decoder receives the potentially distorted watermarked video and outputs the decoded message. The decoder employs a multi-head design, using a “weightnet” to predict the weight matrix for each video input, achieving a content-adaptive allocation strategy. The decoding head distinguishes between watermarked frames and non-watermarked frames, forming the global average pooling output of each scale decoding block through four layers of 3D convolution operations.

3. Distortion Layer

The framework improves robustness by adding common distortions during training. The distortion layer includes temporal distortions (e.g., frame loss), spatial distortions (e.g., Gaussian blur and random cropping), and a differentiable video compression simulator (compressionnet). This layer randomly selects distortion types and injects distortions during training, ensuring that both the encoder and decoder remain robust against various distortion types.

4. Video Discriminator

To improve the visual quality of temporal consistency, a multi-scale video discriminator network is employed. This discriminator network consists of four 3D residual networks, processing video inputs of different temporal and spatial resolutions.

Main Results

The experimental section systematically evaluates the proposed method, comparing it with traditional video watermarking methods and the current state-of-the-art deep learning image watermarking method Hidden.

1. Robustness Evaluation

Under various common distortions, this method significantly outperforms the traditional 3D-DWT method and the deep learning image watermarking method Hidden in bit accuracy. Tests include standard video compression, frame loss, spatial cropping, and Gaussian noise distortions. Experimental results show that the DVMark model performs excellently under almost all tested distortions.

2. Visual Quality Evaluation

Quality of watermarked videos is evaluated using visual quality metrics such as PSNR, MSSIM, LPIPS, and TLP, as well as user ratings. Results indicate that the DVMark model outperforms comparison methods in all quality metrics.

3. Overall Performance Evaluation

The trade-offs between robustness, quality, and payload are explored in depth. With a fixed payload or quality, the DVMark model shows better robustness compared to traditional and deep learning image watermarking methods.

4. Performance on Larger Videos

To verify the model’s practicality, experiments tested the DVMark model’s performance on videos of different resolutions and lengths. The results show that, unlike the segmented lengths limited by the training set, the DVMark’s performance does not significantly decline on larger videos.

Conclusion

Through multi-scale design and optimization, the DVMark model proposes a robust end-to-end video watermarking framework. Rigorous evaluations demonstrate its high practicality in real-world applications. Future research directions may include more precise video compression differentiable proxies and model training corresponding to different distortions. This paper provides significant progress for the video watermarking field, demonstrating how deep learning can achieve higher robustness and visual quality under diverse distortion conditions.