Moonshot: Towards Controllable Video Generation and Editing with Motion-Aware Multimodal Conditions

MoonShot——Towards Controllable Video Generation and Editing with Motion-Aware Multimodal Conditions

Research Background and Problem Statement

In recent years, text-to-video diffusion models (Video Diffusion Models, VDMs) have made significant progress, enabling the generation of high-quality, visually appealing videos. However, most existing VDMs rely on text conditions for generative control, which has limitations in precisely describing visual content. Specifically, these methods often struggle to finely control the appearance and geometric structure of generated videos, resulting in outcomes that are highly dependent on randomness or chance.

To address this issue, researchers have attempted to achieve personalized generation by fine-tuning diffusion models (e.g., DreamBooth). However, this approach requires repetitive training for each input image, making it inefficient and difficult to scale to broader applications. Additionally, while IP-Adapter in the image domain achieves joint conditioning on images and text through dual cross-attention layers, directly applying it to video generation can lead to repeated application of the same text condition across frames, thus failing to capture motion information from prompts.

Against this backdrop, the authors propose the MoonShot model, aiming to solve the above problems by introducing motion-aware multimodal conditioning. This model not only supports joint conditioning on images and text but also incorporates new modules to enhance motion modeling capabilities. It can also leverage pre-trained image ControlNet for geometric conditioning without additional video training.

Source of the Paper

This paper was co-authored by David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, and Doyen Sahoo, who are affiliated with SHOW Lab at the National University of Singapore and Salesforce Research in California, USA. The paper was accepted on January 6, 2025, and published in International Journal of Computer Vision, with the DOI 10.1007/s11263-025-02346-1.


Research Details

a) Research Workflow

1. Model Architecture Design

The core component of MoonShot is the Multimodal Video Block (MVB). MVB includes the following key designs: - Motion-Aware Dual Cross-Attention Layer: By introducing a motion-aware module, learnable temporal weights are assigned to each frame, avoiding the repeated application of the same text condition. Specifically, this module first concatenates mean-pooled text embeddings with latent features, then processes them through a series of temporal convolutional layers (activated by ReLU and Sigmoid), ultimately generating motion-aware weights. - Spatiotemporal Attention Layer: Unlike traditional temporal attention layers that focus only on the same spatial positions, the spatiotemporal attention layer allows each patch to interact with all other patches, better capturing global changes. - Image ControlNet Integration: By adding temporal modules after all spatial modules, the functionality of Image ControlNet is preserved.

2. Datasets and Training Process

The study utilized multiple public datasets for training and evaluation: - LAION-5B: Used to initialize spatial weights. - WebVid10m: Contains 10 million video clips, with 24 frames sampled per video at a resolution of 512×320, used for primary training. - InternVideo: Contains 1,000 high-quality videos, used for watermark removal and further optimization of model performance.

During training, spatial weights were fixed, and only temporal modules and the motion-aware module were trained. The research team used 16 A100 40G GPUs for training.

3. Experimental Setup

Experiments were divided into multiple tasks, including personalized video generation, image animation, video editing, and text-to-video generation. Each task was evaluated using a combination of quantitative and qualitative analyses. For example, in the personalized video generation task, the DreamBooth dataset (containing 30 subjects, each with 4–7 text prompts) was used; for image animation, the I2V-Bench dataset (containing 2,950 high-resolution YouTube videos) was employed.


b) Key Research Findings

1. Effectiveness of the Motion-Aware Module

Table 6 shows the impact of the motion-aware module and spatiotemporal attention layer on video quality and motion performance. The results indicate: - After introducing the motion-aware module, FVD (Fréchet Video Distance) decreased from the baseline of 517 to 498, with a significant improvement in motion realism (71% vs. 29%). - Combining the spatiotemporal attention layer further enhanced the dynamic degree (91.2% vs. 60.3%), while maintaining high temporal consistency (98.84% vs. 98.90%).

2. Advantages of Multimodal Conditioning

Table 7 compares the results of using only text conditions versus jointly using image and text conditions. The results show: - Joint conditioning significantly improved temporal consistency and subject consistency (94.3% vs. 84.5%). - Image quality also improved (63.46% vs. 60.48%), with no adverse effect on dynamic degree (91.2% vs. 91.4%).

3. Video Editing Capabilities

Table 3 presents MoonShot’s performance in video editing tasks. Compared to methods like FateZero and Pix2Video, MoonShot excelled in temporal consistency (98.6% vs. 96.5%) and user preference rates (72.4% vs. 18.2%).

4. Text-to-Video Generation

Evaluation results on the MSR-VTT dataset (Table 5) show that MoonShot outperformed existing methods in FID-VID, FVD, and CLIP-T metrics, demonstrating higher visual quality and semantic consistency in generated videos.


c) Research Conclusions and Significance

The MoonShot model significantly enhances video generation quality and controllability by introducing motion-aware multimodal conditioning and spatiotemporal attention layers. Its main contributions include: 1. Proposing a motion-aware dual cross-attention layer that enables videos to precisely follow motion descriptions in prompts. 2. Introducing image conditioning during video training to provide ample visual signals, allowing temporal modules to focus on temporal consistency and motion modeling. 3. Replacing traditional temporal attention layers with spatiotemporal attention layers to enhance large-scale motion dynamics.

This research not only provides a foundational tool for controllable video generation but also demonstrates broad application potential in personalized video generation, image animation, and video editing.


d) Research Highlights

  1. Innovative Approach: First to propose a motion-aware module and spatiotemporal attention layer, addressing shortcomings in motion modeling and temporal consistency in traditional methods.
  2. Efficiency: By fixing spatial weights, pre-trained Image ControlNet can be reused directly without additional video training.
  3. Versatility: Applicable to various generative tasks, including personalized video generation, image animation, and video editing.

e) Other Valuable Information

The research team also open-sourced the code and model weights, facilitating further exploration and application by academia and industry. Additionally, MoonShot’s performance in dynamic degree and temporal consistency of generated videos stands out, providing an important reference for future video generation research.


Research Value and Significance

The introduction of the MoonShot model marks a significant breakthrough in the field of controllable video generation. Its innovative design philosophy and efficient implementation not only advance video generation technology but also provide strong technical support for practical applications such as film production, virtual reality, and advertising design. By combining image and text conditions, MoonShot achieves precise control over video appearance and geometric structure, laying a solid foundation for future multimodal generation research.