LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models

High-Quality Video Generation with Cascaded Latent Diffusion Models: LaVie

Academic Background

In recent years, with the breakthrough progress of Diffusion Models (DMs) in the field of image generation, Text-to-Image (T2I) generation technology has achieved significant success. However, extending this technology to Text-to-Video (T2V) generation still faces many challenges. Video generation not only requires generating visually realistic images but also ensuring temporal coherence while preserving the creative generation capabilities of pre-trained T2I models. Existing T2V generation methods typically rely on training the entire system from scratch, which not only requires extensive computational resources but also makes it difficult to balance video quality, training cost, and model composability.

To address these issues, this paper proposes LaVie, an integrated video generation framework based on Cascaded Video Latent Diffusion Models. LaVie effectively captures temporal correlations in video data by introducing simple temporal self-attention mechanisms and Rotary Positional Encoding (RoPE). Additionally, this paper introduces a diverse dataset, Vimeo25M, containing 25 million high-quality text-video pairs to further enhance the model’s generation performance.

Source of the Paper

This paper is a collaborative effort by researchers from Shanghai Artificial Intelligence Laboratory, Nanyang Technological University, The Chinese University of Hong Kong, and Monash University. The main authors include Yaohui Wang, Xinyuan Chen, Xin Ma, and others. The paper was published on October 28, 2024, in the International Journal of Computer Vision.

Research Process and Experimental Design

1. Research Process

The LaVie framework consists of three main modules: Base T2V Model, Temporal Interpolation Model, and Video Super-Resolution Model. Each module is trained with text inputs as conditioning information, ultimately generating high-resolution, temporally coherent videos.

a) Base T2V Model

The Base T2V Model is responsible for generating low-resolution video keyframes. This model is based on the pre-trained Stable Diffusion model, extending the original 2D UNet architecture by introducing temporal convolutional layers and spatio-temporal Transformer modules. Specifically, the model expands 2D convolutional kernels into pseudo-3D convolutional kernels and adds temporal attention layers after each spatial attention layer. This allows the model to capture spatio-temporal correlations in videos.

To further improve generation quality, the paper employs a joint image-video fine-tuning strategy. Specifically, the model processes both image and video data during training, jointly optimizing the loss functions for images and videos, thereby avoiding the “catastrophic forgetting” issue that arises when fine-tuning solely on video data. Experiments show that this joint fine-tuning strategy significantly enhances the quality and diversity of video generation.

b) Temporal Interpolation Model

The Temporal Interpolation Model aims to increase the frame rate of generated videos and add temporal details. This model is based on a diffusion UNet architecture, taking 16-frame base videos as input and producing 61-frame high-frame-rate videos. During training, the model learns the denoising process by concatenating base video frames with noisy frames, thereby generating interpolated frames. Unlike traditional video interpolation methods, LaVie’s interpolation model generates entirely new frames rather than simply interpolating input frames.

c) Video Super-Resolution Model

The Video Super-Resolution Model is used to enhance the spatial resolution of generated videos. This model is based on a pre-trained diffusion image upscaler, extending the original 2D architecture by introducing temporal convolutional layers and attention layers. During training, the model uses low-resolution videos as strong conditioning inputs to generate high-resolution video frames. Ultimately, LaVie can generate high-definition videos at 1280×2048 resolution.

2. Experimental Results

a) Qualitative Evaluation

LaVie excels in generating diverse video content. Experimental results show that the model can generate videos containing animals, movie characters, and complex scenes while maintaining high temporal and spatial coherence. Compared to existing T2V generation methods, LaVie demonstrates significant advantages in visual quality and creativity.

b) Quantitative Evaluation

In zero-shot evaluations on the UCF101 and MSR-VTT datasets, LaVie outperforms existing T2V generation methods in both Fréchet Video Distance (FVD) and CLIP Similarity metrics. Particularly on the UCF101 dataset, LaVie’s FVD score is significantly lower than other methods, indicating its superiority in video generation quality.

c) Human Evaluation

Through large-scale human evaluation, LaVie received high scores in video quality, motion smoothness, and subject consistency. Compared to existing T2V generation methods, LaVie performs exceptionally well across multiple evaluation dimensions, especially in generating high-quality facial and hand details.

3. Conclusion

The LaVie framework proposed in this paper successfully achieves high-quality, temporally coherent video generation through cascaded video latent diffusion models. By introducing simple temporal self-attention mechanisms and joint image-video fine-tuning strategies, LaVie has made significant progress in video generation quality and diversity. Additionally, the Vimeo25M dataset introduced in this paper provides high-quality training data for T2V generation tasks, further enhancing the model’s performance.

LaVie not only excels in zero-shot T2V generation tasks but also demonstrates its flexibility and effectiveness in downstream tasks such as long video generation and personalized video generation. In the future, LaVie is expected to play an important role in fields such as filmmaking, video games, and artistic creation.

Research Highlights

  1. High-Quality Video Generation: LaVie generates visually realistic, temporally coherent, high-resolution videos through cascaded diffusion models.
  2. Joint Image-Video Fine-Tuning: By jointly optimizing image and video loss functions, LaVie avoids catastrophic forgetting, significantly improving generation quality.
  3. Vimeo25M Dataset: The high-quality dataset proposed in this paper provides rich training data for T2V generation tasks, further enhancing model performance.
  4. Broad Applications: LaVie not only excels in T2V generation tasks but also demonstrates its potential in long video generation and personalized video generation tasks.

Research Significance

LaVie’s research provides new ideas and methods for the T2V generation field. By introducing simple temporal self-attention mechanisms and joint image-video fine-tuning strategies, LaVie has made significant progress in video generation quality and diversity. Additionally, the Vimeo25M dataset proposed in this paper offers valuable data resources for future T2V research. The successful application of LaVie not only advances video generation technology but also opens up new possibilities for fields such as filmmaking, video games, and artistic creation.