Aniclipart: Clipart Animation with Text-to-Video Priors

Academic Background and Problem Statement

Clipart, a pre-made graphic art form, is widely used in documents, presentations, and websites to quickly enhance visual content. However, traditional workflows for converting static clipart into motion sequences are laborious and time-consuming, often involving intricate steps such as rigging, keyframing, and inbetweening. Recent advancements in text-to-video generation hold great potential for addressing this issue. However, directly applying existing text-to-video generation models often struggles to retain the visual identity of clipart or generate cartoon-style motions, resulting in unsatisfactory animation outcomes.

This paper introduces AniClipart, a system designed to transform static clipart into high-quality motion sequences guided by text-to-video priors. By defining Bézier curves over keypoints as motion trajectories and optimizing the Video Score Distillation Sampling (VSDS) loss, AniClipart extracts natural motion knowledge from pre-trained text-to-video diffusion models to generate smooth, cartoon-style animations. Additionally, AniClipart incorporates a differentiable As-Rigid-As-Possible (ARAP) shape deformation algorithm to maintain the rigidity of the clipart during animation.

Paper Source and Author Information

This paper is co-authored by Ronghuan Wu, Wanchao Su, Kede Ma, and Jing Liao, affiliated with City University of Hong Kong and Monash University. The paper was accepted by the International Journal of Computer Vision on November 18, 2024, and was initially submitted on March 31, 2024.

Research Process and Methodology

1. Clipart Preprocessing

Before generating animations, the clipart undergoes preprocessing, similar to the rigging process in traditional animation production. This step includes the following:

  • Keypoint Detection: The UniPose algorithm is used to detect keypoints on the clipart and construct skeletal structures for each keypoint. UniPose is an end-to-end, prompt-driven keypoint detection framework capable of identifying keypoints across a wide range of objects, including articulated, rigid, and soft objects.

  • Skeleton Generation: For broader categories such as sea animals and plants, a three-step approach is used to generate skeletons: first, the colorful clipart is converted into a binary image, and boundary points are detected; then, straight skeletons are generated by propagating edges inward; finally, the skeletons are pruned and simplified to remove unnecessary details.

  • Triangular Mesh Construction: A triangulation algorithm is used to construct a triangular mesh over the clipart, facilitating shape deformation in subsequent steps.

2. Bézier-Driven Animation

To generate smooth animations, AniClipart defines Bézier curves as motion trajectories for each keypoint. The specific steps are as follows:

  • Bézier Curve Initialization: A cubic Bézier curve is assigned to each keypoint, with the curve’s starting point aligned precisely with the keypoint’s initial position. The remaining three control points are randomly initialized to ensure moderate initial motion.

  • Keyframe Generation: At each timestep, points are sampled along the Bézier curves to determine new positions for the keypoints. The ARAP algorithm is then used to adjust the entire clipart based on these new positions, generating new frames.

  • Video Generation: A differentiable renderer converts the deformed clipart into pixel images, which are then temporally stacked to create the final animation video.

3. Loss Functions

To ensure that the generated animations align with the text prompts and preserve the visual identity of the clipart, AniClipart introduces two loss functions:

  • Video Score Distillation Sampling Loss (VSDS Loss): The generated video is input into a pre-trained text-to-video diffusion model, and the discrepancy between the model’s predicted noise and the actual noise is computed. This loss optimizes the parameters of the Bézier curves, ensuring that the animation aligns with the text description.

  • Skeleton Loss: To maintain the structural integrity of the clipart, the skeleton loss constrains the length variation of the skeletal components, ensuring minimal deviation from their original measurements.

The final loss function is a weighted sum of the VSDS loss and the skeleton loss, optimized using the Adam optimizer.

Experimental Results and Conclusions

1. Experimental Results

AniClipart demonstrates superior performance in generating animations that align with text prompts, preserve visual identity, and maintain motion consistency. Compared to existing image-to-video generation models, AniClipart outperforms in terms of text-video alignment, visual identity preservation, and motion consistency. Additionally, AniClipart showcases its flexibility in handling complex animation formats such as layered animation.

2. Conclusions

AniClipart successfully achieves the goal of generating high-quality clipart animations from text descriptions by defining Bézier curves as motion trajectories for keypoints, combined with VSDS loss and skeleton loss. The system extracts motion priors from pre-trained text-to-video diffusion models without requiring additional training datasets and maintains the rigidity of the clipart through the ARAP deformation algorithm. Experimental results confirm that AniClipart outperforms existing methods in terms of animation quality and flexibility.

3. Research Highlights

  • Automatic Animation Generation: AniClipart automates the generation of clipart animations based on text descriptions, significantly reducing the workload of traditional animation production.

  • Motion Trajectory Optimization: Through Bézier curves and VSDS loss, AniClipart generates semantically meaningful motions while maintaining cartoon-style clipart motion patterns.

  • Shape Preservation: By integrating the ARAP deformation algorithm and skeleton loss, AniClipart effectively preserves the visual identity of the clipart during animation.

Future Work and Limitations

Despite its impressive performance, AniClipart has some limitations. For example, its performance degrades when processing complex scenes or clipart with multiple objects. Future research plans include further automating the keypoint detection and layered animation generation processes, as well as exploring methods to better handle multi-object animations in complex scenes.

Summary

AniClipart provides an efficient and flexible solution for generating clipart animations, capable of producing high-quality motion sequences based on text descriptions. By combining Bézier curves, VSDS loss, and the ARAP deformation algorithm, AniClipart addresses the laborious aspects of traditional animation production and offers a new direction for future research in automatic animation generation.