AutoStory: Generating Diverse Storytelling Images with Minimal Human Efforts
Academic Background and Problem Statement
Story Visualization is a task aimed at generating a series of visually consistent images from a story described in text. This task requires the generated images to be of high quality, aligned with the text description, and consistent in character identities across different images. Despite its wide range of applications in art creation, child education, and cultural heritage, the complexity of story visualization has led existing methods to simplify the problem significantly, such as by considering only specific characters and scenarios or requiring users to provide per-image control conditions (e.g., sketches). These simplifications render existing methods inadequate for real-world applications.
To address these issues, this paper proposes an automated story visualization system capable of generating diverse, high-quality, and consistent story images with minimal human interaction. Specifically, the authors leverage the comprehension and planning capabilities of large language models (LLMs) for layout planning and then utilize large-scale text-to-image models to generate complex story images based on the layout. This approach not only improves the quality of image generation but also allows users to adjust the results through simple interactions.
Paper Source and Author Information
This paper is co-authored by Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, and Chunhua Shen, affiliated with Zhejiang University and the State Key Lab of CAD&CG at Zhejiang University. The paper was accepted and published on November 18, 2024, in the International Journal of Computer Vision by Springer.
Research Process and Experimental Design
1. Layout Generation Stage
In the layout generation stage, the authors first use large language models (LLMs) to convert user-input text stories into image layouts. The specific steps are as follows:
- Story Preprocessing: The user input can be either a complete story or a simple description. If the input is a simple description, the authors use an LLM to generate the specific story content.
- Story Segmentation: The generated story is divided into multiple panels, each corresponding to a story image.
- Layout Generation: The LLM extracts the scene layout from each panel description, generating a global prompt and local prompts, along with corresponding bounding boxes for each local prompt.
2. Dense Condition Generation Stage
In the dense condition generation stage, the authors propose a method to transform sparse bounding box layouts into dense control conditions (e.g., sketches or keypoints) to improve the quality of image generation. The specific steps are as follows:
- Single Object Generation: Individual objects are generated based on the local prompts.
- Extracting Dense Conditions: Open-vocabulary object detection methods (e.g., Grounding-DINO) are used to localize objects, and SAM (Segment Anything Model) is used to obtain object segmentation masks. Then, PIDINet is used to extract object edges as sketch control conditions, or HRNet is used to obtain human pose keypoints.
- Composing Dense Conditions: The dense control conditions for individual objects are pasted into their corresponding regions in the layout to generate dense conditions for the entire image.
3. Conditional Image Generation Stage
In the conditional image generation stage, the authors generate the final story images based on the layout and dense control conditions. The specific steps are as follows:
- Sparse Layout Control: The layout generated by the LLM controls the image generation process, ensuring that the generated images conform to the layout.
- Dense Control: T2I-Adapter is used to inject dense control conditions into the image generation process, further improving image quality.
- Identity Consistency Preservation: The Mix-of-Show method ensures that the generated character images remain consistent across different images.
4. Character Data Generation
To eliminate the burden of users collecting character data, the authors propose a training-free consistency modeling method by treating multi-view images as a video and jointly generating textures to ensure identity consistency. Additionally, the authors leverage 3D priors to generate diverse character images, ensuring that the generated data is both consistent and diverse.
Main Results and Conclusions
1. Main Results
Through experiments, the authors demonstrate the superiority of their method in generating high-quality, text-aligned, and identity-consistent story images. Whether users provide character images or only text input, the method can produce satisfactory results. The experimental results show that the method outperforms existing methods in both text-to-image similarity and image-to-image similarity.
2. Conclusions
The proposed AutoStory system combines large language models and large-scale text-to-image models to achieve high-quality, diverse, and consistent story image generation. This method not only reduces user workload but also eliminates the need for users to collect character data by automatically generating it. Experiments prove that the method outperforms existing methods in terms of generation quality and character consistency, and it can be easily generalized to different characters, scenes, and styles without requiring time-consuming large-scale training.
Highlights and Innovations of the Research
- Fully Automated Story Visualization Pipeline: AutoStory can generate diverse, high-quality, and consistent story images with minimal user input.
- Combination of Sparse and Dense Control Conditions: Sparse control signals are used for layout generation, while dense control signals are used for high-quality image generation. A simple yet effective dense condition generation module is proposed.
- Multi-View Consistent Character Generation: A method is proposed to generate multi-view consistent character images without requiring users to draw or collect character images.
- Flexible User Interaction: Users can adjust the generated results through simple interactions, such as providing character images, adjusting layouts, or sketching.
Significance and Value of the Research
This research holds significant scientific and practical value in the field of story visualization. By combining large language models and large-scale text-to-image models, AutoStory not only improves the quality and consistency of image generation but also greatly reduces user workload. The method has broad application prospects in art creation, child education, and cultural heritage, providing users with rich visual expression tools.
Other Valuable Information
The paper also demonstrates the adaptability of AutoStory in various scenarios, such as generating characters with specific appearances, side views, zoomed-out views, environment-focused images, and characters with different emotions. Additionally, the authors explore the challenges of generating multi-character story images and propose future improvements, such as handling multi-character scenes by generating individual high-quality character images and stitching them into story images.
Through innovative methods and techniques, this paper brings new breakthroughs to the field of story visualization, showcasing the great potential of automated high-quality story image generation.