Smaller but Better: Unifying Layout Generation with Smaller Large Language Models

New Breakthrough in Unified Layout Generation Research: Smaller but Stronger Large Language Models

Research Background and Problem Statement

Layout generation is an important research direction in the fields of computer vision and human-computer interaction, aiming to automatically generate graphic interfaces or layout designs that meet specific requirements through algorithms. For example, the design of scientific articles, application interfaces (App UI), magazine pages, and slides all require efficient and flexible layout generation methods. However, traditional methods are often optimized for single tasks or single domains, lacking cross-task and cross-domain versatility. With the development of deep learning technology, methods based on the Transformer architecture have gradually become mainstream, but they still face issues such as high model complexity and large computational costs.

In recent years, large language models (LLMs) have made significant progress in the field of natural language processing (NLP), and their powerful reasoning capabilities provide new possibilities for solving complex tasks. However, applying LLMs to unified layout generation is still in its infancy. Existing methods have the following limitations: 1) The model size is enormous (e.g., 175B parameters), leading to high training and deployment costs; 2) Dependence on lengthy formats like HTML as input templates increases unnecessary symbol noise; 3) Limited to specific tasks or domains, unable to achieve true versatility.

To address these issues, a research team from South China University of Technology proposed the LGGPT model, a unified layout generation framework based on smaller LLMs, aiming to significantly reduce computational overhead while ensuring performance through innovative input-output templates and quantization encoding strategies.

Source of the Paper

This paper was co-authored by Peirong Zhang, Jiaxin Zhang, Jiahuan Cao, Hongliang Li, and Lianwen Jin from the School of Electronic and Information Engineering at South China University of Technology. It was published in January 2025 in the International Journal of Computer Vision. The title of the paper is “Smaller but Better: Unifying Layout Generation with Smaller Large Language Models.”


Research Content and Methods

a) Research Process

1. Data Preprocessing

The research team integrated five publicly available datasets from four domains, including scientific articles (PubLayNet), App UI (RICO), magazines (Magazine), and slides (Slide). These datasets underwent standardization, with all layout element labels unified into lowercase form and proportionally scaled to fixed dimensions (long side limited to 1024 pixels). Additionally, the research team filtered and divided the data to ensure consistent training and testing set ratios for fair comparison.

2. Model Design

The core of LGGPT is a small LLM (GPT2-XL) with 1.5B parameters, introducing two key technologies: - Arbitrary Layout Instruction (ALI): ALI is a unified input template capable of supporting combinations of arbitrary layout conditions. It includes prefix prompts and body prompts, describing layout types, object numbers, column counts, and specific attribute conditions. - Interval Quantization Encoding (IQE): IQE avoids the use of traditional placeholders by adding independent interval values to each geometric attribute, thus compressing the input sequence length and increasing information density.

3. Model Training

LGGPT adopts the teacher forcing strategy for training, appending the ground truth output to the input prompt to form complete inputs. The optimization goal is to minimize the negative log-likelihood of predicted layout tokens. During training, the research team employed a mixed sampling strategy, simultaneously handling various generation tasks (such as completion, relation-constrained generation, etc.) and single-type generation tasks (such as unconditional generation).

4. Decoding Scheme

During inference, LGGPT defaults to using greedy search as the basic decoding strategy, supplemented by Top-K sampling (K=50). For denoising tasks, multinomial sampling is used exclusively.


b) Main Results

1. Single Task Evaluation

The research team evaluated LGGPT on six individual tasks, including layout completion (Completion), category-based generation (Gen-T), category and size-based generation (Gen-TS), relation-constrained generation (Relation), denoising generation (Refinement), and unconditional generation (Gen-U/Gen-UP). Experimental results show that LGGPT achieved top-tier performance in most tasks, particularly excelling in FID (Fréchet Inception Distance) and Max IoU (Maximum Intersection over Union) metrics. For example, in the completion task on the PubLayNet dataset, LGGPT’s FID was only 2.08, far lower than the baseline method (27.87).

2. Hybrid Task Evaluation

The research team also designed four hybrid tasks (such as completion-denoising, arbitrary condition generation, etc.) to simulate more complex real-world application scenarios. The results showed that LGGPT also performed excellently in these tasks, surpassing existing LDGM models. For instance, in the arbitrary condition generation task (Gen-Arb-Refine), LGGPT’s FID was only 5.83, while LDGM’s FID was as high as 29.21.

3. Comparative Analysis

To verify the effectiveness of ALI and IQE, the research team conducted ablation experiments. The results showed that compared to the traditional HTML format, ALI significantly reduced input length (from 76 tokens to 54 tokens) and shortened inference time from 3.08 seconds to 1.83 seconds. Additionally, the IQE strategy reduced FID by about 60% on average, further enhancing model performance.


c) Conclusions and Significance

The success of LGGPT demonstrates the potential of small LLMs in unified layout generation. The main contributions of this research include: 1. Proposing ALI and ULR (Universal Layout Response) as unified input-output templates, achieving cross-task and cross-domain versatility; 2. Developing the IQE strategy to effectively compress input sequences and increase information density; 3. Verifying that an LLM with 1.5B parameters achieves the best balance between performance and efficiency.

This research not only advances layout generation technology but also provides important references for other multimodal generation tasks. In the future, the research team plans to further explore how to enhance domain versatility and attempt to apply LGGPT to more real-world scenarios.


d) Research Highlights

  1. Cross-task and Cross-domain Versatility: LGGPT achieved task-generic and domain-generic layout generation for the first time, covering 11 common tasks and four different domains.
  2. Efficiency and Compactness: Through ALI and IQE, LGGPT significantly reduces computational costs while maintaining high performance.
  3. Application Potential of Small LLMs: The study shows that an LLM with 1.5B parameters is sufficient to handle complex unified generation tasks, providing new ideas for applications in resource-constrained environments.

e) Other Valuable Information

The research team open-sourced the code and dataset (GitHub link), facilitating subsequent research. Additionally, the paper discusses possible future research directions in detail, such as combining data from similar domains for joint training to further enhance domain versatility.