TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On

TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On

TryOn-Adapter——Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On

Research Background and Problem

Virtual try-on technology has gained widespread attention in recent years. Its core objective is to seamlessly adjust given clothing onto a specific person while avoiding distortion of the garment’s patterns and textures. However, existing diffusion model-based methods have significant limitations in maintaining clothing identity consistency, struggling even with full parameter fine-tuning. Additionally, these methods typically require high training costs, limiting their widespread application. TryOn-Adapter

To address these issues, this study proposes a novel framework—TryOn-Adapter—aiming to achieve efficient clothing identity adaptation while reducing training resource consumption. Specifically, the researchers decouple clothing identity into three fine-grained factors: style, texture, and structure, achieving precise identity control through customized lightweight modules and fine-tuning mechanisms. Moreover, the study introduces a training-free technique—T-Repaint—to further enhance clothing identity preservation while ensuring high fidelity of the generated images.

Paper Source

This paper was jointly completed by research teams from Zhejiang University, Alibaba Group, Baidu, and other institutions, with primary authors including Jiazheng Xing, Chao Xu, Yijie Qian, etc. The paper was published in the journal International Journal of Computer Vision, with a publication date of January 2025, and the DOI is 10.1007/s11263-025-02352-3.


Research Details and Workflow

a) Research Process and Experimental Design

1. Data Preprocessing

The study utilized two widely-used datasets: VITON-HD and DressCode. VITON-HD contains 13,679 image pairs, each consisting of a front-view female upper-body image and an upper garment image; DressCode includes 53,792 full-body person and garment image pairs covering categories such as tops, bottoms, and dresses. The researchers divided the datasets into training and testing sets for model training and performance evaluation, respectively.

2. Model Architecture

TryOn-Adapter is built on a pre-trained Stable Diffusion model and mainly includes the following five components: 1. Pre-trained Stable Diffusion Model: All parameters are frozen, with only the attention layers being fine-tuned. 2. Style Preserving Module: Extracts global style information of the garment, including color and category information. 3. Texture Highlighting Module: Refines complex textures of the garment using high-frequency feature maps. 4. Structure Adapting Module: Corrects unnatural areas caused by clothing changes using segmentation maps. 5. Enhanced Latent Blending Module (ELBM): Performs image reconstruction in latent space to ensure consistent visual quality of the generated images.

3. Experimental Design

  • Style Preserving Module: Extracts class tokens and patch tokens via the CLIP visual encoder and designs a style adapter to enhance color perception by combining VAE embedding features.
  • Texture Highlighting Module: Uses the Sobel operator to extract high-frequency feature maps, highlighting complex textures and patterns of the garment.
  • Structure Adapting Module: Employs a rule-based, training-free segmentation map generation method, combining human pose information to explicitly indicate clothing and body regions.
  • T-Repaint Technique: Applies the Repaint technique only during early denoising steps in the inference phase to balance clothing identity preservation and realistic try-on effects.

4. Novel Methods and Algorithms

The researchers proposed several innovative methods: - Style Adapter: Fuses CLIP patch embeddings with VAE visual embeddings to enhance color perception. - Position Attention Module (PAM): Enhances local spatial representation, helping the model better interpret high-frequency information. - ELBM Module: Reduces disconnection at the foreground-background boundary through deep fusion operations.


b) Main Results

1. Quantitative Evaluation

The study conducted quantitative evaluations on the VITON-HD and DressCode datasets, using metrics such as SSIM (Structural Similarity), LPIPS (Learned Perceptual Image Patch Similarity), FID (Fréchet Inception Distance), and KID (Kernel Inception Distance) to measure model performance. The results showed: - At 512×384 resolution, TryOn-Adapter outperformed existing methods across all metrics, with SSIM reaching 0.897 and LPIPS dropping to 0.069. - At 1024×768 resolution, TryOn-Adapter also performed excellently, demonstrating its robustness at higher resolutions.

2. Qualitative Evaluation

Qualitative evaluations showed that TryOn-Adapter excelled in the following aspects: - Style Preservation: Generated garments’ colors and category information were highly consistent with the target garments. - Texture Highlighting: Complex textures (such as patterns, logos, and text) were clearly visible. - Structure Adaptation: Naturally handled transitions between long and short sleeves and corrected unnatural body structures.

3. User Study

The researchers also conducted a user study, inviting 28 non-experts to rate the generated results. The results showed that TryOn-Adapter received over 45% support in both “most photorealistic image” and “best preservation of target garment details,” significantly outperforming other methods.


c) Conclusions and Significance

Scientific Value

TryOn-Adapter addresses the shortcomings of existing methods in clothing identity control and training efficiency by decoupling clothing identity into three fine-grained factors: style, texture, and structure. Its proposed lightweight modules and training-free techniques provide new research directions for the virtual try-on field.

Application Value

This research holds significant potential applications in online shopping, virtual reality, and other fields. For example, users can more intuitively experience the try-on effects of different garments through virtual try-on technology, thereby enhancing the shopping experience.


d) Research Highlights

  1. Fine-Grained Identity Control: First decouples clothing identity into three factors—style, texture, and structure—significantly improving clothing identity preservation.
  2. Efficient Training Mechanism: Achieves optimal performance with only about half of the trainable parameters through Parameter-Efficient Fine-Tuning (PEFT).
  3. Innovative Module Design: The design of the style adapter, texture highlighting module, and structure adapting module provides new solutions for virtual try-on tasks.
  4. Training-Free Segmentation Map Generation Method: Proposes a rule-based segmentation map generation method, avoiding redundant network training.

e) Other Valuable Information

The researchers plan to further explore Reference-Net methods in future work and develop refined evaluation metrics specifically for virtual try-on tasks to drive further advancements in this field.


Summary

TryOn-Adapter is a groundbreaking study that successfully addresses key issues in the virtual try-on field through innovative module design and efficient training mechanisms. Its scientific value and application potential make it an important milestone in the field, laying a solid foundation for future related research.