TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On

2025-02-13 Thu
virtual try-on fine-grained identity control diffusion model training efficiency clothing identity preservation
TryOn-Adapter——Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-OnResearch Background and ProblemVirtual try-on technology has gained widespread attention in recent years. Its core objective is to seamlessly adjust given clothing onto a specific person while avoiding distortion of the garment’s patterns and textures. However, existing diffusion model-based methods have significant limitations in maintaining clothing identity consistency, struggling even with full parameter fine-tuning. Additionally, these methods typically require high training costs, limiting their widespread application.
To address these issues, this study proposes a novel framework—TryOn-Adapter—aiming to achieve efficient clothing identity adaptation while reducing training resource consumption. Specifically, the researchers decouple clothing identity into three fine-grained factors: style, texture, and structure, achieving precise identity control through customized lightweight modules and fine-tuning mechanisms. Moreover, the study introduces a training-free technique—T-Repaint—to further enhance clothing identity preservation while ensuring high fidelity of the generated images.
Paper SourceThis paper was jointly completed by research teams from Zhejiang University, Alibaba Group, Baidu, and other institutions, with primary authors including Jiazheng Xing, Chao Xu, Yijie Qian, etc. The paper was published in the journal International Journal of Computer Vision, with a publication date of January 2025, and the DOI is 10.1007/s11263-025-02352-3.
Research Details and Workflowa) Research Process and Experimental Design1. Data PreprocessingThe study utilized two widely-used datasets: VITON-HD and DressCode. VITON-HD contains 13,679 image pairs, each consisting of a front-view female upper-body image and an upper garment image; DressCode includes 53,792 full-body person and garment image pairs covering categories such as tops, bottoms, and dresses. The researchers divided the datasets into training and testing sets for model training and performance evaluation, respectively.
2. Model ArchitectureTryOn-Adapter is built on a pre-trained Stable Diffusion model and mainly includes the following five components:
1. Pre-trained Stable Diffusion Model: All parameters are frozen, with only the attention layers being fine-tuned.
2. Style Preserving Module: Extracts global style information of the garment, including color and category information.
3. Texture Highlighting Module: Refines complex textures of the garment using high-frequency feature maps.
4. Structure Adapting Module: Corrects unnatural areas caused by clothing changes using segmentation maps.
5. Enhanced Latent Blending Module (ELBM): Performs image reconstruction in latent space to ensure consistent visual quality of the generated images.
3. Experimental DesignStyle Preserving Module: Extracts class tokens and patch tokens via the CLIP visual encoder and designs a style adapter to enhance color perception by combining VAE embedding features.
Texture Highlighting Module: Uses the Sobel operator to extract high-frequency feature maps, highlighting complex textures and patterns of the garment.
Structure Adapting Module: Employs a rule-based, training-free segmentation map generation method, combining human pose information to explicitly indicate clothing and body regions.
T-Repaint Technique: Applies the Repaint technique only during early denoising steps in the inference phase to balance clothing identity preservation and realistic try-on effects.
4. Novel Methods and AlgorithmsThe researchers proposed several innovative methods:
- Style Adapter: Fuses CLIP patch embeddings with VAE visual embeddings to enhance color perception.
- Position Attention Module (PAM): Enhances local spatial representation, helping the model better interpret high-frequency information.
- ELBM Module: Reduces disconnection at the foreground-background boundary through deep fusion operations.
b) Main Results1. Quantitative EvaluationThe study conducted quantitative evaluations on the VITON-HD and DressCode datasets, using metrics such as SSIM (Structural Similarity), LPIPS (Learned Perceptual Image Patch Similarity), FID (Fréchet Inception Distance), and KID (Kernel Inception Distance) to measure model performance. The results showed:
- At 512×384 resolution, TryOn-Adapter outperformed existing methods across all metrics, with SSIM reaching 0.897 and LPIPS dropping to 0.069.
- At 1024×768 resolution, TryOn-Adapter also performed excellently, demonstrating its robustness at higher resolutions.
2. Qualitative EvaluationQualitative evaluations showed that TryOn-Adapter excelled in the following aspects:
- Style Preservation: Generated garments’ colors and category information were highly consistent with the target garments.
- Texture Highlighting: Complex textures (such as patterns, logos, and text) were clearly visible.
- Structure Adaptation: Naturally handled transitions between long and short sleeves and corrected unnatural body structures.
3. User StudyThe researchers also conducted a user study, inviting 28 non-experts to rate the generated results. The results showed that TryOn-Adapter received over 45% support in both “most photorealistic image” and “best preservation of target garment details,” significantly outperforming other methods.
c) Conclusions and SignificanceScientific ValueTryOn-Adapter addresses the shortcomings of existing methods in clothing identity control and training efficiency by decoupling clothing identity into three fine-grained factors: style, texture, and structure. Its proposed lightweight modules and training-free techniques provide new research directions for the virtual try-on field.
Application ValueThis research holds significant potential applications in online shopping, virtual reality, and other fields. For example, users can more intuitively experience the try-on effects of different garments through virtual try-on technology, thereby enhancing the shopping experience.
d) Research HighlightsFine-Grained Identity Control: First decouples clothing identity into three factors—style, texture, and structure—significantly improving clothing identity preservation.
Efficient Training Mechanism: Achieves optimal performance with only about half of the trainable parameters through Parameter-Efficient Fine-Tuning (PEFT).
Innovative Module Design: The design of the style adapter, texture highlighting module, and structure adapting module provides new solutions for virtual try-on tasks.
Training-Free Segmentation Map Generation Method: Proposes a rule-based segmentation map generation method, avoiding redundant network training.
e) Other Valuable InformationThe researchers plan to further explore Reference-Net methods in future work and develop refined evaluation metrics specifically for virtual try-on tasks to drive further advancements in this field.
SummaryTryOn-Adapter is a groundbreaking study that successfully addresses key issues in the virtual try-on field through innovative module design and efficient training mechanisms. Its scientific value and application potential make it an important milestone in the field, laying a solid foundation for future related research.