SLIDE: A Unified Mesh and Texture Generation Framework with Enhanced Geometric Control and Multi-View Consistency

Academic Background

With the increasing demand for high-quality 3D content across industries such as gaming, architecture, and social media, the manual creation of 3D assets has become time-consuming, technically demanding, and costly. In the gaming industry, the aesthetic quality of assets like characters and furniture significantly impacts the immersive nature of the gaming environment. In architecture, precise and detailed models of buildings are essential for visualization, simulation, and planning. Social media platforms are increasingly leveraging 3D content for augmented reality (AR) and virtual reality (VR) experiences. However, the realism of these 3D models often hinges on detailed mesh representations, including vertices, edges, faces, and textures. This reality amplifies the urgency of automating the production of controllable, high-quality textured meshes.

Existing generative models, such as GET3D and 3DGen, can simultaneously generate geometry and textures, but they often struggle to balance geometric accuracy and texture detail, resulting in 3D shapes with compromised geometries obscured by textures. To address this, this paper proposes a novel framework that separates geometry generation from texture generation, using a Sparse Latent Point Diffusion Model (SLIDE) to achieve precise geometric control and resolve multi-view texture inconsistencies through multi-view priors.

Paper Source

This paper is co-authored by Jinyi Wang, Zhaoyang Lyu, Ben Fei, and others, affiliated with institutions such as Shanghai Jiao Tong University, The Chinese University of Hong Kong, and Nanyang Technological University. The paper was published on December 1, 2024, in the International Journal of Computer Vision.

Research Process and Results

1. Geometry Generation

1.1 Point Cloud Encoding and Decoding

The paper first adopts point clouds as an intermediate representation, encoding dense point clouds into sparse latent points with semantic features to achieve precise geometric control. Specifically, the point cloud encoder reduces a 2048-point cloud to 16 sparse latent points through four hierarchical feature extraction modules (Set Abstraction, SA) and generates the final point cloud feature representation via a Feature Transfer (FT) module. The point cloud decoder then upsamples the sparse latent points into a dense 2048-point cloud through three Point Upsampling (PU) modules and predicts the point cloud normals.

1.2 Sparse Latent Point Diffusion Model

After training the point cloud autoencoder, the paper trains two Denoising Diffusion Probabilistic Models (DDPMs) in the latent space. The first DDPM generates the distribution of sparse latent point positions, while the second DDPM generates feature distributions based on the sparse latent points. Through these two DDPMs, the paper achieves both unconditional and controllable generation of geometric shapes. Controllable generation is achieved by adjusting the positions of sparse latent points, generating corresponding features, and decoding them into point clouds.

1.3 Results

Experimental results show that the proposed method excels in geometry generation, producing meshes with smooth surfaces and sharp details. By controlling sparse latent points, the method can flexibly adjust the overall shape and local details of the generated meshes without requiring part annotations from the dataset. Additionally, the paper demonstrates capabilities in shape interpolation and shape combination, further proving the method’s diversity and flexibility.

2. Texture Generation

2.1 Coarse Texture Generation

After geometry generation, the paper employs a multi-view diffusion model to generate coarse textures. Specifically, textures are first generated from four views (front, left, back, and right), and a depth-conditioned diffusion model is combined with a multi-view diffusion model to ensure texture consistency across different views.

2.2 Fine Texture Refinement

Following coarse texture generation, the paper further enhances texture resolution and coverage through a refinement phase. The texture map is segmented into “refinement regions” and “generation regions,” where inpainting, denoising, and projection techniques are applied to produce high-resolution, multi-view consistent textures.

2.3 Results

Experimental results demonstrate that the proposed method significantly outperforms existing methods in texture generation, producing textures with higher realism and consistency. User studies also indicate that the generated textures are superior in overall quality, alignment with prompts, and texture consistency compared to baseline methods.

Conclusion and Significance

This paper proposes a unified framework for mesh and texture generation, enhancing geometric control through a sparse latent point diffusion model and resolving multi-view texture inconsistencies through multi-view priors. Experimental results show that the proposed method outperforms existing approaches in geometric quality, control capability, and texture consistency, significantly improving the generation of complex textured 3D content. This research provides new insights and methods for the fields of computer graphics and virtual content creation, offering significant scientific and practical value.

Research Highlights

  1. Separation of Geometry and Texture Generation: For the first time, the paper separates geometry generation from texture generation, achieving precise geometric control through a sparse latent point diffusion model and resolving texture inconsistencies through multi-view priors.
  2. Sparse Latent Point Diffusion Model: The proposed sparse latent point diffusion model significantly reduces the complexity of geometry generation and enhances control over mesh structures.
  3. Multi-View Consistent Texture Generation: By combining multi-view diffusion models with depth-conditioned diffusion models, the paper achieves multi-view consistent texture generation, significantly improving texture realism and consistency.
  4. Efficient Generation: The proposed method significantly outperforms existing methods in generation efficiency, producing high-quality geometry and textures in a short time.

Other Valuable Information

The paper also demonstrates capabilities in shape interpolation and shape combination, further proving the method’s diversity and flexibility. Additionally, user studies validate the quality and consistency of the generated textures, showing that the proposed method outperforms baseline methods in overall quality, alignment with prompts, and texture consistency.