ARM: Appearance Reconstruction Model for Relightable 3D Generation

1University of Utah, 2Zhejiang University, 3UCLA, 4Amazon
* Equal contribution.

Abstract

Recent image-to-3D reconstruction models have greatly advanced geometry generation, but they still struggle to faithfully generate realistic appearance. To address this, we introduce ARM, a novel method that reconstructs high-quality 3D meshes and realistic appearance from sparse-view images. The core of ARM lies in decoupling geometry from appearance, processing appearance within the UV texture space. Unlike previous methods, ARM improves texture quality by explicitly back-projecting measurements onto the texture map and processing them in a UV space module with a global receptive field. To resolve ambiguities between material and illumination in input images, ARM introduces a material prior that encodes semantic appearance information, enhancing the robustness of appearance decomposition. Trained on just 8 H100 GPUs, ARM outperforms existing methods both quantitatively and qualitatively.

Overview of our pipeline

(left) Starting from sparse-view input images generated by a diffusion model, ARM separates shape and appearance generation into two stages. In the geometry stage, ARM uses GeoRM to predict a 3D shape from the input images. In the appearance stage, ARM employs InstantAlbedo and GlossyRM to reconstruct PBR maps, enabling realistic relighting under varied lighting conditions. (right) Both GeoRM and GlossyRM share the same architecture, consisting of a triplane synthesizer and a decoding MLP. GeoRM is trained to predict density and extracts an iso-surface from the density grid with DiffMC, while GlossyRM is trained to predict roughness and metalness. GlossyRM is trained after GeoRM and initializes with the weights of GeoRM at the start of training.

Overview of InstantAlbedo

InstantAlbedo operates in the texture UV space. This subprocess begins by converting all necessary data to UV texture space. Given the unwrapped mesh from GeoRM, we back-project images, material encodings, and auxiliary data into UV texture space, resulting in six sets of inputs corresponding to the six input views. InstantAlbedo then processes these maps using a U-Net and an inpainting-specific FFC-Net to predict both the lighting-baked color and the decomposed diffuse albedo UV textures.

Single image to 3D

We provide qualitative examples to visually demonstrate ARM’s superior performance over existing methods. The reconstructed textures from ARM contain significantly richer details, owing to our design in UV texture space. While other methods suffer from blurriness, ARM accurately reconstructs complex and sharp patterns. Some methods, such as SF3D, struggle to generate plausible shape and texture in unseen areas due to training on single-view inputs.

Appearance decomposition

We compare our reconstructed PBR maps and their relighted images under novel lighting conditions to those produced by SF3D, which also reconstructs PBR from single-view input. Our method outperforms SF3D in two key areas: First, when multiple materials are present in the input image, our method reconstructs spatially-varying roughness and metalness, while SF3D generates only constant values, resulting in a homogeneous appearance. Second, SF3D struggles with separating illumination from material properties in the input, leading to baked-in lighting effects. In the cup and ball example, lighting artifacts are embedded in SF3D's reconstructed diffuse albedo, resulting in inaccurate relighting under novel conditions. In contrast, our method successfully decomposes illumination and material, yielding realistic results.

BibTeX

@article{feng2024arm,
  title={ARM: Appearance Reconstruction Model for Relightable 3D Generation},
  author={Xiang Feng and Chang Yu and Zoubin Bi and Yintong Shang and Feng Gao and Hongzhi Wu and Kun Zhou and Chenfanfu Jiang and Yin Yang},
  journal={arXiv preprint arXiv:2411.10825},
  year={2024}
}