Pix2Shape – Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation

by   Sai Rajeswar, et al.

We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach to solve this problem with four components: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfel-based reconstruction of a scene from the latent code (iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed, called 3D-IQTT, to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape's ability to solve scene reconstruction, generation, and understanding tasks.


SynSin: End-to-end View Synthesis from a Single Image

Single image view synthesis allows for the generation of new views of a ...

Unsupervised Continuous Object Representation Networks for Novel View Synthesis

Novel View Synthesis (NVS) is concerned with the generation of novel vie...

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation

Diffusion models currently achieve state-of-the-art performance for both...

Validation of Modulation Transfer Functions and Noise Power Spectra from Natural Scenes

The Modulation Transfer Function (MTF) and the Noise Power Spectrum (NPS...

Learning to Generate and Reconstruct 3D Meshes with only 2D Supervision

We present a unified framework tackling two problems: class-specific 3D ...

Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models

We present Text2Room, a method for generating room-scale textured 3D mes...

Structural Autoencoders Improve Representations for Generation and Transfer

We study the problem of structuring a learned representation to signific...

Please sign up or login with your details

Forgot password? Click here to reset