Compositional Human-Scene Interaction Synthesis with Semantic Control

by   Kaifeng Zhao, et al.
ETH Zurich

Synthesizing natural interactions between virtual humans and their 3D environments is critical for numerous applications, such as computer games and AR/VR experiences. Our goal is to synthesize humans interacting with a given 3D scene controlled by high-level semantic specifications as pairs of action categories and object instances, e.g., "sit on the chair". The key challenge of incorporating interaction semantics into the generation framework is to learn a joint representation that effectively captures heterogeneous information, including human body articulation, 3D object geometry, and the intent of the interaction. To address this challenge, we design a novel transformer-based generative model, in which the articulated 3D human body surface points and 3D objects are jointly encoded in a unified latent space, and the semantics of the interaction between the human and objects are embedded via positional encoding. Furthermore, inspired by the compositional nature of interactions that humans can simultaneously interact with multiple objects, we define interaction semantics as the composition of varying numbers of atomic action-object pairs. Our proposed generative model can naturally incorporate varying numbers of atomic interactions, which enables synthesizing compositional human-scene interactions without requiring composite interaction data. We extend the PROX dataset with interaction semantic labels and scene instance segmentation to evaluate our method and demonstrate that our method can generate realistic human-scene interactions with semantic control. Our perceptual study shows that our synthesized virtual humans can naturally interact with 3D scenes, considerably outperforming existing methods. We name our method COINS, for COmpositional INteraction Synthesis with Semantic Control. Code and data are available at


page 2

page 13

page 23

page 26

page 29

page 30

page 31

page 32


Generating Person-Scene Interactions in 3D Scenes

High fidelity digital 3D environments have been proposed in recent years...

Compositional 3D Human-Object Neural Animation

Human-object interactions (HOIs) are crucial for human-centric scene und...

COUCH: Towards Controllable Human-Chair Interactions

Humans interact with an object in many different ways by making contact ...

Narrator: Towards Natural Control of Human-Scene Interaction Generation via Relationship Reasoning

Naturally controllable human-scene interaction (HSI) generation has an i...

NCHO: Unsupervised Learning for Neural 3D Composition of Humans and Objects

Deep generative models have been recently extended to synthesizing 3D di...

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

Human-Scene Interaction (HSI) is a vital component of fields like embodi...

Self-Supervised Learning of Action Affordances as Interaction Modes

When humans perform a task with an articulated object, they interact wit...

Please sign up or login with your details

Forgot password? Click here to reset