Egocentric scene context for human-centric environment understanding from video

07/22/2022
by   Tushar Nagarajan, et al.
0

First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and only capture what is directly seen. We present an approach that links egocentric video and camera pose over time by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings to facilitate human-centric environment understanding. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on real-world videos of house tours from unseen environments. We show that by grounding videos in their physical environment, our models surpass traditional scene classification models at predicting which room a camera-wearer is in (where frame-level information is insufficient), and can leverage this grounding to localize video moments corresponding to environment-centric queries, outperforming prior methods. Project page: http://vision.cs.utexas.edu/projects/ego-scene-context/

READ FULL TEXT

page 2

page 6

page 10

page 23

page 28

page 29

page 35

page 36

research
01/14/2020

EGO-TOPO: Environment Affordances from Egocentric Video

First-person video naturally brings the use of a physical environment to...
research
11/26/2020

4D Human Body Capture from Egocentric Video via 3D Scene Grounding

To understand human daily social interaction from egocentric perspective...
research
06/03/2019

Grounded Human-Object Interaction Hotspots from Video (Extended Abstract)

Learning how to interact with objects is an important step towards embod...
research
12/08/2022

VideoDex: Learning Dexterity from Internet Videos

To build general robotic agents that can operate in many environments, i...
research
05/05/2022

Visually plausible human-object interaction capture from wearable sensors

In everyday lives, humans naturally modify the surrounding environment t...
research
01/04/2023

Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations

Can conversational videos captured from multiple egocentric viewpoints r...
research
09/19/2022

T3VIP: Transformation-based 3D Video Prediction

For autonomous skill acquisition, robots have to learn about the physica...

Please sign up or login with your details

Forgot password? Click here to reset