We propose a new task of space-time semantic correspondence prediction i...
Many top-down architectures for instance segmentation achieve significan...
Video understanding tasks take many forms, from action detection to visu...
Open-world instance segmentation is the task of grouping pixels into obj...
Video transformers have recently emerged as a competitive alternative to...
Current state-of-the-art object detection and segmentation methods work ...
A majority of approaches solve the problem of video frame interpolation ...
The visual and audio modalities are highly correlated yet they contain
d...
Video classification methods often divide the video into short clips, do...
Although a video is effectively a sequence of images, visual perception
...
Motion is a salient cue to recognize actions in video. Modern action
rec...
Modern approaches for multi-person pose estimation in video require larg...
Consider end-to-end training of a multi-modal vs. a single-modal network...
Current fully-supervised video datasets consist of only a few hundred
th...
While many action recognition datasets consist of collections of brief,
...
Group convolution has been shown to offer great computational savings in...
Video recognition models have progressed significantly over the past few...
Despite huge success in the image domain, modern detection models such a...
There is a natural correlation between the visual and auditive elements ...
This paper addresses the problem of estimating and tracking human body
k...
In this paper we discuss several forms of spatiotemporal convolutions fo...
In this work we propose a simple unsupervised approach for next frame
pr...
While there is overall agreement that future technology for organizing,
...
We propose a simple, yet effective approach for spatiotemporal feature
l...
This paper introduces EXMOVES, learned exemplar-based features for effic...