Existing approaches to unsupervised video instance segmentation typicall...
We present ImageBind, an approach to learn a joint embedding across six
...
This paper revisits the standard pretrain-then-finetune paradigm used in...
Recipe personalization through ingredient substitution has the potential...
We propose Cut-and-LEaRn (CutLER), a simple approach for training
unsupe...
Video-language embeddings are a promising avenue for injecting semantics...
Narrated "how-to" videos have emerged as a promising data source for a w...
Transformer-based architectures have become competitive across a variety...
Prior work has studied different visual modalities in isolation and deve...
Current object detectors are limited in vocabulary size due to the small...
We find Mask2Former also achieves state-of-the-art performance on video
...
Image segmentation is about grouping pixels with different semantics, e....
We propose 3DETR, an end-to-end Transformer based object detection model...
We propose Anticipative Video Transformer (AVT), an end-to-end
attention...
We introduce WyPR, a Weakly-supervised framework for Point cloud Recogni...
A common approach to solving physical-reasoning tasks is to train a valu...
Pretraining on large labeled datasets is a prerequisite to achieve good
...
Physical reasoning requires forward prediction: the ability to forecast ...
With the advent of large-scale multimodal video datasets, especially
seq...
Joint vision and language tasks like visual question answering are
fasci...
Computer vision has undergone a dramatic revolution in performance, driv...
We address the task of unsupervised retargeting of human actions from on...
Video recognition models have progressed significantly over the past few...
We introduce the Action Transformer model for recognizing and localizing...
We introduce a simple baseline for action localization on the AVA datase...
In recent years, there has been a renewed interest in jointly modeling
p...
This paper addresses the problem of estimating and tracking human body
k...
We introduce a simple yet surprisingly powerful model to incorporate
att...
In this work, we introduce a new video representation for action
classif...
What is a good vector representation of an object? We believe that it sh...