Rohit Girdhar

research

∙ 08/28/2023

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

Existing approaches to unsupervised video instance segmentation typicall...

0 Xudong Wang, et al. ∙

research

∙ 05/09/2023

ImageBind: One Embedding Space To Bind Them All

We present ImageBind, an approach to learn a joint embedding across six ...

0 Rohit Girdhar, et al. ∙

research

∙ 03/23/2023

The effectiveness of MAE pre-pretraining for billion-scale pretraining

This paper revisits the standard pretrain-then-finetune paradigm used in...

0 Mannat Singh, et al. ∙

research

∙ 02/15/2023

Learning to Substitute Ingredients in Recipes

Recipe personalization through ingredient substitution has the potential...

0 Bahare Fatemi, et al. ∙

research

∙ 01/26/2023

Cut and Learn for Unsupervised Object Detection and Instance Segmentation

We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupe...

0 Xudong Wang, et al. ∙

research

∙ 01/05/2023

HierVL: Learning Hierarchical Video-Language Embeddings

Video-language embeddings are a promising avenue for injecting semantics...

0 Kumar Ashutosh, et al. ∙

research

∙ 01/05/2023

What You Say Is What You Show: Visual Narration Detection in Instructional Videos

Narrated "how-to" videos have emerged as a promising data source for a w...

0 Kumar Ashutosh, et al. ∙

research

∙ 06/16/2022

OmniMAE: Single Model Masked Pretraining on Images and Videos

Transformer-based architectures have become competitive across a variety...

11 Rohit Girdhar, et al. ∙

research

∙ 01/20/2022

Omnivore: A Single Model for Many Visual Modalities

Prior work has studied different visual modalities in isolation and deve...

7 Rohit Girdhar, et al. ∙

research

∙ 01/07/2022

Detecting Twenty-thousand Classes using Image-level Supervision

Current object detectors are limited in vocabulary size due to the small...

11 Xingyi Zhou, et al. ∙

research

∙ 12/20/2021

Mask2Former for Video Instance Segmentation

We find Mask2Former also achieves state-of-the-art performance on video ...

11 Bowen Cheng, et al. ∙

research

∙ 12/02/2021

Masked-attention Mask Transformer for Universal Image Segmentation

Image segmentation is about grouping pixels with different semantics, e....

6 Bowen Cheng, et al. ∙

research

∙ 09/16/2021

An End-to-End Transformer Model for 3D Object Detection

We propose 3DETR, an end-to-end Transformer based object detection model...

0 Ishan Misra, et al. ∙

research

∙ 06/03/2021

Anticipative Video Transformer

We propose Anticipative Video Transformer (AVT), an end-to-end attention...

0 Rohit Girdhar, et al. ∙

research

∙ 05/13/2021

3D Spatial Recognition without Spatially Labeled 3D

We introduce WyPR, a Weakly-supervised framework for Point cloud Recogni...

0 Zhongzheng Ren, et al. ∙

research

∙ 02/20/2021

Physical Reasoning Using Dynamics-Aware Models

A common approach to solving physical-reasoning tasks is to train a valu...

0 Eltayeb Ahmed, et al. ∙

research

∙ 01/07/2021

Self-Supervised Pretraining of 3D Features on any Point-Cloud

Pretraining on large labeled datasets is a prerequisite to achieve good ...

14 Zaiwei Zhang, et al. ∙

research

∙ 06/18/2020

Forward Prediction for Physical Reasoning

Physical reasoning requires forward prediction: the ability to forecast ...

0 Rohit Girdhar, et al. ∙

research

∙ 06/12/2020

Video Understanding as Machine Translation

With the advent of large-scale multimodal video datasets, especially seq...

11 Bruno Korbar, et al. ∙

research

∙ 11/08/2019

Are we asking the right questions in MovieQA?

Joint vision and language tasks like visual question answering are fasci...

0 Bhavan Jasani, et al. ∙

research

∙ 10/10/2019

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

Computer vision has undergone a dramatic revolution in performance, driv...

7 Rohit Girdhar, et al. ∙

research

∙ 10/10/2019

MetaPix: Few-Shot Video Retargeting

We address the task of unsupervised retargeting of human actions from on...

8 Jessica Lee, et al. ∙

research

∙ 01/26/2019

DistInit: Learning Video Representations without a Single Labeled Video

Video recognition models have progressed significantly over the past few...

0 Rohit Girdhar, et al. ∙

research

∙ 12/06/2018

Video Action Transformer Network

We introduce the Action Transformer model for recognizing and localizing...

20 Rohit Girdhar, et al. ∙

research

∙ 07/26/2018

A Better Baseline for AVA

We introduce a simple baseline for action localization on the AVA datase...

0 Rohit Girdhar, et al. ∙

research

∙ 04/09/2018

Binge Watching: Scaling Affordance Learning from Sitcoms

In recent years, there has been a renewed interest in jointly modeling p...

2 Xiaolong Wang, et al. ∙

research

∙ 12/26/2017

Detect-and-Track: Efficient Pose Estimation in Videos

This paper addresses the problem of estimating and tracking human body k...

0 Rohit Girdhar, et al. ∙

research

∙ 11/04/2017

Attentional Pooling for Action Recognition

We introduce a simple yet surprisingly powerful model to incorporate att...

0 Rohit Girdhar, et al. ∙

research

∙ 04/10/2017

ActionVLAD: Learning spatio-temporal aggregation for action classification

In this work, we introduce a new video representation for action classif...

0 Rohit Girdhar, et al. ∙

research

∙ 03/29/2016

Learning a Predictable and Generative Vector Representation for Objects

What is a good vector representation of an object? We believe that it sh...

0 Rohit Girdhar, et al. ∙

Rohit Girdhar

Featured Co-authors

Sign in with Google

Consider DeepAI Pro