We propose Unified-IO, a model that performs a large variety of AI tasks...
Embodied AI has seen steady progress across a diverse set of independent...
The visual world naturally exhibits a long-tailed distribution of open
c...
Convolutional neural networks (CNNs) are ubiquitous in computer vision, ...
The problem of knowledge-based visual question answering involves answer...
Mirroring the success of masked language models, vision-and-language
cou...
Can we develop visually grounded dialog agents that can efficiently adap...
Textual cues are essential for everyday tasks like buying groceries and ...
Much of vision-and-language research focuses on a small but diverse set ...
We present ViLBERT (short for Vision-and-Language BERT), a model for lea...
Consider a collaborative task that requires communication. Two agents ar...
The Vision-and-Language Navigation (VLN) task entails an agent following...
In an open-world setting, it is inevitable that an intelligent agent (e....
We propose a novel scene graph generation model called Graph R-CNN, that...
We introduce a novel framework for image captioning that can produce nat...
We present a novel training framework for neural sequence models,
partic...
We introduce ParlAI (pronounced "par-lay"), an open-source software plat...
Attention-based neural encoder-decoder frameworks have been widely adopt...
A number of recent works have proposed attention models for Visual Quest...
We propose the task of free-form and open-ended Visual Question Answerin...