We propose attribute-aware multimodal entity linking, where the input is...
The abundance of instructional videos and their narrations over the Inte...
Given the enormous number of instructional videos available online, lear...
In the fashion domain, there exists a variety of vision-and-language (V+...
Semi-supervised learning aims to train a model using limited labels.
Sta...
Embedding-based Retrieval (EBR) in e-commerce search is a powerful searc...
Large vision-language models are generally applicable to many downstream...
Generating a video given the first several static frames is challenging ...
Multimodal tasks in the fashion domain have significant potential for
e-...
Large-scale Vision-and-Language (V+L) pre-training for representation
le...
Cognitive science has shown that humans perceive videos in terms of even...
Dual encoders and cross encoders have been widely used for image-text
re...
Vision-and-Language (V+L) pre-training models have achieved tremendous
s...
We introduce CommerceMM - a multimodal model capable of providing a dive...
Most existing video-and-language (VidL) research focuses on a single dat...
We introduce a unified framework to jointly model images, text, and huma...
Given a video with aligned dialogue, people can often infer what is more...
Recent Transformer-based large-scale pre-trained models have revolutioni...
We present HERO, a Hierarchical EncodeR for Omni-representation learning...
We propose a new task towards more practical application for image gener...
We introduce a new task, Video-and-Language Inference, for joint multimo...
We introduce a new multimodal retrieval task - TV show Retrieval (TVR), ...
Joint image-text embedding is the bedrock for most Vision-and-Language (...
We present the task of Spatio-Temporal Video Question Answering, which
r...
Embodied Question Answering (EQA) is a relatively new task where an agen...
A grand goal in AI is to build a robot that can accurately navigate base...
Recent years have witnessed an increasing interest in image-based
questi...
In this paper, we address referring expression comprehension: localizing...
We address the problem of end-to-end visual storytelling. Given a photo
...
Referring expressions are natural language constructions used to identif...
Most recent garment capturing techniques rely on acquiring multiple view...
Humans refer to objects in their environments all the time, especially i...
In this paper, we introduce a new dataset consisting of 360,001 focused
...