Licheng Yu

research

∙ 05/24/2023

AMELI: Enhancing Multimodal Entity Linking with Fine-Grained Attributes

We propose attribute-aware multimodal entity linking, where the input is...

0 Barry Menglong Yao, et al. ∙

research

∙ 03/31/2023

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations

The abundance of instructional videos and their narrations over the Inte...

0 Yiwu Zhong, et al. ∙

research

∙ 03/23/2023

Learning and Verification of Task Structure in Instructional Videos

Given the enormous number of instructional videos available online, lear...

0 Medhini Narasimhan, et al. ∙

research

∙ 03/04/2023

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

In the fashion domain, there exists a variety of vision-and-language (V+...

0 Xiao Han, et al. ∙

research

∙ 02/28/2023

RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data

Semi-supervised learning aims to train a model using limited labels. Sta...

0 Sangwoo Mo, et al. ∙

research

∙ 02/21/2023

Que2Engage: Embedding-based Retrieval for Relevant and Engaging Products at Facebook Marketplace

Embedding-based Retrieval (EBR) in e-commerce search is a powerful searc...

0 Yunzhong He, et al. ∙

research

∙ 01/05/2023

CiT: Curation in Training for Effective Vision-Language Data

Large vision-language models are generally applicable to many downstream...

0 Hu Xu, et al. ∙

research

∙ 11/23/2022

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Generating a video given the first several static frames is challenging ...

0 Tsu-Jui Fu, et al. ∙

research

∙ 10/26/2022

FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

Multimodal tasks in the fashion domain have significant potential for e-...

0 Suvir Mirchandani, et al. ∙

research

∙ 07/17/2022

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

Large-scale Vision-and-Language (V+L) pre-training for representation le...

10 Xiao Han, et al. ∙

research

∙ 04/01/2022

Generic Event Boundary Captioning: A Benchmark for Status Changes Understanding

Cognitive science has shown that humans perceive videos in terms of even...

0 Yuxuan Wang, et al. ∙

research

∙ 03/10/2022

LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval

Dual encoders and cross encoders have been widely used for image-text re...

3 Jie Lei, et al. ∙

research

∙ 03/01/2022

Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

Vision-and-Language (V+L) pre-training models have achieved tremendous s...

3 Mingyang Zhou, et al. ∙

research

∙ 02/15/2022

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

We introduce CommerceMM - a multimodal model capable of providing a dive...

0 Licheng Yu, et al. ∙

research

∙ 06/08/2021

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Most existing video-and-language (VidL) research focuses on a single dat...

3 Linjie Li, et al. ∙

research

∙ 05/12/2021

Connecting What to Say With Where to Look by Modeling Human Attention Traces

We introduce a unified framework to jointly model images, text, and huma...

9 Zihang Meng, et al. ∙

research

∙ 10/15/2020

What is More Likely to Happen Next? Video-and-Language Future Event Prediction

Given a video with aligned dialogue, people can often infer what is more...

3 Jie Lei, et al. ∙

research

∙ 05/15/2020

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Recent Transformer-based large-scale pre-trained models have revolutioni...

0 Jize Cao, et al. ∙

research

∙ 05/01/2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

We present HERO, a Hierarchical EncodeR for Omni-representation learning...

3 Linjie Li, et al. ∙

research

∙ 03/26/2020

BachGAN: High-Resolution Image Synthesis from Salient Object Layout

We propose a new task towards more practical application for image gener...

7 Yandong Li, et al. ∙

research

∙ 03/25/2020

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

We introduce a new task, Video-and-Language Inference, for joint multimo...

17 Jingzhou Liu, et al. ∙

research

∙ 01/24/2020

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

We introduce a new multimodal retrieval task - TV show Retrieval (TVR), ...

8 Jie Lei, et al. ∙

research

∙ 09/25/2019

UNITER: Learning UNiversal Image-TExt Representations

Joint image-text embedding is the bedrock for most Vision-and-Language (...

0 Yen-Chun Chen, et al. ∙

research

∙ 04/25/2019

TVQA+: Spatio-Temporal Grounding for Video Question Answering

We present the task of Spatio-Temporal Video Question Answering, which r...

0 Jie Lei, et al. ∙

research

∙ 04/09/2019

Multi-Target Embodied Question Answering

Embodied Question Answering (EQA) is a relatively new task where an agen...

6 Licheng Yu, et al. ∙

research

∙ 04/08/2019

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

A grand goal in AI is to build a robot that can accurately navigate base...

0 Hao Tan, et al. ∙

research

∙ 09/05/2018

TVQA: Localized, Compositional Video Question Answering

Recent years have witnessed an increasing interest in image-based questi...

0 Jie Lei, et al. ∙

research

∙ 01/24/2018

MAttNet: Modular Attention Network for Referring Expression Comprehension

In this paper, we address referring expression comprehension: localizing...

0 Licheng Yu, et al. ∙

research

∙ 08/09/2017

Hierarchically-Attentive RNN for Album Summarization and Storytelling

We address the problem of end-to-end visual storytelling. Given a photo ...

0 Licheng Yu, et al. ∙

research

∙ 12/30/2016

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

Referring expressions are natural language constructions used to identif...

0 Licheng Yu, et al. ∙

research

∙ 08/03/2016

Detailed Garment Recovery from a Single-View Image

Most recent garment capturing techniques rely on acquiring multiple view...

0 Shan Yang, et al. ∙

research

∙ 07/31/2016

Modeling Context in Referring Expressions

Humans refer to objects in their environments all the time, especially i...

0 Licheng Yu, et al. ∙

research

∙ 05/31/2015

Visual Madlibs: Fill in the blank Image Generation and Question Answering

In this paper, we introduce a new dataset consisting of 360,001 focused ...

0 Licheng Yu, et al. ∙

Licheng Yu

Featured Co-authors

Sign in with Google

Consider DeepAI Pro