In this paper, we introduce CheXOFA, a new pre-trained vision-language m...
In this report, we present our champion solution for Ego4D Natural Langu...
Recent research on Large Language Models (LLMs) has led to remarkable
ad...
Artificial Intelligence (AI) has made incredible progress recently. On t...
To build Video Question Answering (VideoQA) systems capable of assisting...
This technical report describes the CONE approach for Ego4D Natural Lang...
Video temporal grounding (VTG) targets to localize temporal moments in a...
The rapid development of 5G communication technology has given birth to
...
Fusion technique is a key research topic in multimodal sentiment analysi...
This paper presents a unified multimodal pre-trained model called NÜWA t...
The task of video-based commonsense captioning aims to generate event-wi...
In this paper, we present GEM as a General Evaluation benchmark for
Mult...
Generating videos from text is a challenging task due to its high
comput...
Video-text retrieval plays an essential role in multi-modal research and...
In this paper, we focus on the imbalance issue, which is rarely studied ...
Question Aware Open Information Extraction (Question aware Open IE) take...
Procedural knowledge, which we define as concrete information about the
...
While many BERT-based cross-modal pre-trained models produce excellent
r...
We propose UniViLM: a Unified Video and Language pre-training Model for
...
Recently, Visual Question Answering (VQA) has emerged as one of the most...