What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study

by   Gen Luo, et al.

Most of the existing work in one-stage referring expression comprehension (REC) mainly focuses on multi-modal fusion and reasoning, while the influence of other factors in this task lacks in-depth exploration. To fill this gap, we conduct an empirical study in this paper. Concretely, we first build a very simple REC network called SimREC, and ablate 42 candidate designs/settings, which covers the entire process of one-stage REC from network design to model training. Afterwards, we conduct over 100 experimental trials on three benchmark datasets of REC. The extensive experimental results not only show the key factors that affect REC performance in addition to multi-modal fusion, e.g., multi-scale features and data augmentation, but also yield some findings that run counter to conventional understanding. For example, as a vision and language (V L) task, REC does is less impacted by language prior. In addition, with a proper combination of these findings, we can improve the performance of SimREC by a large margin, e.g., +27.12 existing REC methods. But the most encouraging finding is that with much less training overhead and parameters, SimREC can still achieve better performance than a set of large-scale pre-trained models, e.g., UNITER and VILLA, portraying the special role of REC in existing V L research.


page 1

page 3

page 6

page 8

page 10

page 11


Towards Language-guided Visual Recognition via Dynamic Convolutions

In this paper, we are committed to establishing an unified and end-to-en...

Referring Expression Comprehension via Cross-Level Multi-Modal Fusion

As an important and challenging problem in vision-language tasks, referr...

Efficient Large-Scale Multi-Modal Classification

While the incipient internet was largely text-based, the modern digital ...

Generative-based Fusion Mechanism for Multi-Modal Tracking

Generative models (GMs) have received increasing research interest for t...

Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation

Referring expression comprehension (REC) and segmentation (RES) are two ...

An Empirical Study on Neural Keyphrase Generation

Recent years have seen a flourishing of neural keyphrase generation work...

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Recent advancements in Large Language Models (LLMs) such as GPT4 have di...

Please sign up or login with your details

Forgot password? Click here to reset