Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

11/28/2022
by   Jiangyong Huang, et al.
0

Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts to create a general vision model are limited in the scope of assessed tasks and offer no overarching framework to perform them holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four functional domains x2014 Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation. Along with the benchmark, we provide a general encoder-decoder framework to allow for the evaluation of arbitrary visual representation on all 11 tasks. We evaluate various pre-trained visual representations with our framework and observe that (1) Transformer-based visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual representations from vision-language pre-training are superior to those with vision-only pre-training across visual tasks. With G-VUE, we provide a holistic evaluation standard to motivate research toward building general-purpose visual systems via obtaining more general-purpose visual representations.

READ FULL TEXT

page 2

page 15

research
06/10/2023

Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

With the availability of large-scale, comprehensive, and general-purpose...
research
03/27/2013

A General Purpose Inference Engine for Evidential Reasoning Research

The purpose of this paper is to report on the most recent developments i...
research
05/10/2018

Deep Nets: What have they ever done for Vision?

This is an opinion paper about the strengths and weaknesses of Deep Nets...
research
09/19/2023

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Audio-visual representation learning aims to develop systems with human-...
research
04/28/2022

GRIT: General Robust Image Task Benchmark

Computer vision models excel at making predictions when the test distrib...
research
10/31/2022

Lila: A Unified Benchmark for Mathematical Reasoning

Mathematical reasoning skills are essential for general-purpose intellig...
research
03/14/2023

OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

We present a single neural network architecture composed of task-agnosti...

Please sign up or login with your details

Forgot password? Click here to reset