How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges

07/27/2023
by   Haotong Qin, et al.
0

Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI. Notably, Bard has recently been updated to handle visual inputs alongside text prompts during conversations. Given Bard's impressive track record in handling textual inputs, we explore its capabilities in understanding and interpreting visual data (images) conditioned by text questions. This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models, especially in addressing complex computer vision problems that demand accurate visual and language understanding. Specifically, in this study, we focus on 15 diverse task scenarios encompassing regular, camouflaged, medical, under-water and remote sensing data to comprehensively evaluate Bard's performance. Our primary finding indicates that Bard still struggles in these vision scenarios, highlighting the significant gap in vision-based understanding that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, leading to enhanced capabilities in comprehending and interpreting fine-grained visual data. Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand

READ FULL TEXT

page 2

page 3

page 4

page 5

page 6

research
04/25/2023

The Potential of Visual ChatGPT For Remote Sensing

Recent advancements in Natural Language Processing (NLP), particularly i...
research
06/09/2023

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Recently, large language models (LLMs) have made significant advancement...
research
05/18/2023

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Large language models (LLMs) have notably accelerated progress towards a...
research
04/24/2019

Understanding Art through Multi-Modal Retrieval in Paintings

In computer vision, visual arts are often studied from a purely aestheti...
research
05/24/2023

Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Advances in Large Language Models (LLMs) have inspired a surge of resear...
research
08/19/2023

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Vision Language Models (VLMs), which extend Large Language Models (LLM) ...
research
01/07/2020

An Exploration of Embodied Visual Exploration

Embodied computer vision considers perception for robots in general, uns...

Please sign up or login with your details

Forgot password? Click here to reset