Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

06/12/2023
by   Karsten Roth, et al.
0

The visual classification performance of vision-language models such as CLIP can benefit from additional semantic knowledge, e.g. via large language models (LLMs) such as GPT-3. Further extending classnames with LLM-generated class descriptors, e.g. “waffle, which has a round shape”, or averaging retrieval scores over multiple such descriptors, has been shown to improve generalization performance. In this work, we study this behavior in detail and propose CLIP, a framework for zero-shot visual classification which achieves similar performance gains on a large number of visual classification tasks by simply replacing LLM-generated descriptors with random character and word descriptors without querying external models. We extend these results with an extensive experimental study on the impact and shortcomings of additional semantics introduced via LLM-generated descriptors, and showcase how semantic context is better leveraged by automatically querying LLMs for high-level concepts, while jointly resolving potential class name ambiguities. Link to the codebase: https://github.com/ExplainableML/WaffleCLIP.

READ FULL TEXT
research
10/13/2022

Visual Classification via Description from Large Language Models

Vision-language models (VLMs) such as CLIP have shown promising performa...
research
05/23/2023

Can Language Models Understand Physical Concepts?

Language models (LMs) gradually become general-purpose interfaces in the...
research
05/22/2017

Semantic Softmax Loss for Zero-Shot Learning

A typical pipeline for Zero-Shot Learning (ZSL) is to integrate the visu...
research
08/07/2023

Learning Concise and Descriptive Attributes for Visual Recognition

Recent advances in foundation models present new opportunities for inter...
research
03/21/2016

A System for Probabilistic Linking of Thesauri and Classification Systems

This paper presents a system which creates and visualizes probabilistic ...
research
08/08/2023

Few-shot medical image classification with simple shape and texture text descriptors using vision-language models

In this work, we investigate the usefulness of vision-language models (V...
research
08/15/2023

Link-Context Learning for Multimodal LLMs

The ability to learn from context with novel concepts, and deliver appro...

Please sign up or login with your details

Forgot password? Click here to reset