Waffling around for Performance: Visual Classification with Random Words and Broad Concepts
The visual classification performance of vision-language models such as CLIP can benefit from additional semantic knowledge, e.g. via large language models (LLMs) such as GPT-3. Further extending classnames with LLM-generated class descriptors, e.g. “waffle, which has a round shape”, or averaging retrieval scores over multiple such descriptors, has been shown to improve generalization performance. In this work, we study this behavior in detail and propose CLIP, a framework for zero-shot visual classification which achieves similar performance gains on a large number of visual classification tasks by simply replacing LLM-generated descriptors with random character and word descriptors without querying external models. We extend these results with an extensive experimental study on the impact and shortcomings of additional semantics introduced via LLM-generated descriptors, and showcase how semantic context is better leveraged by automatically querying LLMs for high-level concepts, while jointly resolving potential class name ambiguities. Link to the codebase: https://github.com/ExplainableML/WaffleCLIP.
READ FULL TEXT