EventCLIP: Adapting CLIP for Event-based Object Recognition

06/10/2023
by   Ziyi Wu, et al.
0

Recent advances in 2D zero-shot and few-shot recognition often leverage large pre-trained vision-language models (VLMs) such as CLIP. Due to a shortage of suitable datasets, it is currently infeasible to train such models for event camera data. Thus, leveraging existing models across modalities is an important research challenge. In this work, we propose EventCLIP, a new method that utilizes CLIP for zero-shot and few-shot recognition on event camera data. First, we demonstrate the suitability of CLIP's image embeddings for zero-shot event classification by converting raw events to 2D grid-based representations. Second, we propose a feature adapter that aggregates temporal information over event frames and refines text embeddings to better align with the visual inputs. We evaluate our work on N-Caltech, N-Cars, and N-ImageNet datasets under the few-shot learning setting, where EventCLIP achieves state-of-the-art performance. Finally, we show that the robustness of existing event-based classifiers against data variations can be further boosted by ensembling with EventCLIP.

READ FULL TEXT
research
01/03/2022

Semantically Grounded Visual Embeddings for Zero-Shot Learning

Zero-shot learning methods rely on fixed visual and semantic embeddings,...
research
08/06/2023

E-CLIP: Towards Label-efficient Event-based Open-world Understanding by CLIP

Contrasting Language-image pertaining (CLIP) has recently shown promisin...
research
09/02/2023

Zero-Shot Recommendations with Pre-Trained Large Language Models for Multimodal Nudging

We present a method for zero-shot recommendation of multimodal non-stati...
research
04/10/2019

On zero-shot recognition of generic objects

Many recent advances in computer vision are the result of a healthy comp...
research
05/04/2023

Few-shot Domain-Adaptive Visually-fused Event Detection from Text

Incorporating auxiliary modalities such as images into event detection m...
research
10/18/2022

Perceptual Grouping in Vision-Language Models

Recent advances in zero-shot image recognition suggest that vision-langu...
research
04/05/2023

VicTR: Video-conditioned Text Representations for Activity Recognition

Vision-Language models have shown strong performance in the image-domain...

Please sign up or login with your details

Forgot password? Click here to reset