Zelda: Video Analytics using Vision-Language Models

05/05/2023
by   Francisco Romero, et al.
0

Advances in ML have motivated the design of video analytics systems that allow for structured queries over video datasets. However, existing systems limit query expressivity, require users to specify an ML model per predicate, rely on complex optimizations that trade off accuracy for performance, and return large amounts of redundant and low-quality results. This paper focuses on the recently developed Vision-Language Models (VLMs) that allow users to query images using natural language like "cars during daytime at traffic intersections." Through an in-depth analysis, we show VLMs address three limitations of current video analytics systems: general expressivity, a single general purpose model to query many predicates, and are both simple and fast. However, VLMs still return large numbers of redundant and low-quality results, which can overwhelm and burden users. We present Zelda: a video analytics system that uses VLMs to return both relevant and semantically diverse results for top-K queries on large video datasets. Zelda prompts the VLM with the user's query in natural language and additional terms to improve accuracy and identify low-quality frames. Zelda improves result diversity by leveraging the rich semantic information encoded in VLM embeddings. We evaluate Zelda across five datasets and 19 queries and quantitatively show it achieves higher mean average precision (up to 1.15×) and improves average pairwise similarity (up to 1.16×) compared to using VLMs out-of-the-box. We also compare Zelda to a state-of-the-art video analytics engine and show that Zelda retrieves results 7.5× (up to 10.4×) faster for the same accuracy and frame diversity.

READ FULL TEXT

page 1

page 5

page 6

page 9

page 10

page 11

page 12

research
06/06/2023

Prompting Large Language Models to Reformulate Queries for Moment Localization

The task of moment localization is to localize a temporal moment in an u...
research
08/07/2023

CAESURA: Language Models as Multi-Modal Query Planners

Traditional query planners translate SQL queries into query plans to be ...
research
06/21/2021

Boggart: Accelerating Retrospective Video Analytics via Model-Agnostic Ingest Processing

Delivering fast responses to retrospective queries on video datasets is ...
research
10/14/2021

P-Adapters: Robustly Extracting Factual Information from Language Models with Diverse Prompts

Recent work (e.g. LAMA (Petroni et al., 2019)) has found that the qualit...
research
05/02/2018

BlazeIt: Fast Exploratory Video Queries using Neural Networks

As video volumes grow, analysts have increasingly turned to deep learnin...
research
02/16/2021

THIA: Accelerating Video Analytics using Early Inference and Fine-Grained Query Planning

To efficiently process visual data at scale, researchers have proposed t...
research
06/04/2020

TASM: A Tile-Based Storage Manager for Video Analytics

The amount of video data being produced is rapidly growing. At the same ...

Please sign up or login with your details

Forgot password? Click here to reset