Task-agnostic Indexes for Deep Learning-based Queries over Unstructured Data

09/09/2020
by   Daniel Kang, et al.
0

Unstructured data is now commonly queried by using target deep neural networks (DNNs) to produce structured information, e.g., object types and positions in video. As these target DNNs can be computationally expensive, recent work uses proxy models to produce query-specific proxy scores. These proxy scores are then used in downstream query processing algorithms for improved query execution speeds. Unfortunately, proxy models are often trained per-query, require large amounts of training data from the target DNN, and new training methods per query type. In this work, we develop an index construction method (task-agnostic semantic trainable index, TASTI) that produces reusable embeddings that can be used to generate proxy scores for a wide range of queries, removing the need for query-specific proxies. We observe that many queries over the same dataset only require access to the schema induced by the target DNN. For example, an aggregation query counting the number of cars and a selection query selecting frames of cars require only the object types per frame of video. To leverage this opportunity, TASTI produces embeddings per record that have the key property that close embeddings have similar extracted attributes under the induced schema. Given this property, we show that clustering by embeddings can be used to answer downstream queries efficiently. We theoretically analyze TASTI and show that low training error guarantees downstream query accuracy for a natural class of queries. We evaluate TASTI on four video and text datasets, and three query types. We show that TASTI can be 10x less expensive to construct than proxy models and can outperform them by up to 24x at query time.

READ FULL TEXT
research
08/17/2023

Accelerating Aggregation Queries on Unstructured Streams of Data

Analysts and scientists are interested in querying streams of video, aud...
research
10/16/2019

Similarity Driven Approximation for Text Analytics

Text analytics has become an important part of business intelligence as ...
research
01/02/2022

Optimizing Machine Learning Inference Queries with Correlative Proxy Models

We consider accelerating machine learning (ML) inference queries on unst...
research
04/02/2020

Approximate Selection with Guarantees using Proxies

Due to the falling costs of data acquisition and storage, researchers an...
research
12/15/2020

Generation of complex database queries and API calls from natural language utterances

Generating queries corresponding to natural language questions is a long...
research
06/21/2021

Boggart: Accelerating Retrospective Video Analytics via Model-Agnostic Ingest Processing

Delivering fast responses to retrospective queries on video datasets is ...
research
07/05/2021

PandaDB: Understanding Unstructured Data in Graph Database

At present, graph model is widely used in many applications, such as kno...

Please sign up or login with your details

Forgot password? Click here to reset