ProS: Data Series Progressive k-NN Similarity Search and Classification with Probabilistic Quality Guarantees

by   Karima Echihabi, et al.

Existing systems dealing with the increasing volume of data series cannot guarantee interactive response times, even for fundamental tasks such as similarity search. Therefore, it is necessary to develop analytic approaches that support exploration and decision making by providing progressive results, before the final and exact ones have been computed. Prior works lack both efficiency and accuracy when applied to large-scale data series collections. We present and experimentally evaluate ProS, a new probabilistic learning-based method that provides quality guarantees for progressive Nearest Neighbor (NN) query answering. We develop our method for k-NN queries and demonstrate how it can be applied with the two most popular distance measures, namely, Euclidean and Dynamic Time Warping (DTW). We provide both initial and progressive estimates of the final answer that are getting better during the similarity search, as well suitable stopping criteria for the progressive queries. Moreover, we describe how this method can be used in order to develop a progressive algorithm for data series classification (based on a k-NN classifier), and we additionally propose a method designed specifically for the classification task. Experiments with several and diverse synthetic and real datasets demonstrate that our prediction methods constitute the first practical solutions to the problem, significantly outperforming competing approaches. This paper was published in the VLDB Journal (2022).


page 1

page 2

page 3

page 4


Scalable Data Series Subsequence Matching with ULISSE

Data series similarity search is an important operation and at the core ...

Fast Data Series Indexing for In-Memory Data

Data series similarity search is a core operation for several data serie...

Data Series Indexing Gone Parallel

Data series similarity search is a core operation for several data serie...

A MCMC-type simple probabilistic approach for determining optimal progressive censoring schemes

We present here a simple probabilistic approach for determining an optim...

Dumpy: A Compact and Adaptive Index for Large Data Series Collections

Data series indexes are necessary for managing and analyzing the increas...

Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search

Data series are a special type of multidimensional data present in numer...

Effective and Efficient Variable-Length Data Series Analytics

In the last twenty years, data series similarity search has emerged as a...

Please sign up or login with your details

Forgot password? Click here to reset