Learning from Data Streams: An Overview and Update

by   Jesse Read, et al.

The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.


page 1

page 2

page 3

page 4


Concept-drifting Data Streams are Time Series; The Case for Continuous Adaptation

Learning from data streams is an increasingly important topic in data mi...

SMOClust: Synthetic Minority Oversampling based on Stream Clustering for Evolving Data Streams

Many real-world data stream applications not only suffer from concept dr...

Mondrian Forest for Data Stream Classification Under Memory Constraints

Supervised learning algorithms generally assume the availability of enou...

Learning from Ontology Streams with Semantic Concept Drift

Data stream learning has been largely studied for extracting knowledge s...

Mining Drifting Data Streams on a Budget: Combining Active Learning with Self-Labeling

Mining data streams poses a number of challenges, including the continuo...

On the challenges to learn from Natural Data Streams

In real-world contexts, sometimes data are available in form of Natural ...

Data Stream Classification using Random Feature Functions and Novel Method Combinations

Big Data streams are being generated in a faster, bigger, and more commo...

Please sign up or login with your details

Forgot password? Click here to reset