Dynamical Linear Bandits

by   Marco Mussi, et al.

In many real-world sequential decision-making problems, an action does not immediately reflect on the feedback and spreads its effects over a long time frame. For instance, in online advertising, investing in a platform produces an increase of awareness, but the actual reward, i.e., a conversion, might occur far in the future. Furthermore, whether a conversion takes place depends on: how fast the awareness grows, its vanishing effects, and the synergy or interference with other advertising platforms. Previous work has investigated the Multi-Armed Bandit framework with the possibility of delayed and aggregated feedback, without a particular structure on how an action propagates in the future, disregarding possible dynamical effects. In this paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an extension of the linear bandits characterized by a hidden state. When an action is performed, the learner observes a noisy reward whose mean is a linear function of the hidden state and of the action. Then, the hidden state evolves according to a linear dynamics, affected by the performed action too. We start by introducing the setting, discussing the notion of optimal policy, and deriving an expected regret lower bound. Then, we provide an any-time optimistic regret minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB), that suffers an expected regret of order O(c d sqrt(T)), where c is a constant dependent on the properties of the linear dynamical evolution, and d is the dimension of the action vector. Finally, we conduct a numerical validation on a synthetic environment and on real-world data to show the effectiveness of DynLin-UCB in comparison with several baselines.


page 1

page 2

page 3

page 4


Autoregressive Bandits

Autoregressive processes naturally arise in a large variety of real-worl...

Tight Regret Bounds for Infinite-armed Linear Contextual Bandits

Linear contextual bandit is a class of sequential decision making proble...

Restless Hidden Markov Bandits with Linear Rewards

This paper presents an algorithm and regret analysis for the restless hi...

Learning by Repetition: Stochastic Multi-armed Bandits under Priming Effect

We study the effect of persistence of engagement on learning in a stocha...

PopArt: Efficient Sparse Regression and Experimental Design for Optimal Sparse Linear Bandits

In sparse linear bandits, a learning agent sequentially selects an actio...

Online Least Squares Estimation with Self-Normalized Processes: An Application to Bandit Problems

The analysis of online least squares estimation is at the heart of many ...

Duelling Bandits with Weak Regret in Adversarial Environments

Research on the multi-armed bandit problem has studied the trade-off of ...

Please sign up or login with your details

Forgot password? Click here to reset