Actively Tracking the Optimal Arm in Non-Stationary Environments with Mandatory Probing
We study a novel multi-armed bandit (MAB) setting which mandates the agent to probe all the arms periodically in a non-stationary environment. In particular, we develop that balances the regret guarantees of classical Thompson sampling (TS) with the broadcast probing (BP) of all the arms simultaneously in order to actively detect a change in the reward distributions. Once a system-level change is detected, the changed arm is identified by an optional subroutine called group exploration (GE) which scales as log_2(K) for a K-armed bandit setting. We characterize the probability of missed detection and the probability of false-alarm in terms of the environment parameters. The latency of change-detection is upper bounded by √(T) while within a period of √(T), all the arms are probed at least once. We highlight the conditions in which the regret guarantee of outperforms that of the state-of-the-art algorithms, in particular, and . Furthermore, unlike the existing bandit algorithms, can be deployed for applications such as timely status updates, critical control, and wireless energy transfer, which are essential features of next-generation wireless communication networks. We demonstrate the efficacy of by employing it in a n industrial internet-of-things (IIoT) network designed for simultaneous wireless information and power transfer (SWIPT).
READ FULL TEXT 
  
  
     share
 share