DTR Bandit: Learning to Make Response-Adaptive Decisions With Low Regret
Dynamic treatment regimes (DTRs) for are personalized, sequential treatment plans that adapt the treatment decisions to an individual's time-varying features and intermediate outcomes at each treatment stage. While existing literature mostly focuses on learning the optimal DTR from sequentially randomized data, we study the problem of developing the optimal DTR in an online manner, where decisions in each round affect both our cumulative reward and our data collection for future learning, which we term the DTR bandit problem. We propose a novel algorithm that, by carefully balancing exploration and exploitation, achieves rate-optimal regret when the transition and reward models are linear. We demonstrate the empirical success of our algorithm both on synthetic data and in data from a real-world randomized trial for major depressive disorder.
READ FULL TEXT