Regret Analysis of Learning-Based MPC with Partially-Unknown Cost Function
The exploration/exploitation trade-off is an inherent challenge in data-driven and adaptive control. Though this trade-off has been studied for multi-armed bandits, reinforcement learning (RL) for finite Markov chains, and RL for linear control systems; it is less well-studied for learning-based control of nonlinear control systems. A significant theoretical challenge in the nonlinear setting is that, unlike the linear case, there is no explicit characterization of an optimal controller for a given set of cost and system parameters. We propose in this paper the use of a finite-horizon oracle controller with perfect knowledge of all system parameters as a reference for optimal control actions. First, this allows us to propose a new regret notion with respect to this oracle finite-horizon controller. Second, this allows us to develop learning-based policies that we prove achieve low regret (i.e., square-root regret up to a log-squared factor) with respect to this oracle finite-horizon controller. This policy is developed in the context of learning-based model predictive control (LBMPC). We conduct a statistical analysis to prove finite sample concentration bounds for the estimation step of our policy, and then we perform a control-theoretic analysis using techniques from MPC- and optimization-theory to show this policy ensures closed-loop stability and achieves low regret. We conclude with numerical experiments on a model of heating, ventilation, and air-conditioning (HVAC) systems that show the low regret of our policy in a setting where the cost function is partially-unknown to the controller.
READ FULL TEXT