Queueing Network Controls via Deep Reinforcement Learning

by   J. G. Dai, et al.

Novel advanced policy gradient (APG) methods with conservative policy iterations, such as Trust Region policy optimization and Proximal policy optimization (PPO), have become the dominant reinforcement learning algorithms because of its ease of implementation and good practical performance. A conventional setup for queueing network control problems is a Markov decision problem (MDP) that has three features: infinite state space, unbounded costs and long-run average cost objective. We extend the theoretical justification for the use of APG methods in MDP problems with these three features. We show that each iteration the control policy parameters should be optimized within the trust region that prevents improper updates of the policy leading to the system instability and guarantees monotonic improvement. A critical challenge in queueing control optimization is the large number of samples typically required for relative value function estimation. We adopt discounting of the future costs and use a discounted relative value function as an approximation of the relative value function. We show that this discounted relative value function can be estimated via regenerative simulation. In addition, assuming the full knowledge of transition probabilities, we incorporate the approximating martingale-process (AMP) method into the regenerative estimator. We provide numerical results on parallel servers network and large-size multiclass queueing networks operating under heavy traffic regimes, learning policies that minimize the average number of jobs in the systems. The experiments demonstrate that the performance of control policies resulting from the proposed PPO algorithm outperforms other heuristics and is near-optimal when the optimal can be computed.


Processing Network Controls via Deep Reinforcement Learning

Novel advanced policy gradient (APG) algorithms, such as proximal policy...

The Local Optimality of Reinforcement Learning by Value Gradients, and its Relationship to Policy Gradient Learning

In this theoretical paper we are concerned with the problem of learning ...

Trust-Region-Free Policy Optimization for Stochastic Policies

Trust Region Policy Optimization (TRPO) is an iterative method that simu...

Trusted Approximate Policy Iteration with Bisimulation Metrics

Bisimulation metrics define a distance measure between states of a Marko...

Trust Region Value Optimization using Kalman Filtering

Policy evaluation is a key process in reinforcement learning. It assesse...

Inference for relative sparsity

In healthcare, there is much interest in estimating policies, or mapping...

One-step dispatching policy improvement in multiple-server queueing systems with Poisson arrivals

Policy iteration techniques for multiple-server dispatching rely on the ...

Please sign up or login with your details

Forgot password? Click here to reset