Reinforcement Learning Based Algorithms For Average Cost Markov Decision Processes
Mohammed Shahid Abdulla and Shalabh Bhatnagar

IISc-CSA-TR-2005-6
(May 2005)

Available formats: [ps] [ps.gz]

Filed on May 25, 2005
Updated on May 25, 2005


This article proposes several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes with finite state-space under the average cost criterion. Two of the algorithms are for the compact (non-discrete) action setting while the rest are for finite-action spaces. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and averaged for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, a memory efficient implementation using a feature-vector representation of the state-space and TD(0) learning along the faster timescale is discussed. The TD(0) algorithm does not follow an on-line sampling of states but is observed to do well on on our setting. Numerical experiments on rate based flow control on a bottleneck node using a continuous-time queueing model are presented using the proposed algorithms. We show performance comparisons of our algorithms with the two-timescale actor-critic algorithms of (Konda & Borkar, 1999). Our algorithms exhibit more than an order of magnitude better performance over those of (Konda & Borkar, 1999).


Please bookmark this technical report as http://aditya.csa.iisc.ernet.in/TR/2005/6/.

Problems ? Contact techrep@csa.iisc.ernet.in
[Updated at 2009-10-22T06:42Z]