Seminars

View all Seminars  |  Download ICal for this event

Algorithms for various cost criteria in Reinforcement Learning

Series: Ph.D. Thesis Defense

Speaker: Soumyajit Guin, Ph.D (Engg.) student, Dept. of CSA

Date/Time: Mar 20 11:30:00

Location: CSA Auditorium, (Room No. 104, Ground Floor)

Faculty Advisor: Prof. Shalabh Bhatnagar

Abstract:
In this thesis we will look at various Reinforcement Learning algorithms. We will look at algorithms for various cost criteria or reward objectives namely Finite Horizon, Discounted Cost, Risk-Sensitive Cost. For Finite Horizon and Risk-Sensitive Cost we derive the policy gradient, and for Discounted Cost we propose a new algorithm called Critic-Actor. We analyze and prove the convergence for all the proposed algorithms. We also analyze the empirical performance of our algorithms through numerical experiments.

Infinite Horizon RL has been explored very well in terms of applications. A robot learning to walk, can be considered a continuing task and can be modelled as a average reward problem or a discounted reward problem . Another example is of games, which will have a winner and loser when the game ends hence can be modelled as Episodic RL or Stochastic Shortest Path problem. Learning to play these games is often associated with real intelligence. Now take an instant when the amount of time the agent has is fixed. Such settings can be applied in Network Flow Control, Healthcare, Smart Grids, etc. In this case, the best thing the agent can do is use time-dependent or non-stationary policies. By nature, these policies are superior to stationary policies.

In our first work, we consider the same setup and additionally have constraints that the agent must satisfy. The reason to take such constraints is their application to safe RL which has become popular recently. One can apply safe RL to the above applications. Our approach is based on policy gradient where the policy is approximated by some function. We have used the lagrangian approach to handle our constraints. Our policy gradient can also be used for an unconstrained setup simply by setting the lagrange parameter to 0.

Discounted cost is the most widely used cost criterion for RL problems, mainly due to its simplicity. Practitioners have been using discounted setting even in episodic problems as it makes learning the value function stable. A popular way to solve these problems are Actor Critic algorithms, which are a class of algorithms that has an actor or policy that is updated in a slow timescale and a critic or value function that is updated in the fast timescale. In our second work we consider the discounted cost problem and analyse its behaviour when the timescales of critic and actor are reversed. We observe that updating the critic in a slow timescale and updating the actor in a faster timescale is a legitimate scheme. We call this as Critic Actor algorithm.

Actor Critic algorithms and their convergence is very well studied both in tabular and function approximation case. But we often observe that practitioners dont follow any rule while choosing the stepsizes of the algorithm. They choose the stepsizes that work best for their problem and yet get good results . This begs the question do timescales really matter in actor critic algorithm? Well we theoritically prove that if we update the policy in a fast timescale and the value function in the slow timescale it emulates the value iteration algorithm and converge to the same optimal point as actor critic algorithm. We show that this holds true for the tabular case.

Traditionally RL has been used to optimize the expected summation of costs. People have been using various cost criteria such as discounted cost, average cost which is suitable for their problem. But these cost criteria can give policies with high variance in standard RL testbeds. Variance measures has been derived for these cost criteria in the literature and their corresponding optimization algorithm has been proposed. Optimizing variance can considerably reduce the variance of the policies. But variance alone cannot capture the risk involved in a decision made by the agent. One also has to look at higher moments the more risk-averse decision it wants to make. Risk-sensitive cost gives the flexibility to the user to make risk-averse, risk-neutral or risk-taking decisions according to his choice by tuning the risk-sensitivity parameter. In our third work, we develop a policy gradient algorithm with function approximation for Risk-Sensitive Cost Criterion.


Microsoft teams link:
Link