View all Seminars  |  Download ICal for this event

Stochastic Approximation in Optimization, Estimation and Reinforcement Learning

Series: Ph.D. Colloquium

Speaker: Chandramouli Kamanchi

Date/Time: Feb 24 16:00:00

Location: CSA Seminar Hall (Room No. 254, First Floor)

Faculty Advisor: Prof. Shalabh Bhatnagar

​Stochastic approximation deals with the problem of finding zeros of a function expressed as an expectation of a random variable. In this thesis we study and analyze convergence of stochastic approximation algorithms in the context of optimization under uncertainty, statistical estimation and in the context of learning, in particular reinforcement learning. Moreover we also explore second order methods in the context of Reinforcement learning algorithms. Stochastic optimization (SO) considers the problem of optimizing an objective function in the presence of noise. Most of the solution techniques in SO estimate gradients from the noise corrupted observations of the objective and adjust parameters of the objective along the direction of t​he estimated gradients to obtain locally optimal solutions. Two prominent algorithms in SO namely Random Direction Kiefer-Wolfowitz (RDKW) and Simultaneous Perturbation Stochastic Approximation (SPSA) obtain noisy gradient estimate by randomly perturbing all the parameters simultaneously. This forces the search direction to be random in these algorithms and causes them to suffer additional noise on top of the noise incurred from the samples of the objective. Owing to this additional noise, the idea of using deterministic perturbations instead of random perturbations for gradient estimation has also been studied. Two specific constructions of the deterministic perturbation sequence using lexicographical ordering and Hadamard matrices have been explored and encouraging results have been reported in the literature. In this thesis, we characterise the class of deterministic perturbation sequences that can be utilised in the RDKW algorithm. This class expands the set of known deterministic perturbation sequences available in the literature. Using our characterization we propose a construction of a deterministic perturbation sequence that has the least possible cycle length among all deterministic perturbations. Through simulations we illustrate the performance gain of the proposed deterministic perturbation sequence in the RDKW algorithm over the Hadamard and the random perturbation counterparts. We establish the convergence of the RDKW algorithm for the generalized class of deterministic perturbations utilizing stochastic approximation techniques. In statistical estimation one of the popular measures of central tendency that provides better representation and interesting insights of the data compared to the other measures like mean and median is the metric mode. If the analytical form of the density function is known, mode is an argument of the maximum value of the density function and one can apply optimization techniques to find the mode. In many of the practical applications, the analytical form of the density is not known and only the samples from the distribution are available. Most of the techniques proposed in the literature for estimating the mode from the samples assume that all the samples are available beforehand. Moreover, some of the techniques employ computationally expensive operations like sorting. In this thesis we provide a computationally effective, on-line iterative algorithm that estimates the mode of a unimodal smooth density given only the samples generated from the density. Asymptotic convergence of the proposed algorithm using stochastic approximation techniques is provided. We also prove the stability of the mode estimates by utilizing the concept of regularisation. Experimental results further demonstrate the effectiveness of the proposed algorithm. In a discounted reward Markov Decision Process (MDP), the objective is to find the optimal value function, i.e., the value function corresponding to an optimal policy. This problem reduces to solving a functional equation known as the Bellman equation and a fixed point iteration scheme known as the value iteration is utilized to obtain the solution. In literature, a successive over-relaxation based value iteration scheme is proposed to speed-up the computation of the optimal value function. The speed-up is achieved by constructing a modified Bellman equation that ensures faster convergence to the optimal value function. However, in many practical applications, the model information is not known and we resort to Reinforcement Learning (RL) algorithms to obtain optimal policy and value function. One such popular algorithm is Q-learning. In this paper, we propose Successive Over-Relaxation (SOR) Q-learning. We first derive a modified fixed point iteration for SOR Q-values and utilize stochastic approximation to derive a learning algorithm to compute the optimal value function and an optimal policy. We then prove the almost sure convergence of the SOR Q-learning to SOR Q-values. Finally, through numerical experiments, we show that SOR Q-learning is faster compared to the standard Q-learning algorithm. Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that, under a special structure of the MDP, successive over-relaxation technique computes the optimal value function faster than standard value iteration. In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our algorithm to the optimal solution asymptotically and show the second order convergence. Through experiments, we demonstrate the effectiveness of our proposed appro​ach.

Speaker Bio:
Chandramouli Kamanchi is a senior PhD student in the department of Computer Science and Automation, IISc.

Host Faculty: