Seminars

View all Seminars  |  Download ICal for this event

Towards Practical Reinforcement Learning: Monotonic Improvement, Quasi-Hyperbolic Discounting, and Sample Efficient Learning

Series: Ph.D. Colloquium

Speaker: Eshwar S R, Ph.D (Engg.) student, Dept. of CSA

Date/Time: Apr 24 10:30:00

Location: CSA Auditorium, (Room No. 104, Ground Floor)

Faculty Advisor: Prof. Gugan C.M Thoppe

Abstract:
In this thesis, we investigate fundamental challenges in reinforcement learning (RL) that arise when moving beyond idealized settings toward realistic, large-scale, and human-centric decision-making problems. Classical RL theory relies on assumptions such as tabular representations, exponential discounting, and abundant data collection, which are often violated in practice. This thesis addresses these limitations through a unified perspective spanning four key directions: reliable learning under function approximation, modeling human-centric time preferences, improving sample efficiency in derivative-free RL, and learning in resource allocation systems.

In the first part of the thesis, we study the loss of reliability in policy iteration under function approximation. While classical policy iteration guarantees monotonic improvement and convergence in tabular settings, these guarantees break down when value functions are approximated. We identify that minimizing value approximation error alone is insufficient to ensure policy improvement. To address this, we propose Reliable Policy Iteration (RPI), which replaces Bellman equation-based evaluation with a one-sided Bellman inequality formulation. This ensures that value estimates remain lower bounds on true values and increase monotonically across iterations, restoring performance guarantees. We further introduce Conservative Reliable Policy Iteration (CRPI), which provides explicit per-iteration improvement guarantees and convergence rate bounds.

In the second part, we study reinforcement learning under quasi-hyperbolic discounting, a model motivated by human decision-making behavior. Unlike standard models where preferences over future rewards remain consistent over time, humans often change their preferences??for example, preferring a smaller-sooner reward today but a larger-later reward when both are in the distant future. This mismatch between present and future preferences breaks a key assumption underlying classical RL methods, which rely on decisions being evaluated consistently over time. We develop the first provably convergent algorithms for this setting by considering two distinct solution concepts: self-consistent policies modeled as Markov Perfect Equilibria, and precommitment-based policies with a structured non-stationary form.

In the third part, we address the sample inefficiency of evolutionary reinforcement learning methods. Evolution Strategies (ES) methods such as Augmented Random Search (ARS) and Trust Region Evolutionary Strategies (TRES) have shown strong performance in practice. ES optimize policies by generating many candidate solutions. Each candidate is evaluated through interactions with the environment. The algorithm then updates the policy using only the top-performing candidates. As a result, a large amount of interaction data is required. Data from non top-performing candidates is discarded. Hence, a significant portion of the collected data is wasted. To address this, we propose a novel off-policy ranking framework. Our method enables reusing data collected from a single policy to compare multiple candidate policies, termed as off-policy evaluation. Most existing off-policy evaluation methods are designed for stochastic policies. However, ES methods such as ARS and TRES use deterministic policies, making these approaches inapplicable. To overcome this limitation, we introduce kernel smoothing to make off-policy evaluation feasible. Our approach significantly reduces the number of required environment interactions while preserving the simplicity of ES methods. Empirical results on MuJoCo continuous control tasks demonstrate substantial gains in sample efficiency.

In the final part, we tackle an important problem arising in queueing systems and cloud applications, namely auto-scaling and load balancing. We formulate this problem as a weakly coupled Markov decision process. The resulting model captures interactions across multiple subsystems through shared resource constraints. The exact problem is computationally intractable due to the large state space. To address this, we introduce a relaxation based on occupancy measures, which leads to a tractable formulation. Building on this, we develop a model-free learning algorithm using a stochastic approximation-based primal??dual approach. Our method enables data-driven control without requiring knowledge of system dynamics.