Seminars

View all Seminars  |  Download ICal for this event

Sequential Decision Making with Risk, Offline Data and External Influence: Bandits and Reinforcement Learning

Series: Ph.D. Thesis Defense

Speaker: Ranga Shaarad Ayyagari, Ph.D (Engg.) student, Dept. of CSA

Date/Time: Sep 12 15:00:00

Location: CSA Auditorium, (Room No. 104, Ground Floor)

Faculty Advisor: Prof. Ambedkar Dukkipati

Abstract:
Reinforcement Learning (RL) serves as a foundational framework for addressing sequential decision-making problems under uncertainty. In recent years, extensive research in this domain has led to significant advancements across various fields, including healthcare, recommendation systems, networks, robotics, finance, and navigation. To develop learning algorithms for these problems, RL is built upon the mathematical frameworks of Multi-Armed Bandits (MAB) and Markov Decision Processes (MDP). As opposed to supervised and unsupervised learning, sequential decision-making problems involve optimizing an objective over a time horizon and often involve either explicit or implicit planning to make optimal decisions at each step over a time period. Thus, Reinforcement Learning is a core component for building Artificial Intelligence (AI) agents that seamlessly take in information from the outside world, take decisions, and perform actions to accomplish a given task.

A major portion of the literature on reinforcement learning focuses on finding optimal algorithms to maximize the expected returns achieved by an agent that can continuously interact with an environment to gain relevant information. However, such a scenario is a very special case, and real-world problems deal with more complex objectives, as well as additional conditions and constraints imposed on the agent. The agent could be forced to operate in environments with limited data or changing dynamics. Or it could be expected to follow a policy that is risk-averse or able to plan over complex tasks. In this thesis, we study sequential decision-making problems with (i) risk, (ii) lack of online access, and (iii) under the influence of an exogenous temporal process and provide some theoretical and practical solutions. While the problem of risk is studied in the framework of bandits, the rest of the problems are studied in the framework of MDPs.

First, we consider the problem of minimizing risk in a combinatorial semi-bandit, in which an agent has to choose at each time step a subset of possible actions that satisfy given constraints. We consider the optimization objective to be the CVaR (Conditional Value-At-Risk), a risk measure that considers the worst-case returns obtained by the agent, parameterized by some risk level $alpha$. We propose algorithms to tackle this problem for the cases of (i) Gaussian rewards and (ii) bounded rewards. We theoretically analyze these algorithms along with some practical considerations and experimentally corroborate our findings with numerical experiments in a simulated environment.

Next, we consider the problem of learning optimal RL policies when there is little or no online access to the environment. First, we consider the problem of learning a hierarchy of policies solely from an offline dataset of experiences without any access to the environment. Many sequential tasks require a degree of temporal abstraction and planning to achieve a long-term goal. To incorporate these elements, a hierarchy of policies is learned that operates at different time scales. While most works train these policies by exploring the environment, we attempt to learn such a hierarchy from a given offline dataset collected by an unknown behavior agent. We propose a model-based algorithm that uses a Conditional Variational Auto-Encoder (CVAE) along with an ensemble-based uncertainty metric to apply existing online RL algorithms on offline data. On various continuous control and robotic manipulation tasks, we show that this method reliably preserves the benefits of a hierarchical agent even without online exploration.

In the offline setting, we also solve the problem of active learning in RL, where the agent is given an offline dataset and limited access to the environment are available. We tackle the problem of optimally augmenting the given offline dataset with minimal extra exploration. The performance of a policy learned from an offline dataset is limited by the quality of the data in the dataset. In many circumstances, there is an opportunity for an agent to flexibly collect additional trajectories in the environment, although the number of samples collected needs to be limited due to issues related to cost, etc. To this end, we propose an active reinforcement learning framework that uses a representation-based uncertainty model to optimally augment the existing dataset to minimize overlap and maximize the information gain due to new samples. We demonstrate our proposed method on various continuous control environments such as Gym-MuJoCo locomotion environments as well as Maze2d, AntMaze, CARLA and IsaacSimGo1.

Finally, we consider the problem of dealing with a changing environment due to the effect of exogenous events. We introduce a notion of non-stationarity wherein the dynamics of a Markov Decision Process are affected by the influence of an exogenous temporal event process. We show that under suitable conditions on the decay of this external process, the sequential decision problem can be solved by a tractable policy within $epsilon$-suboptimality. More specifically, when the perturbations caused by events older than $t$ time steps on the MDP transition dynamics and the event process itself are bounded in total variation by $M_t$ and $N_t$ respectively, where $sum_t M_t, sum_t N_t$ are convergent series, we show that the problem has tractable sliding-window solutions whose approximation error $epsilon$ depends on $M_t$ and $N_t$. We propose a policy iteration algorithm and analyze its properties as a function of the properties of the external events. Further, we analyze the sample complexity of an extension of the Least-Squares Policy Evaluation algorithm applied to this setting and demonstrate the results experimentally.


Microsoft teams link:

Link