Seminars

View all Seminars  |  Download ICal for this event

Exploring Fairness and Causality in Online Decision-Making

Series: Ph.D. Colloquium

Speaker: Vishakha Patil Ph.D (Engg.) student Dept. of C.S.A

Date/Time: Nov 17 16:00:00

Location: CSA Seminar Hall (Room No. 254, First Floor)

Faculty Advisor: Prof. Arindam Khan and Prof. Y. Narahari

Abstract:
Online decision-making under uncertainty is a fundamental aspect of numerous real-world problems across various domains, including online resource allocation, crowd-sourcing, and online advertising. Multi-Armed Bandits (MABs) and Markov Decision Processes (MDPs) are two popular modeling frameworks for capturing decision-making under uncertainty. The inherent nature of applications modeled by frameworks like MABs and MDPs often requires additional considerations and adaptations to effectively address real-world challenges. In this thesis, our primary emphasis is on two specific factors: the incorporation of fairness considerations into the model and the exploration of causal relations among different variables within the model.

The thesis comprises three contributions: We commence with an exploration of fairness within temporally extended decision-making scenarios, specifically those modeled as MDPs. Our novel fairness notion aims to guarantee that each states long-term visitation frequency surpasses a predefined fraction - a natural extension of quota-based fairness from MAB literature. We propose an algorithm with a dual guarantee: simultaneously satisfying fairness and maximizing the total reward. Subsequently, we shift our focus to a variant of the MAB model that accounts for the dynamic nature of the environment. This model, where arm rewards increase with each pull, is a versatile abstraction for real-world scenarios, particularly in education and employment domains where opportunity allocation impacts community capabilities. We present an algorithm that maximizes the total reward while ensuring that the arms, which may correspond to communities, attain their fullest potential. Lastly, we study the problem of learning good interventions in causal graphs by modeling it as a MAB problem. This problem, called the Causal Multi-Armed Bandit (Causal MAB) problem, captures dependencies between arms through a causal graph. We study the problem of identifying the best intervention in Causal MAB and provide algorithms for three variants of the Causal MAB problem.

Achieving Long-Term Resource Allocation Fairness in MDPs:
We study fairness in temporally extended decision-making settings, specifically those formulated as MDPs. Our proposed fairness notion aims to ensure that the long-term visitation frequency of each state exceeds a predetermined fraction. This notion of fairness is an intuitive extension of the quota-based fairness concept popular in MAB literature. It finds natural applicability in various resource allocation settings where the allocation dynamics of a shared resource are governed by an MDP, and the distribution of the shared resource is characterized by its state-visitation frequencies. For average-reward Markov Decision Processes (AMDP), we formulate the problem as a bilinear saddle point program, and, for a generative model, we employ a Stochastic Mirror Descent (SMD) based algorithm to solve it. Our proposed solution ensures a simultaneous approximation of the expected average-reward and the fairness requirement.

Balancing between Reward and Disparity in Improving Bandits:
We study the Improving Multi-Armed Bandit (IMAB) problem, where the reward obtained from an arm increases with the number of pulls it receives. This model serves as an elegant abstraction for numerous real-world use cases found in domains like education and employment. In these domains, choices concerning the allocation of opportunities can significantly impact the future capabilities of communities and the disparities among them. Decision-makers in such settings must not only aim to maximize their immediate cumulative rewards but also consider the consequences of their decisions on future rewards. Our focus lies in understanding the delicate balance between two seemingly conflicting objectives in the horizon-unaware setting: (a) maximizing the cumulative reward at any given time based on the current arm rewards, and (b) ensuring that arms with higher long-term rewards receive adequate opportunities, even if they start with low initial rewards. Surprisingly, we discover that these two objectives align with each other in this context. Our main contribution is an anytime algorithm for the IMAB problem that achieves the best possible cumulative reward while ensuring that the arms reach their true potential given sufficient time. Let T denote the time horizon and let k be the number of arms. Then, we prove the optimality of our algorithm by showing that (a) any algorithm for the IMAB problem, no matter how utilitarian, must suffer Omega(T) policy regret and Omega(k) competitive ratio with respect to the optimal offline policy, and (b) the competitive ratio of our algorithm is O(k).

Learning Good Interventions via Causal Bandits:
Learning good interventions in a causal graph can be modelled as a stochastic multi-armed bandit problem with side-information. First, we study this problem when interventions are more expensive than observations and a budget is specified. If there are no backdoor paths from an intervenable node to the reward node, then we propose an algorithm to minimize simple regret that optimally trades off observations and interventions based on the cost of intervention. We also propose an algorithm that accounts for the cost of interventions, utilizes causal side-information, and minimizes the expected cumulative regret without exceeding the budget. Our cumulative-regret minimization algorithm performs better than standard algorithms that do not take side-information into account. Finally, we study the problem of learning best interventions without budget constraint in general graphs and give an algorithm that achieves constant expected cumulative regret in terms of the instance parameters when the parent distribution of the reward variable for each intervention is known.