Seminars

View all Seminars  |  Download ICal for this event

Towards Robustness in LLM Alignment via Reinforcement Learning and Causality

Series: Ph.D. Thesis Defense

Speaker: Rahul Madhavan, Ph.D (Engg.) student, Dept. CSA

Date/Time: Feb 10 11:00:00

Location: CSA Auditorium, (Room No. 104, Ground Floor)

Faculty Advisor: Prof. Siddharth Barman

Abstract:
The post-training paradigm in current state of the art language models relies broadly on algorithms like PPO, GRPO and DPO. Fundamentally, such alignment techniques are used to minimize the distributional divergence between the LLMs output distribution and human preferences. These algorithms, while effective, each exhibit their own failure modes. Policy-gradient methods like PPO and GRPO are often destabilized by imperfect reward modelling and optimization variance, whereas pairwise approaches like DPO yield suboptimal convergence by failing to account for the rich, set-level signals inherent in multi-response data.

Furthermore, models frequently succumb to reward hacking by exploiting superficial heuristics, suffer from distribution shifts, or internalize spurious correlations present in the training data. Centrally, this thesis investigates the question: How can we design post-training frameworks that are mathematically robust to spurious correlations and computationally efficient for on-policy learning? We answer this by unifying principles from reinforcement learning with the formalisms of causal inference to improve both the policy optimization algorithms and the fidelity of the reward signals they maximize.

We first advance the state-of-the-art in preference optimization algorithms by moving beyond the limitations of pairwise comparisons. We introduce Multi-Preference Optimization (MPO), a framework that generalizes the standard Bradley-Terry model to handle set-level contrasts. By optimizing against entire sets of preferred and dispreferred responses simultaneously, MPO extracts a denser supervisory signal, achieving significant gains over Direct Preference Optimization (DPO). To address the computational bottlenecks of on-policy generation, where models produce vast numbers of candidate responses, we propose Active Multi-Preference Optimization (AMPO). This method formulates data selection as a weighted K-medoids problem under Lipschitz constraints, theoretically guaranteeing the maximization of expected reward while training on only a representative subset of data. Furthermore, we identify a critical failure mode in reference-free alignment we term as the URSLA shortcut, where models minimize loss by prematurely truncating responses to reduce per-token uncertainty. We address this via Reference-Free Alignment (REFA), which utilizes a fine-grained End-of-Sequence regularizer to enforce precise control over response length without relying on a reference policy.

Complementing these algorithmic contributions, we propose methods to enhance the causal validity of the reward signal itself. Optimization is only as effective as the objective function it maximizes; if the reward model relies on non-causal features, the policy will fail to generalize. We introduce Context-Aware Reward Modeling (CARMO), which mitigates overfitting to static rubrics by dynamically generating query-specific evaluation criteria at inference time. We further propose Time-Reversed Language Models (TRLM) to generate unsupervised reward signals. By pre-training models to predict the previous token, we model the reverse conditional probability P(Query|Response), enforcing a bidirectional consistency check that is shown to mitigate hallucination. Finally, we apply these causal frameworks to the problem of safety and fairness. We develop Causally Fair Language models (CFL) by constructing reward signals based on the Average Treatment Effect (ATE). By isolating the causal contribution of tokens via counterfactual analysis, we demonstrate the ability to penalize toxic content while mathematically bounding the unintended bias against protected groups. We hope to lay the groundwork towards a theoretically grounded methodology for building language models that are robust, controllable, and aligned with latent user intent.