Seminars
View all Seminars | Download ICal for this eventTowards Robustness in LLM Alignment via Reinforcement Learning and Causality
Series: Ph.D. Colloquium
Speaker: Rahul MadhavanPh.D (Engg.)StudentDept.of CSA
Date/Time: Jul 28 11:00:00
Location: CSA Lecture Hall (Room No. 112, Ground Floor)
Faculty Advisor: Prof. Siddharth Barman
Abstract:
This thesis proposes a suite of methods to address critical challenges in the alignment of Large Language Models (LLMs), including their lack of robustness, susceptibility to reward hacking, and the difficulty of providing fine-grained supervision. While Reinforcement Learning from Human Feedback (RLHF) has been foundational, existing techniques often fall short in complex, real-world scenarios. This thesis studies methods by which some of these challenges in the post-training pipeline of LLMs can be systematically addressed. We integrate principles from reinforcement learning and causal inference to improve both the LLM policy $pi$ as well as the feedback policy $theta$. The primary focus of this talk is to delineate novel approaches to how LLMs can learn from feedback (both automated as well as from humans). These preference signals can also be augmented with causal grounding, enhancing the robustness of this pipeline over data filled with spurious correlations. The overarching goal of this work is to produce more robust, controllable, and causally faithful language models. These contributions are organized around three core components of the LLM post-training process: preference optimization algorithms, reward and feedback mechanisms, and query-response interaction.
Preference Optimization Algorithms:
The first part of this work outlines the improvement of the core aspect of post-training: the reinforcement learning algorithms used for preference-based policy optimization. We unify various preference learning schemes, such as DPO, Plackett-Luce, and InfoNCA, under a single, more general framework. We first introduce Multi-Preference Optimization (MPO), which generalizes DPO to handle set-level contrasts. By comparing entire sets of preferred versus dispreferred responses, MPO leverages a richer supervisory signal, yielding up to a 17.5% improvement over DPO on the AlpacaEval2 benchmark. Recognizing that alignment is not just about content but also style, we then present REFA (Reference-Free Alignment). REFA optimizes both at the trajectory level to the action level, treating the choice of each token --- specifically the End-of-Sequence (EOS) token --- as a distinct control action. By introducing an EOS regularizer, REFA gains explicit control over response length, successfully mitigating dataset-induced brevity biases and producing richer, more detailed outputs. We propose the URSLA conjecture (Uncertainty Reduction with Sequence Length Assertion URSLA), and propose a method to solve for this.
To address the scalability challenges of on-policy scenarios, where models generate a vast number of candidate responses, but needs to select a subset to optimize on, we introduce Active Multi-Preference Optimization (AMPO). AMPO enables efficient learning by employing active subset selection to choose a small, high-impact training subset. We propose an algorithm which under a Lipschitz assumption on the policy, formulates the selection of negative examples as a weighted K-medoids problem. We prove optimality of this formulation for maximizing expected reward under certain constraints. Furthermore, and we show that an efficient local search algorithm achieves a 5-approximation of the optimal solution in polynomial time.
Finally, to provide fine-grained supervision for multi-step reasoning, we propose QFA (Q-Function Alignment). QFA frames the generation of a solution trajectory as a Markov Decision Process (MDP) and learns a Q-function to evaluate each reasoning step, using likelihood differences to generate local contrastive data. By applying the MPO objective at each step, QFA reinforces sound reasoning chains and demonstrates significant improvements in mathematical reasoning accuracy over existing methods.
Improving the Causal Fidelity of Reward Signals:
The effectiveness of any alignment algorithm is limited by the quality of its reward signal. To improve this, we first introduce CARMO (Context-Aware Reward Modeling) to combat reward hacking. Instead of relying on a static rubric, CARMO first dynamically generates a set of context-specific evaluation criteria C(x) for a given query x. The final reward is then an aggregation of scores along these criteria. This ensures rewards are based on qualities genuinely relevant to the query, mitigating overfitting to spurious features like verbosity or formatting. We provide a theoretical argument showing that any fixed, finite rubric is insufficient to capture all possible reward functions, motivating this adaptive approach. Practically, this approach yields a 5% improvement over the baseline GPT4o on RewardBench.
To more formally address spurious correlations, we introduce a framework for Attribute-Free Causal Debiasing. Grounded in causal mediation analysis, this method can debias any classifier (including reward models) without requiring explicit protected-attribute labels. We model the generative process with a causal graph where a hidden confounder F influences both the true intent X and spurious textual features Z. The method isolates the influence of causal features W by penalizing the model with a regularizer derived from the Experimental Spurious Effect, which is approximated using an importance ratio. This forces the model to learn the direct effect of W on the outcome, improving robustness to distribution shifts where the correlation between Z and X breaks.
Finally, we introduce a novel unsupervised reward signal from Time-Reversed Language Models (TRLM). The TRLM scoring function computes the reverse conditional probability P(Query | Response). We show that aligning a forward policy $P_{Forw}$ with this reward induces a policy update, yielding a tilted policy distribution which is proportional to P_{Forw}(A|Q) cdot P_{TRLM}(Q|A)^alpha$. This provides a bidirectional consistency check, ensuring a response is not only likely given its query but also strongly implies its query, which helps mitigate hallucination and ground the models outputs. We also show how this reverse LLM can be used to reason about alternative queries for a given answer, which helps to mitigate jailbreaks over LLMs.
Case Study: Causally-Informed Detoxification
The final part of this thesis provides a concrete application of causal reward modeling to the critical task of text detoxification. We develop the Causally Fair Language models (CFL) framework, which constructs a reward signal based on the Average Treatment Effect (ATE). The ATE of a token is computed by measuring the change in a toxicity score when that token is counterfactually replaced. This isolates the tokens direct causal contribution to toxicity. We prove that the ATE of a spuriously correlated (but non-toxic) word is theoretically upper-bounded. This ATE-based reward is used to fine-tune an LLM, which is shown to effectively reduce toxic generation while, crucially, mitigating unintended bias against identity related terms which are often spuriously correlated with toxicity in training data. This case study demonstrates the practical value of causal fidelity in building robust LLM post-training pipelines.