Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
Abstract
Large language model control methods are unified under a dynamic weight update framework, revealing a preference-utility trade-off and enabling improved steering through SPLIT approach.
Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.
Community
We unify LLM control methods as dynamic weight updates, analyze their trade-offs between preference (targeted behavior) and utility (task-valid generation) via a shared log-odds framework, explain these effects through activation manifolds, and introduce SPLIT, a steering method that enhances preference while better preserving utility.
Great paper—your unified view of control methods and the preference–utility trade-off provides a clear framework for understanding steering.
Our recent work SafeConstellations (https://arxiv.org/abs/2508.11290) takes a complementary approach. Instead of parameter updates, we analyze representation dynamics across layers, showing that tasks follow consistent "trajectory constellations" in embedding space. Over-refusals occur when benign inputs are pushed onto refusal-oriented trajectories.
We propose an inference-time method that selectively shifts representations back toward non-refusal pathways for over-refusal-prone tasks, reducing over-refusals by up to 73% with minimal impact on utility.
Your activation-manifold explanation aligns closely with our findings—SafeConstellations realizes this principle at the trajectory level. It would be exciting to explore combining SPLIT-style control with task-specific trajectory steering.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper