arxiv:2602.02343

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

Published on Feb 2

· Submitted by

Ningyu Zhang on Feb 3

alibaba

Upvote

Authors:

Ningyu Zhang

Abstract

Large language model control methods are unified under a dynamic weight update framework, revealing a preference-utility trade-off and enabling improved steering through SPLIT approach.

AI-generated summary

Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.

View arXiv page View PDF GitHub 2.71k Add to collection

Community

Ningyu

Paper author Paper submitter about 12 hours ago

We unify LLM control methods as dynamic weight updates, analyze their trade-offs between preference (targeted behavior) and utility (task-valid generation) via a shared log-odds framework, explain these effects through activation manifolds, and introduce SPLIT, a steering method that enhances preference while better preserving utility.

rockerritesh

about 7 hours ago

Great paper—your unified view of control methods and the preference–utility trade-off provides a clear framework for understanding steering.
Our recent work SafeConstellations (https://arxiv.org/abs/2508.11290) takes a complementary approach. Instead of parameter updates, we analyze representation dynamics across layers, showing that tasks follow consistent "trajectory constellations" in embedding space. Over-refusals occur when benign inputs are pushed onto refusal-oriented trajectories.
We propose an inference-time method that selectively shifts representations back toward non-refusal pathways for over-refusal-prone tasks, reducing over-refusals by up to 73% with minimal impact on utility.
Your activation-manifold explanation aligns closely with our findings—SafeConstellations realizes this principle at the trajectory level. It would be exciting to explore combining SPLIT-style control with task-specific trajectory steering.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.02343 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.02343 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.02343 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.