Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification
Abstract
LENS framework improves reinforcement learning with verifiable rewards by identifying and removing interference tokens to enhance exploration efficiency and training stability.
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6times speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.
Community
Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong promise for improving LLM reasoning, but in practice it often fails silently:
for many hard prompts, all rollouts receive zero reward, causing training to stall or collapse.
🔍 Key observation
Through token-level analysis, we find that many failed rollouts are not due to problem difficulty. Instead, failures are often caused by a very small number of “interference tokens” (<5%) that derail the entire reasoning trajectory.
✂️ Interference Token Purification
Simply removing these high-interference tokens can: turn failed rollouts into successful ones improve rollout accuracy by 20%+ on previously zero-reward prompts
🚀 Method: Less Noise Sampling (LENS)
We introduce LENS, an online selective rollout framework for RLVR:
Identify & remove interference tokens in low-success prompts to unlock successful rollouts
Transfer learning back to the original noisy prompts, using denoised rollouts as high-reward supervision
→ the model learns to ignore noise, not just solve cleaner prompts
📊 Results
Pareto improvement over GRPO in performance–efficiency trade-offs
+3.88% average accuracy gain across 7 math reasoning benchmarks
1.6Ă— faster convergence with less compute
Outperforms both rollout scaling and prompt filtering baselines
đź’ˇ Takeaway
Low-success prompts are not useless—they contain valuable signals hidden behind a few noisy tokens.
Pruning interference tokens offers a new perspective on improving exploration efficiency in RLVR.
đź”— Links
Paper: Less Noise Sampling Framework for RLVR
Keywords: RLVR, GRPO, reasoning, rollout efficiency, token-level analysis
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper