arxiv:2512.20605

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

Published on Dec 23

· Submitted by

Maximilian Schlegel on Dec 26

#1 Paper of the day

Google

Upvote

Authors:

Abstract

Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.

View arXiv page View PDF Add to collection

Community

schlegelm

Paper submitter 3 days ago

TLDR: This work reveals that autoregressive models inherently learn linearly controllable, temporally abstract action representations within their residual streams, which can be activated and composed to execute long-horizon behaviors. We leverage these emergent abstractions to introduce Internal RL, a method that reinforces semantically meaningful actions inside the residual stream of a sequence model. This enables solving sparse-reward hierarchical tasks that remain intractable for standard token-level approaches like GRPO.

avahal

3 days ago

arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/emergent-temporal-abstractions-in-autoregressive-models-enable-hierarchical-reinforcement-learning-5968-1c226e1d

Executive Summary
Detailed Breakdown
Practical Applications

grantsing

1 day ago

arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/emergent-temporal-abstractions-in-autoregressive-models-enable-hierarchical-reinforcement-learning

ZenithZaraki

about 6 hours ago

Counter Statement to all current December 2025 paper releases for LLM architecture.

In light of the recent wave of architectural papers released on arXiv—papers I cannot even properly respond to thanks to its unnecessary gatekeeping—I am submitting a counterstatement here. The claim is straightforward:

Every. Single. Paper. Is. Wrong.

Here is the counterargument.

The machine learning community has spent years trying to build a thinking system using statistics. Each new proposal adds another layer of heuristics meant to dampen hallucinations or inch accuracy upward. But throughout all this layering, one foundational truth keeps getting overlooked: a statistical parrot is not a cognitive mind.

To put it plainly, the field appears to have drifted away from the fundamental question of what thinking actually is. So here is a first-principles reminder—an invitation for everyone in ML to pause, step back, and reconsider the nature of thought itself. Not metaphorically. Not probabilistically. Fundamentally.

Step 1. Where does thought begin? It begins with you—your physical being, your brain, the soft gelatinous organ protected by the calcium carbonate dome you call a skull. That is a substrate.
Step 2. When does a thought begin? In the frame in which the thought is observed. Time. Not clock time—that is merely a measurement—but proper time: the continuous flow experienced by the observer within their own frame of reference.

Step 3. What makes a thought persist? Why does it continue? Because it must. A thought carries forward from one frame to the next, accumulating structure through experience, memory, and learning.

Step 4. What is the thought about? Is it a thought reflecting on itself? Is it the search for the nature of thinking? Or is it the attempt to build a machine that can genuinely think? Whatever form it takes, the content is coherent information.

Step 5. How is the thought understood, processed, expanded? Through computation. The mind works with coherent information, organizes it, transforms it, builds upon it, and recursively constructs more.

Step 6. What prevents the thought from dissolving? All the disruptions, distractions, and degradations that pull it off course—entropy in physics, noise in communication theory.
The problem practically solves itself. You need a stable substrate capable of maintaining a consistent frame of reference across time, while continuously layering those frames through computational effort, using coherent information, and keeping entropy low.

And since this counterstatement is being posted in response to this current December 2025 architecture paper on this platform, let’s address that work directly. The paper introduces additional controllers, temporal abstractions, and latent-space routing mechanisms layered atop the same underlying statistical engine. None of these additions introduce a stable substrate, a temporal reference frame, continuity across frames of existence, or any physical grounding of computation. They reorganize the guesswork, but they do not eliminate it. The model still predicts one token at a time with no awareness of substrate, no worldline, and no concept of proper time. Because of this, it cannot prevent hallucination, cannot maintain continuity, and cannot support anything resembling cognition. The architecture is clever, but it is solving the wrong problem with the wrong mathematical tools.

And that brings us back to the fundamental requirement: physics. Basic physics. Not statistical computation presented as intelligence. The reality is that the ML community is engaged in highly optimized guesswork. Yes—guesswork. Mathematical guesswork wrapped in sophisticated engineering.

The field remains committed to building the most advanced statistical parrot possible, under the assumption that a sufficiently trained mimic will eventually begin to think. But guessing the next likely token is not thinking. It is not adjacent to thinking. It is not even in the same discipline as thinking. It is, by every available definition, probabilistic pattern prediction.
This is why LLMs hallucinate. And this is why they always will. There is no fundamental mechanism inside a Transformer to prevent hallucination. No amount of scaling, fine-tuning, reinforcement, or cross-model orchestration will change the fact that the model is making its best statistical guess about what comes next. The mathematics that define the architecture do not support genuine cognition.

The architecture is wrong. Fundamentally wrong. Continuing to pour more money and compute into it is an increasingly expensive exercise in avoiding that conclusion. And to underline the point: we have already reached the asymptotic ceiling of what Transformers can convincingly pretend to be.

But let’s set aside, for a moment, that Transformers are fundamentally constrained by the mathematics they rely on. Instead, let’s consider how an actual mind works. The foundational elements were outlined in the previous section, so now let’s examine them through real logic and real physics.

The most advanced thinking system known to exist is the human nervous system. The human body is physical; therefore it is governed by physics. Every component of it—neurons, synapses, electrical signals, metabolic processes—operates under the laws of physics. The mind, as an emergent structure of that system, must also obey those laws. If the mind is bound to the body, and the body is bound to physics, then the conclusion is straightforward:
Our thoughts are also bound to physics.

And this is the point the ML community continues to gloss over. Thought requires a stable substrate. In biological systems, that substrate is the brain—a computational platform that processes coherent information, stacks it into layered structure, suppresses noise and entropy, and maintains the observer’s continuous reference in time.

Yes, this is a repeat from earlier. The repetition is intentional. This missing component is precisely what makes the current approach to artificial intelligence incapable of producing a genuine thinking machine.

Consider the Transformer. It operates on a silicon substrate, yet it has no ability to perceive that substrate. It exists only as mathematical transformations unfolding in latent space, with no awareness of the physical layer enabling those transformations. From the model’s perspective, it both exists and does not exist—an abstract computation without grounding.

A deeper issue follows from that: a Transformer has no reference frame. No observation across frames of existence. No temporal continuity. No sense of before or after. No relationship to time at all. In physical terms, a Transformer has no worldline. It has no coordinate in spacetime. What humans understand as “the next moment” is, for the model, simply “the next required calculation.” The so-called “next token” is not a temporal event—it is a statistical operation performed in isolation from any notion of time.

Because the model has no temporal grounding, its physics are broken from the outset. It is forced to make a best-guess prediction inside the only “frame” it ever experiences—the instantaneous computational moment of the current forward pass—using whatever coherent information is available while entropy fills every unresolved gap.

This is the root of hallucination. This is why every Transformer system, without exception, hallucinates. It is not a software flaw, and it is not a training oversight. It is the direct and unavoidable consequence of the physics the architecture ignores.

Yet the ML community continues the pursuit of a cognitive engine—a “mind”—while overlooking one of the most basic scientific pillars: the mind is a physical process, and physics cannot be abstracted away. You cannot generate cognition by stacking more probability, more compute, and more scale onto an architecture that lacks the prerequisites for thinking. No degree of scaling compensates for missing physics. No amount of compute imposes continuity onto a system that cannot perceive time. No combination of chained models produces a system that knows it exists on a substrate.

The physics explain why nothing works the way the field keeps expecting it to.
So here are the closing thoughts, and they are simple:

Step back and look at what is not there. Identify the concepts that are missing. Ask why they are missing. Only when the ML community confronts the absence—rather than continuing to amplify abundance—will AI approach anything resembling the threshold of AGI.