Continuous Thought Machines by hardmaru

ByHackTech 7 hours ago

11Comments

Share This Article

Sed ut perspiciatis unde.

Send to HN

This page requires Javascript. Please enable it to view the website.

tl;dr

Neurons in brains use timing and synchronization in the way that they compute. This property seems essential for the flexibility and adaptability of biological intelligence. Modern AI systems discard this fundamental property in favor of efficiency and simplicity. We found a way of bridging the gap between the existing powerful implementations and scalability of modern AI, and the biological plausibility paradigm where neuron timing matters. The results have been surprising and encouraging.

Interactive demonstration

Initializing…

Valid Path Only

Auto-solve

Show Path

Show Attention Overlay

Animation FPS:

Click to move Start/ End (toggle with ‘move’)

Introduction

Neural networks (NNs) were originally inspired by biological brains, yet they remain significantly distinct from their biological counterparts. Brains demonstrate complex neural dynamics that evolve over time, but modern NNs intentionally abstract away such temporal dynamics in order to facilitate large-scale deep learning. For instance, the activation functions of standard NNs can be seen as an intentional abstraction of a neuron’s firing rate, replacing the temporal dynamics of biological processes with a single, static value. Such simplifications, though enabling significant advancements in large-scale machine learning , have resulted in a departure from the fundamental principles that govern biological neural computation.

Over hundreds of millions of years, evolution has endowed biological brains with rich neural dynamics, including spike-timing-dependent plasticity (STDP) and neuronal oscillations. Emulating these mechanisms, particularly the temporal coding inherent in spike timing and synchrony, presents a significant challenge. Consequently, modern neural networks do not rely on temporal dynamics to perform compute, but rather prioritize simplicity and computational efficiency. This abstraction, while boosting performance on specific tasks, contributes to a recognized gap between the flexible, general nature of human cognition and current AI capabilities, suggesting fundamental components, potentially related to temporal processing, are missing from our current models .

Why do this research?

Indeed, the notably high performance of modern AI across many fields suggests the emulation of neural dynamics is unwarranted. However, the gap between the highly flexible and general nature of human cognition and the current state of modern AI suggests missing components in our current models.

For these reasons, we argue that time should be a central component of artificial intelligence in order for it to eventually achieve levels of competency that rival or surpass human brains . Therefore, in this work, we address the strong limitation imposed by overlooking neural activity as a central aspect of intelligence. We introduce the Continuous Thought Machine (CTM), a novel neural network architecture designed to explicitly incorporate neural timing as a foundational element. Our contributions are as follows:

We introduce a decoupled internal dimension, a novel approach to modeling the temporal evolution of neural activity. We view this dimension as that over which thought can unfold in an artificial neural system, hence the choice of nomenclature.
We provide a mid-level abstraction for neurons, which we call neuron-level models (NLMs), where every neuron has its own internal weights that process a history of incoming signals (i.e., pre-activations) to activate (as opposed to a static ReLU, for example).
We use neural synchronization directly as the latent representation with which the CTM observes (e.g., through an attention query) and predicts (e.g., via a projection to logits). This biologically-inspired design choice puts forward neural activity as the crucial element for any manifestation of intelligence the CTM might demonstrate.

Reasoning models and recurrence

The frontier of artificial intelligence faces a critical juncture: moving beyond simple input-output mappings towards genuine reasoning capabilities. While scaling existing models has yielded remarkable advancements, the associated computational cost and data demands are unsustainable and raise questions about the long-term viability of this approach. For sequential data, longstanding recurrent architectures have largely been superseded by transformer-based approaches . Nevertheless, recurrence is re-emerging as a natural avenue for extending model complexity. Recurrence is promising because it enables iterative processing and the accumulation of information over time. Modern text generation models (sometimes referred to as ‘reasoning models’) use intermediate generations as a form of recurrence that enables additional compute during test-time. Recently, other works have demonstrated the benefits of the recurrent application of latent layers . While such methods bring us closer to the recurrent structure of biological brains, a fundamental gap nevertheless remains. We posit that recurrence, while essential, is merely one piece of the puzzle. The temporal dynamics unlocked by recurrence — the precise timing and interplay of neural activity — are equally crucial. The CTM differs from existing approaches in three ways: (1) the decoupled internal dimension enables sequential thought on any conceivable data modality; (2) private neuron-level models enables the consideration of precise neural timing; and (3) neural synchronization used directly as a representation for solving tasks.

Method

Fig 1. The Continuous Thought Machine: a single step in its internal recurrent process.

The CTM unfolds neural activity internally as it thinks about data. At each step (one of which demonstrated above) a truncated history of ‘pre activations’ are collected and used for the Neuron Level Models (NLMs). The history of ‘post activations’ produced by all NLMs over time are kept and used to compute neuron-to-neuron synchronization over time. This result is a Synchronization Representation: a new, parameter-efficient, and evidently powerful representation that the CTM uses to observe (via attention) and predict.

The Continuous Thought Machine (CTM) is a neural network architecture that enables a novel approach to thinking about data. It departs from conventional feed-forward models by explicitly incorporating the concept of Neural Dynamics as the central component to its functionality. The video above gives a pictorial overview of the internal workings of the CTM. We give all technical details, including additional figures and verbose explanations in our Technical Report. A GitHub repository is also available. We will provide links to relevant parts of the repository as we explain the model below.

Variable	Description
$mathbf{z}^t$	Post-activations at internal tick $t$, after neuron-level models have been used.
$theta_{text{syn}}$	Recurrent (synapse) model weights; U-NET-like architecture that connects neurons at a given internal tick, $t$.
$mathbf{a}^t$	Pre-activations at internal tick $t$.
$mathbf{A}^t$	History of most recent pre-activations, designed as a FIFO list so that they are always length $M$; inputs to neuron-level models.
$theta_{text{d}}$	Weights of a single neuron-level model, $d$ of $D$; MLP architecture, unique weights per neuron.
$mathbf{Z}^t$	History of all post-activations up to this internal tick, variable length; used as input for synchronization dot products.
$mathbf{S}^t$	Synchronization matrix at internal tick $t$. In practice we use far fewer neurons than $D$ for separate $mathbf{S}^t_{text{out}}$ and $mathbf{S}^t_{text{action}}$ synchronization representations.
$mathbf{W}_{text{out}}$, $mathbf{W}_{text{in}}$	Linear weight matrices that project from $mathbf{S}^t_{text{out}}$ and $mathbf{S}^t_{text{action}}$ to attention queries and predictions, respectively.
$mathbf{o}^t$	Cross attention output.

The CTM consists of three main ideas:

The use of internal recurrence, enabling a dimension over which a concept analogous to thought can occur. The entire process visualised in the video above is a single tick; the interactive maze demo at the top of the page uses 75 ticks. This recurrence is completely decoupled from any data dimensions.
Neuron-level models, that compute post-activations by applying private (i.e., on a per-neuron basis) MLP models to a history of incoming pre-activations.
Synchronization as a representation, where the neural activity over time is tracked and used to compute how pairs of neurons synchronize with one another over time. This measure of synchronization is the representation with which the CTM takes action and makes predictions. Listing 3 in the Technical Report shows the logic for this, and Appendix K details how we use a recursive computation for efficiency.

But what about data?

While data is undoubtedly crucial for any modeling, the CTM is designed around the idea of internal recurrence and synchronization, where the role of data is somewhat secondary to the internal process itself.

Input data is attended to and ingested at each internal tick based on the current sychronisation, and similarly for predictions.

Fig 3. Neural Dynamics when thinking about ImageNet: Each subplot is the activity of a single neuron over time. It is the synchronization between these that forms the representation used by the CTM.

Internal ticks: the ‘thought’ dimension

We start by introducing the continuous internal dimension: $t in { 1, ldots ,T }$ . Unlike conventional sequential models — such as RNNs or Transformers — that process inputs step-by-step according to the sequence inherent in the data (e.g., words in a sentence or frames in a video), the CTM operates along a self-generated timeline of internal thought steps. This internal unfolding allows the model to iteratively build and refine its representations, even when processing static or non-sequential data such as images or mazes. To conform with existing nomenclature used in related works , we refer to these thought steps as ‘internal ticks’ from here on.

A dimension over which thought can unfold.

The CTM’s internal dimension is that over which the dynamics of neural activity can unfold. We believe that such dynamics are likely a cornerstone of intelligent thought.

Recurrent weights: synapses

A recurrent multi-layer perceptron (MLP structured in a U-NET fashion ) acts as a synapse model for the CTM. At any internal tick $t$ , the synapse model produces what we consider pre-activations:

$bold{a}^t = f_{theta_{text{syn}}}(text{concat}(bold{z}^t, bold{o}^t)) in~mathbb{R}^D,$

where $bold{o}^t$ is from input data. The $M$ most recent pre-activations are then collected into a pre-activation ‘history’:

$bold{A}^t = begin{bmatrix} bold{a}^{t-M+1} & bold{a}^{t-M+2} & cdots & bold{a}^t end{bmatrix} in~mathbb{R}^{D times M}.$

Neuron-level models

$M$ effectively defines the length of the history of pre-activations that each neuron level model works with. Each neuron, ${1, ldots, D}$ , is then given its own privately parameterized MLP that produces what we consider post-activations:

$bold{z}_d^{t+1} = g_{theta_d}(bold{A}_d^t),$

where $theta_d$ are the unique parameters for neuron $d$ , and $bold{z}_d^{t+1}$ is a single unit in the vector that contains all post-activations. $bold{A}_d^t$ is a $M$ -dimensional vector (time series). The full set of neuron post-activations are then concatenated with attention output and fed recurrently into $f_{theta_{text{syn}}}$ to produce pre-activations for next step, $t+1$ , in the unfolding thought process.

Synchronization as a representation: modulating data

How should the CTM interact with the outside world? Specifically, how should the CTM consume inputs and produce outputs? We introduced a timing dimension over which something akin to thought can unfold. We also want the CTM’s relationship with data (its interaction, so to speak) to depend not on a snapshot of the state of neurons (at some $t$ ), but rather on the ongoing temporal dynamics of neuron activities. By way of solution, we turn again to natural brains for inspiration and find the concept of neural synchronization both fitting and powerful. For synchronization we start by collecting the post-activations into a post-activation ‘history’:

$bold{Z}^t = begin{bmatrix} bold{z}^{1} & bold{z}^{2} & cdots & bold{z}^t end{bmatrix} in mathbb{R}^{D times t}.$

The length of $bold{Z}^t$ is equal to the current internal tick, meaning that this dimension is not fixed and can be arbitrarily large. We define neural synchronization as the matrix yielded by the inner dot product between post-activation histories:

$bold{S}^t = bold{Z}^t cdot (bold{Z}^t)^intercal in~mathbb{R}^{Dtimes D}.$

Since this matrix scales in $O(D^2)$ it makes practical sense to subsample $(i,j)$ row-column pairs, which capture the synchronization between neurons $i$ and $j$ . To do so we randomly select $D_text{out}$ and $D_text{action}$ $(i,j)$ pairs from $bold{S}$ , thus collecting two synchronization representations, $bold{S}^t_text{out} in~mathbb{R}^{D_text{out}}$ and $bold{S}^t_text{action} in~mathbb{R}^{D_text{action}}$ . $bold{S}^t_text{out}$ can then be projected to an output space as:

$bold{y}^t = bold{W}_{text{out}} cdot bold{S}^t_text{out}.$

Synchronization enables a very large representation.

As the model width, D, grows, the synchronization representation grows with
(frac{D times (D+1)}{2}), offering opportunities for improved expressiveness without the need for more parameters in order to project a latent space to this size.

Modulating input data

$bold{S}^t_text{action}$ can be used to take actions in the world (e.g., via attention as is in our setup):

$bold{q}^t = bold{W}_{text{in}} cdot bold{S}^t_text{action}$

where $bold{W}_{text{out}}$ and $bold{W}_{text{in}}$ are learned weight matrices that project synchronization into vectors for observation (e.g., attention queries, $bold{q}^t$ ) or outputs (e.g., logits, $bold{y}^t$ ). Even though there are $(D times (D+1))/2$ unique pairings in $bold{S}^t$ , $D_text{out}$ and $D_text{action}$ can be orders of magnitude smaller than this. That said, the full synchronization matrix is a large representation that has high future potential.

In most of our experiments we used standard cross attention :

$bold{o}^t = text{Attention}(Q=bold{q}^t, KV=text{FeatureExtractor}(text{data}))$

where a ‘FeatureExtractor’ model, e.g., a ResNet , is first used to build useful local features for the keys and values. $bold{o}^{t}$ is concatenated with $bold{z}^{t+1}$ for the next cycle of recurrence.

Loss function: optimizing across internal ticks

The CTM produces outputs at each internal tick, $t$ . A key question arises: how do we optimize the model across this internal temporal dimension? Let $bold{y}^t in mathbb{R}^{C}$ be the prediction vector (e.g., probabilities of classes) at internal tick $t$ , where $C$ is the number of classes. Let $y_{true}$ be the ground truth target. We can compute a loss at each internal tick using a standard loss function, such as cross-entropy:

$mathcal{L}^t = text{CrossEntropy}(bold{y}^t, y_{true}),$

and a corresponding certainty measure, $mathcal{C}^t$ . We compute certainty simply as 1 – normalised entropy. We compute $mathcal{L}^t$ and $mathcal{C}^t$ for all $t in {1, ldots, T}$ , yielding losses and certainties per internal tick, $mathcal{L} in mathbb{R}^{T}$ and $mathcal{C} in mathbb{R}^{T}$ .

A natural question arises: how should we reduce $mathcal{L}$ into a scalar loss for learning? Our loss function is designed to optimize CTM performance across the internal thought dimension. Instead of relying on a single step (e.g., the last step), which can incentivize the model to only output at that specific step, we dynamically aggregate information from two internal ticks: the point of minimum loss and the point of maximum certainty:

the point of minimum loss: $t_1=text{argmin}(mathcal{L})$ ; and
the point of maximum certainty: $t_2=text{argmax}({mathcal{C}})$ .

This approach is advantageous because it means that the CTM can perform meaningful computations across multiple internal ticks, naturally facilitates a curriculum effect, and enables the CTM to tailor computation based on problem difficulty. The final loss is computed as:

$L = frac{mathcal{L}^{t_1} + mathcal{L}^{t_2}}{2}.$

More information in our Technical Report.

Please take a look at our Technical Report for more information.

Specifically, it includes additional information on how we enable the CTM to learn short versus long time dependency.

Experiment: ImageNet

Demonstrations

Fig 4. Thinking about Images: Top left is the average attention weighting (of the 16 heads shown) when the CTM observes the image on the right. Class predictions are shown on the bottom left and the certainty is shown on the bottom right (green denotes a correct prediction). The small images at the bottom are buttons to load other examples, showing a diversity of certainties and correctness.

Results

This is a subset of results from our ImageNet experiments (see the Technical Report for more). Crucially, the CTM enables Adaptive Compute where the internal steps, (how much thought the CTM is putting into the problem) can be cut short. These figures show what can be expected in terms of accuracy when cutting thinking short. Only marginal gains are had past a certain point, but gains nonetheless.

Fig 4. shows where the CTM looks as it reasons about the data. We show the Attention Weights for all 16 heads and demark where the model is looking for each (and on average at the top). The predictions are shown on the bottom left and certainty over time on the bottom right. Fig 6. shows a visualization of Neural Activity as the CTM thinks about a single image: note the multi-scale structure and how activity seems to ‘flow’.

Fig 6. Neural activity: visualised in 2D using a UMAP projection. Each neuron is shown as an individual dot, scaling in size with absolute magnitude, and color with value (blue for negative, red for positive). We show similar visualizations inside later demonstrations.

Discussion

We never set out to train a model that achieved some remarkable new state-of-the-art performance on ImageNet. AI researchers already expect high performance on ImageNet after over a decade of research that uses it. Instead, we wanted to show just how different and interesting the CTM’s interaction with data can be. The videos on the left/above demonstrate the thought process the CTM undertakes and the figures show its benefits.

Let’s contextualize just what’s going on here: the CTM is looking around these images, all the while building up its prediction, all by using the synchronization of neural activity directly as a representation. The neural dynamics we showed earlier are actually examples of dynamics from a CTM observing ImageNet! The paths output by the CTM in the maze demo are akin to the class predictions made here.

The missing ingredient: TIME

Biological intelligence is still superior to AI in many cases . Biological brains solve tasks very differently to conventional neural networks, which might explain why this is the case. It might be that biological intelligence pays heed to time in ways that modern AI simply does not. In this work, we aimed to develop a model that approaches problem-solving in a manner more aligned with biological brains, emphasizing the central role of the precise timing and interplay of neural dynamics. The interpretable and intuitive outcome we point at in the video demonstrations is very exciting as it suggests that the CTM is indeed leveraging time to its advantage, in order to reason about data.

The details on model hyper-parameters can be found in the Technical Report.

Experiment: Solving 2D Mazes – doing it the hard way

The why and the how

Solving m

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (11)

11 Comments

Post Author

robwwilliams

Posted May 12, 2025 at 4:14 am

Great to refocus in this important topic. So cool to see this bridge being built across fields.

In wet-ware it is hard not to think of “time” as linear Newtonian time driven by a clock. But in the cintext of brain- and-body what really is critical is generating well ordered sequences of acts and operations that are embedded in thicker or thinner sluce of “now” that can range from 300 msec of the “specious present” to 50 microseconds in cells that evaluate the sources of sound (the medial superior olivary nucleus).

For more context on contingent temporality see interview with RW Williams in this recent publication in The European Journal of Neuroscience by John Bickle:

https://pubmed.ncbi.nlm.nih.gov/40176364/

0Likes Log in to Reply
Post Author

brwatomiya

Posted May 12, 2025 at 4:24 am

[flagged]

0Likes Log in to Reply
Post Author

ttoinou

Posted May 12, 2025 at 4:56 am

Ironically this webpage continuously refreshes itself on my firefox iOS :P

0Likes Log in to Reply
Post Author

rvz

Posted May 12, 2025 at 5:34 am

> The Continuous Thought Machine (CTM) is a neural network architecture that enables a novel approach to thinking about data. It departs from conventional feed-forward models by explicitly incorporating the concept of Neural Dynamics as the central component to its functionality.

Still going through the paper, But this looks very exciting to actually see, the internal visual recurrence in action when confronting a task (such as the 2D Puzzle) – making it easier to interpret neural networks over several tasks involving 'time'.

(This internal recurrence may not be new, but applying neural synchronization as described in this paper is).

> Indeed, we observe the emergence of interpretable and intuitive problem-solving strategies, suggesting that leveraging neural timing can lead to more emergent benefits and potentially more effective AI systems

Exactly. Would like to see more applications of this in existing or new architectures that can also give us additional transparency into the thought process on many tasks.

Another great paper from Sakana.

0Likes Log in to Reply
Post Author

coolcase

Posted May 12, 2025 at 5:59 am

I love the ML diagrams that hybrid maths and architecture. It is much less dry than all formal math.

0Likes Log in to Reply
Post Author

dcrimp

Posted May 12, 2025 at 7:37 am

I'm quite enthusiastic about reading this. Since watching the progress by the larger LLM labs, I've noted that they're not making material changes in model configuration that I think to be necessary to proceed toward more refined and capable intelligence. They're adding tools and widgets to things we know don't think like a biological brain. These are really useful things from a commercial perspective, but I think LLMs won't be an enduring paradigm, at least wrt genuine stabs at artificial intelligence. I've been surprised that there hasn't been more effort to transformative work like in the linked article.

The two things that hang me up on current progress in intelligence is that:

– there don't seem to be models which possess continuous thought. Models are alive during a forward pass on their way to produce a token and brain-dead any other time
– there don't seem to be many models that have neural memory
– there doesn't seem to be any form of continuous learning. To be fair, the whole online training thing is pretty uncommon as I understand it.

Reasoning in token space is handy for evals, but is lossy – you throw away all the rest of the info when you sample. I think Meta had a paper on continuous thought in latent space, but I don't think effort in that has continued to anything commercialised.

Somehow, our biological brains are capable of super efficiently doing very intelligent stuff. We have a known-good example, but research toward mimicking that example is weirdly lacking?

All the magic happens in the neural net, right? But we keep wrapping nets with tools we've designed with our own inductive biases, rather than expanding the horizon of what a net can do and empowering it to do that.

Recently I've been looking into SNNs, which feel like a bit of a tech demo, as well as neuromorphic computing, which I think holds some promise for this sort of thing, but doesn't get much press (or, presumably, budget?)

(Apologies for ramble, writing on my phone)

0Likes Log in to Reply
Post Author

liamwire

Posted May 12, 2025 at 8:15 am

Seems really interesting, and the in-browser demo and model was a really great hook to get interest in the rest of the research. I’m only partially through it but the idea itself is compelling.

0Likes Log in to Reply
Post Author

erewhile

Posted May 12, 2025 at 8:17 am

The ideas of these machines isn't entirely new. There's some research from 2002, where Liquid State Machines (LSM) are introduced[1]. These are networks that generally rely on continuous inputs into spiking neural networks, which are then read by some dense layer that connects to all the neurons in this network to read what is called the liquid state.

These LSMs have also been used for other tasks, like playing Atari games in a paper from 2019[2], where they show that while sometimes these networks can outperform humans, they don't always, and they tend to fail at the same things more conventional neural networks failed at at the time as well. They don't outperform these conventional networks, though.

Honestly, I'd be excited to see more research going into continuous processing of inputs (e.g., audio) with continuous outputs, and training full spiking neural networks based on neurons on that idea. We understand some of the ideas of plasticity, and they have been applied in this kind of research, but I'm not aware of anyone creating networks like this with just the kinds of plasticity we see in the brain, with no back propagation or similar algorithms. I've tried this myself, but I think I either have a misunderstanding of how things work in our brains, or we just don't have the full picture yet.

[1] doi.org/10.1162/089976602760407955
[2] doi.org/10.3389/fnins.2019.00883

0Likes Log in to Reply
Post Author

AIorNot

Posted May 12, 2025 at 8:56 am

Can someone explain this paper in the context of LLM architectures – it seems this cannot be combined with LLM deep learning or can it?

0Likes Log in to Reply
Post Author

davedx

Posted May 12, 2025 at 9:19 am

So this weekend we have:

– Continuous thought machines: temporally encoding neural networks (more like how biological brains work)

– Zero data reasoning: (coding) AI that learns from doing, instead of by being trained on giant data sets

– Intellect-2: a globally distributed RL architecture

I am not an expert in the field but this feels like we just bunny hopped a little closer to the singularity…

0Likes Log in to Reply
Post Author

iandanforth

Posted May 12, 2025 at 9:24 am

This paper is concerning. While divorced from the standard ML literature there is a lot of work on biologically plausible spiking, timing dependant artificial neutral networks. The nomenclature here doesn't seem to acknowledge that body of work. Instead it appears as a step toward that bulk of research coming from the ML/LLM field without a clear appreciation of the ground well traveled there.*

In addition some of the terminology is likely to cause confusion. By calling a synaptic integration step "thinking" the authors are going to confuse a lot of people. Instead of the process of forming an idea, evaluating that idea, potentially modifying it and repeating (what a layman would call thinking) they are trying to ascribe "thinking" to single unit processes! That's a pretty radical departure from both ML and ANN literature. Pattern recognition/signal discrimination is well known at the level of synaptic integration and firing, but "thinking?" No, that wording is not helpful.

*I have not reviewed all the citations and am reacting to the plain language of the text as someone familiar with both lines of research.

0Likes Log in to Reply

Continuous Thought Machines by hardmaru

Continuous Thought Machines by hardmaru

Share This Article

Newsletter

This page requires Javascript. Please enable it to view the website.

Interactive demonstration

Introduction

Why do this research?

Reasoning models and recurrence

Method

But what about data?

Internal ticks: the ‘thought’ dimension

A dimension over which thought can unfold.

Recurrent weights: synapses

Neuron-level models

Synchronization as a representation: modulating data

Synchronization enables a very large representation.

Modulating input data

Loss function: optimizing across internal ticks

More information in our Technical Report.

Experiment: ImageNet

Demonstrations

Results

Discussion

The missing ingredient: TIME

Experiment: Solving 2D Mazes – doing it the hard way

The why and the how

11 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter