
Continuous Thought Machines by hardmaru
This page requires Javascript. Please enable it to view the website.
Interactive demonstration
Initializing…
60
Click to move Start/ End (toggle with ‘move’)
Introduction
Neural networks (NNs) were originally inspired by biological brains, yet they remain significantly distinct from their biological counterparts. Brains demonstrate complex neural dynamics that evolve over time, but modern NNs intentionally abstract away such temporal dynamics in order to facilitate large-scale deep learning. For instance, the activation functions of standard NNs can be seen as an intentional abstraction of a neuron’s firing rate, replacing the temporal dynamics of biological processes with a single, static value. Such simplifications, though enabling significant advancements in large-scale machine learning
Over hundreds of millions of years, evolution has endowed biological brains with rich neural dynamics, including spike-timing-dependent plasticity (STDP)
Why do this research?
Indeed, the notably high performance of modern AI across many fields suggests the emulation of neural dynamics is unwarranted. However, the gap between the highly flexible and general nature of human cognition and the current state of modern AI suggests missing components in our current models.
For these reasons, we argue that time should be a central component of artificial intelligence in order for it to eventually achieve levels of competency that rival or surpass human brains
- We introduce a decoupled internal dimension, a novel approach to modeling the temporal evolution of neural activity. We view this dimension as that over which thought can unfold in an artificial neural system, hence the choice of nomenclature.
- We provide a mid-level abstraction for neurons, which we call neuron-level models (NLMs), where every neuron has its own internal weights that process a history of incoming signals (i.e., pre-activations) to activate (as opposed to a static ReLU, for example).
- We use neural synchronization directly as the latent representation with which the CTM observes (e.g., through an attention query) and predicts (e.g., via a projection to logits). This biologically-inspired design choice puts forward neural activity as the crucial element for any manifestation of intelligence the CTM might demonstrate.
Reasoning models and recurrence
The frontier of artificial intelligence faces a critical juncture: moving beyond simple input-output mappings towards genuine reasoning capabilities. While scaling existing models has yielded remarkable advancements, the associated computational cost and data demands are unsustainable and raise questions about the long-term viability of this approach. For sequential data, longstanding recurrent architectures
Method
The Continuous Thought Machine (CTM) is a neural network architecture that enables a novel approach to thinking about data. It departs from conventional feed-forward models by explicitly incorporating the concept of Neural Dynamics as the central component to its functionality. The video above gives a pictorial overview of the internal workings of the CTM. We give all technical details, including additional figures and verbose explanations in our Technical Report. A GitHub repository is also available. We will provide links to relevant parts of the repository as we explain the model below.

Variable | Description |
---|---|
$mathbf{z}^t$ | Post-activations at internal tick $t$, after neuron-level models have been used. |
$theta_{text{syn}}$ | Recurrent (synapse) model weights; U-NET-like architecture that connects neurons at a given internal tick, $t$. |
$mathbf{a}^t$ | Pre-activations at internal tick $t$. |
$mathbf{A}^t$ | History of most recent pre-activations, designed as a FIFO list so that they are always length $M$; inputs to neuron-level models. |
$theta_{text{d}}$ | Weights of a single neuron-level model, $d$ of $D$; MLP architecture, unique weights per neuron. |
$mathbf{Z}^t$ | History of all post-activations up to this internal tick, variable length; used as input for synchronization dot products. |
$mathbf{S}^t$ | Synchronization matrix at internal tick $t$. In practice we use far fewer neurons than $D$ for separate $mathbf{S}^t_{text{out}}$ and $mathbf{S}^t_{text{action}}$ synchronization representations. |
$mathbf{W}_{text{out}}$, $mathbf{W}_{text{in}}$ | Linear weight matrices that project from $mathbf{S}^t_{text{out}}$ and $mathbf{S}^t_{text{action}}$ to attention queries and predictions, respectively. |
$mathbf{o}^t$ | Cross attention output. |
The CTM consists of three main ideas:
- The use of internal recurrence, enabling a dimension over which a concept analogous to thought can occur. The entire process visualised in the video above is a single tick; the interactive maze demo at the top of the page uses 75 ticks. This recurrence is completely decoupled from any data dimensions.
- Neuron-level models, that compute post-activations by applying private (i.e., on a per-neuron basis) MLP models to a history of incoming pre-activations.
- Synchronization as a representation, where the neural activity over time is tracked and used to compute how pairs of neurons synchronize with one another over time. This measure of synchronization is the representation with which the CTM takes action and makes predictions. Listing 3 in the Technical Report shows the logic for this, and Appendix K details how we use a recursive computation for efficiency.
But what about data?
While data is undoubtedly crucial for any modeling, the CTM is designed around the idea of internal recurrence and synchronization, where the role of data is somewhat secondary to the internal process itself.
Input data is attended to and ingested at each internal tick based on the current sychronisation, and similarly for predictions.
Internal ticks: the ‘thought’ dimension
We start by introducing the continuous internal dimension: . Unlike conventional sequential models — such as RNNs or Transformers — that process inputs step-by-step according to the sequence inherent in the data (e.g., words in a sentence or frames in a video), the CTM operates along a self-generated timeline of internal thought steps. This internal unfolding allows the model to iteratively build and refine its representations, even when processing static or non-sequential data such as images or mazes. To conform with existing nomenclature used in related works
A dimension over which thought can unfold.
The CTM’s internal dimension is that over which the dynamics of neural activity can unfold. We believe that such dynamics are likely a cornerstone of intelligent thought.
Recurrent weights: synapses
A recurrent multi-layer perceptron (MLP structured in a U-NET fashion
where is from input data. The most recent pre-activations are then collected into a pre-activation ‘history’:
Neuron-level models
effectively defines the length of the history of pre-activations that each neuron level model works with. Each neuron, , is then given its own privately parameterized MLP that produces what we consider post-activations:
where are the unique parameters for neuron , and is a single unit in the vector that contains all post-activations. is a -dimensional vector (time series). The full set of neuron post-activations are then concatenated with attention output and fed recurrently into to produce pre-activations for next step, , in the unfolding thought process.
Synchronization as a representation: modulating data
How should the CTM interact with the outside world? Specifically, how should the CTM consume inputs and produce outputs? We introduced a timing dimension over which something akin to thought can unfold. We also want the CTM’s relationship with data (its interaction, so to speak) to depend not on a snapshot of the state of neurons (at some ), but rather on the ongoing temporal dynamics of neuron activities. By way of solution, we turn again to natural brains for inspiration and find the concept of neural synchronization
The length of is equal to the current internal tick, meaning that this dimension is not fixed and can be arbitrarily large. We define neural synchronization as the matrix yielded by the inner dot product between post-activation histories:
Since this matrix scales in it makes practical sense to subsample row-column pairs, which capture the synchronization between neurons and . To do so we randomly select and pairs from , thus collecting two synchronization representations, and . can then be projected to an output space as:
Synchronization enables a very large representation.
As the model width, D, grows, the synchronization representation grows with
(frac{D times (D+1)}{2}), offering opportunities for improved expressiveness without the need for more parameters in order to project a latent space to this size.
Modulating input data
can be used to take actions in the world (e.g., via attention as is in our setup):
where and are learned weight matrices that project synchronization into vectors for observation (e.g., attention queries, ) or outputs (e.g., logits, ). Even though there are unique pairings in , and can be orders of magnitude smaller than this. That said, the full synchronization matrix is a large representation that has high future potential.
In most of our experiments we used standard cross attention
where a ‘FeatureExtractor’ model, e.g., a ResNet
Loss function: optimizing across internal ticks
The CTM produces outputs at each internal tick, . A key question arises: how do we optimize the model across this internal temporal dimension? Let be the prediction vector (e.g., probabilities of classes) at internal tick , where is the number of classes. Let be the ground truth target. We can compute a loss at each internal tick using a standard loss function, such as cross-entropy:
and a corresponding certainty measure, . We compute certainty simply as 1 – normalised entropy. We compute and for all , yielding losses and certainties per internal tick, and .
A natural question arises: how should we reduce into a scalar loss for learning? Our loss function is designed to optimize CTM performance across the internal thought dimension. Instead of relying on a single step (e.g., the last step), which can incentivize the model to only output at that specific step, we dynamically aggregate information from two internal ticks: the point of minimum loss and the point of maximum certainty:
- the point of minimum loss: ; and
- the point of maximum certainty: .
This approach is advantageous because it means that the CTM can perform meaningful computations across multiple internal ticks, naturally facilitates a curriculum effect, and enables the CTM to tailor computation based on problem difficulty. The final loss is computed as:
More information in our Technical Report.
Please take a look at our Technical Report for more information.
Specifically, it includes additional information on how we enable the CTM to learn short versus long time dependency.
Experiment: ImageNet
Demonstrations
Results
This is a subset of results from our ImageNet experiments (see the Technical Report for more). Crucially, the CTM enables Adaptive Compute where the internal steps, (how much thought the CTM is putting into the problem) can be cut short. These figures show what can be expected in terms of accuracy when cutting thinking short. Only marginal gains are had past a certain point, but gains nonetheless.
Fig 4. shows where the CTM looks as it reasons about the data. We show the Attention Weights for all 16 heads and demark where the model is looking for each (and on average at the top). The predictions are shown on the bottom left and certainty over time on the bottom right. Fig 6. shows a visualization of Neural Activity as the CTM thinks about a single image: note the multi-scale structure and how activity seems to ‘flow’.
Discussion
We never set out to train a model that achieved some remarkable new state-of-the-art performance on ImageNet. AI researchers already expect high performance on ImageNet after over a decade of research that uses it. Instead, we wanted to show just how different and interesting the CTM’s interaction with data can be. The videos on the left/above demonstrate the thought process the CTM undertakes and the figures show its benefits.
Let’s contextualize just what’s going on here: the CTM is looking around these images, all the while building up its prediction, all by using the synchronization of neural activity directly as a representation. The neural dynamics we showed earlier are actually examples of dynamics from a CTM observing ImageNet! The paths output by the CTM in the maze demo are akin to the class predictions made here.
The missing ingredient: TIME
Biological intelligence is still superior to AI in many cases
The details on model hyper-parameters can be found in the Technical Report.
Experiment: Solving 2D Mazes – doing it the hard way
The why and the how
Solving m
11 Comments
robwwilliams
Great to refocus in this important topic. So cool to see this bridge being built across fields.
In wet-ware it is hard not to think of “time” as linear Newtonian time driven by a clock. But in the cintext of brain- and-body what really is critical is generating well ordered sequences of acts and operations that are embedded in thicker or thinner sluce of “now” that can range from 300 msec of the “specious present” to 50 microseconds in cells that evaluate the sources of sound (the medial superior olivary nucleus).
For more context on contingent temporality see interview with RW Williams in this recent publication in The European Journal of Neuroscience by John Bickle:
https://pubmed.ncbi.nlm.nih.gov/40176364/
brwatomiya
[flagged]
ttoinou
Ironically this webpage continuously refreshes itself on my firefox iOS :P
rvz
> The Continuous Thought Machine (CTM) is a neural network architecture that enables a novel approach to thinking about data. It departs from conventional feed-forward models by explicitly incorporating the concept of Neural Dynamics as the central component to its functionality.
Still going through the paper, But this looks very exciting to actually see, the internal visual recurrence in action when confronting a task (such as the 2D Puzzle) – making it easier to interpret neural networks over several tasks involving 'time'.
(This internal recurrence may not be new, but applying neural synchronization as described in this paper is).
> Indeed, we observe the emergence of interpretable and intuitive problem-solving strategies, suggesting that leveraging neural timing can lead to more emergent benefits and potentially more effective AI systems
Exactly. Would like to see more applications of this in existing or new architectures that can also give us additional transparency into the thought process on many tasks.
Another great paper from Sakana.
coolcase
I love the ML diagrams that hybrid maths and architecture. It is much less dry than all formal math.
dcrimp
I'm quite enthusiastic about reading this. Since watching the progress by the larger LLM labs, I've noted that they're not making material changes in model configuration that I think to be necessary to proceed toward more refined and capable intelligence. They're adding tools and widgets to things we know don't think like a biological brain. These are really useful things from a commercial perspective, but I think LLMs won't be an enduring paradigm, at least wrt genuine stabs at artificial intelligence. I've been surprised that there hasn't been more effort to transformative work like in the linked article.
The two things that hang me up on current progress in intelligence is that:
– there don't seem to be models which possess continuous thought. Models are alive during a forward pass on their way to produce a token and brain-dead any other time
– there don't seem to be many models that have neural memory
– there doesn't seem to be any form of continuous learning. To be fair, the whole online training thing is pretty uncommon as I understand it.
Reasoning in token space is handy for evals, but is lossy – you throw away all the rest of the info when you sample. I think Meta had a paper on continuous thought in latent space, but I don't think effort in that has continued to anything commercialised.
Somehow, our biological brains are capable of super efficiently doing very intelligent stuff. We have a known-good example, but research toward mimicking that example is weirdly lacking?
All the magic happens in the neural net, right? But we keep wrapping nets with tools we've designed with our own inductive biases, rather than expanding the horizon of what a net can do and empowering it to do that.
Recently I've been looking into SNNs, which feel like a bit of a tech demo, as well as neuromorphic computing, which I think holds some promise for this sort of thing, but doesn't get much press (or, presumably, budget?)
(Apologies for ramble, writing on my phone)
liamwire
Seems really interesting, and the in-browser demo and model was a really great hook to get interest in the rest of the research. I’m only partially through it but the idea itself is compelling.
erewhile
The ideas of these machines isn't entirely new. There's some research from 2002, where Liquid State Machines (LSM) are introduced[1]. These are networks that generally rely on continuous inputs into spiking neural networks, which are then read by some dense layer that connects to all the neurons in this network to read what is called the liquid state.
These LSMs have also been used for other tasks, like playing Atari games in a paper from 2019[2], where they show that while sometimes these networks can outperform humans, they don't always, and they tend to fail at the same things more conventional neural networks failed at at the time as well. They don't outperform these conventional networks, though.
Honestly, I'd be excited to see more research going into continuous processing of inputs (e.g., audio) with continuous outputs, and training full spiking neural networks based on neurons on that idea. We understand some of the ideas of plasticity, and they have been applied in this kind of research, but I'm not aware of anyone creating networks like this with just the kinds of plasticity we see in the brain, with no back propagation or similar algorithms. I've tried this myself, but I think I either have a misunderstanding of how things work in our brains, or we just don't have the full picture yet.
[1] doi.org/10.1162/089976602760407955
[2] doi.org/10.3389/fnins.2019.00883
AIorNot
Can someone explain this paper in the context of LLM architectures – it seems this cannot be combined with LLM deep learning or can it?
davedx
So this weekend we have:
– Continuous thought machines: temporally encoding neural networks (more like how biological brains work)
– Zero data reasoning: (coding) AI that learns from doing, instead of by being trained on giant data sets
– Intellect-2: a globally distributed RL architecture
I am not an expert in the field but this feels like we just bunny hopped a little closer to the singularity…
iandanforth
This paper is concerning. While divorced from the standard ML literature there is a lot of work on biologically plausible spiking, timing dependant artificial neutral networks. The nomenclature here doesn't seem to acknowledge that body of work. Instead it appears as a step toward that bulk of research coming from the ML/LLM field without a clear appreciation of the ground well traveled there.*
In addition some of the terminology is likely to cause confusion. By calling a synaptic integration step "thinking" the authors are going to confuse a lot of people. Instead of the process of forming an idea, evaluating that idea, potentially modifying it and repeating (what a layman would call thinking) they are trying to ascribe "thinking" to single unit processes! That's a pretty radical departure from both ML and ANN literature. Pattern recognition/signal discrimination is well known at the level of synaptic integration and firing, but "thinking?" No, that wording is not helpful.
*I have not reviewed all the citations and am reacting to the plain language of the text as someone familiar with both lines of research.