ARC Prize remains undefeated.
New ideas still needed.
R1-Zero is more important than R1
Special thanks to Tuhin and Abu from Baseten and Yuchen from Hyperbolic Labs for hosting r1-zero for us. Hardly any providers are hosting this model variant, and its availability is important for research purposes.
ARC Prize Foundation’s goal is to define, measure, and inspire new ideas towards AGI. To this end, we strive to create the strongest global innovation environment possible.
We do not have AGI yet and are still innovation constrained – scaling up pure LLM pretraining is not the path, despite this being the dominant AI industry narrative and mainstream public view as of last summer.
The reason narratives are important is they end up driving economic activity, like investment, research focus, funding, geopolitics, trade, etc. For example, in 2023-24 there was ~$20B invested into new LLM startups compared to only ~$200M into new AGI startups.
We launched ARC Prize 2024 last June to grow awareness of limits of scaling LLMs and promote a useful benchmark, ARC-AGI-1, towards a new direction that requires AI systems to adapt to novel, unseen problems instead of being able to rely strictly on memorization.

Last week, DeepSeek published their new R1-Zero and R1 “reasoner” systems that is competitive with OpenAI’s o1 system on ARC-AGI-1. R1-Zero, R1, and o1 (low compute) all score around 15-20% – in contrast to GPT-4o
’s 5%, the pinnacle of years of pure LLM scaling. Based on this week’s US market reaction, the public is starting to understand the limits of scaling pure LLMs too. However, there is still broad public ignorance about impending inference demand.
In December 2024, OpenAI announced a new breakthrough o3 system that we verified. It scored 76% in a low compute mode and 88% in a high compute mode. The o3 system demonstrates the first practical, general implementation of a computer adapting to novel unseen problems.
Despite being huge tech news, o3 beating ARC-AGI-1 went largely unnoticed and unreported by mainstream press.
This is an incredibly important moment for the field of AI and for computer science and these systems demand study. But due to the closed nature of o1/o3, we’re forced to rely on speculation. Thanks to ARC-AGI-1 and now (nearly) open source R1-Zero and R1, we can add to our understanding. In particular, R1-Zero is significantly more important than R1.
“Nearly” because DeepSeek did not publish a reproducible way to generate their model weights from scratch
R1-Zero removes the human bottleneck
In our o1 and o3 analysis, we speculated how these reasoning systems work. The key ideas:
- Generate chains-of-thought (CoT) for a problem domain.
- Label the intermediary CoT steps using a combination of human experts (“supervised fine tuning” or SFT) and automated machines (“reinforcement learning” or RL).
- Train base model using (2).
- At test time, iteratively inference from the process model.
Techniques used to iterative sample, along with ARC-AGI-1 scores, are reviewed below:
System | ARC-AGI-1 | Method | Avg Tokens | Avg Cost |
---|---|---|---|---|
r1-zero | 14% | No SFT / no search | 11K | $.11 |
r1 | 15.8% | SFT / no search | 6K | $.06 |
o1 (low) | 20.5% | SFT / no search |