Published 24 Mar 2025
Level Up to Reach AGI
Good AGI benchmarks act as useful progress indicators. Better AGI benchmarks clearly discern capabilities. The best AGI benchmarks do all this and actively inspire research and guide innovation.
At ARC Prize, our mission is to serve as a North Star towards AGI through enduring benchmarks, directing efforts towards systems capable of general intelligence and significantly compressing the timeline for scientific breakthroughs.
ARC-AGI-1 has measured progress towards AGI since 2019 and was the only benchmark to pinpoint the exact moment in late 2024 when AI moved beyond pure memorization. OpenAI used ARC-AGI-1 to demonstrate this progress with their o3 system which combines deep learning-based LLMs with reasoning synthesis engines.
ARC Prize 2024 inspired thousands of independent students and researchers to work alongside frontier labs on new test-time adaption ideas.
But there is more work to do to reach AGI. AGI still needs new ideas.
We can characterize systems like o3 as going from “zero to one” on the fluid intelligence spectrum. But these systems are highly inefficient and currently require significant human supervision during the training process to adapt to new domains.
Announcements
Today we’re excited to launch ARC-AGI-2 to challenge the new frontier. ARC-AGI-2 is even harder for AI (in particular, AI reasoning systems), while maintaining the same relative ease for humans. Pure LLMs score 0% on ARC-AGI-2, and public AI reasoning systems achieve only single-digit percentage scores. In contrast, every task in ARC-AGI-2 has been solved by at least 2 humans in under 2 attempts.
Alongside it, today we’re announcing ARC Prize 2025 (going live on Kaggle this week), designed to drive open-source progress on highly efficient, general systems capable of beating ARC-AGI-2.
Easy for Humans, Hard for AI
All other AI benchmarks focus on superhuman capabilities or specialized knowledge by testing “PhD++” skills. ARC-AGI is the only benchmark that takes the opposite design choice – by focusing on tasks that are relatively easy for humans, yet hard, or impossible, for AI, we shine a spotlight on capability gaps that do not spontaneously emerge from “scaling up”.
The ARC Prize Foundation adapts this into our definition for measuring AGI: the gap between the set of tasks that are easy for humans and hard for AI. When this gap is zero, when there are no remaining tasks, we can find that challenge AI, we will have achieved AGI.
Addressing these capability gaps requires novel insight and new ideas. Importantly, ARC-AGI does not exist purely to measure AGI progress. It also exists to inspire researchers to work on new ideas.
Intelligence requires the ability to generalize from limited experience and apply knowledge in new, unexpected situations. AI systems are already superhuman in many specific domains (e.g., playing Go and image recognition.) However, these are narrow, specialized capabilities. The “human-ai gap” reveals what’s missing for general intelligence – highly efficiently acquiring new skills.
Introducing ARC-AGI-2
The ARC-AGI-2 benchmark launches today. This second edition in the ARC-AGI series raises the bar for difficulty for AI while maintaining the same relative ease for humans.
Every ARC-AGI-2 task was solved by at least 2 humans in 2 attempts or less in a controlled study with hundreds of human participants. This matches the rules we hold for AI, which gets two attempts per task.
ARC-AGI-1, introduced in 2019, was designed to challenge deep learning. Specifically, it was designed to resist the ability to simply “memorize” the training dataset. ARC-AGI comprises a training dataset and several evaluation sets, including a private eval set used for the ARC Prize 2024 contest. The training set is intended to teach the Core Knowledge Priors required to solve tasks in the evaluation sets. To solve the evaluation tasks, AI systems must demonstrate basic fluid intelligence or the ability to adapt to novel, never-before-seen tasks.
As an analogy, think of the training set as a way to learn grade school math symbols, and the evaluation set requires you to solve algebra equations using your knowledge of those symbols. You cannot simply memorize your way to the answer, you must apply existing knowledge to new problems.
Any AI system capable of beating ARC-AGI-1 demonstrates a binary level of fluid intelligence. In contrast, ARC-AGI-2 significantly raises the bar for AI. To beat it, you must demonstrate both a high level of adaptability and high efficiency.
While designing ARC-AGI-2, we studied these properties of frontier AI reasoning systems. Below are example tasks to illustrate some of what we discovered. All of the following tasks are part of ARC-AGI-2 and were (1) solved by at least 2 humans in under 2 attempts and (2) unsolved by any frontier AI reasonin
15 Comments
gkamradt
Hey HN, Greg from ARC Prize Foundation here.
Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.
In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI's o3.
ARC-AGI-2 targets test-time reasoning.
My view is that good AI benchmarks don't just measure progress, they inspire it. Our mission is to guide research towards general systems.
Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are <4%.
Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.
Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.
Change log from ARC-AGI-2 to ARC-AGI-2:
* The two main evaluation sets (semi-private, private eval) have increased to 120 tasks
* Solving tasks requires more reasoning vs pure intuition
* Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less
* Non-training task sets are now difficulty-calibrated
The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.
The Kaggle competition goes live later this week and you can sign up here: https://arcprize.org/competition
We're in an idea-constrained environment. The next AGI breakthrough might come from you, not a giant lab.
Happy to answer questions.
artificialprint
Oh boy! Some of these tasks are not hard, but require full attention and a lot of counting just to get things right! ARC3 will go 3D perhaps? JK
Congrats on launch, lets see how long it'll take to get saturated
FergusArgyll
I'd love to hear from the ARC guys:
These benchmarks, and specifically the constraints placed on solving them (compute etc) seem to me to incentivize the opposite of "general intelligence"
Have any of the technical contributions used to win the past competition been used to advance general AI in any way?
We have transformer based systems constantly gaining capabilities. On the other hand have any of the Kaggle submissions actually advanced the field in any way outside of the ARC Challenge?
To me (a complete outsider, admittedly) the ARC prize seems like an operationalization of the bitter lesson
Nesco
At the very first glance, it's like ARC 1 with some structures serving as contextual data, and more complicated symmetries / topological transformations.
Now, I wonder what surprises are to be found in the full dataset.
The focus on solving cost efficiently discrete tasks might actually lead us towards deep learning systems that could be used reliably in production, and not just give a whoa effect or need to be constantly supervised
lawrenceyan
Concrete benchmarks like these are very valuable.
Defining the reward function, which is basically what ARC is doing, is 50% of the problem solving process.
ipunchghosts
The computer vision community needs an dataset like this for evaluation… train in one domain and test on another. The best we have now are thr imagenet r and c datasets. Humans have no issues with domain adaptation with vision, but comouter vision models struggle in many ways sti including out of domain images.
momojo
Have you had any neurologists utilize your dataset? My own reaction after solving a few of the puzzles was "Why is this so intuitive for me, but not for an LLM?".
Our human-ability to abstract things is underrated.
danpalmer
> and was the only benchmark to pinpoint the exact moment in late 2024 when AI moved beyond pure memorization
This is self-referential, the benchmark pinpointed the time when AI went from memorization to problem solving, because the benchmark requires problem solving to complete. How do we know it requires problem solving skills? Because memorization-only LLMs can't do it but humans can.
I think ARC are producing some great benchmarks, and I think they probably are pushing forward the state of the art, however I don't think they identified anything particular with o3, at least they don't seem to have proven a step change.
iandanforth
I'd very much like to see VLAs get in the game with ARC. When I solve these puzzles I'm imagining myself move blocks around. Much of the time I'm treating these as physics simulations with custom physics per puzzle. VLAs are particularly well suited to the kind of training and planning which might unlock solutions here.
neom
Maybe this is a really stupid question but I've been curious… are LLMs based on… "Neuronormativity"? Like, what neurology is an LLM based on? Would we get any benefit from looking at neurodiverse processing styles?
jwpapi
Did we run out of textual tasks that are easy for humans but hard for AI, or why are the examples all graphics?
falcor84
I spent half an hour playing with these now at https://arcprize.org/play and it's fun, but I must say that they are not "easy". So far I eventually solved all of the ones I've gone through, but several took me significantly more than the 2 tries allotted.
I wonder if this can be shown to be a valid IQ test, and if so, what IQ would a person need to solve e.g. 90% of them in 1 or 2 tries.
Davidzheng
Probably openai will be >60% in three months if not immediately with these $1000/question level compute (which is the way tbh we should throw compute whenever possible that's the main advantage of silicon intelligence)
ttol
Had to give https://reasoner.com a try on ARC-AGI-2.
Reasoner passed on first try.
“Correct!”
(See screenshot that shows one rated “hard” — https://www.linkedin.com/posts/waynechang_tried-reasoner-on-…)
nneonneo
Nitpick: “Public” is misspelled as “pubic” in several of the captions on that page.