Skip to content Skip to footer
Arc-AGI-2 and ARC Prize 2025 by gkamradt

Arc-AGI-2 and ARC Prize 2025 by gkamradt

15 Comments

  • Post Author
    gkamradt
    Posted March 24, 2025 at 8:37 pm

    Hey HN, Greg from ARC Prize Foundation here.

    Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.

    In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI's o3.

    ARC-AGI-2 targets test-time reasoning.

    My view is that good AI benchmarks don't just measure progress, they inspire it. Our mission is to guide research towards general systems.

    Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are <4%.

    Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.

    Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.

    Change log from ARC-AGI-2 to ARC-AGI-2:
    * The two main evaluation sets (semi-private, private eval) have increased to 120 tasks
    * Solving tasks requires more reasoning vs pure intuition
    * Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less
    * Non-training task sets are now difficulty-calibrated

    The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.

    The Kaggle competition goes live later this week and you can sign up here: https://arcprize.org/competition

    We're in an idea-constrained environment. The next AGI breakthrough might come from you, not a giant lab.

    Happy to answer questions.

  • Post Author
    artificialprint
    Posted March 24, 2025 at 10:00 pm

    Oh boy! Some of these tasks are not hard, but require full attention and a lot of counting just to get things right! ARC3 will go 3D perhaps? JK

    Congrats on launch, lets see how long it'll take to get saturated

  • Post Author
    FergusArgyll
    Posted March 24, 2025 at 11:15 pm

    I'd love to hear from the ARC guys:

    These benchmarks, and specifically the constraints placed on solving them (compute etc) seem to me to incentivize the opposite of "general intelligence"

    Have any of the technical contributions used to win the past competition been used to advance general AI in any way?

    We have transformer based systems constantly gaining capabilities. On the other hand have any of the Kaggle submissions actually advanced the field in any way outside of the ARC Challenge?

    To me (a complete outsider, admittedly) the ARC prize seems like an operationalization of the bitter lesson

  • Post Author
    Nesco
    Posted March 24, 2025 at 11:38 pm

    At the very first glance, it's like ARC 1 with some structures serving as contextual data, and more complicated symmetries / topological transformations.

    Now, I wonder what surprises are to be found in the full dataset.

    The focus on solving cost efficiently discrete tasks might actually lead us towards deep learning systems that could be used reliably in production, and not just give a whoa effect or need to be constantly supervised

  • Post Author
    lawrenceyan
    Posted March 24, 2025 at 11:48 pm

    Concrete benchmarks like these are very valuable.

    Defining the reward function, which is basically what ARC is doing, is 50% of the problem solving process.

  • Post Author
    ipunchghosts
    Posted March 24, 2025 at 11:58 pm

    The computer vision community needs an dataset like this for evaluation… train in one domain and test on another. The best we have now are thr imagenet r and c datasets. Humans have no issues with domain adaptation with vision, but comouter vision models struggle in many ways sti including out of domain images.

  • Post Author
    momojo
    Posted March 25, 2025 at 12:13 am

    Have you had any neurologists utilize your dataset? My own reaction after solving a few of the puzzles was "Why is this so intuitive for me, but not for an LLM?".

    Our human-ability to abstract things is underrated.

  • Post Author
    danpalmer
    Posted March 25, 2025 at 12:27 am

    > and was the only benchmark to pinpoint the exact moment in late 2024 when AI moved beyond pure memorization

    This is self-referential, the benchmark pinpointed the time when AI went from memorization to problem solving, because the benchmark requires problem solving to complete. How do we know it requires problem solving skills? Because memorization-only LLMs can't do it but humans can.

    I think ARC are producing some great benchmarks, and I think they probably are pushing forward the state of the art, however I don't think they identified anything particular with o3, at least they don't seem to have proven a step change.

  • Post Author
    iandanforth
    Posted March 25, 2025 at 12:32 am

    I'd very much like to see VLAs get in the game with ARC. When I solve these puzzles I'm imagining myself move blocks around. Much of the time I'm treating these as physics simulations with custom physics per puzzle. VLAs are particularly well suited to the kind of training and planning which might unlock solutions here.

  • Post Author
    neom
    Posted March 25, 2025 at 12:44 am

    Maybe this is a really stupid question but I've been curious… are LLMs based on… "Neuronormativity"? Like, what neurology is an LLM based on? Would we get any benefit from looking at neurodiverse processing styles?

  • Post Author
    jwpapi
    Posted March 25, 2025 at 12:55 am

    Did we run out of textual tasks that are easy for humans but hard for AI, or why are the examples all graphics?

  • Post Author
    falcor84
    Posted March 25, 2025 at 1:39 am

    I spent half an hour playing with these now at https://arcprize.org/play and it's fun, but I must say that they are not "easy". So far I eventually solved all of the ones I've gone through, but several took me significantly more than the 2 tries allotted.

    I wonder if this can be shown to be a valid IQ test, and if so, what IQ would a person need to solve e.g. 90% of them in 1 or 2 tries.

  • Post Author
    Davidzheng
    Posted March 25, 2025 at 1:49 am

    Probably openai will be >60% in three months if not immediately with these $1000/question level compute (which is the way tbh we should throw compute whenever possible that's the main advantage of silicon intelligence)

  • Post Author
    ttol
    Posted March 25, 2025 at 2:23 am

    Had to give https://reasoner.com a try on ARC-AGI-2.

    Reasoner passed on first try.

    “Correct!”

    (See screenshot that shows one rated “hard” — https://www.linkedin.com/posts/waynechang_tried-reasoner-on-…)

  • Post Author
    nneonneo
    Posted March 25, 2025 at 5:42 am

    Nitpick: “Public” is misspelled as “pubic” in several of the captions on that page.

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.