A memorial to the benchmarks that defined—and were defeated by—AI progress
ARC-AGI(2019 – 2024)
Reasoning
Killed 1 month ago, Abstract reasoning challenge consisting of visual pattern completion tasks. Each task presents a sequence of abstract visual patterns and requires selecting the correct completion. Created by François Chollet as part of a broader investigation into measuring intelligence. It was 5 years and 1 months old.
Original Score
Human Baseline: ~80%
MATH(2021 – 2024)
Mathematics
Killed 4 months ago, A dataset of 12K challenging competition mathematics problems from AMC, AIME, and other math competitions. Problems range from pre-algebra to olympiad-level and require complex multi-step reasoning. Each problem has a detailed solution that tests mathematical reasoning capabilities. It was 3 years and 6 months old.
Original Score
Average CS PhD: ~40%
BIG-Bench-Hard(2022 – 2024)
Multi-task
Killed 7 months ago, A curated suite of 23 challenging tasks from BIG-Bench where language models initially performed below average human level. Selected to measure progress on particularly difficult capabilities. It was 1 year and 8 months old.
Original Score
Average Human: 67.7%
HumanEval(2021 – 2024)
Coding
Killed 8 months ago, A collection of 164 Python programming problems designed to test language models’ coding abilities. Each problem includes a function signature, docstring, and unit tests. Models must generate complete, correct function implementations that pass all test cases. It was 2 years and 10 months old.
Original Score
Unspecified
IFEval(2023 – 2024)
Instruction Following
Killed 10 months ago, A comprehensive evaluation suite testing instruction following capabilities across coding, math, roleplay, and other tasks. Measures ability to handle complex multi-step instructions and constraints. It was 4 months old.
Defeated by:
LLama 3.3 70B
Original Score
Unspecified
GSM8K(2021 – 2023)
Mathematics
Killed 1 year ago, A collection of 8.5K grade school math word problems requiring step-by-step solutions. Problems test both numerical computation and natural language understanding through multi-step mathematical reasoning. It was 2 years and 1 months old.
Original Score
Unspecified
Turing Test(1950 – 2023)
Conversation
Killed 1 year ago, The original AI benchmark proposed by Alan Turing in 1950. In this ‘imitation game’, a computer must convince human judges it is human through natural conversation. The test sparked decades of debate about machine intelligence and consciousness. It was 73 years and 5 months old.
Original Score
Interrogator >50%
ARC (AI2)(2018 – 2023)
Reasoning
Killed 1 year ago, AI2 Reasoning Challenge (ARC) – A collection of grade-school level multiple-choice reasoning tasks testing logical deduction, spatial reasoning, and temporal reasoning. Each task requires applying abstract reasoning skills to solve multi-step problems. It was 5 years old.