Killed by LLM by yz-exodao

Share This Article

Sed ut perspiciatis unde.

A memorial to the benchmarks that defined—and were defeated by—AI progress

ARC-AGI(2019 – 2024)

Reasoning

Killed 1 month ago, Abstract reasoning challenge consisting of visual pattern completion tasks. Each task presents a sequence of abstract visual patterns and requires selecting the correct completion. Created by François Chollet as part of a broader investigation into measuring intelligence. It was 5 years and 1 months old.

Original Score

Human Baseline: ~80%

MATH(2021 – 2024)

Mathematics

Killed 4 months ago, A dataset of 12K challenging competition mathematics problems from AMC, AIME, and other math competitions. Problems range from pre-algebra to olympiad-level and require complex multi-step reasoning. Each problem has a detailed solution that tests mathematical reasoning capabilities. It was 3 years and 6 months old.

Original Score

Average CS PhD: ~40%

BIG-Bench-Hard(2022 – 2024)

Multi-task

Killed 7 months ago, A curated suite of 23 challenging tasks from BIG-Bench where language models initially performed below average human level. Selected to measure progress on particularly difficult capabilities. It was 1 year and 8 months old.

Original Score

Average Human: 67.7%

HumanEval(2021 – 2024)

Coding

Killed 8 months ago, A collection of 164 Python programming problems designed to test language models’ coding abilities. Each problem includes a function signature, docstring, and unit tests. Models must generate complete, correct function implementations that pass all test cases. It was 2 years and 10 months old.

Original Score

Unspecified

IFEval(2023 – 2024)

Instruction Following

Killed 10 months ago, A comprehensive evaluation suite testing instruction following capabilities across coding, math, roleplay, and other tasks. Measures ability to handle complex multi-step instructions and constraints. It was 4 months old.

Defeated by:

LLama 3.3 70B

Original Score

Unspecified

GSM8K(2021 – 2023)

Mathematics

Killed 1 year ago, A collection of 8.5K grade school math word problems requiring step-by-step solutions. Problems test both numerical computation and natural language understanding through multi-step mathematical reasoning. It was 2 years and 1 months old.

Original Score

Unspecified

Turing Test(1950 – 2023)

Conversation

Killed 1 year ago, The original AI benchmark proposed by Alan Turing in 1950. In this ‘imitation game’, a computer must convince human judges it is human through natural conversation. The test sparked decades of debate about machine intelligence and consciousness. It was 73 years and 5 months old.

Original Score

Interrogator >50%

ARC (AI2)(2018 – 2023)

Reasoning

Killed 1 year ago, AI2 Reasoning Challenge (ARC) – A collection of grade-school level multiple-choice reasoning tasks testing logical deduction, spatial reasoning, and temporal reasoning. Each task requires applying abstract reasoning skills to solve multi-step problems. It was 5 years old.

Killed by LLM by yz-exodao

Killed by LLM by yz-exodao

Share This Article

Newsletter

ARC-AGI(2019 – 2024)

MATH(2021 – 2024)

BIG-Bench-Hard(2022 – 2024)

HumanEval(2021 – 2024)

IFEval(2023 – 2024)

GSM8K(2021 – 2023)

Turing Test(1950 – 2023)

ARC (AI2)(2018 – 2023)

HellaSwag(2019

HackTech

Leave a comment Cancel reply

Editor's Choice

Killed by LLM by yz-exodao

Killed by LLM by yz-exodao

Share This Article

Newsletter

ARC-AGI(2019 – 2024)

MATH(2021 – 2024)

BIG-Bench-Hard(2022 – 2024)

HumanEval(2021 – 2024)

IFEval(2023 – 2024)

GSM8K(2021 – 2023)

Turing Test(1950 – 2023)

ARC (AI2)(2018 – 2023)

HellaSwag(2019

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter