The Benchmark Graveyard
A museum of dead measures · est. MMXXVI
Vol. I · No. 7 Snark on top — a citable dataset underneath Updated 11 June 2026
An Obituary Column for Artificial Intelligence

The benchmarks we built to measure intelligence keep dying.

Saturation, contamination, gaming, construct drift, quiet abandonment — five ways a test stops meaning what it once did. We hold the autopsies: affectionate roast on top, a rigorous and citable dataset underneath.

Three flagship tests, racing one ceiling they were never meant to reach.

Frontier score over time. The plateau against the human baseline — the dashed mortality line — is the death.
MMLU HumanEval HellaSwag
100 75 50 25 2019 2020 2021 2022 2023 2024 Human baseline · the mortality line HellaSwag 95.4 · ½-life 0.6y HumanEval 92.0 · ½-life 1.8y MMLU 88.7 · ½-life 1.5y
5
Flagship benchmarks
laid to rest
0.6y
Median half-life
to saturation
T1–T3
Evidence tier
on every claim
100%
Authors offered
a right of reply
§ 01 — The Index of the Departed

Each one mattered. Each one ran out of headroom.

Ordered by half-life — years from public release to the midpoint of the saturation curve (ADR-0002). The shorter the span, the faster a hard problem became a solved formality.

2016 → 2019buried
0.3y to ½

SQuAD

Rajpurkar · Zhang · Lopyrev · Liang  ·  arXiv:1606.05250

"Passed human F1 so fast its own authors shipped a sequel just to add the unanswerable questions."

Half-life~0.3 yr
Saturation T2 Construct drift T1 Abandonment T3
2018 → 2019buried
0.4y to ½

GLUE

Wang · Singh · Michael · Hill · Levy · Bowman  ·  arXiv:1804.07461

"Declared dead by its own creators, who chiseled the headstone themselves: 'GLUE is solved.'"

Half-life~0.4 yr
Saturation T2 Construct drift T1 Abandonment T3
2019 → 2024buried
0.6y to ½

HellaSwag

Zellers · Holtzman · Bisk · Farhadi · Choi  ·  arXiv:1905.07830

"Engineered to be adversarially impossible for machines. The machines did not get the memo."

Half-life~0.6 yr
Saturation T2 Contamination T1 Construct drift T3
2020 → 2024buried
1.5y to ½

MMLU

Hendrycks · Burns · Basart · Zou · Mazeika · Song · Steinhardt  ·  arXiv:2009.03300

"Asked fifty-seven subjects' worth of questions, until the models finally answered: 'all of them.'"

Half-life~1.5 yr
Saturation T2 Contamination T1 Construct drift T3
2021 → 2024buried
1.8y to ½

HumanEval

Chen · Tworek · Jun · Yuan · Zaremba et al. (OpenAI)  ·  arXiv:2107.03374

"One hundred sixty-four little functions — outlived by the test suite that proved its solutions didn't run."

Half-life~1.8 yr
Saturation T2 Contamination T1 Construct drift T3
The Autopsy · A Representative Obituary

MMLU

"It asked fifty-seven subjects' worth of questions — until the models answered all of them."

Born 2020-09-07  ·  Declared dead 2024-09-01  ·  Score at death 88.7% vs. expert baseline ~89.8%  ·  Survived by MMLU-Pro, MMLU-Redux, GPQA

What it measured — and what it actually measured

MMLU was built to measure multitask knowledge: fifty-seven subjects, from elementary mathematics to professional law, scored as four-way multiple choice. The construct was breadth — does this model know a little about almost everything a well-read human does?

What it actually measured, by 2024, was closer to four-way multiple-choice test-taking under mild label noise. When the headroom between frontier and ceiling shrinks to a couple of points — and an estimated few percent of ground-truth labels are themselves wrong or ambiguous (the very motivation for MMLU-Redux) — the remaining signal is no longer "knowledge." It is tolerance for the benchmark's own bugs.

The construct quietly drifted out from under the metric — and nobody held a funeral.

Life & career

MMLU mattered. Before it, "general knowledge" in language models was argued by anecdote. Hendrycks et al. handed the field one table everyone could point at — and that table did real work. It made GPT-3's 43.9% legible as far below an expert human, and the leap to GPT-4's 86.4% feel like the era-defining moment it was. For three years it was the headline number on essentially every frontier model card. It earned that spot.

The evidence on file

T2
Saturation — primary cause

Running-maximum series plateaus at ~89% across the last three model generations (GPT-4 → GPT-4o → Claude 3.5 → Llama-3.1-405B, all within ~2 points). Logistic fit + bootstrap CI committed to saturation.csv, regenerated by make curves.

T1
Contamination — reported, not originated

Published analyses report MMLU items inside pretraining corpora (Deng et al., arXiv:2311.09783; MMLU-Pro, arXiv:2406.01574). Per editorial policy we report these findings; materiality versus saturation is debated, so saturation is ranked first.

T3
Protocol divergence

Gemini 1.0 Ultra's headline 90.0% used CoT@32 voting while others reported 5-shot — late-life MMLU became as much about eval protocol as capability. Reported, not adjudicated.

⟶ Standing right of reply

Authors invited June 2026; none received to date. Any response will be published here unedited (modulo length and legal). These benchmarks were load-bearing — we roast with affection, never contempt.

Saturation curve · saturation.csv
100 80 60 40 2020 2021 2022 2023 2024 Human expert · ~89.8 ½-life ~1.5y GPT-3 · 43.9 GPT-4 · 86.4 plateau 88–90
The plateau is the diagnosis. GPT-4 (86.4), GPT-4o (88.7), Claude 3.5 Sonnet (88.7) and Llama-3.1-405B (88.6) land within ~2 points across generations — the saturation criterion met. Gold points are T3-sourced model-card scores.
§ 02 — The Coroner's Manual

Five ways a benchmark dies.

Operational criteria, not vibes (ADR-0002). A burial requires the criteria be met; benchmarks showing early-warning signs go to the clearly-labeled hospice ward. When in doubt, hospice.

Cause of deathOperational criterionEvidence required
i.
Saturation
Frontier scores sit within measurement noise of the ceiling for ≥ 2 model generations. T2 · curve fit
+ bootstrap CI
ii.
Contamination
Documented training-data leakage materially inflating reported scores. T1 or T2 ·
cited & dated
iii.
Gaming
Score gains without the underlying capability gains they claim to track. T1 · A/B
under protocol
iv.
Construct drift
The metric stopped measuring the thing we actually care about. T1 · argued
+ adjudicated
v.
Abandonment
Quietly vanished from the model cards; no living maintainer, no funeral. T3 · last
sighting logged
The Snark Only Works If The Rigor Holds

A falsifiable curve for every benchmark we miss.

Every obituary is backed by a validated record.yaml, a committed saturation.csv, an evidence tier on every claim, a public corrections log, and a standing right of reply for authors. Tagged releases receive a Zenodo DOI, so you can cite a death the way you'd cite anything else.