An Obituary Column for Artificial Intelligence

The benchmarks we built to measure intelligence keep dying.

Saturation, contamination, gaming, construct drift, quiet abandonment — five ways a test stops meaning what it once did. We hold the autopsies: affectionate roast on top, a rigorous and citable dataset underneath.

The Editorial Standard · Evidence-tiered receipts (T1 / T2 / T3) · Reproducible saturation curves · A standing right of reply for every author · We report contamination — we don't originate it

Three flagship tests, racing one ceiling they were never meant to reach.

Frontier score over time. The plateau against the human baseline — the dashed mortality line — is the death.

MMLU HumanEval HellaSwag

Flagship benchmarks
laid to rest

0.6y

Median half-life
to saturation

T1–T3

Evidence tier
on every claim

100%

Authors offered
a right of reply

§ 01 — The Index of the Departed

Each one mattered. Each one ran out of headroom.

Ordered by half-life — years from public release to the midpoint of the saturation curve (ADR-0002). The shorter the span, the faster a hard problem became a solved formality.

2016 → 2019buried
0.3y to ½

SQuAD

Rajpurkar · Zhang · Lopyrev · Liang · arXiv:1606.05250

"Passed human F1 so fast its own authors shipped a sequel just to add the unanswerable questions."

Half-life~0.3 yr

Saturation T2 Construct drift T1 Abandonment T3

2018 → 2019buried
0.4y to ½

GLUE

Wang · Singh · Michael · Hill · Levy · Bowman · arXiv:1804.07461

"Declared dead by its own creators, who chiseled the headstone themselves: 'GLUE is solved.'"

Half-life~0.4 yr

Saturation T2 Construct drift T1 Abandonment T3

2019 → 2024buried
0.6y to ½

HellaSwag

Zellers · Holtzman · Bisk · Farhadi · Choi · arXiv:1905.07830

"Engineered to be adversarially impossible for machines. The machines did not get the memo."

Half-life~0.6 yr

Saturation T2 Contamination T1 Construct drift T3

2020 → 2024buried
1.5y to ½

MMLU

Hendrycks · Burns · Basart · Zou · Mazeika · Song · Steinhardt · arXiv:2009.03300

"Asked fifty-seven subjects' worth of questions, until the models finally answered: 'all of them.'"

Half-life~1.5 yr

Saturation T2 Contamination T1 Construct drift T3

2021 → 2024buried
1.8y to ½

HumanEval

Chen · Tworek · Jun · Yuan · Zaremba et al. (OpenAI) · arXiv:2107.03374

"One hundred sixty-four little functions — outlived by the test suite that proved its solutions didn't run."

Half-life~1.8 yr

Saturation T2 Contamination T1 Construct drift T3

The Autopsy · A Representative Obituary

MMLU

"It asked fifty-seven subjects' worth of questions — until the models answered all of them."

Born 2020-09-07 · Declared dead 2024-09-01 · Score at death 88.7% vs. expert baseline ~89.8% · Survived by MMLU-Pro, MMLU-Redux, GPQA

What it measured — and what it actually measured

MMLU was built to measure multitask knowledge: fifty-seven subjects, from elementary mathematics to professional law, scored as four-way multiple choice. The construct was breadth — does this model know a little about almost everything a well-read human does?

What it actually measured, by 2024, was closer to four-way multiple-choice test-taking under mild label noise. When the headroom between frontier and ceiling shrinks to a couple of points — and an estimated few percent of ground-truth labels are themselves wrong or ambiguous (the very motivation for MMLU-Redux) — the remaining signal is no longer "knowledge." It is tolerance for the benchmark's own bugs.

The construct quietly drifted out from under the metric — and nobody held a funeral.

Life & career

MMLU mattered. Before it, "general knowledge" in language models was argued by anecdote. Hendrycks et al. handed the field one table everyone could point at — and that table did real work. It made GPT-3's 43.9% legible as far below an expert human, and the leap to GPT-4's 86.4% feel like the era-defining moment it was. For three years it was the headline number on essentially every frontier model card. It earned that spot.

The evidence on file

Saturation — primary cause

Running-maximum series plateaus at ~89% across the last three model generations (GPT-4 → GPT-4o → Claude 3.5 → Llama-3.1-405B, all within ~2 points). Logistic fit + bootstrap CI committed to saturation.csv, regenerated by make curves.

Contamination — reported, not originated

Published analyses report MMLU items inside pretraining corpora (Deng et al., arXiv:2311.09783; MMLU-Pro, arXiv:2406.01574). Per editorial policy we report these findings; materiality versus saturation is debated, so saturation is ranked first.

Protocol divergence

Gemini 1.0 Ultra's headline 90.0% used CoT@32 voting while others reported 5-shot — late-life MMLU became as much about eval protocol as capability. Reported, not adjudicated.

⟶ Standing right of reply

Authors invited June 2026; none received to date. Any response will be published here unedited (modulo length and legal). These benchmarks were load-bearing — we roast with affection, never contempt.

Saturation curve · saturation.csv

The plateau is the diagnosis. GPT-4 (86.4), GPT-4o (88.7), Claude 3.5 Sonnet (88.7) and Llama-3.1-405B (88.6) land within ~2 points across generations — the saturation criterion met. Gold points are T3-sourced model-card scores.

§ 02 — The Coroner's Manual

Five ways a benchmark dies.

Operational criteria, not vibes (ADR-0002). A burial requires the criteria be met; benchmarks showing early-warning signs go to the clearly-labeled hospice ward. When in doubt, hospice.

№	Cause of death	Operational criterion	Evidence required
i.	Saturation	Frontier scores sit within measurement noise of the ceiling for ≥ 2 model generations.	T2 · curve fit + bootstrap CI
ii.	Contamination	Documented training-data leakage materially inflating reported scores.	T1 or T2 · cited & dated
iii.	Gaming	Score gains without the underlying capability gains they claim to track.	T1 · A/B under protocol
iv.	Construct drift	The metric stopped measuring the thing we actually care about.	T1 · argued + adjudicated
v.	Abandonment	Quietly vanished from the model cards; no living maintainer, no funeral.	T3 · last sighting logged

The Snark Only Works If The Rigor Holds

A falsifiable curve for every benchmark we miss.

Every obituary is backed by a validated record.yaml, a committed saturation.csv, an evidence tier on every claim, a public corrections log, and a standing right of reply for authors. Tagged releases receive a Zenodo DOI, so you can cite a death the way you'd cite anything else.

Browse the full dataset Submit an obituary

The benchmarks we built to measure intelligence keep dying.

Three flagship tests, racing one ceiling they were never meant to reach.

Each one mattered. Each one ran out of headroom.

SQuAD

GLUE

HellaSwag

MMLU

HumanEval

MMLU

What it measured — and what it actually measured

Life & career

The evidence on file

Five ways a benchmark dies.

Saturation

Contamination

Gaming

Construct drift

Abandonment