0.3y to ½
SQuAD
"Passed human F1 so fast its own authors shipped a sequel just to add the unanswerable questions."
Saturation, contamination, gaming, construct drift, quiet abandonment — five ways a test stops meaning what it once did. We hold the autopsies: affectionate roast on top, a rigorous and citable dataset underneath.
Ordered by half-life — years from public release to the midpoint of the saturation curve (ADR-0002). The shorter the span, the faster a hard problem became a solved formality.
"Passed human F1 so fast its own authors shipped a sequel just to add the unanswerable questions."
"Declared dead by its own creators, who chiseled the headstone themselves: 'GLUE is solved.'"
"Engineered to be adversarially impossible for machines. The machines did not get the memo."
"Asked fifty-seven subjects' worth of questions, until the models finally answered: 'all of them.'"
"One hundred sixty-four little functions — outlived by the test suite that proved its solutions didn't run."
"It asked fifty-seven subjects' worth of questions — until the models answered all of them."
MMLU was built to measure multitask knowledge: fifty-seven subjects, from elementary mathematics to professional law, scored as four-way multiple choice. The construct was breadth — does this model know a little about almost everything a well-read human does?
What it actually measured, by 2024, was closer to four-way multiple-choice test-taking under mild label noise. When the headroom between frontier and ceiling shrinks to a couple of points — and an estimated few percent of ground-truth labels are themselves wrong or ambiguous (the very motivation for MMLU-Redux) — the remaining signal is no longer "knowledge." It is tolerance for the benchmark's own bugs.
The construct quietly drifted out from under the metric — and nobody held a funeral.
MMLU mattered. Before it, "general knowledge" in language models was argued by anecdote. Hendrycks et al. handed the field one table everyone could point at — and that table did real work. It made GPT-3's 43.9% legible as far below an expert human, and the leap to GPT-4's 86.4% feel like the era-defining moment it was. For three years it was the headline number on essentially every frontier model card. It earned that spot.
Running-maximum series plateaus at ~89% across the last three model generations (GPT-4 → GPT-4o → Claude 3.5 → Llama-3.1-405B, all within ~2 points). Logistic fit + bootstrap CI committed to saturation.csv, regenerated by make curves.
Published analyses report MMLU items inside pretraining corpora (Deng et al., arXiv:2311.09783; MMLU-Pro, arXiv:2406.01574). Per editorial policy we report these findings; materiality versus saturation is debated, so saturation is ranked first.
Gemini 1.0 Ultra's headline 90.0% used CoT@32 voting while others reported 5-shot — late-life MMLU became as much about eval protocol as capability. Reported, not adjudicated.
Authors invited June 2026; none received to date. Any response will be published here unedited (modulo length and legal). These benchmarks were load-bearing — we roast with affection, never contempt.
Operational criteria, not vibes (ADR-0002). A burial requires the criteria be met; benchmarks showing early-warning signs go to the clearly-labeled hospice ward. When in doubt, hospice.
| № | Cause of death | Operational criterion | Evidence required |
|---|---|---|---|
| i. | Saturation |
Frontier scores sit within measurement noise of the ceiling for ≥ 2 model generations. | T2 · curve fit + bootstrap CI |
| ii. | Contamination |
Documented training-data leakage materially inflating reported scores. | T1 or T2 · cited & dated |
| iii. | Gaming |
Score gains without the underlying capability gains they claim to track. | T1 · A/B under protocol |
| iv. | Construct drift |
The metric stopped measuring the thing we actually care about. | T1 · argued + adjudicated |
| v. | Abandonment |
Quietly vanished from the model cards; no living maintainer, no funeral. | T3 · last sighting logged |
A falsifiable curve for every benchmark we miss.
Every obituary is backed by a validated record.yaml, a committed saturation.csv, an evidence tier on every claim, a public corrections log, and a standing right of reply for authors. Tagged releases receive a Zenodo DOI, so you can cite a death the way you'd cite anything else.