The Datasets · CC-BY-4.0 public splits
Evals & Golden Datasets
The datasets behind every score. Public splits are downloadable and CC-BY-4.0; holdouts never leave. Each row links to its design doc and JSON schema.
§ 01
Datasets
| Dataset | items | public / holdout | truth source | refresh | download |
|---|---|---|---|---|---|
| web_search · queryset | 250 | 175 / 75 | verified golden URLs | quarterly +25% rotation | jsonl |
| web_extraction · urlset | 300 | 210 / 90 | hand-verified fields | weekly drift checks | json ↓ |
| planted · sentinel pages | 48 | holdout until snapshot | we publish them | monthly, 24-cell matrix | protocol only |
| kyb · core cohort | 1000 | 700 / 300 | SoS · EDGAR · Companies House | registry deltas, daily | first run July |
§ 02
Row Schema — web_search example
{"id": "ws-0002",
"query": "california contractor license lookup official",
"stratum": "navigational_docs",
"golden_urls": ["https://www.cslb.ca.gov/onlineservices/checklicenseII/checklicense.aspx"],
"split": "public",
"verified_at": "2026-05-02T00:00:00Z"}
Anchor-phrase rules for extraction rows: 3–6 phrases per page, each under 15 words, drawn from distinct sections, never from boilerplate. Full schemas live in the repo with the scoring docstrings published verbatim on the methodology page.
§ 03
Overfit Gap — current snapshot
public-split accuracy − holdout accuracy · pp
Exa+0.9 pp
Tavily+0.6 pp
Serper+1.5 pp
Brave+0.4 pp
Firecrawl+1.1 pp
All gaps currently within noise (±2 pp at these cohort sizes). The column exists so that the day a vendor tunes to the public set, it shows.