arlen/benchOpen benchmarks for agentic consumers
SNAPSHOT web_search-2026-q3
11 JUN 2026 · BERKELEY, CA
The Datasets · CC-BY-4.0 public splits

Evals & Golden Datasets

The datasets behind every score. Public splits are downloadable and CC-BY-4.0; holdouts never leave. Each row links to its design doc and JSON schema.

§ 01

Datasets

Dataset items public / holdout truth source refresh download
web_search · queryset250175 / 75verified golden URLsquarterly +25% rotationjsonl
web_extraction · urlset300210 / 90hand-verified fieldsweekly drift checksjson ↓
planted · sentinel pages48holdout until snapshotwe publish themmonthly, 24-cell matrixprotocol only
kyb · core cohort1000700 / 300SoS · EDGAR · Companies Houseregistry deltas, dailyfirst run July
§ 02

Row Schema — web_search example

{"id": "ws-0002",
 "query": "california contractor license lookup official",
 "stratum": "navigational_docs",
 "golden_urls": ["https://www.cslb.ca.gov/onlineservices/checklicenseII/checklicense.aspx"],
 "split": "public",
 "verified_at": "2026-05-02T00:00:00Z"}

Anchor-phrase rules for extraction rows: 3–6 phrases per page, each under 15 words, drawn from distinct sections, never from boilerplate. Full schemas live in the repo with the scoring docstrings published verbatim on the methodology page.

§ 03

Overfit Gap — current snapshot

public minus holdout accuracy
public-split accuracy − holdout accuracy · pp
Exa+0.9 pp
Tavily+0.6 pp
Serper+1.5 pp
Brave+0.4 pp
Firecrawl+1.1 pp

All gaps currently within noise (±2 pp at these cohort sizes). The column exists so that the day a vendor tunes to the public set, it shows.