arlen/benchOpen benchmarks for agentic consumers
SNAPSHOT web_search-2026-q3
11 JUN 2026 · BERKELEY, CA
The Contract · Reproducible from a published pipeline

Methodology

Every score on this site is reproducible from a published pipeline. This page is the contract: how golden truth is built, how vendors are queried, how cells are scored, and what the bench deliberately does not claim.

§ 01

The Pipeline

Golden truth
Registry-sourced, hand-verified, or planted pages we control. Verified row-by-row before any vendor call.
70 / 30 split
Public split ships in the repo; holdout never leaves. Holdout queries batched with decoys.
Vendor adapters
One per vendor, transport-only. Endpoints, plan, and cost table pinned per snapshot.
Deterministic scoring
Exact / tolerance / hit@k. No LLM judge in current snapshots.
Immutable snapshot
Quarterly ID, raw responses archived, aggregates published on every surface.
§ 02

Metric Definitions

hit@k
Share of golden queries where a normalized golden URL appears in the vendor's top k results. URL normalization strips www., utm_* parameters, fragments, and trailing slashes.
extraction fidelity (0–1)
Weighted field score per page: title 0.15, anchor phrases 0.35, table cells 0.25, numeric facts 0.15, excluded noise 0.10. Weights renormalize over the fields a golden row defines. Correct ≥ 0.90; partial 0.50–0.90.
time-to-retrievability
Hours from a sentinel page's authoritative publish timestamp to its first appearance in vendor search results. Kaplan–Meier median over probes every 6h, right-censored at 30 days.
cost per correct
Total billed spend on the disclosed pinned plan ÷ verified-correct answers. Undefined (rendered ) when a vendor scores zero correct — never 0.
overfit gap
Public-split accuracy minus holdout accuracy, per vendor. A persistent positive gap indicates tuning to the published set.
agent-ready
An autonomous agent obtained a working API key in live trials via otp-email or device-code auth, with no human in the loop.
§ 03

Integrity

Holdout. 30% of every cohort is private. Holdout queries are interleaved with decoys so vendors cannot pattern-match the bench's traffic, and the public-vs-holdout gap is published per vendor.

Rotation. 25% of each cohort rotates every quarter, biased toward strata where vendor scores saturate. Dead golden rows are replaced from the same stratum within 7 days.

Plans and spend. We purchase the cheapest published plan that exposes the capability, archive the pricing page at review time, and disclose exactly which plan every number was billed on.

Vendor review. Every vendor receives its rows 10 business days before publication. Factual corrections trigger re-runs logged in a public errata file; methodology disputes get a standing right of reply, published verbatim and linked from the vendor's row.

Exclusions. Vendors without a programmatic endpoint, or whose terms prohibit automated evaluation, are listed with the reason rather than silently omitted — and are never assigned scores.

§ 04

Known Limitations

Counts and medians are comparable within a snapshot, not across vendors' differing internal windows. Strata are opinionated; edge cases are decided once and applied uniformly. Sentinel pages measure index freshness for pages we control — they do not measure ranking quality on competitive queries. Harness results depend on pinned framework versions and can shift when frameworks update; versions are disclosed per snapshot. Illustrative prototype data is labeled as such until a snapshot's first full run completes.