arlen/benchOpen benchmarks for agentic consumers
INDEPENDENT · CC-BY-4.0
UPDATED 12 JUN 2026 · BERKELEY, CA
Revision history · Immutable snapshots

Changelog & Snapshot History

Every arlen/bench snapshot revision, dated — so you can confirm numbers were never quietly edited. Snapshots are immutable once published; a correction appears as a new dated row, never as a silent edit.

§ 01

Snapshot Revision History

date · snapshot · status · what changed
Date Snapshot Status What changed
2026-06-12web_search-2026-q3verified (reproducible, pre-review)First verified scored run: n=62 public queries, hit@1/hit@5 measured (cost/freshness/latency pending). Exa 79.0% and SerpAPI 71.0% tied on hit@5 (McNemar p=0.40). Replaces the prior illustrative placeholder. Golden cohort expanded to 435 public for the next run.
(prior)web_searchillustrativePlaceholder figures only; every value was marked status=illustrative and "must not be cited." No vendor run had been scored.
2026-06-10web_extraction-2026-q2verified (audited, pre-review)Real scored run over 150 WCXB pages; independent 20% human audit, 30/30 agreed. Vendor right-of-reply pre-review (rows not yet sent).

Snapshots are immutable once published; corrections appear as a new dated row, never as a silent edit. Per-run data is published (rows/results JSON) so any revision is externally diffable.