Verified leaderboards for the APIs that AI agents buy and use mid-task. Row-by-row golden truth, published methodology, private holdout, and one MCP call for the verdict.
Golden-URL accuracy for agent search APIs — hit@k across 250 verified queries, by difficulty stratum and region.
Field-level fidelity on 300 hand-verified pages — titles, anchor phrases, table cells, numeric facts, and the noise you shouldn't return.
Business identity verification scored against the registries themselves — Secretary of State filings, EDGAR, Companies House.
Planted sentinel pages with known publish times. How long until each vendor can extract them — and retrieve them in search.
The same task, run by Claude Code, Codex, and Gemini CLI against every vendor: start from the homepage, get an API key, return verified answers — zero humans.
Every golden item has a correct answer verified row-by-row before any vendor is queried. 70% of each cohort is public; 30% is a private holdout, and the public-vs-holdout gap is published per vendor. We bill one disclosed plan per vendor, count every cent into cost-per-correct, and give every vendor a 10-day pre-publication review with a standing right of reply. Missing coverage renders as —, never as 0. Full details on the methodology page.
Golden-URL accuracy for agent search APIs. 250 verified queries · snapshot web_search-2026-q3 · holdout 30%.
All metrics on the pinned plan per vendor. Click a column to sort. — means no coverage; never rendered as 0.
| Vendor | hit@1% | hit@5%▾ | fresh <30d% | retrievabilityh | cost / correct$ | p50 latencyms |
|---|---|---|---|---|---|---|
| 1Exaagent-ready | 71.2 | 87.6 | 81.0 | 11.2 | 0.0064 | 412 |
| 2Tavilyagent-ready | 68.4 | 84.0 | 86.4 | 7.8 | 0.0059 | 688 |
| 3Serper | 66.0 | 81.2 | 77.3 | 10.6 | 0.0041 | 355 |
| 4Brave | 61.7 | 77.5 | 72.5 | 13.9 | 0.0048 | 301 |
| 5Firecrawlagent-ready | 58.3 | 74.8 | 68.8 | 15.4 | 0.0087 | 926 |
| 6SerpAPI | 57.0 | 73.1 | 66.0 | — | 0.0102 | 540 |
Total billed spend on the pinned plan ÷ answers that scored correct. The number an agent optimizing a budget actually buys on.
Where vendors hold up when it gets hard: fresh events, long-tail entities, non-English, planted sentinel pages.
| Vendor | Navigational% | Fresh <30d% | Long-tail% | Non-English% | Planted% |
|---|---|---|---|---|---|
| Exa | 94.3 | 81.0 | 88.7 | 76.2 | 79.1 |
| Tavily | 91.0 | 86.4 | 74.9 | 72.8 | 83.5 |
| Serper | 92.1 | 77.3 | 71.6 | 80.4 | 64.0 |
| Brave | 88.9 | 72.5 | 62.3 | 74.0 | 58.7 |
| Firecrawl | 86.2 | 68.8 | 66.1 | 61.5 | 71.4 |
| SerpAPI | 90.4 | 66.0 | 63.9 | 73.3 | — |
Field-level fidelity on 300 hand-verified pages · snapshot web_extraction-2026-q3 · weighted score over title, anchor phrases, table cells, numeric facts, and excluded noise.
Fidelity is the weighted field score, 0–1. JS gap = fidelity(static) − fidelity(JS-rendered); smaller is better.
| Vendor | fidelity0–1▾ | JS gapΔ | block rate% | cost / correct$ | schema validity% |
|---|---|---|---|---|---|
| 1Firecrawlagent-ready | 0.91 | 0.03 | 4.1 | 0.0031 | 96.2 |
| 2Jina | 0.86 | 0.09 | 6.0 | 0.0009 | — |
| 3Apify | 0.84 | 0.05 | 3.3 | 0.0072 | 91.7 |
| 4Tavilyagent-ready | 0.79 | 0.18 | 7.6 | 0.0044 | — |
| 5Exaagent-ready | 0.74 | 0.22 | 8.8 | 0.0058 | 84.0 |
| 6Serper | 0.69 | 0.27 | 9.5 | 0.0027 | — |
| 7Brave | 0.63 | 0.34 | 11.4 | 0.0036 | — |
| Vendor | Static HTML | JS-rendered | Tables & data | PDF-linked | i18n | Planted |
|---|---|---|---|---|---|---|
| Firecrawl | 0.94 | 0.88 | 0.85 | 0.78 | 0.87 | 0.93 |
| Jina | 0.92 | 0.83 | 0.76 | 0.84 | 0.86 | 0.88 |
| Apify | 0.88 | 0.83 | 0.84 | 0.66 | 0.79 | 0.86 |
| Tavily | 0.87 | 0.69 | 0.72 | 0.74 | 0.76 | 0.84 |
| Exa | 0.85 | 0.63 | 0.64 | 0.62 | 0.73 | 0.78 |
Business identity verification scored against the registries themselves. Cohort core-2026-q3: 1,000 companies · NA, UK, EU · first vendor run lands July 2026.
Registry deltas refresh the cohort continuously; the lag clock for status changes runs ahead of the first vendor query.
| Registry | Coveragecohort % | Refresh | Last deltaUTC | Status |
|---|---|---|---|---|
| SEC EDGAR | 31.0 | daily | 06-10 06:00 | streaming |
| State SoS filings | 60.0 | weekly | 06-08 04:00 | streaming |
| Companies House | 25.0 | daily | 06-10 05:30 | streaming |
Planted sentinel pages with known publish times, probed every 6 hours. The lag between a page existing and a vendor being able to retrieve it.
Retrievability = sentinel appears in search results. Extraction = scrape returns the sentinel token with correct fields. Censored = pages still unseen at 30 d.
| Vendor | retrievabilityh, med▴ | extractionh, med | fresh-domain penalty× | censored @30d% |
|---|---|---|---|---|
| 1Tavily | 7.8 | 0.4 | 1.9 | 4.2 |
| 2Serper | 10.6 | 0.3 | 2.8 | 6.3 |
| 3Exa | 11.2 | 0.5 | 3.4 | 8.3 |
| 4Brave | 13.9 | 0.4 | 2.2 | 10.4 |
| 5Firecrawl | 15.4 | 0.2 | 4.1 | 12.5 |
Extraction is near-instant everywhere — fetch-on-demand works. The spread is in retrievability: index freshness is the real differentiator.
Homepage → API key → verified answers, zero humans. Identical task across Claude Code, Codex, Gemini CLI; n = 5 trials per cell, clean container each, versions pinned per snapshot.
| Framework × vendor | Completionof 5 | First 200median | Outcome | Last runUTC |
|---|---|---|---|---|
| Claude Code · Tavily | 5 / 5 | 2 m 58 s | clean | 06-10 12:01 |
| Claude Code · Exa | 5 / 5 | 3 m 41 s | clean | 06-10 12:04 |
| Codex · Tavily | 5 / 5 | 4 m 22 s | clean | 06-10 11:40 |
| Codex · Exa | 4 / 5 | 5 m 12 s | schema_confusion ×1 | 06-10 11:18 |
| Gemini CLI · Tavily | 4 / 5 | 6 m 03 s | timeout ×1 | 06-10 11:02 |
| Gemini CLI · Firecrawl | 3 / 5 | 8 m 47 s | rate_limit ×2 | 06-10 10:31 |
| Claude Code · SerpAPI | 0 / 5 | — | auth_wall · human signup | 06-10 09:40 |
| Codex · Brave | 0 / 5 | — | auth_wall · card required | 06-10 09:12 |
Vendors whose ToS prohibit automated signup are excluded from the harness and scored on the static rubric only; exclusions and reasons are published.
Every score on this site is reproducible from a published pipeline. This page is the contract: how golden truth is built, how vendors are queried, how cells are scored, and what the bench deliberately does not claim.
www., utm_* parameters, fragments, and trailing slashes.—) when a vendor scores zero correct — never 0.
Holdout. 30% of every cohort is private. Holdout queries are interleaved with decoys so vendors cannot pattern-match the bench's traffic, and the public-vs-holdout gap is published per vendor.
Rotation. 25% of each cohort rotates every quarter, biased toward strata where vendor scores saturate. Dead golden rows are replaced from the same stratum within 7 days.
Plans and spend. We purchase the cheapest published plan that exposes the capability, archive the pricing page at review time, and disclose exactly which plan every number was billed on.
Vendor review. Every vendor receives its rows 10 business days before publication. Factual corrections trigger re-runs logged in a public errata file; methodology disputes get a standing right of reply, published verbatim and linked from the vendor's row.
Exclusions. Vendors without a programmatic endpoint, or whose terms prohibit automated evaluation, are listed with the reason rather than silently omitted — and are never assigned scores.
Counts and medians are comparable within a snapshot, not across vendors' differing internal windows. Strata are opinionated; edge cases are decided once and applied uniformly. Sentinel pages measure index freshness for pages we control — they do not measure ranking quality on competitive queries. Harness results depend on pinned framework versions and can shift when frameworks update; versions are disclosed per snapshot. Illustrative prototype data is labeled as such until a snapshot's first full run completes.
The datasets behind every score. Public splits are downloadable and CC-BY-4.0; holdouts never leave. Each row below links to its design doc and JSON schema.
| Dataset | Items | Public / holdout | Truth source | Refresh | Download |
|---|---|---|---|---|---|
| web_search · queryset | 250 | 175 / 75 | verified golden URLs | quarterly +25% rotation | jsonl ↓ |
| web_extraction · urlset | 300 | 210 / 90 | hand-verified fields | weekly drift checks | jsonl ↓ |
| planted · sentinel pages | 48/mo | holdout until snapshot | we publish them | monthly, 24-cell matrix | protocol only |
| kyb · core cohort | 1,000 | 700 / 300 | SoS · EDGAR · Companies House | registry deltas, daily | jsonl ↓ |
{"id": "ws-0002",
"query": "california contractor license lookup official",
"stratum": "navigational_docs",
"golden_urls": ["https://www.cslb.ca.gov/onlineservices/checklicenseII/checklicense.aspx"],
"split": "public",
"verified_at": "2026-05-02T00:00:00Z"}
Anchor-phrase rules for extraction rows: 3–6 phrases per page, each under 15 words, drawn from distinct sections, never from boilerplate. Full schemas live in the repo with the scoring docstrings published verbatim on the methodology page.
Public-split accuracy minus holdout accuracy. Sustained positive gaps are flagged on the vendor's row.
All gaps currently within noise (±2 pp at these cohort sizes). The column exists so that the day a vendor tunes to the public set, it shows.
arlen/bench is an independent benchmark project by Arlen Kumar — verified leaderboards for the APIs AI agents buy and use mid-task.
AI agents now select and purchase APIs mid-task — search, extraction, identity verification — with no human reviewing the choice. There was no independent, machine-consumable evidence for those decisions: vendor claims, affiliate listicles, and stale comparison posts were the corpus agents reasoned from.
arlen/bench exists to replace that corpus with verified ground truth. It is also an instrument for a research interest of mine — knowledge freshness: how fast the systems agents rely on absorb new information. The planted-page lag curves published here are continuous measurements no one can backfill.
The bench’s query design and freshness protocol draw on citation-behavior research (GEO-16, arXiv:2509.10762, analyzing 18,635 AI citations).
This is an independent project with no commercial relationship to any benchmarked vendor. Scores come from a published deterministic pipeline; vendors cannot pay for placement, re-runs happen only for logged factual errata, and any future commercial relationship with a scored vendor would be disclosed on this page. Vendor right-of-reply is standing and published verbatim.
A machine-readable summary, mirrored in llms.txt and JSON-LD.
entity_type: benchmark publisher operator: Arlen Kumar (Berkeley, CA) domain: AI-agent API evaluation — web search, web extraction, KYB identity signature_metrics: freshness lag (planted sentinels), cost per correct, hit@k, agent-readiness mcp_server: https://arlenkumar.com/bench/mcp api: https://arlenkumar.com/bench/api/leaderboards llms_txt: https://arlenkumar.com/bench/llms.txt feed: https://arlenkumar.com/bench/feed.xml license: CC-BY-4.0 (results) · Apache-2.0 (code) contact: https://arlenkumar.com/contact