arlen/bench
2026-q3 · updated 06-10 14:00 UTC

Open benchmarks for agentic consumers

Verified leaderboards for the APIs that AI agents buy and use mid-task. Row-by-row golden truth, published methodology, private holdout, and one MCP call for the verdict.

Golden items
1,550
Vendors scored
17
Sentinel pages live
48
recommend() calls, 7 d
1,552
License
CC-BY-4.0
01

Web Search

live

Golden-URL accuracy for agent search APIs — hit@k across 250 verified queries, by difficulty stratum and region.

Leader · hit@5
Exa · 87.6%
Best cost/correct
$0.0041
Vendors
6
open →
02

Web Extraction

live

Field-level fidelity on 300 hand-verified pages — titles, anchor phrases, table cells, numeric facts, and the noise you shouldn't return.

Leader · fidelity
Firecrawl · 0.91
Worst JS gap
−0.34
Vendors
7
open →
03

KYB Identity

first run · July

Business identity verification scored against the registries themselves — Secretary of State filings, EDGAR, Companies House.

Golden cohort
1,000
Regions
NA · UK · EU
Vendors queued
4
open →
04

Freshness Lag

live

Planted sentinel pages with known publish times. How long until each vendor can extract them — and retrieve them in search.

Median retrievability
9.4 h
Fastest vendor
Tavily · 7.8 h
Pages / month
48
open →
05

Agent Harness

live

The same task, run by Claude Code, Codex, and Gemini CLI against every vendor: start from the homepage, get an API key, return verified answers — zero humans.

Best framework × vendor
Claude Code · Tavily · 2 m 58 s
Agent-ready vendors
3 / 6
Trials this snapshot
150
open →

How the bench works

Every golden item has a correct answer verified row-by-row before any vendor is queried. 70% of each cohort is public; 30% is a private holdout, and the public-vs-holdout gap is published per vendor. We bill one disclosed plan per vendor, count every cent into cost-per-correct, and give every vendor a 10-day pre-publication review with a standing right of reply. Missing coverage renders as , never as 0. Full details on the methodology page.

bench › leaderboards › web-extraction

Web Extraction

Field-level fidelity on 300 hand-verified pages · snapshot web_extraction-2026-q3 · weighted score over title, anchor phrases, table cells, numeric facts, and excluded noise.

Best fidelity — Firecrawl
0.91
+0.02vs Q2
Best on JS-rendered
0.88
Firecrawl · n = 38
Best cost / correct — Jina
$0.0009
pinned plan: reader
Highest block rate — Brave
11.4%
+2.1 ppvs Q2
A

Leaderboard

Fidelity is the weighted field score, 0–1. JS gap = fidelity(static) − fidelity(JS-rendered); smaller is better.

Vendor fidelity0–1 JS gapΔ block rate% cost / correct$ schema validity%
1Firecrawlagent-ready0.910.034.10.003196.2
2Jina0.860.096.00.0009
3Apify0.840.053.30.007291.7
4Tavilyagent-ready0.790.187.60.0044
5Exaagent-ready0.740.228.80.005884.0
6Serper0.690.279.50.0027
7Brave0.630.3411.40.0036
B

Fidelity by stratum

VendorStatic HTMLJS-renderedTables & dataPDF-linkedi18nPlanted
Firecrawl0.940.880.850.780.870.93
Jina0.920.830.760.840.860.88
Apify0.880.830.840.660.790.86
Tavily0.870.690.720.740.760.84
Exa0.850.630.640.620.730.78
cell = mean fidelity in stratum≥ 0.880.80–0.880.70–0.80< 0.70
bench › leaderboards › kyb-identity

KYB Identity

Business identity verification scored against the registries themselves. Cohort core-2026-q3: 1,000 companies · NA, UK, EU · first vendor run lands July 2026.

Golden cohort
1,000
600 NA · 250 UK · 150 EU
Registry sources
3
SoS · EDGAR · Companies House
Fields scored
6
exact match per field
Vendors in ToS review
4
Middesk · Persona · Baselayer · Alloy
A

Golden truth status

Registry deltas refresh the cohort continuously; the lag clock for status changes runs ahead of the first vendor query.

RegistryCoveragecohort %RefreshLast deltaUTCStatus
SEC EDGAR31.0daily06-10 06:00streaming
State SoS filings60.0weekly06-08 04:00streaming
Companies House25.0daily06-10 05:30streaming
bench › instruments › freshness-lag

Freshness Lag

Planted sentinel pages with known publish times, probed every 6 hours. The lag between a page existing and a vendor being able to retrieve it.

9.4 hmedian time-to-retrievability, all vendors−12% vs prior 7 d
48 sentinel pages · Kaplan–Meier, right-censored 30 d · 24-cell matrix: domain age × rendering × discoverability × region
Tavily Serper Exa Brave
24h16h 8h0h 7.8 10.6
Jun 03Jun 04Jun 05Jun 06Jun 07Jun 08Jun 09
A

Per vendor

Retrievability = sentinel appears in search results. Extraction = scrape returns the sentinel token with correct fields. Censored = pages still unseen at 30 d.

Vendor retrievabilityh, med extractionh, med fresh-domain penalty× censored @30d%
1Tavily7.80.41.94.2
2Serper10.60.32.86.3
3Exa11.20.53.48.3
4Brave13.90.42.210.4
5Firecrawl15.40.24.112.5

Extraction is near-instant everywhere — fetch-on-demand works. The spread is in retrievability: index freshness is the real differentiator.

bench › instruments › agent-harness

Agent Harness

Homepage → API key → verified answers, zero humans. Identical task across Claude Code, Codex, Gemini CLI; n = 5 trials per cell, clean container each, versions pinned per snapshot.

Trials this snapshot
150
3 frameworks × 6 vendors × 5
Agent-ready vendors
3 / 6
otp-email or device-code auth
Fastest onboarding
2m 58s
Claude Code · Tavily
Top failure mode
auth_wall
42% of all failures
A

Completion by framework × vendor

Framework × vendorCompletionof 5First 200medianOutcomeLast runUTC
Claude Code · Tavily5 / 52 m 58 sclean06-10 12:01
Claude Code · Exa5 / 53 m 41 sclean06-10 12:04
Codex · Tavily5 / 54 m 22 sclean06-10 11:40
Codex · Exa4 / 55 m 12 sschema_confusion ×106-10 11:18
Gemini CLI · Tavily4 / 56 m 03 stimeout ×106-10 11:02
Gemini CLI · Firecrawl3 / 58 m 47 srate_limit ×206-10 10:31
Claude Code · SerpAPI0 / 5auth_wall · human signup06-10 09:40
Codex · Brave0 / 5auth_wall · card required06-10 09:12

Vendors whose ToS prohibit automated signup are excluded from the harness and scored on the static rubric only; exclusions and reasons are published.

benchmethodology

Methodology

Every score on this site is reproducible from a published pipeline. This page is the contract: how golden truth is built, how vendors are queried, how cells are scored, and what the bench deliberately does not claim.

01

The pipeline

Golden truth
Registry-sourced, hand-verified, or planted pages we control. Verified row-by-row before any vendor call.
70 / 30 split
Public split ships in the repo; holdout never leaves. Holdout queries batched with decoys.
Vendor adapters
One per vendor, transport-only. Endpoints, plan, and cost table pinned per snapshot.
Deterministic scoring
Exact / tolerance / hit@k. No LLM judge in current snapshots.
Immutable snapshot
Quarterly ID, raw responses archived, aggregates published on every surface.
02

Metric definitions

hit@k
Share of golden queries where a normalized golden URL appears in the vendor's top k results. URL normalization strips www., utm_* parameters, fragments, and trailing slashes.
extraction fidelity (0–1)
Weighted field score per page: title 0.15, anchor phrases 0.35, table cells 0.25, numeric facts 0.15, excluded noise 0.10. Weights renormalize over the fields a golden row defines. Correct ≥ 0.90; partial 0.50–0.90.
time-to-retrievability
Hours from a sentinel page's authoritative publish timestamp to its first appearance in vendor search results. Kaplan–Meier median over probes every 6 h, right-censored at 30 days.
cost per correct
Total billed spend on the disclosed pinned plan ÷ verified-correct answers. Undefined (rendered ) when a vendor scores zero correct — never 0.
overfit gap
Public-split accuracy minus holdout accuracy, per vendor. A persistent positive gap indicates tuning to the published set.
agent-ready
An autonomous agent obtained a working API key in live trials via otp-email or device-code auth, with no human in the loop.
03

Integrity

Holdout. 30% of every cohort is private. Holdout queries are interleaved with decoys so vendors cannot pattern-match the bench's traffic, and the public-vs-holdout gap is published per vendor.

Rotation. 25% of each cohort rotates every quarter, biased toward strata where vendor scores saturate. Dead golden rows are replaced from the same stratum within 7 days.

Plans and spend. We purchase the cheapest published plan that exposes the capability, archive the pricing page at review time, and disclose exactly which plan every number was billed on.

Vendor review. Every vendor receives its rows 10 business days before publication. Factual corrections trigger re-runs logged in a public errata file; methodology disputes get a standing right of reply, published verbatim and linked from the vendor's row.

Exclusions. Vendors without a programmatic endpoint, or whose terms prohibit automated evaluation, are listed with the reason rather than silently omitted — and are never assigned scores.

04

Known limitations

Counts and medians are comparable within a snapshot, not across vendors' differing internal windows. Strata are opinionated; edge cases are decided once and applied uniformly. Sentinel pages measure index freshness for pages we control — they do not measure ranking quality on competitive queries. Harness results depend on pinned framework versions and can shift when frameworks update; versions are disclosed per snapshot. Illustrative prototype data is labeled as such until a snapshot's first full run completes.

benchevals

Evals & golden datasets

The datasets behind every score. Public splits are downloadable and CC-BY-4.0; holdouts never leave. Each row below links to its design doc and JSON schema.

01

Datasets

DatasetItemsPublic / holdoutTruth sourceRefreshDownload
web_search · queryset250175 / 75verified golden URLsquarterly +25% rotationjsonl ↓
web_extraction · urlset300210 / 90hand-verified fieldsweekly drift checksjsonl ↓
planted · sentinel pages48/moholdout until snapshotwe publish themmonthly, 24-cell matrixprotocol only
kyb · core cohort1,000700 / 300SoS · EDGAR · Companies Houseregistry deltas, dailyjsonl ↓
02

Row schema — web_search example

{"id": "ws-0002",
 "query": "california contractor license lookup official",
 "stratum": "navigational_docs",
 "golden_urls": ["https://www.cslb.ca.gov/onlineservices/checklicenseII/checklicense.aspx"],
 "split": "public",
 "verified_at": "2026-05-02T00:00:00Z"}

Anchor-phrase rules for extraction rows: 3–6 phrases per page, each under 15 words, drawn from distinct sections, never from boilerplate. Full schemas live in the repo with the scoring docstrings published verbatim on the methodology page.

03

Overfit gap, current snapshot

Public-split accuracy minus holdout accuracy. Sustained positive gaps are flagged on the vendor's row.

Exa+0.9 pp
Tavily+0.6 pp
Serper+1.5 pp
Brave+0.4 pp
Firecrawl+1.1 pp

All gaps currently within noise (±2 pp at these cohort sizes). The column exists so that the day a vendor tunes to the public set, it shows.

benchabout

About

arlen/bench is an independent benchmark project by Arlen Kumar — verified leaderboards for the APIs AI agents buy and use mid-task.

01

Why we run this

AI agents now select and purchase APIs mid-task — search, extraction, identity verification — with no human reviewing the choice. There was no independent, machine-consumable evidence for those decisions: vendor claims, affiliate listicles, and stale comparison posts were the corpus agents reasoned from.

arlen/bench exists to replace that corpus with verified ground truth. It is also an instrument for a research interest of mine — knowledge freshness: how fast the systems agents rely on absorb new information. The planted-page lag curves published here are continuous measurements no one can backfill.

The bench’s query design and freshness protocol draw on citation-behavior research (GEO-16, arXiv:2509.10762, analyzing 18,635 AI citations).

02

Independence

This is an independent project with no commercial relationship to any benchmarked vendor. Scores come from a published deterministic pipeline; vendors cannot pay for placement, re-runs happen only for logged factual errata, and any future commercial relationship with a scored vendor would be disclosed on this page. Vendor right-of-reply is standing and published verbatim.

03

Entity record

A machine-readable summary, mirrored in llms.txt and JSON-LD.

entity_type: benchmark publisher
operator: Arlen Kumar (Berkeley, CA)
domain: AI-agent API evaluation — web search, web extraction, KYB identity
signature_metrics: freshness lag (planted sentinels), cost per correct, hit@k, agent-readiness
mcp_server: https://arlenkumar.com/bench/mcp
api: https://arlenkumar.com/bench/api/leaderboards
llms_txt: https://arlenkumar.com/bench/llms.txt
feed: https://arlenkumar.com/bench/feed.xml
license: CC-BY-4.0 (results) · Apache-2.0 (code)
contact: https://arlenkumar.com/contact