What is the most accurate web search API for AI agents?

On the verified web_search-2026-q3 snapshot (n=299 public queries, descriptive form), Exa leads hit@5 at 80.9%, ahead of SerpAPI at 69.2%: their 95% Wilson confidence intervals are disjoint and McNemar p≈0.001, so Exa is statistically separable as the clear leader, and both are ahead of Brave (58.9%) and Tavily (49.4%). Only hit@1 and hit@5 are measured this run, so no cost, freshness or latency leader is claimed.

How does arlen/bench measure freshness?

Sentinel pages carrying a unique token are deployed to sentinel.arlenkumar.com and probed for first appearance in each vendor's results, recording a not_indexed vs ranked_below_k discriminator. Freshness lag will be reported as the Kaplan-Meier median time from publication to first retrievability, right-censored at 30 days. For web_search-2026-q3 this curve is pending the first crawl, so the freshness numbers are not yet published.

Were cost, freshness or latency measured for web search in this snapshot?

No. The web_search-2026-q3 run measured only hit@1 and hit@5 over 299 public queries. Cost-per-correct and latency metering and the sentinel freshness instrument have not yet been run, so those columns render as an em-dash (pending), never zero, and no cost, freshness or latency leader is claimed. Serper and Firecrawl were not scored on web search this snapshot and are excluded; agent onboarding was not tested in this run.

§ 02 · Leaderboard · Verified (audited, pre-review)

Web Extraction API Benchmark: Main-Content Fidelity

Main-content extraction fidelity on the WCXB cohort, for AI agents. verified · audited (20% human, 30/30) · vendor right-of-reply pre-review (notifications not yet sent) · cost provisional · weighted over title, required phrases, boilerplate exclusion.

Safe to cite Exa leads main-content fidelity at 0.74 (95% CI 0.71–0.78, n=150); the lead over Firecrawl is statistically significant. Verified — audited (20% human re-verification, 30/30), pre-review. Cost-per-correct is provisional. claims.json

Answer firstweb_extraction-2026-q2

Exa leads main-content fidelity at 0.74 on the audited WCXB cohort — though it is index-first (serving from its index, live-fetching on cache miss), disclosed below. Jina recovers required phrases but keeps the most boilerplate (exclusion just 0.26), so it ranks last. Cost-per-correct is provisional: among the two vendors with archived native pricing, Exa ($0.0047) is lower than Firecrawl ($0.0294); Tavily and Jina are not yet priced (—, never 0).

Best fidelity

0.74

Exa · index-first

Best boilerplate excl.

0.76

Exa

Lowest priced/correct

$0.0047

Exa · provisional · 2 of 4 priced

Audit (20% sample)

30/30

of 150 pages · 100% agreed

§ 02.1

Full Leaderboard

click any column to sort

Vendor	fidelity 0–1	phrase recall	boilerplate excl.	cost/correct $	coverage %
1Exa index-first	0.7495% CI 0.71–0.78	0.67	0.76	$0.0047	98.7
2Firecrawl	0.6695% CI 0.63–0.70	0.58	0.64	$0.0294	97.3
3Tavily	0.6295% CI 0.58–0.66	0.54	0.68	—	98.7
4Jina	0.5495% CI 0.51–0.57	0.59	0.26	—	99.3

Fidelity = weighted main-content score (title .20, required phrases .50, boilerplate exclusion .30) vs WCXB gold; correct ≥ 0.90. Exa is index-first — it serves from its index and live-fetches on cache miss, so its latency is not a live-scrape time. — in cost = the vendor plan rate is not yet archived (Tavily/Jina); undefined, never 0. Cohort: 150 WCXB pages scored; 30 (20%) independently re-audited, 100% agreement. Exa's 95% CI (0.71–0.78) does not overlap Firecrawl's (0.63–0.70) — the #1 vs #2 gap is statistically significant.

Machine surface: /bench/api/web_extraction-2026-q2.json — the audited snapshot aggregate; vendor review pending (CC-BY-4.0). Corpus: WCXB (Foley, CC-BY-4.0).

Reproduce & cite — web_extraction-2026-q2 (verified)

Independent 20% human audit: 30/30 rows agreed (100%). Metric: fidelity = .20·title + .50·phrase-recall + .30·boilerplate-exclusion; correct ≥ 0.90. cost_per_correct provisional until per-vendor pricing PDFs are archived. Per-row raw responses available on request pending public mirror.

§ 02.2

Fidelity by Page Type

mean fidelity · WCXB strata

Page type	Exa ▾	Firecrawl ▾	Tavily ▾	Jina ▾
Article	0.84	0.74	0.69	0.59
Documentation	0.78	0.63	0.64	0.52
Forum	0.68	0.67	0.61	0.55
Listing	0.56	0.59	0.48	0.4
Product	0.7	0.59	0.51	0.48
Collection	0.74	0.65	0.58	0.53
Service	0.7	0.67	0.67	0.6

Exa leads every page type except listings, where Firecrawl (0.59) edges Exa (0.56). Choose by your dominant content type — e.g. docs-heavy RAG favors Exa; product/listing extraction is closer.

§ 02.3

Example Audited Rows

public WCXB pages · per-vendor fidelity

Type	Page (gold main content)	req. phrases	boilerplate	Exa	Firecrawl	Tavily	Jina
article	gcccd.edu — Tips for Online Success	5	4	1.00	1.00	0.45	0.70
article	conductor.com — 2025 AI Search Trends: The Future of SEO & Content Marketing	5	4	0.90	0.90	0.90	0.60
documentation	doc.rust-lang.org — What is Ownership?	5	3	0.80	0.70	0.70	0.70
documentation	developer.mozilla.org — Using the Fetch API	5	4	0.62	0.53	0.53	0.45
forum	cooking.stackexchange.com — Catering event for 1st time. How should I prepare?	4	3	1.00	0.88	0.78	0.78
forum	forum.mssociety.org.uk — Struggling with getting Stronger and what my body is doing	5	4	0.42	0.35	0.65	0.57
listing	interiordesign.net — 2025 Best of Year Award Winners	5	4	0.50	0.72	0.80	0.57
listing	hospitalitynet.org — All Latest News	5	4	0.23	0.15	0.23	0.07
product	allbirds.com — Men's Wool Runner	5	4	0.60	0.30	0.57	0.30
product	mvmt.com — Moon Silver	5	3	0.90	0.90	1.00	0.90

Public WCXB pages (CC-BY, doi:10.5281/zenodo.19316874) — anyone can re-fetch and reproduce.

§ 02.4

How to Read This · Who It's For

For RAG ingestion, weight fidelity + boilerplate exclusion (Exa leads, significantly). For cost-sensitive high volume, weigh cost-per-correct — provisional until pricing PDFs are archived. For current page state, note Exa is index-served (cached, not a live fetch); prefer a live fetcher when freshness matters. Use coverage when failed fetches are costly.

Independence & integrity. No commercial relationship with any benchmarked vendor; vendors cannot pay for placement (CC-BY-4.0). Independent 20% audit: 30/30 agreed (100%). Corrections log: none filed. Last audited: 2026-06-12. Vendor right-of-reply is open — pre-publication notifications not yet sent; any dispute will be published verbatim and linked here.

§ 02.5

Plan & Config Assumptions

exact request per vendor

Vendor	Endpoint	Params / mode	Billing	Cost/correct	Pricing as-of
Exa	`POST api.exa.ai/contents`	text=true · index-served (live-fetch on cache miss)	usage, USD	$0.0047	native USD; pricing PDF not yet archived
Firecrawl	`POST api.firecrawl.dev/v1/scrape`	formats=[markdown], onlyMainContent=true	credits (9.2/correct)	$0.0294 (provisional)	credit→USD estimate; PDF pending
Tavily	`POST api.tavily.com/extract`	urls=[url] (defaults)	credits (1.33/correct)	—	pricing PDF pending
Jina	`GET r.jina.ai/{url}`	X-Return-Format=markdown	tokens (1.95M/correct)	—	pricing PDF pending

These are the exact endpoints and parameters used in the scored run. Cost-per-correct is provisional until each vendor's pricing page is archived as a date-stamped PDF; credit/token-billed vendors render — until then.

§ 02.6

Vendor Response Status

right of reply

Vendor	Rows sent	Review status	Response
Exa	not yet	pending	—
Firecrawl	not yet	pending	—
Tavily	not yet	pending	—
Jina	not yet	pending	—

Pre-publication notifications have not yet been sent; right of reply is standing. When a vendor receives its rows and replies, this table updates and the response is published verbatim. No vendor can pay for placement.

§ 02.7

Data Access

what's published · what's on request

run date: Scored 2026-06-10; published 2026-06-12. Snapshot web_extraction-2026-q2 is immutable (content-hashed).
aggregates + provenance: Public machine surface: web_extraction-2026-q2.json (per-vendor fidelity, bootstrap CIs, by-page-type, exact request config, audit hash).
atomic claims: Status-tagged, caveat-bearing one-line claims: claims.json.
golden corpus: WCXB pages are public (CC-BY-4.0, doi:10.5281/zenodo.19316874) — anyone can re-fetch and reproduce against the gold.
per-row raw responses: Each vendor's raw extracted output per page is available on request pending a public mirror; the reproducibility bundle (rows, scored cells, runner) is publishing at FindingsWebExtract.
pricing: Cost-per-correct is provisional until each vendor's pricing page is archived as a date-stamped PDF (Exa & Firecrawl native USD; Tavily/Jina pending).