What is the most accurate web search API for AI agents?

On the verified web_search-2026-q3 snapshot (n=299 public queries, descriptive form), Exa leads hit@5 at 80.9%, ahead of SerpAPI at 69.2%: their 95% Wilson confidence intervals are disjoint and McNemar p≈0.001, so Exa is statistically separable as the clear leader, and both are ahead of Brave (58.9%) and Tavily (49.4%). Only hit@1 and hit@5 are measured this run, so no cost, freshness or latency leader is claimed.

How does arlen/bench measure freshness?

Sentinel pages carrying a unique token are deployed to sentinel.arlenkumar.com and probed for first appearance in each vendor's results, recording a not_indexed vs ranked_below_k discriminator. Freshness lag will be reported as the Kaplan-Meier median time from publication to first retrievability, right-censored at 30 days. For web_search-2026-q3 this curve is pending the first crawl, so the freshness numbers are not yet published.

Were cost, freshness or latency measured for web search in this snapshot?

No. The web_search-2026-q3 run measured only hit@1 and hit@5 over 299 public queries. Cost-per-correct and latency metering and the sentinel freshness instrument have not yet been run, so those columns render as an em-dash (pending), never zero, and no cost, freshness or latency leader is claimed. Serper and Firecrawl were not scored on web search this snapshot and are excluded; agent onboarding was not tested in this run.

Independent System of Record · CC-BY-4.0

Web Search & Extraction API Benchmarks for AI Agents

Open benchmarks for agentic consumers — the APIs AI agents buy and use mid-task.

Safe to cite today Two leaderboards are verified (pre-review): web search — Exa leads hit@5 at 80.9% (n=299) — and web extraction — Exa leads fidelity 0.74 (n=150, independently audited). Cost, freshness, KYB and the agent harness are pending or prototype — don't cite them. See the claim table ↓ · claims.json

The Verdict snapshot web_search-2026-q3 · 299 verified queries · 4 vendors

Answer: for golden-URL search accuracy, Exa leads. Caveat first: only web search and web extraction are verified — cost, freshness, KYB and the agent harness are pending or prototype. On snapshot web_search-2026-q3 (299 public verified queries, 4 vendors; hit@1/hit@5 only this run): Exa (80.9% hit@5) is the clear leader over SerpAPI (69.2%) — their 95% confidence intervals are disjoint and McNemar p≈0.001, so the two are statistically separable at n=299 — both ahead of Brave (58.9%) and Tavily (49.4%). Cost, freshness & latency: pending — sentinel & cost instruments not yet run.

449

Rows scored & verified (150 ext + 299 search)

Vendors scored

Verified boards

1,550

Corpus curated (not all scored)

CC-BY 4.0

License

§ 00

What's verified — and safe to cite

one row per claim · status-tagged

Claim	Snapshot	n	Vendors	Status	Safe to cite?
Web search — Exa leads hit@5 (80.9%), ahead of SerpAPI (69.2%)	web_search-2026-q3	299	4	verified · reproducible, pre-review	✓ yes
Web extraction — Exa leads main-content fidelity (0.74)	web_extraction-2026-q2	150	4	verified · audited, pre-review	✓ yes
Web extraction — cost-per-correct (Exa lower of the 2 priced)	web_extraction-2026-q2	150	2 priced	provisional	✗ provisional
KYB / identity — vendor accuracy	—	—	—	first run July 2026	✗ not yet
Freshness lag — time-to-retrievability	—	—	—	prototype (illustrative)	✗ no
Agent harness — autonomous onboarding	—	—	—	prototype (illustrative)	✗ no

Cite only rows marked ✓, with their snapshot id and caveat. The machine-readable equivalent (with confidence intervals + per-claim caveats) is claims.json. "Verified" snapshots are published pre-review — vendor right-of-reply rows have not yet been sent. "Corpus curated" (1,550) counts hand-verified golden items built across all primitives; only 449 are scored on a verified leaderboard today.

§ 01

The Benchmarks — Web Search, Extraction & Identity APIs

5 primitives · 2 verified · 2 prototype · 1 first run July

Web Search

Verified

Golden-URL accuracy for agent search APIs — hit@k across 299 public verified queries (descriptive form, 30% holdout excluded).

Top hit@5 Exa 80.9% · SerpAPI 69.2%n = 299 (public)Vendors 4

Web Extraction

Verified

Main-content extraction fidelity on the WCXB cohort — title, required phrases retained, boilerplate excluded.

Leader fidelity Exa 0.74Cohort 150 WCXB pagesVendors 4

KYB Identity

First run · July

Business identity verification scored against the registries themselves — SoS, EDGAR, Companies House.

Golden cohort 1,000Regions NA·UK·EUVendors queued 4

Freshness Lag

Illustrative

Planted sentinel pages with a unique token, probed for first appearance — index freshness as a survival curve.

Lag curve pending first crawlDiscriminator not_indexed vs ranked_below_kHost sentinel.arlenkumar.com

Agent Harness

Illustrative

Same task run by Claude Code, Codex, Gemini CLI against every vendor: homepage → API key → verified answers, zero humans.

Status illustrativeFirst run pendingNot yet measured

§ 01b

Choose an API by use case, or compare vendors head-to-head

demand-matched guides

Best web search API for AI agents

Which search API to call from an autonomous LLM agent — ranked on hit@5 over verified golden truth.

Best API for RAG ingestion

Extraction and scraping fidelity for RAG pipelines — main content in, boilerplate out.

Scraping & extraction for SEO/content

Web scraping and extraction APIs compared on content fidelity for SEO and content workflows.

Cost-sensitive at scale

Cheapest per verified-correct answer for high-volume agent workloads (cost benchmarks rolling out).

Compare head-to-head: Exa vs Tavily · Exa vs SerpAPI · Brave vs Tavily. By vendor: Exa · SerpAPI · Brave · Tavily · Firecrawl · Jina.

§ 02

Which API should you use?

use-case guides from the verified data

RAG Ingestion

Fidelity + boilerplate removal so retrieved chunks aren't polluted by nav, ads, or banners.

AI Agents

Reliable structured main content with high coverage for autonomous mid-task fetches.

SEO & Content Extraction

Retaining required on-page phrases across products, listings, docs — not just articles.

Cost-Sensitive at Scale

Lowest cost per verified-correct page while keeping fidelity acceptable at volume.

§ 03

Methodology & Independence

golden truth · private holdout · published

Every leaderboard is scored against row-by-row golden truth with a private 30% holdout to deter overfitting. Snapshots are versioned and dated; claims cite the snapshot ID. No vendor can pay for placement, and vendors are entitled to a documented right of reply — corrections are published, not negotiated.

Read the full methodology →