Were cost, freshness or latency measured for web search in this snapshot?

No. The web_search-2026-q3 run measured only hit@1 and hit@5 over 299 public queries. Cost-per-correct and latency metering and the sentinel freshness instrument have not yet been run, so those columns render as an em-dash (pending), never zero, and no cost, freshness or latency leader is claimed. Serper and Firecrawl were not scored on web search this snapshot and are excluded; agent onboarding was not tested in this run.

📌 Immutable snapshot — web_search-2026-q3. Stable, citable URL; the rows, results and content_hash for this snapshot are frozen. The live leaderboard tracks the latest snapshot.

§ 01 · Leaderboard · Verified (scored run)

Web Search API Benchmark for AI Agents — Exa vs SerpAPI vs Brave vs Tavily

Q: What is the most accurate web search API for AI agents?

On the verified web_search-2026-q3 snapshot (n=299 public queries, descriptive form), Exa leads hit@5 at 80.9%, ahead of SerpAPI at 69.2%: their 95% Wilson confidence intervals are disjoint and McNemar p≈0.001, so Exa is statistically separable as the clear leader, and both are ahead of Brave (58.9%) and Tavily (49.4%). Only hit@1 and hit@5 are measured this run, so no cost, freshness or latency leader is claimed.

Q: How does arlen/bench measure freshness?

Sentinel pages carrying a unique token are deployed to sentinel.arlenkumar.com and probed for first appearance in each vendor's results, recording a not_indexed vs ranked_below_k discriminator. Freshness lag will be reported as the Kaplan-Meier median time from publication to first retrievability, right-censored at 30 days. For web_search-2026-q3 this curve is pending the first crawl, so the freshness numbers are not yet published.

Golden-URL accuracy for agent search APIs. 299 verified queries (public split, descriptive form) · snapshot web_search-2026-q3 · holdout 30% excluded.

✓ Verified scored run. Real deterministic run from the GoldenEvalWebSearch engine (results.jsonl, n=299 public, 3 reps/query). Only hit@1 and hit@5 are measured this snapshot — freshness, cost-per-correct and latency are pending (sentinel & cost instruments not yet run) and render as —, never 0.

Safe to cite "Exa leads hit@5 in the public split, statistically separable from SerpAPI (Exa 80.9%, SerpAPI 69.2%, McNemar p≈0.001, n=299)." Verified — reproducible, pre-review. Only hit@1/hit@5 are measured; cost, freshness & latency are pending. claims.json

Answer firstweb_search-2026-q3changelog

Exa (80.9% hit@5) is the clear leader over SerpAPI (69.2%): Exa's 95% Wilson CI [76.1, 85.0] is disjoint from SerpAPI's [63.8, 74.2] and McNemar p≈0.001, so the two are statistically separable at n=299. Both are ahead of Brave (58.9%) and Tavily (49.4%). Cost, freshness & latency: pending — sentinel & cost instruments not yet run.

Best hit@5

80.9%

Exa · leads SerpAPI 69.2%

n (public split)

299

descriptive form · holdout excluded

Cost / freshness / latency

—

pending: instruments not yet run

Vendors

Serper, Firecrawl not yet scored

§ 01.1

Full Leaderboard

click any column to sort

Vendor	hit@1 %	hit@5 %	fresh<30d %	retrievability h	cost/correct $	p50 latency ms
1Exa	62.2	80.995% CI 76.1–85.0	—	—	—	—
2SerpAPI	56.9	69.295% CI 63.8–74.2	—	—	—	—
3Brave	48.8	58.995% CI 53.2–64.3	—	—	—	—
4Tavily	31.4	49.495% CI 43.8–55.0	—	—	—	—

— denotes a metric not measured this snapshot: fresh<30d, retrievability, cost/correct and p50 latency have no scored run yet (sentinel & cost instruments pending), so they render as an em-dash and are excluded from ranking — never as 0. hit@5 carries a 95% Wilson CI. Exa leads: Exa (80.9%) and SerpAPI (69.2%) have disjoint CIs (McNemar p≈0.001, separable at n=299), so Exa is the clear #1. Serper and Firecrawl are not yet scored on web_search and are excluded from this snapshot.

Reproduce & cite — web_search-2026-q3 (verified: hit@1/hit@5)

Cite this snapshot: Arlen Kumar, Web Search API Benchmark for AI Agents, snapshot web_search-2026-q3, public split n=299, updated 2026-06-14. Measured: hit@1, hit@5 (deterministic set-membership scoring, no LLM judge; 30% private holdout excluded). Not measured this snapshot: cost, freshness, latency. Safe claim: "Exa leads hit@5 in the public split, statistically separable from SerpAPI (Exa 80.9%, SerpAPI 69.2%, McNemar p≈0.001, n=299)." Permanent citation URL: /bench/snapshots/web_search-2026-q3.

§ 01.2

Cost, Freshness & Latency

pending — not yet measured

cost-per-correct, freshness lag & latency · web_search-2026-q3

Cost, freshness & latency: pending — the sentinel-page freshness instrument and the cost/latency metering have not been run for this snapshot. No cost, freshness or latency leader is claimed for web_search; these columns render as — until their first scored run lands, never as 0.

§ 01.3

Questions

methodology & results

Q1 What is the most accurate web search API for AI agents?

On the verified web_search-2026-q3 snapshot (n=299 public queries, descriptive form), Exa leads hit@5 at 80.9%, ahead of SerpAPI at 69.2%. Their 95% Wilson confidence intervals are disjoint and McNemar p≈0.001, so Exa is statistically separable as the clear leader; both are ahead of Brave (58.9%) and Tavily (49.4%). Only hit@1 and hit@5 are measured this run — cost, freshness and latency are not yet scored, so no cost or freshness leader is claimed.

Q2 How does arlen/bench measure freshness?

Sentinel pages carrying a unique token are deployed to sentinel.arlenkumar.com, then probed for first appearance in each vendor's results, with a not_indexed vs ranked_below_k discriminator. Freshness lag will be reported as the Kaplan–Meier median time from publication to first retrievability, right-censored at 30 days. For web_search-2026-q3 this curve is pending the first crawl, so freshness figures are shown as —, never 0.

Q3 Were cost, freshness or latency measured in this snapshot?

No. This web_search-2026-q3 run measured only hit@1 and hit@5 over 299 public queries. Cost-per-correct and latency metering and the sentinel freshness instrument have not yet been run, so those columns render as — (pending), never 0, and no cost/freshness/latency leader is claimed. Serper and Firecrawl were not scored on web_search this snapshot and are excluded.