📌 Immutable snapshot — web_search-2026-q3. Stable, citable URL; the rows, results and content_hash for this snapshot are frozen. The live leaderboard tracks the latest snapshot.
Web Search API Benchmark for AI Agents — Exa vs SerpAPI vs Brave vs Tavily
Golden-URL accuracy for agent search APIs. 62 verified queries (public split, descriptive form) · snapshot web_search-2026-q3 · holdout 30% excluded.
Exa (79.0% hit@5) and SerpAPI (71.0%) are tied at the top: Exa's 95% Wilson CI [67.3, 87.3] overlaps SerpAPI's [58.7, 80.8] and McNemar p=0.40, so the two are not separable at n=62. Exa is nominal #1, but both are clearly ahead of Brave (54.8%) and Tavily (43.5%). Cost, freshness & latency: pending — sentinel & cost instruments not yet run.
Full Leaderboard
| Vendor | hit@1 % | hit@5 % | fresh<30d % | retrievability h | cost/correct $ | p50 latency ms |
|---|---|---|---|---|---|---|
| 1Exa | 61.3 | 79.095% CI 67.3–87.3 | — | — | — | — |
| 2SerpAPI | 58.1 | 71.095% CI 58.7–80.8 | — | — | — | — |
| 3Brave | 45.2 | 54.895% CI 42.5–66.5 | — | — | — | — |
| 4Tavily | 29.0 | 43.595% CI 31.9–55.9 | — | — | — | — |
— denotes a metric not measured this snapshot: fresh<30d, retrievability, cost/correct and p50 latency have no scored run yet (sentinel & cost instruments pending), so they render as an em-dash and are excluded from ranking — never as 0. hit@5 carries a 95% Wilson CI. Top is a tie: Exa (79.0%) and SerpAPI (71.0%) overlap (McNemar p=0.40, not separable at n=62); the required n for an 80%-power Exa/SerpAPI separation is ≈338. Serper and Firecrawl are not yet scored on web_search and are excluded from this snapshot.
- Golden rows (62 public queries + accepted golden-URL equivalence classes + truth tokens)
- Raw results (248 per-vendor hit@k outcomes, with miss reasons)
- Snapshot JSON (aggregates · Wilson CIs · per-metric status · content_hash · run timestamp)
- Run engine + scoring script (GoldenEvalWebSearch)
- Scoring methodology · atomic claims (status-tagged)
Cost, Freshness & Latency
Cost, freshness & latency: pending — the sentinel-page freshness instrument and the cost/latency metering have not been run for this snapshot. No cost, freshness or latency leader is claimed for web_search; these columns render as — until their first scored run lands, never as 0.