What is the most accurate web search API for AI agents?

On the verified web_search-2026-q3 snapshot (n=299 public queries, descriptive form), Exa leads hit@5 at 80.9%, ahead of SerpAPI at 69.2%: their 95% Wilson confidence intervals are disjoint and McNemar p≈0.001, so Exa is statistically separable as the clear leader, and both are ahead of Brave (58.9%) and Tavily (49.4%). Only hit@1 and hit@5 are measured this run, so no cost, freshness or latency leader is claimed.

How does arlen/bench measure freshness?

Sentinel pages carrying a unique token are deployed to sentinel.arlenkumar.com and probed for first appearance in each vendor's results, recording a not_indexed vs ranked_below_k discriminator. Freshness lag will be reported as the Kaplan-Meier median time from publication to first retrievability, right-censored at 30 days. For web_search-2026-q3 this curve is pending the first crawl, so the freshness numbers are not yet published.

Were cost, freshness or latency measured for web search in this snapshot?

No. The web_search-2026-q3 run measured only hit@1 and hit@5 over 299 public queries. Cost-per-correct and latency metering and the sentinel freshness instrument have not yet been run, so those columns render as an em-dash (pending), never zero, and no cost, freshness or latency leader is claimed. Serper and Firecrawl were not scored on web search this snapshot and are excluded; agent onboarding was not tested in this run.

The Contract · Reproducible from a published pipeline

arlen/bench Methodology: Golden Truth, Holdouts & Deterministic Scoring

Every score on this site is reproducible from a published pipeline. This page is the contract: how golden truth is built, how vendors are queried, how cells are scored, and what the bench deliberately does not claim.

How to cite this benchmark Cite the snapshot id with its caveat — e.g. "web_extraction-2026-q2 (verified · audited, pre-review)". The machine-readable, status-tagged claims live at claims.json. Only figures whose status reads verified are citable.

§ 01

The Pipeline

Golden truth

Registry-sourced, hand-verified, or planted pages we control. Verified row-by-row before any vendor call.

70 / 30 split

Public split ships in the repo; holdout never leaves. Holdout queries batched with decoys.

Vendor adapters

One per vendor, transport-only. Endpoints, plan, and cost table pinned per snapshot.

Deterministic scoring

Exact / tolerance / hit@k. No LLM judge in current snapshots.

Immutable snapshot

Quarterly ID, raw responses archived, aggregates published on every surface.

§ 02

Metric Definitions

hit@k: Share of golden queries where a normalized golden URL appears in the vendor's top k results. URL normalization strips www., utm_* parameters, fragments, and trailing slashes.
extraction fidelity (0–1): Weighted main-content score per page: title 0.20, required anchor phrases 0.50, boilerplate exclusion 0.30 (vs WCXB-reviewed gold). Weights renormalize over the components a golden row defines. Correct ≥ 0.90; partial 0.50–0.90.
time-to-retrievability: Hours from a sentinel page's authoritative publish timestamp to its first appearance in vendor search results. Kaplan–Meier median over probes every 6h, right-censored at 30 days.
cost per correct: Total billed spend on the disclosed pinned plan ÷ verified-correct answers. Undefined (rendered —) when a vendor scores zero correct — never 0.
overfit gap: Public-split accuracy minus holdout accuracy, per vendor. A persistent positive gap indicates tuning to the published set.
agent-ready: An autonomous agent obtained a working API key in live trials via otp-email or device-code auth, with no human in the loop.

§ 03

Integrity

Holdout. 30% of every cohort is private. Holdout queries are interleaved with decoys so vendors cannot pattern-match the bench's traffic, and the public-vs-holdout gap is published per vendor.

Rotation. 25% of each cohort rotates every quarter, biased toward strata where vendor scores saturate. Dead golden rows are replaced from the same stratum within 7 days.

Plans and spend. We purchase the cheapest published plan that exposes the capability, archive the pricing page at review time, and disclose exactly which plan every number was billed on.

Vendor review (standing policy). The standing policy is that every vendor receives its rows at least 10 business days before a snapshot is marked FINAL. Current snapshots are published as audited/reproducible, pre-review: vendor right-of-reply rows have not yet been sent, and each snapshot's status reflects that (web_extraction-2026-q2 = verified, audited, pre-review; web_search-2026-q3 = verified, reproducible, pre-review). Factual corrections trigger re-runs logged in a public errata file; methodology disputes get a standing right of reply, published verbatim and linked from the vendor's row.

Exclusions. Vendors without a programmatic endpoint, or whose terms prohibit automated evaluation, are listed with the reason rather than silently omitted — and are never assigned scores.

§ 04

Known Limitations

Counts and medians are comparable within a snapshot, not across vendors' differing internal windows. Strata are opinionated; edge cases are decided once and applied uniformly. Sentinel pages measure index freshness for pages we control — they do not measure ranking quality on competitive queries. Harness results depend on pinned framework versions and can shift when frameworks update; versions are disclosed per snapshot. Illustrative prototype data is labeled as such until a snapshot's first full run completes.

§ 05

Sources & Attribution

web_extraction's page corpus is imported from WCXB — the Web Content Extraction Benchmark (Murrough Foley, 2026; CC‑BY‑4.0; doi:10.5281/zenodo.19316874): 2,008 human-reviewed pages across seven page types — article, documentation, service, product, collection, forum, listing — and 1,613 domains. We import a stratified sample of its dev split, mapping WCXB's reviewed gold (title, required anchor phrases, boilerplate exclusions, hash-pinned main content) into our golden-row schema; native fields WCXB does not label (table cells, numeric facts) are recorded as not assessed — never as empty-correct. The WCXB test split (511 pages) is never ingested; it stays untouched as an external reference. SPA / JS-only pages are excluded — our live-fetch vendors render JS, so WCXB's static-empty gold would wrongly penalize them, and that stratum is hand-built instead. Every imported row and HTML snapshot is SHA-256-pinned in a public integrity manifest committed before any vendor is scored, and a deterministic 20% sample is re-verified by hand first. WCXB is credited here per its CC-BY-4.0 license.

§ 06

What changed — June 2026

new cases

Web-search is now a real scored run. The web_search-2026-q3 snapshot is deterministic, produced by the GoldenEvalWebSearch engine over 299 public verified queries (descriptive form, 3 reps/query; the 30% private holdout is excluded). Only hit@1 and hit@5 are measured in this snapshot.

A separable leader. Exa (80.9% hit@5) is statistically separable from SerpAPI (69.2%) at n=299 — McNemar p≈0.001, with disjoint 95% Wilson intervals — so Exa is reported as the clear leader, not a tie. The cohort was expanded from the prior 62-row probe to 299 public rows for this re-probe, clearing the pre-registered n≈338-for-80%-power bar closely enough that the primary pair now separates.

Sentinel freshness instrument (built, awaiting first crawl). Bench-published pages carry a unique token (sentinel.arlenkumar.com). A new discriminator splits a search miss into not_indexed (even the unique-token query misses → the vendor has not crawled the page) vs ranked_below_k (the token query finds it → indexed, just not ranked in top-k). This is the prerequisite that makes the freshness-lag curve honest; numbers land after the first crawl.

Mirror auto-promotion. A returned URL that holds the truth token is promoted into the equivalence class (logged to adjudication), so an unanticipated mirror is never scored as a false miss.

§ 07

Replicate this

web_search-2026-q3

Download the public data. Two engine-exported, salt-free files ship in the API: web_search-2026-q3-rows.json (the 299 queries plus accepted golden-URL equivalence_members and truth tokens) and web_search-2026-q3-results.json (the per-(vendor,rep) hit@k outcomes).

Method. Re-issue each row's descriptive query to a vendor, normalize URLs, and check set-membership against equivalence_members; or audit the recorded results directly. Scoring is deterministic set membership — there is no LLM judge.

Pipeline (GoldenEvalWebSearch engine). pump → verify → split → reslice → probe → report.

Why the private 30% stays uncomputable. Holdout rows are withheld and the HMAC split salt is never published.

§ 08

Independence & funding

arlen/bench is an independent, self-funded project by Arlen Kumar. No vendor pays for placement, ranking, or review, and no vendor funds the project; the only money involved is out-of-pocket vendor API spend for the runs. Independence here is not just asserted — it is checkable: the full scoring harness is open source (GoldenEvalWebSearch on GitHub) and the public-split rows and per-vendor results ship as JSON, so anyone can re-run the benchmark and confirm the numbers independently. Vendors are entitled to a documented right of reply — corrections are published, not negotiated. There is no third-party audit of web_search-2026-q3 yet (web_extraction-2026-q2 carries an independent 20% human audit); until there is, treat web_search as a reproducible result rather than an audited one.

§ 09

How the golden set is selected (and why it resists selection bias)

The golden set is not hand-picked queries — it is a registry-delta pump: rows are real, recently-published documents drawn deterministically from authoritative registries (US Federal Register, SEC EDGAR, NVD CVE, GitHub Releases). For each, the correct answer is the canonical registry URL (plus verified mirrors), the truth token is the document's own identifier (FR doc number, SEC accession, CVE ID), and the query is a descriptive, token-stripped ask. Because rows come from a registry delta rather than an author choosing queries, the cohort cannot be quietly curated toward any vendor's strengths. The web_search-2026-q3 public split is n=299 (expanded from the prior 62-row probe; the cohort continues to grow toward 435 public rows). Disclosed skew: the current cohort over-represents government / financial / security / software-release documents and English text; broadening the source mix is an open work item, so a reader should read the numbers as 'accuracy on authoritative-registry lookups,' not 'all web search.' The 30% private holdout (HMAC-split, salt withheld) guards against overfitting; the public/holdout overfit gap is published per snapshot.

§ 10

Cost & latency measurement conditions

Cost-per-correct and latency are not measured in web_search-2026-q3 (they render as em-dash, pending). When they are measured, the snapshot will disclose the exact conditions that move them: vendor plan tier, region, concurrency, and the date pricing was captured (pricing is the most perishable figure on the page). No cost or latency leader is claimed until those conditions are pinned and published.