Question 1

What is the most accurate web search API for AI agents?

Accepted Answer

On the verified web_search-2026-q3 snapshot (n=299 public queries, descriptive form), Exa leads hit@5 at 80.9%, ahead of SerpAPI at 69.2%: their 95% Wilson confidence intervals are disjoint and McNemar p≈0.001, so Exa is statistically separable as the clear leader, and both are ahead of Brave (58.9%) and Tavily (49.4%). Only hit@1 and hit@5 are measured this run, so no cost, freshness or latency leader is claimed.

Question 2

How does arlen/bench measure freshness?

Accepted Answer

Sentinel pages carrying a unique token are deployed to sentinel.arlenkumar.com and probed for first appearance in each vendor's results, recording a not_indexed vs ranked_below_k discriminator. Freshness lag will be reported as the Kaplan-Meier median time from publication to first retrievability, right-censored at 30 days. For web_search-2026-q3 this curve is pending the first crawl, so the freshness numbers are not yet published.

Question 3

Were cost, freshness or latency measured for web search in this snapshot?

Accepted Answer

No. The web_search-2026-q3 run measured only hit@1 and hit@5 over 299 public queries. Cost-per-correct and latency metering and the sentinel freshness instrument have not yet been run, so those columns render as an em-dash (pending), never zero, and no cost, freshness or latency leader is claimed. Serper and Firecrawl were not scored on web search this snapshot and are excluded; agent onboarding was not tested in this run.

Dataset	items	public / holdout	truth source	refresh	download
web_search · queryset	62	public (30% holdout excluded)	verified golden URLs	quarterly +25% rotation	jsonl
web_extraction · WCXB	150	public	WCXB dev (Foley) · 20% audited	quarterly resample	json ↓
planted · sentinel pages	—	holdout until snapshot	we publish them (unique token)	probing · curve pending first crawl	protocol only
kyb · core cohort	1000	700 / 300	SoS · EDGAR · Companies House	registry deltas, daily	first run July

Evals & Golden Datasets

Datasets

Row Schema — web_search example

Overfit Gap — current snapshot