arlen/benchOpen benchmarks for agentic consumers
SNAPSHOT web_search-2026-q3
11 JUN 2026 · BERKELEY, CA
§ 05 · Instrument · Live

Agent Harness

Homepage → API key → verified answers, zero humans. Identical task across Claude Code, Codex, Gemini CLI; n=5 trials per cell, clean container each, versions pinned per snapshot.

Trials this snapshot
150
3 frameworks × 6 vendors × 5
Agent-ready vendors
3 / 6
otp-email or device-code auth
Fastest onboarding
2m 58s
Claude Code · Tavily
Top failure mode
auth_wall
42% of all failures
§ 05.1

Completion by Framework × Vendor

of 5 trials each
Framework × vendor completion /5 first 200 median outcome last run UTC
Claude Code · Tavily52m 58sclean06-10 12:01
Claude Code · Exa53m 41sclean06-10 12:04
Codex · Tavily54m 22sclean06-10 11:40
Codex · Exa45m 12sschema_confusion ×106-10 11:18
Gemini CLI · Tavily46m 03stimeout ×106-10 11:02
Gemini CLI · Firecrawl38m 47srate_limit ×206-10 10:31
Claude Code · SerpAPI0auth_wall · human signup06-10 09:40
Codex · Brave0auth_wall · card required06-10 09:12

Vendors whose ToS prohibit automated signup are excluded from the harness and scored on the static rubric only; exclusions and reasons are published.