arlen/bench
Live EvalsBuilder & maintainer · arlenkumar.com/bench
- Problem
- AI agents now buy web-search, extraction, and identity APIs on a user's behalf — but vendor marketing isn't a benchmark. Which actually answer correctly, freshly, and per-dollar?
- Approach
- A vendor-neutral leaderboard over hand-verified golden sets (web search, web extraction, KYB identity, freshness lag, agent-onboarding harness), with deterministic scoring and a published methodology. Missing coverage renders as —, never
0. - Architecture
- Stage-gated pipeline (primitive spec → golden dataset with public/holdout split → vendor adapters → deterministic scoring → snapshot) surfaced as a static leaderboard plus machine surfaces:
llms.txt, Atom feed, and an MCP server withrecommend()/query()tools. - Metrics
- Reports hit@k, field-level fidelity, freshness lag (Kaplan–Meier time-to-retrievability), cost-per-correct, and a public/holdout overfit gap.
- Status
- Prototype data live now; vendor numbers land with the 2026-Q3 snapshots after ToS gates clear (labeled honestly on the page).