About
arlen/bench is an independent benchmark project by Arlen Kumar — verified leaderboards for the APIs AI agents buy and use mid-task.
Why We Run This
AI agents now select and purchase APIs mid-task — search, extraction, identity verification — with no human reviewing the choice. There was no independent, machine-consumable evidence for those decisions: vendor claims, affiliate listicles, and stale comparison posts were the corpus agents reasoned from.
arlen/bench exists to replace that corpus with verified ground truth. It is also an instrument for a research interest of mine — knowledge freshness: how fast the systems agents rely on absorb new information. The planted-page lag curves published here are continuous measurements no one can backfill.
The bench's query design and freshness protocol draw on citation-behavior research (GEO-16, arXiv:2509.10762, analyzing 18,635 AI citations).
Independence
This is an independent project with no commercial relationship to any benchmarked vendor. Scores come from a published deterministic pipeline; vendors cannot pay for placement, re-runs happen only for logged factual errata, and any future commercial relationship with a scored vendor would be disclosed on this page. Vendor right-of-reply is standing and published verbatim.