Methodology
Every score on this site is reproducible from a published pipeline. This page is the contract: how golden truth is built, how vendors are queried, how cells are scored, and what the bench deliberately does not claim.
The Pipeline
Metric Definitions
- hit@k
- Share of golden queries where a normalized golden URL appears in the vendor's top k results. URL normalization strips
www.,utm_*parameters, fragments, and trailing slashes. - extraction fidelity (0–1)
- Weighted field score per page: title 0.15, anchor phrases 0.35, table cells 0.25, numeric facts 0.15, excluded noise 0.10. Weights renormalize over the fields a golden row defines. Correct ≥ 0.90; partial 0.50–0.90.
- time-to-retrievability
- Hours from a sentinel page's authoritative publish timestamp to its first appearance in vendor search results. Kaplan–Meier median over probes every 6h, right-censored at 30 days.
- cost per correct
- Total billed spend on the disclosed pinned plan ÷ verified-correct answers. Undefined (rendered
—) when a vendor scores zero correct — never 0. - overfit gap
- Public-split accuracy minus holdout accuracy, per vendor. A persistent positive gap indicates tuning to the published set.
- agent-ready
- An autonomous agent obtained a working API key in live trials via otp-email or device-code auth, with no human in the loop.
Integrity
Holdout. 30% of every cohort is private. Holdout queries are interleaved with decoys so vendors cannot pattern-match the bench's traffic, and the public-vs-holdout gap is published per vendor.
Rotation. 25% of each cohort rotates every quarter, biased toward strata where vendor scores saturate. Dead golden rows are replaced from the same stratum within 7 days.
Plans and spend. We purchase the cheapest published plan that exposes the capability, archive the pricing page at review time, and disclose exactly which plan every number was billed on.
Vendor review. Every vendor receives its rows 10 business days before publication. Factual corrections trigger re-runs logged in a public errata file; methodology disputes get a standing right of reply, published verbatim and linked from the vendor's row.
Exclusions. Vendors without a programmatic endpoint, or whose terms prohibit automated evaluation, are listed with the reason rather than silently omitted — and are never assigned scores.
Known Limitations
Counts and medians are comparable within a snapshot, not across vendors' differing internal windows. Strata are opinionated; edge cases are decided once and applied uniformly. Sentinel pages measure index freshness for pages we control — they do not measure ranking quality on competitive queries. Harness results depend on pinned framework versions and can shift when frameworks update; versions are disclosed per snapshot. Illustrative prototype data is labeled as such until a snapshot's first full run completes.