Projects & Publications

Things I've shipped.

A two-time founder building measurement and freshness infrastructure for the AI era. Each project below is framed the way an engineer reads it: the problem, the architecture, the trade-offs, and the numbers.

Systems

arlen/bench

Live Evals
Builder & maintainer · arlenkumar.com/bench
Problem
AI agents now buy web-search, extraction, and identity APIs on a user's behalf — but vendor marketing isn't a benchmark. Which actually answer correctly, freshly, and per-dollar?
Approach
A vendor-neutral leaderboard over hand-verified golden sets (web search, web extraction, KYB identity, freshness lag, agent-onboarding harness), with deterministic scoring and a published methodology. Missing coverage renders as , never 0.
Architecture
Stage-gated pipeline (primitive spec → golden dataset with public/holdout split → vendor adapters → deterministic scoring → snapshot) surfaced as a static leaderboard plus machine surfaces: llms.txt, Atom feed, and an MCP server with recommend() / query() tools.
Metrics
Reports hit@k, field-level fidelity, freshness lag (Kaplan–Meier time-to-retrievability), cost-per-correct, and a public/holdout overfit gap.
Status
Prototype data live now; vendor numbers land with the 2026-Q3 snapshots after ToS gates clear (labeled honestly on the page).

GEO-16

Research arXiv:2509.10762
First author · UC Berkeley Hearst Lab
Problem
Generative search engines cite sources instead of ranking links. What page properties actually predict citation?
Approach
A 16-pillar rubric scoring a page's citation-worthiness, validated against 18,635 audited AI citations from production engines.
Finding
Metadata & freshness, semantic HTML, and structured data had the strongest association with citation; overall page quality was itself a strong predictor.
Metrics
Combined G ∈ [0,1] citation-worthiness score; per-pillar associations reported in the paper.

Knowledge-freshness RAG infrastructure

AI Systems
Co-founder & CTO · Wrodium
Problem
AI assistants cite stale content, and brands have no way to detect or fix it. Freshness — not just relevance — drives citation.
Approach
A recrawl-and-change-detection pipeline: content delta tracking, staleness scoring, sitemap/llms.txt orchestration, and citation-drift monitoring across ChatGPT / Gemini / Perplexity / Claude.
Architecture
Production RAG with hybrid retrieval (dense + BM25, reciprocal rank fusion), cross-encoder re-ranking, pgvector/HNSW indexes, AI-bot crawl analytics (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) via Cloudflare, and grounded citation enforcement.
Metrics
Retrieval evaluated with nDCG, MRR, recall@k; freshness tracked as citation-drift over time.

WebMCP

Agent infra
CTO · Wrodium
Problem
Every storefront and schema on the web was built for human habits — but AI agents are becoming the buyers. They need machine-native surfaces to act on a brand's behalf.
Approach
MCP-native infrastructure that exposes brand surfaces to LLM agents as callable tools — structured-output extraction, grounding, and discovery via .well-known manifests.
Proof
The arlen/bench MCP server is a live, public instance of the pattern: a fresh agent session discovers the manifest and calls recommend() end-to-end.

Air Quake Simulations

Acquired 0→1 hardware
Founder · exited before 21
Problem
VR flight simulation hardware was priced for institutions, not enthusiasts.
Approach
Designed and shipped VR flight-simulator cockpit hardware end-to-end — physical build plus Python/PyTorch and Unity3D on the software side.
Outcome
Shipped hundreds of units at roughly 10× under incumbent pricing, then acquired before the founder turned 21.
Signal
End-to-end ownership under ambiguity, unit economics, and a real exit — the founding-engineer profile.

Publications

GEO-16: A Benchmark for Generative Engine Optimization
Kumar & Palkhouski · arXiv:2509.10762 · 2025 · CC BY 4.0
First-author benchmark for AI citation behavior. Explainer · arXiv
CHASE
Hearst Lab · submitted to COLM 2026 (decisions July 8, 2026)
Work on LLM citation and knowledge-freshness behavior. Preprint link goes live when decisions land.
Lean 4 formalization of GEO-16
UC Berkeley CS 294-268 · formal methods
Machine-checked formalization work — repo links here once published.
Talk — "The Ad-ification of AI: When Chatbots Become Salespeople"
CITRIS, UC Berkeley
On incentives and attribution as AI search becomes the interface to information.

Code and dataset repos for GEO-16 and the Lean 4 work are being prepared for open-source release; this page links them as they publish.