Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

The paper correctly prioritizes out-of-distribution rank transfer, but its proposed score conflates predictive validity with risk-adjusted utility.

June 21, 2026 ยท Sai Boorlagadda