Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
The paper correctly prioritizes out-of-distribution rank transfer, but its proposed score conflates predictive validity with risk-adjusted utility.
The paper correctly prioritizes out-of-distribution rank transfer, but its proposed score conflates predictive validity with risk-adjusted utility.
The paper offers a useful template/realized-graph/trace distinction and reporting protocol, but lacks a reproducible survey methodology.