Evaluation

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

The paper correctly prioritizes out-of-distribution rank transfer, but its proposed score conflates predictive validity with risk-adjusted utility.

The paper offers a useful template/realized-graph/trace distinction and reporting protocol, but lacks a reproducible survey methodology.