Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Paper: Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
Authors: Dhaval C. Patel et al.
Submitted: June 18, 2026
arXiv: 2606.19704

One-sentence summary

This position paper argues that agent leaderboards should be judged by whether their rankings transfer out of distribution, not merely by in-sample mean performance, and proposes a twelve-tier measurement framework for exposing deployment-relevant differences among agent configurations.

What kind of paper this is

This is a position and synthesis paper, not a completed empirical validation. It combines:

  • seven prior agent benchmarks or deployed-system papers;
  • fourteen implementation studies extending AssetOpsBench;
  • retrospective public-versus-hidden leaderboard results; and
  • a proposed future experiment for testing rank stability.

The paper explicitly states that it runs no new controlled experiments. Its empirical base includes roughly 6,000 judged trajectories, but the fourteen extension studies are unpublished implementation reports from a shared institutional and benchmark context.

Central argument

Aggregate scores collapse agent configurations that may have similar average success but materially different:

  • latency and cost;
  • tool-call hygiene;
  • orchestration;
  • retrieval behavior;
  • reasoning-mode sensitivity;
  • artifact reuse across turns;
  • failure modes;
  • infrastructure overhead;
  • robustness to query phrasing or domain changes; and
  • dependence on an LLM judge.

The paper therefore proposes evaluating a leaderboard by its predictive validity: whether the ordering it produces in one setting predicts the ordering observed under held-out or shifted conditions.

The motivating evidence includes:

  • public-to-hidden Spearman correlation of (\rho=-0.13) for an execution track with (n=13), although its confidence interval is wide and includes substantial positive correlations;
  • (\rho=0.69) for a planning track with (n=20);
  • cross-benchmark rank correlations reported elsewhere ranging from (0.32) to (0.85); and
  • implementation studies in which similar aggregate scores conceal large differences in latency, retrieval cost, per-rubric quality, or tool behavior.

This supports the narrower claim that leaderboard rank transfer cannot safely be assumed. It does not yet establish the paper’s proposed scoring system.

Three proposed tests of rank transfer

The paper defines three levels of out-of-distribution evaluation:

  1. Held-out scenarios: A stratified random split within the benchmark. This tests ordinary sample stability rather than strong distribution shift.
  2. Cross-subset transfer: Rank systems on all but one domain or asset subset, then test on the omitted subset.
  3. Adversarial perturbation: Re-evaluate semantically equivalent tasks after paraphrasing, identifier renaming, time-window changes, or distractor injection.

These are useful and increasingly demanding tests. The cross-subset and perturbation tests are much more deployment-relevant than a random holdout.

Twelve-tier measurement framework

The proposed measurement apparatus includes:

  1. Success
  2. Tool-call hygiene
  3. Planning quality
  4. Capability axes
  5. Cost and efficiency
  6. Failure modes
  7. Integrity and reproducibility
  8. Deployment infrastructure
  9. Multi-turn dialog
  10. Reasoning mode
  11. Knowledge augmentation
  12. Evidence grounding and verification

The paper does not claim that every leaderboard should collapse all twelve into one number. It recommends a layered presentation:

  • headline rank;
  • cost-quality Pareto view;
  • per-tier drill-downs; and
  • uncertainty and significance information.

It also recommends declaring the complete configuration rather than only the model: architecture, reasoning mode, retrieval strategy, prompt constraints, and verifier type.

Evidence synthesized from the implementation studies

The strongest examples are not proof of rank instability, but they clearly demonstrate omitted variables:

  • Extended reasoning increased planning latency by 41.9% while improving one clarity rubric by 31 percentage points and leaving some other dimensions unchanged.
  • Multi-hop knowledge retrieval approached 90% accuracy but used 4.5–10 times more tokens and substantially more latency than single-pass RAG.
  • A supervisor-specialist architecture reused artifacts across turns, making later turns 4.2 times faster while a parallel variant increased token use and tail latency.
  • Confidence-gated routing reduced unnecessary expensive-tool calls and sharply improved sequence correctness.
  • MCP subprocess and transport overhead appeared as a major latency floor in several studies.
  • A temporal cache produced large speedups but only 0.64 F1 for safe hit decisions, showing that cache performance and cache trustworthiness are different metrics.
  • A multimodal serving optimization improved both speed and judged quality, while an aggressive FP8 KV-cache configuration caused complete response collapse on one vision-language model.
  • Judge-independent rule or DAG verifiers produced useful anchors against LLM-judge drift.

These cases make a convincing argument for reporting configuration, cost, infrastructure, and failure-mode data alongside task success.

The proposed predictive-validity score

The paper suggests:

[ \mathrm{PV}(c)=\alpha \bar{Y}c-\beta\sigma{Y_c,\mathrm{OOD}}-\gamma\operatorname{IQR}(Y_c) ]

where the terms represent in-sample mean, variation in OOD rank, and per-scenario dispersion. The weights are left for future fitting.

This formula is the weakest part of the proposal.

First, predictive validity is normally a property of a measurement or ranking procedure: the association between a predictor measured now and a criterion observed later. The proposed formula instead assigns each configuration a performance-minus-instability score. That is a risk-adjusted utility or robustness score, not predictive validity itself.

Second, the formula requires OOD outcomes to compute the OOD-rank variation term. Once those outcomes are known, the score is summarizing observed robustness rather than predicting it.

Third, the proposed weight-fitting text says the weights are selected to maximize correlation with Criterion-B/C ranks. Without a nested development/evaluation split, this tunes directly on the OOD results used to claim validity.

Fourth, the terms mix score units, rank variation, and per-scenario dispersion. Convex weights do not make those quantities commensurable; they require normalization and sensitivity analysis.

A cleaner methodology would separate:

  • configuration utility: a declared deployment-specific function over quality, cost, latency, and risk;
  • benchmark predictive validity: rank correlation and calibration against untouched target environments; and
  • uncertainty: bootstrap intervals, top-(k) stability, and probability of rank reversal.

Strengths

  1. It asks the correct deployment question. A benchmark matters only insofar as its conclusions transfer to the target environment.
  2. The paper is unusually falsifiable for a position paper. It gives explicit OOD tests, thresholds, and a planned pilot.
  3. It treats the evaluated object as a full configuration. Model-only leaderboards are inadequate for agent systems whose behavior depends on orchestration, tools, prompts, and infrastructure.
  4. The twelve tiers form a useful reporting checklist. Even if they are neither exhaustive nor orthogonal, they expose many hidden confounders.
  5. It emphasizes judge-independent verification. Deterministic or rule-based anchors are important for detecting judge drift and evaluator gaming.
  6. It reports inconvenient findings. Several optimizations trade accuracy against cost, or speed against reliability, rather than producing simple wins.
  7. The limitations section is candid. The authors clearly label the central empirical validation, tier independence, and real-deployment linkage as unfinished.

Limitations and concerns

  1. The central empirical claim remains untested by the paper. Existing rank-correlation evidence is sparse, heterogeneous, and based on small samples.
  2. The score is mislabeled. The proposed configuration-level PV score is not the same construct as predictive validity.
  3. Potential evaluation leakage is built into the fitting proposal. Using Criterion-B/C results to choose weights and then evaluating correlation on those criteria would overstate generalization.
  4. The twelve tiers are neither shown to be independent nor derived through a reproducible measurement-development procedure. Some overlap substantially: tool hygiene, planning quality, failure modes, evidence grounding, and integrity can describe the same event.
  5. Thresholds are insufficiently justified. Values such as (\rho<0.85), 10% top-rank displacement, and Pearson correlation above 0.2 are declared rather than derived from deployment loss or decision consequences.
  6. Some falsification conditions are not logically necessary for the central position. Rank instability does not require higher-mean systems to have greater OOD variance, so failure of that condition should not refute the leaderboard-transfer concern.
  7. The evidence is concentrated around one benchmark family. AssetOpsBench, its competition, and its extensions dominate the argument, limiting external validity.
  8. The implementation studies are not independent replications. They share a benchmark and institutional context and are unpublished. The paper acknowledges this and appropriately calls the result “architectural sensitivity.”
  9. Magnitude comparisons are visually aggregated despite incompatible outcomes. Speedup, token reduction, accuracy ratios, and judge-score improvements appear on one logarithmic chart. This is illustrative, not a valid quantitative synthesis.
  10. Real operational validity is absent. No result connects benchmark rank to operator overrides, incidents, maintenance outcomes, false alarms, or business value.
  11. Predictive validity alone does not determine deployment choice. A perfectly stable ranking may optimize the wrong objective, and different operators have different utility functions and risk constraints.
  12. Benchmark adaptation can make validity decay. Once participants optimize against perturbations and hidden subsets, those tests become part of the effective training distribution and require renewal.

Better experimental design

A rigorous follow-up should:

  1. Freeze an untouched target environment before model or score development.
  2. Define several complete agent configurations, not merely models.
  3. Pre-register benchmark metrics and any utility function.
  4. Use nested splits:
    • development environments for selecting metrics or score weights;
    • validation environments for freezing the ranking rule;
    • untouched target environments for estimating predictive validity.
  5. Report:
    • Spearman and Kendall rank correlation;
    • bootstrap confidence intervals;
    • top-(k) retention;
    • pairwise rank-reversal probability;
    • regret from selecting the benchmark winner;
    • calibration against deployment utility; and
    • sensitivity to task mix and operator preferences.
  6. Repeat across unrelated domains such as software engineering, customer support, and scientific agents.
  7. Keep benchmark validity separate from configuration utility instead of merging them into a single PV score.

Practical implications

For agent evaluation, the paper’s reporting recommendations are immediately useful:

  • Version and disclose the full agent configuration.
  • Preserve raw traces and deterministic verifier outputs.
  • Separate task quality from token, tool, latency, and infrastructure cost.
  • Include repeated runs and rank uncertainty.
  • Measure transfer across task subsets, not only random examples.
  • Add metamorphic tests such as paraphrases, identifier changes, and irrelevant context.
  • Report failure categories and termination causes.
  • Evaluate cache safety, routing quality, and tool-call necessity directly.
  • Keep LLM judges paired with non-LLM anchors wherever possible.

Bottom line

The paper makes a strong case that static mean-score leaderboards are inadequate for agents and offers a practical multidimensional reporting checklist. Its core recommendation—measure whether rankings transfer—is correct. However, the proposed PV formula should not yet be adopted: it conflates predictive validity with risk-adjusted utility, depends on observed OOD performance, and lacks a leakage-safe validation design. The paper is best treated as a valuable research agenda and evaluation checklist, not as a validated leaderboard methodology.