Daily AI Digest — 2026-06-19

Published

June 19, 2026

arXiv Highlights

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

This is a position paper consolidating fourteen parallel implementation studies of an MCP-based industrial-agent benchmark (AssetOpsBench) with seven prior agent benchmarks. The thesis is structural: aggregate-score leaderboards (HELM-style, Pass@1 means, even multi-metric dashboards) systematically underspecify the evaluation surface that deployment exposes, and the correct ranking criterion is not in-sample mean but predictive validity — the rank correlation between in-sample and out-of-sample evaluations.

The argument against aggregate scores

The authors give three concrete cases where aggregation hides qualitatively distinct configurations that score identically on Pass@1. First, toggling extended thinking on a Gemma-4-26B planner over 40 AssetOpsBench scenarios leaves overall rubric mean roughly flat but shifts clarity by 31 percentage points (61% → 92%) and hallucination by 7 pp (12% → 5%), while data-retrieval and agent-sequence correctness are unchanged. Latency rises 21.5% end-to-end (15.08 s → 18.32 s) and planning latency by 41.9%. A single mean obscures both the localized quality gain and its cost. Second, Plan-Execute vs. Supervisor-Specialist architectures match on single-turn Pass@1 but differ by 4.2\times on turn-2-to-5 latency due to cross-turn artifact reuse — a dimension single-turn benchmarks cannot see. Third, single-pass RAG hits 50–68% accuracy while agentic multi-hop retrieval reaches ~90%, but at 4.5\times–10\times token inflation; neither dominates without deployment constraints.

The deeper claim is that scalar leaderboards collapse orthogonal axes (reasoning mode, retrieval strategy, orchestration, transport) and therefore misattribute wins. The SmartGridBench experiment (2,420 trajectories) makes this surgical: holding the agent fixed and varying transport (direct vs. MCP-stdio) and orchestration (Plan-Execute, Verified PE, Self-Ask) independently shows MCP standardization adds latency with no quality gain, while orchestration alone moves pass rate from 43.2% to 55.5%. A leaderboard that does not separate transport from orchestration assigns the orchestration win to whichever axis it happens to vary.

The twelve-tier apparatus

The synthesis section organizes measurement into twelve tiers consolidated from prior benchmarks (SWE-Bench, \tau-Bench, TaskBench, MCP-Bench, MCP-Universe, ARE, AssetOpsBench) and the fourteen extension studies. T1–T7 are core capability tiers: success rates, tool-call hygiene, planning quality, capability axes, cost/efficiency Pareto, failure-mode taxonomies, and reproducibility/integrity. T8–T12 are deployment-extension tiers absent from nearly all current leaderboards: deployment infrastructure, multi-turn dialog, reasoning-mode adaptivity, knowledge augmentation, and evidence grounding with judge-independent verification. The empirical claim attached to the tier diagram is that no prior single benchmark reports more than four or five tiers as first-class metrics.

Predictive validity as the ranking criterion

The operational core is replacing in-sample mean with \rho(\text{rank}_{\text{in}}, \text{rank}_{\text{out}}) across three OOD criteria. Criterion A (mild shift) is a stratified random split preserving the joint distribution of subset and category — a weak test where failures are damning but passes uninformative. Criteria B and C escalate to held-out scenario classes and adversarial perturbations (the paper frames these as falsifiable thresholds for the position itself). The rationale is that recent public-to-hidden competition retrospectives already show direct rank instability: leaderboards trained on public splits do not predict hidden-split rankings, so in-sample mean is provably the wrong objective for a deployment-advising artifact.

Concrete leaderboard proposals

Three implementation moves follow. (1) Declared configuration columns: every submission must report Architecture, Reasoning Mode, Retrieval Strategy, Prompt-Constraint Level, and Verifier Type, since each is a non-empty axis that changes attribution. (2) Rank by predictive-validity score on at least one OOD criterion; treat in-sample mean as one column. (3) Require a judge-independent anchor — at least one trajectory-level deterministic verifier (rule pipeline, DAG oracle) so LLM-judge drift is detectable. A fourth field-level recommendation, surfaced because three of fourteen studies independently identified it, is to abandon stdio-based MCP for benchmark infrastructure: protocol overhead currently dominates the latency floor and conflates with reasoning ability in any cost metric.

Limitations and open questions

The orthogonality claim for the twelve tiers is explicitly a working hypothesis; the paper does not establish empirical independence (e.g., factor analysis on tier scores) and defers this to future work. The fourteen studies share institutional context, so the authors correctly call their convergence “architectural sensitivity” rather than independent replication — generalization to non-AssetOpsBench domains is asserted, not shown. The predictive-validity proposal also concentrates evaluation cost: maintaining hidden splits, adversarial suites, and deterministic verifiers is expensive and risks centralizing evaluation in well-resourced institutions, a concern the authors flag without resolving. Finally, the three OOD criteria are described with thresholds in the abstract but the paper stops short of running the full predictive-validity protocol on the consolidated 6,000-trajectory corpus; that is the obvious next experiment.

Why this matters

Agent leaderboards are increasingly used as deployment-decision artifacts, but the rank-instability evidence from public-to-hidden retrospectives shows scalar means do not transfer. Replacing in-sample mean with predictive validity, plus declaring configuration axes and trajectory-level verifiers, is a low-cost structural fix that aligns leaderboard incentives with what deployed agents are actually judged on.

Source: https://arxiv.org/abs/2606.19704

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

Problem

Real-world dexterous manipulation progress is bottlenecked by human-in-the-loop algorithm engineering: tuning controllers, designing rewards, debugging perception, resetting scenes, and iterating on training code. Coding agents (Codex, Claude Code, Kimi Code) have automated substantial portions of digital research, but their feedback loops are confined to deterministic, side-effect-free environments. Physical autoresearch demands repeatable interaction with a non-deterministic world: scenes must reset, outcomes must be verified, and safety must be enforced before any agent-driven optimization loop can run. ENPIRE proposes a harness that supplies exactly that abstraction so that a coding agent can hill-climb a real-world success rate without human babysitting.

Method

ENPIRE decomposes physical autoresearch into two stages, mirroring its name: EN (environment construction from human feedback) and PIRE (policy improvement, rollout, evolution).

Stage 1 — Environment construction. A coding agent procedurally synthesizes environment APIs that wrap the robot stack with: (i) hard kinematic and configuration-space safety constraints whose violation triggers truncation and an automatic reset; (ii) an automated verifier that produces the per-episode reward; and (iii) an automatic reset mechanism. Humans assess the resulting APIs once and the agent refines them; this cost is amortized across all subsequent autoresearch on every robot.

The verifier is task-specific and built from perception primitives. For zip-tie cutting, for example, two camera views are cropped and segmented to test whether the strap still passes through the head, with redundancy across views to suppress false positives.

Figure 4: Reward for zip-tie insertion. Cropping and image segmentation test whether the zip-tie strap passes through the zip-tie head. Two camera views are considered to prevent false positives.

Stage 2 — PIRE loop. Once the environment exposes a reset → execute → verify API, the agent enters a closed loop:

Policy Improvement (PI): agent edits training infrastructure, hyperparameters, or policy code.
Rollout (R): one or many physical robots execute the candidate policy in parallel; each rollout returns a verifier reward and trajectory logs.
Evolution (E): the agent ingests logs, consults literature via tool calls, diagnoses failure modes, and emits the next code revision.

Figure 2: Overview of the ENPIRE physical autoresearch framework.

The agent is not restricted to a single learning paradigm. It can synthesize heuristic policies from perception/control tool calls, train behavior-cloning networks, run real-world RL, or compose these (e.g., a heuristic skeleton with a learned residual). Success is scored as completion within 8 sequential retries per trial — retries observe their predecessor’s failure, so the metric rewards in-context recovery, not just i.i.d. best-of-N precision:

\text{Success} = \Pr\!\left[\exists\, k \le 8 : \text{trial}_k \text{ succeeds} \mid \text{trial}_{<k} \text{ observed}\right].

The hardware platform is a bimanual 6-DoF YAM robot. Tasks: Push-T (non-prehensile alignment), pin insertion into 4 mm holes, GPU-chip socket insertion, and zip-tie cutting with scissors.

Results

On simulated Gym-PushT, all three agents converge: Claude Code and Codex hit 95% success in ~2 hours of wall-clock autoresearch; Kimi Code reaches the same level in roughly twice the time.

The gap between simulation and reality is the headline finding. On the real Push-T setup, two of the three agents fail outright — non-deterministic contact friction, robot dynamics, and object micro-motion violate the low-variance hypothesis-testing regime that simulators provide. This argues that heuristic-only policy synthesis is brittle in the real world and motivates mixing gradient-based learning into the agent’s toolkit.

Figure 3: Benchmarking coding agents for physical autoresearch on Push-T and Pin Insertion. Adding robot workers reduces wall-clock time to a fixed success rate.

The second claim is resource scaling: when the rollout module dispatches across multiple physical robots in parallel, wall-clock time to reach a target success rate drops monotonically with worker count, while token spend on the agent side scales the convergence rate of the search itself. The framework therefore exposes two orthogonal compute axes — physical throughput (robots) and reasoning throughput (tokens) — both of which materially affect time-to-policy on Push-T and pin insertion.