Daily AI Digest — 2026-05-27

Published

May 27, 2026

arXiv Highlights

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

Spatial foundation models (SFMs) — DUSt3R, MASt3R, VGGT, \pi^3, MapAnything, and the Depth-Anything family — are typically reported on narrow per-domain benchmarks (ScanNet for indoor, KITTI for driving, Sintel for flow), often with non-deterministic frame sampling. This makes cross-paper comparisons unreliable and obscures how well a single model generalizes across viewpoints (egocentric, wrist), dynamics (static vs. moving scenes), input density (1 to hundreds of frames), and hardware budgets. SpatialBench addresses this by unifying 19 datasets and 546 scenes into a single deterministic protocol and evaluating 41 models across 6 paradigms on 5 task suites under 4 input-density regimes.

Benchmark construction

All raw data is normalized into a per-scene tuple of RGB, metric depth D, camera-to-world poses T_{cw}, and intrinsics K. The key design choice is that each (scene, view-density) pair is committed to a JSON record specifying exact frame indices. This decouples ingestion from evaluation: every method consumes identical inputs, so AbsRel, AUC@30, ATE, and F-score are directly comparable across runs and papers.

Overview of SpatialBench scene categories and per-scene frame counts.

The 19 sources span a four-axis taxonomy: environment (indoor/outdoor), dynamics (static/dynamic), viewpoint (normal/egocentric/wrist), and origin (real/synthetic). Static-real comes from 7-Scenes, DTU, NRGBD, ScanNet++, Tanks & Temples, and ETH3D; dynamic-real from TUM-Dynamic, DROID, Xperience, Waymo, KITTI-Odometry; dynamic-synthetic from ADT, RLBench/Colosseum, RoboTwin, Robolab, Virtual KITTI 2, and OmniWorld-Game. Four density regimes — single-frame, sparse, medium, dense — are reported, with the dense regime triggering OOM (>140 GB) or timeout (>4 h/scene) on most large transformers.

DROID, a wrist-view robot dataset lacking clean ground-truth geometry, is rebuilt with a dedicated pipeline: stereo depth from S^2M^2 with confidence filtering, initial pose from MapAnything, gripper/contact masks via SAM3, and bundle adjustment refining poses against the masked RGBs.

DROID curation: stereo depth, initial pose, SAM3 masks, then BA refinement.

Depth-Anything-Next and DA-Next-5M

The benchmark exposes a clear gap: existing SFMs perform poorly on egocentric and wrist views, which dominate embodied applications and feature ultra-close-range capture, heavy occlusion, and aggressive ego-motion. To plug this, the authors release DA-Next-5M — 5.5M frames over 22K scenes, mostly egocentric/wrist — with metric depth, intrinsics, and extrinsics. Simulation portions use domain randomization over background, object scale, color, and wrist camera placement.

Sample assets and episodes from DA-Next-5M.

Depth-Anything-Next (DA-Next) is then trained on this corpus (architecture details deferred to the paper) and integrated into the same evaluation harness.

Results

Table 1 reports AbsRel (depth), AUC@30 (pose), ATE (trajectory), and F-score (geometry) across the four density regimes. Numbers worth fixing in mind:

Single-frame AbsRel: DA-Next reaches 0.166, a 54.9% reduction over the next end-to-end feed-forward entry. VGGT sits at 0.184, FastVGGT at 0.183, OmniVGGT at 0.188; the DA3 family ranges 0.333–0.385; optimization-based DUSt3R/MASt3R are 0.385/0.456.
Sparse AbsRel: DA-Next 0.050 (−47.4% relative), VGGT-Omega 0.077, \pi^3-X 0.084, AMB3R 0.088, \pi^3 0.092, DA3-Giant 0.095. AUC@30 in sparse: DA-Next 0.809, VGGT-Omega 0.803, DA3-Giant 0.785, DA3-Nested 0.779, \pi^3 0.742.
Medium AbsRel: DA-Next 0.035 (−59.3%), VGGT-Omega 0.067, \pi^3-X 0.078, \pi^3 0.082, AMB3R 0.085. DA-Next ATE 1.442 is +24.2% worse than the best (\pi^3-X at 0.369), revealing a depth-vs-pose tradeoff: the model is strongly tuned for metric depth but its camera-pose head trails geometry-centric architectures.
Dense regime: nearly every >1 GB transformer (VGGT, MapAnything, OmniVGGT, \pi^3-X, AMB3R, DA3-Large/Giant/Nested, DA-Next) hits OOM at >140 GB. Only Fast3R, FastVGGT, \pi^3, DA3-Small/Base, and the streaming/online methods (Spann3r, CUT3R, Point3R, Stream3R, StreamVGGT) survive. Among survivors at dense, FastVGGT achieves AbsRel 0.130 / AUC@30 0.627 / ATE 9.984 / F 0.527; \pi^3 posts 0.190 / 0.672 / 8.478 / 0.491.
Latency (sparse, per-sequence): MapAnything and OmniVGGT 0.22 s, \pi^3 0.20 s, FastVGGT 0.24 s, VGGT 0.40 s, DA-Next 0.50 s, MonST3R 20.81 s.

Online methods (Spann3r, CUT3R, MonST3R, Point3R, Stream3R, StreamVGGT) are the only ones that scale to dense inputs cleanly but pay a steep accuracy cost: e.g. CUT3R averages AbsRel 0.223 vs. DA-Next sparse 0.050.

Limitations and open questions

The benchmark equates “all-round” with strong performance under shifting density and domain, but the dense regime is currently a memory contest more than a methodological one — most SOTA models simply cannot run, so the dense leaderboard rewards architectural compactness rather than reasoning quality. DA-Next’s pose/ATE regression vs. \pi^3-X suggests its training mixture biases representations toward depth at the cost of multi-view geometric consistency. SpatialBench also relies on S^2M^2 + BA pseudo-GT for DROID, which puts an upper bound on wrist-view depth evaluation fidelity. Finally, although 41 models are evaluated, the protocol only covers reconstruction and pose; downstream tasks such as planning, manipulation success, or novel-view synthesis quality are out of scope.

Why this matters

Deterministic, cross-paradigm evaluation exposes that no current SFM is uniformly best: small models scale to dense inputs but trail on accuracy, large transformers OOM past medium density, and depth-strong models can be pose-weak. SpatialBench plus DA-Next-5M give the field a fixed yardstick and a missing data regime (egocentric/wrist) for embodied 3D perception.

Source: https://arxiv.org/abs/2605.27367

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

MiniMax-M2 is a sparse MoE family explicitly designed for long-horizon agentic deployment rather than static QA. The flagship has 229.9B total parameters with 9.8B activated per token, and the design thesis is that a small activation footprint, paired with agent-native data pipelines and an RL system tuned for tool-using trajectories, can match dense frontier models on the workloads that actually matter in production: coding agents, deep search, and office task automation.

Performance of MiniMax-M2.7 versus closed-weight frontier baselines.

Architecture

M2 is a 62-layer decoder-only Transformer with hidden dimension 3,072 and a 200,064-token vocabulary. Attention uses GQA with 48 query heads and 8 KV heads, full attention at every layer, and RoPE; the team explicitly abandons the hybrid (linear/softmax) attention of MiniMax-Text-01 in favor of full attention at scale. The MoE FFN has 256 fine-grained experts with 8 active per token, sigmoid gating with learnable expert biases, and an auxiliary-loss-free load-balancing scheme following Wang et al. 2024. A Multi-Token Prediction head is trained alongside next-token prediction and later expanded by weight copying to support multi-step speculative decoding.

Pre-training uses 29.2T tokens with up-sampled code, math, and STEM. Context is extended in stages 8K \to 32K \to 192K, with a 9.3T-token decay phase mixing short-text decay data, naturally long PDFs, concatenated code, and thematically packed documents.

Agentic data pipelines

The post-training corpus is built around verifiable agentic trajectories in three coding regimes (SWE, AppDev, terminal) plus cowork tasks. The SWE pipeline mines permissively-licensed GitHub PRs, then runs an agent-driven loop to synthesize per-PR Docker environments — particularly important for compiled languages (Java/Go/Rust/C++) where toolchain coordination is brittle. PRs are tagged by intent (bug fix, feature, refactor, perf, test) so that distinct reward functions can be applied.

Agentic coding data pipelines for SWE and AppDev tasks.

For bug fixes, validity requires a golden patch to satisfy both Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests; the model agent must then reproduce that, with P2P guarding against regressions. For feature additions, where new tests reference new code, the pipeline shifts away from F2P/P2P toward executable artifact equivalence.

Interleaved thinking and RL

Trajectories are modeled as interleaved sequences

\tau = (r_1, a_1, o_1, r_2, a_2, o_2, \ldots, r_T, a_T, o_T)

where each reasoning block r_t is conditioned on the full prior history. Crucially, the assistant message kept in history at turn t+1 retains the thinking block:

\mathcal{H}_{t+1} = \mathcal{H}_t \oplus [\mathrm{assistant}(r_t, a_t)] \oplus [\mathrm{tool}(o_t)],

versus the dropped variant \mathcal{H}_{t+1}^{(\text{drop})} = \mathcal{H}_t \oplus [\mathrm{assistant}(a_t)] \oplus [\mathrm{tool}(o_t)] that forces re-derivation of state every turn. Ablations show persistence yields the largest gains exactly where it should — multi-step deep search and SWE — consistent with the Plan–Act–Reflect framing.

The RL system, Forge, treats the LLM as the policy and externalizes context management, memory, and state transitions into the environment. Production engineering items called out: windowed-FIFO scheduling for variable-length agent rollouts, prefix-tree merging to share KV across branched trajectories, and a strict decoupling of training, inference, and agent processes so the same trainer can drive white-box (logit-level) and black-box (API-only) agents. SFT data is built by large-scale rejection sampling against domain rewards, producing interleaved-thinking traces in chat, reasoning, code, and cowork.

Evaluation

M2.7 is benchmarked against Claude Opus 4.6, Claude Sonnet 4.6, GPT 5.4, and Gemini 3.1 Pro, all in maximal reasoning configurations. The evaluation suite is deliberately skewed toward environment-grounded benchmarks: SWE-bench Pro, SWE-bench Multilingual, Multi-SWE-bench, NL2Repo, Terminal-Bench 2.0, MLE Bench Lite for coding; VIBE-Pro and HyperTask for app dev; BrowseComp, Wide Search, and RISE for deep research; GDPval-AA, Toolathlon, MEWC v2, and Finance Modeling Pro for office work; and AIME 2026, GPQA-Diamond, SciCode, IFBench, AA-LCR, HLE, and MMLU-Pro for reasoning and knowledge. Coding agents share a Claude Code scaffold (CodeX for GPT 5.4), with 4-trial averaging; Terminal-Bench runs in an 8 vCPU / 16 GB sandbox with a 2 hour wall-clock cap under Terminus-2. With ~10B activated parameters, M2.7 tracks frontier closed-weight systems across these blocks (Figure 1), and the within-series gap from M2.5 to M2.7 is reported to isolate the contribution of the latest data and RL iterations.

Limitations and open questions

The paper foregrounds several internal benchmarks (NL2Repo, VIBE-Pro, HyperTask, MM Claw, MEWC v2, Finance Modeling Pro, RISE), which limits external reproducibility of the cowork claims. Self-evolution at M2.7 — autonomously debugging training runs and editing its own scaffold — is presented as an early step rather than a measured capability with isolated ablations. The decision to drop hybrid attention in favor of full attention at 192K is justified empirically but not analyzed against linear-attention baselines at matched compute. Finally, the activation count headline (9.8B) understates serving cost: 229.9B total parameters still dominate memory footprint and routing overhead.

Why this matters

If the numbers hold under independent replication, M2 demonstrates that agent-grade frontier capability is reachable at ~10B activated parameters when the full stack — data pipelines, interleaved-thinking SFT, and a trajectory-aware RL system — is co-designed for tool use. That reframes the cost frontier for deployable agents away from raw parameter count toward environment and rollout infrastructure.

Source: https://arxiv.org/abs/2605.26494

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Problem

Frame-sampling video MLLMs allocate observation budget by elapsed wall-clock time. This is wasteful when content is mostly static and catastrophic when discriminative evidence lives at sub-second event boundaries — temporal grounding, fine-grained motion classification, repeated-cycle counting. Uniform GoP-style sampling treats all temporal slots as equally informative, while in practice the compressed bit-stream already encodes a strong prior over where novelty (and therefore semantic event content) lives. LLaVA-OneVision-2 (LLaVA-OV-2) reorganizes the video tokenization step around this signal.

Roadmap from token compression to codec-aligned tokenization.

Method

The model is an 8B-class VLM built on the OneVision-Encoder backbone with windowed attention for native-resolution processing, a lightweight VL connector, and an autoregressive LM decoder. Three input modes — codec-stream video, uniformly sampled video, and static images — are mapped into a single visual-token interface. A shared 3D RoPE places I/P canvases, sampled frames, and image tokens in one spatiotemporal coordinate system, and group-visible attention masks define which tokens see each other (fixed 4-slot groups for sampled-frame/IPPP, single-temporal group for images, bit-cost-adaptive GoP ids for codec streams).

LLaVA-OneVision-2 architecture: codec, sampled-frame, and image inputs share the OneVision-Encoder under a unified visual-token interface.

The central contribution is codec-stream tokenization. Rather than decoding to RGB and sampling fixed-rate frames, the model consumes the compressed bitstream directly:

Adaptive GoPs from bit-cost. P/B packet bit-cost is used to partition the stream into variable-length GoPs. High-bit-cost intervals — those the codec found expensive to predict, i.e. high-novelty — receive their own GoP boundaries, concentrating tokens on event-bearing content.
Motion-residual spatial saliency. Within each GoP, motion vectors and residual energy jointly score 2{\times}2 patch blocks. High-score blocks are selected and packed into compact visual canvases: one anchor I-canvas per GoP plus several P-canvases carrying motion-residual evidence.
Group-aligned tokens. Each canvas’s tokens inherit a GoP id; group-visible masks let P-canvases attend to their anchor I-canvas, mirroring codec dependency structure.

Codec-stream tokenization: bit-cost gives adaptive GoPs; motion+residual scores select 2x2 blocks packed into I/P canvases.

Training uses a four-stage progressive recipe initialized from LLaVA-OneVision-1.5-8B, with each batch interleaving ~50% codec-patchified video, ~37.5% uniform chunk-wise video, and ~12.5% images. The corpora include the inherited 85M image–text mid-training set and 22M instruction set, FineVision (~24M), a newly released 30s-Video-Caption-4.2M, an 8M-clip / 104.1B-token re-captioned video pretraining mixture, and a 4M-sample spatial corpus covering 3D scenes, pointing, and referring expressions.