Daily AI Digest — 2026-05-07

Published

May 7, 2026

arXiv Highlights

Co-Evolving Policy Distillation

Consolidating multiple reasoning capabilities (text, image, video) into a single policy under RLVR exposes a structural tension: joint training induces gradient conflict, while sequential expert-then-distill pipelines hit a behavioral-pattern mismatch between teacher and student. This paper formalizes both failure modes and proposes a co-evolutionary alternative in which experts train in parallel and distill into each other on the fly.

Unified accounting of capability loss

The authors model any consolidation paradigm \mathcal{P} by

U_{\mathcal{P}} \approx a_{\mathcal{P}} \cdot X(D_1, D_2) + b_{\mathcal{P}},

where X is the total optimization signal across capability datasets, a_\mathcal{P}\in[0,1] measures conversion efficiency, and b_\mathcal{P}\le 0 captures additional loss. Mixed-data RLVR sets a_{\text{mix}}=1 but pays a divergence cost: per-step gradients from D_1 and D_2 disagree on capability-specific dimensions, yielding

U_{\text{mix}} \approx X(D_1,D_2) - \Phi(D_1,D_2),\qquad \Phi>0.

This is the familiar seesaw effect — gains on one capability are partially canceled by interference, regardless of mixing ratio. Static OPD avoids \Phi but suffers from a<1: once experts have diverged, the student’s on-policy rollouts no longer share the teacher’s behavioral support, so token-level supervision is poorly absorbed.

Figure 1: CoPD addresses the limitations of mixed-data RLVR (a) and static OPD (b) by letting two branches co-evolve as teachers and students across domains (c), achieving the best overall performance.

Top-k overlap as a behavioral-similarity proxy

The pilot study operationalizes “behavioral pattern gap” as the top-k token overlap between teacher and student distributions on shared rollouts. Two empirical facts emerge: (i) the gain from a fixed OPD step grows roughly linearly with top-k overlap to the teacher, and (ii) standard branch-specific RLVR drives this overlap down over time. The implication is that delaying distillation until experts are mature is precisely the wrong schedule — by then the very property that makes OPD effective has been eroded.

Figure 2: post-OPD gain grows with teacher-student top-k overlap, while standard RLVR training pushes overlap in the opposite direction.

CoPD: alternating RLVR and bidirectional OPD

CoPD initializes K branches from the same \pi_0 and alternates two phases.

Branch-specific RLVR. Each branch k runs GRPO on its own dataset \mathcal{D}_k with verifiable reward r_k:

\mathcal{L}_{\text{RLVR}}^{(k)}(\theta_k) = \mathbb{E}_{x\sim \mathcal{D}_k}\!\left[\tfrac{1}{G}\sum_i \tfrac{1}{|y_i|}\sum_t \min\!\big(\rho_{i,t}^{(k)} \hat A_i^{\text{RL}}, \text{clip}(\rho_{i,t}^{(k)},1\!-\!\epsilon,1\!+\!\epsilon)\hat A_i^{\text{RL}}\big)\right].

This opens a knowledge gap between branches.

Mutual on-policy distillation. Each branch then generates on-policy rollouts and receives token-level supervision from the other branch — distillation is bidirectional, and student samples come from the student itself, so the high-overlap regime in which OPD is effective is preserved. Because RLVR and OPD are interleaved at short intervals, branches never drift far enough behaviorally for transfer to break down.

Figure 3: An overview of CoPD with two co-evolving branches alternating RLVR and mutual OPD.

Results

Experiments use Qwen3-VL-4B-Instruct, with text data from Polaris-53K, image data from MMFineReason-123K, and (for the three-branch run) 40K filtered video samples. Image reasoning is evaluated on seven benchmarks (MMMU, MMMU-Pro, MathVista, MathVision, ZeroBench, WeMath, MathVerse); text on AIME24/25, HMMT25, MATH-500, Minerva; video on Video-Holmes, MVBench, MMVU, VideoMathQA.

The paper reports that CoPD outperforms the domain-specific Text-Expert and Image-Expert, mixed RLVR, and both directions of static OPD (V\toT and T\toV). In the three-branch setting CoPD beats MOPD (multi-teacher distillation into a single student). Notably the unified CoPD model surpasses the single-domain experts on their own benchmarks — evidence that co-evolution provides positive transfer rather than merely minimizing interference. (The abstract emphasizes “significantly outperforming” baselines; the experimental section enumerates the benchmarks but the numerical tables themselves were not included in the excerpt.)

Limitations and open questions

Several issues remain. First, CoPD’s compute scales roughly linearly with the number of branches during the RLVR phase plus an OPD cost; the paper does not report a tightly controlled FLOPs-matched comparison against mixed RLVR. Second, the top-k overlap indicator is an empirical proxy and the authors do not provide a tight theoretical link between overlap and OPD gain beyond a linear fit. Third, the synchronization schedule (how frequently to interleave RLVR and OPD) is a hyperparameter whose sensitivity is not characterized in the excerpt. Fourth, scalability beyond three branches, and to capabilities with very asymmetric data quality or reward sparsity, is untested. Finally, the framework assumes verifiable rewards on each capability — extending to mixed verifiable/preference settings is open.

Why this matters

CoPD reframes multi-capability post-training from a static “merge or distill” problem into a dynamic co-training problem, with a concrete behavioral-similarity diagnostic (top-k overlap) that explains when distillation works. The parallel-branch pattern is a plausible template for scaling RLVR to many domains without paying the gradient-conflict tax of joint optimization.

Source: https://arxiv.org/abs/2604.27083

Efficient Training on Multiple Consumer GPUs with RoundPipe

Problem

Fine-tuning LLMs on consumer GPU servers (e.g., 8×RTX 4090, 24 GB each, PCIe 4.0 at 32 GB/s, no NVLink) is attractive economically but mechanically painful: VRAM is tight and inter-GPU bandwidth is an order of magnitude lower than datacenter NVLink (200 GB/s). The standard recipe is pipeline parallelism (PP) with CPU offloading of weights, gradients, optimizer states, and activations, since PP keeps inter-GPU traffic to small activation tensors at stage boundaries.

The catch is what the authors call the weight binding issue. Existing PP schedules (1F1B, ZB-H1, Looped BFS, etc.) require that the model be split into S = vN stages on N GPUs, and that each stage’s weights live permanently on one GPU. Real LLMs are not uniform: the embedding/LM head stage is much heavier than a single transformer block. Forcing S to be a multiple of N produces imbalance bubbles; allowing S to be arbitrary (e.g., 13 stages on 4 GPUs) instead produces structural bubbles because GPUs hosting fewer stages must idle waiting on the heavy GPU. Either way, throughput is bounded by the slowest device.

Looped BFS vs. RoundPipe schedule for a 12-layer model + LM head on 4 GPUs.

Method

The key insight is that with full CPU offloading, weights are not pinned to any GPU — they are streamed in per microbatch anyway. RoundPipe therefore treats GPUs as a stateless pool of execution workers and dispatches stages round-robin across devices. A stage’s forward and backward can land on different GPUs in different rounds; the only state a GPU holds long-term is the activations/gradients it produced in flight.

Concretely, with S stages and N GPUs (no requirement that N \mid S), RoundPipe runs the pipeline in \lceil S/N \rceil rounds per microbatch wave. Asymmetric splitting lets stage sizes track real layer cost (the LM head can be its own stage), so per-stage execution time is balanced even when the model is not. Round-robin dispatch ensures every GPU sees roughly the same total work over a wave, eliminating the structural bubble that flexible partitioning would otherwise create.

Bubble ratio of Looped schedules under ideal vs. real-world partitions on 8 GPUs.

The system is built as a single-controller framework in the style of Ray and veRL/HybridFlow. The user writes ordinary sequential code calling forward_backward() and an optimizer step; the controller (running in the user’s thread) constructs microbatches, assigns stages to GPU workers in round-robin order, and tracks dependencies. GPU workers execute asynchronously, and a separate optimizer worker performs asynchronous parameter updates on the host. Three subsystems make this work in practice:

Priority-aware transfer scheduling. PCIe is the bottleneck. Each microbatch needs (a) weights pulled from host to GPU and (b) activations either recomputed or reloaded. RoundPipe orders these transfers by criticality so that the next stage’s weights arrive before its compute starts, overlapping H2D copies with compute. The recompute-vs-reload tradeoff per layer is decided from the model in Figure 2.

Theoretical time of recomputing vs. reloading activations of a transformer layer.
Fine-grained event-based synchronization. Because a stage’s forward and backward can run on different GPUs across rounds, naive barrier sync would erase the gain. RoundPipe uses distributed CUDA events keyed per (microbatch, stage, direction) so that producers and consumers synchronize only on the specific tensors they share.
Parameter consistency under async optimizer updates. Since the optimizer runs concurrently on the host, RoundPipe must guarantee that all microbatches in a step see the same weights and that gradient accumulation is correct. This is handled by versioning weight buffers and gating the optimizer step on completion of the relevant gradient writes.
Automated layer partitioning. A profiler measures per-layer forward/backward time and memory, then solves for an asymmetric partition into S stages that minimizes the maximum stage time subject to VRAM constraints. S is chosen freely (not constrained to vN), which is what makes the LM head expressible as its own stage.