Daily AI Digest — 2026-05-08

Published

May 8, 2026

arXiv Highlights

Lightning Unified Video Editing via In-Context Sparse Attention

In-context learning (ICL) has become a unifying framework for video editing: the source video and a “context” (e.g., an edited reference frame, instruction-conditioned exemplar) are concatenated and fed jointly through a video diffusion transformer. The price is quadratic attention over a doubled token budget. This paper introduces In-context Sparse Attention (ISA), a structured sparsification scheme that exploits the asymmetry between source and context tokens, and packages it into LIVEditor, a Wan 2.2 post-trained video editor.

Two empirical observations driving the design

The authors decompose the ICL attention matrix into four blocks: Q^{\text{src}}(K^{\text{src}})^\top, Q^{\text{src}}(K^{\text{ctx}})^\top, Q^{\text{ctx}}(K^{\text{src}})^\top, Q^{\text{ctx}}(K^{\text{ctx}})^\top.

Figure 4: distinct distributions of the four attention sub-blocks in ICL.

The source–source block dominates, and the gap to source–context grows with depth. This motivates aggressive pruning of context Keys/Values rather than uniform sparsification. The second observation is theoretical: the error of a 0-th order Taylor expansion of softmax attention correlates with the sharpness of the Query distribution. Sharp queries (peaked attention) tolerate sparse approximation poorly; flat queries can be approximated cheaply.

Method

ISA operates on block-partitioned tokens of block size L_Q, L_K, with compressed representations Q^c, K^c, V^c obtained by sequence-axis pooling. The pipeline has three stages.

Figure 3: ISA workflow — context pre-selection, sharpness-based query grouping, dual-kernel execution.

1. Context pre-selection. From the coarse score matrix S_{\text{coarse}}\in\mathbb{R}^{B\times H\times N_Q\times N_K}, the slice corresponding to source-query × context-key entries, S^{\text{ctx}}_{\text{coarse}}, is averaged over the source-query axis and Top-k selected:

I_{\text{topk}} = \text{TopK}(\text{Mean}(S^{\text{ctx}}_{\text{coarse}}, \text{axis}=2), \text{axis}=2).

The retained context KV blocks are gathered and concatenated with the full source KV:

K_{\text{new}} = [K^{\text{src}}; \text{Gather}(K^{\text{ctx}}, I_{\text{topk}})], \quad V_{\text{new}} = [V^{\text{src}}; \text{Gather}(V^{\text{ctx}}, I_{\text{topk}})].

A hyperparameter \alpha_s (default 0.125) controls the fraction of context blocks retained — i.e., 87.5% of context KV is dropped before any attention is computed.

2. Block-wise 0-th order Taylor sparse attention. For low-sharpness query blocks, ISA replaces softmax attention with a 0-th order Taylor approximation around a block-mean reference, which reduces to a cheap linear aggregation that the authors implement as a dedicated sparse kernel. The approximation error is bounded by the Query-block sharpness; the proof is in the appendix but the take-away is that flat queries are essentially free.

3. Dynamic query grouping. Per query block, sharpness is computed from the coarse scores; high-sharpness blocks (fraction \alpha_f=0.5) are routed to full FlashAttention-2, while the remainder go to the 0-th order Taylor kernel. A second ratio \alpha_{ns}=0.0625 further controls KV sparsity for the non-sharp group. This produces two execution paths whose total cost is dominated by the sparse kernel, with negligible overhead from selection and gather operations.

Figure 2: ISA vs SDPA and FA2 — speedup grows with sequence length; the sparse and flat kernels dominate runtime.

LIVEditor and data

LIVEditor is built by post-training the high-noise branch of Wan 2.2 on a curated 1.7M-sample dataset. The pipeline uses Gemini 2.5 Flash for caption and instruction synthesis, Gemini 2.5 Image Preview to render edited initial frames, and propagates edits temporally via pose-guided TI2V for humans and attention injection for non-human subjects. Public sources (Ditto, LoVoRA, ReCo) supplement the non-human portion. Training uses two stages: 1.7M samples at lr 1e{-5}, then 0.089M high-quality samples at lr 1e{-6}, both with batch 16 under ZeRO-3 Offload. A consistent role assignment — synthetic frames as context, real frames as source — mitigates artifact leakage from synthetic data.

Results

LIVEditor surpasses prior state-of-the-art on EditVerseBench, IVE-Bench, and VIE-Bench. The headline efficiency claim is roughly 60% reduction in attention-module latency relative to dense attention, with Figure 2 showing the speedup over both SDPA and FA2 widening monotonically with sequence length — the regime that matters for ICL editing where token counts double. Ablations on EditVerseBench and FiVE-Bench show ISA matches or exceeds full attention quality at the chosen (\alpha_s, \alpha_{ns}, \alpha_f) = (0.125, 0.0625, 0.5), contradicting the usual quality–sparsity tradeoff and supporting the authors’ claim that context tokens are largely redundant.

Limitations and open questions

The pre-selection assumes context tokens are systematically lower-saliency than source tokens; this is validated for ICL editing but unlikely to hold for tasks where the context carries fine-grained spatial information that source frames lack (e.g., long-range identity reference, multi-shot consistency). The sharpness-based router uses fixed ratios rather than thresholds, so adaptation to varying scene complexity is coarse. The 0-th order Taylor approximation error bound is sharpness-dependent but not data-distribution-dependent — empirical equivalence to full attention may degrade outside the training regime. Finally, gains are reported on attention-module latency; end-to-end diffusion sampling latency reductions depend on how attention dominates the schedule at the chosen resolution.

Why this matters

ICL is becoming the default interface for controllable video editing, but it doubles attention cost on top of an already expensive video DiT. ISA shows that the source/context asymmetry is exploitable structurally — not just empirically — and that combining KV-side pre-selection with query-side sharpness routing yields near-lossless 60% attention speedups, which is the right shape of optimization for this regime.

Source: https://arxiv.org/abs/2605.04569

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

Problem

Distribution Matching Distillation (DMD) is the workhorse for compressing autoregressive streaming video diffusion models into few-step students, but the standard objective treats every rollout, every frame, and every pixel as equally informative supervision. The authors argue this uniform weighting conflates two distinct decisions: whether a given student rollout deserves to be learned from at all, and where within that rollout the gradient should concentrate. Both axes carry real variance — some student samples are simply unreliable (the teacher-student score gap is noisy or misleading), and within any reliable sample, perplexity is concentrated in specific regions and frames (motion boundaries, late frames in autoregressive rollouts, etc.). For long-horizon streaming generation where errors compound, ignoring this structure caps the achievable quality of the distilled student.

Motivation: Inter-Reliability across rollouts and Intra-Perplexity across spatiotemporal regions.