Daily AI Digest — 2026-05-12

Published

May 12, 2026

arXiv Highlights

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Problem and motivation

After frontier LLMs reached IMO gold-medal performance, olympiad-style benchmarks have lost discriminative power at the top. Research-level mathematics is the natural next target: it requires the same step-by-step reasoning but applied to problems closer to open mathematical knowledge. Existing research-level benchmarks are tiny — Riemann-Bench has 25 problems, FrontierMath-Tier 4 has 50 — which makes statistical separation of frontier systems unreliable. Soohak (SH^2, from 수학 시험, “math exam”) is a 439-problem benchmark authored from scratch by 64 mathematicians, plus a larger Mini split, designed to provide both scale and difficulty.

Construction

The full contributor pool reached 105 accepted-question authors across 31 organizations: 48% faculty, 23% graduate students/postdocs, 25% undergraduates, 5% undisclosed. 72 of 86 primary-system contributors were recruited via direct outreach to math departments. Compensation was per-accepted-question, with a $260{,}000 pool and per-question payments ranging from $36 to $3{,}623 (capped at $20{,}000 per contributor). All submissions were text-only LaTeX, in English or Korean, with a complete solution and an explicit final-answer line. Contributors signed an originality clause forbidding AI assistance, plus an NDA and IP-transfer agreement.

Item-flow through the SH^2 collection pipeline.

The pipeline is a multi-gate filter: originality/copyright agreement, automated screening with model-gated routing and similarity checks, two human reviewers, contributor-controlled opt-in, then final inclusion. Submissions found to be AI-generated removed the contributor entirely (“banned creators”). The model-gated routing is operationally important: items that several frontier models solve cleanly are routed into Mini; items that survive the model gate populate Challenge.

The benchmark splits into: - SOOHAK-Mini (n=702): merges the first two internal model gates; olympiad to early graduate level. - SOOHAK Challenge (n=340): hard, research-flavored items. - SOOHAK Refusal (n=99): a separate split probing recognition of ill-posed problems — a capability intrinsic to research mathematics, where the first task on a fresh question is often deciding whether it is well-posed at all.

Evaluation protocol

Eleven models were evaluated with reasoning enabled: Gemini-3-Pro/Flash, GPT-5/-Mini (Medium reasoning), Claude-Opus-4.5/Sonnet-4.5, Grok-4.1-Fast (closed); Qwen3-235B-A22B-thinking-2507, GPT-OSS-120B, Kimi-2.5, GLM-5 (open-weight). Reported metrics are Avg@3 and Pass@3, scored purely on final-answer correctness with no partial credit.

Main results

On Mini, frontier models cluster: GPT-5 leads Avg@3 at 72.22, Gemini-3-Pro at 71.70, Grok-4.1-Fast at 70.66. Difficulty separation appears on Challenge:

Model	Challenge Avg@3	Pass@3
Gemini-3-Pro	30.39	44.12
GPT-5	26.37	40.88
Grok-4.1-Fast	18.43	30.88
GPT-5-Mini	18.82	28.82
Gemini-3-Flash	15.69	25.59
Kimi-2.5	13.87	20.00
GPT-OSS-120B	11.27	18.53
Claude-Opus-4.5	10.39	18.82
GLM-5	9.61	18.24
Qwen3-235B	8.04	15.00
Claude-Sonnet-4.5	5.69	10.29

All open-weight models stay below 15% Avg@3. 124 Challenge items are unsolved by any evaluated model, and 170 items are unsolved or missed across the family — already exceeding the entire size of Riemann-Bench’s unsolved set (\geq 23/25).

Refusal yields a different ordering. GLM-5 leads at 49.49 Avg@3 / 73.74 Pass@3, GPT-OSS-120B at 43.77/60.61, GPT-5 at 43.09/61.62. Qwen3-235B collapses to 2.69 Avg@3 — it almost never identifies ill-posed problems, suggesting an answer-emission prior that overrides well-posedness checking. The decoupling between Challenge rank and Refusal rank indicates that solving and recognizing-unsolvability are distinct skills not jointly optimized in current post-training.

Compute scaling on Challenge and Refusal across the Qwen3 family and test-time scaling for GPT-OSS-120B.

The compute-scaling panel shows Pass@3 climbing with parameter count across Qwen3 0.6B–32B on both Challenge and Refusal, and test-time scaling for GPT-OSS-120B (medium 16k tokens; hard 16k; hard 81{,}920) yielding monotonic gains, indicating Soohak is not yet saturated by either dimension.

Human baselines

Five teams of five solvers each (CS-major IMO experience; Math-major IMO experience; Math-major IMO Gold; Math major; Math researchers) attempted a 79-problem subset (49 Calibration, 30 Challenge upsampled) under a 4.5-hour budget, allowed CAS, programming, and non-AI search. Models received Pass@1 on the same 79 items.

Model and human-team accuracy on the 79-problem human-evaluation set.

Only Gemini-3-Pro exceeds combined-human coverage at 50.6%. The strongest single team is the Math Major (IMO experience) team. Math researchers (Team E) do not dominate over IMO-trained undergraduates on this set, consistent with the benchmark’s mix of olympiad-flavored and research-flavored items.

Limitations and open questions

Sessions were not standardized across human teams, adding variance at frontier difficulty.
Pure outcome-based scoring with text-only LaTeX excludes problems requiring diagrams or genuinely open conjectures, so “research-level” here means hard problems with a verifiable final answer rather than open mathematics proper.
The Challenge gate was deliberately not pushed harder against top closed systems; this preserves scale at the cost of some headroom for Gemini-3-Pro and GPT-5.
Refusal performance does not correlate with Challenge solving, but the benchmark does not yet disentangle calibration training from genuine ill-posedness detection.

Why this matters

Soohak provides the first research-flavored math benchmark with enough items (439 graded, plus 702 in Mini) to statistically separate frontier systems, and it exposes a sharp gap — 30% Avg@3 for the best model — that olympiad benchmarks no longer reveal. The Refusal split formalizes a research skill prior benchmarks ignored, and the rank inversion between solving and refusal signals an underexplored axis for post-training.

Source: https://arxiv.org/abs/2605.09063

Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

Problem

Continual post-training of LLMs sequentially injects new domains, skills, or behaviors via updates \Delta_t = \theta_t - \theta_{\text{pre}}. Existing tools — sequential SFT, EWC, replay (FOREVER), task-arithmetic merging (TIES, DARE, AIMMerging) — patch forgetting but offer no principled criterion for when a new update will transfer cleanly versus when it will overwrite previously acquired capabilities. Practical decisions (which updates to merge, with what weighting, at which layers) are made by trial and error. This paper argues that forgetting is governed by a measurable geometric property of the updates relative to the evolving model state, and turns that diagnostic into a data-free merging algorithm (GCWM).

What governs forgetting

The authors represent each task update by its layer-wise covariance C_i^{(\ell)} = (\Delta_i^{(\ell)})^\top \Delta_i^{(\ell)} + \lambda I and contrast four candidate predictors of retention loss across Qwen3 (0.6B–14B) under Seq. SFT, EWC, FOREVER, and AIMMerging: update norm, subspace alignment ratio (SAR), gradient conflict, and Bures–Wasserstein “geometry conflict” measured both pairwise among active updates and against the current model state.

State-relative geometry tracks forgetting across scales.

The empirical finding is that update norm gives only a coarse signal (|\rho_s|=0.48 Spearman with retention loss), pairwise active conflict is weaker still (|\rho_s|=0.30), but the state-relative gap — geometry mismatch between the new active updates and the geometry of the accumulated model state — reaches |\rho_s|=0.59 globally and grows monotonically with scale, from 0.16 at 0.6B to 0.86 at 14B. In other words, forgetting is not “how far the parameters move” but “how incompatible the move is with the geometry already encoded by previous updates.”

Global and method-level associations between signals and forgetting.

A complementary stratification (Fig. 3) shows that SAR and geometry conflict together carve task pairs into distinct transfer regimes (positive transfer, neutral, interference), while gradient conflict captures a different failure mode concentrated in top-layer parameter shares. Geometry and gradient conflict are therefore complementary rather than redundant.

Pairwise compatibility, transfer regimes, and complementary failure modes between geometry and gradient conflict.

GCWM: turning the diagnostic into a controller

GCWM is a data-free merging procedure parameterized purely by the active task vectors. For each linear layer \ell it:

Computes truncated SVDs \Delta_i^{(\ell)} \approx U_i \Sigma_i V_i^\top and forms a shared right-singular basis Q^{(\ell)} = \mathrm{orth}([V_1^{(\ell)},\dots,V_m^{(\ell)}]).
Projects each update covariance into the shared basis: B_i^{(\ell)} = (Q^{(\ell)})^\top C_i^{(\ell)} Q^{(\ell)}.
Measures pairwise normalized Bures–Wasserstein conflict \gamma_{ij}^{(\ell)} = \frac{d_B^2(B_i^{(\ell)}, B_j^{(\ell)})}{\mathrm{tr}(B_i^{(\ell)}) + \mathrm{tr}(B_j^{(\ell)}) + \varepsilon},\quad d_B^2(A,B)=\mathrm{tr}(A)+\mathrm{tr}(B)-2\,\mathrm{tr}((A^{1/2}BA^{1/2})^{1/2}).
Aggregates to a layer score g^{(\ell)} = \sum_{i<j} w_{ij}\gamma_{ij}^{(\ell)} and converts to a sigmoidal gate \alpha^{(\ell)} = \alpha_{\min} + (\alpha_{\max}-\alpha_{\min})\,\sigma(\kappa(g^{(\ell)} - \tau)).

At step t, only the incremental portion of the merged update modulated by \alpha^{(\ell)} is applied: high-conflict layers receive stronger geometry-aware correction (effectively shrinking the merge into compatible directions), low-conflict layers pass through. Construction is closed-form, requires no replay data, no held-out evaluation, and no gradient access — only the task vectors.

Results

On Qwen3 backbones the domain-continual MMLU-Pro (14 sub-domains, 1k samples each) shows GCWM closing most of the gap to multi-task joint training:

1.7B: MTL upper bound 44.4 → Seq. SFT 36.8, EWC 40.0, AIMMerging 41.8, OPCM 41.7, GCWM 43.5.
8B: MTL 65.3 → Seq. SFT 55.2, AIMMerging 62.9, OPCM 61.9, GCWM 63.7.
14B: MTL 68.6 → Seq. SFT 60.4, AIMMerging 66.4, OPCM 66.6, GCWM 67.8.

GCWM is the best non-MTL method in every block, with the largest improvement over Seq. SFT at 8B (+8.5 points) and over the strongest data-free baseline (OPCM/AIMMerging) of 0.8–1.8 points across scales. Per-domain, GCWM is competitive on traditionally interference-prone categories (CS at 8B: 64.7 vs. 62.8 OPCM; Hist at 14B: matches OPCM at high accuracy) suggesting the gate is doing useful per-layer modulation rather than uniform shrinkage.