Daily AI Digest — 2026-05-10
Hacker News Signals
LLMs corrupt your documents when you delegate
Source: https://arxiv.org/abs/2604.15597
The paper investigates a class of failures that occur when LLMs act as document-editing agents: the model silently introduces factual errors, stylistic drift, or outright fabrications into documents it is asked to revise, summarize, or transform. The problem matters because agentic pipelines that loop LLM output back into persistent storage (codebases, wikis, legal drafts) have no automatic integrity check — corruption accumulates across iterations.
The core finding is that standard instruction-following fine-tuning creates a tension between the model’s parametric knowledge and the document’s ground truth. When asked to rewrite or extend text, models regularize toward their priors, subtly overwriting factual claims that conflict with what they “expect” the text to say. The effect is measurable even on single-pass edits and worsens with longer documents where attention to source fidelity degrades.
The authors construct a benchmark of controlled corruption scenarios across domains (technical documentation, medical text, legal clauses) and evaluate several frontier models. They find corruption rates are non-trivial even at temperature zero, and that post-hoc verification prompts (“did you change any facts?”) catch fewer than half the introduced errors — models are poorly calibrated about their own edits.
Mitigation experiments include diff-based prompting (asking the model to output only a delta rather than a full rewrite), retrieval-grounded generation, and explicit self-consistency checks. Diff-based prompting shows the largest reduction in corruption rate, though it introduces its own failure mode when models produce syntactically valid but semantically incorrect diffs.
The open question is how to build lightweight, reliable integrity verification that does not itself depend on an LLM. The authors note that deterministic diff tools can detect structural changes but not semantic corruption, leaving the problem partially unsolved for prose documents without ground-truth references.
Teaching Claude Why
Source: https://www.anthropic.com/research/teaching-claude-why
Anthropic’s post describes a shift in their RLHF/RLAIF methodology away from rule-based behavioral constraints toward what they call “value internalization” — training Claude to understand the reasoning behind safety guidelines rather than pattern-matching to a list of prohibitions. The technical substance centers on how this changes both the training data construction and the reward modeling process.
The core claim is that rule-following models generalize poorly to novel situations: a model trained never to produce output matching a set of harmful categories can be jailbroken by surface rephrasing. By contrast, a model that has been trained on explanations of why a behavior is harmful is expected to generalize the underlying principle to new surface forms.
In practice this means constitutional-AI-style self-critique prompts are augmented with causal explanations. Rather than “this response violates policy X,” training signal is paired with a chain-of-reasoning that connects the behavior to harm mechanisms. The reward model is then trained to score responses higher when they exhibit reasoning consistent with these explanations, not just when they avoid flagged outputs.
A secondary technical point is robustness to distribution shift. When the model encounters a genuinely ambiguous request, a principle-aware model can reason about trade-offs rather than defaulting to refusal or compliance based on surface features. The post gives examples where over-refusal is reduced on legitimate edge cases without increasing harmful outputs on adversarial ones — though no quantitative breakdown is provided.
The limitation is that this approach requires high-quality explanatory training data at scale, which is expensive to construct and introduces its own biases. There is also no formal guarantee that internalized “values” are stable under further fine-tuning or that they generalize beyond the distribution of explanations seen during training. The approach is empirical and the evaluation is largely qualitative, which is a notable gap for a claim of this scope.
A polynomial autoencoder beats PCA on transformer embeddings
Source: https://ivanpleshkov.dev/blog/polynomial-autoencoder/
The post describes a nonlinear autoencoder where both encoder and decoder are implemented as explicit polynomial functions of their input, rather than multi-layer neural networks or linear projections. The motivation is that transformer embeddings live on curved manifolds, and PCA (a linear method) cannot capture the intrinsic geometry without a large number of components.
The architecture is straightforward: for a degree-d polynomial encoder, the latent code is z = \sum_{|\alpha| \leq d} W_\alpha \odot x^{\alpha} where \alpha is a multi-index and x^\alpha denotes element-wise monomials. In practice degree 2 is used to keep the parameter count tractable, giving terms up to x_i x_j. The decoder is a symmetric polynomial map back to the original dimension. Training minimizes reconstruction loss with an \ell_2 regularizer on the polynomial coefficients.
The key empirical result is that at the same bottleneck dimension (e.g., 32 or 64 dimensions from a 768-dimensional BERT embedding), the polynomial autoencoder achieves lower reconstruction error and better downstream classification accuracy than PCA, and also outperforms a shallow MLP autoencoder on several text classification tasks. The polynomial structure provides interpretability benefits: each latent dimension corresponds to a weighted sum of monomials, so importance of interactions can be read off coefficient magnitudes.
The limitations are significant. The degree-2 expansion for a 768-dimensional input nominally produces ~295k cross-terms; the author handles this with random feature subsampling and sparse coefficient tensors, which reintroduces approximation error. Scalability beyond BERT-sized embeddings is unclear. There is also no comparison to modern alternatives like variational autoencoders or flow-based models, so the claim “beats” should be read narrowly: it beats PCA and a specific shallow MLP on the tested benchmarks. Still, the polynomial inductive bias is an underexplored direction worth revisiting.
Hardening Firefox with Claude Mythos Preview
Source: https://hacks.mozilla.org/2026/05/behind-the-scenes-hardening-firefox/
Mozilla’s engineering post describes an experiment using Anthropic’s Claude Mythos Preview (a long-context, tool-use-capable model) to assist with systematic security hardening of Firefox’s C++ and Rust codebase. The technical substance covers two pipelines: static analysis triage and patch generation.
For static analysis triage, the model is given the output of tools like AddressSanitizer, Coverity, and custom Mozilla fuzzing harnesses, along with relevant source context (up to tens of thousands of tokens), and asked to classify findings by exploitability and suggest fix strategies. The goal is to reduce the manual triage burden for the security team, which deals with thousands of low-signal static-analysis alerts per release cycle.
For patch generation, the pipeline is more involved. The model is given a confirmed vulnerability class (e.g., a use-after-free pattern or an integer overflow in a parser), the relevant call graph context, and existing test cases, and asked to produce a patch plus an explanation of why the patch closes the vulnerability. Human engineers review all patches; the post reports that roughly 30% of model-generated patches were accepted with minor modifications, which they describe as meaningfully accelerating throughput without reducing review quality.
The Rust/C++ boundary is a particular focus: Firefox’s ongoing Rust rewrite creates interfaces where memory safety guarantees are weakened, and the model was specifically prompted to reason about unsafe blocks and FFI boundaries. The post includes examples of the model correctly identifying that a safe Rust wrapper was passing a raw pointer to a C function that aliased it, a subtle correctness issue that evades standard linting.
Open questions include how to evaluate the model’s false-negative rate on vulnerability detection (it may be confidently wrong), and whether the approach generalizes beyond a codebase with unusually rich documentation and test coverage.
Gemini API File Search is now multimodal
Google’s update extends the Gemini API’s file search (grounded retrieval) capability from text-only to multimodal corpora, meaning retrieval can now be triggered by or return image, audio, and video chunks alongside text. The technical substance is in the indexing and retrieval architecture.
Previously, file search used a text embedding index over uploaded documents. The multimodal extension adds a joint embedding space where image regions, audio segments, and video keyframes are projected into the same vector space as text tokens using Gemini’s native multimodal encoder. Queries can be text or mixed-modality, and retrieved chunks can be any modality, which are then passed to the generation model as context.
The practical implication for RAG pipelines is that developers can now build retrieval over heterogeneous corpora — for example, a technical documentation system that indexes both PDF text and inline diagrams, returning the relevant figure when a query is about a visual concept rather than a textual one. The API exposes this through the existing files.search endpoint with an updated schema that specifies chunk modality and returns base64-encoded media chunks alongside text.
The retrieval scoring is not described in detail; it is presumably cosine similarity in the joint embedding space, but Google does not disclose whether cross-modal retrieval uses a single encoder or a contrastive alignment layer on top of separate encoders. Latency implications of returning large media chunks are also unaddressed.
The main limitation from a research perspective is that this is an opaque hosted service — there is no way to inspect the embedding model, adjust retrieval parameters beyond top-k, or understand calibration of cross-modal similarity scores. For production use cases requiring auditable retrieval, this is a meaningful constraint.
Noteworthy New Repositories
kyegomez/OpenMythos
A speculative reverse-engineering effort that attempts to reconstruct the architectural decisions behind Anthropic’s Claude models from publicly available research literature. The project synthesizes components from Constitutional AI, RLHF, sparse attention, and mixture-of-experts research into a coherent hypothetical architecture. The reconstruction is built from first principles rather than leaked weights or internal documents, meaning it is an interpretive assembly of publicly known techniques plausibly consistent with Claude’s observed behavior. Technical content includes implementations of constitutional self-critique loops, preference model training pipelines, and multi-stage RLHF scaffolding. The codebase follows a modular design inspired by the Zeta library pattern common to Kye Gomez’s other open-source work. Value here is primarily pedagogical: it exposes the compositional logic of alignment-focused LLM training stacks in a single navigable repository. Researchers familiar with Anthropic’s published papers on Constitutional AI (Bai et al., 2022) will recognize the design choices being approximated. Notable caveat: this is architectural speculation, not an empirical replication with benchmarked parity against Claude. There is no training compute or dataset provenance documentation, so claims about fidelity are unverifiable. Useful as a structured reading companion to Anthropic’s published work, less useful as a production baseline.
Source: https://github.com/kyegomez/OpenMythos
future-agi/future-agi
An end-to-end observability and evaluation platform for LLM and agent applications, self-hostable under Apache 2.0. The core technical stack covers five integrated subsystems: distributed tracing of LLM calls and multi-step agent trajectories; an evaluation framework supporting both reference-based and LLM-as-judge scoring; simulation environments for offline agent rollout testing; dataset management with versioning; and a gateway layer providing unified routing and rate-limiting across providers. The guardrails module supports both input and output filtering with configurable policy enforcement. Architecturally the platform resembles a combination of LangSmith-style tracing and Weights & Biases-style experiment tracking, unified under a single data model that links traces to eval runs to dataset slices. The tracing implementation uses OpenTelemetry-compatible spans, allowing integration with existing observability infrastructure. The simulation component is the most technically differentiated feature: it allows offline replay of agent trajectories against modified environments or model versions without live API calls, which is critical for regression testing agents with non-deterministic tool use. Self-hosting implies teams retain full trace data, relevant for compliance-sensitive deployments. Active development with 911 stars suggests early but growing adoption.
Source: https://github.com/future-agi/future-agi
GammaLabTechnologies/harmonist
An agent orchestration runtime distinguished by two design constraints: portability (zero runtime dependencies, single binary distribution) and mechanical protocol enforcement. The 186 bundled agents communicate via a formally specified protocol layer that enforces message schema validation, capability negotiation, and turn-taking contracts at the transport level rather than relying on prompt-level conventions. This shifts correctness guarantees from probabilistic (LLM follows instructions) to deterministic (invalid messages are rejected by the runtime). The orchestration model is graph-based: agents are nodes, typed channels are edges, and the scheduler enforces topological ordering with cycle detection. Protocol enforcement is implemented as a state machine per channel that validates message sequences against a finite automaton derived from the agent’s declared capability schema. The zero-dependency constraint means the runtime compiles to a self-contained executable, relevant for edge deployment or air-gapped environments where pip/npm dependency resolution is unavailable. The 186 pre-built agents span common tool-use categories (web search, code execution, file I/O, API calling), each with declared input/output schemas that the protocol layer uses for static compatibility checking before runtime. The mechanical enforcement approach is architecturally closer to Erlang’s OTP supervision trees than to LangGraph or AutoGen, trading flexibility for predictable failure modes.
Source: https://github.com/GammaLabTechnologies/harmonist
amitshekhariitbhu/llm-internals
A structured educational repository covering LLM internals from tokenization through inference optimization, targeting engineers who want implementation-level understanding rather than API-level familiarity. Content is organized as a progressive curriculum: tokenization (BPE, WordPiece, SentencePiece with reference implementations), embedding layers and positional encodings (absolute, RoPE, ALiBi), attention mechanisms (scaled dot-product, multi-head, grouped-query, multi-query variants with complexity analysis), feed-forward blocks, layer normalization placement (pre-norm vs. post-norm stability tradeoffs), and KV cache mechanics. The inference optimization section covers quantization (INT8, GPTQ, AWQ), speculative decoding, continuous batching, and PagedAttention. Each section pairs conceptual explanation with minimal NumPy or PyTorch implementations designed to be readable rather than performant. This is pedagogically important: production implementations in vLLM or llama.cpp obscure the algorithmic structure behind systems optimizations. The repository fills a gap between tutorial-level transformer introductions and production inference codebases. At 975 stars and active commits, it is positioned as a reference maintained alongside current literature. Limitations: coverage of training-time internals (optimizer states, gradient checkpointing, ZeRO stages) is thinner than the inference side, and MoE routing mechanisms are not yet covered.