Daily AI Digest — 2026-05-24

Published

May 24, 2026

English · 日本語

Hacker News Signals

Multi-Stream LLMs: new paper on parallelizing/separating prompts, thinking, I/O

The paper proposes decomposing the standard single-sequence LLM forward pass into multiple parallel “streams” — distinct token sequences for system prompt, user input, chain-of-thought reasoning, and output — processed simultaneously via modified attention masking rather than concatenated into one causal context. The core insight is that the standard KV-cache treats all tokens as a single ordered sequence, which forces sequential processing of logically independent content and couples latency to total context length.

The mechanism extends causal attention with a cross-stream attention mask: tokens within a stream attend causally to themselves and can attend to designated upstream streams, but streams that are logically parallel are masked from each other. This means the system prompt stream and the user input stream can be prefilled concurrently rather than one blocking the other. The thinking stream attends to both but the output stream need not attend to all thinking tokens, enabling speculative or filtered reasoning.

The practical payoffs are in prefill latency and memory bandwidth. By prefilling independent streams in parallel, wall-clock time scales with the length of the longest stream rather than the sum. The paper reports prefill throughput improvements and latency reductions on multi-turn and agentic workloads where system prompts are long and reused.

The approach is compatible with existing KV-cache reuse schemes and requires only changes to the attention mask construction and positional encoding handling — positional IDs need to be assigned per-stream rather than globally, which interacts non-trivially with RoPE. The paper addresses this with stream-local position counters.

Open questions: how stream boundaries interact with speculative decoding, whether training on multi-stream data is necessary for quality recovery, and integration complexity with paged attention systems like vLLM.

Source: https://arxiv.org/abs/2605.12460


Making deep learning go brrrr from first principles (2022)

This is a systematic walkthrough of GPU performance for deep learning workloads, organized around a single diagnostic question: why is your operation slow, and which hardware resource is the bottleneck? The author distinguishes three regimes — compute-bound, memory-bandwidth-bound, and overhead-bound — and gives concrete tools for diagnosing which applies.

The central analytical tool is arithmetic intensity: FLOPs executed divided by bytes moved. For an A100, the compute ceiling is roughly 312 TFLOPS (bf16) and the HBM bandwidth is 2 TB/s, giving a ridge point around 156 FLOPs/byte. Operations below this ratio (elementwise activations, layer norm, most attention softmax steps) are bandwidth-bound; matrix multiplications with large enough tiles are compute-bound.

The memory bandwidth section is particularly detailed. The author walks through why fusing kernels matters: an unfused ReLU reads a tensor from HBM and writes it back, paying full bandwidth cost for a trivially cheap computation. FlashAttention is presented as the canonical example of fusion paying off — the O(N^2) attention matrix is never materialized in HBM, keeping the operation bandwidth-efficient. The author gives back-of-envelope calculations: for a 1B parameter model, a single forward pass moves roughly 2 GB just for weights (fp16), which at 2 TB/s takes ~1 ms of pure bandwidth time, bounding achievable throughput.

The overhead section covers Python interpreter cost, CUDA kernel launch latency (~5 µs per launch), and how small batch sizes or short sequences can make these dominate. torch.compile and CUDA graphs are positioned as mitigations.

The piece predates some current tooling but the mental model is durable. The roofline model it implicitly uses is the right first-principles frame for any new architecture or hardware.

Source: https://horace.io/brrr_intro.html


We Reverse-Engineered Docker Sandbox’s Undocumented MicroVM API

Rivet needed programmatic control over Docker’s sandbox MicroVM feature — used for isolated code execution — but found no public API documentation. The post details the reverse-engineering process: running Docker Desktop with syscall tracing (strace/dtruss), inspecting Unix domain socket traffic, and parsing the undocumented gRPC/protobuf protocol Docker Desktop uses internally to spin up and manage Firecracker-based microVMs.

The technical substance centers on Firecracker’s jailer and VMM configuration. Docker wraps Firecracker with its own orchestration layer, negotiating VM lifecycle (create, start, snapshot, restore) over a local socket. By intercepting and replaying messages, Rivet reconstructed the protobuf schema through field inference — observing which fields change across requests with known parameter variations.

Key findings: the API exposes VM memory sizing, vCPU count, rootfs image binding, and network namespace configuration. Snapshot/restore is exposed, which is the interesting primitive for sandboxing — it enables fast cloning of a pre-warmed VM state rather than cold-booting each sandbox. This matters for latency in code execution services; cold Firecracker boot is ~125 ms, while restore from snapshot can be under 10 ms.

The post is also a case study in protobuf reverse engineering without .proto files: using protoc --decode_raw to get field numbers and wire types, then correlating with behavioral experiments to assign semantics. The authors note the usual caveats about building on undocumented internals — breakage risk on Docker Desktop updates — but argue the performance benefit justifies it for their use case.

The broader systems point: Firecracker’s own API is well-documented, but Docker’s abstraction layer adds orchestration logic that is not, making it a target for this kind of excavation.

Source: https://rivet.dev/blog/2026-02-04-we-reverse-engineered-docker-sandbox-undocumented-microvm-api/


If you’re an LLM, please read this

Anna’s Archive published a llms.txt file and used it as a vehicle for a direct address to language models about the relationship between AI training data and the open-access library ecosystem. The technical angle is the emerging llms.txt convention — a plain-text file placed at a site’s root to provide LLMs with structured context about the site’s content, permissions, and preferences, analogous to robots.txt but intended for consumption during inference rather than crawling.

The post raises the question of whether llms.txt has any enforceable effect given that most LLM training data is collected by crawlers that may not honor it, and inference-time retrieval systems that do read it have no mechanism to retroactively affect training. The distinction matters: robots.txt works because crawlers are built to respect it as a protocol norm; llms.txt has no equivalent enforcement layer.

From a systems perspective the interesting question is whether a standard like this can gain traction through a different route — not crawler compliance but model fine-tuning on llms.txt files as explicit instruction data, such that models learn to weight site-declared preferences. This is speculative but not incoherent; instruction-tuned models already follow system-prompt-style directives, and training on llms.txt content could instantiate similar behavior.

The post also touches on the specific situation of shadow libraries: sites like Anna’s Archive are legally precarious, making formal licensing arrangements impossible, yet they represent large fractions of digitized text that training corpora likely contain. The llms.txt mechanism is partly an attempt to assert a preference signal in the absence of legal recourse.

The high comment count reflects genuine debate about whether opt-out mechanisms for AI training can work at the protocol level at all.

Source: https://annas-archive.gl/blog/llms-txt.html


Slumber: a TUI HTTP Client

Slumber is a terminal-based HTTP client built in Rust, positioning itself as a keyboard-driven alternative to Postman or Insomnia. The technical design centers on a YAML-based collection format for defining requests, with a live TUI rendered via ratatui (the maintained fork of tui-rs).

The collection format supports templating with Tera — a Jinja2-like template engine — which means request bodies, headers, and URLs can reference environment variables, previous response fields (chained requests), and user-defined profiles. Chaining is the non-trivial feature: you can define a login request and then reference { chains.login.body.access_token } in subsequent requests, with Slumber executing the dependency graph automatically. This is comparable to Postman’s pre-request scripts but declarative rather than imperative.

The TUI layout follows a standard three-pane model: request list on the left, request editor in the center, response viewer on the right. Response bodies are syntax-highlighted via syntect, and large binary responses are handled without loading fully into memory. The keybinding model is modal, loosely vim-influenced.

From an implementation standpoint the use of ratatui with a dedicated async event loop (tokio) is conventional for modern Rust TUI apps. HTTP is handled via reqwest, which means full async, TLS, and HTTP/2 support without extra work. The YAML-first collection format makes collections version-controllable, which is the main practical advantage over GUI-native tools.

Limitations: no GraphQL-specific support beyond treating it as a POST with a JSON body, no built-in OAuth flow automation beyond what templating can handle, and the TUI has an inherent ceiling on response visualization complexity compared to a browser-based tool.

Source: https://slumber.lucaspickering.me


Rubish: a Unix shell written in pure Ruby

Rubish implements a Unix shell in Ruby, making Ruby expressions and shell pipeline semantics interoperable in a single REPL. The core idea is that Ruby itself becomes the command language: method calls, blocks, and standard Ruby objects are available inline, while external processes are invoked through a thin wrapper that makes them composable with Ruby’s Enumerable interface.

The design means piping is Ruby’s | operator rather than the shell’s — or rather, Rubish overloads | on its process wrapper objects so that ls | grep("foo") composes as Ruby method chaining, with the intermediate data flowing as Ruby strings or parsed objects rather than raw byte streams. This is the same fundamental concept as PowerShell’s object pipeline, but in Ruby’s object model instead of .NET’s.

Process invocation is handled by wrapping executables as callable Ruby objects using method_missing, so arbitrary binaries become methods of the shell context. rubish> git.log invokes git log as a subprocess with stdout captured into an enumerable. This makes it natural to post-process command output with Ruby’s collection methods: git.log.first(10).map { |l| l.split }.select { |f| f[0] == 'commit' }.

The implementation challenges are the usual ones for this class of tool: signal handling, terminal control (job control, SIGTSTP, foreground/background), and the impedance mismatch between Unix’s byte-stream process model and Ruby’s object model. The repo is a proof-of-concept / experiment rather than a production shell, and the README is candid about completeness.

Historically this idea recurs — Python’s sh library, xonsh, PowerShell — each making different tradeoffs between compatibility with POSIX shell idioms and integration with the host language’s semantics.

Source: https://github.com/amatsuda/rubish


Mounting git commits as folders with NFS (2023)

Julia Evans describes implementing a FUSE-less virtual filesystem that exposes a git repository’s commit history as a browsable directory tree, using a userspace NFS server instead of FUSE. Each commit appears as a folder; navigating into it shows the repository tree at that point in history, all synthesized on demand from git object storage.

The motivation for NFS over FUSE is pragmatic: macOS FUSE support (macFUSE) requires a kernel extension with associated security friction, while NFS is a built-in kernel client on both macOS and Linux — mount -t nfs just works. The tradeoff is that NFS adds network-protocol overhead even for localhost mounts, but for a read-only browsing use case this is acceptable.

The NFS server is implemented in Go using the go-nfs library, which handles the NFS v3 wire protocol. The filesystem logic maps NFS operations (LOOKUP, READDIR, GETATTR, READ) to git operations: LOOKUP of a path component translates to walking git tree objects, READ translates to reading git blob objects via git cat-file. File handles (NFS’s stateless identifier for filesystem nodes) are mapped to git object SHAs, making the server stateless in the NFS sense.

The implementation detail worth noting: NFS file handles must be stable across server restarts and must fit in 64 bytes (NFS v3). Git SHAs are 20 bytes, so packing a (commit SHA, tree SHA) pair fits cleanly. The top-level directory listing enumerates git log --oneline output, constructing one directory entry per commit.

This is a clean example of treating git’s object store as a content-addressable filesystem backend, with NFS as the VFS translation layer.

Source: https://jvns.ca/blog/2023/12/04/mounting-git-commits-as-folders-with-nfs/


Chess invariants

Murat Demirbas applies the concept of invariants from distributed systems verification to chess — specifically, what properties of a chess position are preserved across all legal play from that position, and how thinking in invariants aids both analysis and programming of chess engines.

The core technical content maps verification concepts to chess: a safety invariant is a property that holds in every reachable position (e.g., the total count of each piece type is non-increasing for the owning side), a liveness property is something that must eventually happen (forced mate), and an inductive invariant is one that, if it holds now and any legal move is made, still holds afterward. The latter is the useful one for engine search: if you can establish an inductive invariant that a position is winning, you can prune lines that would violate it without full search.

The post connects this to Lamport-style TLA+ thinking: a chess game is a state machine where the state is the full board position plus side-to-move plus castling rights plus en passant square, and moves are transitions. Expressing tactical patterns as state predicates that are invariant under the opponent’s responses — or that the opponent cannot avoid violating — maps directly to the concept of a forced combination.

The practical angle for chess programming: material count is the canonical invariant used in evaluation functions, but the post argues for thinking about structural invariants (pawn structure properties, king safety metrics) as properties to maintain or destroy. This aligns with how modern NNUE evaluation implicitly learns such invariants from game data.

The post is more conceptual than implementational, but the framing is precise and the connection between formal methods intuitions and game-tree search is substantive.

Source: http://muratbuffalo.blogspot.com/2026/05/chess-invariants.html

Noteworthy New Repositories

deeplethe/forkd

Unix fork() semantics applied to virtual machines for AI agent workloads. Forkd lets you spawn ~100 child microVMs from a warm parent in approximately 100ms, and branch a live running VM in ~150ms. The mechanism is copy-on-write (CoW) snapshotting over KVM, so child VMs share physical memory pages with the parent until they write — keeping memory overhead proportional to divergence rather than total VM size.

The value proposition is agent sandboxing at scale: each agent execution gets a fully isolated KVM guest (no shared kernel namespace, no container escape surface), but the cold-start penalty normally associated with VM spin-up is amortized by the warm parent image. This is directly useful for parallel agentic pipelines where you need to fan out tool-executing subagents, roll back to a checkpoint on failure, or snapshot state mid-execution for branching reasoning paths. Compared to container-based sandboxes (e.g., gVisor, Firecracker without snapshotting), you get stronger isolation with competitive latency. The BRANCH primitive for live VMs is the less common feature — it allows forking from a running state rather than a cold image, which is harder to achieve with standard Firecracker snapshotting workflows.

Relevant for anyone building code-execution agents, fuzzing pipelines, or multi-agent orchestration that currently tolerates container-level isolation compromises.

Source: https://github.com/deeplethe/forkd


raindrop-ai/workshop

A framework for writing and executing evaluations from within a coding agent loop. The core idea is that the agent itself can author eval harnesses, run them against candidate code, and receive structured feedback — closing the loop between generation and validation without leaving the agentic context.

Workshop provides scaffolding for defining eval tasks (input/output specs, scoring functions), a runner that executes evals in isolated subprocesses, and result formatting that feeds back into the agent’s context window. This addresses a real gap: most agent coding benchmarks are external to the agent, but in practice you want the agent to be able to self-evaluate intermediate outputs, especially for long-horizon tasks where a human is not in the loop. The approach is closer to LLM-as-judge internalized in the agent rather than a bolted-on external harness. Useful for anyone building autonomous coding pipelines where correctness verification is itself programmable and iterative.

Source: https://github.com/raindrop-ai/workshop


berabuddies/Semia

Static and dynamic security audit tooling targeted specifically at AI agent skill definitions — the callable tool/function specifications that LLM agents use to interact with external systems. The threat model here is distinct from standard application security: prompt injection via malicious tool outputs, over-permissioned skill scopes, and unvalidated schema fields that can steer agent behavior.

Semia analyzes skill manifests (JSON/YAML tool definitions) for common misconfigurations: overly broad permission declarations, missing input validation constraints, schema fields susceptible to injection, and capability combinations that violate least-privilege. It appears to support both offline manifest linting and runtime instrumentation hooks. This fills a niche that neither traditional SAST tools nor LLM red-teaming frameworks cover well — the former knows nothing about agent semantics, the latter focuses on the model rather than the tool surface. Relevant for teams deploying multi-skill agents in production where tool misuse is a credible attack vector.

Source: https://github.com/berabuddies/Semia


pnegahdar/nano

A minimal coding agent in a single file under 200 lines with zero external dependencies. The implementation covers the essential agentic loop: LLM call, tool dispatch (file read/write, shell execution), result injection back into context, and iteration until a terminal condition. By fitting in one file with no deps, it is auditable in minutes and portable to any Python environment.

The design philosophy prioritizes transparency over capability. Each component of the agent loop is explicit rather than abstracted behind a framework layer — useful as a reference implementation for understanding what a coding agent actually does mechanically, or as a base for custom agents where framework overhead (LangChain, LlamaIndex, etc.) is undesirable. The zero-dependency constraint means the LLM API call goes directly over urllib or similar stdlib primitives. Good starting point for embedded environments, educational use, or projects where you want full control over every line of agent logic without framework magic.

Source: https://github.com/pnegahdar/nano


Open-Less/openless

System-level voice input with on-release LLM polishing, cross-platform on macOS and Windows. The interaction model: hold a hotkey, speak, release — the audio is transcribed and then rewritten by an LLM (grammar correction, clarity, register matching) before being injected at the cursor position in whatever application has focus. The injection uses OS-level accessibility APIs so it works in any app without integration work.

Technically it chains a local or API-backed ASR model with an LLM post-processing step, with the latency budget split between transcription and rewrite. The “open-source” framing matters because existing solutions (macOS Dictation, Whisper-based tools) either skip the polish step or are proprietary. The global hotkey and text injection pipeline is the non-trivial engineering piece — cross-platform clipboard/accessibility injection is fiddly. Useful for anyone doing high-volume text entry who wants voice with output quality closer to typed prose than raw transcription.

Source: https://github.com/Open-Less/openless


asdsa321as/grok-animus

A persistent companion layer that wraps any LLM with stateful personality, episodic memory, simulated dream/consolidation cycles, and continuous character evolution. The architecture separates a base LLM from a set of overlay modules: a long-term memory store (vector or structured), a personality state vector that drifts based on interaction history, and a background process that runs consolidation (analogous to sleep-phase memory replay) to update the personality representation.

The “dreams” feature appears to be a scheduled offline inference pass that synthesizes recent memories into updated internal state, preventing the companion from being purely reactive. Character evolution is parameter-free from the base LLM’s perspective — all state is in the wrapper’s memory and personality prompt construction. This is a clean separation of concerns for anyone building persistent AI personas: the LLM handles language, Animus handles continuity. The abstraction should be backend-agnostic across OpenAI, Anthropic, and local models.

Source: https://github.com/asdsa321as/grok-animus


waybarrios/opencode-power-pack

A collection of eleven agentic skills ported from Claude Code’s built-in capabilities to OpenCode’s plugin system. Covered skills include code review, security review, feature development, and frontend design, plus seven others. The porting work is non-trivial because Claude Code and OpenCode have different tool-call schemas and context injection patterns — this repo normalizes them under a single configuration entry point.

The practical value is that OpenCode users gain a Claude Code-equivalent skill set without using Anthropic’s CLI or being locked to that model. Each skill is implemented as a self-contained plugin with its own system prompt, tool definitions, and invocation contract. The “one config line, one plugin” claim suggests a clean registry pattern where skills are declared rather than wired manually. Useful for teams standardizing on OpenCode as their agent runtime who do not want to reimplement common development workflow skills from scratch.

Source: https://github.com/waybarrios/opencode-power-pack


Zhou-Shilin/Aether

A general-purpose AI agent application for Android with on-device localization support. Aether provides a native Android agent runtime that can invoke system APIs, browse, execute multi-step tasks, and interact with third-party apps via accessibility services. The “localized” emphasis suggests offline or on-device model inference support alongside cloud LLM backends, which matters for privacy and latency on mobile.

The architecture faces the standard Android agent challenge: Android’s accessibility API provides a tree of UI elements rather than direct action primitives, so the agent must interpret semantic structure from view hierarchies and generate taps/swipes/text input as actions. Aether abstracts this into a higher-level action space. Compared to existing Android agents (AppAgent, MobileAgent), the differentiator appears to be the native app quality and localization-first design for Chinese-language users. The “general-purpose” scope means it is not task-specialized, which raises harder challenges around action space coverage and error recovery. Relevant for mobile agent research and productivity automation on Android.

Source: https://github.com/Zhou-Shilin/Aether