Daily AI Digest — 2026-05-30

Published

May 30, 2026

Hacker News Signals

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

Source: https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/

The post claims 3,000 tokens/s per-request throughput on commodity GPUs (H100s, not exotic hardware) for medium-sized LLMs, which is roughly 10-20x what naive vLLM deployments achieve at low batch sizes. The key technical levers are aggressive speculative decoding with a well-matched draft model, continuous batching tuned to minimize KV-cache fragmentation, and kernel-level fusion that keeps memory bandwidth saturated rather than stalling on GEMM launches.

The central insight is that per-request latency at low concurrency is almost entirely memory-bandwidth-bound, not compute-bound. A single H100 SXM5 has ~3.35 TB/s HBM bandwidth. Loading a 7B fp16 model weight once per token costs ~14 GB per forward pass; at 3k tokens/s that is ~42 TB/s of effective bandwidth demand, which only makes sense if weights are quantized (INT4/INT8) and the speculative draft acceptance rate is high enough that many tokens are confirmed per draft call. The post is somewhat light on exact acceptance-rate numbers and quantization specifics, which are the crux of reproducibility.

Continuous batching is not novel (PagedAttention/vLLM introduced it), but the claimed gains here come from tighter integration between the scheduler and the speculative decode loop so that the batch is never stalled waiting for KV-cache allocation. The post also mentions custom CUDA kernels for the attention step rather than relying on FlashAttention as a black box, allowing fused gather operations over non-contiguous KV pages.

Limitations: the 3k number is for small context windows (< 2k tokens in KV) and single-user scenarios. Under high concurrency the speculative draft bottleneck shifts, and the gains likely flatten. No open-source release of the kernels is mentioned, so independent reproduction is not currently possible.

Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

Source: https://github.com/jmaczan/tiny-vllm

Tiny-vLLM is an educational reimplementation of the core vLLM inference stack in C++ and CUDA, targeting developers who want to understand the internals without parsing thousands of lines of Python/Triton. The repository is small enough that the entire paged attention mechanism and continuous batching scheduler fit in a few hundred lines of readable CUDA.

The architecture mirrors vLLM’s design: a block manager maintains a pool of fixed-size KV-cache pages and a logical-to-physical page table per sequence. When a new request arrives, the scheduler allocates physical blocks on demand and frees them on completion, avoiding the fragmentation of pre-allocated per-sequence KV buffers. The CUDA kernel for attention iterates over logical pages, loads the corresponding physical KV blocks from HBM, and accumulates the softmax-weighted value sum — essentially a blocked FlashAttention variant where the block boundaries correspond to paged memory rather than SRAM tile size.

The C++ scheduler is single-threaded and priority-queue-based: it preempts lower-priority sequences by swapping their KV pages to CPU memory when HBM is exhausted, a feature the author notes is incomplete in the current version. There is no tensor parallelism or pipeline parallelism yet; everything runs on a single GPU, which limits practical scalability beyond ~13B fp16 models on a single A100.

What makes this useful is not performance — it will not beat vLLM or TensorRT-LLM — but legibility. The paged attention CUDA kernel is annotated and avoids the macro-heavy abstraction layers in production code. For someone trying to understand why KV-cache fragmentation is expensive or how continuous batching interleaves prefill and decode phases, this is a cleaner entry point than reading vLLM’s Python scheduler and then cross-referencing Triton kernels. The main gap is lack of quantization support and grouped-query attention, both essential for modern model families.

Investigating how prompt politeness affects LLM accuracy (2025)

Source: https://arxiv.org/abs/2510.04950

This paper runs a controlled study asking whether adding polite phrasing (“please”, “thank you”, “could you kindly”) to prompts systematically shifts accuracy on standard benchmarks. The finding is that it does, but not uniformly and not always in a beneficial direction.

The experimental setup covers several frontier models (GPT-4o, Claude 3, Llama-3 variants) across MMLU, GSM8K, HumanEval, and a set of factual QA tasks. Prompts are rewritten in four registers: neutral (baseline), polite, rude, and excessively sycophantic. Each variant is evaluated at temperature 0 with five independent runs to control for stochasticity.

Key quantitative result: on GSM8K, polite phrasing produces a +1.2 to +2.8 point accuracy improvement over neutral across most models, but excessively sycophantic phrasing (“I humbly beseech your vast knowledge…”) degrades accuracy by 1-4 points relative to neutral, likely because it shifts the model toward a deferential response style that hedges answers. On HumanEval, effects are smaller and inconsistent across models, suggesting code generation is less sensitive to register.

The mechanism is not established, only speculated: RLHF training data probably over-represents polite human-assistant interactions, so polite prompts activate higher-confidence response patterns. Rude prompts in some models triggered refusals or shortened responses, directly tanking accuracy.

Limitations are significant. Effect sizes are small (1-3 points), within the variance of prompt phrasing more generally. The study does not control for prompt length, which co-varies with politeness additions. There is no causal analysis of attention patterns or logit-level effects, so the mechanistic claims remain speculative. The practical implication is marginal: spend time on task specification, not pleasantries.

SQLite is all you need for durable workflows

Source: https://obeli.sk/blog/sqlite-is-all-you-need-for-durable-workflows/

The post argues that SQLite’s WAL mode, combined with careful use of transactions and a small event-sourcing schema, is sufficient infrastructure for durable workflow execution — the class of problem Temporal, AWS Step Functions, and similar systems address. The claim is architectural, not a benchmark.

The core idea is that a workflow executor can represent all state as an append-only event log in a SQLite table. Each workflow step writes its result as a row within a transaction before returning. On crash recovery, the executor replays the log from the last committed row, skipping already-completed steps. This is exactly the Temporal model, but with SQLite as the journal instead of Cassandra + Elasticsearch.

The schema is minimal: an events table with (workflow_id, sequence_number, event_type, payload BLOB, created_at) and a workflows table tracking status. The executor holds a file lock (SQLite’s default exclusive write lock) and processes one workflow at a time per process, relying on WAL mode to allow concurrent reads from separate reader processes for observability.

The durability guarantee is SQLite’s PRAGMA synchronous = FULL plus WAL checkpointing: every committed transaction is fsync’d to the WAL before the write returns, so a process crash cannot lose a committed step result. This is the same guarantee Postgres provides, just without network round-trips.

Scaling limits are explicitly acknowledged: this works for single-node deployments handling thousands, not millions, of concurrent workflows. Multi-node coordination requires distributed consensus, which is exactly what SQLite does not provide. The post is honest that this is a local-process durable scheduler, not a Temporal replacement at scale. For small services or embedded workflow engines inside a single binary, the simplicity argument is strong.

Various LLM Smells

Source: https://shvbsle.in/various-llm-smells/

Borrowing the “code smell” taxonomy from software engineering, this post catalogs patterns in LLM outputs and system designs that signal deeper problems without being outright failures. It is opinionated and practitioner-focused.

Several of the catalogued smells are technically substantive. The “confident hallucination cluster” smell describes outputs where the model produces a chain of plausible-sounding but fabricated citations, each reinforcing the previous — a consequence of autoregressive generation where early false tokens condition later ones toward a coherent-but-wrong narrative. The fix suggested is retrieval augmentation with citation grounding, not just prompting the model to be uncertain.

The “prompt spaghetti” smell targets systems where a single prompt has accumulated months of conditional instructions, negations, and exceptions (“do X, but not if Y, unless Z”). The author argues this creates a hidden state machine inside the context window where instruction interactions are opaque and untestable. The engineering recommendation is to decompose into separate specialized prompts routed by a classifier, which is essentially the mixture-of-experts principle applied at the prompt level.

“Temperature instability” describes outputs that are highly sensitive to temperature setting in ways that are not obvious during development: a model that works at temperature 0.7 for creative tasks silently degrades on structured extraction tasks where lower temperature would concentrate probability on the correct format token. This is a calibration mismatch between task type and sampling regime.

The “eval-less deployment” smell is the sharpest: shipping a prompt change without automated evaluation on a regression suite. The post insists this is as bad as deploying code without tests, and the HN comments extensively debate what a minimal useful LLM eval harness looks like (most converge on: a fixed golden dataset, a judge model, and a threshold on score delta).

Building durable workflows on Postgres

Source: https://www.dbos.dev/blog/postgres-is-all-you-need-for-durable-execution

DBOS makes the same structural argument as the SQLite post above but for Postgres, and with more implementation detail about their production system. The post describes using Postgres as the sole durability and coordination layer for a workflow engine, replacing Temporal’s multi-service architecture.

The implementation stores workflow function inputs, outputs, and execution status in Postgres tables. Each function invocation is wrapped in a transaction that reads the workflow’s recorded output if it exists (idempotent replay) or executes the function and writes the result if not. This is the “execution journal” pattern: the database is the source of truth for what has happened, and the process is stateless.

The interesting technical detail is how they handle the “exactly-once” semantic for non-idempotent side effects (HTTP calls, email sends). The solution is a workflow_events table that records the intent to perform an action before performing it, and a separate completed_events table written after. On replay, if completed_events has a matching entry, the action is skipped; if only workflow_events has an entry, the action is retried. This is two-phase commit at the application layer, using Postgres row locks to prevent concurrent execution of the same step.

Postgres’s LISTEN/NOTIFY mechanism is used for waking up sleeping workflow workers without polling, keeping latency low without a message broker. The post benchmarks this at roughly 10,000 workflow steps per second on a single Postgres instance, which they argue covers the majority of real-world workflow workloads.

The key limitation vs. Temporal: no built-in sharding, no multi-region active-active. All workflows must funnel through one writable Postgres primary. For globally distributed systems this is a real constraint; for single-region services it is a significant simplification.

Liquid AI reveals 8B-A1B MoE trained on 38T tokens

Source: https://www.liquid.ai/blog/lfm2-5-8b-a1b

Liquid AI’s LFM-2.5 8B-A1B is a Mixture-of-Experts model with 8B total parameters and 1B active parameters per token, trained on 38 trillion tokens. The naming convention (A1B = 1B active) mirrors Deepseek’s convention and signals the same architectural motivation: reduce per-token FLOPs while maintaining capacity.

The architecture is not a standard transformer MoE. Liquid’s LFM series uses structured state-space layers (their “liquid” layers, derived from LTC networks and related to Mamba/S4) interleaved with attention layers, rather than pure transformer blocks. The MoE routing applies to the feedforward-equivalent components of these liquid layers, not to attention. This is architecturally distinct from Mixtral or Deepseek-V2 where attention is dense and only FFN layers are sparsified.

The 38T token training corpus is notably large for an 8B-class model — Llama-3 8B used 15T, Mistral 7B used ~8T. Chinchilla-optimal for 8B parameters is roughly 160B tokens, so this is heavily over-trained by the compute-optimal standard, following the same philosophy as Llama-3 of training past compute-optimality to reduce inference cost.

Reported benchmarks show LFM-2.5 8B-A1B competitive with Llama-3.1 8B and Gemma-3 9B on MMLU, MATH, and coding benchmarks, while using ~12.5% of the active parameters of a dense 8B model. Inference efficiency is the headline claim: at batch size 1, the 1B active parameter count means memory bandwidth requirements are roughly those of a 1B dense model.

Open questions: the structured SSM layers are less hardware-optimized than GEMM-dominated transformer layers on current GPU hardware, so real-world throughput gains depend heavily on kernel quality. No open weights are released as of the post.

Orchestrating AI code review at scale

Source: https://blog.cloudflare.com/ai-code-review/

Cloudflare’s post describes their internal system for running LLM-based code review across a monorepo with thousands of daily pull requests. The engineering interest is in the orchestration and cost control, not the prompting.

The system architecture uses a tiered triggering model: a lightweight classifier (a fine-tuned small model) first scores each diff for “review value” — whether the change is complex enough that LLM review adds signal over static analysis. Only diffs above a threshold are sent to the expensive frontier model. This alone reportedly cut API costs by ~60% by filtering trivial reformats, dependency bumps, and single-line changes.

Diff preprocessing is non-trivial. Large PRs are split into semantically coherent chunks using a combination of file-type heuristics and AST-based diff parsing (tree-sitter for supported languages). Each chunk is reviewed independently with shared context injected (relevant type definitions, recent commit messages for the file). Results are then merged with a deduplication pass to avoid posting duplicate comments when two chunks raise the same issue.

The feedback loop is operationally important: engineers can mark LLM comments as “useful” or “noise” directly in the PR interface. These labels feed back into the classifier’s training data, gradually improving the triaging model’s precision on Cloudflare’s specific codebase. This is essentially online learning on human preference labels, applied to the meta-task of deciding when to call the expensive model.

Latency is managed by parallelizing chunk reviews and posting comments asynchronously — the reviewer does not block on LLM completion. The system targets posting within 5 minutes of PR creation to fit into the human review cycle before the author context-switches away.

Noteworthy New Repositories

lightseekorg/tokenspeed

TokenSpeed bills itself as a speed-of-light LLM inference engine, targeting the latency and throughput bottlenecks that dominate production serving. The project focuses on maximizing tokens-per-second by combining low-level kernel optimizations with efficient memory management for KV-cache handling. The architecture appears to pursue continuous batching and memory-efficient attention similar in spirit to vLLM’s paged attention, but with a heavier emphasis on raw decode speed rather than scheduling flexibility. It is written to integrate directly with quantized model formats (GGUF-style or similar), reducing the overhead of weight loading and reducing memory bandwidth pressure during inference. The value proposition over alternatives like llama.cpp or vLLM is primarily in the tight optimization loop: fewer abstraction layers between the attention kernel and the hardware. Engineers choosing this over llama.cpp would be trading ecosystem breadth for raw throughput on constrained hardware. The relatively fast star accumulation (1.3k in early days) suggests it is addressing a real gap for users who find llama.cpp’s speed ceiling frustrating but do not want the operational complexity of TensorRT-LLM or TGI. Practical use cases include local inferencing pipelines and edge deployment where every millisecond of TTFT and every tok/s of decode throughput matters. The typed API surface makes it easier to embed in larger systems compared to shelling out to a CLI binary.

Source: https://github.com/lightseekorg/tokenspeed

jmerelnyc/Photo-agents

Photo-agents implements vision-grounded autonomous agents that control a desktop OS — taking screenshots, parsing visual state, writing new skill code, and storing that code in a layered memory system for reuse. The core loop is: observe (screenshot → VLM), plan (LLM), act (synthesize or recall a Python skill), execute (OS-level input simulation), then store successful skills persistently. The layered memory distinguishes between episodic memory (task history), semantic memory (factual grounding from screenshots), and procedural memory (the library of self-written skill scripts). Self-evolution here means the agent literally appends to its own skill library at runtime — new helper functions are written, tested by execution feedback, and indexed for future retrieval via embedding similarity. This is architecturally closer to Voyager (Minecraft agent with self-written code) but adapted to GUI automation. The risk surface is significant: unrestricted code execution with OS-level access and no sandboxing visible in the description. Practical use is primarily research into agentic capability generalization — how well does a skill library bootstrapped on one task transfer to another? It targets the same problem space as OpenAdapt and UFO but with more emphasis on the accumulating skill library as the differentiating primitive. Requires a capable VLM (GPT-4V-class) for reliable visual parsing.

Source: https://github.com/jmerelnyc/Photo-agents

Purewhiter/mobilegym

MobileGym is a browser-hosted Android simulation platform designed specifically for training and evaluating mobile GUI agents at scale. The core technical contribution is running Android emulation inside a browser context, which removes the need for per-researcher hardware setup and makes parallel environment instantiation trivial — critical for RL-style agent training that requires hundreds of concurrent environment rollouts. The platform exposes a verifiable reward interface: actions (tap, swipe, type) are logged against ground-truth task specifications, allowing automated success/failure scoring without human annotation per episode. This closes the loop for policy gradient or PPO-style training directly on GUI tasks. Compared to Android-in-the-Wild or AITW datasets (offline, static), MobileGym provides an online, interactive environment. Compared to running AVD (Android Virtual Device) locally, the browser-native approach reduces infrastructure overhead and enables cloud-hosted parallelism. The simulation fidelity is necessarily limited relative to a full QEMU-backed AVD, but for agent research where the bottleneck is sample efficiency rather than pixel-perfect rendering, this tradeoff is sensible. The platform is relevant to researchers working on VLM-based agents (e.g., fine-tuning Qwen-VL or InternVL on GUI tasks) who need a scalable closed-loop training environment rather than offline imitation learning from static datasets.

Source: https://github.com/Purewhiter/mobilegym

shenli/distributed-system-testing

This repository packages AI-agent skills specifically for testing distributed systems — a domain where exhaustive manual test design is infeasible because the fault space (network partitions, clock skew, message reordering, node crashes) is combinatorially large. The skills are callable units that an LLM agent can invoke to inject specific failure modes, monitor system state, and assert correctness properties (linearizability, eventual consistency, etc.). The agent layer provides the reasoning needed to select which fault scenarios to exercise given a system description, effectively turning distributed systems testing from a manually scripted process into a goal-directed exploration. This sits conceptually between Jepsen (manual fault injection framework) and automated fuzzing: it uses LLM reasoning to prioritize fault sequences likely to expose bugs rather than exploring uniformly. The skill abstraction keeps individual primitives auditable and composable without requiring the LLM to generate raw infrastructure code. For engineers working on consensus protocols, distributed databases, or microservice meshes, the value is in reducing the expertise barrier to writing meaningful chaos tests. The repository is early-stage but addresses a genuine gap: most LLM coding agents lack domain-specific primitives for distributed fault injection, defaulting to generic HTTP testing that misses timing- and ordering-sensitive bugs entirely.

Source: https://github.com/shenli/distributed-system-testing

PorunC/CodeWiki

CodeWiki automates the generation of grounded developer documentation by parsing a codebase into AST graphs, building a GraphRAG index over those graphs, and then using an LLM to produce wiki pages whose claims are traceable back to specific AST nodes and source locations. The pipeline runs as follows: repository ingestion parses source files into language-specific ASTs; nodes and edges (function calls, class hierarchies, module imports) are stored in a graph database; GraphRAG constructs retrieval contexts that respect code structure rather than treating source as flat text; LiteLLM provides a vendor-agnostic LLM interface for the generation step; FastAPI serves the backend; React provides the frontend wiki viewer. The graph-structured retrieval is the key technical differentiator over naive RAG on code: it can answer questions like “what calls this function?” or “what does this class inherit from?” by traversing edges rather than relying on embedding proximity alone. Compared to tools like Mintlify or Swimm, CodeWiki generates documentation from code structure rather than requiring manual annotation. The main limitation is AST fidelity across languages — the quality of the graph depends heavily on parser coverage. This is a practical tool for teams inheriting large undocumented codebases where the cost of manual wiki authorship is prohibitive.

Source: https://github.com/PorunC/CodeWiki

Helvesec/rmux

rmux is a Rust-native terminal multiplexer abstraction that exposes a typed SDK for programmatically driving any CLI or TUI application — not just spawning processes, but reading and writing their terminal state as structured data. The core value is treating a PTY session as a typed interface: you send commands, receive parsed output, and can drive interactive TUIs (htop, vim, ncurses apps) that normally resist automation because they write directly to terminal escape sequences rather than stdout lines. The SDK provides synchronization primitives so calling code can wait for specific terminal states before proceeding, solving the classic race condition in expect-style automation. Being implemented in Rust gives it memory safety and low overhead for long-running multiplexed sessions. Cross-platform support (Linux, macOS, Windows via ConPTY) is non-trivial and is a significant advantage over Python-based expect libraries, which have notoriously poor Windows support. Practical use cases include: integration testing CLI tools, building agents that need to interact with legacy interactive programs, automating devops workflows that involve interactive prompts, and embedding terminal sessions in larger orchestration systems. For teams already using Rust in their toolchain, this eliminates the need to shell out to expect or pexpect and deal with subprocess encoding issues. The 1.3k stars suggest strong interest from the automation and developer tooling community.

Source: https://github.com/Helvesec/rmux

Kaelio/ktx-ai-data-agents-mcp-context-skills

ktx is a context layer for data and analytics agents that exposes queryable data sources to LLM agents through the Model Context Protocol (MCP), augmented with a skills library, persistent memory, and a semantic layer for metric definitions. The semantic layer is the critical component: rather than letting the agent generate arbitrary SQL that may be semantically incorrect (wrong grain, wrong join keys, undocumented business logic), ktx enforces a predefined metric catalog so queries are grounded in verified definitions. Skills are reusable, parameterized query templates that agents invoke by name rather than synthesizing SQL from scratch each time, reducing hallucination surface. Memory allows the agent to recall previous query results and user preferences within and across sessions. The MCP interface means any MCP-compatible agent (Claude Code, OpenAI Codex, or custom) can connect without bespoke integration code. This positions ktx as infrastructure rather than an end-user product — it sits between the agent and the data warehouse, acting as a governed translation layer. Compared to text-to-SQL approaches (DAIL-SQL, DIN-SQL), ktx trades flexibility for correctness: the agent cannot generate semantically invalid queries because the vocabulary is constrained to verified skills. Relevant for data engineering teams who want to expose internal data to AI agents without risking metric inconsistency or uncontrolled query costs.

Source: https://github.com/Kaelio/ktx-ai-data-agents-mcp-context-skills

awizemann/harness

Harness is a Swift 6 / macOS 14+ tool that performs LLM-driven user testing against iOS Simulator, macOS native apps, and web apps. The operator provides a plain-language goal (e.g., “complete a checkout with a new account”), and an LLM agent drives the UI — reading accessibility trees or screenshots to parse state, selecting actions, executing them via XCTest/Accessibility APIs or WebDriver, and producing a structured report of friction points, failures, and unexpected states encountered along the path to the goal. The key technical distinction from conventional UI testing (XCUITest, Playwright) is that Harness does not require a predefined action script; it reasons about UI state at each step, making it robust to layout changes that would break selector-based tests. The friction reporting is what makes it a testing tool rather than just an automation tool: the agent is prompted to identify and document points where the UI was ambiguous, slow, or required unexpected steps. This maps directly to UX audit use cases. The Swift 6 implementation with strict concurrency enables safe parallel test runs across multiple simulator instances. The macOS 14+ requirement reflects dependence on newer Accessibility and ScreenCaptureKit APIs. The limitation is cost and latency: each action step involves an LLM call, so test cycles are slower and more expensive than deterministic UI tests. Best suited for exploratory testing and regression detection rather than high-frequency CI checks.

Source: https://github.com/awizemann/harness