Posts

Hexo Labs Open-Sources SIA: A Self-Improving Agent That Updates Both the Harness and the Model Weights

Image
Most AI agents stop improving once a human stops tuning them. The model is fixed. The scaffold around it is fixed. Hexo Labs wants to move both at once. It released SIA (Self-Improving AI) this week as an open-source framework under an MIT license. The core claim of this research is narrow but concrete. SIA edits both the agent’s scaffold and the model’s weights inside one self-improving loop. What is SIA (Self-Improving AI) SIA splits a task-specific agent into two parts. The first is the harness, also called the scaffold. That covers the system prompt, tool-dispatch logic, retry policy, and answer-extraction code. The second part is the model weights themselves. Three LLM components drive the loop. A Meta-Agent writes the initial scaffold from a task specification and any reference code. A Task-Specific Agent runs the task and logs every step. A Feedback-Agent then reads that full trajectory and decides what to change. That decision is the key idea. A...

Liquid AI Releases LFM2.5-8B-A1B: An On-Device MoE Model With 8.3B Total and 1.5B Active Parameters

Image
Liquid AI just shipped LFM2.5-8B-A1B . It is an on-device Mixture-of-Experts (MoE) model built for tool calling. The model holds 8.3B total parameters but activates only 1.5B per token. That sparsity is what lets it run on consumer hardware. The release follows LFM2-8B-A1B, which Liquid AI team published earlier. LFM2.5 is a new family of hybrid models for on-device deployment. This version adds a 128K context window, reasoning, and scaled-up training. What is LFM2.5-8B-A1B The model uses a sparse MoE design. It activates 1.5B of 8.3B total parameters per forward pass. That keeps each generated token cheap to compute. The architecture has 24 layers. Eighteen are double-gated LIV convolution blocks; six are GQA layers. It combines MoE, GQA, and gated short convolution blocks. The context length is 131,072 tokens. The model covers nine languages, including Arabic, Chinese, and Japanese. Liquid AI team recommends a temperature of 0.2, top_k of 80, and repetition_penal...

Anthropic Ships Claude Opus 4.8 Alongside Dynamic Workflows and Cheaper Fast Mode, With Workflows Capped at 1,000 Subagents

Anthropic just launched Claude Opus 4.8. Also, there two Claude Code updates shipped with it. Dynamic workflows run many subagents in parallel. Fast mode now supports Opus 4.8 at a lower price. Both are research previews. What Dynamic Workflows Actually Are A dynamic workflow is a JavaScript script that orchestrates subagents at scale. Claude writes the script for the task you describe. A runtime then executes it in the background. Your session stays responsive while agents work. The plan moves into code, not Claude’s context window. Intermediate results live in script variables instead. So Claude’s context holds only the final answer. That is the core difference from subagents and skills. The feature requires Claude Code v2.1.154 or later. It runs in the CLI, Desktop, and VS Code extension. It is available on Max, Team, and Enterprise plans. It is on by default on Max and Team. On Enterprise it is off until an admin enables it. It also runs on the Claude AP...

Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate

Perplexity AI’s research team reimplemented their Unigram tokenizer from scratch in Rust and open-sourced the code in pplx-garden , their inference technology repository. At production input lengths, the new encoder cuts p50 latency by roughly 5x versus the Hugging Face tokenizers crate, ~2x versus SentencePiece (C++), and ~1.5x versus IREE’s tokenizer (C), with zero steady-state heap allocations. In production, it reduced CPU utilization in Perplexity’s inference stack by 5-6x and shaved double-digit milliseconds off reranker latency. Why Tokenization Became a Bottleneck LLM inference cost is typically framed around GPU work: KV caches, attention kernels, expert routing. But smaller models, such as embedding models, classifiers, and rerankers, tell a different story. These models are two to three orders of magnitude smaller than frontier transformers. A reranker scoring hundreds of candidate documents per request is a clear example. With a small mode...