Posts

Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference

Image
Speculative decoding is a technique for speeding up large language model inference. A small, fast draft model proposes several tokens. The large target model verifies them in parallel. If accepted, inference is faster. If rejected, the system falls back gracefully. EAGLE Team, vLLM Team, and TorchSpec Team has launched the EAGLE series including EAGLE 1, EAGLE 2, and EAGLE 3 has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and production systems. Today, that family gets a targeted reliability upgrade with introduction of EAGLE 3.1 . What was Going Wrong While speculative decoding performs well in controlled settings, performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts. The EAGLE team traced this fragility to a phenomenon called attention drift as speculation depth increases, the drafter gradually shifts attention away fro...

MEMO: A Modular Framework for Training a Dedicated Memory Model on New Knowledge Without Modifying LLM Parameters

Image
Large language models become static after pretraining. Their knowledge does not update as the world changes. Retraining a full LLM is too expensive at modern scales. Fine-tuning risks degrading previously learned knowledge. Retrieval-augmented generation (RAG) struggles when answers require reasoning across many documents. A team of researchers from the National University of Singapore, MIT CSAIL, A*STAR, and the Singapore-MIT Alliance for Research and Technology (SMART) proposes a new approach called MEMO (Memory as a Model) . What Problem Does MEMO Solve? Existing methods for integrating new knowledge into LLMs fall into three categories. Non-parametric methods like RAG retrieve documents at inference time. They are sensitive to retrieval noise and struggle with cross-document reasoning. Parametric methods such as continual pretraining or supervised fine-tuning internalize knowledge into model weights. They are computationally expensive and cause catastrophic forgetting ...

Design a High-Precision Retrieve-and-Rerank Pipeline with ZeroEntropy Zerank-2 Reranker

In this tutorial, we use zeroentropy/zerank-2-reranker , a 4B Qwen3-based cross-encoder reranker, to improve retrieval quality. We start by setting up the runtime, loading the reranker, and understanding how it scores query-document pairs. Then, we move from simple pairwise scoring to a practical two-stage retrieve-and-rerank pipeline, where a fast bi-encoder first retrieves candidates and zerank-2 reranks them for better precision. We also evaluate the impact using NDCG@10 and test the reranker across finance, legal, and code examples to assess its performance in real-world search and ranking tasks. Copy Code Copied Use a different Browser !pip -q install -U "sentence-transformers>=3.0" "transformers>=4.51.0" accelerate import os, time, numpy as np, torch from sentence_transformers import CrossEncoder, SentenceTransformer, util os.environ["TOKENIZERS_PARALLELISM"] = "false" if torch.cuda.is_available(): device = "cuda" ...

Stability AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing

Image
Stability AI has released open weights for Stable Audio 3 along with a technical research paper . Stable Audio 3 is a family of latent diffusion models that generate stereo audio at 44.1 kHz. The models support variable-length outputs, inpainting-based editing, and fast inference. What Is Stable Audio 3? Stable Audio 3 is a family of three model scales: small, medium, and large. A latent diffusion model generates audio by learning to progressively remove noise from a compressed representation of audio, called a latent. The model learns a mapping from noise to data by training on many (noisy latent, audio) pairs. The three model scales differ in capacity and maximum generation length. All parameter counts below are for the diffusion transformer component only. Each model also includes a SAME autoencoder (108M parameters for SAME-S, 852M for SAME-L). small-music — 459M diffusion transformer parameters, up to 2 minutes, music only. small-sfx — 459M diffusion transf...