Posts

The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache

Image
Long-context large language models (LLMs) face a memory bottleneck that has nothing to do with model weights. During decoding, transformers cache the key and value (KV) vectors for every token at every layer so they don’t have to recompute attention. This cache grows linearly with sequence length and batch size, and at long context with high concurrency it can dwarf the model’s own footprint. Consider Llama-3.1-70B in BF16. Its KV cache costs about 0.31 MB per token (80 layers × 8 KV heads × 128 head-dim × 2 tensors × 2 bytes). At 128K tokens that is ~40 GB; at 1M tokens it exceeds 300 GB — more than the 140 GB of weights themselves. Worse, every newly decoded token has to stream the entire cache out of high-bandwidth memory (HBM), which makes decoding memory-bandwidth-bound rather than compute-bound. Shrinking the KV cache is therefore the most direct lever for cutting both cost and decode latency. Current approaches fall into roughly five families: token eviction ...

OpenAI Releases LifeSciBench, a 750-Task Benchmark Grading AI Models on Real Life-Science Research With Expert-Written Rubric

Most biology benchmarks ask narrow, fact-based questions with clean answers. Scientists weigh imperfect evidence and make decisions. OpenAI released LifeSciBench and it targets that gap directly. Even the strongest model passes roughly one task in three. The benchmark is far from saturated. What is LifeSciBench LifeSciBench contains 750 expert-authored tasks. They span seven workflows and seven biological domains. Each task pairs a prompt, supporting artifacts, and a grading rubric. The seven workflows cover evidence handling and analysis. They also include design and optimization, scientific reasoning, validation and operations, translation, and scientific communication. The seven domains run from genomics and medicinal chemistry to clinical and translational science. Tasks are written as a scientist would brief a colleague. They are free-response, not multiple-choice. Around 79% require multiple reasoning or decision-making steps, averaging four steps each. ...

NVIDIA SkillSpector Guide: Scanning AI Skills for Security Risks with Static Analysis and SARIF Reports

Image
In this tutorial, we explore how NVIDIA SkillSpector helps us evaluate AI skills for security risks before they are used in real-world workflows. We build a controlled corpus containing both benign and deliberately vulnerable skills, scan them through SkillSpector’s programmatic LangGraph workflow, and organize the resulting risk scores and findings with pandas. We then visualize severity and category distributions, export results in SARIF format, extend the framework with a custom analyzer, and optionally apply LLM-based semantic analysis for deeper validation. Installing NVIDIA SkillSpector and Building a Skill Corpus Copy Code Copied Use a different Browser import os import sys import json import shutil import textwrap import subprocess from pathlib import Path print("Python:", sys.version.split()[0]) if sys.version_info < (3, 12): print(" SkillSpector requires Python 3.12+. On Colab pick a 3.12+ runtime.") def _pip(*args): subprocess.run([s...

Vercel Releases Eve: An Open-Source AI Agent Framework Where Each Agent is a Directory of Files Mapped to Capabilities

Vercel has released eve , an open-source framework for building, running, and scaling agents. The project is published as the npm package eve , licensed under Apache-2.0. Building an agent should mean defining what it does. It should not mean assembling all the plumbing that an agent needs to run in production. eve is the framework Vercel builds and runs its own agents on. According to Vercel post, it runs more than a hundred agents in production today. What is eve? eve is a filesystem-first framework for durable backend agents. You create an agent as a directory on disk. The directory is the contract. Each file describes one component of the agent. At a glance, the tree shows what an agent is and does. It also shows where it lives and when it acts on its own. The smallest agent that runs is two files. One sets the model. The other sets the instructions. Copy Code Copied Use a different Browser // agent/agent.ts import { defineAgent } from "eve"; ...