Posts

A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

In this tutorial, we build an end-to-end implementation around Qwen 3.6-35B-A3B and explore how a modern multimodal MoE model can be used in practical workflows. We begin by setting up the environment, loading the model adaptively based on available GPU memory, and creating a reusable chat framework that supports both standard responses and explicit thinking traces. From there, we work through important capabilities such as thinking-budget control, streamed generation with separated reasoning and answers, vision input handling, tool calling, structured JSON generation, MoE routing inspection, benchmarking, retrieval-augmented generation, and session persistence. Through this process, we run the model for inference and also examine how to design a robust application layer on top of Qwen 3.6 for real experimentation and advanced prototyping. Copy Code Copied Use a different Browser import subprocess, sys def _pip(*a): subprocess.check_call([sys.executable, "-m", ...

Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps

Image
Moonshot AI, the Chinese AI lab behind the Kimi assistant, today open-sourced Kimi K2.6 — a native multimodal agentic model that pushes the boundaries of what an AI system can do when left to run autonomously on hard software engineering problems. The release targets practical deployment scenarios: long-running coding agents, front-end generation from natural language, massively parallel agent swarms coordinating hundreds of specialized sub-agents simultaneously, and a new open ecosystem where humans and agents from any device collaborate on the same task. The model is available now on Kimi.com, the Kimi App, the API, and Kimi Code CLI. Weights are published on Hugging Face under a Modified MIT License. What Kind of Model is This, Technically? Kimi K2.6 is a Mixture-of-Experts (MoE) model — an architecture that’s become increasingly dominant at frontier scale. Instead of activating all of a model’s parameters for every token it processes, a MoE model routes each token to a small su...

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning

Image
In this tutorial, we build a pipeline on Phi-4-mini to explore how a compact yet highly capable language model can handle a full range of modern LLM workflows within a single notebook. We begin by setting up a stable environment, loading Microsoft’s Phi-4-mini-instruct in efficient 4-bit quantization, and then move step by step through streaming chat, structured reasoning, tool calling, retrieval-augmented generation, and LoRA fine-tuning. Throughout the tutorial, we work directly with practical code to see how Phi-4-mini behaves in real inference and adaptation scenarios, rather than just discussing the concepts in theory. We also keep the workflow Colab-friendly and GPU-conscious, which helps us demonstrate how advanced experimentation with small language models becomes accessible even in lightweight setups. Copy Code Copied Use a different Browser import subprocess, sys, os, shutil, glob def pip_install(args): subprocess.run([sys.executable, "-m", "pi...

OpenAI Scales Trusted Access for Cyber Defense With GPT-5.4-Cyber: a Fine-Tuned Model Built for Verified Security Defenders

Cybersecurity has always had a dual-use problem: the same technical knowledge that helps defenders find vulnerabilities can also help attackers exploit them. For AI systems, that tension is sharper than ever. Restrictions intended to prevent harm have historically created friction for good-faith security work, and it can be genuinely difficult to tell whether any particular cyber action is intended for defensive usage or to cause harm. OpenAI is now proposing a concrete structural solution to that problem: verified identity, tiered access, and a purpose-built model for defenders. OpenAI team announced that it is scaling up its Trusted Access for Cyber (TAC) program to thousands of verified individual defenders and hundreds of teams responsible for defending critical software. The main focus of this expansion is the introduction of GPT-5.4-Cyber , a variant of GPT-5.4 fine-tuned specifically for defensive cybersecurity use cases. What Is GPT-5.4-Cyber and How Does It Differ From Stan...

Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale

Image
For years, the way large language models handle inference has been stuck inside a box — literally. The high-bandwidth RDMA networks that make modern LLM serving work have confined both prefill and decode to the same datacenter, sometimes even the same rack. A team of researchers at Moonshot AI and Tsinghua University is making the case that this constraint is about to break down — and that the right architecture can already exploit that shift. The research team introduces Prefill-as-a-Service (PrfaaS), a cross-datacenter serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode. The result, in a case study using an internal 1T-parameter hybrid model, is 54% higher serving throughput than a homogeneous PD baseline and 32% higher than a naive heterogeneous setup — while consuming only a fraction of available cross-datacenter bandwidth. The r...

A Coding Implementation to Build an AI-Powered File Type Detection and Security Analysis Pipeline with Magika and OpenAI

Image
In this tutorial, we build a workflow that combines Magika’s deep-learning-based file type detection with OpenAI’s language intelligence to create a practical and insightful analysis pipeline. We begin by setting up the required libraries, securely connecting to the OpenAI API, and initializing Magika to classify files directly from raw bytes rather than relying on filenames or extensions. As we move through the tutorial, we explore batch scanning, confidence modes, spoofed-file detection, forensic-style analysis, upload-pipeline risk scoring, and structured JSON reporting. At each stage, we use GPT to translate technical scan outputs into clear explanations, security insights, and executive-level summaries, allowing us to connect low-level byte detection with meaningful real-world interpretation. Copy Code Copied Use a different Browser !pip install magika openai -q import os, io, json, zipfile, textwrap, hashlib, tempfile, getpass from pathlib import Path from collectio...