Posts

NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX

NVIDIA AI researchers recently released cuda-oxide , an experimental compiler that allows developers to write CUDA SIMT (Single Instruction, Multiple Threads) GPU kernels in standard Rust code. The project compiles Rust directly to PTX (Parallel Thread Execution) — the assembly-like intermediate representation that CUDA uses to target NVIDIA GPUs — without requiring domain-specific languages, foreign function interface bindings, or C/C++ code. How This Makes a Change Writing GPU kernels today typically means writing C++ and using the CUDA programming model directly, or relying on Python-level abstractions like Triton that generate CUDA under the hood. The Rust GPU ecosystem has had projects attempting to bridge this gap — Rust-GPU targets SPIR-V for Vulkan/graphics compute, rust-cuda uses a rustc codegen backend targeting NVVM IR, CubeCL uses an embedded DSL with a JIT runtime that cross-compiles to CUDA/ROCm/WGPU, and std::offload uses LLVM’s implicit offload path. ...

A Coding Implementation to Recover Hidden Malware IOCs with FLARE-FLOSS Beyond Classic Strings Analysis

In this tutorial, we explore how FLARE-FLOSS helps us recover hidden and obfuscated strings from a Windows PE file. We begin by setting up FLOSS and the MinGW-w64 cross-compiler. We synthesize a small malware-like executable that hides strings using multiple techniques, including static strings, stack-built strings, tight strings, and XOR-decoded strings. After that, we compare the limitations of the traditional string utility with FLOSS’s deeper static analysis and emulation-based string recovery. Through this process, we learn how analysts can uncover URLs, registry paths, suspicious APIs, and other indicators of compromise that plain string extraction often misses. Copy Code Copied Use a different Browser import subprocess, os, sys, json, re, time from pathlib import Path def banner(t): print("\n" + "═"*72 + f"\n {t}\n" + "═"*72) def sh(cmd, quiet=False, check=False): r = subprocess.run(cmd, shell=True, capture_output=True, text=Tr...

NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

Image
Training a family of large language models (LLMs) has always come with a painful multiplier: every model variant in the family—whether 8B, 30B, or 70B—typically requires its own full training run, its own storage, and its own deployment stack. For a dev team running inference at scale, this means multiplying compute costs by the number of model sizes they want to support. NVIDIA researchers are now proposing a different approach called Star Elastic . Star Elastic is a post-training method that embeds multiple nested submodels—at different parameter budgets—inside a single parent reasoning model, using a single training run. Applied to Nemotron Nano v3 (a hybrid Mamba–Transformer–MoE model with 30B total parameters and 3.6B active parameters), Star Elastic produces 23B (2.8B active) and 12B (2.0B active) nested variants trained with approximately 160B tokens. All three variants live in one checkpoint and can be extracted without any additional fine-tuning. paper pdf What ...