NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX

NVIDIA AI researchers recently released cuda-oxide, an experimental compiler that allows developers to write CUDA SIMT (Single Instruction, Multiple Threads) GPU kernels in standard Rust code. The project compiles Rust directly to PTX (Parallel Thread Execution) — the assembly-like intermediate representation that CUDA uses to target NVIDIA GPUs — without requiring domain-specific languages, foreign function interface bindings, or C/C++ code.

How This Makes a Change

Writing GPU kernels today typically means writing C++ and using the CUDA programming model directly, or relying on Python-level abstractions like Triton that generate CUDA under the hood. The Rust GPU ecosystem has had projects attempting to bridge this gap — Rust-GPU targets SPIR-V for Vulkan/graphics compute, rust-cuda uses a rustc codegen backend targeting NVVM IR, CubeCL uses an embedded DSL with a JIT runtime that cross-compiles to CUDA/ROCm/WGPU, and std::offload uses LLVM’s implicit offload path.

cuda-oxide occupies a specific position in this space. Its stated design center is “bringing CUDA into Rust” — kernel authoring, device intrinsics, the SIMT execution model, and the CUDA programming model expressed natively in safe Rust — closer in spirit to writing a __global__ function in C++ than to writing a generic Rust function that happens to run on a GPU. By contrast, the closest neighbor, rust-cuda, focuses on “bringing Rust to NVIDIA GPUs”: Rust ergonomics like async/.await, parts of the standard library running on-device, and a Rust-first programming model that abstracts over CUDA concepts. The NVlabs team notes it has been coordinating with rust-cuda maintainers and considers the two projects complementary.

The Compilation Pipeline

At the core of cuda-oxide is a custom rustc codegen backend — the layer in the Rust compiler responsible for generating machine code. Instead of emitting native CPU code, the rustc-codegen-cuda crate intercepts the compiler at the CodegenBackend::codegen_crate() entry point and runs a separate pipeline for device code:

Rust Source → rustc frontend → rustc_public (Stable MIR) → dialect-mir → mem2reg → dialect-llvm → LLVM IR (.ll) → PTX (.ptx)

Here are some important elements:

Why rustc_public? The raw internal MIR representation in rustc changes between nightly versions with no stability guarantees. cuda-oxide uses rustc_public — also known as Stable MIR — which is Rust’s official versioned, stable API over the compiler’s internals. This lets the backend read MIR without breaking on every nightly update.

What is Pliron? The middle stages use Pliron, a Rust-native MLIR-like IR framework written entirely in Rust. Choosing Pliron instead of upstream MLIR means the entire compiler builds with cargo — no C++ toolchain, no CMake, no tablegen. cuda-oxide defines three custom Pliron dialects: dialect-mir (modeling Rust MIR semantics — places, projections, rvalues, terminators), dialect-llvm (modeling LLVM IR with textual .ll export), and dialect-nvvm (NVIDIA GPU intrinsics like thread indexing, barriers, and TMA).

What does llc do? After the dialect-llvm printer serializes the IR into a textual .ll file, the external llc binary (the LLVM static compiler with NVPTX backend) compiles it to PTX assembly. This is the one stage outside pure Rust. The resulting .ptx file is written next to the host binary — for example, target/debug/vecadd.ptx — and loaded by the CUDA driver at runtime.

You as a developer can observe each stage with:

cargo oxide pipeline vecadd

This prints the full trace from Rust MIR through each dialect down to PTX output.

Single-Source Compilation and the Host/Device Split

Host and device code live in the same .rs source file. cargo oxide sets -Z codegen-backend=librustc_codegen_cuda.so, which routes code generation through cuda-oxide’s backend. The backend then scans compiled code for monomorphized functions whose names carry the reserved cuda_oxide_kernel_<hash>_<name> prefix — the namespace that the #[kernel] proc macro creates. Functions matching that prefix go through the cuda-oxide pipeline to produce PTX; all other host code is delegated to rustc’s standard LLVM backend. The result of a single cargo oxide build is a host binary plus a .ptx file.

cargo oxide run vecadd
cargo oxide debug vecadd --tui    # debug with cuda-gdb

Device code from library dependencies is compiled lazily: the backend reads their Stable MIR from .rlib metadata on demand, only compiling functions a kernel actually calls.

What You Can Write in a Kernel

cuda-oxide supports a meaningful subset of Rust in GPU kernel functions, marked with the #[kernel] attribute macro. This includes:

Generic functions with monomorphization — fn scale<T: Copy>(...) is compiled to a concrete PTX kernel per type used at the call site.
Closures with captures — closures passed from the host are scalarized and passed as PTX kernel parameters automatically.
User-defined structs and enums — standard Rust data structures work inside kernels.
Pattern matching — match, if let, and related constructs work in device code.
Full GPU intrinsics — the cuda-device crate provides wrappers for thread indexing, warp operations (shfl_sync, ballot_sync, etc.), shared memory, barriers, TMA (Tensor Memory Accelerator), Thread Block Clusters, and scoped atomics (6 types × 3 scopes × 5 orderings).

One important GPU-specific compiler detail: rustc’s JumpThreading MIR optimization — which duplicates function calls into both branches of an if-statement — is disabled for device code in cuda-oxide. On CPUs this is a safe optimization, but on GPUs it breaks barrier semantics: all threads in a block must converge at the same bar.sync instruction, and duplicating it across branches violates that requirement. Additionally, sync primitives are marked convergent in the emitted LLVM IR so that LLVM’s optimization passes cannot move or duplicate them across control flow.

How to Use NVIDIA Star Elastic

NVlabs cuda-oxide — Step-by-Step Guide

Rust → Stable MIR → Pliron IR → LLVM IR → PTX | v0.1.0

Step 01 of 09 · Prerequisites

What You Need Before You Start

cuda-oxide has specific version requirements for each dependency. Before installing anything, verify your system meets all of these. The project is currently Linux-only (tested on Ubuntu 24.04).

Linux (Ubuntu 24.04) Rust nightly CUDA Toolkit 12.x+ LLVM 21+ Clang 21 / libclang-common-21-dev Git

ⓘ Why LLVM 21? Simple kernels may work on LLVM 20, but anything targeting Hopper or Blackwell — TMA, tcgen05, WGMMA — requires llc from LLVM 21 or later. This is a hard requirement, not a recommendation.

Check your current CUDA version to confirm compatibility:

nvcc --version

Step 02 of 09 · Install Rust Nightly

Set Up the Rust Nightly Toolchain

cuda-oxide requires Rust nightly with two additional components: rust-src and rustc-dev. The toolchain is pinned to nightly-2026-04-03 via rust-toolchain.toml in the repository — it will be installed automatically when you first run a build inside the repo.

If you need to install it manually:

# Install the pinned nightly toolchain
rustup toolchain install nightly-2026-04-03

# Add required components
rustup component add rust-src rustc-dev \
  --toolchain nightly-2026-04-03

# Confirm the toolchain is active
rustup show

ⓘ Why these components? rustc-dev exposes the internal compiler APIs that the custom codegen backend hooks into. rust-src is needed so the compiler can find and compile its own standard library sources for the device target.

Step 03 of 09 · Install LLVM 21

Install LLVM 21 with the NVPTX Backend

The cuda-oxide pipeline emits textual LLVM IR (.ll files) and hands them to the external llc binary to produce PTX. You need LLVM 21 or later with the NVPTX backend enabled.

# Ubuntu/Debian
sudo apt install llvm-21

# Verify the NVPTX backend is present
llc-21 --version | grep nvptx

The pipeline auto-discovers llc-22 and llc-21 on your PATH in that order. To pin a specific binary, set the environment variable:

# Pin to a specific llc binary
export CUDA_OXIDE_LLC=/usr/bin/llc-21

⚠ Common Failure If NVPTX does not appear in the output of llc-21 --version, your LLVM build was compiled without the NVPTX target. Install from the official LLVM apt repository rather than your distro’s default packages, which may omit GPU backends.

Step 04 of 09 · Install Clang

Install Clang 21 for the cuda-bindings Crate

The cuda-bindings crate uses bindgen to generate FFI bindings to cuda.h at build time. bindgen needs libclang — and specifically, it needs Clang’s own resource directory (which includes stddef.h). A bare libclang1-* runtime package is not enough.

# Install the full clang-21 package (includes resource headers)
sudo apt install clang-21

# Alternatively, the -dev header package also works
sudo apt install libclang-common-21-dev

⚠ Symptom of Missing Clang If you only install the runtime but not the headers, the host build will fail with a cryptic 'stddef.h' file not found error during bindgen. Run cargo oxide doctor in the next step to catch this before attempting a build.

Step 05 of 09 · Install cargo-oxide

Clone the Repo and Install cargo-oxide

cargo-oxide is a Cargo subcommand that drives the entire build pipeline — running cargo oxide build, cargo oxide run, cargo oxide debug, and cargo oxide pipeline.

Inside the repo (for trying examples):

git clone https://github.com/NVlabs/cuda-oxide.git
cd cuda-oxide

# cargo oxide works out of the box via a workspace alias
cargo oxide run vecadd

Outside the repo (for your own projects):

# Install globally from the git source
cargo install \
  --git https://github.com/NVlabs/cuda-oxide.git \
  cargo-oxide

# On first run, cargo-oxide fetches and builds the codegen backend

Then verify all prerequisites are in place with the built-in health check:

cargo oxide doctor

ⓘ What doctor checks It validates your Rust toolchain (nightly, rust-src, rustc-dev), CUDA Toolkit, LLVM version and NVPTX support, Clang/libclang headers, and the codegen backend binary. Fix any red items before proceeding.

Step 06 of 09 · Run Your First Kernel

Build and Run the vecadd Example

The canonical first example is vecadd — a vector addition kernel that adds two arrays of 1,024 f32 values on the GPU and verifies the result on the host.

# Build and run end-to-end
cargo oxide run vecadd

If everything is configured correctly, you will see:

✓ SUCCESS: All 1024 elements correct!

To see the full compilation pipeline — from Rust MIR through each Pliron dialect down to PTX — run:

# Print the full Rust MIR — dialect-mir — mem2reg — dialect-llvm — LLVM IR — PTX trace
cargo oxide pipeline vecadd

To debug with cuda-gdb:

cargo oxide debug vecadd --tui

ⓘ Output artifacts A successful build produces two files: target/debug/vecadd (the host binary) and target/debug/vecadd.ptx (the device code). The host binary loads the PTX file via the CUDA driver at runtime.

Step 07 of 09 · Write a Kernel

Writing Your Own #[kernel] Function

A kernel function is annotated with #[kernel]. Use DisjointSlice<T> for mutable outputs and &[T] for read-only inputs. Access the thread’s unique hardware index with thread::index_1d().

use cuda_device::{kernel, thread, DisjointSlice};

// Tier 1 safety: race-free by construction, no `unsafe` needed.
// DisjointSlice::get_mut() only accepts a ThreadIndex —
// a hardware-derived opaque type guaranteeing unique writes per thread.
#[kernel]
pub fn scale(input: &[f32], factor: f32, mut out: DisjointSlice<f32>) {
    let idx = thread::index_1d();
    if let Some(elem) = out.get_mut(idx) {
        *elem = input[idx.get()] * factor;
    }
}

ⓘ Tier 1 Safety — how it works ThreadIndex is an opaque newtype around usize that can only be created from hardware built-in registers (threadIdx, blockIdx, blockDim). Since each thread gets a unique value, and DisjointSlice::get_mut() only accepts a ThreadIndex, writes are race-free by construction — no unsafe anywhere in the kernel.

Step 08 of 09 · Launch from Host

Launching the Kernel from Host Code

Host and device code live in the same .rs file. The host side uses CudaContext, DeviceBuffer, and the cuda_launch! macro to manage GPU memory and dispatch.

use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};
use cuda_host::{cuda_launch, load_kernel_module};

fn main() {
    // Initialize GPU context on device 0
    let ctx    = CudaContext::new(0).unwrap();
    let stream = ctx.default_stream();
    let module = load_kernel_module(&ctx, "scale_example").unwrap();

    // Upload input data to GPU memory
    let data: Vec<f32> = (0..1024).map(|i| i as f32).collect();
    let input  = DeviceBuffer::from_host(&stream, &data).unwrap();
    let mut output = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();

    // Dispatch the kernel — LaunchConfig auto-sizes blocks/grids
    cuda_launch! {
        kernel: scale,
        stream: stream,
        module: module,
        config: LaunchConfig::for_num_elems(1024),
        args: [slice(input), 2.5f32, slice_mut(output)]
    }.unwrap();

    // Download result back to host
    let result = output.to_host_vec(&stream).unwrap();
    assert!((result[1] - 2.5).abs() < 1e-5);
    println!("✓ Kernel ran successfully!");
}

ⓘ What cuda_launch! does It scalarizes the argument list — flattening slices, scalars, and captured closures — into PTX kernel parameters and dispatches the kernel on the given stream. No manual argument marshalling is required.

Step 09 of 09 · Next Steps

What to Explore Next

You have a working cuda-oxide setup. Here are the high-value paths forward, ordered by complexity:

Generic kernels with monomorphization — try the generic example (cargo oxide run generic) to see how fn scale<T: Copy> compiles to separate PTX kernels per type.
Closures with captures — the host_closure example shows how a move |x: f32| x * factor closure is scalarized and passed as PTX kernel parameters automatically.
Async GPU execution — cuda_launch_async! returns a lazy DeviceOperation that executes on .sync() or .await. See the async_mlp and async_vecadd examples.
Shared memory and warp intrinsics — these require scoped unsafe blocks with documented safety contracts. See Tier 2 in the safety model documentation.
GEMM at Speed-of-Light — the gemm_sol example achieves 868 TFLOPS on B200 (58% of cuBLAS SoL) using cta_group::2, CLC, and a 4-stage pipeline.
Blackwell tensor cores — the tcgen05 example targets sm_100a with TMEM, MMA, and cta_group::2. Requires LLVM 21+.

ⓘ Known Limitation in v0.1.0 index_2d(stride) is documented as currently unsound — if threads in the same kernel use different stride values, two threads can get &mut T to the same element with no unsafe in sight. Until the fix lands (lifting stride into a type parameter), bind stride to a single let binding and reuse it at every call site.

Full documentation: nvlabs.github.io/cuda-oxide · Source: github.com/NVlabs/cuda-oxide

Document Created by Marktechpost.com

Key Takeaways

cuda-oxide is a custom rustc codegen backend from NVlabs that compiles #[kernel]-annotated Rust functions to PTX through a Rust → rustc_public Stable MIR → Pliron IR → LLVM IR → PTX pipeline, all buildable with cargo.
Host and device code coexist in a single .rs file, compiled with one cargo oxide build command; the output is a host binary plus a .ptx file placed next to it.
The safety model has three documented tiers: Tier 1 (race-free by construction via DisjointSlice<T> + ThreadIndex), Tier 2 (scoped unsafe for shared memory, warp intrinsics, atomics), and Tier 3 (raw hardware intrinsics for TMA, WGMMA, tcgen05). index_2d(stride) is documented as currently unsound in the 0.x release.
The gemm_sol example hits 868 TFLOPS on the B200 (58% of cuBLAS SoL) using a multi-phase GEMM pipeline with CLC and cta_group::2.

Check out the GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX appeared first on MarkTechPost.

from MarkTechPost https://ift.tt/rJdU9gu
via IFTTT

World Wire