{"id":89398,"date":"2026-05-10T16:17:19","date_gmt":"2026-05-10T16:17:19","guid":{"rendered":"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/"},"modified":"2026-05-10T16:17:19","modified_gmt":"2026-05-10T16:17:19","slug":"nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx","status":"publish","type":"post","link":"https:\/\/youzum.net\/es\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/","title":{"rendered":"NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX"},"content":{"rendered":"<p>NVIDIA AI researchers recently released <strong><a href=\"https:\/\/github.com\/NVlabs\/cuda-oxide\" target=\"_blank\" rel=\"noreferrer noopener\">cuda-oxide<\/a><\/strong>, an experimental compiler that allows developers to write CUDA SIMT (Single Instruction, Multiple Threads) GPU kernels in standard Rust code. The project compiles Rust directly to PTX (Parallel Thread Execution) \u2014 the assembly-like intermediate representation that CUDA uses to target NVIDIA GPUs \u2014 without requiring domain-specific languages, foreign function interface bindings, or C\/C++ code.<\/p>\n<h3 class=\"wp-block-heading\"><strong>How This Makes a Change<\/strong><\/h3>\n<p>Writing GPU kernels today typically means writing C++ and using the CUDA programming model directly, or relying on Python-level abstractions like Triton that generate CUDA under the hood. The Rust GPU ecosystem has had projects attempting to bridge this gap \u2014 Rust-GPU targets SPIR-V for Vulkan\/graphics compute, rust-cuda uses a rustc codegen backend targeting NVVM IR, CubeCL uses an embedded DSL with a JIT runtime that cross-compiles to CUDA\/ROCm\/WGPU, and <code>std::offload<\/code> uses LLVM\u2019s implicit offload path.<\/p>\n<p>cuda-oxide occupies a specific position in this space. Its stated design center is \u201cbringing CUDA into Rust\u201d \u2014 kernel authoring, device intrinsics, the SIMT execution model, and the CUDA programming model expressed natively in safe Rust \u2014 closer in spirit to writing a <code>__global__<\/code> function in C++ than to writing a generic Rust function that happens to run on a GPU. By contrast, the closest neighbor, rust-cuda, focuses on \u201cbringing Rust to NVIDIA GPUs\u201d: Rust ergonomics like <code>async<\/code>\/<code>.await<\/code>, parts of the standard library running on-device, and a Rust-first programming model that abstracts over CUDA concepts. The NVlabs team notes it has been coordinating with rust-cuda maintainers and considers the two projects complementary.<\/p>\n<h3 class=\"wp-block-heading\"><strong>The Compilation Pipeline<\/strong><\/h3>\n<p>At the core of cuda-oxide is a custom <code>rustc<\/code> codegen backend \u2014 the layer in the Rust compiler responsible for generating machine code. Instead of emitting native CPU code, the <code>rustc-codegen-cuda<\/code> crate intercepts the compiler at the <code>CodegenBackend::codegen_crate()<\/code> entry point and runs a separate pipeline for device code:<\/p>\n<p><strong>Rust Source \u2192 rustc frontend \u2192 <code>rustc_public<\/code> (Stable MIR) \u2192 <code>dialect-mir<\/code> \u2192 <code>mem2reg<\/code> \u2192 <code>dialect-llvm<\/code> \u2192 LLVM IR (.ll) \u2192 PTX (.ptx)<\/strong><\/p>\n<p><strong>Here are some important elements:<\/strong><\/p>\n<p><strong>Why <code>rustc_public<\/code>?<\/strong> The raw internal MIR representation in <code>rustc<\/code> changes between nightly versions with no stability guarantees. cuda-oxide uses <code>rustc_public<\/code> \u2014 also known as Stable MIR \u2014 which is Rust\u2019s official versioned, stable API over the compiler\u2019s internals. This lets the backend read MIR without breaking on every nightly update.<\/p>\n<p><strong>What is Pliron?<\/strong> The middle stages use <a href=\"https:\/\/github.com\/vaivaswatha\/pliron\">Pliron<\/a>, a Rust-native MLIR-like IR framework written entirely in Rust. Choosing Pliron instead of upstream MLIR means the entire compiler builds with <code>cargo<\/code> \u2014 no C++ toolchain, no CMake, no tablegen. cuda-oxide defines three custom Pliron dialects: <code>dialect-mir<\/code> (modeling Rust MIR semantics \u2014 places, projections, rvalues, terminators), <code>dialect-llvm<\/code> (modeling LLVM IR with textual <code>.ll<\/code> export), and <code>dialect-nvvm<\/code> (NVIDIA GPU intrinsics like thread indexing, barriers, and TMA).<\/p>\n<p><strong>What does <code>llc<\/code> do?<\/strong> After the <code>dialect-llvm<\/code> printer serializes the IR into a textual <code>.ll<\/code> file, the external <code>llc<\/code> binary (the LLVM static compiler with NVPTX backend) compiles it to PTX assembly. This is the one stage outside pure Rust. The resulting <code>.ptx<\/code> file is written next to the host binary \u2014 for example, <code>target\/debug\/vecadd.ptx<\/code> \u2014 and loaded by the CUDA driver at runtime.<\/p>\n<p><strong>You as a developer can observe each stage with:<\/strong><\/p>\n<pre class=\"wp-block-code\"><code>cargo oxide pipeline vecadd<\/code><\/pre>\n<p>This prints the full trace from Rust MIR through each dialect down to PTX output.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Single-Source Compilation and the Host\/Device Split<\/strong><\/h3>\n<p>Host and device code live in the same <code>.rs<\/code> source file. <code>cargo oxide<\/code> sets <code>-Z codegen-backend=librustc_codegen_cuda.so<\/code>, which routes code generation through cuda-oxide\u2019s backend. The backend then scans compiled code for monomorphized functions whose names carry the reserved <code>cuda_oxide_kernel_&lt;hash&gt;_&lt;name&gt;<\/code> prefix \u2014 the namespace that the <code>#[kernel]<\/code> proc macro creates. Functions matching that prefix go through the cuda-oxide pipeline to produce PTX; all other host code is delegated to rustc\u2019s standard LLVM backend. The result of a single <code>cargo oxide build<\/code> is a host binary plus a <code>.ptx<\/code> file.<\/p>\n<pre class=\"wp-block-code\"><code>cargo oxide run vecadd\ncargo oxide debug vecadd --tui    # debug with cuda-gdb<\/code><\/pre>\n<p>Device code from library dependencies is compiled lazily: the backend reads their Stable MIR from <code>.rlib<\/code> metadata on demand, only compiling functions a kernel actually calls.<\/p>\n<h3 class=\"wp-block-heading\"><strong>What You Can Write in a Kernel<\/strong><\/h3>\n<p>cuda-oxide supports a meaningful subset of Rust in GPU kernel functions, marked with the <code>#[kernel]<\/code> attribute macro. <strong>This includes:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Generic functions with monomorphization<\/strong> \u2014 <code>fn scale&lt;T: Copy&gt;(...)<\/code> is compiled to a concrete PTX kernel per type used at the call site.<\/li>\n<li><strong>Closures with captures<\/strong> \u2014 closures passed from the host are scalarized and passed as PTX kernel parameters automatically.<\/li>\n<li><strong>User-defined structs and enums<\/strong> \u2014 standard Rust data structures work inside kernels.<\/li>\n<li><strong>Pattern matching<\/strong> \u2014 <code>match<\/code>, <code>if let<\/code>, and related constructs work in device code.<\/li>\n<li><strong>Full GPU intrinsics<\/strong> \u2014 the <code>cuda-device<\/code> crate provides wrappers for thread indexing, warp operations (<code>shfl_sync<\/code>, <code>ballot_sync<\/code>, etc.), shared memory, barriers, TMA (Tensor Memory Accelerator), Thread Block Clusters, and scoped atomics (6 types \u00d7 3 scopes \u00d7 5 orderings).<\/li>\n<\/ul>\n<p>One important GPU-specific compiler detail: rustc\u2019s <code>JumpThreading<\/code> MIR optimization \u2014 which duplicates function calls into both branches of an if-statement \u2014 is <strong>disabled for device code<\/strong> in cuda-oxide. On CPUs this is a safe optimization, but on GPUs it breaks barrier semantics: all threads in a block must converge at the same <code>bar.sync<\/code> instruction, and duplicating it across branches violates that requirement. Additionally, sync primitives are marked <code>convergent<\/code> in the emitted LLVM IR so that LLVM\u2019s optimization passes cannot move or duplicate them across control flow.<\/p>\n<h3 class=\"wp-block-heading\"><strong><\/strong><strong>How to Use NVIDIA Star Elastic<\/strong><\/h3>\n<div>\n<p>  <!-- Header --><\/p>\n<div class=\"cog-header\">\n<div>\n<div>\n        <span class=\"cog-badge\">NVlabs<\/span><br \/>\n        <span class=\"cog-title\">cuda-oxide \u2014 Step-by-Step Guide<\/span>\n      <\/div>\n<div class=\"cog-subtitle\">Rust \u2192 Stable MIR \u2192 Pliron IR \u2192 LLVM IR \u2192 PTX \u00a0|\u00a0 v0.1.0<\/div>\n<\/div>\n<\/div>\n<p>  <!-- Progress bar --><\/p>\n<div class=\"cog-progress-bar\">\n<div class=\"cog-progress-fill\"><\/div>\n<\/div>\n<p>  <!-- Step dots --><\/p>\n<div class=\"cog-dots\"><\/div>\n<p>  <!-- Slides --><\/p>\n<div class=\"cog-slides\">\n<p>    <!-- Step 1 --><\/p>\n<div class=\"cog-slide active\" data-step=\"1\">\n      <span class=\"cog-step-label\">Step 01 of 09 \u00a0\u00b7\u00a0 Prerequisites<\/span>\n<h2>What You Need Before You Start<\/h2>\n<p>cuda-oxide has specific version requirements for each dependency. Before installing anything, verify your system meets all of these. The project is currently <strong>Linux-only<\/strong> (tested on Ubuntu 24.04).<\/p>\n<div class=\"cog-pills\">\n        <span class=\"cog-pill\"><span class=\"dot\"><\/span>Linux (Ubuntu 24.04)<\/span><br \/>\n        <span class=\"cog-pill\"><span class=\"dot\"><\/span>Rust nightly<\/span><br \/>\n        <span class=\"cog-pill\"><span class=\"dot\"><\/span>CUDA Toolkit 12.x+<\/span><br \/>\n        <span class=\"cog-pill\"><span class=\"dot\"><\/span>LLVM 21+<\/span><br \/>\n        <span class=\"cog-pill\"><span class=\"dot\"><\/span>Clang 21 \/ libclang-common-21-dev<\/span><br \/>\n        <span class=\"cog-pill\"><span class=\"dot\"><\/span>Git<\/span>\n      <\/div>\n<div class=\"cog-note\">\n        <strong>\u24d8 Why LLVM 21?<\/strong><br \/>\n        Simple kernels may work on LLVM 20, but anything targeting Hopper or Blackwell \u2014 TMA, tcgen05, WGMMA \u2014 requires <code>llc<\/code> from LLVM 21 or later. This is a hard requirement, not a recommendation.\n      <\/div>\n<p>Check your current CUDA version to confirm compatibility:<\/p>\n<pre><code>nvcc --version<\/code><\/pre>\n<\/div>\n<p>    <!-- Step 2 --><\/p>\n<div class=\"cog-slide\" data-step=\"2\">\n      <span class=\"cog-step-label\">Step 02 of 09 \u00a0\u00b7\u00a0 Install Rust Nightly<\/span>\n<h2>Set Up the Rust Nightly Toolchain<\/h2>\n<p>cuda-oxide requires Rust <strong>nightly<\/strong> with two additional components: <code>rust-src<\/code> and <code>rustc-dev<\/code>. The toolchain is pinned to <code>nightly-2026-04-03<\/code> via <code>rust-toolchain.toml<\/code> in the repository \u2014 it will be installed automatically when you first run a build inside the repo.<\/p>\n<p>If you need to install it manually:<\/p>\n<pre><code><span class=\"cm\"># Install the pinned nightly toolchain<\/span>\nrustup toolchain install nightly-2026-04-03\n\n<span class=\"cm\"># Add required components<\/span>\nrustup component add rust-src rustc-dev \n  --toolchain nightly-2026-04-03\n\n<span class=\"cm\"># Confirm the toolchain is active<\/span>\nrustup show<\/code><\/pre>\n<div class=\"cog-note\">\n        <strong>\u24d8 Why these components?<\/strong><br \/>\n        <code>rustc-dev<\/code> exposes the internal compiler APIs that the custom codegen backend hooks into. <code>rust-src<\/code> is needed so the compiler can find and compile its own standard library sources for the device target.\n      <\/div>\n<\/div>\n<p>    <!-- Step 3 --><\/p>\n<div class=\"cog-slide\" data-step=\"3\">\n      <span class=\"cog-step-label\">Step 03 of 09 \u00a0\u00b7\u00a0 Install LLVM 21<\/span>\n<h2>Install LLVM 21 with the NVPTX Backend<\/h2>\n<p>The cuda-oxide pipeline emits textual LLVM IR (<code>.ll<\/code> files) and hands them to the external <code>llc<\/code> binary to produce PTX. You need LLVM 21 or later with the NVPTX backend enabled.<\/p>\n<pre><code><span class=\"cm\"># Ubuntu\/Debian<\/span>\nsudo apt install llvm-21\n\n<span class=\"cm\"># Verify the NVPTX backend is present<\/span>\nllc-21 --version | grep nvptx<\/code><\/pre>\n<p>The pipeline auto-discovers <code>llc-22<\/code> and <code>llc-21<\/code> on your <code>PATH<\/code> in that order. To pin a specific binary, set the environment variable:<\/p>\n<pre><code><span class=\"cm\"># Pin to a specific llc binary<\/span>\n<span class=\"kw\">export<\/span> CUDA_OXIDE_LLC=\/usr\/bin\/llc-21<\/code><\/pre>\n<div class=\"cog-warn\">\n        <strong>\u26a0 Common Failure<\/strong><br \/>\n        If NVPTX does not appear in the output of <code>llc-21 --version<\/code>, your LLVM build was compiled without the NVPTX target. Install from the official LLVM apt repository rather than your distro\u2019s default packages, which may omit GPU backends.\n      <\/div>\n<\/div>\n<p>    <!-- Step 4 --><\/p>\n<div class=\"cog-slide\" data-step=\"4\">\n      <span class=\"cog-step-label\">Step 04 of 09 \u00a0\u00b7\u00a0 Install Clang<\/span>\n<h2>Install Clang 21 for the cuda-bindings Crate<\/h2>\n<p>The <code>cuda-bindings<\/code> crate uses <code>bindgen<\/code> to generate FFI bindings to <code>cuda.h<\/code> at build time. <code>bindgen<\/code> needs <code>libclang<\/code> \u2014 and specifically, it needs Clang\u2019s own resource directory (which includes <code>stddef.h<\/code>). A bare <code>libclang1-*<\/code> runtime package is <strong>not enough<\/strong>.<\/p>\n<pre><code><span class=\"cm\"># Install the full clang-21 package (includes resource headers)<\/span>\nsudo apt install clang-21\n\n<span class=\"cm\"># Alternatively, the -dev header package also works<\/span>\nsudo apt install libclang-common-21-dev<\/code><\/pre>\n<div class=\"cog-warn\">\n        <strong>\u26a0 Symptom of Missing Clang<\/strong><br \/>\n        If you only install the runtime but not the headers, the host build will fail with a cryptic <code>'stddef.h' file not found<\/code> error during bindgen. Run <code>cargo oxide doctor<\/code> in the next step to catch this before attempting a build.\n      <\/div>\n<\/div>\n<p>    <!-- Step 5 --><\/p>\n<div class=\"cog-slide\" data-step=\"5\">\n      <span class=\"cog-step-label\">Step 05 of 09 \u00a0\u00b7\u00a0 Install cargo-oxide<\/span>\n<h2>Clone the Repo and Install cargo-oxide<\/h2>\n<p><code>cargo-oxide<\/code> is a Cargo subcommand that drives the entire build pipeline \u2014 running <code>cargo oxide build<\/code>, <code>cargo oxide run<\/code>, <code>cargo oxide debug<\/code>, and <code>cargo oxide pipeline<\/code>.<\/p>\n<p><strong>Inside the repo<\/strong> (for trying examples):<\/p>\n<pre><code>git clone https:\/\/github.com\/NVlabs\/cuda-oxide.git\ncd cuda-oxide\n\n<span class=\"cm\"># cargo oxide works out of the box via a workspace alias<\/span>\ncargo oxide run vecadd<\/code><\/pre>\n<p><strong>Outside the repo<\/strong> (for your own projects):<\/p>\n<pre><code><span class=\"cm\"># Install globally from the git source<\/span>\ncargo install \n  --git https:\/\/github.com\/NVlabs\/cuda-oxide.git \n  cargo-oxide\n\n<span class=\"cm\"># On first run, cargo-oxide fetches and builds the codegen backend<\/span><\/code><\/pre>\n<p>Then verify all prerequisites are in place with the built-in health check:<\/p>\n<pre><code>cargo oxide doctor<\/code><\/pre>\n<div class=\"cog-note\">\n        <strong>\u24d8 What doctor checks<\/strong><br \/>\n        It validates your Rust toolchain (nightly, rust-src, rustc-dev), CUDA Toolkit, LLVM version and NVPTX support, Clang\/libclang headers, and the codegen backend binary. Fix any red items before proceeding.\n      <\/div>\n<\/div>\n<p>    <!-- Step 6 --><\/p>\n<div class=\"cog-slide\" data-step=\"6\">\n      <span class=\"cog-step-label\">Step 06 of 09 \u00a0\u00b7\u00a0 Run Your First Kernel<\/span>\n<h2>Build and Run the vecadd Example<\/h2>\n<p>The canonical first example is <code>vecadd<\/code> \u2014 a vector addition kernel that adds two arrays of 1,024 <code>f32<\/code> values on the GPU and verifies the result on the host.<\/p>\n<pre><code><span class=\"cm\"># Build and run end-to-end<\/span>\ncargo oxide run vecadd<\/code><\/pre>\n<p>If everything is configured correctly, you will see:<\/p>\n<pre><code>\u2713 SUCCESS: All 1024 elements correct!<\/code><\/pre>\n<p>To see the full compilation pipeline \u2014 from Rust MIR through each Pliron dialect down to PTX \u2014 run:<\/p>\n<pre><code><span class=\"cm\"># Print the full Rust MIR \u2014 dialect-mir \u2014 mem2reg \u2014 dialect-llvm \u2014 LLVM IR \u2014 PTX trace<\/span>\ncargo oxide pipeline vecadd<\/code><\/pre>\n<p>To debug with <code>cuda-gdb<\/code>:<\/p>\n<pre><code>cargo oxide debug vecadd --tui<\/code><\/pre>\n<div class=\"cog-note\">\n        <strong>\u24d8 Output artifacts<\/strong><br \/>\n        A successful build produces two files: <code>target\/debug\/vecadd<\/code> (the host binary) and <code>target\/debug\/vecadd.ptx<\/code> (the device code). The host binary loads the PTX file via the CUDA driver at runtime.\n      <\/div>\n<\/div>\n<p>    <!-- Step 7 --><\/p>\n<div class=\"cog-slide\" data-step=\"7\">\n      <span class=\"cog-step-label\">Step 07 of 09 \u00a0\u00b7\u00a0 Write a Kernel<\/span>\n<h2>Writing Your Own #[kernel] Function<\/h2>\n<p>A kernel function is annotated with <code>#[kernel]<\/code>. Use <code>DisjointSlice&lt;T&gt;<\/code> for mutable outputs and <code>&amp;[T]<\/code> for read-only inputs. Access the thread\u2019s unique hardware index with <code>thread::index_1d()<\/code>.<\/p>\n<pre><code><span class=\"kw\">use<\/span> <span class=\"ty\">cuda_device<\/span>::{<span class=\"at\">kernel<\/span>, thread, DisjointSlice};\n\n<span class=\"cm\">\/\/ Tier 1 safety: race-free by construction, no `unsafe` needed.\n\/\/ DisjointSlice::get_mut() only accepts a ThreadIndex \u2014\n\/\/ a hardware-derived opaque type guaranteeing unique writes per thread.<\/span>\n<span class=\"at\">#[kernel]<\/span>\n<span class=\"kw\">pub fn<\/span> <span class=\"fn\">scale<\/span>(input: <span class=\"kw\">&amp;<\/span>[<span class=\"ty\">f32<\/span>], factor: <span class=\"ty\">f32<\/span>, <span class=\"kw\">mut<\/span> out: <span class=\"ty\">DisjointSlice<\/span>&lt;<span class=\"ty\">f32<\/span>&gt;) {\n    <span class=\"kw\">let<\/span> idx = thread::<span class=\"fn\">index_1d<\/span>();\n    <span class=\"kw\">if let<\/span> <span class=\"ty\">Some<\/span>(elem) = out.<span class=\"fn\">get_mut<\/span>(idx) {\n        *elem = input[idx.<span class=\"fn\">get<\/span>()] * factor;\n    }\n}<\/code><\/pre>\n<div class=\"cog-note\">\n        <strong>\u24d8 Tier 1 Safety \u2014 how it works<\/strong><br \/>\n        <code>ThreadIndex<\/code> is an opaque newtype around <code>usize<\/code> that can only be created from hardware built-in registers (<code>threadIdx<\/code>, <code>blockIdx<\/code>, <code>blockDim<\/code>). Since each thread gets a unique value, and <code>DisjointSlice::get_mut()<\/code> only accepts a <code>ThreadIndex<\/code>, writes are race-free by construction \u2014 no <code>unsafe<\/code> anywhere in the kernel.\n      <\/div>\n<\/div>\n<p>    <!-- Step 8 --><\/p>\n<div class=\"cog-slide\" data-step=\"8\">\n      <span class=\"cog-step-label\">Step 08 of 09 \u00a0\u00b7\u00a0 Launch from Host<\/span>\n<h2>Launching the Kernel from Host Code<\/h2>\n<p>Host and device code live in the same <code>.rs<\/code> file. The host side uses <code>CudaContext<\/code>, <code>DeviceBuffer<\/code>, and the <code>cuda_launch!<\/code> macro to manage GPU memory and dispatch.<\/p>\n<pre><code><span class=\"kw\">use<\/span> <span class=\"ty\">cuda_core<\/span>::{CudaContext, DeviceBuffer, LaunchConfig};\n<span class=\"kw\">use<\/span> <span class=\"ty\">cuda_host<\/span>::{cuda_launch, load_kernel_module};\n\n<span class=\"kw\">fn<\/span> <span class=\"fn\">main<\/span>() {\n    <span class=\"cm\">\/\/ Initialize GPU context on device 0<\/span>\n    <span class=\"kw\">let<\/span> ctx    = <span class=\"ty\">CudaContext<\/span>::<span class=\"fn\">new<\/span>(<span class=\"num\">0<\/span>).<span class=\"fn\">unwrap<\/span>();\n    <span class=\"kw\">let<\/span> stream = ctx.<span class=\"fn\">default_stream<\/span>();\n    <span class=\"kw\">let<\/span> module = <span class=\"fn\">load_kernel_module<\/span>(&amp;ctx, <span class=\"str\">\"scale_example\"<\/span>).<span class=\"fn\">unwrap<\/span>();\n\n    <span class=\"cm\">\/\/ Upload input data to GPU memory<\/span>\n    <span class=\"kw\">let<\/span> data: <span class=\"ty\">Vec<\/span>&lt;<span class=\"ty\">f32<\/span>&gt; = (<span class=\"num\">0<\/span>..<span class=\"num\">1024<\/span>).<span class=\"fn\">map<\/span>(|i| i <span class=\"kw\">as<\/span> <span class=\"ty\">f32<\/span>).<span class=\"fn\">collect<\/span>();\n    <span class=\"kw\">let<\/span> input  = <span class=\"ty\">DeviceBuffer<\/span>::<span class=\"fn\">from_host<\/span>(&amp;stream, &amp;data).<span class=\"fn\">unwrap<\/span>();\n    <span class=\"kw\">let mut<\/span> output = <span class=\"ty\">DeviceBuffer<\/span>::&lt;<span class=\"ty\">f32<\/span>&gt;::<span class=\"fn\">zeroed<\/span>(&amp;stream, <span class=\"num\">1024<\/span>).<span class=\"fn\">unwrap<\/span>();\n\n    <span class=\"cm\">\/\/ Dispatch the kernel \u2014 LaunchConfig auto-sizes blocks\/grids<\/span>\n    <span class=\"mac\">cuda_launch!<\/span> {\n        kernel: scale,\n        stream: stream,\n        module: module,\n        config: <span class=\"ty\">LaunchConfig<\/span>::<span class=\"fn\">for_num_elems<\/span>(<span class=\"num\">1024<\/span>),\n        args: [<span class=\"fn\">slice<\/span>(input), <span class=\"num\">2.5<\/span><span class=\"ty\">f32<\/span>, <span class=\"fn\">slice_mut<\/span>(output)]\n    }.<span class=\"fn\">unwrap<\/span>();\n\n    <span class=\"cm\">\/\/ Download result back to host<\/span>\n    <span class=\"kw\">let<\/span> result = output.<span class=\"fn\">to_host_vec<\/span>(&amp;stream).<span class=\"fn\">unwrap<\/span>();\n    assert!((result[<span class=\"num\">1<\/span>] - <span class=\"num\">2.5<\/span>).<span class=\"fn\">abs<\/span>() &lt; <span class=\"num\">1e-5<\/span>);\n    println!(<span class=\"str\">\"\u2713 Kernel ran successfully!\"<\/span>);\n}<\/code><\/pre>\n<div class=\"cog-note\">\n        <strong>\u24d8 What cuda_launch! does<\/strong><br \/>\n        It scalarizes the argument list \u2014 flattening slices, scalars, and captured closures \u2014 into PTX kernel parameters and dispatches the kernel on the given stream. No manual argument marshalling is required.\n      <\/div>\n<\/div>\n<p>    <!-- Step 9 --><\/p>\n<div class=\"cog-slide\" data-step=\"9\">\n      <span class=\"cog-step-label\">Step 09 of 09 \u00a0\u00b7\u00a0 Next Steps<\/span>\n<h2>What to Explore Next<\/h2>\n<p>You have a working cuda-oxide setup. Here are the high-value paths forward, ordered by complexity:<\/p>\n<ul>\n<li><strong>Generic kernels with monomorphization<\/strong> \u2014 try the <code>generic<\/code> example (<code>cargo oxide run generic<\/code>) to see how <code>fn scale&lt;T: Copy&gt;<\/code> compiles to separate PTX kernels per type.<\/li>\n<li><strong>Closures with captures<\/strong> \u2014 the <code>host_closure<\/code> example shows how a <code>move |x: f32| x * factor<\/code> closure is scalarized and passed as PTX kernel parameters automatically.<\/li>\n<li><strong>Async GPU execution<\/strong> \u2014 <code>cuda_launch_async!<\/code> returns a lazy <code>DeviceOperation<\/code> that executes on <code>.sync()<\/code> or <code>.await<\/code>. See the <code>async_mlp<\/code> and <code>async_vecadd<\/code> examples.<\/li>\n<li><strong>Shared memory and warp intrinsics<\/strong> \u2014 these require scoped <code>unsafe<\/code> blocks with documented safety contracts. See Tier 2 in the safety model documentation.<\/li>\n<li><strong>GEMM at Speed-of-Light<\/strong> \u2014 the <code>gemm_sol<\/code> example achieves 868 TFLOPS on B200 (58% of cuBLAS SoL) using <code>cta_group::2<\/code>, CLC, and a 4-stage pipeline.<\/li>\n<li><strong>Blackwell tensor cores<\/strong> \u2014 the <code>tcgen05<\/code> example targets sm_100a with TMEM, MMA, and <code>cta_group::2<\/code>. Requires LLVM 21+.<\/li>\n<\/ul>\n<div class=\"cog-note\">\n        <strong>\u24d8 Known Limitation in v0.1.0<\/strong><br \/>\n        <code>index_2d(stride)<\/code> is documented as currently unsound \u2014 if threads in the same kernel use different stride values, two threads can get <code>&amp;mut T<\/code> to the same element with no <code>unsafe<\/code> in sight. Until the fix lands (lifting stride into a type parameter), bind stride to a single <code>let<\/code> binding and reuse it at every call site.\n      <\/div>\n<p>Full documentation: <strong>nvlabs.github.io\/cuda-oxide<\/strong> \u00a0\u00b7\u00a0 Source: <strong>github.com\/NVlabs\/cuda-oxide<\/strong><\/p>\n<\/div>\n<\/div>\n<p><!-- \/cog-slides --><\/p>\n<p>  <!-- Footer nav --><\/p>\n<div class=\"cog-footer\">\n    <span class=\"cog-counter\">Step <span>1<\/span> of <span>9<\/span><\/span>\n<div class=\"cog-nav\">\n      <button class=\"cog-btn\">\u2190 Prev<\/button><br \/>\n      <button class=\"cog-btn primary\">Next \u2192<\/button>\n    <\/div>\n<\/div>\n<p>  <!-- Attribution --><\/p>\n<div class=\"cog-attribution\">\n    <em>Document Created by Marktechpost.com<\/em>\n  <\/div>\n<\/div>\n<p><!-- \/#coxide-guide --><\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li>cuda-oxide is a custom <code>rustc<\/code> codegen backend from NVlabs that compiles <code>#[kernel]<\/code>-annotated Rust functions to PTX through a Rust \u2192 <code>rustc_public<\/code> Stable MIR \u2192 Pliron IR \u2192 LLVM IR \u2192 PTX pipeline, all buildable with <code>cargo<\/code>.<\/li>\n<li>Host and device code coexist in a single <code>.rs<\/code> file, compiled with one <code>cargo oxide build<\/code> command; the output is a host binary plus a <code>.ptx<\/code> file placed next to it.<\/li>\n<li>The safety model has three documented tiers: Tier 1 (race-free by construction via <code>DisjointSlice&lt;T&gt;<\/code> + <code>ThreadIndex<\/code>), Tier 2 (scoped <code>unsafe<\/code> for shared memory, warp intrinsics, atomics), and Tier 3 (raw hardware intrinsics for TMA, WGMMA, tcgen05). <code>index_2d(stride)<\/code> is documented as currently unsound in the 0.x release.<\/li>\n<li>The <code>gemm_sol<\/code> example hits 868 TFLOPS on the B200 (58% of cuBLAS SoL) using a multi-phase GEMM pipeline with CLC and <code>cta_group::2<\/code>.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/NVlabs\/cuda-oxide\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Repo<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/09\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/\">NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>NVIDIA AI researchers recently released cuda-oxide, an experimental compiler that allows developers to write CUDA SIMT (Single Instruction, Multiple Threads) GPU kernels in standard Rust code. The project compiles Rust directly to PTX (Parallel Thread Execution) \u2014 the assembly-like intermediate representation that CUDA uses to target NVIDIA GPUs \u2014 without requiring domain-specific languages, foreign function interface bindings, or C\/C++ code. How This Makes a Change Writing GPU kernels today typically means writing C++ and using the CUDA programming model directly, or relying on Python-level abstractions like Triton that generate CUDA under the hood. The Rust GPU ecosystem has had projects attempting to bridge this gap \u2014 Rust-GPU targets SPIR-V for Vulkan\/graphics compute, rust-cuda uses a rustc codegen backend targeting NVVM IR, CubeCL uses an embedded DSL with a JIT runtime that cross-compiles to CUDA\/ROCm\/WGPU, and std::offload uses LLVM\u2019s implicit offload path. cuda-oxide occupies a specific position in this space. Its stated design center is \u201cbringing CUDA into Rust\u201d \u2014 kernel authoring, device intrinsics, the SIMT execution model, and the CUDA programming model expressed natively in safe Rust \u2014 closer in spirit to writing a __global__ function in C++ than to writing a generic Rust function that happens to run on a GPU. By contrast, the closest neighbor, rust-cuda, focuses on \u201cbringing Rust to NVIDIA GPUs\u201d: Rust ergonomics like async\/.await, parts of the standard library running on-device, and a Rust-first programming model that abstracts over CUDA concepts. The NVlabs team notes it has been coordinating with rust-cuda maintainers and considers the two projects complementary. The Compilation Pipeline At the core of cuda-oxide is a custom rustc codegen backend \u2014 the layer in the Rust compiler responsible for generating machine code. Instead of emitting native CPU code, the rustc-codegen-cuda crate intercepts the compiler at the CodegenBackend::codegen_crate() entry point and runs a separate pipeline for device code: Rust Source \u2192 rustc frontend \u2192 rustc_public (Stable MIR) \u2192 dialect-mir \u2192 mem2reg \u2192 dialect-llvm \u2192 LLVM IR (.ll) \u2192 PTX (.ptx) Here are some important elements: Why rustc_public? The raw internal MIR representation in rustc changes between nightly versions with no stability guarantees. cuda-oxide uses rustc_public \u2014 also known as Stable MIR \u2014 which is Rust\u2019s official versioned, stable API over the compiler\u2019s internals. This lets the backend read MIR without breaking on every nightly update. What is Pliron? The middle stages use Pliron, a Rust-native MLIR-like IR framework written entirely in Rust. Choosing Pliron instead of upstream MLIR means the entire compiler builds with cargo \u2014 no C++ toolchain, no CMake, no tablegen. cuda-oxide defines three custom Pliron dialects: dialect-mir (modeling Rust MIR semantics \u2014 places, projections, rvalues, terminators), dialect-llvm (modeling LLVM IR with textual .ll export), and dialect-nvvm (NVIDIA GPU intrinsics like thread indexing, barriers, and TMA). What does llc do? After the dialect-llvm printer serializes the IR into a textual .ll file, the external llc binary (the LLVM static compiler with NVPTX backend) compiles it to PTX assembly. This is the one stage outside pure Rust. The resulting .ptx file is written next to the host binary \u2014 for example, target\/debug\/vecadd.ptx \u2014 and loaded by the CUDA driver at runtime. You as a developer can observe each stage with: cargo oxide pipeline vecadd This prints the full trace from Rust MIR through each dialect down to PTX output. Single-Source Compilation and the Host\/Device Split Host and device code live in the same .rs source file. cargo oxide sets -Z codegen-backend=librustc_codegen_cuda.so, which routes code generation through cuda-oxide\u2019s backend. The backend then scans compiled code for monomorphized functions whose names carry the reserved cuda_oxide_kernel_&lt;hash&gt;_&lt;name&gt; prefix \u2014 the namespace that the #[kernel] proc macro creates. Functions matching that prefix go through the cuda-oxide pipeline to produce PTX; all other host code is delegated to rustc\u2019s standard LLVM backend. The result of a single cargo oxide build is a host binary plus a .ptx file. cargo oxide run vecadd cargo oxide debug vecadd &#8211;tui # debug with cuda-gdb Device code from library dependencies is compiled lazily: the backend reads their Stable MIR from .rlib metadata on demand, only compiling functions a kernel actually calls. What You Can Write in a Kernel cuda-oxide supports a meaningful subset of Rust in GPU kernel functions, marked with the #[kernel] attribute macro. This includes: Generic functions with monomorphization \u2014 fn scale&lt;T: Copy&gt;(&#8230;) is compiled to a concrete PTX kernel per type used at the call site. Closures with captures \u2014 closures passed from the host are scalarized and passed as PTX kernel parameters automatically. User-defined structs and enums \u2014 standard Rust data structures work inside kernels. Pattern matching \u2014 match, if let, and related constructs work in device code. Full GPU intrinsics \u2014 the cuda-device crate provides wrappers for thread indexing, warp operations (shfl_sync, ballot_sync, etc.), shared memory, barriers, TMA (Tensor Memory Accelerator), Thread Block Clusters, and scoped atomics (6 types \u00d7 3 scopes \u00d7 5 orderings). One important GPU-specific compiler detail: rustc\u2019s JumpThreading MIR optimization \u2014 which duplicates function calls into both branches of an if-statement \u2014 is disabled for device code in cuda-oxide. On CPUs this is a safe optimization, but on GPUs it breaks barrier semantics: all threads in a block must converge at the same bar.sync instruction, and duplicating it across branches violates that requirement. Additionally, sync primitives are marked convergent in the emitted LLVM IR so that LLVM\u2019s optimization passes cannot move or duplicate them across control flow. How to Use NVIDIA Star Elastic NVlabs cuda-oxide \u2014 Step-by-Step Guide Rust \u2192 Stable MIR \u2192 Pliron IR \u2192 LLVM IR \u2192 PTX \u00a0|\u00a0 v0.1.0 Step 01 of 09 \u00a0\u00b7\u00a0 Prerequisites What You Need Before You Start cuda-oxide has specific version requirements for each dependency. Before installing anything, verify your system meets all of these. The project is currently Linux-only (tested on Ubuntu 24.04). Linux (Ubuntu 24.04) Rust nightly CUDA Toolkit 12.x+ LLVM 21+ Clang 21 \/ libclang-common-21-dev Git \u24d8 Why LLVM 21? Simple kernels may work on LLVM 20, but anything targeting Hopper or Blackwell \u2014 TMA, tcgen05, WGMMA \u2014 requires llc from LLVM 21<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-89398","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/es\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/\" \/>\n<meta property=\"og:locale\" content=\"es_ES\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/es\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-10T16:17:19+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Escrito por\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Tiempo de lectura\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutos\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX\",\"datePublished\":\"2026-05-10T16:17:19+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/\"},\"wordCount\":1894,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/\",\"url\":\"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/\",\"name\":\"NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"datePublished\":\"2026-05-10T16:17:19+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/#breadcrumb\"},\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"es\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/es\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/es\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/","og_locale":"es_ES","og_type":"article","og_title":"NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/es\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-05-10T16:17:19+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Escrito por":"admin NU","Tiempo de lectura":"12 minutos"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX","datePublished":"2026-05-10T16:17:19+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/"},"wordCount":1894,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"es","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/","url":"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/","name":"NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"datePublished":"2026-05-10T16:17:19+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/#breadcrumb"},"inLanguage":"es","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/nvidia-ai-just-released-cuda-oxide-an-experimental-rust-to-cuda-compiler-backend-that-compiles-simt-gpu-kernels-directly-to-ptx\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"es"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/es\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/es\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/es\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"NVIDIA AI researchers recently released cuda-oxide, an experimental compiler that allows developers to write CUDA SIMT (Single Instruction, Multiple Threads) GPU kernels in standard Rust code. The project compiles Rust directly to PTX (Parallel Thread Execution) \u2014 the assembly-like intermediate representation that CUDA uses to target NVIDIA GPUs \u2014 without requiring domain-specific languages, foreign function&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/89398","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/comments?post=89398"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/89398\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/media?parent=89398"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/categories?post=89398"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/tags?post=89398"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}