{"id":99148,"date":"2026-06-22T18:15:51","date_gmt":"2026-06-22T18:15:51","guid":{"rendered":"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/"},"modified":"2026-06-22T18:15:51","modified_gmt":"2026-06-22T18:15:51","slug":"moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode","status":"publish","type":"post","link":"https:\/\/youzum.net\/es\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/","title":{"rendered":"MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode"},"content":{"rendered":"<p class=\"wp-block-paragraph\">MoonMath AI team has released a bf16 forward attention kernel for AMD\u2019s MI300X GPU. It is written in HIP, not hand-written assembly. The code is open-source under the MIT license. The MoonMath.ai team reports it beats AITER v3, AMD\u2019s own optimized kernel, on every tested shape. Bare-metal access came from HotAisle, an AMD cloud provider.<\/p>\n<p class=\"wp-block-paragraph\">Attention is the fused <code>softmax(QK\u1d40\/\u221ad)\u00b7V<\/code> operation inside every transformer. The MI300X is AMD\u2019s CDNA3 data-center GPU, with the ISA target (gfx942). This kernel runs on that hardware only.<\/p>\n<h2 class=\"wp-block-heading\"><strong>TL;DR<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>MoonMath.ai open-sources a bf16 forward attention kernel for AMD MI300X, written in HIP, not assembly (MIT).<\/li>\n<li>It beats AMD\u2019s AITER v3 on every shape and rounding mode \u2014 geomean 1.18\u00d7\/1.15\u00d7\/1.08\u00d7, up to 1.26\u00d7.<\/li>\n<li>The core trick: one-instruction asm wrappers let you pick the opcode while the compiler allocates registers.<\/li>\n<li>Most of the speedup is memory placement \u2014 K in LDS, V hot in L1, Q and accumulators in registers.<\/li>\n<li>A real SGLang PR used it to speed up Wan2.1 video diffusion by 1.23\u00d7, with no quality regression.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Understanding Kernel<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">A kernel is a small program that runs directly on the GPU\u2019s many cores to perform one specific computation\u2014here, the attention math\u2014as fast as the hardware allows. The kernel computes forward attention in bf16 on MI300X only. It takes inputs in either BSHD or BHSD layout, with no transpose. Head dimension is fixed at 128. It supports any sequence length, including cross-attention.<\/p>\n<p class=\"wp-block-paragraph\">There are real limits. There is no causal mask, no GQA, and no varlen batching. Outputs are bf16, and it runs on gfx942 hardware exclusively.<\/p>\n<p class=\"wp-block-paragraph\">Numerics are tightly controlled. All three rounding modes match AITER\u2019s per-mode rounding rule. Every finite output sits within 1 bf16 ULP of AITER. NaN and Inf handling is bit-identical, and results are deterministic.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Core Trick: One-Instruction asm Wrappers<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The core technique avoids a familiar dilemma. Compiler intrinsics keep code tidy but let the compiler reorder or rename operands. Raw inline assembly gives control but forces manual register and address management.<\/p>\n<p class=\"wp-block-paragraph\">MoonMath wraps exactly one instruction in a <code>__device__ __forceinline__<\/code> function. Extended asm constraints describe the operands. The research team picks the opcode. The compiler still allocates registers and tracks data flow.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">\/\/ in\/out tied to the SAME VGPR \u2192 no accumulator rename, no v_mov copy.\n__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t&amp; c) {\n    asm volatile(\"v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0\"\n                 : \"+v\"(c) : \"v\"(a), \"v\"(b));\n}<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">The <code>\"+v\"(c)<\/code> constraint ties the accumulator input and output to the same VGPR. No copy instruction is emitted. This keeps the kernel close to ordinary HIP. It still steers the machine one instruction at a time.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Architecture: Eight Waves, Two Groups, Two Barriers<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">A CDNA3 compute unit has four SIMD units. The textbook block is four waves. MoonMath instead runs eight waves per block, in two groups of four.<\/p>\n<p class=\"wp-block-paragraph\">The two groups run the same <code>Q*K<\/code>, softmax, <code>O += P*V<\/code> sequence. They are offset by a phase. While one group saturates the matrix core, the other runs softmax and issues loads. Then they swap, so the matrix core never idles.<\/p>\n<p class=\"wp-block-paragraph\">There are two <code>s_barrier<\/code>s per iteration. One sits at the phase handoff. One sits at the iteration boundary. Per-counter waits handle the rest of the synchronization.<\/p>\n<p class=\"wp-block-paragraph\">This echoes FlashAttention-3\u2019s matmul and softmax alternation. It does not copy FA3\u2019s producer and consumer warp split. On CDNA3, every memory move is already asynchronous, so a dedicated producer wave is unnecessary.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Where Data Lives, and Why 16\u00d716\u00d716<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Most of the speedup comes from memory placement. <code>K<\/code> streams from HBM into LDS, double-buffered, shared by all eight waves. <code>V<\/code> stays hot in L1, read on every PV matmul. <code>Q<\/code> and accumulators live in registers.<\/p>\n<p class=\"wp-block-paragraph\">The research team picked the 16\u00d716\u00d716 MFMA over 32\u00d732\u00d78. Both shapes have identical throughput. The smaller tile accumulates into 4 fp32 elements per lane, against 16. Lower accumulator pressure leaves room for deeper prefetch and a third <code>Q<\/code> tile.<\/p>\n<figure class=\"wp-block-table is-style-stripes\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Decision<\/th>\n<th>Choice<\/th>\n<th>Reason<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Waves per block<\/td>\n<td>8 (two groups of 4)<\/td>\n<td>Plan the pipeline directly; share one K copy<\/td>\n<\/tr>\n<tr>\n<td>MFMA shape<\/td>\n<td>16\u00d716\u00d716 bf16<\/td>\n<td>Same throughput, lower VGPR pressure, better power efficiency<\/td>\n<\/tr>\n<tr>\n<td>K placement<\/td>\n<td>LDS, double-buffered, 32 KiB<\/td>\n<td>Shared by all 8 waves, swapped per iteration<\/td>\n<\/tr>\n<tr>\n<td>V placement<\/td>\n<td>L1, resident, prefetched<\/td>\n<td>Reread across PV, kept hot deliberately<\/td>\n<\/tr>\n<tr>\n<td>Q + accumulators<\/td>\n<td>VGPRs<\/td>\n<td>Read every iteration, never reloaded<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">Two later wins close the gap. A third <code>Q<\/code> tile (3Q) raises data reuse per loaded <code>K<\/code> and <code>V<\/code> tile. A Flash-Decoding-style tail KV split rescues the stranded fractional round across MI300X\u2019s 304 CUs. These wins cascade. Moving <code>V<\/code> to L1 freed the LDS that the third <code>Q<\/code> tile then fills.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Benchmark<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Tests ran on MI300X in bf16, head dimension 128. Each shape was measured at three rounding modes. RTNE rounds to nearest even. RTNA rounds to nearest, ties away from zero. RTZ truncates toward zero.<\/p>\n<figure class=\"wp-block-table is-style-stripes\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Shape (B, H, S, D)<\/th>\n<th>Round<\/th>\n<th>Ours (ms)<\/th>\n<th>AITER v3 (ms)<\/th>\n<th>vs AITER<\/th>\n<th>vs MAX<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>(2, 24, 8192, 128)<\/td>\n<td>RTNE<\/td>\n<td>3.083<\/td>\n<td>3.792<\/td>\n<td>1.23\u00d7<\/td>\n<td>1.37\u00d7<\/td>\n<\/tr>\n<tr>\n<td>(2, 24, 16384, 128)<\/td>\n<td>RTNE<\/td>\n<td>11.670<\/td>\n<td>14.691<\/td>\n<td>1.26\u00d7<\/td>\n<td>1.54\u00d7<\/td>\n<\/tr>\n<tr>\n<td>(4, 16, 16384, 128)<\/td>\n<td>RTZ<\/td>\n<td>15.055<\/td>\n<td>16.183<\/td>\n<td>1.07\u00d7<\/td>\n<td>1.47\u00d7<\/td>\n<\/tr>\n<tr>\n<td>(2, 24, 32768, 128)<\/td>\n<td>RTNA<\/td>\n<td>44.440<\/td>\n<td>52.363<\/td>\n<td>1.18\u00d7<\/td>\n<td>1.57\u00d7<\/td>\n<\/tr>\n<tr>\n<td>(1, 16, 131072, 128)<\/td>\n<td>RTNE<\/td>\n<td>232.517<\/td>\n<td>269.278<\/td>\n<td>1.16\u00d7<\/td>\n<td>1.46\u00d7<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">Geomeans across the sweep favor MoonMath. Versus AITER, it scores 1.18\u00d7 (RTNE), 1.15\u00d7 (RTNA), and 1.08\u00d7 (RTZ). Versus Modular MAX, geomeans run 1.44\u00d7 to 1.49\u00d7, and per-shape speedups reach 1.59\u00d7.<\/p>\n<p class=\"wp-block-paragraph\">RTZ is AITER\u2019s own fastest mode and the tightest race. The (4, 16, 16384) RTZ shape moved from 0.95\u00d7 to 1.07\u00d7. The tail KV split is what closed that final gap.<\/p>\n<p class=\"wp-block-paragraph\">\n<h2 class=\"wp-block-heading\"><strong>Interactive Explainer<\/strong><\/h2>\n<\/p><p><!-- MoonMath CDNA3 Attention \u2014 interactive demo (Marktechpost). Paste into a WordPress Custom HTML block. --><\/p>\n<p class=\"wp-block-paragraph\">\n<h2 class=\"wp-block-heading\"><strong>Use Cases<\/strong><\/h2>\n<\/p><p class=\"wp-block-paragraph\">The kernel installs with pip and exposes a small API. It launches on the caller\u2019s stream, so it overlaps inside larger pipelines.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">import torch\nimport moonmath_attention as ma\n\n# PyTorch's ROCm build uses the \"cuda\" device string on AMD GPUs\nq = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device=\"cuda\")\nk = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device=\"cuda\")\nv = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device=\"cuda\")\n\nout     = ma.forward(q, k, v, layout=\"bshd\")\nout_rtz = ma.forward(q, k, v, layout=\"bshd\", round_mode=\"rtz\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">One concrete use case is video diffusion. The team added LiteAttention support and sent a PR to SGLang diffusion. On Wan2.1-T2V-1.3B-Diffusers, they switched attention from AITER to <code>liteattention_rocm<\/code>. End-to-end generation improved by 1.23\u00d7 on MI300X, with no visible quality regression.<\/p>\n<p class=\"wp-block-paragraph\">The BSHD layout suits diffusion tensors directly. Cross-attention works with any KV length and no padding.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>The kernel is bf16 forward attention for MI300X, written in HIP under MIT.<\/li>\n<li>It beats AITER v3 on every shape and rounding mode, geomean 1.18\u00d7\/1.15\u00d7\/1.08\u00d7.<\/li>\n<li>One-instruction asm wrappers give opcode control while the compiler allocates registers.<\/li>\n<li>Memory placement drove most of the gain: K in LDS, V hot in L1, Q in registers.<\/li>\n<li>A real SGLang PR sped up Wan2.1 video diffusion by 1.23\u00d7 with no quality regression.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/moonmath.ai\/cdna3attention\/\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/wbash1wF6efRj8G58\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/06\/22\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/\">MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>MoonMath AI team has released a bf16 forward attention kernel for AMD\u2019s MI300X GPU. It is written in HIP, not hand-written assembly. The code is open-source under the MIT license. The MoonMath.ai team reports it beats AITER v3, AMD\u2019s own optimized kernel, on every tested shape. Bare-metal access came from HotAisle, an AMD cloud provider. Attention is the fused softmax(QK\u1d40\/\u221ad)\u00b7V operation inside every transformer. The MI300X is AMD\u2019s CDNA3 data-center GPU, with the ISA target (gfx942). This kernel runs on that hardware only. TL;DR MoonMath.ai open-sources a bf16 forward attention kernel for AMD MI300X, written in HIP, not assembly (MIT). It beats AMD\u2019s AITER v3 on every shape and rounding mode \u2014 geomean 1.18\u00d7\/1.15\u00d7\/1.08\u00d7, up to 1.26\u00d7. The core trick: one-instruction asm wrappers let you pick the opcode while the compiler allocates registers. Most of the speedup is memory placement \u2014 K in LDS, V hot in L1, Q and accumulators in registers. A real SGLang PR used it to speed up Wan2.1 video diffusion by 1.23\u00d7, with no quality regression. Understanding Kernel A kernel is a small program that runs directly on the GPU\u2019s many cores to perform one specific computation\u2014here, the attention math\u2014as fast as the hardware allows. The kernel computes forward attention in bf16 on MI300X only. It takes inputs in either BSHD or BHSD layout, with no transpose. Head dimension is fixed at 128. It supports any sequence length, including cross-attention. There are real limits. There is no causal mask, no GQA, and no varlen batching. Outputs are bf16, and it runs on gfx942 hardware exclusively. Numerics are tightly controlled. All three rounding modes match AITER\u2019s per-mode rounding rule. Every finite output sits within 1 bf16 ULP of AITER. NaN and Inf handling is bit-identical, and results are deterministic. The Core Trick: One-Instruction asm Wrappers The core technique avoids a familiar dilemma. Compiler intrinsics keep code tidy but let the compiler reorder or rename operands. Raw inline assembly gives control but forces manual register and address management. MoonMath wraps exactly one instruction in a __device__ __forceinline__ function. Extended asm constraints describe the operands. The research team picks the opcode. The compiler still allocates registers and tracks data flow. Copy CodeCopiedUse a different Browser \/\/ in\/out tied to the SAME VGPR \u2192 no accumulator rename, no v_mov copy. __device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t&amp; c) { asm volatile(&#8220;v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0&#8221; : &#8220;+v&#8221;(c) : &#8220;v&#8221;(a), &#8220;v&#8221;(b)); } The &#8220;+v&#8221;(c) constraint ties the accumulator input and output to the same VGPR. No copy instruction is emitted. This keeps the kernel close to ordinary HIP. It still steers the machine one instruction at a time. The Architecture: Eight Waves, Two Groups, Two Barriers A CDNA3 compute unit has four SIMD units. The textbook block is four waves. MoonMath instead runs eight waves per block, in two groups of four. The two groups run the same Q*K, softmax, O += P*V sequence. They are offset by a phase. While one group saturates the matrix core, the other runs softmax and issues loads. Then they swap, so the matrix core never idles. There are two s_barriers per iteration. One sits at the phase handoff. One sits at the iteration boundary. Per-counter waits handle the rest of the synchronization. This echoes FlashAttention-3\u2019s matmul and softmax alternation. It does not copy FA3\u2019s producer and consumer warp split. On CDNA3, every memory move is already asynchronous, so a dedicated producer wave is unnecessary. Where Data Lives, and Why 16\u00d716\u00d716 Most of the speedup comes from memory placement. K streams from HBM into LDS, double-buffered, shared by all eight waves. V stays hot in L1, read on every PV matmul. Q and accumulators live in registers. The research team picked the 16\u00d716\u00d716 MFMA over 32\u00d732\u00d78. Both shapes have identical throughput. The smaller tile accumulates into 4 fp32 elements per lane, against 16. Lower accumulator pressure leaves room for deeper prefetch and a third Q tile. Decision Choice Reason Waves per block 8 (two groups of 4) Plan the pipeline directly; share one K copy MFMA shape 16\u00d716\u00d716 bf16 Same throughput, lower VGPR pressure, better power efficiency K placement LDS, double-buffered, 32 KiB Shared by all 8 waves, swapped per iteration V placement L1, resident, prefetched Reread across PV, kept hot deliberately Q + accumulators VGPRs Read every iteration, never reloaded Two later wins close the gap. A third Q tile (3Q) raises data reuse per loaded K and V tile. A Flash-Decoding-style tail KV split rescues the stranded fractional round across MI300X\u2019s 304 CUs. These wins cascade. Moving V to L1 freed the LDS that the third Q tile then fills. Benchmark Tests ran on MI300X in bf16, head dimension 128. Each shape was measured at three rounding modes. RTNE rounds to nearest even. RTNA rounds to nearest, ties away from zero. RTZ truncates toward zero. Shape (B, H, S, D) Round Ours (ms) AITER v3 (ms) vs AITER vs MAX (2, 24, 8192, 128) RTNE 3.083 3.792 1.23\u00d7 1.37\u00d7 (2, 24, 16384, 128) RTNE 11.670 14.691 1.26\u00d7 1.54\u00d7 (4, 16, 16384, 128) RTZ 15.055 16.183 1.07\u00d7 1.47\u00d7 (2, 24, 32768, 128) RTNA 44.440 52.363 1.18\u00d7 1.57\u00d7 (1, 16, 131072, 128) RTNE 232.517 269.278 1.16\u00d7 1.46\u00d7 Geomeans across the sweep favor MoonMath. Versus AITER, it scores 1.18\u00d7 (RTNE), 1.15\u00d7 (RTNA), and 1.08\u00d7 (RTZ). Versus Modular MAX, geomeans run 1.44\u00d7 to 1.49\u00d7, and per-shape speedups reach 1.59\u00d7. RTZ is AITER\u2019s own fastest mode and the tightest race. The (4, 16, 16384) RTZ shape moved from 0.95\u00d7 to 1.07\u00d7. The tail KV split is what closed that final gap. Interactive Explainer Use Cases The kernel installs with pip and exposes a small API. It launches on the caller\u2019s stream, so it overlaps inside larger pipelines. Copy CodeCopiedUse a different Browser import torch import moonmath_attention as ma # PyTorch&#8217;s ROCm build uses the &#8220;cuda&#8221; device string on AMD GPUs q = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device=&#8221;cuda&#8221;) k = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device=&#8221;cuda&#8221;) v = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device=&#8221;cuda&#8221;) out<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-99148","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/es\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/\" \/>\n<meta property=\"og:locale\" content=\"es_ES\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/es\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-22T18:15:51+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Escrito por\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Tiempo de lectura\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutos\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode\",\"datePublished\":\"2026-06-22T18:15:51+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/\"},\"wordCount\":1098,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/\",\"url\":\"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/\",\"name\":\"MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"datePublished\":\"2026-06-22T18:15:51+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/#breadcrumb\"},\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"es\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/es\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/es\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/","og_locale":"es_ES","og_type":"article","og_title":"MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/es\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-06-22T18:15:51+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Escrito por":"admin NU","Tiempo de lectura":"6 minutos"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode","datePublished":"2026-06-22T18:15:51+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/"},"wordCount":1098,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"es","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/","url":"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/","name":"MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"datePublished":"2026-06-22T18:15:51+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/#breadcrumb"},"inLanguage":"es","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/moonmath-ai-open-sources-a-hip-attention-kernel-for-amd-mi300x-that-beats-aiter-v3-on-every-shape-and-rounding-mode\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"es"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/es\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/es\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/es\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"MoonMath AI team has released a bf16 forward attention kernel for AMD\u2019s MI300X GPU. It is written in HIP, not hand-written assembly. The code is open-source under the MIT license. The MoonMath.ai team reports it beats AITER v3, AMD\u2019s own optimized kernel, on every tested shape. Bare-metal access came from HotAisle, an AMD cloud provider.&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/99148","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/comments?post=99148"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/99148\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/media?parent=99148"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/categories?post=99148"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/tags?post=99148"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}