{"id":92594,"date":"2026-05-24T16:59:04","date_gmt":"2026-05-24T16:59:04","guid":{"rendered":"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/"},"modified":"2026-05-24T16:59:04","modified_gmt":"2026-05-24T16:59:04","slug":"nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule","status":"publish","type":"post","link":"https:\/\/youzum.net\/th\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/","title":{"rendered":"NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Linear attention replaces the unbounded KV cache of softmax attention with a fixed-size recurrent state. This cuts sequence mixing to linear time and decoding to constant memory. The hard part is not what to forget. It is how to edit a compressed memory without scrambling existing associations.<\/p>\n<p class=\"wp-block-paragraph\">NVIDIA has released <strong>Gated DeltaNet-2<\/strong>, a linear attention layer that targets that bottleneck. The model decouples the active memory edit into two channel-wise gates. It is trained at 1.3B parameters on 100B FineWeb-Edu tokens. It outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 across the researchs benchmark suite. <\/p>\n<h2 class=\"wp-block-heading\"><strong>The scalar gate problem in delta-rule models<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">A recurrent linear attention layer stores a matrix state <code>S<sub>t<\/sub><\/code> and reads it with the query. DeltaNet adds an active edit by subtracting the value currently associated with the current key. It uses a scalar step size <code>\u03b2<sub>t<\/sub><\/code> to control how much to overwrite. Mamba-2 adds a data-dependent scalar decay <code>\u03b1<sub>t<\/sub><\/code> for global forgetting. Gated DeltaNet combined both operations, but both gates remained scalar per head.<\/p>\n<p class=\"wp-block-paragraph\">Kimi Delta Attention (KDA) refines the decay side. It replaces the scalar <code>\u03b1<sub>t<\/sub><\/code> with a channel-wise vector. KDA still keeps a single scalar <code>\u03b2<sub>t<\/sub><\/code> for the active edit. That scalar controls two different things at once. It decides how much old content to erase on the key side. It also decides how much new content to commit on the value side. These two decisions act on different axes of the state. Tying them together is a modeling restriction, not a property of the delta rule.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1456\" height=\"842\" data-attachment-id=\"80073\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/24\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/screenshot-2026-05-24-at-12-26-27-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1.png\" data-orig-size=\"1456,842\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"Screenshot 2026-05-24 at 12.26.27\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-1024x592.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1.png\" alt=\"\" class=\"wp-image-80073\" \/><figcaption class=\"wp-element-caption\">https:\/\/github.com\/NVlabs\/GatedDeltaNet-2\/blob\/main\/paper\/GDN2_paper.pdf<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Gated Delta Rule-2: two gates instead of one<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Gated DeltaNet-2 separates the two decisions through Gated Delta Rule-2. It introduces a channel-wise erase gate <code>b<sub>t<\/sub> \u2208 [0,1]<sup>d<\/sup><sub>k<\/sub><\/code> on the key axis. It also introduces a channel-wise write gate <code>w<sub>t<\/sub> \u2208 [0,1]<sup>d<\/sup><sub>v<\/sub><\/code> on the value axis. Both gates are produced by sigmoid projections of the token representation. The update applies decay before the active edit.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Written compactly, the recurrence is:<\/strong><\/p>\n<p class=\"wp-block-paragraph\"><code><strong>S<sub>t<\/sub> = (I \u2212 k<sub>t<\/sub> (b<sub>t<\/sub> \u2299 k<sub>t<\/sub>)<sup>\u22a4<\/sup>) D<sub>t<\/sub> S<sub>t\u22121<\/sub> + k<sub>t<\/sub> (w<sub>t<\/sub> \u2299 v<sub>t<\/sub>)<sup>\u22a4<\/sup><\/strong><\/code><\/p>\n<p class=\"wp-block-paragraph\">Here <code>D<sub>t<\/sub> = Diag(\u03b1<sub>t<\/sub>)<\/code> is the channel-wise decay carried over from KDA. The left factor of the erase matrix stays <code>k<sub>t<\/sub><\/code>, preserving the delta-rule write direction. The right factor becomes <code>b<sub>t<\/sub> \u2299 k<sub>t<\/sub><\/code>, making the read direction channel-selective. The write term <code>k<sub>t<\/sub> z<sub>t<\/sub><sup>\u22a4<\/sup><\/code> uses <code>z<sub>t<\/sub> = w<sub>t<\/sub> \u2299 v<sub>t<\/sub><\/code>, making the value update channel-selective.<\/p>\n<p class=\"wp-block-paragraph\">When both gates collapse to the same scalar <code>\u03b2<sub>t<\/sub><\/code>, the update recovers KDA exactly. When the decay <code>\u03b1<sub>t<\/sub><\/code> also collapses to a scalar, it recovers Gated DeltaNet. Both prior models are preserved as tied subspaces of the new update.<\/p>\n<p class=\"wp-block-paragraph\">In the fast-weight view, Gated Delta Rule-2 is one online gradient step on a local regression loss. The decayed state stays close to memory, while the residual edit uses gated read and gated write targets.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Chunkwise training and gate-aware backward<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The recurrence admits a chunkwise WY form that matches the structure used by KDA. Cumulative channel-wise decay is absorbed into the two factors of each rank-one erase. The per-chunk update becomes a product of asymmetric matrices of the form <code>I \u2212 k\u0304<sub>r<\/sub> \u0113<sub>r<\/sub><sup>\u22a4<\/sup><\/code>. The implementation uses chunk size <code>C = 64<\/code> with fused Triton kernels.<\/p>\n<p class=\"wp-block-paragraph\">For the backward pass, the scalar shortcut used by KDA no longer applies. The write side contains a different diagonal gate over value channels. The erase side contains a different diagonal gate over key channels. So the gate factors must appear inside the dot products that accumulate gradients. The paper derives this gate-aware vector-Jacobian product explicitly. On Hopper GPUs, the fused WY backward kernel is restricted to two and four warps to avoid a Triton WGMMA layout assertion.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Block design and hybrid model<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Gated DeltaNet-2 is used as the recurrent token mixer in a standard Transformer-style block. Query and key paths use linear projection, short causal convolution, SiLU, and L2 normalization. The value path uses linear projection, short convolution, and SiLU. The decay <code>\u03b1<sub>t<\/sub><\/code>, erase gate <code>b<sub>t<\/sub><\/code>, and write gate <code>w<sub>t<\/sub><\/code> come from separate linear branches. The recurrent output is RMS-normalized, multiplied by a SiLU output gate, and projected back.<\/p>\n<p class=\"wp-block-paragraph\">A hybrid variant inserts Sliding-Window Attention (SWA) after the recurrent mixer. A repeated cell contains Gated DeltaNet-2, an MLP, SWA, and another MLP. SWA handles exact local interactions, while the recurrent mixer compresses long histories. The hybrid retains linear sequence scaling with a bounded attention cache.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Results at 1.3B parameters<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">All models are 1.3B parameters trained on 100B FineWeb-Edu tokens. Parameter count and recurrent state size are matched across models. The recurrent state holds 262,144 floats per layer per batch element. Training length is 4K tokens, and hybrid models use a 2K SWA window. The Mamba-3 MIMO baseline uses rank <code>R = 4<\/code>.<\/p>\n<p class=\"wp-block-paragraph\">On language modeling and commonsense reasoning, Gated DeltaNet-2 has the best average in both settings. The recurrent model averages 53.11 across LAMBADA and the reasoning suite. That sits above Mamba-3 MIMO at 52.39 and KDA at 52.28. In the hybrid setting, Gated DeltaNet-2 averages 53.97 against Mamba-3 MIMO at 52.72. Since recurrent state size is matched, the gain points to the update rule, not more memory.<\/p>\n<p class=\"wp-block-paragraph\">The clearest gains appear on RULER long-context retrieval. In the recurrent setting, S-NIAH-2 at 4K rises from 89.0 (KDA) to 93.0. S-NIAH-3 at 2K jumps from 63.2 (KDA) to 89.8. MK-NIAH-1 at 4K climbs from 28.0 (KDA) to 37.8. <\/p>\n<p class=\"wp-block-paragraph\">On real-world retrieval (SWDE, SQuAD, FDA, TriviaQA, NQ, DROP), Gated DeltaNet-2 also leads both settings. The recurrent average is 29.88 and the hybrid average is 42.28.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<div class=\"gdn2-header\">\n<div class=\"gdn2-brand\">\n      <span class=\"gdn2-brand-mark\"><\/span><br \/>\n      <span>Gated DeltaNet-2 \u00b7 Quickstart<\/span>\n    <\/div>\n<div class=\"gdn2-counter\">\n      <span class=\"gdn2-current\">01<\/span> \/ <span class=\"gdn2-total\">08<\/span>\n    <\/div>\n<\/div>\n<div class=\"gdn2-stage\">\n<div class=\"gdn2-track\">\n<p>      <!-- Slide 1: Cover --><\/p>\n<section class=\"gdn2-slide gdn2-cover\">\n        <span class=\"gdn2-step-label\">NVIDIA \u00b7 2026<\/span>\n<h2 class=\"gdn2-title\">Gated DeltaNet-2<\/h2>\n<p class=\"gdn2-subtitle\">Decoupling Erase and Write in Linear Attention. A delta-rule recurrent attention layer with channel-wise erase and write gates.<\/p>\n<div class=\"gdn2-badges\">\n          <span class=\"gdn2-badge\">PyTorch<\/span><br \/>\n          <span class=\"gdn2-badge\">Triton kernels<\/span><br \/>\n          <span class=\"gdn2-badge\">1.3B params<\/span><br \/>\n          <span class=\"gdn2-badge\">100B FineWeb-Edu tokens<\/span>\n        <\/div>\n<div class=\"gdn2-cover-meta\">\n<div>\n            <strong>Authors<\/strong><br \/>\n            Ali Hatamizadeh, Yejin Choi, Jan Kautz\n          <\/div>\n<div>\n            <strong>Repo<\/strong><br \/>\n            github.com\/NVlabs\/GatedDeltaNet-2\n          <\/div>\n<div>\n            <strong>License<\/strong><br \/>\n            NVIDIA Source Code License-NC\n          <\/div>\n<\/div>\n<\/section>\n<p>      <!-- Slide 2: The Idea --><\/p>\n<section class=\"gdn2-slide\">\n        <span class=\"gdn2-step-label\">Step 01 \u00b7 The Idea<\/span>\n<h2 class=\"gdn2-title\">Two gates instead of one scalar<\/h2>\n<p>Linear attention compresses an unbounded KV cache into a fixed-size recurrent state. Editing this memory without scrambling existing associations is the hard part.<\/p>\n<div class=\"gdn2-grid-2\">\n<div class=\"gdn2-card\">\n<h4>The Problem<\/h4>\n<p>Prior delta-rule models (Gated DeltaNet, KDA) tie <em>erasing old content<\/em> and <em>writing new content<\/em> to one scalar gate <code>\u03b2_t<\/code>.<\/p>\n<\/div>\n<div class=\"gdn2-card\">\n<h4>The Fix<\/h4>\n<p>Split it: a channel-wise erase gate <code>b_t<\/code> on the key axis, and a channel-wise write gate <code>w_t<\/code> on the value axis.<\/p>\n<\/div>\n<\/div>\n<ul class=\"gdn2-bullets\">\n<li><strong>Erase gate<\/strong> picks which key-side coordinates of the decayed state are read and removed.<\/li>\n<li><strong>Write gate<\/strong> picks which value-side coordinates of the new content are committed.<\/li>\n<li><strong>Channel-wise decay<\/strong> is inherited from KDA for fine-grained global forgetting.<\/li>\n<\/ul>\n<\/section>\n<p>      <!-- Slide 3: The Gated Delta Rule-2 --><\/p>\n<section class=\"gdn2-slide\">\n        <span class=\"gdn2-step-label\">Step 02 \u00b7 The Update Rule<\/span>\n<h2 class=\"gdn2-title\">The Gated Delta Rule-2<\/h2>\n<p>With erase gate <code>b_t \u2208 [0,1]^{d_k}<\/code>, write gate <code>w_t \u2208 [0,1]^{d_v}<\/code>, and channel-wise decay <code>D_t = Diag(\u03b1_t)<\/code>, the recurrent state evolves as:<\/p>\n<div class=\"gdn2-eq\">S_t = (I \u2212 k_t (b_t &amp;odot; k_t)<sup>\u22a4<\/sup>) D_t S_{t\u22121} + k_t (w_t &amp;odot; v_t)<sup>\u22a4<\/sup><\/div>\n<ul class=\"gdn2-bullets\">\n<li>Recovers <strong>KDA<\/strong> exactly when both gates collapse to the same scalar.<\/li>\n<li>Recovers <strong>Gated DeltaNet<\/strong> when the decay also collapses to a scalar.<\/li>\n<li>Trains efficiently via a <strong>chunkwise WY<\/strong> form with channel-wise decay absorbed into asymmetric erase factors.<\/li>\n<\/ul>\n<\/section>\n<p>      <!-- Slide 4: Get the Code --><\/p>\n<section class=\"gdn2-slide\">\n        <span class=\"gdn2-step-label\">Step 03 \u00b7 Get the Code<\/span>\n<h2 class=\"gdn2-title\">Clone the repo and build the environment<\/h2>\n<p>The official PyTorch implementation ships with a Dockerfile, training scripts, and the <code>lit_gpt<\/code> model definitions.<\/p>\n<pre><code>git clone https:\/\/github.com\/NVlabs\/GatedDeltaNet-2.git\ncd GatedDeltaNet-2\n\n# build the environment from the provided Dockerfile\ndocker build -t gdn2 .\ndocker run --gpus all -it \u2014ipc=host -v $PWD:\/workspace gdn2<\/code><\/pre>\n<div class=\"gdn2-callout\">\n          <span class=\"gdn2-callout-tag\">Repo layout<\/span>\n<div class=\"gdn2-callout-text\">\n            <code>lit_gpt\/<\/code> model code \u00b7 <code>scripts\/<\/code> launchers \u00b7 <code>pretrain.py<\/code> training entry \u00b7 <code>data.py<\/code>, <code>cache.py<\/code> data &amp; KV cache \u00b7 <code>paper\/<\/code> arXiv PDF\n          <\/div>\n<\/div>\n<\/section>\n<p>      <!-- Slide 5: Training Command --><\/p>\n<section class=\"gdn2-slide\">\n        <span class=\"gdn2-step-label\">Step 04 \u00b7 Launch Training<\/span>\n<h2 class=\"gdn2-title\">Run <code>pretrain.py<\/code><\/h2>\n<p>The streamlined command from the official README. Replace placeholders with your dataset paths and config name.<\/p>\n<pre><code>python ..\/pretrain.py \n  --train_data_dir ${TRAIN_DATA} \n  --val_data_dir ${VALIDATION_DATA} \n  --output_root ${SAVE_DIR} \n  --exp_name ${NAME} \n  --model_name ${MODEL} \n  --train_config ${CONFIG} \n  --eval_iters ${EVAL_ITERS} \n  --learning_rate ${LR} \n  --micro_batch_size ${MICRO_BATCH_SIZE}<\/code><\/pre>\n<div class=\"gdn2-callout\">\n          <span class=\"gdn2-callout-tag\">Pro tip<\/span>\n<div class=\"gdn2-callout-text\">Add <code>--interactive_job --debug<\/code> for an interactive debugging session.<\/div>\n<\/div>\n<\/section>\n<p>      <!-- Slide 6: Default Recipe --><\/p>\n<section class=\"gdn2-slide\">\n        <span class=\"gdn2-step-label\">Step 05 \u00b7 Default Recipe<\/span>\n<h2 class=\"gdn2-title\">The 1.3B \/ 100B FineWeb-Edu setup<\/h2>\n<p>Matched against Mamba-2, Gated DeltaNet, KDA, and Mamba-3 baselines under identical optimizer settings and recurrent state size.<\/p>\n<div class=\"gdn2-grid-2\">\n<div class=\"gdn2-card\">\n<h4>Optimizer<\/h4>\n<p>AdamW \u00b7 peak LR <code>4e-4<\/code> \u00b7 weight decay <code>0.1<\/code> \u00b7 gradient clip <code>1.0<\/code> \u00b7 cosine schedule \u00b7 <code>1B<\/code>-token warmup.<\/p>\n<\/div>\n<div class=\"gdn2-card\">\n<h4>Batch &amp; Sequence<\/h4>\n<p>Global batch <code>0.5M<\/code> tokens \u00b7 sequence length <code>4K<\/code> \u00b7 hybrid models use a <code>2K<\/code> sliding-window attention size.<\/p>\n<\/div>\n<div class=\"gdn2-card\">\n<h4>Model Shape<\/h4>\n<p><code>16<\/code> heads \u00b7 <code>d_k = d_v = 128<\/code> \u00b7 per-layer recurrent state <code>262,144<\/code> floats, matched against Mamba-2\/3.<\/p>\n<\/div>\n<div class=\"gdn2-card\">\n<h4>Hybrid Block<\/h4>\n<p>Repeated cell: Gated DeltaNet-2 \u2192 MLP \u2192 SWA \u2192 MLP. The recurrent mixer compresses long histories; SWA handles local interactions.<\/p>\n<\/div>\n<\/div>\n<\/section>\n<p>      <!-- Slide 7: Key Results --><\/p>\n<section class=\"gdn2-slide\">\n        <span class=\"gdn2-step-label\">Step 06 \u00b7 Results<\/span>\n<h2 class=\"gdn2-title\">Numbers worth pasting into a comparison<\/h2>\n<p>Best average across language modeling and commonsense reasoning, with the largest gains on long-context retrieval.<\/p>\n<table class=\"gdn2-table\">\n<thead>\n<tr>\n<th>Setting \u00b7 Metric<\/th>\n<th class=\"gdn2-num\">KDA<\/th>\n<th class=\"gdn2-num\">Mamba-3 MIMO<\/th>\n<th class=\"gdn2-num\">GDN-2<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Recurrent avg. (LMB + reasoning)<\/td>\n<td class=\"gdn2-num\">52.28<\/td>\n<td class=\"gdn2-num\">52.39<\/td>\n<td class=\"gdn2-num\"><strong>53.11<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Hybrid avg. (LMB + reasoning)<\/td>\n<td class=\"gdn2-num\">52.68<\/td>\n<td class=\"gdn2-num\">52.72<\/td>\n<td class=\"gdn2-num\"><strong>53.97<\/strong><\/td>\n<\/tr>\n<tr class=\"gdn2-row-hi\">\n<td>S-NIAH-3 @2K (recurrent)<\/td>\n<td class=\"gdn2-num\">63.2<\/td>\n<td class=\"gdn2-num\">72.4<\/td>\n<td class=\"gdn2-num\"><strong>89.8<\/strong><\/td>\n<\/tr>\n<tr class=\"gdn2-row-hi\">\n<td>MK-NIAH-1 @4K (recurrent)<\/td>\n<td class=\"gdn2-num\">28.0<\/td>\n<td class=\"gdn2-num\">18.0<\/td>\n<td class=\"gdn2-num\"><strong>37.8<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Real-world recall, recurrent avg.<\/td>\n<td class=\"gdn2-num\">28.67<\/td>\n<td class=\"gdn2-num\">28.35<\/td>\n<td class=\"gdn2-num\"><strong>29.88<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Real-world recall, hybrid avg.<\/td>\n<td class=\"gdn2-num\">40.14<\/td>\n<td class=\"gdn2-num\">40.11<\/td>\n<td class=\"gdn2-num\"><strong>42.28<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/section>\n<p>      <!-- Slide 8: Resources --><\/p>\n<section class=\"gdn2-slide\">\n        <span class=\"gdn2-step-label\">Step 07 \u00b7 Resources<\/span>\n<h2 class=\"gdn2-title\">Paper, code, and citation<\/h2>\n<p>Everything you need to read, run, and cite Gated DeltaNet-2 in one place.<\/p>\n<div class=\"gdn2-links\">\n          <a class=\"gdn2-link\" href=\"https:\/\/github.com\/NVlabs\/GatedDeltaNet-2\" target=\"_blank\" rel=\"noopener\"><br \/>\n            <span>GitHub \u00b7 NVlabs\/GatedDeltaNet-2<\/span><br \/>\n            <span>\u2192<\/span><br \/>\n          <\/a><br \/>\n          <a class=\"gdn2-link\" href=\"https:\/\/github.com\/NVlabs\/GatedDeltaNet-2\/tree\/main\/paper\" target=\"_blank\" rel=\"noopener\"><br \/>\n            <span>Paper PDF (in repo)<\/span><br \/>\n            <span>\u2192<\/span><br \/>\n          <\/a><br \/>\n          <a class=\"gdn2-link\" href=\"https:\/\/github.com\/NVlabs\/GatedDeltaNet-2\/blob\/main\/LICENSE\" target=\"_blank\" rel=\"noopener\"><br \/>\n            <span>License (NVIDIA SCL-NC)<\/span><br \/>\n            <span>\u2192<\/span><br \/>\n          <\/a><br \/>\n          <a class=\"gdn2-link\" href=\"https:\/\/github.com\/NVlabs\/GatedDeltaNet\" target=\"_blank\" rel=\"noopener\"><br \/>\n            <span>Predecessor \u00b7 Gated DeltaNet<\/span><br \/>\n            <span>\u2192<\/span><br \/>\n          <\/a>\n        <\/div>\n<pre><code>@article{hatamizadeh2026gdn2,\n  title   = {Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention},\n  author  = {Hatamizadeh, Ali and Choi, Yejin and Kautz, Jan},\n  journal = {arXiv preprint},\n  year    = {2026}\n}<\/code><\/pre>\n<\/section><\/div>\n<\/div>\n<div class=\"gdn2-controls\">\n<div class=\"gdn2-dots\" role=\"tablist\" aria-label=\"Slide navigation\">\n      <button class=\"gdn2-dot gdn2-active\" data-go=\"0\" aria-label=\"Slide 1\"><\/button><br \/>\n      <button class=\"gdn2-dot\" data-go=\"1\" aria-label=\"Slide 2\"><\/button><br \/>\n      <button class=\"gdn2-dot\" data-go=\"2\" aria-label=\"Slide 3\"><\/button><br \/>\n      <button class=\"gdn2-dot\" data-go=\"3\" aria-label=\"Slide 4\"><\/button><br \/>\n      <button class=\"gdn2-dot\" data-go=\"4\" aria-label=\"Slide 5\"><\/button><br \/>\n      <button class=\"gdn2-dot\" data-go=\"5\" aria-label=\"Slide 6\"><\/button><br \/>\n      <button class=\"gdn2-dot\" data-go=\"6\" aria-label=\"Slide 7\"><\/button><br \/>\n      <button class=\"gdn2-dot\" data-go=\"7\" aria-label=\"Slide 8\"><\/button>\n    <\/div>\n<div class=\"gdn2-buttons\">\n      <button class=\"gdn2-btn gdn2-prev\" disabled>\u2190 Prev<\/button><br \/>\n      <button class=\"gdn2-btn gdn2-next\">Next \u2192<\/button>\n    <\/div>\n<\/div>\n<div class=\"gdn2-tagline\">\n    <strong>MARKTECHPOST<\/strong> \u00a0\u00b7\u00a0 The hub for AI research, dev tools, and model launches\n  <\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Gated DeltaNet-2 splits the scalar \u03b2<sub>t<\/sub> into a channel-wise erase gate <code>b<sub>t<\/sub><\/code> (key axis) and a channel-wise write gate <code>w<sub>t<\/sub><\/code> (value axis).<\/li>\n<li>The update recovers KDA when both gates collapse to one scalar, and Gated DeltaNet when the decay collapses too.<\/li>\n<li>Training stays parallel via a chunkwise WY form, with channel-wise decay absorbed into asymmetric erase factors and a gate-aware backward fused in Triton.<\/li>\n<li>At 1.3B params on 100B FineWeb-Edu with matched state size, it has the best average over Mamba-2, Gated DeltaNet, KDA, and Mamba-3 in both recurrent and hybrid settings.<\/li>\n<li>Largest gains come on RULER long-context retrieval \u2014 S-NIAH-3 at 2K rises 63.2 \u2192 89.8 and MK-NIAH-1 at 4K rises 28.0 \u2192 37.8 over KDA (recurrent).<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/NVlabs\/GatedDeltaNet-2\/blob\/main\/paper\/GDN2_paper.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a><\/strong> and <strong><a href=\"https:\/\/github.com\/NVlabs\/GatedDeltaNet-2\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/24\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/\">NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Linear attention replaces the unbounded KV cache of softmax attention with a fixed-size recurrent state. This cuts sequence mixing to linear time and decoding to constant memory. The hard part is not what to forget. It is how to edit a compressed memory without scrambling existing associations. NVIDIA has released Gated DeltaNet-2, a linear attention layer that targets that bottleneck. The model decouples the active memory edit into two channel-wise gates. It is trained at 1.3B parameters on 100B FineWeb-Edu tokens. It outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 across the researchs benchmark suite. The scalar gate problem in delta-rule models A recurrent linear attention layer stores a matrix state St and reads it with the query. DeltaNet adds an active edit by subtracting the value currently associated with the current key. It uses a scalar step size \u03b2t to control how much to overwrite. Mamba-2 adds a data-dependent scalar decay \u03b1t for global forgetting. Gated DeltaNet combined both operations, but both gates remained scalar per head. Kimi Delta Attention (KDA) refines the decay side. It replaces the scalar \u03b1t with a channel-wise vector. KDA still keeps a single scalar \u03b2t for the active edit. That scalar controls two different things at once. It decides how much old content to erase on the key side. It also decides how much new content to commit on the value side. These two decisions act on different axes of the state. Tying them together is a modeling restriction, not a property of the delta rule. https:\/\/github.com\/NVlabs\/GatedDeltaNet-2\/blob\/main\/paper\/GDN2_paper.pdf Gated Delta Rule-2: two gates instead of one Gated DeltaNet-2 separates the two decisions through Gated Delta Rule-2. It introduces a channel-wise erase gate bt \u2208 [0,1]dk on the key axis. It also introduces a channel-wise write gate wt \u2208 [0,1]dv on the value axis. Both gates are produced by sigmoid projections of the token representation. The update applies decay before the active edit. Written compactly, the recurrence is: St = (I \u2212 kt (bt \u2299 kt)\u22a4) Dt St\u22121 + kt (wt \u2299 vt)\u22a4 Here Dt = Diag(\u03b1t) is the channel-wise decay carried over from KDA. The left factor of the erase matrix stays kt, preserving the delta-rule write direction. The right factor becomes bt \u2299 kt, making the read direction channel-selective. The write term kt zt\u22a4 uses zt = wt \u2299 vt, making the value update channel-selective. When both gates collapse to the same scalar \u03b2t, the update recovers KDA exactly. When the decay \u03b1t also collapses to a scalar, it recovers Gated DeltaNet. Both prior models are preserved as tied subspaces of the new update. In the fast-weight view, Gated Delta Rule-2 is one online gradient step on a local regression loss. The decayed state stays close to memory, while the residual edit uses gated read and gated write targets. Chunkwise training and gate-aware backward The recurrence admits a chunkwise WY form that matches the structure used by KDA. Cumulative channel-wise decay is absorbed into the two factors of each rank-one erase. The per-chunk update becomes a product of asymmetric matrices of the form I \u2212 k\u0304r \u0113r\u22a4. The implementation uses chunk size C = 64 with fused Triton kernels. For the backward pass, the scalar shortcut used by KDA no longer applies. The write side contains a different diagonal gate over value channels. The erase side contains a different diagonal gate over key channels. So the gate factors must appear inside the dot products that accumulate gradients. The paper derives this gate-aware vector-Jacobian product explicitly. On Hopper GPUs, the fused WY backward kernel is restricted to two and four warps to avoid a Triton WGMMA layout assertion. Block design and hybrid model Gated DeltaNet-2 is used as the recurrent token mixer in a standard Transformer-style block. Query and key paths use linear projection, short causal convolution, SiLU, and L2 normalization. The value path uses linear projection, short convolution, and SiLU. The decay \u03b1t, erase gate bt, and write gate wt come from separate linear branches. The recurrent output is RMS-normalized, multiplied by a SiLU output gate, and projected back. A hybrid variant inserts Sliding-Window Attention (SWA) after the recurrent mixer. A repeated cell contains Gated DeltaNet-2, an MLP, SWA, and another MLP. SWA handles exact local interactions, while the recurrent mixer compresses long histories. The hybrid retains linear sequence scaling with a bounded attention cache. Results at 1.3B parameters All models are 1.3B parameters trained on 100B FineWeb-Edu tokens. Parameter count and recurrent state size are matched across models. The recurrent state holds 262,144 floats per layer per batch element. Training length is 4K tokens, and hybrid models use a 2K SWA window. The Mamba-3 MIMO baseline uses rank R = 4. On language modeling and commonsense reasoning, Gated DeltaNet-2 has the best average in both settings. The recurrent model averages 53.11 across LAMBADA and the reasoning suite. That sits above Mamba-3 MIMO at 52.39 and KDA at 52.28. In the hybrid setting, Gated DeltaNet-2 averages 53.97 against Mamba-3 MIMO at 52.72. Since recurrent state size is matched, the gain points to the update rule, not more memory. The clearest gains appear on RULER long-context retrieval. In the recurrent setting, S-NIAH-2 at 4K rises from 89.0 (KDA) to 93.0. S-NIAH-3 at 2K jumps from 63.2 (KDA) to 89.8. MK-NIAH-1 at 4K climbs from 28.0 (KDA) to 37.8. On real-world retrieval (SWDE, SQuAD, FDA, TriviaQA, NQ, DROP), Gated DeltaNet-2 also leads both settings. The recurrent average is 29.88 and the hybrid average is 42.28. Marktechpost\u2019s Visual Explainer Gated DeltaNet-2 \u00b7 Quickstart 01 \/ 08 NVIDIA \u00b7 2026 Gated DeltaNet-2 Decoupling Erase and Write in Linear Attention. A delta-rule recurrent attention layer with channel-wise erase and write gates. PyTorch Triton kernels 1.3B params 100B FineWeb-Edu tokens Authors Ali Hatamizadeh, Yejin Choi, Jan Kautz Repo github.com\/NVlabs\/GatedDeltaNet-2 License NVIDIA Source Code License-NC Step 01 \u00b7 The Idea Two gates instead of one scalar Linear attention compresses an unbounded KV cache into a fixed-size recurrent state. Editing this memory without scrambling existing associations is the hard part. The<\/p>","protected":false},"author":2,"featured_media":92595,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-92594","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/th\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/\" \/>\n<meta property=\"og:locale\" content=\"th_TH\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/th\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-24T16:59:04+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 \u0e19\u0e32\u0e17\u0e35\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule\",\"datePublished\":\"2026-05-24T16:59:04+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/\"},\"wordCount\":1555,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"th\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/\",\"url\":\"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/\",\"name\":\"NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro.png\",\"datePublished\":\"2026-05-24T16:59:04+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#breadcrumb\"},\"inLanguage\":\"th\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro.png\",\"width\":1456,\"height\":842},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"th\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"th\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/th\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/th\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/","og_locale":"th_TH","og_type":"article","og_title":"NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/th\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-05-24T16:59:04+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Written by":"admin NU","Est. reading time":"9 \u0e19\u0e32\u0e17\u0e35"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule","datePublished":"2026-05-24T16:59:04+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/"},"wordCount":1555,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"th","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/","url":"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/","name":"NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro.png","datePublished":"2026-05-24T16:59:04+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#breadcrumb"},"inLanguage":"th","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/"]}]},{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro.png","width":1456,"height":842},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/nvidia-ai-releases-gated-deltanet-2-a-linear-attention-layer-that-decouples-erase-and-write-in-the-delta-rule\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"th"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"th","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/th\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro.png",1456,842,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro.png",1456,842,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro.png",1456,842,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro-300x173.png",300,173,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro-1024x592.png",1024,592,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro.png",1456,842,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro.png",1456,842,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro-18x10.png",18,10,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro-600x347.png",600,347,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-12.26.27-AM-1-CL0tro-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/th\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/th\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/th\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Linear attention replaces the unbounded KV cache of softmax attention with a fixed-size recurrent state. This cuts sequence mixing to linear time and decoding to constant memory. The hard part is not what to forget. It is how to edit a compressed memory without scrambling existing associations. NVIDIA has released Gated DeltaNet-2, a linear attention&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts\/92594","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/comments?post=92594"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/posts\/92594\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/media\/92595"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/media?parent=92594"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/categories?post=92594"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/th\/wp-json\/wp\/v2\/tags?post=92594"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}