{"id":91688,"date":"2026-05-20T16:48:48","date_gmt":"2026-05-20T16:48:48","guid":{"rendered":"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/"},"modified":"2026-05-20T16:48:48","modified_gmt":"2026-05-20T16:48:48","slug":"nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b","status":"publish","type":"post","link":"https:\/\/youzum.net\/fr\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/","title":{"rendered":"NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6\u00d7 Tokens Per Forward Over Qwen3-8B"},"content":{"rendered":"<p>NVIDIA researchers have released Nemotron-Labs-Diffusion, a language model family that unifies three decoding modes in one architecture. The model supports autoregressive (AR) decoding, diffusion-based parallel decoding, and self-speculation decoding. It is available in 3B, 8B, and 14B parameter sizes. The family includes base, instruct, and vision-language variants.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Sequential Decoding Limits Throughput<\/strong><\/h2>\n<p>Standard autoregressive (AR) language models generate text one token at a time, left to right. Each token depends on all previous tokens. This sequential dependency limits GPU parallelism per generation step. The result is low hardware utilization at low batch sizes \u2014 the typical setting for single-user or edge deployment.<\/p>\n<p>Diffusion language models (LMs) offer a different approach. Instead of generating tokens sequentially, they denoise multiple tokens in parallel per forward pass. This enables higher throughput. The tradeoff has been accuracy: diffusion LMs have consistently lagged behind AR models on benchmarks, requiring substantially more data to reach comparable performance. A key reason is that diffusion training treats all token permutations uniformly, rather than leveraging the strong left-to-right prior inherent in natural language.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1338\" height=\"568\" data-attachment-id=\"79984\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/20\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/screenshot-2026-05-20-at-3-28-04-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1.png\" data-orig-size=\"1338,568\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-20 at 3.28.04\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-1024x435.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1.png\" alt=\"\" class=\"wp-image-79984\" \/><figcaption class=\"wp-element-caption\">https:\/\/d1qx31qr3h6wln.cloudfront.net\/publications\/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>What Is a Tri-Mode Language Model?<\/strong><\/h2>\n<p>Nemotron-Labs-Diffusion is trained on a joint AR-diffusion objective. At inference time, it operates in three modes depending on the deployment context. There are no mode-specific architectural modifications \u2014 the same weights serve all three modes.<\/p>\n<p><strong>AR mode<\/strong> is standard left-to-right autoregressive decoding using causal attention. This mode is best suited for high-concurrency cloud serving.<\/p>\n<p><strong>Diffusion mode<\/strong> denoises multiple tokens in parallel within a fixed-length block. The sequence is partitioned into contiguous blocks. Within each block, tokens attend bidirectionally. Across blocks, attention remains causal, so prior blocks can reuse their KV cache. A lightweight trained sampler predicts, per masked position, whether the model\u2019s top-1 prediction at the current denoising step is correct. Positions predicted as correct are committed in that step. This allows the model to commit multiple tokens per forward pass.<\/p>\n<p><strong>Self-speculation mode<\/strong> uses the diffusion pathway to draft candidate tokens and the AR pathway to verify them, within the same single model. No auxiliary draft model or separate prediction head is required. The diffusion pathway generates a block of k candidate tokens in parallel. The AR pathway then runs a second forward pass over those candidates using causal attention, verifying the longest contiguous prefix that matches AR predictions. Each cycle produces between 1 and k+1 verified tokens. This contrasts with Multi-Token Prediction (MTP) methods such as Eagle3, which use small auxiliary draft heads attached to an AR backbone.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Training<\/strong><\/h2>\n<p><strong>The joint training objective combines an AR next-token prediction loss and a block-wise diffusion denoising loss:<\/strong><\/p>\n<p><strong>\u2112(\u03b8) = \u2112_AR(\u03b8) + \u03b1 \u00b7 \u2112_diff(\u03b8)<\/strong><\/p>\n<p>The coefficient \u03b1 is set to 0.3 across all training stages. Ablation experiments varying \u03b1 from 0.1 to 1.0 show that both AR-mode and diffusion-mode accuracy peak at \u03b1 = 0.3. No value in the range [0.1, 0.5] improves one mode at the expense of the other \u2014 the two objectives rise and fall together.<\/p>\n<p><strong>Two-stage training<\/strong> first trains the model purely on the AR objective for 1 trillion tokens, building strong left-to-right linguistic priors. Stage 2 then introduces the joint objective for 300 billion additional tokens. In ablations, two-stage training contributed +5.74% average accuracy. Adding the AR loss contributed the single largest gain at +7.48%. Global loss averaging \u2014 treating all tokens across a batch equally rather than averaging per-sequence first \u2014 contributed +2.12% by reducing gradient variance from variable diffusion masking ratios. Cumulatively, the full training pipeline improved the baseline by 16.05% average accuracy.<\/p>\n<p>All models are initialized from pretrained Ministral3 base models, not trained from scratch. Training was performed on 256 NVIDIA H100 GPUs. Instruct models are trained via supervised fine-tuning (SFT) on 45 billion tokens on top of the base models, using the same joint AR-diffusion objective with \u03b1 = 0.3. The training and inference pipeline is released through Megatron Bridge.<\/p>\n<h2 class=\"wp-block-heading\"><strong>LoRA-Enhanced Linear Self-Speculation<\/strong><\/h2>\n<p>The base diffusion-to-AR alignment in self-speculation can be improved with a LoRA adapter. This adapter is fine-tuned on the diffusion draft pathway to better align its output with the AR verifier. It targets only the o_proj layer of the attention module (rank 128, \u03b1 = 512, approximately 36M trainable parameters, 0.4% of the backbone). LoRA tuning improves tokens per forward (TPF) by 14.4%, 32.5%, and 27.6% at the 3B, 8B, and 14B scales respectively, with negligible accuracy change.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Speed-of-Light Analysis<\/strong><\/h2>\n<p>The research team reports a speed-of-light (SOL) analysis \u2014 a theoretical upper bound on tokens per forward pass achievable by the diffusion mode, assuming an oracle sampler that correctly identifies all positions that can be safely committed in parallel.<\/p>\n<p>At block length 32, the SOL acceptance rate reaches 7.60\u00d7 on average, exceeding 10\u00d7 on coding and multilingual tasks. Current confidence-based sampling achieves approximately 3\u00d7 TPF at comparable accuracy, leaving a large gap to the SOL ceiling.<\/p>\n<p>Comparing against linear self-speculation: both approach similar acceptance rates (6.82\u00d7 for linear self-speculation vs. 7.60\u00d7 SOL). However, the real tokens per forward pass (TPF) gap is much larger \u2014 6.02\u00d7 for SOL versus 3.41\u00d7 for linear self-speculation, a 76.5% difference. Linear self-speculation requires two forward passes per cycle (one diffusion draft, one AR verify) and accepts only a contiguous prefix. These two constraints cap its real TPF well below SOL, even when drafter and verifier are well aligned.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"1708\" height=\"646\" data-attachment-id=\"79986\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/20\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/screenshot-2026-05-20-at-3-28-40-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.40-AM-1.png\" data-orig-size=\"1708,646\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-20 at 3.28.40\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.40-AM-1-1024x387.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.40-AM-1.png\" alt=\"NVIDIA introduces Nemotron-Labs-Diffusion, a 3B\/8B\/14B model family achieving 5.99\u00d7 tokens per forward over Qwen3-8B using self-speculation decoding.\" class=\"wp-image-79986\" \/><figcaption class=\"wp-element-caption\">https:\/\/d1qx31qr3h6wln.cloudfront.net\/publications\/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Benchmark Results<\/strong><\/h2>\n<p><strong>On the 10-task instruct evaluation (HumanEval, MBPP, LiveCodeBench-CPP, GSM8K, Math500, AIME24, AIME25, GPQA, IFEval, MMLU):<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>NLD-8B AR mode<\/strong>: 63.61% average accuracy, versus 62.75% for Qwen3-8B and 58.02% for Ministral3-8B-Instruct.<\/li>\n<li><strong>NLD-8B diffusion mode<\/strong>: 63.18% average accuracy with 2.57\u00d7 TPF.<\/li>\n<li><strong>NLD-8B LoRA-tuned linear self-speculation<\/strong>: 62.81% average accuracy with 5.99\u00d7 TPF.<\/li>\n<li><strong>NLD-8B quadratic self-speculation<\/strong>: 64.04% average accuracy with 6.38\u00d7 TPF.<\/li>\n<\/ul>\n<p>On SPEED-Bench with SGLang on an NVIDIA GB200 GPU, linear self-speculation achieves 4\u00d7 higher throughput than Qwen3-8B and 3.3\u00d7 speedup over the NLD-8B AR mode at concurrency 1 (3.97\u00d7 with an optimized CUDA kernel). Compared to Qwen3-8B-Eagle3, linear self-speculation delivers a 2.4\u00d7, 2.3\u00d7, and 1.8\u00d7 speedup at batch size 1 on GB200, RTX Pro 6000, and DGX Spark respectively.<\/p>\n<p><strong>Acceptance length<\/strong> is the underlying reason for this advantage. Across SPEED-Bench categories, NLD achieves average acceptance lengths of 5.46 (native) and 6.82 (with LoRA) tokens per draft step. Eagle3 averages 2.75 and Qwen3-9B-MTP averages 4.24. On the four diffusion-friendly categories \u2014 coding, math, reasoning, and multilingual \u2014 the gap widens further: 8.69 for NLD-LoRA versus 2.81 for Eagle3.<\/p>\n<p>At 14B scale with LoRA-tuned linear self-speculation, NLD-14B achieves 66.36% average accuracy at 5.96\u00d7 TPF, outperforming Qwen3-14B at 65.17% accuracy in AR mode.<\/p>\n<p>The vision-language model, Nemotron-Labs-Diffusion-VLM-8B, extends the same framework to multimodal tasks. In linear self-speculation mode, it achieves 3.63\u00d7 to 7.45\u00d7 TPF \u2014 the higher end for responses over 200 tokens \u2014 with a 0.1% average accuracy drop versus AR mode.<\/p>\n<h2 class=\"wp-block-heading\"><strong><\/strong><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<div class=\"nld-header\">\n<div class=\"nld-header-left\">\n      <span class=\"nld-nvidia-badge\">NVIDIA<\/span><br \/>\n      <span class=\"nld-header-title\">Nemotron-Labs-Diffusion \u2014 Usage Guide<\/span>\n    <\/div>\n<p>    <span class=\"nld-header-count\">01 \/ 07<\/span>\n  <\/p><\/div>\n<div class=\"nld-progress\">\n<div class=\"nld-progress-fill\"><\/div>\n<\/div>\n<div class=\"nld-tabs\">\n    <button class=\"nld-tab active\">Overview<\/button><br \/>\n    <button class=\"nld-tab\">Three Modes<\/button><br \/>\n    <button class=\"nld-tab\">Install<\/button><br \/>\n    <button class=\"nld-tab\">Basic Usage<\/button><br \/>\n    <button class=\"nld-tab\">Self-Speculation<\/button><br \/>\n    <button class=\"nld-tab\">Production Serving<\/button><br \/>\n    <button class=\"nld-tab\">When to Use<\/button>\n  <\/div>\n<div class=\"nld-slides\">\n<p>    <!-- 0 Overview --><\/p>\n<div class=\"nld-slide active\">\n<h2 class=\"nld-slide-h\">What is Nemotron-Labs-Diffusion?<\/h2>\n<div class=\"nld-slide-sub\">A single model checkpoint. Three decoding modes. No architecture changes.<\/div>\n<p class=\"nld-p\">Nemotron-Labs-Diffusion is a language model family from NVIDIA that combines autoregressive (AR) decoding, diffusion-based parallel decoding, and self-speculation decoding in one set of weights. You switch modes at inference time by changing the attention pattern \u2014 no separate model files needed.<\/p>\n<div class=\"nld-badge-row\">\n<div class=\"nld-badge\">Sizes: <b>3B \u00a0\u00b7\u00a0 8B \u00a0\u00b7\u00a0 14B<\/b><\/div>\n<div class=\"nld-badge\">Variants: <b>Base \u00a0\u00b7\u00a0 Instruct \u00a0\u00b7\u00a0 VLM<\/b><\/div>\n<div class=\"nld-badge\">Requires: <b>transformers \u2265 5.0.0<\/b><\/div>\n<div class=\"nld-badge\">License: <b>NVIDIA Nemotron Open Model<\/b><\/div>\n<\/div>\n<div class=\"nld-stat-row\">\n<div class=\"nld-stat\">\n<div class=\"s-num\">5.99\u00d7<\/div>\n<div class=\"s-label\">Tokens per forward vs Qwen3-8B (Linear Self-Speculation, 8B)<\/div>\n<\/div>\n<div class=\"nld-stat\">\n<div class=\"s-num\">3.3\u00d7<\/div>\n<div class=\"s-label\">Throughput over AR mode at concurrency 1 (GB200)<\/div>\n<\/div>\n<div class=\"nld-stat\">\n<div class=\"s-num\">2.4\u00d7<\/div>\n<div class=\"s-label\">Faster than Qwen3-8B-Eagle3 at batch size 1 (GB200)<\/div>\n<\/div>\n<div class=\"nld-stat\">\n<div class=\"s-num\">63.61%<\/div>\n<div class=\"s-label\">Avg accuracy, 8B AR mode vs 62.75% Qwen3-8B<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>    <!-- 1 Modes --><\/p>\n<div class=\"nld-slide\">\n<h2 class=\"nld-slide-h\">The Three Decoding Modes<\/h2>\n<div class=\"nld-slide-sub\">Same weights. Different attention pattern. Pick based on your deployment.<\/div>\n<div class=\"nld-mode-grid\">\n<div class=\"nld-mode-card\">\n<div class=\"mc-tag\">Mode 1<\/div>\n<div class=\"mc-title\">AR Decoding<\/div>\n<div class=\"mc-desc\">Standard left-to-right generation using causal attention. One token per forward pass. Compatible with all existing AR serving infrastructure.<\/div>\n<div class=\"mc-use\">Best for: high-concurrency cloud serving where GPU compute is fully saturated by batching.<\/div>\n<\/div>\n<div class=\"nld-mode-card\">\n<div class=\"mc-tag\">Mode 2<\/div>\n<div class=\"mc-title\">Diffusion Decoding<\/div>\n<div class=\"mc-desc\">Denoises multiple tokens per block in parallel. Adjust the <code>threshold<\/code> value to trade accuracy for higher throughput. 2.57\u00d7 TPF at threshold 0.9.<\/div>\n<div class=\"mc-use\">Best for: flexible accuracy\u2013throughput tradeoff from one model.<\/div>\n<\/div>\n<div class=\"nld-mode-card\">\n<div class=\"mc-tag\">Mode 3<\/div>\n<div class=\"mc-title\">Self-Speculation<\/div>\n<div class=\"mc-desc\">Diffusion drafts k tokens in parallel. AR verifies them in a second pass. Accepts the longest matching prefix. No auxiliary model or extra heads needed.<\/div>\n<div class=\"mc-use\">Best for: low-concurrency or single-user inference where per-user speed matters most.<\/div>\n<\/div>\n<\/div>\n<div class=\"nld-tip\">\n        <strong>How mode switching works:<\/strong> You call a different method on the same model object \u2014 <code>ar_generate()<\/code>, <code>generate()<\/code>, or <code>linear_spec_generate()<\/code>. The model weights do not change.\n      <\/div>\n<\/div>\n<p>    <!-- 2 Install --><\/p>\n<div class=\"nld-slide\">\n<h2 class=\"nld-slide-h\">Installation<\/h2>\n<div class=\"nld-slide-sub\">Two pip installs. CUDA-capable GPU required.<\/div>\n<p class=\"nld-p\">The model uses <code>trust_remote_code=True<\/code> because custom modeling code is bundled with the checkpoint on Hugging Face. Install <code>peft<\/code> only if you plan to use the LoRA-enhanced self-speculation mode.<\/p>\n<div class=\"nld-code-wrap\">\n<div class=\"nld-code-label\">Step 1 \u2014 core dependencies<\/div>\n<p>        <button class=\"nld-copy\">Copy<\/button><\/p>\n<pre><span class=\"kw\">pip install<\/span> <span class=\"st\">\"transformers&gt;=5.0.0\"<\/span> torch accelerate<\/pre>\n<\/div>\n<div class=\"nld-code-wrap\">\n<div class=\"nld-code-label\">Step 2 \u2014 optional: LoRA-enhanced self-speculation<\/div>\n<p>        <button class=\"nld-copy\">Copy<\/button><\/p>\n<pre><span class=\"kw\">pip install<\/span> peft<\/pre>\n<\/div>\n<div class=\"nld-code-wrap\">\n<div class=\"nld-code-label\">Step 3 \u2014 load model (swap model ID for 3B or 14B)<\/div>\n<p>        <button class=\"nld-copy\">Copy<\/button><\/p>\n<pre><span class=\"kw\">from<\/span> transformers <span class=\"kw\">import<\/span> AutoModel, AutoTokenizer\n<span class=\"kw\">import<\/span> torch\n\n<span class=\"cm\"># Available: nvidia\/Nemotron-Labs-Diffusion-3B<\/span>\n<span class=\"cm\">#            nvidia\/Nemotron-Labs-Diffusion-8B<\/span>\n<span class=\"cm\">#            nvidia\/Nemotron-Labs-Diffusion-14B<\/span>\nrepo = <span class=\"st\">\"nvidia\/Nemotron-Labs-Diffusion-8B\"<\/span>\n\ntokenizer = AutoTokenizer.<span class=\"fn\">from_pretrained<\/span>(repo, trust_remote_code=<span class=\"kw\">True<\/span>)\nmodel     = AutoModel.<span class=\"fn\">from_pretrained<\/span>(repo, trust_remote_code=<span class=\"kw\">True<\/span>)\nmodel     = model.<span class=\"fn\">cuda<\/span>().<span class=\"fn\">to<\/span>(torch.bfloat16)<\/pre>\n<\/div>\n<\/div>\n<p>    <!-- 3 Basic Usage --><\/p>\n<div class=\"nld-slide\">\n<h2 class=\"nld-slide-h\">Basic Usage \u2014 All Three Modes<\/h2>\n<div class=\"nld-slide-sub\">Prepare the prompt once. Choose a generate call.<\/div>\n<p class=\"nld-p\">All three modes share the same tokenization step. The variable <code>nfe<\/code> (num function evals) returned alongside output IDs lets you measure how many forward passes were used to produce the output.<\/p>\n<div class=\"nld-code-wrap\">\n<div class=\"nld-code-label\">Shared \u2014 build prompt_ids<\/div>\n<p>        <button class=\"nld-copy\">Copy<\/button><\/p>\n<pre>history = [{\"role\": <span class=\"st\">\"user\"<\/span>, \"content\": <span class=\"st\">\"Explain gradient descent.\"<\/span>}]\nprompt     = tokenizer.<span class=\"fn\">apply_chat_template<\/span>(history, tokenize=<span class=\"kw\">False<\/span>,\n                                              add_generation_prompt=<span class=\"kw\">True<\/span>)\nprompt_ids = tokenizer(prompt, return_tensors=<span class=\"st\">\"pt\"<\/span>).input_ids.<span class=\"fn\">to<\/span>(<span class=\"st\">\"cuda\"<\/span>)<\/pre>\n<\/div>\n<div class=\"nld-code-wrap\">\n<div class=\"nld-code-label\">AR Mode \u2014 standard autoregressive<\/div>\n<p>        <button class=\"nld-copy\">Copy<\/button><\/p>\n<pre>out_ids, nfe = model.<span class=\"fn\">ar_generate<\/span>(prompt_ids, max_new_tokens=<span class=\"num\">512<\/span>)<\/pre>\n<\/div>\n<div class=\"nld-code-wrap\">\n<div class=\"nld-code-label\">Diffusion Mode \u2014 parallel decoding (threshold adjusts speed vs accuracy)<\/div>\n<p>        <button class=\"nld-copy\">Copy<\/button><\/p>\n<pre>out_ids, nfe = model.<span class=\"fn\">generate<\/span>(\n    prompt_ids,\n    max_new_tokens=<span class=\"num\">512<\/span>,\n    block_length=<span class=\"num\">32<\/span>,\n    threshold=<span class=\"num\">0.9<\/span>,\n    eos_token_id=tokenizer.eos_token_id\n)<\/pre>\n<\/div>\n<div class=\"nld-code-wrap\">\n<div class=\"nld-code-label\">Decode output \u2014 same for all modes<\/div>\n<p>        <button class=\"nld-copy\">Copy<\/button><\/p>\n<pre>text = tokenizer.<span class=\"fn\">batch_decode<\/span>(\n    out_ids[:, prompt_ids.shape[<span class=\"num\">1<\/span>]:], skip_special_tokens=<span class=\"kw\">True<\/span>\n)[<span class=\"num\">0<\/span>]\n<span class=\"fn\">print<\/span>(<span class=\"fn\">f<\/span><span class=\"st\">\"Output: {text}nNFE: {nfe}\"<\/span>)<\/pre>\n<\/div>\n<\/div>\n<p>    <!-- 4 Self-Speculation --><\/p>\n<div class=\"nld-slide\">\n<h2 class=\"nld-slide-h\">Self-Speculation + LoRA Drafter<\/h2>\n<div class=\"nld-slide-sub\">Highest per-user throughput. Optional LoRA for higher acceptance length.<\/div>\n<p class=\"nld-p\">Without LoRA, average acceptance length is 5.46 tokens per draft step. With LoRA it rises to 6.82, versus 2.75 for Eagle3 and 4.24 for Qwen3-9B-MTP. The LoRA adapter is stored inside the same Hugging Face repo under <code>linear_spec_lora\/<\/code>.<\/p>\n<div class=\"nld-code-wrap\">\n<div class=\"nld-code-label\">Linear self-speculation \u2014 without LoRA<\/div>\n<p>        <button class=\"nld-copy\">Copy<\/button><\/p>\n<pre>out_ids, nfe = model.<span class=\"fn\">linear_spec_generate<\/span>(\n    prompt_ids,\n    max_new_tokens=<span class=\"num\">512<\/span>,\n    block_length=<span class=\"num\">32<\/span>,\n    eos_token_id=tokenizer.eos_token_id\n)<\/pre>\n<\/div>\n<div class=\"nld-code-wrap\">\n<div class=\"nld-code-label\">Linear self-speculation \u2014 with LoRA drafter (recommended)<\/div>\n<p>        <button class=\"nld-copy\">Copy<\/button><\/p>\n<pre><span class=\"kw\">from<\/span> peft <span class=\"kw\">import<\/span> PeftModel\n\nrepo  = <span class=\"st\">\"nvidia\/Nemotron-Labs-Diffusion-8B\"<\/span>\nmodel = AutoModel.<span class=\"fn\">from_pretrained<\/span>(repo, trust_remote_code=<span class=\"kw\">True<\/span>)\nmodel = model.<span class=\"fn\">cuda<\/span>().<span class=\"fn\">to<\/span>(torch.bfloat16)\n\n<span class=\"cm\"># Attach the LoRA adapter from the same repo<\/span>\nmodel = PeftModel.<span class=\"fn\">from_pretrained<\/span>(\n    model, repo, subfolder=<span class=\"st\">\"linear_spec_lora\"<\/span>\n).<span class=\"fn\">eval<\/span>()\n\n<span class=\"cm\"># Unwrap to call linear_spec_generate directly<\/span>\nbase = model.model\n\nout_ids, nfe = base.<span class=\"fn\">linear_spec_generate<\/span>(\n    prompt_ids,\n    max_new_tokens=<span class=\"num\">512<\/span>,\n    block_length=<span class=\"num\">32<\/span>,\n    eos_token_id=tokenizer.eos_token_id\n)\n<span class=\"fn\">print<\/span>(tokenizer.<span class=\"fn\">decode<\/span>(\n    out_ids[<span class=\"num\">0<\/span>, prompt_ids.shape[<span class=\"num\">1<\/span>]:], skip_special_tokens=<span class=\"kw\">True<\/span>\n))\n<span class=\"fn\">print<\/span>(<span class=\"fn\">f<\/span><span class=\"st\">\"NFE: {nfe}\"<\/span>)<\/pre>\n<\/div>\n<\/div>\n<p>    <!-- 5 Production Serving --><\/p>\n<div class=\"nld-slide\">\n<h2 class=\"nld-slide-h\">Production Serving: vLLM &amp; SGLang<\/h2>\n<div class=\"nld-slide-sub\">OpenAI-compatible API. Standard curl calls work out of the box.<\/div>\n<p class=\"nld-p\">SGLang was used for all SPEED-Bench measurements in the paper and is the recommended serving framework for self-speculation mode. Both frameworks expose an OpenAI-compatible <code>\/v1\/chat\/completions<\/code> endpoint.<\/p>\n<div class=\"nld-code-wrap\">\n<div class=\"nld-code-label\">vLLM \u2014 install and serve<\/div>\n<p>        <button class=\"nld-copy\">Copy<\/button><\/p>\n<pre><span class=\"kw\">pip install<\/span> vllm\nvllm serve <span class=\"st\">\"nvidia\/Nemotron-Labs-Diffusion-8B\"<\/span><\/pre>\n<\/div>\n<div class=\"nld-code-wrap\">\n<div class=\"nld-code-label\">SGLang \u2014 install and serve<\/div>\n<p>        <button class=\"nld-copy\">Copy<\/button><\/p>\n<pre><span class=\"kw\">pip install<\/span> sglang\npython3 -m sglang.launch_server \n    --model-path <span class=\"st\">\"nvidia\/Nemotron-Labs-Diffusion-8B\"<\/span> \n    --host 0.0.0.0 --port <span class=\"num\">30000<\/span><\/pre>\n<\/div>\n<div class=\"nld-code-wrap\">\n<div class=\"nld-code-label\">Call either server \u2014 OpenAI-compatible<\/div>\n<p>        <button class=\"nld-copy\">Copy<\/button><\/p>\n<pre>curl -X POST <span class=\"st\">\"http:\/\/localhost:30000\/v1\/chat\/completions\"<\/span> \n  -H <span class=\"st\">\"Content-Type: application\/json\"<\/span> \n  --data <span class=\"st\">'{\n    \"model\": \"nvidia\/Nemotron-Labs-Diffusion-8B\",\n    \"messages\": [{ \"role\": \"user\", \"content\": \"Your prompt here.\" }]\n  }'<\/span><\/pre>\n<\/div>\n<div class=\"nld-code-wrap\">\n<div class=\"nld-code-label\">SGLang with Docker<\/div>\n<p>        <button class=\"nld-copy\">Copy<\/button><\/p>\n<pre>docker run --gpus all --shm-size 32g -p <span class=\"num\">30000<\/span>:<span class=\"num\">30000<\/span> \n  -v ~\/.cache\/huggingface:\/root\/.cache\/huggingface \n  --env <span class=\"st\">\"HF_TOKEN=&lt;your_token&gt;\"<\/span> --ipc=host \n  lmsysorg\/sglang:latest \n  python3 -m sglang.launch_server \n    --model-path <span class=\"st\">\"nvidia\/Nemotron-Labs-Diffusion-8B\"<\/span> \n    --host 0.0.0.0 --port <span class=\"num\">30000<\/span><\/pre>\n<\/div>\n<\/div>\n<p>    <!-- 6 When to Use --><\/p>\n<div class=\"nld-slide\">\n<h2 class=\"nld-slide-h\">When to Use Each Mode<\/h2>\n<div class=\"nld-slide-sub\">Match the mode to your deployment context.<\/div>\n<table class=\"nld-table\">\n<thead>\n<tr>\n<th>Scenario<\/th>\n<th>Mode<\/th>\n<th>Reason<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>High-concurrency API (many users)<\/td>\n<td><span class=\"t-mode\">ar_generate()<\/span><\/td>\n<td class=\"t-dim\">GPU is fully saturated by batching. Sequential decoding is not the bottleneck.<\/td>\n<\/tr>\n<tr>\n<td>Single-user or edge inference<\/td>\n<td><span class=\"t-mode\">linear_spec_generate() + LoRA<\/span><\/td>\n<td class=\"t-dim\">3.3\u00d7 over AR on GB200. 2.4\u00d7 over Eagle3 at batch size 1.<\/td>\n<\/tr>\n<tr>\n<td>Adjustable speed vs accuracy<\/td>\n<td><span class=\"t-mode\">generate() \u2014 diffusion<\/span><\/td>\n<td class=\"t-dim\">Tune <code>threshold<\/code> between 0 and 1. Lower threshold = more tokens per pass = lower accuracy.<\/td>\n<\/tr>\n<tr>\n<td>Existing AR serving stack<\/td>\n<td><span class=\"t-mode\">ar_generate()<\/span><\/td>\n<td class=\"t-dim\">Drop-in replacement. No infrastructure changes needed.<\/td>\n<\/tr>\n<tr>\n<td>Coding, math, multilingual tasks<\/td>\n<td><span class=\"t-mode\">linear_spec_generate() + LoRA<\/span><\/td>\n<td class=\"t-dim\">Acceptance length peaks on structured content: 8.57\u00d7 coding, 8.14\u00d7 math.<\/td>\n<\/tr>\n<tr>\n<td>Vision-language, long responses<\/td>\n<td><span class=\"t-mode\">VLM \u2014 linear_spec_generate()<\/span><\/td>\n<td class=\"t-dim\">Up to 7.45\u00d7 TPF on responses over 200 tokens. 0.1% accuracy drop vs AR.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"nld-tip\">\n        <strong>Model collection on Hugging Face:<\/strong> huggingface.co\/collections\/nvidia\/nemotron-labs-diffusion \u2014 includes 3B, 8B, 14B base, instruct, and VLM checkpoints.\n      <\/div>\n<\/div>\n<\/div>\n<div class=\"nld-nav\">\n    <button class=\"nld-btn\" disabled>\u2190 Prev<\/button>\n<div class=\"nld-dots\"><\/div>\n<p>    <button class=\"nld-btn\">Next \u2192<\/button>\n  <\/p><\/div>\n<div>\n    <span><\/span><br \/>\n    <span>Published by <a href=\"https:\/\/www.marktechpost.com\/\" target=\"_blank\">MarkTechPost<\/a> \u2014 AI &amp; ML Research, Simplified for Developers.<\/span>\n  <\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Nemotron-Labs-Diffusion unifies AR, diffusion, and self-speculation decoding in one model, with no mode-specific architectural changes.<\/li>\n<li>Joint AR-diffusion training is not a tradeoff \u2014 both objectives peak at \u03b1=0.3 and improve together.<\/li>\n<li>Self-speculation mode achieves 5.99\u00d7 TPF on the 8B model, with 2.4\u00d7 higher throughput than Qwen3-8B-Eagle3 at batch size 1 on GB200.<\/li>\n<li>Higher acceptance length is the key differentiator: NLD-LoRA averages 6.82 tokens per draft step versus 2.75 for Eagle3 and 4.24 for MTP.<\/li>\n<li>Speed-of-light analysis shows the diffusion mode has a theoretical ceiling of 7.60\u00d7 TPF \u2014 current confidence-based sampling realizes only ~3\u00d7, leaving significant room for sampler improvements.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/d1qx31qr3h6wln.cloudfront.net\/publications\/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL\" target=\"_blank\" rel=\"noreferrer noopener\">Paper,<\/a><\/strong> <strong><a href=\"https:\/\/huggingface.co\/collections\/nvidia\/nemotron-labs-diffusion\" target=\"_blank\" rel=\"noreferrer noopener\">Model Weights<\/a><\/strong> and <strong><a href=\"https:\/\/research.nvidia.com\/publication\/2026-05_nemotron-labs-diffusion-tri-mode-language-model-unifying-autoregressive\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/20\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/\">NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6\u00d7 Tokens Per Forward Over Qwen3-8B<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>NVIDIA researchers have released Nemotron-Labs-Diffusion, a language model family that unifies three decoding modes in one architecture. The model supports autoregressive (AR) decoding, diffusion-based parallel decoding, and self-speculation decoding. It is available in 3B, 8B, and 14B parameter sizes. The family includes base, instruct, and vision-language variants. Sequential Decoding Limits Throughput Standard autoregressive (AR) language models generate text one token at a time, left to right. Each token depends on all previous tokens. This sequential dependency limits GPU parallelism per generation step. The result is low hardware utilization at low batch sizes \u2014 the typical setting for single-user or edge deployment. Diffusion language models (LMs) offer a different approach. Instead of generating tokens sequentially, they denoise multiple tokens in parallel per forward pass. This enables higher throughput. The tradeoff has been accuracy: diffusion LMs have consistently lagged behind AR models on benchmarks, requiring substantially more data to reach comparable performance. A key reason is that diffusion training treats all token permutations uniformly, rather than leveraging the strong left-to-right prior inherent in natural language. https:\/\/d1qx31qr3h6wln.cloudfront.net\/publications\/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL What Is a Tri-Mode Language Model? Nemotron-Labs-Diffusion is trained on a joint AR-diffusion objective. At inference time, it operates in three modes depending on the deployment context. There are no mode-specific architectural modifications \u2014 the same weights serve all three modes. AR mode is standard left-to-right autoregressive decoding using causal attention. This mode is best suited for high-concurrency cloud serving. Diffusion mode denoises multiple tokens in parallel within a fixed-length block. The sequence is partitioned into contiguous blocks. Within each block, tokens attend bidirectionally. Across blocks, attention remains causal, so prior blocks can reuse their KV cache. A lightweight trained sampler predicts, per masked position, whether the model\u2019s top-1 prediction at the current denoising step is correct. Positions predicted as correct are committed in that step. This allows the model to commit multiple tokens per forward pass. Self-speculation mode uses the diffusion pathway to draft candidate tokens and the AR pathway to verify them, within the same single model. No auxiliary draft model or separate prediction head is required. The diffusion pathway generates a block of k candidate tokens in parallel. The AR pathway then runs a second forward pass over those candidates using causal attention, verifying the longest contiguous prefix that matches AR predictions. Each cycle produces between 1 and k+1 verified tokens. This contrasts with Multi-Token Prediction (MTP) methods such as Eagle3, which use small auxiliary draft heads attached to an AR backbone. Training The joint training objective combines an AR next-token prediction loss and a block-wise diffusion denoising loss: \u2112(\u03b8) = \u2112_AR(\u03b8) + \u03b1 \u00b7 \u2112_diff(\u03b8) The coefficient \u03b1 is set to 0.3 across all training stages. Ablation experiments varying \u03b1 from 0.1 to 1.0 show that both AR-mode and diffusion-mode accuracy peak at \u03b1 = 0.3. No value in the range [0.1, 0.5] improves one mode at the expense of the other \u2014 the two objectives rise and fall together. Two-stage training first trains the model purely on the AR objective for 1 trillion tokens, building strong left-to-right linguistic priors. Stage 2 then introduces the joint objective for 300 billion additional tokens. In ablations, two-stage training contributed +5.74% average accuracy. Adding the AR loss contributed the single largest gain at +7.48%. Global loss averaging \u2014 treating all tokens across a batch equally rather than averaging per-sequence first \u2014 contributed +2.12% by reducing gradient variance from variable diffusion masking ratios. Cumulatively, the full training pipeline improved the baseline by 16.05% average accuracy. All models are initialized from pretrained Ministral3 base models, not trained from scratch. Training was performed on 256 NVIDIA H100 GPUs. Instruct models are trained via supervised fine-tuning (SFT) on 45 billion tokens on top of the base models, using the same joint AR-diffusion objective with \u03b1 = 0.3. The training and inference pipeline is released through Megatron Bridge. LoRA-Enhanced Linear Self-Speculation The base diffusion-to-AR alignment in self-speculation can be improved with a LoRA adapter. This adapter is fine-tuned on the diffusion draft pathway to better align its output with the AR verifier. It targets only the o_proj layer of the attention module (rank 128, \u03b1 = 512, approximately 36M trainable parameters, 0.4% of the backbone). LoRA tuning improves tokens per forward (TPF) by 14.4%, 32.5%, and 27.6% at the 3B, 8B, and 14B scales respectively, with negligible accuracy change. Speed-of-Light Analysis The research team reports a speed-of-light (SOL) analysis \u2014 a theoretical upper bound on tokens per forward pass achievable by the diffusion mode, assuming an oracle sampler that correctly identifies all positions that can be safely committed in parallel. At block length 32, the SOL acceptance rate reaches 7.60\u00d7 on average, exceeding 10\u00d7 on coding and multilingual tasks. Current confidence-based sampling achieves approximately 3\u00d7 TPF at comparable accuracy, leaving a large gap to the SOL ceiling. Comparing against linear self-speculation: both approach similar acceptance rates (6.82\u00d7 for linear self-speculation vs. 7.60\u00d7 SOL). However, the real tokens per forward pass (TPF) gap is much larger \u2014 6.02\u00d7 for SOL versus 3.41\u00d7 for linear self-speculation, a 76.5% difference. Linear self-speculation requires two forward passes per cycle (one diffusion draft, one AR verify) and accepts only a contiguous prefix. These two constraints cap its real TPF well below SOL, even when drafter and verifier are well aligned. https:\/\/d1qx31qr3h6wln.cloudfront.net\/publications\/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL Benchmark Results On the 10-task instruct evaluation (HumanEval, MBPP, LiveCodeBench-CPP, GSM8K, Math500, AIME24, AIME25, GPQA, IFEval, MMLU): NLD-8B AR mode: 63.61% average accuracy, versus 62.75% for Qwen3-8B and 58.02% for Ministral3-8B-Instruct. NLD-8B diffusion mode: 63.18% average accuracy with 2.57\u00d7 TPF. NLD-8B LoRA-tuned linear self-speculation: 62.81% average accuracy with 5.99\u00d7 TPF. NLD-8B quadratic self-speculation: 64.04% average accuracy with 6.38\u00d7 TPF. On SPEED-Bench with SGLang on an NVIDIA GB200 GPU, linear self-speculation achieves 4\u00d7 higher throughput than Qwen3-8B and 3.3\u00d7 speedup over the NLD-8B AR mode at concurrency 1 (3.97\u00d7 with an optimized CUDA kernel). Compared to Qwen3-8B-Eagle3, linear self-speculation delivers a 2.4\u00d7, 2.3\u00d7, and 1.8\u00d7 speedup at batch size 1 on GB200, RTX Pro 6000, and DGX Spark respectively. Acceptance length is the underlying<\/p>","protected":false},"author":2,"featured_media":91689,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-91688","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6\u00d7 Tokens Per Forward Over Qwen3-8B - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/fr\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/\" \/>\n<meta property=\"og:locale\" content=\"fr_FR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6\u00d7 Tokens Per Forward Over Qwen3-8B - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/fr\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-20T16:48:48+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u00c9crit par\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Dur\u00e9e de lecture estim\u00e9e\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6\u00d7 Tokens Per Forward Over Qwen3-8B\",\"datePublished\":\"2026-05-20T16:48:48+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/\"},\"wordCount\":1942,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/\",\"url\":\"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/\",\"name\":\"NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6\u00d7 Tokens Per Forward Over Qwen3-8B - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K.png\",\"datePublished\":\"2026-05-20T16:48:48+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#breadcrumb\"},\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K.png\",\"width\":1338,\"height\":568},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6\u00d7 Tokens Per Forward Over Qwen3-8B\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"fr-FR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/fr\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6\u00d7 Tokens Per Forward Over Qwen3-8B - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/fr\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/","og_locale":"fr_FR","og_type":"article","og_title":"NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6\u00d7 Tokens Per Forward Over Qwen3-8B - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/fr\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-05-20T16:48:48+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u00c9crit par":"admin NU","Dur\u00e9e de lecture estim\u00e9e":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6\u00d7 Tokens Per Forward Over Qwen3-8B","datePublished":"2026-05-20T16:48:48+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/"},"wordCount":1942,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"fr-FR","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/","url":"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/","name":"NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6\u00d7 Tokens Per Forward Over Qwen3-8B - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K.png","datePublished":"2026-05-20T16:48:48+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#breadcrumb"},"inLanguage":"fr-FR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/"]}]},{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K.png","width":1338,"height":568},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/nvidia-ai-releases-nemotron-labs-diffusion-a-tri-mode-language-model-with-6x-tokens-per-forward-over-qwen3-8b\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6\u00d7 Tokens Per Forward Over Qwen3-8B"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"fr-FR"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/fr\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K.png",1338,568,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K.png",1338,568,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K.png",1338,568,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K-300x127.png",300,127,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K-1024x435.png",1024,435,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K.png",1338,568,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K.png",1338,568,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K-18x8.png",18,8,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K-600x255.png",600,255,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-20-at-3.28.04-AM-1-5e7R8K-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/fr\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/fr\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"NVIDIA researchers have released Nemotron-Labs-Diffusion, a language model family that unifies three decoding modes in one architecture. The model supports autoregressive (AR) decoding, diffusion-based parallel decoding, and self-speculation decoding. It is available in 3B, 8B, and 14B parameter sizes. The family includes base, instruct, and vision-language variants. Sequential Decoding Limits Throughput Standard autoregressive (AR) language\u2026","_links":{"self":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/91688","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/comments?post=91688"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/91688\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media\/91689"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media?parent=91688"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/categories?post=91688"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/tags?post=91688"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}