{"id":91207,"date":"2026-05-18T16:41:54","date_gmt":"2026-05-18T16:41:54","guid":{"rendered":"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/"},"modified":"2026-05-18T16:41:54","modified_gmt":"2026-05-18T16:41:54","slug":"nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon","status":"publish","type":"post","link":"https:\/\/youzum.net\/ja\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/","title":{"rendered":"NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon"},"content":{"rendered":"<p>Pretraining frontier-scale LLMs in FP8 is now standard practice, but moving to 4-bit floating point has remained an open research problem because narrower formats compress dynamic range and amplify quantization error at long token horizons. A new research from NVIDIA describes a pretraining methodology built around <strong>NVFP4<\/strong>, a 4-bit microscaling format supported natively by Blackwell Tensor Cores, and validates it by pretraining a 12-billion-parameter hybrid Mamba-Transformer on <strong>10 trillion tokens<\/strong>. The research team state this is the longest publicly documented training run in 4-bit precision to date. The resulting model attains 62.58% on MMLU-Pro 5-shot versus 62.62% for the FP8 baseline, and is supported in NVIDIA\u2019s Transformer Engine.<\/p>\n<h2 class=\"wp-block-heading\"><strong>What NVFP4 Actually is<\/strong><\/h2>\n<p>To understand why NVFP4 is important, it helps to revisit how microscaling formats work. In a microscaling (MX) format, a contiguous block of low-precision elements shares a single scale factor, which is used to map the block back into a wider numerical range during the matrix multiply. MXFP4 uses 32-element blocks where each element is stored as E2M1 \u2014 1 sign bit, 2 exponent bits, 1 mantissa bit \u2014 encoding only the values \u00b10, \u00b10.5, \u00b11, \u00b11.5, \u00b12, \u00b13, \u00b14, and \u00b16. Block scale factors are stored in UE8M0, which restricts them to powers of two.<\/p>\n<p>NVFP4 changes three things. First, the block size drops from 32 to 16 elements, narrowing the dynamic range each scale has to cover. Second, block scale factors are stored in E4M3 rather than UE8M0, trading exponent range for mantissa precision so the per-block amax (absolute maximum) can be mapped much closer to the FP4 maximum representable. Third, NVFP4 adds a second scaling level: an FP32 per-tensor scale that remaps values so the E4M3 block scales themselves stay in range. The result is that at least 6.25% of values in each block \u2014 the per-block amax \u2014 are represented at near-FP8 precision, while the remainder sit in FP4.<\/p>\n<p>On NVIDIA Blackwell, FP4 GEMMs run at 4\u00d7 BF16 throughput on GB200 and 6\u00d7 on GB300, which translates to roughly 2\u00d7 and 3\u00d7 speedups over FP8. Operand memory footprint is approximately halved compared to FP8.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1430\" height=\"924\" data-attachment-id=\"79936\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/18\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/screenshot-2026-05-18-at-1-35-42-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1.png\" data-orig-size=\"1430,924\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-18 at 1.35.42\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-1024x662.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1.png\" alt=\"\" class=\"wp-image-79936\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2509.25149<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>What\u2019s Quantized \u2014 and What Isn\u2019t<\/strong><\/h2>\n<p>Only the GEMMs inside linear (fully-connected) layers Fprop, Dgrad, and Wgrad actually run in NVFP4. Embeddings, the output projection head, normalization layers, non-linearities, and all attention components (softmax and the query-key and attention score-value batched GEMMs) stay in BF16 or FP32. Model weights, weight gradients used for accumulation across microbatches and data-parallel replicas, and optimizer states are kept in FP32. Tensor parallel reductions run in BF16. <\/p>\n<h2 class=\"wp-block-heading\"><strong>The Four-Part Training Methodology<\/strong><\/h2>\n<p>Quantizing every linear-layer GEMM to NVFP4 with default settings (1\u00d716 block scaling everywhere, round-to-nearest-even on every tensor, no transforms) diverges early in training. NVIDIA\u2019s approach stabilizes it with four components, and ablation studies on the 12B model show each is necessary.<\/p>\n<p><strong>Selective high precision:<\/strong> Linear layers in the first two and the final eight of the 62 blocks (about 16% of all linear layers) are kept in BF16. Ablations indicated that the final blocks are the sensitive ones because they require more dynamic range than FP4 provides; keeping only the final four blocks in BF16 was also enough for stable convergence.<\/p>\n<p><strong>Random Hadamard Transforms (RHT):<\/strong> Outliers in weight gradients are spread into an approximately Gaussian distribution by multiplying the input tiles with a 16\u00d716 Hadamard matrix combined with a random \u00b11 sign vector. Because the orthogonal transforms cancel inside the dot-product, no math correction is needed in the GEMM. The d=16 size was chosen empirically: d=4 hurt convergence, d=128 gave similar results. RHT is applied only to the inputs of the weight-gradient (Wgrad) GEMM, and a single random sign vector is shared across all linear layers. Randomization itself was a no-op at the 1.2B scale but measurably improved the 12B run.<\/p>\n<p><strong>Two-dimensional (2D) block scaling for weights<\/strong>: Standard NVFP4 scales 1\u00d716 blocks along the dot-product dimension. Because the backward pass transposes the weight tensor, the forward and backward passes end up with different quantized weights, breaking the chain rule. NVIDIA\u2019s fix is to scale weights in 16\u00d716 blocks so the same quantized representation is used in both passes. Activations and gradients keep 1\u00d716 scaling, since they are less sensitive to this inconsistency.<\/p>\n<p><strong>Stochastic rounding on gradients<\/strong>: Round-to-nearest-even introduces systematic bias when applied to gradient tensors. Stochastic rounding rounds probabilistically based on distance to the two nearest representable values, removing that bias. The research team explicitly notes in research <a href=\"https:\/\/arxiv.org\/pdf\/2509.25149\" target=\"_blank\" rel=\"noreferrer noopener\">paper<\/a> that stochastic rounding is <strong>detrimental<\/strong> when applied to forward-pass tensors, so it is restricted to gradients.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Results on the 12B Hybrid Mamba-Transformer<\/strong><\/h2>\n<p>The 12B model uses the Nemotron-Nano-12B-v2-Base architecture \u2014 62 blocks (6 Self-Attention, 28 FFN, 28 Mamba-2), hidden dimension 5120, FFN dimension 20480 \u2014 trained with a Warmup-Stable-Decay schedule (constant LR through 80% of training, decay over the final 20%), batch size 736, sequence length 8192. The FP8 reference baseline follows the DeepSeek-V3 methodology: E4M3 elements, 128\u00d7128 weight blocks, 1\u00d7128 activation and gradient blocks, with the first block and last two blocks kept in BF16.<\/p>\n<p>NVFP4 validation loss stays within 1% of the FP8 baseline during the stable phase and widens to slightly above 1.5% during decay. Downstream accuracy is comparable across most benchmarks: MMLU 76.57% vs 77.36%, GSM8K CoT 92.27% vs 89.08%, MATH 81.48% vs 83.32%, AGIEval English CoT 70.31% vs 67.01%. Coding shows the largest gap \u2014 HumanEval+ 57.43% vs 59.93%, MBPP+ 55.91% vs 59.11% \u2014 which the research team attributes partly to noisy final-checkpoint evaluation. The research team also documents a precision-switching technique: transitioning the forward pass from NVFP4 to BF16 starting at 8.2T tokens (about 18% of the schedule) reduced relative loss error from 1.5% to 0.5%.<\/p>\n<h2 class=\"wp-block-heading\"><strong>NVFP4 vs MXFP4<\/strong><\/h2>\n<p>On a separate 8B hybrid Mamba-Transformer trained on 1T tokens, NVFP4 reached a relative loss error of about 1.5% versus BF16, while MXFP4 stayed near 2.5%. To close the gap, MXFP4 required 1.36T tokens to match the NVFP4 1T-token loss \u2014 a 36% token overhead. The research team attributes the difference to NVFP4\u2019s smaller block size and E4M3 scales, which preserve more of the FP4 dynamic range than MXFP4\u2019s power-of-two UE8M0 scales (which can waste up to one binade and the \u00b14, \u00b16 samples in the worst case).<\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<p>  <!-- TOP BAR --><\/p>\n<div class=\"nv-topbar\">\n<div class=\"nv-topbar-left\">\n<div class=\"nv-dot-row\"><span><\/span><span><\/span><span><\/span><\/div>\n<p>      <span class=\"nv-topbar-title\">NVFP4 Pretraining Guide<\/span>\n    <\/p><\/div>\n<p>    <span class=\"nv-topbar-counter\">01 \/ 10<\/span>\n  <\/p><\/div>\n<p>  <!-- VIEWPORT --><\/p>\n<div class=\"nv-viewport\">\n<div class=\"nv-track\">\n<p>      <!-- SLIDE 1 \u2014 COVER --><\/p>\n<section class=\"nv-slide\">\n<div class=\"nv-cover\">\n          <span class=\"nv-cover-badge\">\u25cf NVIDIA Technical Report<\/span>\n<h1>Pretraining Large Language Models with NVFP4<\/h1>\n<p>A 4-bit floating-point training recipe validated on a 12-billion-parameter hybrid Mamba-Transformer trained on 10 trillion tokens \u2014 the longest publicly documented 4-bit pretraining run to date.<\/p>\n<div class=\"nv-cover-stats\">\n<div class=\"nv-stat\">\n<div class=\"nv-stat-value\">12B<\/div>\n<div class=\"nv-stat-label\">Parameters<\/div>\n<\/div>\n<div class=\"nv-stat\">\n<div class=\"nv-stat-value\">10T<\/div>\n<div class=\"nv-stat-label\">Training Tokens<\/div>\n<\/div>\n<div class=\"nv-stat\">\n<div class=\"nv-stat-value\">62.58%<\/div>\n<div class=\"nv-stat-label\">MMLU-Pro (vs 62.62 FP8)<\/div>\n<\/div>\n<\/div>\n<div class=\"nv-cover-meta\">SOURCE \u2014 <span>arXiv:2509.25149v2<\/span> \u00b7 NVIDIA \u00b7 Available in Transformer Engine<\/div>\n<\/div>\n<\/section>\n<p>      <!-- SLIDE 2 \u2014 WHY 4-BIT --><\/p>\n<section class=\"nv-slide\">\n<div class=\"nv-kicker\">01 \u2014 Context<\/div>\n<h2>Why move from FP8 to 4-bit pretraining<\/h2>\n<p>FP8 training is now standard for frontier LLM pretraining. Moving to FP4 promises a <strong>2\u00d7 to 3\u00d7 boost in arithmetic throughput<\/strong> over FP8 and approximately half the operand memory \u2014 but narrower formats compress dynamic range and amplify quantization error at long token horizons.<\/p>\n<p>The challenge is to preserve training stability and downstream accuracy across multi-trillion-token runs. This report presents a recipe that does both, using <strong class=\"nv-accent\">NVFP4<\/strong>, a 4-bit microscaling format with native support on NVIDIA Blackwell Tensor Cores.<\/p>\n<div class=\"nv-2col\">\n<div class=\"nv-card\">\n<div class=\"nv-card-title\">GB200 Throughput<\/div>\n<ul>\n<li>BF16 baseline <span>1\u00d7<\/span><\/li>\n<li>FP8 <span>2\u00d7<\/span><\/li>\n<li>FP4 (NVFP4) <span>4\u00d7<\/span><\/li>\n<\/ul><\/div>\n<div class=\"nv-card\">\n<div class=\"nv-card-title\">GB300 Throughput<\/div>\n<ul>\n<li>BF16 baseline <span>1\u00d7<\/span><\/li>\n<li>FP8 <span>2\u00d7<\/span><\/li>\n<li>FP4 (NVFP4) <span>6\u00d7<\/span><\/li>\n<\/ul><\/div>\n<\/div>\n<\/section>\n<p>      <!-- SLIDE 3 \u2014 NVFP4 FORMAT --><\/p>\n<section class=\"nv-slide\">\n<div class=\"nv-kicker\">02 \u2014 The Format<\/div>\n<h2>What NVFP4 actually stores<\/h2>\n<p>Each element is encoded as <strong>E2M1<\/strong> \u2014 1 sign, 2 exponent, 1 mantissa bit \u2014 representing one of: \u00b10, \u00b10.5, \u00b11, \u00b11.5, \u00b12, \u00b13, \u00b14, \u00b16.<\/p>\n<p>Every block of <strong>16 contiguous elements<\/strong> shares a single <strong>E4M3<\/strong> scale factor. A second <strong>FP32 per-tensor scale<\/strong> sits on top to keep the E4M3 block scales in range. The result: at least 6.25% of values in each block (the per-block amax) sit at near-FP8 precision.<\/p>\n<div class=\"nv-block-diagram\">\n<div class=\"nv-block-row\">\n<div class=\"nv-block-scale\">FP8 scale<\/div>\n<div class=\"nv-block-cell nv-amax\">6<\/div>\n<div class=\"nv-block-cell\">0.5<\/div>\n<div class=\"nv-block-cell\">-2<\/div>\n<div class=\"nv-block-cell\">-4<\/div>\n<div class=\"nv-block-cell\">1<\/div>\n<div class=\"nv-block-cell\">0<\/div>\n<div class=\"nv-block-cell\">3<\/div>\n<div class=\"nv-block-cell\">-1<\/div>\n<div class=\"nv-block-cell\">2<\/div>\n<div class=\"nv-block-cell\">4<\/div>\n<div class=\"nv-block-cell\">-3<\/div>\n<div class=\"nv-block-cell\">0.5<\/div>\n<div class=\"nv-block-cell\">-1<\/div>\n<div class=\"nv-block-cell\">2<\/div>\n<div class=\"nv-block-cell\">0<\/div>\n<div class=\"nv-block-cell\">4<\/div>\n<\/div>\n<div class=\"nv-block-legend\">\n            <span><i class=\"nv-legend-dot\"><\/i> E4M3 block scale<\/span><br \/>\n            <span><i class=\"nv-legend-dot\"><\/i> Block amax (mapped to FP4 max)<\/span><br \/>\n            <span><i class=\"nv-legend-dot\"><\/i> 16 FP4 elements<\/span>\n          <\/div>\n<\/div>\n<\/section>\n<p>      <!-- SLIDE 4 \u2014 NVFP4 vs MXFP4 FORMAT --><\/p>\n<section class=\"nv-slide\">\n<div class=\"nv-kicker\">03 \u2014 Format Comparison<\/div>\n<h2>How NVFP4 differs from MXFP4<\/h2>\n<p>NVFP4 makes three design changes to the microscaling approach that meaningfully improve representation fidelity at 4 bits.<\/p>\n<div class=\"nv-2col\">\n<div class=\"nv-card\">\n<div class=\"nv-card-title\">MXFP4<\/div>\n<ul>\n<li>Block size <span>32<\/span><\/li>\n<li>Element <span>E2M1<\/span><\/li>\n<li>Block scale <span>UE8M0<\/span><\/li>\n<li>Scale type <span>Power of 2<\/span><\/li>\n<li>Tensor scale <span>None<\/span><\/li>\n<\/ul><\/div>\n<div class=\"nv-card\">\n<div class=\"nv-card-title\">NVFP4<\/div>\n<ul>\n<li>Block size <span>16<\/span><\/li>\n<li>Element <span>E2M1<\/span><\/li>\n<li>Block scale <span>E4M3<\/span><\/li>\n<li>Scale type <span>Fractional<\/span><\/li>\n<li>Tensor scale <span>FP32<\/span><\/li>\n<\/ul><\/div>\n<\/div>\n<div class=\"nv-callout\">\n<p>MXFP4\u2019s power-of-two UE8M0 scales can waste up to one binade of dynamic range and lose the <strong>\u00b14 and \u00b16<\/strong> FP4 samples after scale rounding. NVFP4\u2019s E4M3 scales map the block amax much closer to the FP4 maximum.<\/p>\n<\/div>\n<\/section>\n<p>      <!-- SLIDE 5 \u2014 WHAT'S QUANTIZED --><\/p>\n<section class=\"nv-slide\">\n<div class=\"nv-kicker\">04 \u2014 Scope<\/div>\n<h2>What runs in NVFP4 \u2014 and what doesn\u2019t<\/h2>\n<p>Only the <strong>three GEMMs inside linear layers<\/strong> \u2014 Fprop, Dgrad, and Wgrad \u2014 actually run in NVFP4. Everything else stays in higher precision.<\/p>\n<div class=\"nv-2col\">\n<div class=\"nv-card\">\n<div class=\"nv-card-title\">In NVFP4<\/div>\n<ul>\n<li>Linear Fprop GEMM<\/li>\n<li>Linear Dgrad GEMM<\/li>\n<li>Linear Wgrad GEMM<\/li>\n<\/ul><\/div>\n<div class=\"nv-card\">\n<div class=\"nv-card-title\">In BF16 \/ FP32<\/div>\n<ul>\n<li>Embeddings \u00b7 Output head<\/li>\n<li>Normalization layers<\/li>\n<li>Non-linearities<\/li>\n<li>Attention (softmax, QK, score-V)<\/li>\n<li>Master weights \u00b7 Optimizer states<\/li>\n<li>TP reductions (BF16)<\/li>\n<\/ul><\/div>\n<\/div>\n<div class=\"nv-callout\">\n<p>The \u201cFP4 training\u201d label applies to the most compute-heavy GEMMs, not to the full forward and backward graph.<\/p>\n<\/div>\n<\/section>\n<p>      <!-- SLIDE 6 \u2014 THE 4-PART RECIPE --><\/p>\n<section class=\"nv-slide\">\n<div class=\"nv-kicker\">05 \u2014 The Recipe<\/div>\n<h2>Four techniques required for convergence<\/h2>\n<p>Quantizing every linear-layer GEMM to NVFP4 with default settings \u2014 1\u00d716 block scaling everywhere, round-to-nearest-even, no transforms \u2014 <strong>diverges early in training<\/strong>. The recipe stabilizes it with four components. Ablations show each is necessary.<\/p>\n<div class=\"nv-recipe\">\n<div class=\"nv-recipe-item\">\n<div class=\"nv-recipe-num\">1<\/div>\n<div>\n<div class=\"nv-recipe-title\">Selective High Precision<\/div>\n<div class=\"nv-recipe-body\">Keep ~16% of linear layers in BF16, concentrated in the final blocks. For the 12B model: first 2 + final 8 of 62 blocks.<\/div>\n<\/div>\n<\/div>\n<div class=\"nv-recipe-item\">\n<div class=\"nv-recipe-num\">2<\/div>\n<div>\n<div class=\"nv-recipe-title\">Random Hadamard Transforms (RHT)<\/div>\n<div class=\"nv-recipe-body\">16\u00d716 Hadamard matrix + random \u00b11 sign vector, applied only to Wgrad inputs. d=4 was worse; d=128 was similar to d=16.<\/div>\n<\/div>\n<\/div>\n<div class=\"nv-recipe-item\">\n<div class=\"nv-recipe-num\">3<\/div>\n<div>\n<div class=\"nv-recipe-title\">2D Block Scaling for Weights<\/div>\n<div class=\"nv-recipe-body\">16\u00d716 block scales for weights so forward and backward see the same quantized representation. Activations and gradients keep 1\u00d716 scaling.<\/div>\n<\/div>\n<\/div>\n<div class=\"nv-recipe-item\">\n<div class=\"nv-recipe-num\">4<\/div>\n<div>\n<div class=\"nv-recipe-title\">Stochastic Rounding on Gradients<\/div>\n<div class=\"nv-recipe-body\">Probabilistic rounding removes systematic gradient bias. <strong>Detrimental<\/strong> on forward-pass tensors \u2014 restrict to gradients only.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/section>\n<p>      <!-- SLIDE 7 \u2014 MODEL &amp; TRAINING SETUP --><\/p>\n<section class=\"nv-slide\">\n<div class=\"nv-kicker\">06 \u2014 Training Setup<\/div>\n<h2>The 12B hybrid Mamba-Transformer<\/h2>\n<p>The model uses the <strong>Nemotron-Nano-12B-v2-Base architecture<\/strong>: 62 blocks consisting of 6 Self-Attention, 28 FFN, and 28 Mamba-2 blocks.<\/p>\n<div class=\"nv-2col\">\n<div class=\"nv-card\">\n<div class=\"nv-card-title\">Architecture<\/div>\n<ul>\n<li>Blocks <span>62<\/span><\/li>\n<li>Hidden dim <span>5120<\/span><\/li>\n<li>FFN dim <span>20480<\/span><\/li>\n<li>Q heads <span>40<\/span><\/li>\n<li>KV heads <span>8<\/span><\/li>\n<li>Mamba state dim <span>128<\/span><\/li>\n<\/ul><\/div>\n<div class=\"nv-card\">\n<div class=\"nv-card-title\">Training<\/div>\n<ul>\n<li>Tokens <span>10T<\/span><\/li>\n<li>Batch size <span>736<\/span><\/li>\n<li>Sequence length <span>8192<\/span><\/li>\n<li>Schedule <span>WSD 80\/20<\/span><\/li>\n<li>Peak LR <span>4.5e-4<\/span><\/li>\n<li>Weight decay <span>0.1<\/span><\/li>\n<\/ul><\/div>\n<\/div>\n<div class=\"nv-callout\">\n<p>FP8 reference baseline follows DeepSeek-V3: E4M3 elements, 128\u00d7128 weight blocks, 1\u00d7128 activation\/gradient blocks, with the first block and last two in BF16.<\/p>\n<\/div>\n<\/section>\n<p>      <!-- SLIDE 8 \u2014 RESULTS --><\/p>\n<section class=\"nv-slide\">\n<div class=\"nv-kicker\">07 \u2014 Downstream Results<\/div>\n<h2>NVFP4 matches FP8 across most benchmarks<\/h2>\n<p>Validation loss stays within 1% of FP8 during the stable phase, widening to slightly above 1.5% during decay. Downstream accuracies tracked below.<\/p>\n<table class=\"nv-table\">\n<thead>\n<tr>\n<th>Benchmark<\/th>\n<th>FP8<\/th>\n<th>NVFP4<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>MMLU-Pro 5-shot<\/td>\n<td class=\"nv-num\">62.62<\/td>\n<td class=\"nv-num\">62.58<\/td>\n<\/tr>\n<tr>\n<td>MMLU<\/td>\n<td class=\"nv-num\">77.36<\/td>\n<td class=\"nv-num\">76.57<\/td>\n<\/tr>\n<tr>\n<td>AGIEval English CoT<\/td>\n<td class=\"nv-num\">67.01<\/td>\n<td class=\"nv-num nv-best\">70.31<\/td>\n<\/tr>\n<tr>\n<td>GSM8K CoT<\/td>\n<td class=\"nv-num\">89.08<\/td>\n<td class=\"nv-num nv-best\">92.27<\/td>\n<\/tr>\n<tr>\n<td>MATH<\/td>\n<td class=\"nv-num\">83.32<\/td>\n<td class=\"nv-num\">81.48<\/td>\n<\/tr>\n<tr>\n<td>MGSM<\/td>\n<td class=\"nv-num\">81.87<\/td>\n<td class=\"nv-num nv-best\">85.53<\/td>\n<\/tr>\n<tr>\n<td>HumanEval+<\/td>\n<td class=\"nv-num\">59.93<\/td>\n<td class=\"nv-num\">57.43<\/td>\n<\/tr>\n<tr>\n<td>MBPP+<\/td>\n<td class=\"nv-num\">59.11<\/td>\n<td class=\"nv-num\">55.91<\/td>\n<\/tr>\n<tr>\n<td>ARC Challenge<\/td>\n<td class=\"nv-num\">91.81<\/td>\n<td class=\"nv-num\">91.81<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"nv-callout\">\n<p>Coding shows the widest gap. Switching the forward pass to BF16 at 8.2T tokens (last 18%) reduces relative loss error from 1.5% to 0.5%.<\/p>\n<\/div>\n<\/section>\n<p>      <!-- SLIDE 9 \u2014 NVFP4 vs MXFP4 EFFICIENCY --><\/p>\n<section class=\"nv-slide\">\n<div class=\"nv-kicker\">08 \u2014 Format Efficiency<\/div>\n<h2>NVFP4 vs MXFP4 on the same 8B model<\/h2>\n<p>On an 8B hybrid Mamba-Transformer trained on the same data, NVFP4 converged to a meaningfully better loss than MXFP4 in the same token budget.<\/p>\n<div class=\"nv-2col\">\n<div class=\"nv-card\">\n<div class=\"nv-card-title\">Loss vs BF16 @ 1T tokens<\/div>\n<ul>\n<li>NVFP4 <span>~1.5% gap<\/span><\/li>\n<li>MXFP4 <span>~2.5% gap<\/span><\/li>\n<\/ul><\/div>\n<div class=\"nv-card\">\n<div class=\"nv-card-title\">Tokens to match NVFP4 loss<\/div>\n<ul>\n<li>NVFP4 <span>1.00T<\/span><\/li>\n<li>MXFP4 <span>1.36T (+36%)<\/span><\/li>\n<\/ul><\/div>\n<\/div>\n<div class=\"nv-callout\">\n<p>The 36% token overhead translates directly into longer training time. Smaller block size and E4M3 scales preserve more of the FP4 dynamic range than MXFP4\u2019s UE8M0 design.<\/p>\n<\/div>\n<\/section>\n<p>      <!-- SLIDE 10 \u2014 TAKEAWAYS --><\/p>\n<section class=\"nv-slide\">\n<div class=\"nv-kicker\">09 \u2014 Practitioner Takeaways<\/div>\n<h2>What this unlocks for AI engineers<\/h2>\n<p>4-bit pretraining at multi-trillion-token scale is now reproducible with a known recipe, on Blackwell hardware, via Transformer Engine.<\/p>\n<div class=\"nv-recipe\">\n<div class=\"nv-recipe-item\">\n<div class=\"nv-recipe-num\">\u2713<\/div>\n<div>\n<div class=\"nv-recipe-title\">Throughput &amp; memory<\/div>\n<div class=\"nv-recipe-body\">FP4 GEMMs run 2\u00d7 faster than FP8 on GB200 and 3\u00d7 on GB300. Operand memory roughly halved.<\/div>\n<\/div>\n<\/div>\n<div class=\"nv-recipe-item\">\n<div class=\"nv-recipe-num\">\u2713<\/div>\n<div>\n<div class=\"nv-recipe-title\">Reproducible recipe<\/div>\n<div class=\"nv-recipe-body\">Selective BF16 layers + 16\u00d716 RHT on Wgrad + 2D weight scaling + stochastic rounding on gradients.<\/div>\n<\/div>\n<\/div>\n<div class=\"nv-recipe-item\">\n<div class=\"nv-recipe-num\">\u2192<\/div>\n<div>\n<div class=\"nv-recipe-title\">Open questions<\/div>\n<div class=\"nv-recipe-body\">Quantizing all linear layers, extending NVFP4 to attention and communication paths, scaling laws for FP4 across parameter counts and horizons.<\/div>\n<\/div>\n<\/div>\n<div class=\"nv-recipe-item\">\n<div class=\"nv-recipe-num\">\u2318<\/div>\n<div>\n<div class=\"nv-recipe-title\">Availability<\/div>\n<div class=\"nv-recipe-body\">NVFP4 training is supported in NVIDIA Transformer Engine. Source: arXiv:2509.25149v2.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/section><\/div>\n<\/div>\n<p>  <!-- CONTROLS --><\/p>\n<div class=\"nv-controls\">\n    <button class=\"nv-btn\">\u2190 Prev<\/button>\n<div class=\"nv-dots\"><\/div>\n<div class=\"nv-progress\">\n<div class=\"nv-progress-fill\"><\/div>\n<\/div>\n<p>    <button class=\"nv-btn\">Next \u2192<\/button>\n  <\/p><\/div>\n<p>  <!-- TAGLINE --><\/p>\n<div class=\"nv-tagline\">\n    <strong>MARKTECHPOST<\/strong> \u00a0\u00b7\u00a0 AI research, deeply explained.\n  <\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>NVIDIA&#8217;s research team pretrained a 12B hybrid Mamba-Transformer on 10T tokens in NVFP4 \u2014 the longest publicly documented 4-bit training run \u2014 matching FP8 on MMLU-Pro at 62.58% vs 62.62%.<\/li>\n<li>NVFP4 uses 16-element blocks with E4M3 scales plus an FP32 per-tensor scale, preserving the \u00b14 and \u00b16 samples that MXFP4&#8217;s 32-element UE8M0 design can lose to power-of-two rounding.<\/li>\n<li>Four techniques are required for convergence \u2014 none are optional: ~16% of linear layers in BF16, 16\u00d716 Random Hadamard Transforms on Wgrad inputs, 2D 16\u00d716 weight scaling, and stochastic rounding on gradients only.<\/li>\n<li>Only linear-layer GEMMs run in NVFP4 \u2014 attention, embeddings, normalization, non-linearities, master weights, gradients, and optimizer states all stay in BF16 or FP32.<\/li>\n<li>On an 8B model, MXFP4 needed 1.36T tokens (36% more) to match NVFP4&#8217;s loss at 1T tokens, while FP4 GEMMs deliver 2\u00d7 FP8 throughput on GB200 and 3\u00d7 on GB300.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2509.25149\" target=\"_blank\" rel=\"noreferrer noopener\">Paper here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/18\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/\">NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Pretraining frontier-scale LLMs in FP8 is now standard practice, but moving to 4-bit floating point has remained an open research problem because narrower formats compress dynamic range and amplify quantization error at long token horizons. A new research from NVIDIA describes a pretraining methodology built around NVFP4, a 4-bit microscaling format supported natively by Blackwell Tensor Cores, and validates it by pretraining a 12-billion-parameter hybrid Mamba-Transformer on 10 trillion tokens. The research team state this is the longest publicly documented training run in 4-bit precision to date. The resulting model attains 62.58% on MMLU-Pro 5-shot versus 62.62% for the FP8 baseline, and is supported in NVIDIA\u2019s Transformer Engine. What NVFP4 Actually is To understand why NVFP4 is important, it helps to revisit how microscaling formats work. In a microscaling (MX) format, a contiguous block of low-precision elements shares a single scale factor, which is used to map the block back into a wider numerical range during the matrix multiply. MXFP4 uses 32-element blocks where each element is stored as E2M1 \u2014 1 sign bit, 2 exponent bits, 1 mantissa bit \u2014 encoding only the values \u00b10, \u00b10.5, \u00b11, \u00b11.5, \u00b12, \u00b13, \u00b14, and \u00b16. Block scale factors are stored in UE8M0, which restricts them to powers of two. NVFP4 changes three things. First, the block size drops from 32 to 16 elements, narrowing the dynamic range each scale has to cover. Second, block scale factors are stored in E4M3 rather than UE8M0, trading exponent range for mantissa precision so the per-block amax (absolute maximum) can be mapped much closer to the FP4 maximum representable. Third, NVFP4 adds a second scaling level: an FP32 per-tensor scale that remaps values so the E4M3 block scales themselves stay in range. The result is that at least 6.25% of values in each block \u2014 the per-block amax \u2014 are represented at near-FP8 precision, while the remainder sit in FP4. On NVIDIA Blackwell, FP4 GEMMs run at 4\u00d7 BF16 throughput on GB200 and 6\u00d7 on GB300, which translates to roughly 2\u00d7 and 3\u00d7 speedups over FP8. Operand memory footprint is approximately halved compared to FP8. https:\/\/arxiv.org\/pdf\/2509.25149 What\u2019s Quantized \u2014 and What Isn\u2019t Only the GEMMs inside linear (fully-connected) layers Fprop, Dgrad, and Wgrad actually run in NVFP4. Embeddings, the output projection head, normalization layers, non-linearities, and all attention components (softmax and the query-key and attention score-value batched GEMMs) stay in BF16 or FP32. Model weights, weight gradients used for accumulation across microbatches and data-parallel replicas, and optimizer states are kept in FP32. Tensor parallel reductions run in BF16. The Four-Part Training Methodology Quantizing every linear-layer GEMM to NVFP4 with default settings (1\u00d716 block scaling everywhere, round-to-nearest-even on every tensor, no transforms) diverges early in training. NVIDIA\u2019s approach stabilizes it with four components, and ablation studies on the 12B model show each is necessary. Selective high precision: Linear layers in the first two and the final eight of the 62 blocks (about 16% of all linear layers) are kept in BF16. Ablations indicated that the final blocks are the sensitive ones because they require more dynamic range than FP4 provides; keeping only the final four blocks in BF16 was also enough for stable convergence. Random Hadamard Transforms (RHT): Outliers in weight gradients are spread into an approximately Gaussian distribution by multiplying the input tiles with a 16\u00d716 Hadamard matrix combined with a random \u00b11 sign vector. Because the orthogonal transforms cancel inside the dot-product, no math correction is needed in the GEMM. The d=16 size was chosen empirically: d=4 hurt convergence, d=128 gave similar results. RHT is applied only to the inputs of the weight-gradient (Wgrad) GEMM, and a single random sign vector is shared across all linear layers. Randomization itself was a no-op at the 1.2B scale but measurably improved the 12B run. Two-dimensional (2D) block scaling for weights: Standard NVFP4 scales 1\u00d716 blocks along the dot-product dimension. Because the backward pass transposes the weight tensor, the forward and backward passes end up with different quantized weights, breaking the chain rule. NVIDIA\u2019s fix is to scale weights in 16\u00d716 blocks so the same quantized representation is used in both passes. Activations and gradients keep 1\u00d716 scaling, since they are less sensitive to this inconsistency. Stochastic rounding on gradients: Round-to-nearest-even introduces systematic bias when applied to gradient tensors. Stochastic rounding rounds probabilistically based on distance to the two nearest representable values, removing that bias. The research team explicitly notes in research paper that stochastic rounding is detrimental when applied to forward-pass tensors, so it is restricted to gradients. Results on the 12B Hybrid Mamba-Transformer The 12B model uses the Nemotron-Nano-12B-v2-Base architecture \u2014 62 blocks (6 Self-Attention, 28 FFN, 28 Mamba-2), hidden dimension 5120, FFN dimension 20480 \u2014 trained with a Warmup-Stable-Decay schedule (constant LR through 80% of training, decay over the final 20%), batch size 736, sequence length 8192. The FP8 reference baseline follows the DeepSeek-V3 methodology: E4M3 elements, 128\u00d7128 weight blocks, 1\u00d7128 activation and gradient blocks, with the first block and last two blocks kept in BF16. NVFP4 validation loss stays within 1% of the FP8 baseline during the stable phase and widens to slightly above 1.5% during decay. Downstream accuracy is comparable across most benchmarks: MMLU 76.57% vs 77.36%, GSM8K CoT 92.27% vs 89.08%, MATH 81.48% vs 83.32%, AGIEval English CoT 70.31% vs 67.01%. Coding shows the largest gap \u2014 HumanEval+ 57.43% vs 59.93%, MBPP+ 55.91% vs 59.11% \u2014 which the research team attributes partly to noisy final-checkpoint evaluation. The research team also documents a precision-switching technique: transitioning the forward pass from NVFP4 to BF16 starting at 8.2T tokens (about 18% of the schedule) reduced relative loss error from 1.5% to 0.5%. NVFP4 vs MXFP4 On a separate 8B hybrid Mamba-Transformer trained on 1T tokens, NVFP4 reached a relative loss error of about 1.5% versus BF16, while MXFP4 stayed near 2.5%. To close the gap, MXFP4 required 1.36T tokens to match the NVFP4 1T-token loss \u2014 a 36% token overhead. The research team attributes the difference to NVFP4\u2019s smaller block size and E4M3<\/p>","protected":false},"author":2,"featured_media":91205,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-91207","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/ja\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/\" \/>\n<meta property=\"og:locale\" content=\"ja_JP\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/ja\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-18T16:41:54+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u57f7\u7b46\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u63a8\u5b9a\u8aad\u307f\u53d6\u308a\u6642\u9593\" \/>\n\t<meta name=\"twitter:data2\" content=\"10\u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon\",\"datePublished\":\"2026-05-18T16:41:54+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/\"},\"wordCount\":2075,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"ja\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/\",\"url\":\"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/\",\"name\":\"NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP.png\",\"datePublished\":\"2026-05-18T16:41:54+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#breadcrumb\"},\"inLanguage\":\"ja\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP.png\",\"width\":1430,\"height\":924},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ja\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/ja\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/ja\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/","og_locale":"ja_JP","og_type":"article","og_title":"NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/ja\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-05-18T16:41:54+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u57f7\u7b46\u8005":"admin NU","\u63a8\u5b9a\u8aad\u307f\u53d6\u308a\u6642\u9593":"10\u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon","datePublished":"2026-05-18T16:41:54+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/"},"wordCount":2075,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"ja","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/","url":"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/","name":"NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP.png","datePublished":"2026-05-18T16:41:54+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#breadcrumb"},"inLanguage":"ja","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/"]}]},{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP.png","width":1430,"height":924},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/nvidia-introduces-a-4-bit-pretraining-methodology-using-nvfp4-validated-on-a-12b-hybrid-mamba-transformer-at-10t-token-horizon\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ja"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/ja\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP.png",1430,924,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP.png",1430,924,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP.png",1430,924,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP-300x194.png",300,194,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP-1024x662.png",1024,662,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP.png",1430,924,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP.png",1430,924,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP-18x12.png",18,12,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP-600x388.png",600,388,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-18-at-1.35.42-AM-1-xP6VJP-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/ja\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/ja\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/ja\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/ja\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/ja\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Pretraining frontier-scale LLMs in FP8 is now standard practice, but moving to 4-bit floating point has remained an open research problem because narrower formats compress dynamic range and amplify quantization error at long token horizons. A new research from NVIDIA describes a pretraining methodology built around NVFP4, a 4-bit microscaling format supported natively by Blackwell&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/posts\/91207","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/comments?post=91207"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/posts\/91207\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/media\/91205"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/media?parent=91207"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/categories?post=91207"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/tags?post=91207"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}