{"id":93970,"date":"2026-05-30T17:19:00","date_gmt":"2026-05-30T17:19:00","guid":{"rendered":"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/"},"modified":"2026-05-30T17:19:00","modified_gmt":"2026-05-30T17:19:00","slug":"nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b","status":"publish","type":"post","link":"https:\/\/youzum.net\/es\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/","title":{"rendered":"NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Knowledge distillation (KD) transfers \u201cdark knowledge\u201d from a large teacher model to a smaller student. The student learns from the teacher\u2019s full output probability distribution over tokens, not just correct answers. This is done via per-position Kullback\u2013Leibler (KL) divergence over next-token probability distributions.<\/p>\n<p class=\"wp-block-paragraph\">This formulation requires a shared tokenizer. A practitioner committed to Llama-3.2-1B cannot leverage stronger teachers with incompatible tokenizers \u2014 such as Phi-4-mini or Qwen3-4B \u2014 because token positions do not correspond across vocabularies. This also prevents multi-teacher distillation across tokenizer families.<\/p>\n<p class=\"wp-block-paragraph\">NVIDIA researchers introduced <strong>X-Token<\/strong>, a logit-distribution-based method for cross-tokenizer KD (Knowledge distillation). It operates as a drop-in replacement for the standard KD loss, requiring no auxiliary trainable components and no architectural changes.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Problem X-Token is Solving<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Two prior approaches dominate cross-tokenizer KD. <strong>ULD (Universal Logit Distillation)<\/strong> sidesteps vocabulary alignment by rank-sorting both distributions and minimizing L1 distance. It discards token identity entirely. <strong>GOLD<\/strong> adds span alignment and a hybrid loss. It partitions tokens into a 1-to-1 string-matched common subset, trained with KL divergence, and an uncommon remainder, trained with ULD-style rank matching. GOLD is the current state of the art.<\/p>\n<p class=\"wp-block-paragraph\"><strong>The research team identifies two structural failures in GOLD\u2019s design<\/strong>:<\/p>\n<p class=\"wp-block-paragraph\"><strong>Failure 1: Uncommon-token failure<\/strong>\u2013 When tokenizers fragment text differently, critical tokens fall into the unmatched uncommon subset. Llama-3 packs multi-digit numbers as single tokens \u2014 \u201c201\u201d is one token. Qwen3 splits them digit by digit: \u201c2\u201d, \u201c0\u201d, \u201c1\u201d. Under GOLD, all 1,100 of Llama\u2019s two- and three-digit numerals (100 two-digit, 1,000 three-digit) fall into the uncommon set when Qwen3-4B is the teacher. Those tokens receive two types of harmful signal: identity-agnostic noise from rank-based ULD matching, and suppressive gradients from the common-KL term acting through the full-vocabulary softmax. The result: GSM8k accuracy drops to 2.56 under GOLD with Qwen3-4B, compared to 12.89 for same-tokenizer KD from a weaker Llama-3.2-3B teacher.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Failure 2: Over-conservative matching<\/strong>\u2013 GOLD uses strict string equality to define the common subset. A student token <code>Hundreds<\/code> corresponds to teacher tokens <code>Hund<\/code> followed by <code>reds<\/code> under teacher-side re-tokenization, but strict matching discards this pair. Useful alignment signal is lost even when the correspondence is well-formed.<\/p>\n<p class=\"wp-block-paragraph\">These two failures require opposite remedies: eliminate the partition when critical tokens are misaligned, and relax it when alignment is structurally sound.<\/p>\n<h2 class=\"wp-block-heading\"><strong>How X-Token Works<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>X-Token has three components:<\/strong> span alignment, a projection matrix W, and two complementary loss formulations \u2014 P-KL and H-KL.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Span Alignment<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">Teacher and student tokenizers produce sequences of different lengths for the same text. X-Token uses dynamic-programming (DP) span alignment, grouping tokens into chunks where each chunk-pair decodes to the same underlying text substring. A chain-rule merge then combines per-token probabilities within each chunk into a single chunk-level distribution for use in the distillation loss. The alignment is cached per sequence and adds no per-step training overhead.<\/p>\n<p class=\"wp-block-paragraph\">The research team also identifies a failure in TRL\u2019s surface-substring alignment, which is used in TRL\u2019s GOLD trainer. TRL accumulates per-side decoded buffers and flushes only when both buffers match as equal raw strings. A byte-level disagreement \u2014 such as Llama-3 auto-prepending <code>&lt;bos&gt;<\/code> while Qwen-3 does not \u2014 prevents future flushes and forces all remaining tokens into one mis-grouped super-group at end of sequence. The DP approach handles this with a single gap move, regardless of sequence length.<\/p>\n<h3 class=\"wp-block-heading\"><strong>The Projection Matrix W<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">After alignment, teacher and student distributions still operate over different vocabularies. The projection matrix W \u2208 \u211d<sup>V<sub>S<\/sub>|\u00d7|V<sub>T<\/sub>|<\/sup> maps each student token to a weighted combination of teacher tokens, bridging the vocabulary mismatch.<\/p>\n<p class=\"wp-block-paragraph\"><strong>W is constructed deterministically in two passes:<\/strong><\/p>\n<p class=\"wp-block-paragraph\"><strong>Pass 1 (exact-match):<\/strong> For every (student token, teacher token) pair whose decoded strings match after canonicalization, set W[s, t] = 1. Canonicalization unifies space prefixes (\u0120, _, \u2423), newlines, byte-fallback tokens of the form <code>&lt;0xHH&gt;<\/code>, and model-specific special tokens across tokenizer families.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Pass 2 (multi-token rule):<\/strong> For each student token without an exact match, re-tokenize its decoded text under the teacher tokenizer. If the resulting sequence has length \u2264 4, assign exponentially-decayed weights: W[s, \u03c4\u1d62] = \u03b2\u00b7\u03b3\u2071 with (\u03b2, \u03b3) = (0.9, 0.1). A length-2 span receives normalized weights (0.909, 0.091). A length-3 span receives (0.9009, 0.0901, 0.0090). A length-4 span receives (0.9000, 0.0900, 0.0090, 0.0009). The leading sub-token receives the highest weight because it typically carries the most informative probability mass \u2014 for example, \u201c_inter\u201d in [\u201c_inter\u201d, \u201cnational\u201d] or \u201c_20\u201d in [\u201c_20\u201d, \u201c24\u201d].<\/p>\n<p class=\"wp-block-paragraph\">Each row is truncated to its top-4 entries and row-normalized. Because each row of W is non-negative and sums to 1, left-multiplication by W\u22a4 is probability-preserving: if p<sub>S<\/sub> is a probability vector, W<sup>\u22a4<\/sup>p<sub>S<\/sub> is also a valid probability vector over V<sub>T<\/sub>. W is constructed once before training and can optionally be jointly refined with the student under P-KL.<\/p>\n<h3 class=\"wp-block-heading\"><strong>P-KL: Addressing Erroneous and Suppressive Gradients<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">P-KL removes the partition entirely. It projects the student distribution p\u0302<sub>S<\/sub><sup>(k)<\/sup> into teacher vocabulary space via W:<\/p>\n<p class=\"wp-block-paragraph\">p~S(k)[t]=\u2211s\u2208\ud835\udcb1SW[s,t]\u22c5p^S(k)[s]tilde{p}_S^{(k)}[t] = sum_{sinmathcal{V}_S} W[s, t] cdot hat{p}_S^{(k)}[s]<\/p>\n<p class=\"wp-block-paragraph\">Then it computes KL divergence directly between teacher and projected student:<\/p>\n<p class=\"wp-block-paragraph\"> \u2202\u2112common\u2202zj=pS[j]\u22c5M\ud835\udc9e(T)frac{partialmathcal{L}_{common}}{partial z_{j}} = p_S[j] cdot M_{mathcal{C}}(T)<\/p>\n<p class=\"wp-block-paragraph\">There is no uncommon set, so rank-based ULD noise is eliminated. The suppressive gradient problem is also eliminated: the projection routes the student\u2019s probability mass for \u201c201\u201d directly onto {2, 0, 1} in the teacher vocabulary via W.<\/p>\n<p class=\"wp-block-paragraph\">The research team formally proves (Proposition 1) that GOLD\u2019s common-KL term induces non-negative gradients on every uncommon student logit. The gradient on an uncommon student logit j is: \u2202\u2112<sub>common<\/sub>\/\u2202z<sub>j<\/sub> = p<sub>S<\/sub>[j] \u00b7 M<sub>C<\/sub>(T), where M<sub>C<\/sub>(T), is the teacher probability mass on the common subset. Under gradient descent, this always drives z<sub>j<\/sub> downward \u2014 suppressing every uncommon token\u2019s probability regardless of the ground-truth token.<\/p>\n<h3 class=\"wp-block-heading\"><strong>H-KL: Relaxing the 1-to-1 Matching<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">H-KL applies when the partition is structurally sound \u2014 that is, when critical tokens land in the common subset. In that case, GOLD\u2019s direct KL on identity-aligned pairs delivers sharper per-pair supervision than P-KL\u2019s projection, which blends student probability mass across multiple teacher tokens. The opportunity is to make the partition less wasteful by relaxing the strict string-equality criterion.<\/p>\n<p class=\"wp-block-paragraph\">H-KL retains GOLD\u2019s hybrid loss structure but expands the common set C using W. For each student token s, it selects the top-ranked teacher token t* = argmax_{t\u2019\u2208V_T} W[s, t\u2019], and adds (s, t*) to C. Exact matches are preserved since they receive weight 1 in W, the highest possible. Near-equivalent pairs like (Hundreds, Hund) \u2014 excluded by GOLD \u2014 are now admitted. The expanded C feeds the same hybrid loss: direct KL on common pairs, ULD on the remainder.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Selecting Between P-KL and H-KL<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">The selection uses a coverage audit over token categories in the student vocabulary. For math tasks, multi-digit numerals are the critical category. Table 8 in the research paper shows: under Qwen3-4B, 0 out of 100 two-digit Llama numerals and 0 out of 1,000 three-digit Llama numerals appear in C. Under Phi-4-mini-Instruct, all 100 two-digit and all 1,000 three-digit numerals appear in C. ASCII punctuation and single-digit numerals are fully covered in both cases.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1124\" height=\"616\" data-attachment-id=\"80195\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/29\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/screenshot-2026-05-29-at-4-09-12-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1.png\" data-orig-size=\"1124,616\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"Screenshot 2026-05-29 at 4.09.12\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-1024x561.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1.png\" alt=\"\" class=\"wp-image-80195\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.21699<\/figcaption><\/figure>\n<\/div>\n<p class=\"wp-block-paragraph\">The rule: use P-KL when critical tokens fall outside C (Qwen3-4B), and H-KL when the partition is sound (Phi-4-mini-Instruct). Table 2 in the research paper shows the mode reversal is sharp: P-KL outperforms H-KL by +3.55 avg. on Qwen3-4B, while H-KL outperforms P-KL by +1.68 avg. on Phi-4-mini.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"766\" height=\"294\" data-attachment-id=\"80192\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/29\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/screenshot-2026-05-29-at-4-05-17-pm\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.05.17-PM.png\" data-orig-size=\"766,294\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"Screenshot 2026-05-29 at 4.05.17\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.05.17-PM.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.05.17-PM.png\" alt=\"\" class=\"wp-image-80192\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.21699<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Multi-Teacher Distillation<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">X-Token extends to multiple teachers. Each teacher has its own projection matrix W_m and loss selection. For same-tokenizer teachers, standard token-level KL is used. <strong>The multi-teacher loss aggregates per-teacher losses with weights \u03b1<sub>m<\/sub>:<\/strong><\/p>\n<p class=\"wp-block-paragraph\">\u2112KD,multi=\u2211m=1M\u03b1m1|\ud835\udca6m|\u2211k\u2208\ud835\udca6m\u2112\u2217,m(k)mathcal{L}_{KD,multi} = sum_{m=1}^{M}alpha_{m}frac{1}{|mathcal{K}_{m}|}sum_{kinmathcal{K}_{m}}mathcal{L}_{*,m}^{(k)}<\/p>\n<p class=\"wp-block-paragraph\">The research team evaluates static and confidence-adaptive weighting schemes. Confidence-adaptive variants compute \u03b1_m from cross-entropy, Shannon entropy, or maximum predicted probability of the teacher\u2019s distribution. Static weighting outperforms adaptive schemes in both multi-teacher setups evaluated.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"1342\" height=\"438\" data-attachment-id=\"80191\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/29\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/screenshot-2026-05-29-at-3-41-08-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-3.41.08-PM-1.png\" data-orig-size=\"1342,438\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"Screenshot 2026-05-29 at 3.41.08\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-3.41.08-PM-1-1024x334.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-3.41.08-PM-1.png\" alt=\"\" class=\"wp-image-80191\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.21699<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Dynamic KD\/CE Scaling<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">Training combines the distillation loss \u2112<sub>KD<\/sub> with next-token cross-entropy \u2112<sub>CE<\/sub>. Because these terms differ in magnitude and shift during training, X-Token rescales the KD term at each step to match the scale of \u2112<sub>CE<\/sub>:<\/p>\n<p class=\"wp-block-paragraph\">\u2112=sg(\u2112CE\/\u2112KD)\u22c5\u2112KD+\u2112CEmathcal{L} = text{sg}(mathcal{L}_{CE} \/ mathcal{L}_{KD}) cdot mathcal{L}_{KD} + mathcal{L}_{CE}<\/p>\n<p class=\"wp-block-paragraph\">where sg(\u00b7) is stop-gradient. Table 4 in the paper shows dynamic scaling outperforms three fixed-weight settings (KD-heavy, balanced, CE-heavy) on the Qwen3-4B (P-KL) pair.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1256\" height=\"596\" data-attachment-id=\"80197\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/29\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/screenshot-2026-05-29-at-4-10-01-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.10.01-PM-1.png\" data-orig-size=\"1256,596\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"Screenshot 2026-05-29 at 4.10.01\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.10.01-PM-1-1024x486.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.10.01-PM-1.png\" alt=\"\" class=\"wp-image-80197\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.21699<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Experiments and Results<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>Student:<\/strong> Llama-3.2-1B. <strong>Teachers:<\/strong> Llama-3.2-3B (same tokenizer), Qwen3-4B, and Phi-4-mini-Instruct. <strong>Training data:<\/strong> NemotronClimbMix dataset, 30,000 steps, batch size 768, context length 4096. <strong>Optimizer:<\/strong> AdamW, learning rate 5\u00d710\u207b\u2075, 5% warmup with cosine decay, weight decay 0.1, gradient clipping 1.0. Each experiment is feasible on a single NVIDIA H100 GPU; the research team used 128 H100s to speed up iteration.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Evaluation:<\/strong> 3-shot accuracy on MMLU, GSM8k, MATH-Hendrycks, Winogrande, and HellaSwag.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Key results:<\/strong><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Setting<\/th>\n<th>Method<\/th>\n<th>Avg.<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>No distillation<\/td>\n<td>Llama-1B (base)<\/td>\n<td>33.96<\/td>\n<\/tr>\n<tr>\n<td>No distillation<\/td>\n<td>Continued pre-training<\/td>\n<td>36.63<\/td>\n<\/tr>\n<tr>\n<td>Same tokenizer<\/td>\n<td>Llama-3B \u2192 1B (KL)<\/td>\n<td>38.40<\/td>\n<\/tr>\n<tr>\n<td>Cross-tokenizer<\/td>\n<td>Qwen-4B, ULD<\/td>\n<td>36.77<\/td>\n<\/tr>\n<tr>\n<td>Cross-tokenizer<\/td>\n<td>Qwen-4B, GOLD<\/td>\n<td>35.03<\/td>\n<\/tr>\n<tr>\n<td>Cross-tokenizer<\/td>\n<td><strong>Qwen-4B, X-Token (P-KL)<\/strong><\/td>\n<td><strong>38.85<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Cross-tokenizer<\/td>\n<td>Phi-mini, ULD<\/td>\n<td>38.31<\/td>\n<\/tr>\n<tr>\n<td>Cross-tokenizer<\/td>\n<td>Phi-mini, GOLD<\/td>\n<td>38.66<\/td>\n<\/tr>\n<tr>\n<td>Cross-tokenizer<\/td>\n<td><strong>Phi-mini, X-Token (H-KL)<\/strong><\/td>\n<td><strong>39.18<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Multi-teacher<\/td>\n<td><strong>Phi-mini + Llama-3B (X-Token)<\/strong><\/td>\n<td><strong>40.48<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\"><strong>On Qwen-4B (P-KL regime):<\/strong> GOLD reaches 35.03 avg., below even continued pre-training without a teacher (36.63). This confirms the partition is actively harmful when critical tokens are misaligned. Pure ULD (36.77) already improves over GOLD, indicating the partition is the primary failure source. P-KL further improves to 38.85 avg. (+3.82 over GOLD). GSM8k alone moves from 2.56 to 15.54, surpassing same-tokenizer KD from Llama-3.2-3B (12.89) on that benchmark.<\/p>\n<p class=\"wp-block-paragraph\"><strong>On Phi-mini (H-KL regime):<\/strong> GOLD reaches 38.66 avg. \u2014 a reasonable baseline where the partition is structurally sound. H-KL improves to 39.18 avg. (+0.52 over GOLD). P-KL applied to Phi-mini drops to 37.50 avg., confirming that the wrong loss mode hurts even when W is available.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Multi-teacher:<\/strong> Phi-mini (H-KL, \u03b1=0.8) + Llama-3B (standard KL, \u03b1=0.2) under static weighting reaches 40.48 avg. This is +2.08 over same-family KD from Llama-3B alone, and +1.30 over the best single cross-tokenizer result (39.18). Combining Phi-mini + Qwen-4B \u2014 two teachers with overlapping reasoning strengths \u2014 scores only 38.49, below the best single teacher. Adding Qwen-4B as a third teacher yields 40.15, with math\/reasoning degrading (GSM8k 20.39 \u2192 19.18) while commonsense improves slightly. Teacher complementarity, not teacher count, drives gains.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Strengths <\/strong><strong>and What to Watch<\/strong><\/h2>\n<h4 class=\"wp-block-heading\"><strong>Strengths:<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>The suppressive gradient problem in GOLD\u2019s hybrid loss is formally proved (Proposition 1), not just observed empirically<\/li>\n<li>W is constructed rule-based from tokenizer strings alone; no training data or learned parameters needed at initialization<\/li>\n<li>Dynamic KD\/CE scaling removes the need to tune fixed loss weights; it outperforms three fixed-weight baselines in ablations<\/li>\n<li>Multi-teacher extension adds no architectural changes; each teacher uses its own W_m and appropriate loss<\/li>\n<li>The coverage audit for P-KL vs H-KL selection is a defined, reproducible criterion based on per-category token retention in C<\/li>\n<\/ul>\n<h4 class=\"wp-block-heading\"><strong><\/strong><strong><\/strong><strong>What to Watch<\/strong>:<\/h4>\n<ul class=\"wp-block-list\">\n<li>Experiments use only Llama-3.2-1B as the student under continued pre-training; larger students and instruction-tuned settings are not evaluated<\/li>\n<li>Only three teacher pairs are tested; low-overlap tokenizer families (SentencePiece, byte-level BPE) are left for future work<\/li>\n<li>Static weighting outperforms confidence-adaptive weighting in all tested multi-teacher setups, but why?<\/li>\n<li>The multi-token rule in Pass 2 skips student tokens whose decoded text re-tokenizes to sequences longer than 4 under the teacher; those rows remain zero in W<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<p>  <!-- Header --><\/p>\n<div class=\"xt-header\">\n    <span class=\"xt-logo\">\u25a0 X-Token \u2014 NVIDIA Research<\/span><br \/>\n    <span class=\"xt-counter\">1 \/ 8<\/span>\n  <\/div>\n<p>  <!-- SLIDE 1: What is Knowledge Distillation --><\/p>\n<div class=\"xt-slide active\" data-slide=\"0\">\n    <span class=\"xt-label\">01 \u2014 Background<\/span>\n<div class=\"xt-title\">What is Knowledge Distillation?<\/div>\n<div class=\"xt-body\">\n<p>Knowledge distillation (KD) transfers <strong>\u201cdark knowledge\u201d<\/strong> from a large teacher model to a smaller student model. The student learns from the teacher\u2019s full next-token probability distribution, not just the correct answer.<\/p>\n<p>This is done via <strong>per-position KL divergence<\/strong> over the teacher\u2019s output distribution at every token position in the sequence.<\/p>\n<p><strong>The constraint:<\/strong> standard KD requires a shared tokenizer. If Llama-3.2-1B is the student, it cannot learn from Qwen3-4B or Phi-4-mini \u2014 their token vocabularies do not align. Token positions have no correspondence across different tokenizer families.<\/p>\n<\/div>\n<div class=\"xt-stats\">\n<div class=\"xt-stat\">\n        <span class=\"xt-stat-val\">Llama<\/span><br \/>\n        <span class=\"xt-stat-lbl\">Student tokenizer<\/span>\n      <\/div>\n<div class=\"xt-stat\">\n        <span class=\"xt-stat-val\">Qwen \/ Phi<\/span><br \/>\n        <span class=\"xt-stat-lbl\">Incompatible teachers<\/span>\n      <\/div>\n<div class=\"xt-stat\">\n        <span class=\"xt-stat-val\">\u2260 Match<\/span><br \/>\n        <span class=\"xt-stat-lbl\">Vocab mismatch<\/span>\n      <\/div>\n<\/div>\n<\/div>\n<p>  <!-- SLIDE 2: Two Failures in GOLD --><\/p>\n<div class=\"xt-slide\" data-slide=\"1\">\n    <span class=\"xt-label\">02 \u2014 The Problem<\/span>\n<div class=\"xt-title\">Two Structural Failures in GOLD<\/div>\n<div class=\"xt-body\">\n<p><strong>GOLD<\/strong> is the prior state-of-the-art cross-tokenizer KD method. It partitions tokens into a string-matched <em>common subset<\/em> (trained with KL) and an <em>uncommon remainder<\/em> (trained with ULD rank-matching).<\/p>\n<p>NVIDIA researchers identified two distinct failures:<\/p>\n<\/div>\n<div class=\"xt-steps\">\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">1<\/div>\n<div class=\"xt-step-txt\"><strong>Uncommon-token failure:<\/strong> Critical tokens fall into the unmatched subset. Llama packs \u201c201\u201d as one token. Qwen splits it into \u201c2\u201d, \u201c0\u201d, \u201c1\u201d. All 1,100 multi-digit Llama numerals fall into the uncommon set under Qwen3-4B. They receive identity-agnostic noise and suppressive gradients \u2014 GSM8k drops to <strong>2.56<\/strong>.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">2<\/div>\n<div class=\"xt-step-txt\"><strong>Over-conservative matching:<\/strong> Strict string equality discards well-formed pairs. Student token <span class=\"xt-code\">Hundreds<\/span> maps to teacher tokens <span class=\"xt-code\">Hund<\/span> + <span class=\"xt-code\">reds<\/span>, but GOLD drops this alignment entirely.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>  <!-- SLIDE 3: X-Token Overview --><\/p>\n<div class=\"xt-slide\" data-slide=\"2\">\n    <span class=\"xt-label\">03 \u2014 Solution<\/span>\n<div class=\"xt-title\">X-Token: Three Core Components<\/div>\n<div class=\"xt-body\">\n<p>X-Token is a <strong>logit-distribution-based<\/strong> cross-tokenizer KD method. It requires no auxiliary trainable components and no architectural changes \u2014 it is a drop-in replacement for the standard KD loss.<\/p>\n<\/div>\n<div class=\"xt-steps\">\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">1<\/div>\n<div class=\"xt-step-txt\"><strong>Span Alignment:<\/strong> DP-based alignment groups tokens into chunks that decode to the same text substring. Cached per sequence \u2014 zero per-step overhead.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">2<\/div>\n<div class=\"xt-step-txt\"><strong>Projection Matrix W:<\/strong> A sparse matrix W \u2208 \u211d\u207c|V_S|\u00d7|V_T|\u207d maps each student token to a weighted combination of teacher tokens, bridging the vocabulary gap.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">3<\/div>\n<div class=\"xt-step-txt\"><strong>Two Loss Modes:<\/strong> <em>P-KL<\/em> removes the partition entirely. <em>H-KL<\/em> retains the partition but relaxes matching via top-1 mappings under W. Each targets a different failure mode.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>  <!-- SLIDE 4: Projection Matrix W --><\/p>\n<div class=\"xt-slide\" data-slide=\"3\">\n    <span class=\"xt-label\">04 \u2014 Projection Matrix W<\/span>\n<div class=\"xt-title\">How W is Constructed<\/div>\n<div class=\"xt-body\">\n<p>W is built <strong>deterministically before training<\/strong> in two passes. No training data or learned parameters are required at initialization.<\/p>\n<\/div>\n<div class=\"xt-steps\">\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">1<\/div>\n<div class=\"xt-step-txt\"><strong>Exact-match pass:<\/strong> For every (student, teacher) token pair whose decoded strings match after canonicalization, set W[s,t] = 1. Canonicalization unifies space prefixes, newlines, byte-fallback tokens, and special tokens across families.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">2<\/div>\n<div class=\"xt-step-txt\"><strong>Multi-token rule pass:<\/strong> For unmatched student tokens, re-tokenize their decoded text under the teacher. Assign decayed weights W[s,\u03c4\u1d62] = \u03b2\u00b7\u03b3\u2071 with (\u03b2,\u03b3) = (0.9, 0.1). A 2-token span gets (0.909, 0.091). Each row is truncated to top-4 entries and row-normalized.<\/div>\n<\/div>\n<\/div>\n<div class=\"xt-body\">\n<p>Because each row sums to 1, W\u1d40 is <strong>probability-preserving<\/strong>: W\u1d40p_S is a valid probability vector over V_T without additional normalization.<\/p>\n<\/div>\n<\/div>\n<p>  <!-- SLIDE 5: P-KL vs H-KL --><\/p>\n<div class=\"xt-slide\" data-slide=\"4\">\n    <span class=\"xt-label\">05 \u2014 Loss Formulations<\/span>\n<div class=\"xt-title\">P-KL vs H-KL: When to Use Each<\/div>\n<div class=\"xt-body\">\n<p>Selection is based on a <strong>coverage audit<\/strong>: measure what fraction of critical token categories (e.g. multi-digit numerals) appear in the common set C.<\/p>\n<\/div>\n<table class=\"xt-table\">\n<tr>\n<th>Property<\/th>\n<th>P-KL<\/th>\n<th>H-KL<\/th>\n<\/tr>\n<tr>\n<td>Partition<\/td>\n<td>Removed entirely<\/td>\n<td>Retained, relaxed<\/td>\n<\/tr>\n<tr>\n<td>Matching<\/td>\n<td>Full vocab via W<\/td>\n<td>Top-1 under W<\/td>\n<\/tr>\n<tr>\n<td>Use when<\/td>\n<td>Critical tokens fall outside C<\/td>\n<td>Partition is sound<\/td>\n<\/tr>\n<tr>\n<td>Teacher example<\/td>\n<td>Qwen3-4B<\/td>\n<td>Phi-4-mini-Instruct<\/td>\n<\/tr>\n<tr>\n<td>Avg. gain vs GOLD<\/td>\n<td class=\"xt-best\">+3.82<\/td>\n<td class=\"xt-best\">+0.52<\/td>\n<\/tr>\n<\/table>\n<div class=\"xt-body\">\n<p>Applying the wrong mode reverses results: P-KL on Phi-mini drops to <strong>37.50<\/strong> avg. vs H-KL\u2019s 39.18.<\/p>\n<\/div>\n<\/div>\n<p>  <!-- SLIDE 6: Results --><\/p>\n<div class=\"xt-slide\" data-slide=\"5\">\n    <span class=\"xt-label\">06 \u2014 Results<\/span>\n<div class=\"xt-title\">Benchmark Results on Llama-3.2-1B (3-shot)<\/div>\n<div class=\"xt-body\">\n<p>Student: <strong>Llama-3.2-1B<\/strong> \u2014 trained on NemotronClimbMix, 30K steps, batch 768, context 4096.<\/p>\n<\/div>\n<table class=\"xt-table\">\n<tr>\n<th>Method<\/th>\n<th>GSM8k<\/th>\n<th>Avg.<\/th>\n<\/tr>\n<tr>\n<td>Llama-1B (base)<\/td>\n<td>5.69<\/td>\n<td>33.96<\/td>\n<\/tr>\n<tr>\n<td>Continued pre-training<\/td>\n<td>10.25<\/td>\n<td>36.63<\/td>\n<\/tr>\n<tr>\n<td>Same-tokenizer KD (Llama-3B)<\/td>\n<td>12.89<\/td>\n<td>38.40<\/td>\n<\/tr>\n<tr>\n<td class=\"xt-bad\">Qwen-4B, GOLD<\/td>\n<td class=\"xt-bad\">2.56<\/td>\n<td class=\"xt-bad\">35.03<\/td>\n<\/tr>\n<tr>\n<td class=\"xt-best\">Qwen-4B, X-Token (P-KL)<\/td>\n<td class=\"xt-best\">15.54<\/td>\n<td class=\"xt-best\">38.85<\/td>\n<\/tr>\n<tr>\n<td>Phi-mini, GOLD<\/td>\n<td>16.50<\/td>\n<td>38.66<\/td>\n<\/tr>\n<tr>\n<td class=\"xt-best\">Phi-mini, X-Token (H-KL)<\/td>\n<td class=\"xt-best\">19.11<\/td>\n<td class=\"xt-best\">39.18<\/td>\n<\/tr>\n<tr>\n<td class=\"xt-best\">Phi-mini + Llama-3B (Multi)<\/td>\n<td class=\"xt-best\">20.39<\/td>\n<td class=\"xt-best\">40.48<\/td>\n<\/tr>\n<\/table><\/div>\n<p>  <!-- SLIDE 7: Multi-Teacher --><\/p>\n<div class=\"xt-slide\" data-slide=\"6\">\n    <span class=\"xt-label\">07 \u2014 Multi-Teacher Distillation<\/span>\n<div class=\"xt-title\">Teacher Complementarity Drives Gains<\/div>\n<div class=\"xt-body\">\n<p>X-Token extends to multiple teachers. Each gets its own projection matrix W_m and loss mode. The aggregated loss uses per-teacher weights \u03b1_m.<\/p>\n<p>Key finding: <strong>static weighting outperforms confidence-adaptive weighting<\/strong> in all tested setups. Phi-mini (\u03b1=0.8) + Llama-3B (\u03b1=0.2) achieves the best result.<\/p>\n<\/div>\n<table class=\"xt-table\">\n<tr>\n<th>Teacher Combination<\/th>\n<th>Avg.<\/th>\n<th>Note<\/th>\n<\/tr>\n<tr>\n<td>Phi-mini only (H-KL)<\/td>\n<td>39.18<\/td>\n<td>Best single<\/td>\n<\/tr>\n<tr>\n<td class=\"xt-best\">Phi-mini + Llama-3B<\/td>\n<td class=\"xt-best\">40.48<\/td>\n<td class=\"xt-best\">Complementary<\/td>\n<\/tr>\n<tr>\n<td>Phi-mini + Qwen-4B<\/td>\n<td>38.49<\/td>\n<td>Overlapping<\/td>\n<\/tr>\n<tr>\n<td>Phi-mini + Qwen-4B + Llama-3B<\/td>\n<td>40.15<\/td>\n<td>3rd teacher hurts math<\/td>\n<\/tr>\n<\/table>\n<div class=\"xt-body\">\n<p>Combining two reasoning-heavy teachers (Phi-mini + Qwen-4B) scores <strong>below<\/strong> the best single teacher. Teacher diversity matters more than teacher count.<\/p>\n<\/div>\n<\/div>\n<p>  <!-- SLIDE 8: Key Takeaways --><\/p>\n<div class=\"xt-slide\" data-slide=\"7\">\n    <span class=\"xt-label\">08 \u2014 Key Takeaways<\/span>\n<div class=\"xt-title\">What to Remember About X-Token<\/div>\n<div class=\"xt-steps\">\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">1<\/div>\n<div class=\"xt-step-txt\">GOLD\u2019s partition actively harms training when critical tokens (e.g., multi-digit numerals) fall into the uncommon set \u2014 <strong>P-KL eliminates the partition entirely<\/strong> using projection matrix W.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">2<\/div>\n<div class=\"xt-step-txt\"><strong>H-KL<\/strong> retains the partition but relaxes matching to top-1 mappings under W \u2014 best when the partition is structurally sound.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">3<\/div>\n<div class=\"xt-step-txt\">The projection matrix W is <strong>built rule-based before training<\/strong> from tokenizer strings alone; no learned parameters required at init.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">4<\/div>\n<div class=\"xt-step-txt\">Multi-teacher gains (+1.3 over single-teacher) come from <strong>teacher complementarity<\/strong>, not from adding more teachers with overlapping strengths.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">5<\/div>\n<div class=\"xt-step-txt\">GSM8k recovers from <strong>2.56<\/strong> (GOLD) to <strong>15.54<\/strong> (P-KL) \u2014 a 6\u00d7 gain that exceeds same-tokenizer KD from a stronger Llama-3.2-3B teacher.<\/div>\n<\/div>\n<\/div>\n<div class=\"xt-body\">\n<p><strong>arXiv:<\/strong> <em>2605.21699<\/em> \u00a0\u2014\u00a0 <strong>Institution:<\/strong> <em>NVIDIA<\/em><\/p>\n<\/div>\n<\/div>\n<p>  <!-- Navigation --><\/p>\n<div class=\"xt-nav\">\n    <button class=\"xt-btn\" disabled>\u2190 Prev<\/button>\n<div class=\"xt-dots\"><\/div>\n<p>    <button class=\"xt-btn\">Next \u2192<\/button>\n  <\/p><\/div>\n<p>  <!-- Footer --><\/p>\n<div class=\"xt-footer\">NVIDIA Research \u2014 X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation \u2014 arXiv:2605.21699 \u2014 marktechpost.com<\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>X-Token identifies two distinct, opposite failure modes in GOLD: uncommon-token suppression (fix: remove the partition with P-KL) and over-conservative matching (fix: relax it with H-KL).<\/li>\n<li>The projection matrix W is built rule-based from tokenizer strings before training; it can optionally be jointly refined with the student for additional gains.<\/li>\n<li>P-KL on Qwen3-4B improves over GOLD by +3.82 avg. and recovers GSM8k from 2.56 to 15.54.<\/li>\n<li>Multi-teacher distillation gains (+1.3 over single-teacher) come from teacher complementarity, not just from adding more teachers.<\/li>\n<li>Loss mode selection (P-KL vs H-KL) is determined by a coverage audit on token categories; applying the wrong mode reverses the ranking.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2605.21699\" target=\"_blank\" rel=\"noreferrer noopener\">Research Paper<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/wbash1wF6efRj8G58\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/29\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/\">NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Knowledge distillation (KD) transfers \u201cdark knowledge\u201d from a large teacher model to a smaller student. The student learns from the teacher\u2019s full output probability distribution over tokens, not just correct answers. This is done via per-position Kullback\u2013Leibler (KL) divergence over next-token probability distributions. This formulation requires a shared tokenizer. A practitioner committed to Llama-3.2-1B cannot leverage stronger teachers with incompatible tokenizers \u2014 such as Phi-4-mini or Qwen3-4B \u2014 because token positions do not correspond across vocabularies. This also prevents multi-teacher distillation across tokenizer families. NVIDIA researchers introduced X-Token, a logit-distribution-based method for cross-tokenizer KD (Knowledge distillation). It operates as a drop-in replacement for the standard KD loss, requiring no auxiliary trainable components and no architectural changes. The Problem X-Token is Solving Two prior approaches dominate cross-tokenizer KD. ULD (Universal Logit Distillation) sidesteps vocabulary alignment by rank-sorting both distributions and minimizing L1 distance. It discards token identity entirely. GOLD adds span alignment and a hybrid loss. It partitions tokens into a 1-to-1 string-matched common subset, trained with KL divergence, and an uncommon remainder, trained with ULD-style rank matching. GOLD is the current state of the art. The research team identifies two structural failures in GOLD\u2019s design: Failure 1: Uncommon-token failure\u2013 When tokenizers fragment text differently, critical tokens fall into the unmatched uncommon subset. Llama-3 packs multi-digit numbers as single tokens \u2014 \u201c201\u201d is one token. Qwen3 splits them digit by digit: \u201c2\u201d, \u201c0\u201d, \u201c1\u201d. Under GOLD, all 1,100 of Llama\u2019s two- and three-digit numerals (100 two-digit, 1,000 three-digit) fall into the uncommon set when Qwen3-4B is the teacher. Those tokens receive two types of harmful signal: identity-agnostic noise from rank-based ULD matching, and suppressive gradients from the common-KL term acting through the full-vocabulary softmax. The result: GSM8k accuracy drops to 2.56 under GOLD with Qwen3-4B, compared to 12.89 for same-tokenizer KD from a weaker Llama-3.2-3B teacher. Failure 2: Over-conservative matching\u2013 GOLD uses strict string equality to define the common subset. A student token Hundreds corresponds to teacher tokens Hund followed by reds under teacher-side re-tokenization, but strict matching discards this pair. Useful alignment signal is lost even when the correspondence is well-formed. These two failures require opposite remedies: eliminate the partition when critical tokens are misaligned, and relax it when alignment is structurally sound. How X-Token Works X-Token has three components: span alignment, a projection matrix W, and two complementary loss formulations \u2014 P-KL and H-KL. Span Alignment Teacher and student tokenizers produce sequences of different lengths for the same text. X-Token uses dynamic-programming (DP) span alignment, grouping tokens into chunks where each chunk-pair decodes to the same underlying text substring. A chain-rule merge then combines per-token probabilities within each chunk into a single chunk-level distribution for use in the distillation loss. The alignment is cached per sequence and adds no per-step training overhead. The research team also identifies a failure in TRL\u2019s surface-substring alignment, which is used in TRL\u2019s GOLD trainer. TRL accumulates per-side decoded buffers and flushes only when both buffers match as equal raw strings. A byte-level disagreement \u2014 such as Llama-3 auto-prepending &lt;bos&gt; while Qwen-3 does not \u2014 prevents future flushes and forces all remaining tokens into one mis-grouped super-group at end of sequence. The DP approach handles this with a single gap move, regardless of sequence length. The Projection Matrix W After alignment, teacher and student distributions still operate over different vocabularies. The projection matrix W \u2208 \u211dVS|\u00d7|VT| maps each student token to a weighted combination of teacher tokens, bridging the vocabulary mismatch. W is constructed deterministically in two passes: Pass 1 (exact-match): For every (student token, teacher token) pair whose decoded strings match after canonicalization, set W[s, t] = 1. Canonicalization unifies space prefixes (\u0120, _, \u2423), newlines, byte-fallback tokens of the form &lt;0xHH&gt;, and model-specific special tokens across tokenizer families. Pass 2 (multi-token rule): For each student token without an exact match, re-tokenize its decoded text under the teacher tokenizer. If the resulting sequence has length \u2264 4, assign exponentially-decayed weights: W[s, \u03c4\u1d62] = \u03b2\u00b7\u03b3\u2071 with (\u03b2, \u03b3) = (0.9, 0.1). A length-2 span receives normalized weights (0.909, 0.091). A length-3 span receives (0.9009, 0.0901, 0.0090). A length-4 span receives (0.9000, 0.0900, 0.0090, 0.0009). The leading sub-token receives the highest weight because it typically carries the most informative probability mass \u2014 for example, \u201c_inter\u201d in [\u201c_inter\u201d, \u201cnational\u201d] or \u201c_20\u201d in [\u201c_20\u201d, \u201c24\u201d]. Each row is truncated to its top-4 entries and row-normalized. Because each row of W is non-negative and sums to 1, left-multiplication by W\u22a4 is probability-preserving: if pS is a probability vector, W\u22a4pS is also a valid probability vector over VT. W is constructed once before training and can optionally be jointly refined with the student under P-KL. P-KL: Addressing Erroneous and Suppressive Gradients P-KL removes the partition entirely. It projects the student distribution p\u0302S(k) into teacher vocabulary space via W: p~S(k)[t]=\u2211s\u2208\ud835\udcb1SW[s,t]\u22c5p^S(k)[s]tilde{p}_S^{(k)}[t] = sum_{sinmathcal{V}_S} W[s, t] cdot hat{p}_S^{(k)}[s] Then it computes KL divergence directly between teacher and projected student: \u2202\u2112common\u2202zj=pS[j]\u22c5M\ud835\udc9e(T)frac{partialmathcal{L}_{common}}{partial z_{j}} = p_S[j] cdot M_{mathcal{C}}(T) There is no uncommon set, so rank-based ULD noise is eliminated. The suppressive gradient problem is also eliminated: the projection routes the student\u2019s probability mass for \u201c201\u201d directly onto {2, 0, 1} in the teacher vocabulary via W. The research team formally proves (Proposition 1) that GOLD\u2019s common-KL term induces non-negative gradients on every uncommon student logit. The gradient on an uncommon student logit j is: \u2202\u2112common\/\u2202zj = pS[j] \u00b7 MC(T), where MC(T), is the teacher probability mass on the common subset. Under gradient descent, this always drives zj downward \u2014 suppressing every uncommon token\u2019s probability regardless of the ground-truth token. H-KL: Relaxing the 1-to-1 Matching H-KL applies when the partition is structurally sound \u2014 that is, when critical tokens land in the common subset. In that case, GOLD\u2019s direct KL on identity-aligned pairs delivers sharper per-pair supervision than P-KL\u2019s projection, which blends student probability mass across multiple teacher tokens. The opportunity is to make the partition less wasteful by relaxing the strict string-equality criterion. H-KL retains GOLD\u2019s hybrid loss structure but expands the common set<\/p>","protected":false},"author":2,"featured_media":93971,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-93970","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/es\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/\" \/>\n<meta property=\"og:locale\" content=\"es_ES\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/es\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-30T17:19:00+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Escrito por\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Tiempo de lectura\" \/>\n\t<meta name=\"twitter:data2\" content=\"14 minutos\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B\",\"datePublished\":\"2026-05-30T17:19:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/\"},\"wordCount\":2862,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/\",\"url\":\"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/\",\"name\":\"NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX.png\",\"datePublished\":\"2026-05-30T17:19:00+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#breadcrumb\"},\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX.png\",\"width\":1124,\"height\":616},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"es\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/es\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/es\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/","og_locale":"es_ES","og_type":"article","og_title":"NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/es\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-05-30T17:19:00+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Escrito por":"admin NU","Tiempo de lectura":"14 minutos"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B","datePublished":"2026-05-30T17:19:00+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/"},"wordCount":2862,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"es","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/","url":"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/","name":"NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX.png","datePublished":"2026-05-30T17:19:00+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#breadcrumb"},"inLanguage":"es","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/"]}]},{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX.png","width":1124,"height":616},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"es"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/es\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX.png",1124,616,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX.png",1124,616,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX.png",1124,616,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX-300x164.png",300,164,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX-1024x561.png",1024,561,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX.png",1124,616,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX.png",1124,616,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX-18x10.png",18,10,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX-600x329.png",600,329,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-4zYgnX-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/es\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/es\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Knowledge distillation (KD) transfers \u201cdark knowledge\u201d from a large teacher model to a smaller student. The student learns from the teacher\u2019s full output probability distribution over tokens, not just correct answers. This is done via per-position Kullback\u2013Leibler (KL) divergence over next-token probability distributions. This formulation requires a shared tokenizer. A practitioner committed to Llama-3.2-1B cannot&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/93970","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/comments?post=93970"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/93970\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/media\/93971"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/media?parent=93970"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/categories?post=93970"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/tags?post=93970"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}