{"id":93270,"date":"2026-05-27T17:12:21","date_gmt":"2026-05-27T17:12:21","guid":{"rendered":"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/"},"modified":"2026-05-27T17:12:21","modified_gmt":"2026-05-27T17:12:21","slug":"meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference","status":"publish","type":"post","link":"https:\/\/youzum.net\/zh\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/","title":{"rendered":"Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Speculative decoding is a technique for speeding up large language model inference. A small, fast draft model proposes several tokens. The large target model verifies them in parallel. If accepted, inference is faster. If rejected, the system falls back gracefully.<\/p>\n<p class=\"wp-block-paragraph\">EAGLE Team, vLLM Team, and TorchSpec Team has launched the EAGLE series including EAGLE 1, EAGLE 2, and EAGLE 3 has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and production systems. Today, that family gets a targeted reliability upgrade with introduction of <a href=\"https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1\" target=\"_blank\" rel=\"noreferrer noopener\">EAGLE 3.1<\/a>. <\/p>\n<h2 class=\"wp-block-heading\"><strong>What was Going Wrong<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">While speculative decoding performs well in controlled settings, performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts. <\/p>\n<p class=\"wp-block-paragraph\">The EAGLE team traced this fragility to a phenomenon called <strong><a href=\"https:\/\/arxiv.org\/pdf\/2605.09992\" target=\"_blank\" rel=\"noreferrer noopener\">attention drift<\/a><\/strong> as speculation depth increases, the drafter gradually shifts attention away from sink tokens and toward its own generated tokens. <\/p>\n<p class=\"wp-block-paragraph\">In simpler terms: the drafter is a small model that predicts future tokens. As speculation gets deeper, it starts attending to its own prior outputs instead of the original context. This degrades acceptance length and output stability.<\/p>\n<p class=\"wp-block-paragraph\">Two underlying issues were identified. First, the fused input representation becomes increasingly imbalanced as higher-layer hidden states dominate the drafter input. Second, hidden-state magnitude grows across speculation steps due to the unnormalized residual path. Together, these effects make the drafter progressively less stable at deeper speculation depths.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Two Architectural Fixes in EAGLE 3.1<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">To address attention drift, EAGLE 3.1 comes with two key architectural improvements: FC normalization after each target hidden state and before the FC layer, and feeding post-norm hidden states into the next decoding step.<\/p>\n<p class=\"wp-block-paragraph\">FC normalization stabilizes the hidden states that the drafter receives from the target model. Without it, hidden-state magnitude grows across steps, making the drafter increasingly unreliable. Applying normalization at each step keeps the inputs bounded.<\/p>\n<p class=\"wp-block-paragraph\">The post-norm design makes the method behave more like recursively invoking the drafter across decoding steps, rather than simply appending additional layers to the target model. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1706\" height=\"664\" data-attachment-id=\"80133\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/27\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/screenshot-2026-05-27-at-12-17-50-am\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM.png\" data-orig-size=\"1706,664\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1\" data-image-description=\"&lt;p&gt;https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1&lt;\/p&gt;\" data-image-caption=\"&lt;p&gt;https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1&lt;\/p&gt;\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-1024x399.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM.png\" alt=\"https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1\" class=\"wp-image-80133\" \/><figcaption class=\"wp-element-caption\">https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>What These Fixes Deliver<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Compared with EAGLE 3, EAGLE 3.1 demonstrates: better training-time to inference-time extrapolation, stronger long-context robustness, higher resilience to chat template and system prompt variation, and more stable acceptance length across diverse serving environments. <\/p>\n<p class=\"wp-block-paragraph\">In long-context workloads, EAGLE 3.1 achieves up to 2\u00d7 longer acceptance length compared with EAGLE 3. <\/p>\n<h2 class=\"wp-block-heading\"><strong>Training Infrastructure: TorchSpec<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">TorchSpec now provides efficient training support for EAGLE 3.1 and future speculative decoding algorithms. By lowering training overhead and simplifying experimentation workflows, TorchSpec helps accelerate iteration and exploration for next-generation speculative decoding research and deployment. <\/p>\n<p class=\"wp-block-paragraph\">Based on TorchSpec and vLLM, the research team also trained and open-sourced an EAGLE 3.1 draft model for Kimi K2.6, available on <a href=\"https:\/\/huggingface.co\/lightseekorg\/kimi-k2.6-eagle3-mla\" target=\"_blank\" rel=\"noreferrer noopener\">HuggingFace<\/a>. The model serves as an example of deploying EAGLE 3.1 with TorchSpec training and vLLM serving support on a real-world serving model<\/p>\n<h2 class=\"wp-block-heading\"><strong>vLLM Integration: Config-Driven and Backward-Compatible<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">EAGLE 3.1 lands in vLLM as a config-driven extension of the existing EAGLE 3 implementation. The integration includes FC normalization support, post-norm hidden-state feedback, and removal of hardcoded assumptions around target hidden states. <\/p>\n<p class=\"wp-block-paragraph\">Backward compatibility with existing EAGLE 3 checkpoints is fully preserved. EAGLE 3.1 draft models can be plugged directly through the same speculative-decoding code path. <\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">vllm serve nvidia\/Kimi-K2.6-NVFP4 \n  --trust-remote-code \n  --tensor-parallel-size 4 \n  --tool-call-parser kimi_k2 \n  --enable-auto-tool-choice \n  --reasoning-parser kimi_k2 \n  --attention-backend tokenspeed_mla \n  --speculative-config '{\"model\":\"lightseekorg\/kimi-k2.6-eagle3.1-mla\",\"method\":\"eagle3\",\"num_speculative_tokens\":3}' \n  --language-model-only<\/code><\/pre>\n<\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Benchmark Results on Kimi K2.6<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The research team benchmarked the Kimi K2.6 EAGLE 3.1 draft model on Kimi-K2.6-NVFP4 with vLLM (TP=4, GB200, non-disagg) on the SPEED-Bench coding dataset. EAGLE 3.1 delivers 2.03\u00d7 higher per-user output throughput at concurrency 1. The speedup stays meaningful as concurrency scales: 1.71\u00d7 at C=4 and 1.66\u00d7 at C=16. <\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<div class=\"mtp-progress\"><\/div>\n<div class=\"mtp-track\">\n<p><!-- 1 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">01 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">vLLM \u00b7 May 26, 2026<\/span>\n<h1 class=\"mtp-h1\">Meet EAGLE 3.1<\/h1>\n<p><span class=\"mtp-divider\"><\/span><br \/>\n<span class=\"mtp-sub\">The EAGLE team, vLLM team, and TorchSpec team jointly released EAGLE 3.1 \u2014 a targeted fix for speculative decoding instability in production LLM serving.<\/span><\/p>\n<div>\n<span class=\"mtp-badge\">#speculative-decoding<\/span><br \/>\n<span class=\"mtp-badge\">#vLLM<\/span><br \/>\n<span class=\"mtp-badge\">#LLM inference<\/span><br \/>\n<span class=\"mtp-badge\">#performance<\/span>\n<\/div>\n<\/div>\n<\/div>\n<p><!-- 2 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">02 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">Background<\/span>\n<h2 class=\"mtp-h2\">What is Speculative Decoding?<\/h2>\n<p><span class=\"mtp-divider\"><\/span><br \/>\n<span class=\"mtp-sub\">A technique for speeding up LLM inference using two models working together.<\/span><\/p>\n<ul class=\"mtp-list\">\n<li>A small, fast <span class=\"mtp-hi\">draft model<\/span> proposes several tokens ahead<\/li>\n<li>The large <span class=\"mtp-hi\">target model<\/span> verifies all proposed tokens in one pass<\/li>\n<li>Accepted tokens are kept \u2014 rejected tokens fall back gracefully<\/li>\n<li>Result: higher output throughput with no change in output quality<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<p><!-- 3 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">03 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">The Problem<\/span>\n<h2 class=\"mtp-h2\">Attention Drift in EAGLE 3<\/h2>\n<p><span class=\"mtp-divider\"><\/span><br \/>\n<span class=\"mtp-sub\">EAGLE 3 performance degraded in real-world deployments under three conditions:<\/span><\/p>\n<ul class=\"mtp-list\">\n<li>Different <span class=\"mtp-hi\">chat templates<\/span><\/li>\n<li><span class=\"mtp-hi\">Long-context<\/span> inputs<\/li>\n<li>Out-of-distribution <span class=\"mtp-hi\">system prompts<\/span><\/li>\n<\/ul>\n<p><span class=\"mtp-sub\">Root cause: <span class=\"mtp-hi\">attention drift<\/span> \u2014 as speculation depth increases, the drafter shifts attention away from sink tokens toward its own generated tokens.<\/span>\n<\/p><\/div>\n<\/div>\n<p><!-- 4 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">04 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">Root Cause<\/span>\n<h2 class=\"mtp-h2\">Two Underlying Issues<\/h2>\n<p><span class=\"mtp-divider\"><\/span><\/p>\n<ul class=\"mtp-list\">\n<li>The <span class=\"mtp-hi\">fused input representation<\/span> becomes increasingly imbalanced \u2014 higher-layer hidden states dominate the drafter input<\/li>\n<li><span class=\"mtp-hi\">Hidden-state magnitude<\/span> grows across speculation steps due to the unnormalized residual path<\/li>\n<li>Together, these make the drafter <span class=\"mtp-hi\">progressively less stable<\/span> at deeper speculation depths<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<p><!-- 5 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">05 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">Architecture<\/span>\n<h2 class=\"mtp-h2\">Two Architectural Fixes<\/h2>\n<p><span class=\"mtp-divider\"><\/span><\/p>\n<div class=\"mtp-arch\">\n<div class=\"mtp-abox\">\n<span class=\"mtp-atitle\">Fix 1<\/span><br \/>\n<span class=\"mtp-atext\"><span class=\"mtp-hi\">FC normalization<\/span> applied after each target hidden state and before the FC layer. Keeps hidden-state magnitude bounded across decoding steps.<\/span>\n<\/div>\n<div class=\"mtp-abox\">\n<span class=\"mtp-atitle\">Fix 2<\/span><br \/>\n<span class=\"mtp-atext\"><span class=\"mtp-hi\">Post-norm hidden-state feedback<\/span> \u2014 normalized hidden states fed into the next decoding step, making the drafter behave like recursive invocation rather than appended layers.<\/span>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><!-- 6 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">06 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">Benchmarks \u00b7 SPEED-Bench Coding \u00b7 GB200 TP=4<\/span>\n<h2 class=\"mtp-h2\">Per-User Throughput vs. No-Spec Baseline<\/h2>\n<p><span class=\"mtp-divider\"><\/span><\/p>\n<div class=\"mtp-metric\">\n<div class=\"mtp-card\"><span class=\"mtp-num\">2.03\u00d7<\/span><span class=\"mtp-label\">Concurrency 1<\/span><\/div>\n<div class=\"mtp-card\"><span class=\"mtp-num\">1.71\u00d7<\/span><span class=\"mtp-label\">Concurrency 4<\/span><\/div>\n<div class=\"mtp-card\"><span class=\"mtp-num\">1.66\u00d7<\/span><span class=\"mtp-label\">Concurrency 16<\/span><\/div>\n<\/div>\n<p><span class=\"mtp-sub\">In long-context workloads, EAGLE 3.1 achieves up to <span class=\"mtp-hi\">2\u00d7 longer acceptance length<\/span> compared with EAGLE 3. Tested on Kimi-K2.6-NVFP4 with vLLM.<\/span>\n<\/p><\/div>\n<\/div>\n<p><!-- 7 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">07 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">Deployment \u00b7 vLLM v0.22.0<\/span>\n<h2 class=\"mtp-h2\">How to Deploy EAGLE 3.1<\/h2>\n<p><span class=\"mtp-divider\"><\/span><br \/>\n<span class=\"mtp-sub\">Backward-compatible with EAGLE 3 checkpoints. Already merged in vLLM main. Stable release: <span class=\"mtp-hi\">v0.22.0<\/span>.<\/span><\/p>\n<div class=\"mtp-code\">\n<pre>vllm serve nvidia\/Kimi-K2.6-NVFP4 \n  --trust-remote-code \n  --tensor-parallel-size 4 \n  --tool-call-parser kimi_k2 \n  --enable-auto-tool-choice \n  --reasoning-parser kimi_k2 \n  --attention-backend tokenspeed_mla \n  --speculative-config \n    '{\"model\":\"lightseekorg\/kimi-k2.6-eagle3.1-mla\",\n      \"method\":\"eagle3\",\n      \"num_speculative_tokens\":3}' \n  --language-model-only<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"mtp-nav\">\n<button class=\"mtp-btn\" disabled>\u2190 Prev<\/button>\n<div class=\"mtp-dots\"><\/div>\n<p><span class=\"mtp-ctr\">1 \/ 7<\/span><br \/>\n<button class=\"mtp-btn\">Next \u2192<\/button>\n<\/p><\/div>\n<div class=\"mtp-foot\">\n<span class=\"mtp-brand\">Markt<b>ech<\/b>post<\/span><br \/>\n<span class=\"mtp-tagline\">AI &amp; ML Research, Simplified.<\/span>\n<\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>EAGLE 3.1 fixes <strong>attention drift<\/strong> \u2014 a newly identified instability where the drafter loses focus on sink tokens at deeper speculation depths.<\/li>\n<li>Two architectural changes \u2014 <strong>FC normalization<\/strong> and <strong>post-norm hidden-state feedback<\/strong> \u2014 stabilize the drafter across speculation steps.<\/li>\n<li>In long-context workloads, EAGLE 3.1 delivers <strong>up to 2\u00d7 longer acceptance length<\/strong> compared with EAGLE 3.<\/li>\n<li>Benchmarks on Kimi-K2.6-NVFP4 show <strong>2.03\u00d7 per-user output throughput<\/strong> at concurrency 1, dropping to 1.66\u00d7 at C=16.<\/li>\n<li>EAGLE 3.1 is <strong>backward-compatible with EAGLE 3 checkpoints<\/strong> and is already merged into vLLM main, shipping in v0.22.0.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/27\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/\">Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Speculative decoding is a technique for speeding up large language model inference. A small, fast draft model proposes several tokens. The large target model verifies them in parallel. If accepted, inference is faster. If rejected, the system falls back gracefully. EAGLE Team, vLLM Team, and TorchSpec Team has launched the EAGLE series including EAGLE 1, EAGLE 2, and EAGLE 3 has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and production systems. Today, that family gets a targeted reliability upgrade with introduction of EAGLE 3.1. What was Going Wrong While speculative decoding performs well in controlled settings, performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts. The EAGLE team traced this fragility to a phenomenon called attention drift as speculation depth increases, the drafter gradually shifts attention away from sink tokens and toward its own generated tokens. In simpler terms: the drafter is a small model that predicts future tokens. As speculation gets deeper, it starts attending to its own prior outputs instead of the original context. This degrades acceptance length and output stability. Two underlying issues were identified. First, the fused input representation becomes increasingly imbalanced as higher-layer hidden states dominate the drafter input. Second, hidden-state magnitude grows across speculation steps due to the unnormalized residual path. Together, these effects make the drafter progressively less stable at deeper speculation depths. Two Architectural Fixes in EAGLE 3.1 To address attention drift, EAGLE 3.1 comes with two key architectural improvements: FC normalization after each target hidden state and before the FC layer, and feeding post-norm hidden states into the next decoding step. FC normalization stabilizes the hidden states that the drafter receives from the target model. Without it, hidden-state magnitude grows across steps, making the drafter increasingly unreliable. Applying normalization at each step keeps the inputs bounded. The post-norm design makes the method behave more like recursively invoking the drafter across decoding steps, rather than simply appending additional layers to the target model. https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1 What These Fixes Deliver Compared with EAGLE 3, EAGLE 3.1 demonstrates: better training-time to inference-time extrapolation, stronger long-context robustness, higher resilience to chat template and system prompt variation, and more stable acceptance length across diverse serving environments. In long-context workloads, EAGLE 3.1 achieves up to 2\u00d7 longer acceptance length compared with EAGLE 3. Training Infrastructure: TorchSpec TorchSpec now provides efficient training support for EAGLE 3.1 and future speculative decoding algorithms. By lowering training overhead and simplifying experimentation workflows, TorchSpec helps accelerate iteration and exploration for next-generation speculative decoding research and deployment. Based on TorchSpec and vLLM, the research team also trained and open-sourced an EAGLE 3.1 draft model for Kimi K2.6, available on HuggingFace. The model serves as an example of deploying EAGLE 3.1 with TorchSpec training and vLLM serving support on a real-world serving model vLLM Integration: Config-Driven and Backward-Compatible EAGLE 3.1 lands in vLLM as a config-driven extension of the existing EAGLE 3 implementation. The integration includes FC normalization support, post-norm hidden-state feedback, and removal of hardcoded assumptions around target hidden states. Backward compatibility with existing EAGLE 3 checkpoints is fully preserved. EAGLE 3.1 draft models can be plugged directly through the same speculative-decoding code path. Copy CodeCopiedUse a different Browser vllm serve nvidia\/Kimi-K2.6-NVFP4 &#8211;trust-remote-code &#8211;tensor-parallel-size 4 &#8211;tool-call-parser kimi_k2 &#8211;enable-auto-tool-choice &#8211;reasoning-parser kimi_k2 &#8211;attention-backend tokenspeed_mla &#8211;speculative-config &#8216;{&#8220;model&#8221;:&#8221;lightseekorg\/kimi-k2.6-eagle3.1-mla&#8221;,&#8221;method&#8221;:&#8221;eagle3&#8243;,&#8221;num_speculative_tokens&#8221;:3}&#8217; &#8211;language-model-only Benchmark Results on Kimi K2.6 The research team benchmarked the Kimi K2.6 EAGLE 3.1 draft model on Kimi-K2.6-NVFP4 with vLLM (TP=4, GB200, non-disagg) on the SPEED-Bench coding dataset. EAGLE 3.1 delivers 2.03\u00d7 higher per-user output throughput at concurrency 1. The speedup stays meaningful as concurrency scales: 1.71\u00d7 at C=4 and 1.66\u00d7 at C=16. Marktechpost\u2019s Visual Explainer 01 \/ 07 vLLM \u00b7 May 26, 2026 Meet EAGLE 3.1 The EAGLE team, vLLM team, and TorchSpec team jointly released EAGLE 3.1 \u2014 a targeted fix for speculative decoding instability in production LLM serving. #speculative-decoding #vLLM #LLM inference #performance 02 \/ 07 Background What is Speculative Decoding? A technique for speeding up LLM inference using two models working together. A small, fast draft model proposes several tokens ahead The large target model verifies all proposed tokens in one pass Accepted tokens are kept \u2014 rejected tokens fall back gracefully Result: higher output throughput with no change in output quality 03 \/ 07 The Problem Attention Drift in EAGLE 3 EAGLE 3 performance degraded in real-world deployments under three conditions: Different chat templates Long-context inputs Out-of-distribution system prompts Root cause: attention drift \u2014 as speculation depth increases, the drafter shifts attention away from sink tokens toward its own generated tokens. 04 \/ 07 Root Cause Two Underlying Issues The fused input representation becomes increasingly imbalanced \u2014 higher-layer hidden states dominate the drafter input Hidden-state magnitude grows across speculation steps due to the unnormalized residual path Together, these make the drafter progressively less stable at deeper speculation depths 05 \/ 07 Architecture Two Architectural Fixes Fix 1 FC normalization applied after each target hidden state and before the FC layer. Keeps hidden-state magnitude bounded across decoding steps. Fix 2 Post-norm hidden-state feedback \u2014 normalized hidden states fed into the next decoding step, making the drafter behave like recursive invocation rather than appended layers. 06 \/ 07 Benchmarks \u00b7 SPEED-Bench Coding \u00b7 GB200 TP=4 Per-User Throughput vs. No-Spec Baseline 2.03\u00d7Concurrency 1 1.71\u00d7Concurrency 4 1.66\u00d7Concurrency 16 In long-context workloads, EAGLE 3.1 achieves up to 2\u00d7 longer acceptance length compared with EAGLE 3. Tested on Kimi-K2.6-NVFP4 with vLLM. 07 \/ 07 Deployment \u00b7 vLLM v0.22.0 How to Deploy EAGLE 3.1 Backward-compatible with EAGLE 3 checkpoints. Already merged in vLLM main. Stable release: v0.22.0. vllm serve nvidia\/Kimi-K2.6-NVFP4 &#8211;trust-remote-code &#8211;tensor-parallel-size 4 &#8211;tool-call-parser kimi_k2 &#8211;enable-auto-tool-choice &#8211;reasoning-parser kimi_k2 &#8211;attention-backend tokenspeed_mla &#8211;speculative-config &#8216;{&#8220;model&#8221;:&#8221;lightseekorg\/kimi-k2.6-eagle3.1-mla&#8221;, &#8220;method&#8221;:&#8221;eagle3&#8243;, &#8220;num_speculative_tokens&#8221;:3}&#8217; &#8211;language-model-only \u2190 Prev 1 \/ 7 Next \u2192 Marktechpost AI &amp; ML Research, Simplified. Key Takeaways EAGLE 3.1 fixes attention drift \u2014 a newly identified instability where the drafter loses focus on sink tokens at deeper speculation depths. Two architectural changes \u2014 FC normalization and post-norm hidden-state feedback \u2014 stabilize the<\/p>","protected":false},"author":2,"featured_media":93271,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-93270","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/zh\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/\" \/>\n<meta property=\"og:locale\" content=\"zh_CN\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/zh\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-27T17:12:21+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u4f5c\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 \u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference\",\"datePublished\":\"2026-05-27T17:12:21+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/\"},\"wordCount\":1042,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/\",\"url\":\"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/\",\"name\":\"Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W.png\",\"datePublished\":\"2026-05-27T17:12:21+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#breadcrumb\"},\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W.png\",\"width\":1706,\"height\":664},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"zh-Hans\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/zh\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/zh\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/","og_locale":"zh_CN","og_type":"article","og_title":"Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/zh\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-05-27T17:12:21+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u4f5c\u8005":"admin NU","\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4":"5 \u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference","datePublished":"2026-05-27T17:12:21+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/"},"wordCount":1042,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"zh-Hans","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/","url":"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/","name":"Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W.png","datePublished":"2026-05-27T17:12:21+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#breadcrumb"},"inLanguage":"zh-Hans","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/"]}]},{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W.png","width":1706,"height":664},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"zh-Hans"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/zh\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W.png",1706,664,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W.png",1706,664,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W.png",1706,664,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W-300x117.png",300,117,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W-1024x399.png",1024,399,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W-1536x598.png",1536,598,true],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W.png",1706,664,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W-18x7.png",18,7,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W-600x234.png",600,234,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-t6Vy1W-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/zh\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/zh\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Speculative decoding is a technique for speeding up large language model inference. A small, fast draft model proposes several tokens. The large target model verifies them in parallel. If accepted, inference is faster. If rejected, the system falls back gracefully. EAGLE Team, vLLM Team, and TorchSpec Team has launched the EAGLE series including EAGLE 1,&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/93270","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/comments?post=93270"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/93270\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media\/93271"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media?parent=93270"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/categories?post=93270"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/tags?post=93270"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}