{"id":95987,"date":"2026-06-08T17:39:55","date_gmt":"2026-06-08T17:39:55","guid":{"rendered":"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/"},"modified":"2026-06-08T17:39:55","modified_gmt":"2026-06-08T17:39:55","slug":"xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus","status":"publish","type":"post","link":"https:\/\/youzum.net\/de\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/","title":{"rendered":"Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Inference speed is becoming a competitive metric for large language models. Xiaomi\u2019s MiMo team just released MiMo-V2.5-Pro-UltraSpeed, built in collaboration with the TileRT systems group. It decodes faster than 1000 tokens per second on a 1-trillion-parameter model. Xiaomi team describes this as a first at trillion-parameter scale. Demos show generation peaks near 1200 tokens per second. The notable part is the hardware: it runs on commodity GPUs, not custom silicon.<\/p>\n<h2 class=\"wp-block-heading\"><strong>What is MiMo-V2.5-Pro-UltraSpeed<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">UltraSpeed is a high-speed serving mode for the existing MiMo-V2.5-Pro model. The base model uses a Mixture-of-Experts (MoE) architecture at trillion-parameter scale. UltraSpeed targets generation speed rather than model capability. It changes how fast the model produces output tokens. The speedup comes from three coordinated techniques across the model and the serving system. Xiaomi calls this approach extreme model-system codesign. Crucially, the entire stack runs on a single standard 8-GPU commodity node.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Speed Case: Three Layers Working Together<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The first layer is FP4 quantization. At trillion scale, FP8 or FP16 weights create heavy memory and bandwidth pressure. Lower bit-width weights move through memory faster, which directly lifts decode speed. Xiaomi uses the MXFP4 format, applied selectively to the MoE Experts only. Other modules keep higher precision, reported as FP8 by TileRT. Experts hold most parameters and tolerate quantization best, so the tradeoff is favorable. Quantization-Aware Training (QAT) keeps benchmark quality essentially on par with the original.<\/p>\n<p class=\"wp-block-paragraph\">The second layer is DFlash speculative decoding, covered in detail below. The third layer is TileRT, the system that executes everything on the GPU. Each technique alone is not enough. The 1000 TPS result needs all three aligned tightly.<\/p>\n<h2 class=\"wp-block-heading\"><strong>DFlash: Parallel Drafting Without a Serial Bottleneck<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Standard speculative decoding uses a small draft model to guess upcoming tokens. The large model then verifies those guesses in parallel. Rejection sampling keeps output identical to normal decoding, so quality is lossless. The problem is that the draft model still generates tokens one at a time. DFlash, a method from the research community, removes that constraint. It uses block-level masked parallel prediction. The draft model fills a whole block of masked positions in one forward pass.<\/p>\n<p class=\"wp-block-paragraph\">Xiaomi tuned DFlash with the Muon second-order optimizer and model self-distillation. The draft model uses Sliding Window Attention (SWA) only, matching the MiMo-V2 design. This makes per-prediction compute constant rather than growing with context length. Block size is capped at 8 to limit verification cost and raise concurrency.<\/p>\n<p class=\"wp-block-paragraph\">Acceptance length measures how many draft tokens survive verification each round.<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Scenario<\/th>\n<th>Acceptance Length<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Coding<\/td>\n<td>6.30<\/td>\n<\/tr>\n<tr>\n<td>Math \/ Reasoning<\/td>\n<td>5.56<\/td>\n<\/tr>\n<tr>\n<td>Agent<\/td>\n<td>4.29<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">In coding, six to seven of eight draft tokens are accepted per round. Some samples reach a maximum of 7.14. <\/p>\n<h2 class=\"wp-block-heading\"><strong>TileRT: Squeezing the Microseconds<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">At 1000 TPS, each operator runs for only microseconds. Traditional systems launch operators one by one, and each launch costs time. Those gaps fracture the execution stream and become the real bottleneck. TileRT replaces this with a Persistent Engine Kernel that stays resident on the GPU. It uses Warp Specialization to split data movement, compute, and communication into coordinated roles. Small operations like RMSNorm, RoPE, and KV cache writes turn into bottlenecks at this scale. The system was co-designed with the FP4 and DFlash choices, not added afterward.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Use Cases<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The release targets latency-sensitive work where waiting breaks the loop:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Parallel reasoning:<\/strong> run many Best-of-N or tree-search paths within the same wall-clock time.<\/li>\n<li><strong>Coding agents:<\/strong> faster code generation cuts the wait between agent steps.<\/li>\n<li><strong>Real-time decision loops:<\/strong> trading signal generation, fraud interception, and live dialogue.<\/li>\n<li><strong>Interactive prototyping:<\/strong> demos show a Snake game in about 10 seconds and a macOS interface in about one minute.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">These are throughput-bound workloads where raw token speed is the binding constraint.<\/p>\n<h2 class=\"wp-block-heading\"><strong>How It Compares<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>The first table contrasts the two routes to extreme decode speed.<\/strong><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Approach<\/th>\n<th>Hardware<\/th>\n<th>How speed is achieved<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Cerebras<\/td>\n<td>Wafer-Scale integration (custom)<\/td>\n<td>Scale on a single custom wafer<\/td>\n<\/tr>\n<tr>\n<td>Groq<\/td>\n<td>Custom architecture<\/td>\n<td>Pure on-chip SRAM<\/td>\n<\/tr>\n<tr>\n<td>MiMo \u00d7 TileRT<\/td>\n<td>Commodity GPUs (8-GPU node)<\/td>\n<td>Model-system codesign: FP4 + DFlash + TileRT<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\"><strong>The second table compares the standard model with the UltraSpeed mode.<\/strong><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Dimension<\/th>\n<th>MiMo-V2.5-Pro<\/th>\n<th>MiMo-V2.5-Pro-UltraSpeed<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Decode speed<\/td>\n<td>Baseline<\/td>\n<td>~10\u00d7 faster (1000+ TPS)<\/td>\n<\/tr>\n<tr>\n<td>Price<\/td>\n<td>1\u00d7<\/td>\n<td>3\u00d7<\/td>\n<\/tr>\n<tr>\n<td>Weight precision<\/td>\n<td>Standard<\/td>\n<td>FP4 MoE Experts via QAT<\/td>\n<\/tr>\n<tr>\n<td>Decoding<\/td>\n<td>Standard autoregressive<\/td>\n<td>DFlash speculative decoding<\/td>\n<\/tr>\n<tr>\n<td>Access<\/td>\n<td>Standard model plans<\/td>\n<td>API only, application-based trial<\/td>\n<\/tr>\n<tr>\n<td>Token Plan<\/td>\n<td>Supported<\/td>\n<td>Not supported<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\"><strong>Access, Pricing, and Open Source<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">UltraSpeed ships through a limited, application-based window. The API trial runs June 9 to June 23, 2026. Pricing is 3\u00d7 the standard MiMo-V2.5-Pro rate, for roughly 10\u00d7 the speed. It is API only, and the Token Plan is not supported. Approved users also receive free Chat access during the trial. Chat limits apply: 10 queue entries daily, 30-minute sessions, and 5-minute idle release. Xiaomi open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on <a href=\"https:\/\/huggingface.co\/XiaomiMiMo\/MiMo-V2.5-Pro-FP4-DFlash\" target=\"_blank\" rel=\"noreferrer noopener\">Hugging Face<\/a>. TileRT has open-sourced select modules on GitHub. <\/p>\n<h2 class=\"wp-block-heading\"><strong>Strengths and Limitations<\/strong><\/h2>\n<h4 class=\"wp-block-heading\"><strong>Strengths<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>1000+ TPS on a 1T model without custom silicon.<\/li>\n<li>Lossless decoding through rejection sampling in DFlash.<\/li>\n<li>FP4 applied only where tolerance is highest, preserving quality.<\/li>\n<li>An open checkpoint lets the community test the claims.<\/li>\n<\/ul>\n<h4 class=\"wp-block-heading\"><strong>Limitations<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>Access is gated, short, and approval-based at launch.<\/li>\n<li>Pricing triples per token versus the standard model.<\/li>\n<li>Acceptance length drops in open-ended conversation.<\/li>\n<li>Independent third-party speed verification is not yet public.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Xiaomi MiMo and TileRT decode a 1-trillion-parameter model past 1000 tokens per second on commodity GPUs.<\/li>\n<li>The speedup comes from three layers: FP4 quantization, DFlash speculative decoding, and the TileRT runtime.<\/li>\n<li>FP4 (MXFP4) is applied only to MoE Experts; QAT keeps capability essentially on par.<\/li>\n<li>DFlash predicts a whole masked block per forward pass, hitting 6.30 average acceptance length in coding.<\/li>\n<li>UltraSpeed runs on a single 8-GPU node via an application-based API trial, June 9\u201323, 2026.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<div class=\"mtp-mus-header\">\n<div class=\"mtp-mus-eyebrow\">GUIDE \u2022 INFERENCE SYSTEMS<\/div>\n<h2 class=\"mtp-mus-title\">MiMo-V2.5-Pro-UltraSpeed: 1000+ Tokens Per Second on a 1T Model<\/h2>\n<p class=\"mtp-mus-sub\">Xiaomi MiMo &amp; TileRT \u2014 FP4 quantization, DFlash speculative decoding, and a microsecond-scale runtime.<\/p>\n<\/div>\n<div class=\"mtp-mus-viewport\">\n<div class=\"mtp-mus-track\">\n<p>      <!-- Slide 1 --><\/p>\n<section class=\"mtp-mus-slide\">\n<div class=\"mtp-mus-step\">01 \/ 08<\/div>\n<h3 class=\"mtp-mus-h\">What It Is<\/h3>\n<ul class=\"mtp-mus-list\">\n<li>Xiaomi\u2019s MiMo team built it with the TileRT systems group.<\/li>\n<li>It decodes over 1000 tokens\/s on a 1-trillion-parameter model.<\/li>\n<li>Demos show generation peaks near 1200 tokens\/s.<\/li>\n<li>It runs on commodity GPUs, a single standard 8-GPU node.<\/li>\n<li>Released June 8, 2026.<\/li>\n<\/ul>\n<div class=\"mtp-mus-stats\">\n<div class=\"mtp-mus-stat\"><span class=\"mtp-mus-num\">1000+<\/span><span class=\"mtp-mus-lbl\">tokens \/ second<\/span><\/div>\n<div class=\"mtp-mus-stat\"><span class=\"mtp-mus-num\">1T<\/span><span class=\"mtp-mus-lbl\">parameters (MoE)<\/span><\/div>\n<div class=\"mtp-mus-stat\"><span class=\"mtp-mus-num\">8<\/span><span class=\"mtp-mus-lbl\">commodity GPUs<\/span><\/div>\n<\/div>\n<\/section>\n<p>      <!-- Slide 2 --><\/p>\n<section class=\"mtp-mus-slide\">\n<div class=\"mtp-mus-step\">02 \/ 08<\/div>\n<h3 class=\"mtp-mus-h\">Three Layers Working Together<\/h3>\n<ul class=\"mtp-mus-list\">\n<li><b>FP4 quantization<\/b> shrinks weights and eases bandwidth pressure.<\/li>\n<li><b>DFlash<\/b> speculative decoding predicts many tokens in parallel.<\/li>\n<li><b>TileRT<\/b> executes the whole pipeline at microsecond scale.<\/li>\n<li>Xiaomi calls this approach extreme model-system codesign.<\/li>\n<li>No single technique is enough; all three must align.<\/li>\n<\/ul>\n<\/section>\n<p>      <!-- Slide 3 --><\/p>\n<section class=\"mtp-mus-slide\">\n<div class=\"mtp-mus-step\">03 \/ 08<\/div>\n<h3 class=\"mtp-mus-h\">Layer 1 \u2014 FP4 Quantization<\/h3>\n<ul class=\"mtp-mus-list\">\n<li>Uses the MXFP4 format to lower memory and bandwidth cost.<\/li>\n<li>Applied selectively to the MoE Experts only.<\/li>\n<li>Other modules keep higher precision (FP8, per TileRT).<\/li>\n<li>Experts hold most parameters and tolerate quantization best.<\/li>\n<li>QAT keeps capability essentially on par with the original.<\/li>\n<\/ul>\n<\/section>\n<p>      <!-- Slide 4 --><\/p>\n<section class=\"mtp-mus-slide\">\n<div class=\"mtp-mus-step\">04 \/ 08<\/div>\n<h3 class=\"mtp-mus-h\">Layer 2 \u2014 DFlash Speculative Decoding<\/h3>\n<ul class=\"mtp-mus-list\">\n<li>A research-community method using block-level masked parallel prediction.<\/li>\n<li>The draft model fills a whole block in one forward pass.<\/li>\n<li>It uses Sliding Window Attention; block size capped at 8.<\/li>\n<li>Rejection sampling keeps the output lossless.<\/li>\n<\/ul>\n<div class=\"mtp-mus-tablewrap\">\n<table class=\"mtp-mus-table\">\n<thead>\n<tr>\n<th>Scenario<\/th>\n<th>Acceptance Length<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Coding<\/td>\n<td>6.30<\/td>\n<\/tr>\n<tr>\n<td>Math \/ Reasoning<\/td>\n<td>5.56<\/td>\n<\/tr>\n<tr>\n<td>Agent<\/td>\n<td>4.29<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<\/section>\n<p>      <!-- Slide 5 --><\/p>\n<section class=\"mtp-mus-slide\">\n<div class=\"mtp-mus-step\">05 \/ 08<\/div>\n<h3 class=\"mtp-mus-h\">Layer 3 \u2014 TileRT Runtime<\/h3>\n<ul class=\"mtp-mus-list\">\n<li>At 1000 TPS, each operator runs for only microseconds.<\/li>\n<li>A Persistent Engine Kernel stays resident on the GPU.<\/li>\n<li>Warp Specialization splits data movement, compute, and communication.<\/li>\n<li>Small ops like RMSNorm and RoPE become bottlenecks here.<\/li>\n<li>The runtime was co-designed with the FP4 and DFlash choices.<\/li>\n<\/ul>\n<\/section>\n<p>      <!-- Slide 6 --><\/p>\n<section class=\"mtp-mus-slide\">\n<div class=\"mtp-mus-step\">06 \/ 08<\/div>\n<h3 class=\"mtp-mus-h\">Where It Fits<\/h3>\n<ul class=\"mtp-mus-list\">\n<li><b>Parallel reasoning:<\/b> many Best-of-N or tree-search paths at once.<\/li>\n<li><b>Coding agents:<\/b> less wait between agent steps.<\/li>\n<li><b>Real-time loops:<\/b> trading signals, fraud interception, live dialogue.<\/li>\n<li><b>Interactive prototyping:<\/b> a Snake game in about 10 seconds.<\/li>\n<\/ul>\n<\/section>\n<p>      <!-- Slide 7 --><\/p>\n<section class=\"mtp-mus-slide\">\n<div class=\"mtp-mus-step\">07 \/ 08<\/div>\n<h3 class=\"mtp-mus-h\">Standard vs UltraSpeed<\/h3>\n<div class=\"mtp-mus-tablewrap\">\n<table class=\"mtp-mus-table\">\n<thead>\n<tr>\n<th>Dimension<\/th>\n<th>MiMo-V2.5-Pro<\/th>\n<th>UltraSpeed<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Decode speed<\/td>\n<td>Baseline<\/td>\n<td>~10\u00d7 (1000+ TPS)<\/td>\n<\/tr>\n<tr>\n<td>Price<\/td>\n<td>1\u00d7<\/td>\n<td>3\u00d7<\/td>\n<\/tr>\n<tr>\n<td>Weights<\/td>\n<td>Standard<\/td>\n<td>FP4 MoE Experts (QAT)<\/td>\n<\/tr>\n<tr>\n<td>Decoding<\/td>\n<td>Autoregressive<\/td>\n<td>DFlash speculative<\/td>\n<\/tr>\n<tr>\n<td>Access<\/td>\n<td>Standard plans<\/td>\n<td>API only, by application<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<\/section>\n<p>      <!-- Slide 8 --><\/p>\n<section class=\"mtp-mus-slide\">\n<div class=\"mtp-mus-step\">08 \/ 08<\/div>\n<h3 class=\"mtp-mus-h\">Access, Pricing &amp; Open Source<\/h3>\n<ul class=\"mtp-mus-list\">\n<li>API trial runs June 9 to June 23, 2026 (Beijing time).<\/li>\n<li>Pricing is 3\u00d7 the standard rate for roughly 10\u00d7 speed.<\/li>\n<li>API only; the Token Plan is not supported.<\/li>\n<li>Checkpoint open-sourced: MiMo-V2.5-Pro-FP4-DFlash on Hugging Face.<\/li>\n<li>TileRT has open-sourced select modules on GitHub.<\/li>\n<\/ul>\n<\/section><\/div>\n<\/div>\n<div class=\"mtp-mus-nav\">\n    <button type=\"button\" class=\"mtp-mus-btn\" data-mus=\"prev\" aria-label=\"Previous slide\">\u2190 Prev<\/button>\n<div class=\"mtp-mus-dots\" data-mus=\"dots\"><\/div>\n<p>    <button type=\"button\" class=\"mtp-mus-btn\" data-mus=\"next\" aria-label=\"Next slide\">Next \u2192<\/button>\n  <\/p><\/div>\n<div class=\"mtp-mus-tagline\">\n    <span class=\"mtp-mus-brand\">Marktechpost<\/span><br \/>\n    <span class=\"mtp-mus-taglinetxt\">AI research, models, and developer tools \u2014 explained for engineers.<\/span>\n  <\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/huggingface.co\/XiaomiMiMo\/MiMo-V2.5-Pro-FP4-DFlash\" target=\"_blank\" rel=\"noreferrer noopener\">Model weights<\/a><\/strong>\u00a0and<strong>\u00a0<a href=\"https:\/\/mimo.xiaomi.com\/blog\/mimo-tilert-1000tps\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/wbash1wF6efRj8G58\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/06\/08\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/\">Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Inference speed is becoming a competitive metric for large language models. Xiaomi\u2019s MiMo team just released MiMo-V2.5-Pro-UltraSpeed, built in collaboration with the TileRT systems group. It decodes faster than 1000 tokens per second on a 1-trillion-parameter model. Xiaomi team describes this as a first at trillion-parameter scale. Demos show generation peaks near 1200 tokens per second. The notable part is the hardware: it runs on commodity GPUs, not custom silicon. What is MiMo-V2.5-Pro-UltraSpeed UltraSpeed is a high-speed serving mode for the existing MiMo-V2.5-Pro model. The base model uses a Mixture-of-Experts (MoE) architecture at trillion-parameter scale. UltraSpeed targets generation speed rather than model capability. It changes how fast the model produces output tokens. The speedup comes from three coordinated techniques across the model and the serving system. Xiaomi calls this approach extreme model-system codesign. Crucially, the entire stack runs on a single standard 8-GPU commodity node. The Speed Case: Three Layers Working Together The first layer is FP4 quantization. At trillion scale, FP8 or FP16 weights create heavy memory and bandwidth pressure. Lower bit-width weights move through memory faster, which directly lifts decode speed. Xiaomi uses the MXFP4 format, applied selectively to the MoE Experts only. Other modules keep higher precision, reported as FP8 by TileRT. Experts hold most parameters and tolerate quantization best, so the tradeoff is favorable. Quantization-Aware Training (QAT) keeps benchmark quality essentially on par with the original. The second layer is DFlash speculative decoding, covered in detail below. The third layer is TileRT, the system that executes everything on the GPU. Each technique alone is not enough. The 1000 TPS result needs all three aligned tightly. DFlash: Parallel Drafting Without a Serial Bottleneck Standard speculative decoding uses a small draft model to guess upcoming tokens. The large model then verifies those guesses in parallel. Rejection sampling keeps output identical to normal decoding, so quality is lossless. The problem is that the draft model still generates tokens one at a time. DFlash, a method from the research community, removes that constraint. It uses block-level masked parallel prediction. The draft model fills a whole block of masked positions in one forward pass. Xiaomi tuned DFlash with the Muon second-order optimizer and model self-distillation. The draft model uses Sliding Window Attention (SWA) only, matching the MiMo-V2 design. This makes per-prediction compute constant rather than growing with context length. Block size is capped at 8 to limit verification cost and raise concurrency. Acceptance length measures how many draft tokens survive verification each round. Scenario Acceptance Length Coding 6.30 Math \/ Reasoning 5.56 Agent 4.29 In coding, six to seven of eight draft tokens are accepted per round. Some samples reach a maximum of 7.14. TileRT: Squeezing the Microseconds At 1000 TPS, each operator runs for only microseconds. Traditional systems launch operators one by one, and each launch costs time. Those gaps fracture the execution stream and become the real bottleneck. TileRT replaces this with a Persistent Engine Kernel that stays resident on the GPU. It uses Warp Specialization to split data movement, compute, and communication into coordinated roles. Small operations like RMSNorm, RoPE, and KV cache writes turn into bottlenecks at this scale. The system was co-designed with the FP4 and DFlash choices, not added afterward. Use Cases The release targets latency-sensitive work where waiting breaks the loop: Parallel reasoning: run many Best-of-N or tree-search paths within the same wall-clock time. Coding agents: faster code generation cuts the wait between agent steps. Real-time decision loops: trading signal generation, fraud interception, and live dialogue. Interactive prototyping: demos show a Snake game in about 10 seconds and a macOS interface in about one minute. These are throughput-bound workloads where raw token speed is the binding constraint. How It Compares The first table contrasts the two routes to extreme decode speed. Approach Hardware How speed is achieved Cerebras Wafer-Scale integration (custom) Scale on a single custom wafer Groq Custom architecture Pure on-chip SRAM MiMo \u00d7 TileRT Commodity GPUs (8-GPU node) Model-system codesign: FP4 + DFlash + TileRT The second table compares the standard model with the UltraSpeed mode. Dimension MiMo-V2.5-Pro MiMo-V2.5-Pro-UltraSpeed Decode speed Baseline ~10\u00d7 faster (1000+ TPS) Price 1\u00d7 3\u00d7 Weight precision Standard FP4 MoE Experts via QAT Decoding Standard autoregressive DFlash speculative decoding Access Standard model plans API only, application-based trial Token Plan Supported Not supported Access, Pricing, and Open Source UltraSpeed ships through a limited, application-based window. The API trial runs June 9 to June 23, 2026. Pricing is 3\u00d7 the standard MiMo-V2.5-Pro rate, for roughly 10\u00d7 the speed. It is API only, and the Token Plan is not supported. Approved users also receive free Chat access during the trial. Chat limits apply: 10 queue entries daily, 30-minute sessions, and 5-minute idle release. Xiaomi open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face. TileRT has open-sourced select modules on GitHub. Strengths and Limitations Strengths 1000+ TPS on a 1T model without custom silicon. Lossless decoding through rejection sampling in DFlash. FP4 applied only where tolerance is highest, preserving quality. An open checkpoint lets the community test the claims. Limitations Access is gated, short, and approval-based at launch. Pricing triples per token versus the standard model. Acceptance length drops in open-ended conversation. Independent third-party speed verification is not yet public. Key Takeaways Xiaomi MiMo and TileRT decode a 1-trillion-parameter model past 1000 tokens per second on commodity GPUs. The speedup comes from three layers: FP4 quantization, DFlash speculative decoding, and the TileRT runtime. FP4 (MXFP4) is applied only to MoE Experts; QAT keeps capability essentially on par. DFlash predicts a whole masked block per forward pass, hitting 6.30 average acceptance length in coding. UltraSpeed runs on a single 8-GPU node via an application-based API trial, June 9\u201323, 2026. Marktechpost\u2019s Visual Explainer GUIDE \u2022 INFERENCE SYSTEMS MiMo-V2.5-Pro-UltraSpeed: 1000+ Tokens Per Second on a 1T Model Xiaomi MiMo &amp; TileRT \u2014 FP4 quantization, DFlash speculative decoding, and a microsecond-scale runtime. 01 \/ 08 What It Is Xiaomi\u2019s MiMo team built it with the TileRT systems group. It decodes over 1000 tokens\/s on a 1-trillion-parameter model.<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-95987","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/de\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/de\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-08T17:39:55+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"7\u00a0Minuten\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs\",\"datePublished\":\"2026-06-08T17:39:55+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/\"},\"wordCount\":1407,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/\",\"url\":\"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/\",\"name\":\"Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"datePublished\":\"2026-06-08T17:39:55+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/de\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/de\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/","og_locale":"de_DE","og_type":"article","og_title":"Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/de\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-06-08T17:39:55+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Verfasst von":"admin NU","Gesch\u00e4tzte Lesezeit":"7\u00a0Minuten"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs","datePublished":"2026-06-08T17:39:55+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/"},"wordCount":1407,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/","url":"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/","name":"Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"datePublished":"2026-06-08T17:39:55+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/xiaomi-mimo-and-tilert-push-a-1-trillion-parameter-model-past-1000-tokens-per-second-on-commodity-gpus\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/de\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/de\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/de\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Inference speed is becoming a competitive metric for large language models. Xiaomi\u2019s MiMo team just released MiMo-V2.5-Pro-UltraSpeed, built in collaboration with the TileRT systems group. It decodes faster than 1000 tokens per second on a 1-trillion-parameter model. Xiaomi team describes this as a first at trillion-parameter scale. Demos show generation peaks near 1200 tokens per&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts\/95987","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/comments?post=95987"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts\/95987\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/media?parent=95987"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/categories?post=95987"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/tags?post=95987"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}