{"id":39492,"date":"2025-09-21T06:40:28","date_gmt":"2025-09-21T06:40:28","guid":{"rendered":"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/"},"modified":"2025-09-21T06:40:28","modified_gmt":"2025-09-21T06:40:28","slug":"xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens","status":"publish","type":"post","link":"https:\/\/youzum.net\/it\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/","title":{"rendered":"Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens"},"content":{"rendered":"<p>Xiaomi\u2019s MiMo team released MiMo-Audio, a 7-billion-parameter audio-language model that runs a single next-token objective over interleaved text and discretized speech, scaling pretraining beyond 100 million hours of audio. <\/p>\n<h3 class=\"wp-block-heading\"><strong>What\u2019s actually new<\/strong>?<\/h3>\n<p>Instead of relying on task-specific heads or lossy acoustic tokens, MiMo-Audio uses a bespoke RVQ (residual vector quantization) tokenizer that targets both semantic fidelity and high-quality reconstruction. The tokenizer runs at 25 Hz and outputs 8 RVQ layers (\u2248200 tokens\/s), giving the LM access to \u201clossless\u201d speech features it can model autoregressively alongside text.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Architecture: patch encoder \u2192 7B LLM \u2192 patch decoder<\/strong><\/h3>\n<p>To handle the audio\/text rate mismatch, the system packs four timesteps per patch for LM consumption (downsampling 25 Hz \u2192 6.25 Hz), then reconstructs full-rate RVQ streams with a causal patch decoder. A delayed multi-layer RVQ generation scheme staggers predictions per codebook to stabilize synthesis and respect inter-layer dependencies. All three parts\u2014patch encoder, MiMo-7B backbone, and patch decoder\u2014are trained under a single next-token objective. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"509\" data-attachment-id=\"74699\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/09\/20\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/screenshot-2025-09-20-at-1-13-57-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1.png\" data-orig-size=\"1436,714\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-09-20 at 1.13.57\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-300x149.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509.png\" alt=\"\" class=\"wp-image-74699\" \/><figcaption class=\"wp-element-caption\">https:\/\/xiaomimimo.github.io\/MiMo-Audio-Demo\/<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Scale is the algorithm<\/strong><\/h3>\n<p>Training proceeds in two big phases: (1) an \u201cunderstanding\u201d stage that optimizes text-token loss over interleaved speech-text corpora, and (2) a joint \u201cunderstanding + generation\u201d stage that turns on audio losses for speech continuation, S2T\/T2S tasks, and instruction-style data. The report emphasizes a compute\/data threshold where few-shot behavior appears to \u201cswitch on,\u201d echoing emergence curves seen in large text-only LMs. <\/p>\n<h3 class=\"wp-block-heading\"><strong>Benchmarks: speech intelligence and general audio<\/strong><\/h3>\n<p>MiMo-Audio is evaluated on speech-reasoning suites (e.g., SpeechMMLU) and broad audio understanding benchmarks (e.g., MMAU), reporting strong scores across speech, sound, and music and a reduced \u201cmodality gap\u201d between text-only and speech-in\/speech-out settings. Xiaomi also releases <strong>MiMo-Audio-Eval<\/strong>, a public toolkit to reproduce these results. Listen-and-respond demos (speech continuation, voice\/emotion conversion, denoising, and speech translation) are available online.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img decoding=\"async\" width=\"1024\" height=\"479\" data-attachment-id=\"74703\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/09\/20\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/screenshot-2025-09-20-at-1-15-31-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.15.31-AM-1.png\" data-orig-size=\"1600,748\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-09-20 at 1.15.31\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.15.31-AM-1-300x140.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.15.31-AM-1-1024x479.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.15.31-AM-1-1024x479.png\" alt=\"\" class=\"wp-image-74703\" \/><figcaption class=\"wp-element-caption\">https:\/\/xiaomimimo.github.io\/MiMo-Audio-Demo\/<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Why this is important?<\/strong><\/h3>\n<p>The approach is intentionally simple\u2014no multi-head task tower, no bespoke ASR\/TTS objectives at pretraining time\u2014just GPT-style next-token prediction over <em>lossless<\/em> audio tokens plus text. The key engineering ideas are (i) a tokenizer the LM can actually use without throwing away prosody and speaker identity; (ii) patchification to keep sequence lengths manageable; and (iii) delayed RVQ decoding to preserve quality at generation time. For teams building spoken agents, those design choices translate into few-shot speech-to-speech editing and robust speech continuation with minimal task-specific finetuning.<\/p>\n<h3 class=\"wp-block-heading\"><strong>6 Technical Takeaways:<\/strong><\/h3>\n<ol class=\"wp-block-list\">\n<li><strong>High-Fidelity Tokenization<\/strong><br \/>MiMo-Audio uses a custom RVQ tokenizer operating at 25 Hz with 8 active codebooks, ensuring speech tokens preserve prosody, timbre, and speaker identity while keeping them LM-friendly.<\/li>\n<li><strong>Patchified Sequence Modeling<\/strong><br \/>The model reduces sequence length by grouping 4 timesteps into one patch (25 Hz \u2192 6.25 Hz), letting the 7B LLM handle long speech efficiently without discarding detail.<\/li>\n<li><strong>Unified Next-Token Objective<\/strong><br \/>Rather than separate heads for ASR, TTS, or dialogue, MiMo-Audio trains under a single next-token prediction loss across interleaved text and audio, simplifying architecture while supporting multi-task generalization.<\/li>\n<li><strong>Emergent Few-Shot Abilities<\/strong><br \/>Few-shot behaviors such as speech continuation, voice conversion, emotion transfer, and speech translation emerge once training surpasses a large-scale data threshold (~100M hours, trillions of tokens).<\/li>\n<li><strong>Benchmark Leadership<\/strong><br \/>MiMo-Audio sets state-of-the-art scores on SpeechMMLU (S2S 69.1, T2S 71.5) and MMAU (66.0 overall), while minimizing the text-to-speech modality gap to just 3.4 points.<\/li>\n<li><strong>Open Ecosystem Release<\/strong><br \/>Xiaomi provides the tokenizer, 7B checkpoints (base and instruct), MiMo-Audio-Eval toolkit, and public demos, enabling researchers and developers to test and extend speech-to-speech intelligence in open-source pipelines.<\/li>\n<\/ol>\n<h3 class=\"wp-block-heading\"><strong>Summary<\/strong><\/h3>\n<p>MiMo-Audio demonstrates that high-fidelity, RVQ-based \u201clossless\u201d tokenization combined with patchified next-token pretraining at scale is sufficient to unlock few-shot speech intelligence without task-specific heads. The 7B stack\u2014tokenizer \u2192 patch encoder \u2192 LLM \u2192 patch decoder\u2014bridges the audio\/text rate gap (25\u21926.25 Hz) and preserves prosody and speaker identity via delayed multi-layer RVQ decoding. Empirically, the model narrows the text<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/16.0.1\/72x72\/2194.png\" alt=\"\u2194\" class=\"wp-smiley\" \/>speech modality gap, generalizes across speech\/sound\/music benchmarks, and supports in-context S2S editing and continuation.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/github.com\/XiaomiMiMo\/MiMo-Audio\/blob\/main\/MiMo-Audio-Technical-Report.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/xiaomimimo.github.io\/MiMo-Audio-Demo\/\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a> <\/strong>and <strong><a href=\"https:\/\/github.com\/XiaomiMiMo\/MiMo-Audio\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page<\/a><em>.<\/em><\/strong>\u00a0Feel free to check out our\u00a0<strong><mark><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page for Tutorials, Codes and Notebooks<\/a><\/mark><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>.<\/p>\n<p><!-- CONTENT END 6 --><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/09\/20\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/\">Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Xiaomi\u2019s MiMo team released MiMo-Audio, a 7-billion-parameter audio-language model that runs a single next-token objective over interleaved text and discretized speech, scaling pretraining beyond 100 million hours of audio. What\u2019s actually new? Instead of relying on task-specific heads or lossy acoustic tokens, MiMo-Audio uses a bespoke RVQ (residual vector quantization) tokenizer that targets both semantic fidelity and high-quality reconstruction. The tokenizer runs at 25 Hz and outputs 8 RVQ layers (\u2248200 tokens\/s), giving the LM access to \u201clossless\u201d speech features it can model autoregressively alongside text. Architecture: patch encoder \u2192 7B LLM \u2192 patch decoder To handle the audio\/text rate mismatch, the system packs four timesteps per patch for LM consumption (downsampling 25 Hz \u2192 6.25 Hz), then reconstructs full-rate RVQ streams with a causal patch decoder. A delayed multi-layer RVQ generation scheme staggers predictions per codebook to stabilize synthesis and respect inter-layer dependencies. All three parts\u2014patch encoder, MiMo-7B backbone, and patch decoder\u2014are trained under a single next-token objective. https:\/\/xiaomimimo.github.io\/MiMo-Audio-Demo\/ Scale is the algorithm Training proceeds in two big phases: (1) an \u201cunderstanding\u201d stage that optimizes text-token loss over interleaved speech-text corpora, and (2) a joint \u201cunderstanding + generation\u201d stage that turns on audio losses for speech continuation, S2T\/T2S tasks, and instruction-style data. The report emphasizes a compute\/data threshold where few-shot behavior appears to \u201cswitch on,\u201d echoing emergence curves seen in large text-only LMs. Benchmarks: speech intelligence and general audio MiMo-Audio is evaluated on speech-reasoning suites (e.g., SpeechMMLU) and broad audio understanding benchmarks (e.g., MMAU), reporting strong scores across speech, sound, and music and a reduced \u201cmodality gap\u201d between text-only and speech-in\/speech-out settings. Xiaomi also releases MiMo-Audio-Eval, a public toolkit to reproduce these results. Listen-and-respond demos (speech continuation, voice\/emotion conversion, denoising, and speech translation) are available online. https:\/\/xiaomimimo.github.io\/MiMo-Audio-Demo\/ Why this is important? The approach is intentionally simple\u2014no multi-head task tower, no bespoke ASR\/TTS objectives at pretraining time\u2014just GPT-style next-token prediction over lossless audio tokens plus text. The key engineering ideas are (i) a tokenizer the LM can actually use without throwing away prosody and speaker identity; (ii) patchification to keep sequence lengths manageable; and (iii) delayed RVQ decoding to preserve quality at generation time. For teams building spoken agents, those design choices translate into few-shot speech-to-speech editing and robust speech continuation with minimal task-specific finetuning. 6 Technical Takeaways: High-Fidelity TokenizationMiMo-Audio uses a custom RVQ tokenizer operating at 25 Hz with 8 active codebooks, ensuring speech tokens preserve prosody, timbre, and speaker identity while keeping them LM-friendly. Patchified Sequence ModelingThe model reduces sequence length by grouping 4 timesteps into one patch (25 Hz \u2192 6.25 Hz), letting the 7B LLM handle long speech efficiently without discarding detail. Unified Next-Token ObjectiveRather than separate heads for ASR, TTS, or dialogue, MiMo-Audio trains under a single next-token prediction loss across interleaved text and audio, simplifying architecture while supporting multi-task generalization. Emergent Few-Shot AbilitiesFew-shot behaviors such as speech continuation, voice conversion, emotion transfer, and speech translation emerge once training surpasses a large-scale data threshold (~100M hours, trillions of tokens). Benchmark LeadershipMiMo-Audio sets state-of-the-art scores on SpeechMMLU (S2S 69.1, T2S 71.5) and MMAU (66.0 overall), while minimizing the text-to-speech modality gap to just 3.4 points. Open Ecosystem ReleaseXiaomi provides the tokenizer, 7B checkpoints (base and instruct), MiMo-Audio-Eval toolkit, and public demos, enabling researchers and developers to test and extend speech-to-speech intelligence in open-source pipelines. Summary MiMo-Audio demonstrates that high-fidelity, RVQ-based \u201clossless\u201d tokenization combined with patchified next-token pretraining at scale is sufficient to unlock few-shot speech intelligence without task-specific heads. The 7B stack\u2014tokenizer \u2192 patch encoder \u2192 LLM \u2192 patch decoder\u2014bridges the audio\/text rate gap (25\u21926.25 Hz) and preserves prosody and speaker identity via delayed multi-layer RVQ decoding. Empirically, the model narrows the textspeech modality gap, generalizes across speech\/sound\/music benchmarks, and supports in-context S2S editing and continuation. Check out the\u00a0Paper, Technical details and GitHub Page.\u00a0Feel free to check out our\u00a0GitHub Page for Tutorials, Codes and Notebooks.\u00a0Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a0100k+ ML SubReddit\u00a0and Subscribe to\u00a0our Newsletter. The post Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":39493,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-39492","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/it\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/\" \/>\n<meta property=\"og:locale\" content=\"it_IT\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/it\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-09-21T06:40:28+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Scritto da\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Tempo di lettura stimato\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minuti\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens\",\"datePublished\":\"2025-09-21T06:40:28+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/\"},\"wordCount\":716,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"it-IT\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/\",\"url\":\"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/\",\"name\":\"Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png\",\"datePublished\":\"2025-09-21T06:40:28+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#breadcrumb\"},\"inLanguage\":\"it-IT\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"it-IT\",\"@id\":\"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png\",\"width\":1024,\"height\":509},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"it-IT\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"it-IT\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"it-IT\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/it\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/it\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/","og_locale":"it_IT","og_type":"article","og_title":"Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/it\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-09-21T06:40:28+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Scritto da":"admin NU","Tempo di lettura stimato":"3 minuti"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens","datePublished":"2025-09-21T06:40:28+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/"},"wordCount":716,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"it-IT","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/","url":"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/","name":"Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png","datePublished":"2025-09-21T06:40:28+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#breadcrumb"},"inLanguage":"it-IT","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/"]}]},{"@type":"ImageObject","inLanguage":"it-IT","@id":"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png","width":1024,"height":509},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/xiaomi-released-mimo-audio-a-7b-speech-language-model-trained-on-100m-hours-with-high-fidelity-discrete-tokens\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"it-IT"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"it-IT","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"it-IT","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/it\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png",1024,509,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png",1024,509,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png",1024,509,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ-300x149.png",300,149,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png",1024,509,false],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png",1024,509,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ.png",1024,509,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ-18x9.png",18,9,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ-600x298.png",600,298,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/09\/Screenshot-2025-09-20-at-1.13.57-AM-1-1024x509-i8SvbJ-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/it\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/it\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/it\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/it\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/it\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Xiaomi\u2019s MiMo team released MiMo-Audio, a 7-billion-parameter audio-language model that runs a single next-token objective over interleaved text and discretized speech, scaling pretraining beyond 100 million hours of audio. What\u2019s actually new? Instead of relying on task-specific heads or lossy acoustic tokens, MiMo-Audio uses a bespoke RVQ (residual vector quantization) tokenizer that targets both semantic&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/posts\/39492","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/comments?post=39492"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/posts\/39492\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/media\/39493"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/media?parent=39492"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/categories?post=39492"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/tags?post=39492"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}