{"id":76803,"date":"2026-03-11T12:22:54","date_gmt":"2026-03-11T12:22:54","guid":{"rendered":"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/"},"modified":"2026-03-11T12:22:54","modified_gmt":"2026-03-11T12:22:54","slug":"fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion","status":"publish","type":"post","link":"https:\/\/youzum.net\/es\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/","title":{"rendered":"Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion"},"content":{"rendered":"<p>The landscape of Text-to-Speech (TTS) is moving away from modular pipelines toward integrated Large Audio Models (LAMs). Fish Audio\u2019s release of S2-Pro, the flagship model within the Fish Speech ecosystem, represents a shift toward open architectures capable of high-fidelity, multi-speaker synthesis with sub-150ms latency. The release provides a framework for zero-shot voice cloning and granular emotional control using a Dual-Auto-Regressive (AR) approach.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Architecture: The Dual-AR Framework and RVQ<\/strong><\/h3>\n<p>The fundamental technical distinction in Fish Audio S2-Pro is its hierarchical Dual-AR architecture. Traditional TTS models often struggle with the trade-off between sequence length and acoustic detail. S2-Pro addresses this by bifurcating the generation process into two specialized stages: a \u2018Slow AR\u2019 model and a \u2018Fast AR\u2019 model.<\/p>\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>The Slow AR Model (4B Parameters):<\/strong> This component operates on the time-axis. It is responsible for processing linguistic input and generating semantic tokens. By utilizing a larger parameter count (approximately 4 billion), the Slow AR model captures long-range dependencies, prosody, and the structural nuances of speech.<\/li>\n<li><strong>The Fast AR Model (400M Parameters):<\/strong> This component processes the acoustic dimension. It predicts the residual codebooks for each semantic token. This smaller, faster model ensures that the high-frequency details of the audio\u2014timbre, breathiness, and texture\u2014are generated with high efficiency.<\/li>\n<\/ol>\n<p>This system relies on <strong>Residual Vector Quantization (RVQ)<\/strong>. In this setup, raw audio is compressed into discrete tokens across multiple layers (codebooks). The first layer captures the primary acoustic features, while subsequent layers capture the \u2018residuals\u2019 or the remaining errors from the previous layer. This allows the model to reconstruct high-fidelity 44.1kHz audio while maintaining a manageable token count for the Transformer architecture.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Emotional Control via In-Context Learning and Inline Tags<\/strong><\/h3>\n<p>Fish Audio S2-Pro achieves what the developers describe as \u2018absurdly controllable emotion\u2019 through two primary mechanisms: zero-shot in-context learning and natural language inline control.<\/p>\n<p><strong>In-Context Learning (ICL):<\/strong><\/p>\n<p>Unlike older generations of TTS that required explicit fine-tuning to mimic a specific voice, S2-Pro utilizes the Transformer\u2019s ability to perform in-context learning. By providing a reference audio clip\u2014ideally between 10 and 30 seconds\u2014the model extracts the speaker\u2019s identity and emotional state. The model treats this reference as a prefix in its context window, allowing it to continue the \u201csequence\u201d in the same voice and style.<\/p>\n<p><strong>Inline Control Tags:<\/strong><\/p>\n<p>The model supports dynamic emotional transitions within a single generation pass. Because the model was trained on data containing descriptive linguistic markers, developers can insert natural language tags directly into the text prompt. For example:<\/p>\n<p><code>[whisper] I have a secret [laugh] that I cannot tell you.<\/code><\/p>\n<p>The model interprets these tags as instructions to modify the acoustic tokens in real-time, adjusting pitch, intensity, and rhythm without requiring a separate emotional embedding or external control vector.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Performance Benchmarks and SGLang Integration<\/strong><\/h3>\n<p>Integrating TTS into real-time applications, the primary constraint is \u2018Time to First Audio\u2019 (TTFA). Fish Audio S2-Pro is optimized for a sub-150ms latency, with benchmarks on NVIDIA H200 hardware reaching approximately 100ms.<\/p>\n<p><strong>Several technical optimizations contribute to this performance:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>SGLang and RadixAttention:<\/strong> S2-Pro is designed to work with SGLang, a high-performance serving framework. It utilizes <strong>RadixAttention<\/strong>, which allows for efficient Key-Value (KV) cache management. In a production environment where the same \u201cmaster\u201d voice prompt (reference clip) is used repeatedly, RadixAttention caches the prefix\u2019s KV states. This eliminates the need to re-compute the reference audio for every request, significantly reducing the prefill time.<\/li>\n<li><strong>Multi-Speaker Single-Pass Generation:<\/strong> The architecture allows for multiple speaker identities to be present within the same context window. This permits the generation of complex dialogues or multi-character narrations in a single inference call, avoiding the latency overhead of switching models or reloading weights for different speakers.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Technical Implementation and Data Scaling<\/strong><\/h3>\n<p>The Fish Speech repository provides a Python-based implementation utilizing PyTorch. The model was trained on a diverse dataset comprising over 300,000 hours of multi-lingual audio. This scale is what enables the model\u2019s robust performance across different languages and its ability to handle \u2018non-verbal\u2019 vocalizations like sighs or hesitations.<\/p>\n<p>The training pipeline involves:<\/p>\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>VQ-GAN Training:<\/strong> Training the quantizer to map audio into a discrete latent space.<\/li>\n<li><strong>LLM Training:<\/strong> Training the Dual-AR transformers to predict those latent tokens based on text and acoustic prefixes.<\/li>\n<\/ol>\n<p>The VQ-GAN used in S2-Pro is specifically tuned to minimize artifacts during the decoding process, ensuring that even at high compression ratios, the reconstructed audio remains \u2018transparent\u2019 (indistinguishable from the source to the human ear).<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Dual-AR Architecture (Slow\/Fast):<\/strong> Unlike single-stage models, S2-Pro splits tasks between a <strong>4B parameter \u2018Slow AR\u2019 model<\/strong> (for linguistic and prosodic structure) and a <strong>400M parameter \u2018Fast AR\u2019 model<\/strong> (for acoustic refinement), optimizing both detail and speed.<\/li>\n<li><strong>Sub-150ms Latency:<\/strong> Engineered for real-time conversational AI, the model achieves a <strong>Time-to-First-Audio (TTFA) of ~100ms<\/strong> on high-end hardware, making it suitable for live agents and interactive applications.<\/li>\n<li><strong>Hierarchical RVQ Encoding:<\/strong> By using <strong>Residual Vector Quantization<\/strong>, the system compresses 44.1kHz audio into discrete tokens across multiple layers. This allows the model to reconstruct complex vocal textures\u2014including breaths and sighs\u2014without the computational bloat of raw waveforms.<\/li>\n<li><strong>Zero-Shot In-Context Learning:<\/strong> Developers can clone a voice and its emotional state by providing a <strong>10\u201330 second reference clip<\/strong>. The model treats this as a prefix, adopting the speaker\u2019s timbre and prosody without requiring additional fine-tuning.<\/li>\n<li><strong>RadixAttention &amp; SGLang Integration:<\/strong> Optimized for production, S2-Pro leverages <strong>RadixAttention<\/strong> to cache KV states of voice prompts. This allows for nearly instant generation when using the same speaker repeatedly, drastically reducing prefill overhead.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0<strong><a href=\"https:\/\/huggingface.co\/fishaudio\/s2-pro\" target=\"_blank\" rel=\"noreferrer noopener\">Model Card<\/a> <\/strong>and<strong> <a href=\"https:\/\/github.com\/fishaudio\/fish-speech\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/10\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/\">Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>The landscape of Text-to-Speech (TTS) is moving away from modular pipelines toward integrated Large Audio Models (LAMs). Fish Audio\u2019s release of S2-Pro, the flagship model within the Fish Speech ecosystem, represents a shift toward open architectures capable of high-fidelity, multi-speaker synthesis with sub-150ms latency. The release provides a framework for zero-shot voice cloning and granular emotional control using a Dual-Auto-Regressive (AR) approach. Architecture: The Dual-AR Framework and RVQ The fundamental technical distinction in Fish Audio S2-Pro is its hierarchical Dual-AR architecture. Traditional TTS models often struggle with the trade-off between sequence length and acoustic detail. S2-Pro addresses this by bifurcating the generation process into two specialized stages: a \u2018Slow AR\u2019 model and a \u2018Fast AR\u2019 model. The Slow AR Model (4B Parameters): This component operates on the time-axis. It is responsible for processing linguistic input and generating semantic tokens. By utilizing a larger parameter count (approximately 4 billion), the Slow AR model captures long-range dependencies, prosody, and the structural nuances of speech. The Fast AR Model (400M Parameters): This component processes the acoustic dimension. It predicts the residual codebooks for each semantic token. This smaller, faster model ensures that the high-frequency details of the audio\u2014timbre, breathiness, and texture\u2014are generated with high efficiency. This system relies on Residual Vector Quantization (RVQ). In this setup, raw audio is compressed into discrete tokens across multiple layers (codebooks). The first layer captures the primary acoustic features, while subsequent layers capture the \u2018residuals\u2019 or the remaining errors from the previous layer. This allows the model to reconstruct high-fidelity 44.1kHz audio while maintaining a manageable token count for the Transformer architecture. Emotional Control via In-Context Learning and Inline Tags Fish Audio S2-Pro achieves what the developers describe as \u2018absurdly controllable emotion\u2019 through two primary mechanisms: zero-shot in-context learning and natural language inline control. In-Context Learning (ICL): Unlike older generations of TTS that required explicit fine-tuning to mimic a specific voice, S2-Pro utilizes the Transformer\u2019s ability to perform in-context learning. By providing a reference audio clip\u2014ideally between 10 and 30 seconds\u2014the model extracts the speaker\u2019s identity and emotional state. The model treats this reference as a prefix in its context window, allowing it to continue the \u201csequence\u201d in the same voice and style. Inline Control Tags: The model supports dynamic emotional transitions within a single generation pass. Because the model was trained on data containing descriptive linguistic markers, developers can insert natural language tags directly into the text prompt. For example: [whisper] I have a secret [laugh] that I cannot tell you. The model interprets these tags as instructions to modify the acoustic tokens in real-time, adjusting pitch, intensity, and rhythm without requiring a separate emotional embedding or external control vector. Performance Benchmarks and SGLang Integration Integrating TTS into real-time applications, the primary constraint is \u2018Time to First Audio\u2019 (TTFA). Fish Audio S2-Pro is optimized for a sub-150ms latency, with benchmarks on NVIDIA H200 hardware reaching approximately 100ms. Several technical optimizations contribute to this performance: SGLang and RadixAttention: S2-Pro is designed to work with SGLang, a high-performance serving framework. It utilizes RadixAttention, which allows for efficient Key-Value (KV) cache management. In a production environment where the same \u201cmaster\u201d voice prompt (reference clip) is used repeatedly, RadixAttention caches the prefix\u2019s KV states. This eliminates the need to re-compute the reference audio for every request, significantly reducing the prefill time. Multi-Speaker Single-Pass Generation: The architecture allows for multiple speaker identities to be present within the same context window. This permits the generation of complex dialogues or multi-character narrations in a single inference call, avoiding the latency overhead of switching models or reloading weights for different speakers. Technical Implementation and Data Scaling The Fish Speech repository provides a Python-based implementation utilizing PyTorch. The model was trained on a diverse dataset comprising over 300,000 hours of multi-lingual audio. This scale is what enables the model\u2019s robust performance across different languages and its ability to handle \u2018non-verbal\u2019 vocalizations like sighs or hesitations. The training pipeline involves: VQ-GAN Training: Training the quantizer to map audio into a discrete latent space. LLM Training: Training the Dual-AR transformers to predict those latent tokens based on text and acoustic prefixes. The VQ-GAN used in S2-Pro is specifically tuned to minimize artifacts during the decoding process, ensuring that even at high compression ratios, the reconstructed audio remains \u2018transparent\u2019 (indistinguishable from the source to the human ear). Key Takeaways Dual-AR Architecture (Slow\/Fast): Unlike single-stage models, S2-Pro splits tasks between a 4B parameter \u2018Slow AR\u2019 model (for linguistic and prosodic structure) and a 400M parameter \u2018Fast AR\u2019 model (for acoustic refinement), optimizing both detail and speed. Sub-150ms Latency: Engineered for real-time conversational AI, the model achieves a Time-to-First-Audio (TTFA) of ~100ms on high-end hardware, making it suitable for live agents and interactive applications. Hierarchical RVQ Encoding: By using Residual Vector Quantization, the system compresses 44.1kHz audio into discrete tokens across multiple layers. This allows the model to reconstruct complex vocal textures\u2014including breaths and sighs\u2014without the computational bloat of raw waveforms. Zero-Shot In-Context Learning: Developers can clone a voice and its emotional state by providing a 10\u201330 second reference clip. The model treats this as a prefix, adopting the speaker\u2019s timbre and prosody without requiring additional fine-tuning. RadixAttention &amp; SGLang Integration: Optimized for production, S2-Pro leverages RadixAttention to cache KV states of voice prompts. This allows for nearly instant generation when using the same speaker repeatedly, drastically reducing prefill overhead. Check out\u00a0Model Card and Repo.\u00a0Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a0120k+ ML SubReddit\u00a0and Subscribe to\u00a0our Newsletter. Wait! are you on telegram?\u00a0now you can join us on telegram as well. The post Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-76803","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/es\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/\" \/>\n<meta property=\"og:locale\" content=\"es_ES\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/es\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-03-11T12:22:54+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Escrito por\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Tiempo de lectura\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutos\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion\",\"datePublished\":\"2026-03-11T12:22:54+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/\"},\"wordCount\":983,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/\",\"url\":\"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/\",\"name\":\"Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"datePublished\":\"2026-03-11T12:22:54+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/#breadcrumb\"},\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"es\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/es\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/es\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/","og_locale":"es_ES","og_type":"article","og_title":"Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/es\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-03-11T12:22:54+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Escrito por":"admin NU","Tiempo de lectura":"5 minutos"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion","datePublished":"2026-03-11T12:22:54+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/"},"wordCount":983,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"es","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/","url":"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/","name":"Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"datePublished":"2026-03-11T12:22:54+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/#breadcrumb"},"inLanguage":"es","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"es"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/es\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/es\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/es\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"The landscape of Text-to-Speech (TTS) is moving away from modular pipelines toward integrated Large Audio Models (LAMs). Fish Audio\u2019s release of S2-Pro, the flagship model within the Fish Speech ecosystem, represents a shift toward open architectures capable of high-fidelity, multi-speaker synthesis with sub-150ms latency. The release provides a framework for zero-shot voice cloning and granular&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/76803","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/comments?post=76803"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/76803\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/media?parent=76803"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/categories?post=76803"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/tags?post=76803"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}