{"id":95559,"date":"2026-06-06T17:38:16","date_gmt":"2026-06-06T17:38:16","guid":{"rendered":"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/"},"modified":"2026-06-06T17:38:16","modified_gmt":"2026-06-06T17:38:16","slug":"nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time","status":"publish","type":"post","link":"https:\/\/youzum.net\/zh\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/","title":{"rendered":"NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time"},"content":{"rendered":"<p class=\"wp-block-paragraph\">NVIDIA\u2019s Nemotron Speech team has released <a href=\"https:\/\/huggingface.co\/nvidia\/nemotron-3.5-asr-streaming-0.6b\" target=\"_blank\" rel=\"noreferrer noopener\">Nemotron 3.5 ASR<\/a>. It is a 600M-parameter streaming Automatic Speech Recognition (ASR) model. A single checkpoint transcribes 40 language-locales in real time. Punctuation and capitalization are built in natively. The model ships as open weights on Hugging Face. The license is OpenMDW-1.1. The architecture is a Cache-Aware FastConformer-RNNT.<\/p>\n<h2 class=\"wp-block-heading\"><strong>What is Nemotron 3.5 ASR<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Nemotron 3.5 ASR extends <code>nvidia\/nemotron-speech-streaming-en-0.6b<\/code> to many languages. It adds prompt-based language-ID conditioning to the base model. That lets one 600M-parameter checkpoint cover 40 language-locales. No per-language model or model-swapping is required.<\/p>\n<p class=\"wp-block-paragraph\">The model targets two workloads. The first is low-latency streaming for live audio. The second is high-throughput batch transcription. Output is production-ready text with proper casing and punctuation. No separate punctuation-restoration step is needed.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"2560\" height=\"1150\" data-attachment-id=\"80346\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/06\/06\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/kl1-3\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/kl1-1-scaled.png\" data-orig-size=\"2560,1150\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"kl1\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/kl1-1-1024x460.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/kl1-1-scaled.png\" alt=\"\" class=\"wp-image-80346\" \/><figcaption class=\"wp-element-caption\">Image source: https:\/\/huggingface.co\/nvidia\/nemotron-3.5-asr-streaming-0.6b<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>How Cache-Aware FastConformer-RNNT Works<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The model has two main pieces. The first is a Cache-Aware FastConformer encoder with 24 layers. FastConformer is an efficient evolution of the Conformer architecture. It uses linearly scalable attention. The second piece is an RNNT (Recurrent Neural Network Transducer) decoder. RNNT emits text frame by frame as audio streams in.<\/p>\n<p class=\"wp-block-paragraph\">The \u201ccache-aware\u201d design is the efficiency lever. Buffered streaming re-processes overlapping audio windows at every step. That repeats the same work and adds delay. This model caches encoder self-attention and convolution activations instead. It reuses those cached states as new audio arrives. So each audio frame is processed exactly once, with no overlap. Compute and end-to-end latency both drop, without an accuracy penalty.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Latency Knob: <code>att_context_size<\/code><\/strong><\/h2>\n<p class=\"wp-block-paragraph\">One inference setting controls the latency-accuracy tradeoff. It is the attention context size, <code>att_context_size<\/code>. Smaller context emits text sooner but sees less future audio. Larger context raises accuracy at higher latency.<\/p>\n<p class=\"wp-block-paragraph\">The same checkpoint covers the full range. Settings map to chunk sizes of 80ms, 160ms, 320ms, 560ms, and 1.12s. For example, <code>[56,0]<\/code> gives an 80ms ultra-low-latency mode. The <code>[56,13]<\/code> setting gives 1.12s for highest accuracy. Teams pick the operating point at inference time, with no retraining.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Language Detection and Coverage<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The 40 language-locales include English, Spanish, German, and French variants. They also cover Arabic, Japanese, Korean, Mandarin, Hindi, and Thai. Several other European and Nordic languages are included too.<\/p>\n<p class=\"wp-block-paragraph\">Language conditioning works two ways. Setting <code>target_lang<\/code> to a known locale usually gives the best accuracy. Setting <code>target_lang=auto<\/code> lets the model detect the language itself. In auto mode, it emits a language tag after terminal punctuation. One deployment can then transcribe mixed-language traffic. No separate language-ID component is required.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Comparison<\/strong><\/h2>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Product<\/th>\n<th>Company<\/th>\n<th>Access<\/th>\n<th>Native streaming<\/th>\n<th>Language coverage<\/th>\n<th>Reported latency<\/th>\n<th>Pricing model<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Nemotron 3.5 ASR<\/strong><\/td>\n<td>NVIDIA<\/td>\n<td>Open weights (OpenMDW-1.1), self-host; hosted on DeepInfra<\/td>\n<td>Yes \u2014 cache-aware FastConformer-RNNT<\/td>\n<td>40 language-locales<\/td>\n<td>80ms\u20131.12s, configurable at inference<\/td>\n<td>Free to self-host; usage-based via host<\/td>\n<\/tr>\n<tr>\n<td>Whisper large-v3<\/td>\n<td>OpenAI<\/td>\n<td>Open weights (MIT), self-host; API<\/td>\n<td>No \u2014 offline\/batch<\/td>\n<td>~99 languages <\/td>\n<td>Not streaming-native<\/td>\n<td>Self-host free; API ~$0.006\/min (batch) <\/td>\n<\/tr>\n<tr>\n<td>Nova-3<\/td>\n<td>Deepgram<\/td>\n<td>Closed API; on-premise\/self-host (enterprise)<\/td>\n<td>Yes \u2014 streaming + batch<\/td>\n<td>Multilingual; +10 monolingual languages added Jan 2026 <\/td>\n<td>Low-latency streaming (reported sub-300ms)<\/td>\n<td>~$0.0077\/min (Nova-3 Monolingual, PAYG) <\/td>\n<\/tr>\n<tr>\n<td>Universal-3 Pro Streaming<\/td>\n<td>AssemblyAI<\/td>\n<td>Closed API (EU endpoint available)<\/td>\n<td>Yes<\/td>\n<td>6 languages: English, Spanish, French, German, Italian, Portuguese <\/td>\n<td>Sub-300ms (official); first partial ~750ms <\/td>\n<td>Usage-based (PAYG)<\/td>\n<\/tr>\n<tr>\n<td>Scribe v2 Realtime<\/td>\n<td>ElevenLabs<\/td>\n<td>Closed API<\/td>\n<td>Yes<\/td>\n<td>90+ languages (99 per ElevenLabs) <\/td>\n<td>~150ms (p50) <\/td>\n<td>~$0.28\/hour <\/td>\n<\/tr>\n<tr>\n<td>Ursa \/ streaming<\/td>\n<td>Speechmatics<\/td>\n<td>API + on-premise + edge<\/td>\n<td>Yes \u2014 streaming + batch<\/td>\n<td>50+ languages with automatic identification <\/td>\n<td>Ultra-low latency (positioned)<\/td>\n<td>Enterprise\/usage<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\"><strong>Fine-Tuning Results<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Because the weights are open, teams can fine-tune for a language, domain, or accent. NVIDIA published a worked example on Greek and Bulgarian. It fine-tuned the base checkpoint with the same Cache-Aware FastConformer-RNNT recipe. Each clip carried a <code>target_lang<\/code> tag for language conditioning. Training data came from public corpora, including Granary, Common Voice, and FLEURS.<\/p>\n<p class=\"wp-block-paragraph\">Results were measured as WER on held-out FLEURS, at the 80ms setting. Greek WER fell from 35 to 24, a 32% relative improvement. Bulgarian fell from 22 to 15, a 31% relative improvement. These are raw WER percentages at the lowest-latency streaming mode. NVIDIA notes that evaluating at deployment latency, on held-out data, gives honest numbers.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Strengths and Considerations<\/strong><\/h2>\n<h4 class=\"wp-block-heading\"><strong>Strengths:<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>One 600M-parameter checkpoint covers 40 language-locales, cutting deployment sprawl.<\/li>\n<li>Cache-aware streaming processes each frame once, reported at 17x buffered concurrency on an H100.<\/li>\n<li><code>att_context_size<\/code> tunes latency from 80ms to 1.12s at inference, with no retraining.<\/li>\n<li>Punctuation, capitalization, and <code>auto<\/code> language tagging are built in.<\/li>\n<li>Open weights enabled a 31\u201332% relative WER drop on Greek and Bulgarian after fine-tuning.<\/li>\n<\/ul>\n<h4 class=\"wp-block-heading\"><strong>Considerations:<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>The model handles English, but NVIDIA recommends its dedicated English model for English-only use.<\/li>\n<li>The 80ms mode trades some accuracy for the lowest latency.<\/li>\n<li>Japanese and Korean use CER, so cross-language error comparisons need care.<\/li>\n<li>Throughput figures are measured on H100, so results on other GPUs will differ.<\/li>\n<li>The production NIM with gRPC streaming is announced, but not yet released.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>NVIDIA\u2019s Nemotron 3.5 ASR is an open-weights (OpenMDW-1.1), 600M-parameter streaming model transcribing 40 language-locales from one checkpoint.<\/li>\n<li>Its Cache-Aware FastConformer-RNNT design processes each audio frame once, reported at 17x the concurrent streams of buffered approaches on an H100.<\/li>\n<li>Latency is configurable from 80ms to 1.12s at inference via <code>att_context_size<\/code>, with no retraining.<\/li>\n<li>A short fine-tune cut FLEURS WER 32% on Greek (35\u219224) and 31% on Bulgarian (22\u219215), at the 80ms setting.<\/li>\n<li>It is self-hostable and streaming-native, unlike closed APIs (Deepgram, AssemblyAI, ElevenLabs) or offline Whisper.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<p><!-- ============================================================\n     Marktechpost \u2014 Nemotron 3.5 ASR Slider Guide\n     WordPress-embeddable snippet. Paste into a Custom HTML block.\n     Scoped to #mtp-nemo-asr. NVIDIA dark theme.\n     ============================================================ --><\/p>\n<div>\n<div class=\"mtp-na-frame\">\n<p>    <!-- Top bar --><\/p>\n<div class=\"mtp-na-topbar\">\n      <span class=\"mtp-na-dot\"><\/span><br \/>\n      <span class=\"mtp-na-brand\">NEMOTRON\u00a03.5\u00a0ASR<\/span><br \/>\n      <span class=\"mtp-na-counter\"><span class=\"mtp-na-cur\">1<\/span> \/ <span class=\"mtp-na-tot\">10<\/span><\/span>\n    <\/div>\n<div class=\"mtp-na-progress\"><span class=\"mtp-na-progbar\"><\/span><\/div>\n<p>    <!-- Viewport --><\/p>\n<div class=\"mtp-na-viewport\">\n<div class=\"mtp-na-track\">\n<p>        <!-- Slide 1: Cover --><\/p>\n<section class=\"mtp-na-slide\">\n<div class=\"mtp-na-kicker\">NVIDIA \u00b7 STREAMING SPEECH AI \u00b7 OPEN WEIGHTS<\/div>\n<h2 class=\"mtp-na-title-xl\">Nemotron 3.5 ASR<\/h2>\n<p class=\"mtp-na-lead\">A 600M-parameter cache-aware streaming model that transcribes 40 language-locales in real time, from a single checkpoint.<\/p>\n<div class=\"mtp-na-chips\">\n            <span class=\"mtp-na-chip\">600M parameters<\/span><br \/>\n            <span class=\"mtp-na-chip\">40 language-locales<\/span><br \/>\n            <span class=\"mtp-na-chip\">80ms\u20131.12s latency<\/span><br \/>\n            <span class=\"mtp-na-chip\">OpenMDW-1.1<\/span>\n          <\/div>\n<\/section>\n<p>        <!-- Slide 2: What it is --><\/p>\n<section class=\"mtp-na-slide\">\n<div class=\"mtp-na-kicker\">01 \u2014 WHAT IT IS<\/div>\n<h2 class=\"mtp-na-title\">One model, 40 language-locales<\/h2>\n<ul class=\"mtp-na-list\">\n<li>Extends <code>nvidia\/nemotron-speech-streaming-en-0.6b<\/code> with prompt-based language-ID conditioning.<\/li>\n<li>A single 600M-parameter checkpoint covers 40 language-locales. No model-swapping.<\/li>\n<li>Punctuation and capitalization are built in. No separate post-processing step.<\/li>\n<li>Targets two workloads: low-latency streaming and high-throughput batch.<\/li>\n<li>NVIDIA still recommends its English-only model for English-only use.<\/li>\n<\/ul>\n<\/section>\n<p>        <!-- Slide 3: Architecture --><\/p>\n<section class=\"mtp-na-slide\">\n<div class=\"mtp-na-kicker\">02 \u2014 ARCHITECTURE<\/div>\n<h2 class=\"mtp-na-title\">Cache-Aware FastConformer-RNNT<\/h2>\n<ul class=\"mtp-na-list\">\n<li>A 24-layer FastConformer encoder paired with an RNNT decoder.<\/li>\n<li>Buffered streaming re-processes overlapping audio windows at every step.<\/li>\n<li>This model caches encoder self-attention and convolution states, then reuses them.<\/li>\n<li>Each audio frame is processed exactly once, with no overlap.<\/li>\n<li>Compute and end-to-end latency drop, with no accuracy penalty.<\/li>\n<\/ul>\n<\/section>\n<p>        <!-- Slide 4: Latency knob --><\/p>\n<section class=\"mtp-na-slide\">\n<div class=\"mtp-na-kicker\">03 \u2014 THE LATENCY KNOB<\/div>\n<h2 class=\"mtp-na-title\">One setting tunes latency vs. accuracy<\/h2>\n<div class=\"mtp-na-tablewrap\">\n<table class=\"mtp-na-table\">\n<thead>\n<tr>\n<th>att_context_size<\/th>\n<th>Chunk (latency)<\/th>\n<th>Use case<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>[56,0]<\/code><\/td>\n<td>80ms (Ultra-Low)<\/td>\n<td>Ultra low latency voice agents<\/td>\n<\/tr>\n<tr>\n<td><code>[56,1]<\/code><\/td>\n<td>160ms (Low)<\/td>\n<td>Interactive voice agents<\/td>\n<\/tr>\n<tr>\n<td><code>[56,3]<\/code><\/td>\n<td>320ms (Balanced)<\/td>\n<td>Conversational AI, live caption<\/td>\n<\/tr>\n<tr>\n<td><code>[56,6]<\/code><\/td>\n<td>560ms (Medium)<\/td>\n<td>Higher accuracy, reasonable latency<\/td>\n<\/tr>\n<tr>\n<td><code>[56,13]<\/code><\/td>\n<td>1.12s (High)<\/td>\n<td>Highest accuracy<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p class=\"mtp-na-note\">Same checkpoint, chosen at inference time. No retraining required.<\/p>\n<\/section>\n<p>        <!-- Slide 5: Languages &amp; detection --><\/p>\n<section class=\"mtp-na-slide\">\n<div class=\"mtp-na-kicker\">04 \u2014 LANGUAGES &amp; DETECTION<\/div>\n<h2 class=\"mtp-na-title\">Coverage and automatic language ID<\/h2>\n<ul class=\"mtp-na-list\">\n<li>40 language-locales, including English, Spanish, German, and French variants.<\/li>\n<li>Also covers Arabic, Japanese, Korean, Mandarin, Hindi, and Thai.<\/li>\n<li>Set <code>target_lang<\/code> to a known locale for the best accuracy.<\/li>\n<li>Set <code>target_lang=auto<\/code> to let the model detect the language.<\/li>\n<li>In auto mode, it emits a language tag after terminal punctuation.<\/li>\n<li>One deployment handles mixed-language traffic, with no separate language-ID component.<\/li>\n<\/ul>\n<\/section>\n<p>        <!-- Slide 6: Throughput --><\/p>\n<section class=\"mtp-na-slide\">\n<div class=\"mtp-na-kicker\">05 \u2014 THROUGHPUT<\/div>\n<h2 class=\"mtp-na-title\">Half the size, more concurrent streams<\/h2>\n<ul class=\"mtp-na-list\">\n<li>NVIDIA compares it against Parakeet RNNT 1.1B multilingual, which uses buffered streaming.<\/li>\n<li>Nemotron 3.5 ASR is roughly half the size: 0.6B versus 1.1B.<\/li>\n<li>The team reports 17x the concurrent streams of buffered approaches, on the same H100.<\/li>\n<li>Avoiding redundant recomputation lowers the cost per stream in production.<\/li>\n<\/ul>\n<p class=\"mtp-na-note\">The 17x figure is from the release announcement; the model card states the qualitative claim directly.<\/p>\n<\/section>\n<p>        <!-- Slide 7: Fine-tuning --><\/p>\n<section class=\"mtp-na-slide\">\n<div class=\"mtp-na-kicker\">06 \u2014 FINE-TUNING RESULTS<\/div>\n<h2 class=\"mtp-na-title\">A short fine-tune lifts weaker languages<\/h2>\n<div class=\"mtp-na-tablewrap\">\n<table class=\"mtp-na-table\">\n<thead>\n<tr>\n<th>Language<\/th>\n<th>Base WER<\/th>\n<th>Fine-tuned<\/th>\n<th>Relative<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Greek<\/td>\n<td>35<\/td>\n<td>24<\/td>\n<td class=\"mtp-na-pos\">32%<\/td>\n<\/tr>\n<tr>\n<td>Bulgarian<\/td>\n<td>22<\/td>\n<td>15<\/td>\n<td class=\"mtp-na-pos\">31%<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p class=\"mtp-na-note\">Raw WER (%) on held-out FLEURS at the 80ms setting. Data: Granary, Common Voice, FLEURS.<\/p>\n<\/section>\n<p>        <!-- Slide 8: Availability --><\/p>\n<section class=\"mtp-na-slide\">\n<div class=\"mtp-na-kicker\">07 \u2014 AVAILABILITY &amp; ACCESS<\/div>\n<h2 class=\"mtp-na-title\">Open weights, plus a hosted path<\/h2>\n<ul class=\"mtp-na-list\">\n<li>Weights on Hugging Face under the OpenMDW-1.1 license.<\/li>\n<li>Runtime is NeMo 26.06 or newer. Input must be mono-channel.<\/li>\n<li>Hosted on DeepInfra, which adds word boosting for domain vocabulary.<\/li>\n<li>NVIDIA says a NIM release is planned for later in the month, with gRPC streaming.<\/li>\n<li>Stated GPU support: Ampere, Hopper, Blackwell, Lovelace, Turing, Volta, and Jetson.<\/li>\n<\/ul>\n<\/section>\n<p>        <!-- Slide 9: How it compares --><\/p>\n<section class=\"mtp-na-slide\">\n<div class=\"mtp-na-kicker\">08 \u2014 HOW IT COMPARES<\/div>\n<h2 class=\"mtp-na-title\">Where it sits in the landscape<\/h2>\n<div class=\"mtp-na-tablewrap\">\n<table class=\"mtp-na-table\">\n<thead>\n<tr>\n<th>Product<\/th>\n<th>Access<\/th>\n<th>Streaming<\/th>\n<th>Languages<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Nemotron 3.5 ASR<\/td>\n<td>Open weights<\/td>\n<td>Native<\/td>\n<td>40 locales<\/td>\n<\/tr>\n<tr>\n<td>Whisper large-v3<\/td>\n<td>Open weights<\/td>\n<td>No (batch)<\/td>\n<td>~99<\/td>\n<\/tr>\n<tr>\n<td>Deepgram Nova-3<\/td>\n<td>API \/ on-prem<\/td>\n<td>Native<\/td>\n<td>Multilingual<\/td>\n<\/tr>\n<tr>\n<td>AssemblyAI U-3 Pro<\/td>\n<td>API<\/td>\n<td>Native<\/td>\n<td>6<\/td>\n<\/tr>\n<tr>\n<td>ElevenLabs Scribe v2<\/td>\n<td>API<\/td>\n<td>Native<\/td>\n<td>90+<\/td>\n<\/tr>\n<tr>\n<td>Google Chirp \/ Azure<\/td>\n<td>API<\/td>\n<td>Native<\/td>\n<td>100+ \/ 140+<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p class=\"mtp-na-note\">Latency and WER are not directly comparable across vendors; this compares structure, not a ranking.<\/p>\n<\/section>\n<p>        <!-- Slide 10: Key takeaways --><\/p>\n<section class=\"mtp-na-slide\">\n<div class=\"mtp-na-kicker\">09 \u2014 KEY TAKEAWAYS<\/div>\n<h2 class=\"mtp-na-title\">The short version<\/h2>\n<ul class=\"mtp-na-list mtp-na-list-tight\">\n<li>An open-weights 600M streaming model transcribing 40 language-locales from one checkpoint.<\/li>\n<li>Cache-aware design processes each frame once; reported 17x buffered concurrency on an H100.<\/li>\n<li>Latency configurable from 80ms to 1.12s at inference, with no retraining.<\/li>\n<li>A short fine-tune cut FLEURS WER 32% (Greek) and 31% (Bulgarian).<\/li>\n<li>Self-hostable and streaming-native, unlike closed APIs or offline Whisper.<\/li>\n<\/ul>\n<\/section><\/div>\n<\/div>\n<p>    <!-- Controls --><\/p>\n<div class=\"mtp-na-controls\">\n      <button class=\"mtp-na-btn mtp-na-prev\" type=\"button\" aria-label=\"Previous slide\">\u2190 Prev<\/button>\n<div class=\"mtp-na-dots\"><\/div>\n<p>      <button class=\"mtp-na-btn mtp-na-next\" type=\"button\" aria-label=\"Next slide\">Next \u2192<\/button>\n    <\/p><\/div>\n<p>    <!-- Marktechpost tagline --><\/p>\n<div class=\"mtp-na-tagline\">\n      <span class=\"mtp-na-tagdot\"><\/span><br \/>\n      Curated for AI engineers by <a href=\"https:\/\/www.marktechpost.com\/\" target=\"_blank\" rel=\"noopener\"><strong>Marktechpost<\/strong><\/a> \u2014 practitioner-first coverage of AI &amp; ML.\n    <\/div>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/huggingface.co\/nvidia\/nemotron-3.5-asr-streaming-0.6b\" target=\"_blank\" rel=\"noreferrer noopener\">Model weights<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/wbash1wF6efRj8G58\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/06\/06\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/\">NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>NVIDIA\u2019s Nemotron Speech team has released Nemotron 3.5 ASR. It is a 600M-parameter streaming Automatic Speech Recognition (ASR) model. A single checkpoint transcribes 40 language-locales in real time. Punctuation and capitalization are built in natively. The model ships as open weights on Hugging Face. The license is OpenMDW-1.1. The architecture is a Cache-Aware FastConformer-RNNT. What is Nemotron 3.5 ASR Nemotron 3.5 ASR extends nvidia\/nemotron-speech-streaming-en-0.6b to many languages. It adds prompt-based language-ID conditioning to the base model. That lets one 600M-parameter checkpoint cover 40 language-locales. No per-language model or model-swapping is required. The model targets two workloads. The first is low-latency streaming for live audio. The second is high-throughput batch transcription. Output is production-ready text with proper casing and punctuation. No separate punctuation-restoration step is needed. Image source: https:\/\/huggingface.co\/nvidia\/nemotron-3.5-asr-streaming-0.6b How Cache-Aware FastConformer-RNNT Works The model has two main pieces. The first is a Cache-Aware FastConformer encoder with 24 layers. FastConformer is an efficient evolution of the Conformer architecture. It uses linearly scalable attention. The second piece is an RNNT (Recurrent Neural Network Transducer) decoder. RNNT emits text frame by frame as audio streams in. The \u201ccache-aware\u201d design is the efficiency lever. Buffered streaming re-processes overlapping audio windows at every step. That repeats the same work and adds delay. This model caches encoder self-attention and convolution activations instead. It reuses those cached states as new audio arrives. So each audio frame is processed exactly once, with no overlap. Compute and end-to-end latency both drop, without an accuracy penalty. The Latency Knob: att_context_size One inference setting controls the latency-accuracy tradeoff. It is the attention context size, att_context_size. Smaller context emits text sooner but sees less future audio. Larger context raises accuracy at higher latency. The same checkpoint covers the full range. Settings map to chunk sizes of 80ms, 160ms, 320ms, 560ms, and 1.12s. For example, [56,0] gives an 80ms ultra-low-latency mode. The [56,13] setting gives 1.12s for highest accuracy. Teams pick the operating point at inference time, with no retraining. Language Detection and Coverage The 40 language-locales include English, Spanish, German, and French variants. They also cover Arabic, Japanese, Korean, Mandarin, Hindi, and Thai. Several other European and Nordic languages are included too. Language conditioning works two ways. Setting target_lang to a known locale usually gives the best accuracy. Setting target_lang=auto lets the model detect the language itself. In auto mode, it emits a language tag after terminal punctuation. One deployment can then transcribe mixed-language traffic. No separate language-ID component is required. Comparison Product Company Access Native streaming Language coverage Reported latency Pricing model Nemotron 3.5 ASR NVIDIA Open weights (OpenMDW-1.1), self-host; hosted on DeepInfra Yes \u2014 cache-aware FastConformer-RNNT 40 language-locales 80ms\u20131.12s, configurable at inference Free to self-host; usage-based via host Whisper large-v3 OpenAI Open weights (MIT), self-host; API No \u2014 offline\/batch ~99 languages Not streaming-native Self-host free; API ~$0.006\/min (batch) Nova-3 Deepgram Closed API; on-premise\/self-host (enterprise) Yes \u2014 streaming + batch Multilingual; +10 monolingual languages added Jan 2026 Low-latency streaming (reported sub-300ms) ~$0.0077\/min (Nova-3 Monolingual, PAYG) Universal-3 Pro Streaming AssemblyAI Closed API (EU endpoint available) Yes 6 languages: English, Spanish, French, German, Italian, Portuguese Sub-300ms (official); first partial ~750ms Usage-based (PAYG) Scribe v2 Realtime ElevenLabs Closed API Yes 90+ languages (99 per ElevenLabs) ~150ms (p50) ~$0.28\/hour Ursa \/ streaming Speechmatics API + on-premise + edge Yes \u2014 streaming + batch 50+ languages with automatic identification Ultra-low latency (positioned) Enterprise\/usage Fine-Tuning Results Because the weights are open, teams can fine-tune for a language, domain, or accent. NVIDIA published a worked example on Greek and Bulgarian. It fine-tuned the base checkpoint with the same Cache-Aware FastConformer-RNNT recipe. Each clip carried a target_lang tag for language conditioning. Training data came from public corpora, including Granary, Common Voice, and FLEURS. Results were measured as WER on held-out FLEURS, at the 80ms setting. Greek WER fell from 35 to 24, a 32% relative improvement. Bulgarian fell from 22 to 15, a 31% relative improvement. These are raw WER percentages at the lowest-latency streaming mode. NVIDIA notes that evaluating at deployment latency, on held-out data, gives honest numbers. Strengths and Considerations Strengths: One 600M-parameter checkpoint covers 40 language-locales, cutting deployment sprawl. Cache-aware streaming processes each frame once, reported at 17x buffered concurrency on an H100. att_context_size tunes latency from 80ms to 1.12s at inference, with no retraining. Punctuation, capitalization, and auto language tagging are built in. Open weights enabled a 31\u201332% relative WER drop on Greek and Bulgarian after fine-tuning. Considerations: The model handles English, but NVIDIA recommends its dedicated English model for English-only use. The 80ms mode trades some accuracy for the lowest latency. Japanese and Korean use CER, so cross-language error comparisons need care. Throughput figures are measured on H100, so results on other GPUs will differ. The production NIM with gRPC streaming is announced, but not yet released. Key Takeaways NVIDIA\u2019s Nemotron 3.5 ASR is an open-weights (OpenMDW-1.1), 600M-parameter streaming model transcribing 40 language-locales from one checkpoint. Its Cache-Aware FastConformer-RNNT design processes each audio frame once, reported at 17x the concurrent streams of buffered approaches on an H100. Latency is configurable from 80ms to 1.12s at inference via att_context_size, with no retraining. A short fine-tune cut FLEURS WER 32% on Greek (35\u219224) and 31% on Bulgarian (22\u219215), at the 80ms setting. It is self-hostable and streaming-native, unlike closed APIs (Deepgram, AssemblyAI, ElevenLabs) or offline Whisper. Marktechpost\u2019s Visual Explainer NEMOTRON\u00a03.5\u00a0ASR 1 \/ 10 NVIDIA \u00b7 STREAMING SPEECH AI \u00b7 OPEN WEIGHTS Nemotron 3.5 ASR A 600M-parameter cache-aware streaming model that transcribes 40 language-locales in real time, from a single checkpoint. 600M parameters 40 language-locales 80ms\u20131.12s latency OpenMDW-1.1 01 \u2014 WHAT IT IS One model, 40 language-locales Extends nvidia\/nemotron-speech-streaming-en-0.6b with prompt-based language-ID conditioning. A single 600M-parameter checkpoint covers 40 language-locales. No model-swapping. Punctuation and capitalization are built in. No separate post-processing step. Targets two workloads: low-latency streaming and high-throughput batch. NVIDIA still recommends its English-only model for English-only use. 02 \u2014 ARCHITECTURE Cache-Aware FastConformer-RNNT A 24-layer FastConformer encoder paired with an RNNT decoder. Buffered streaming re-processes overlapping audio windows at every step. This model caches encoder<\/p>","protected":false},"author":2,"featured_media":95560,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-95559","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/zh\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/\" \/>\n<meta property=\"og:locale\" content=\"zh_CN\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/zh\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-06T17:38:16+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u4f5c\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 \u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time\",\"datePublished\":\"2026-06-06T17:38:16+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/\"},\"wordCount\":1484,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/\",\"url\":\"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/\",\"name\":\"NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo.png\",\"datePublished\":\"2026-06-06T17:38:16+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#breadcrumb\"},\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo.png\",\"width\":2560,\"height\":1150},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"zh-Hans\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/zh\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/zh\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/","og_locale":"zh_CN","og_type":"article","og_title":"NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/zh\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-06-06T17:38:16+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u4f5c\u8005":"admin NU","\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4":"8 \u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time","datePublished":"2026-06-06T17:38:16+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/"},"wordCount":1484,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"zh-Hans","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/","url":"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/","name":"NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo.png","datePublished":"2026-06-06T17:38:16+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#breadcrumb"},"inLanguage":"zh-Hans","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/"]}]},{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo.png","width":2560,"height":1150},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/nvidia-releases-nemotron-3-5-asr-a-600m-parameter-cache-aware-streaming-model-transcribing-40-language-locales-in-real-time\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"zh-Hans"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/zh\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo.png",2560,1150,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo.png",2560,1150,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo.png",2560,1150,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo-300x135.png",300,135,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo-1024x460.png",1024,460,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo-1536x690.png",1536,690,true],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo-2048x920.png",2048,920,true],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo-18x8.png",18,8,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo-600x270.png",600,270,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/06\/kl1-1-scaled-yj1mlo-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/zh\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/zh\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"NVIDIA\u2019s Nemotron Speech team has released Nemotron 3.5 ASR. It is a 600M-parameter streaming Automatic Speech Recognition (ASR) model. A single checkpoint transcribes 40 language-locales in real time. Punctuation and capitalization are built in natively. The model ships as open weights on Hugging Face. The license is OpenMDW-1.1. The architecture is a Cache-Aware FastConformer-RNNT. What&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/95559","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/comments?post=95559"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/95559\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media\/95560"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media?parent=95559"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/categories?post=95559"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/tags?post=95559"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}