{"id":80282,"date":"2026-03-31T14:47:41","date_gmt":"2026-03-31T14:47:41","guid":{"rendered":"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/"},"modified":"2026-03-31T14:47:41","modified_gmt":"2026-03-31T14:47:41","slug":"alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction","status":"publish","type":"post","link":"https:\/\/youzum.net\/fr\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/","title":{"rendered":"Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction"},"content":{"rendered":"<p>The landscape of multimodal large language models (MLLMs) has shifted from experimental \u2018wrappers\u2019\u2014where separate vision or audio encoders are stitched onto a text-based backbone\u2014to native, end-to-end \u2018omnimodal\u2019 architectures. Alibaba Qwen team latest release, <strong>Qwen3.5-Omni<\/strong>, represents a significant milestone in this evolution. Designed as a direct competitor to flagship models like Gemini 3.1 Pro, the Qwen3.5-Omni series introduces a unified framework capable of processing text, images, audio, and video simultaneously within a single computational pipeline. <\/p>\n<p>The technical significance of Qwen3.5-Omni lies in its <strong>Thinker-Talker<\/strong> architecture and its use of <strong>Hybrid-Attention Mixture of Experts (MoE)<\/strong> across all modalities. This approach enables the model to handle massive context windows and real-time interaction without the traditional latency penalties associated with cascaded systems.<\/p>\n<h4 class=\"wp-block-heading\"><strong>Model Tiers<\/strong><\/h4>\n<p>The series is offered in three sizes to balance performance and cost:<sup><\/sup><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Plus:<\/strong> High-complexity reasoning and maximum accuracy.<\/li>\n<li><strong>Flash:<\/strong> Optimized for high-throughput and low-latency interaction.<\/li>\n<li><strong>Light:<\/strong> A smaller variant for efficiency-focused tasks.<\/li>\n<\/ul>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1694\" height=\"1230\" data-attachment-id=\"78718\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/30\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/screenshot-2026-03-30-at-10-06-06-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1.png\" data-orig-size=\"1694,1230\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-30 at 10.06.06\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-300x218.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-1024x744.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1.png\" alt=\"\" class=\"wp-image-78718\" \/><figcaption class=\"wp-element-caption\">https:\/\/qwen.ai\/blog?id=qwen3.5-omni<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>The Thinker-Talker Architecture: A Unified MoE Framework<\/strong><\/h3>\n<p>At the core of Qwen3.5-Omni is a bifurcated yet tightly integrated architecture consisting of two main components: the <strong>Thinker<\/strong> and the <strong>Talker<\/strong>.<sup><\/sup><\/p>\n<p>In previous iterations, multimodal models often relied on external pre-trained encoders (such as Whisper for audio). Qwen3.5-Omni moves beyond this by utilizing a native <strong>Audio Transformer (AuT)<\/strong> encoder.<sup><\/sup> This encoder was pre-trained on more than <strong>100 million hours<\/strong> of audio-visual data, providing the model with a grounded understanding of temporal and acoustic nuances that traditional text-first models lack.<sup><\/sup><\/p>\n<h4 class=\"wp-block-heading\"><strong>Hybrid-Attention Mixture of Experts (MoE)<\/strong><\/h4>\n<p>Both the Thinker and the Talker leverage <strong>Hybrid-Attention MoE<\/strong>. In a standard MoE setup, only a subset of parameters (the \u2018experts\u2019) are activated for any given token, which allows for a high total parameter count with lower active computational costs. By applying this to a hybrid-attention mechanism, Qwen3.5-Omni can effectively weigh the importance of different modalities (e.g., focusing more on visual tokens during a video analysis task) while maintaining the throughput required for streaming services.<\/p>\n<p><strong>This architecture supports a 256k long-context input, enabling the model to ingest and reason over:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>Over <strong>10 hours of continuous audio<\/strong>.<\/li>\n<li>Over <strong>400 seconds of 720p audio-visual content<\/strong> (sampled at 1 FPS).<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Benchmarking Performance: The \u2018215 SOTA\u2019 Milestone<\/strong><\/h3>\n<p>One of the most highlighted technical claims regarding the flagship <strong>Qwen3.5-Omni-Plus<\/strong> model is its performance on the global leaderboard. The model achieved <strong>State-of-the-Art (SOTA) results on 215 audio and audio-visual understanding, reasoning, and interaction subtasks<\/strong>.<\/p>\n<p><strong>These 215 SOTA wins are not merely a measure of broad evaluation but span specific technical benchmarks, including:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>3 audio-visual benchmarks<\/strong> and <strong>5 general audio benchmarks<\/strong>.<\/li>\n<li><strong>8 ASR (Automatic Speech Recognition) benchmarks<\/strong>.<\/li>\n<li><strong>156 language-specific Speech-to-Text Translation (S2TT) tasks<\/strong>.<\/li>\n<li><strong>43 language-specific ASR tasks<\/strong>.<\/li>\n<\/ul>\n<p>According to their official <a href=\"https:\/\/qwen.ai\/blog?id=qwen3.5-omni\" target=\"_blank\" rel=\"noreferrer noopener\">technical reports<\/a>, Qwen3.5-Omni-Plus surpasses <strong>Gemini 3.1 Pro<\/strong> in general audio understanding, reasoning, recognition, and translation. In audio-visual understanding, it achieves parity with Google\u2019s flagship, while maintaining the core text and visual performance of the standard Qwen3.5 series.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"2242\" height=\"1218\" data-attachment-id=\"78716\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/30\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/screenshot-2026-03-30-at-9-58-24-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-9.58.24-PM-1.png\" data-orig-size=\"2242,1218\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-30 at 9.58.24\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-9.58.24-PM-1-300x163.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-9.58.24-PM-1-1024x556.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-9.58.24-PM-1.png\" alt=\"\" class=\"wp-image-78716\" \/><figcaption class=\"wp-element-caption\">https:\/\/qwen.ai\/blog?id=qwen3.5-omni<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Technical Solutions for Real-Time Interaction<\/strong><\/h3>\n<p>Building a model that can \u2018talk\u2019 and \u2018hear\u2019 in real-time requires solving specific engineering challenges related to streaming stability and conversational flow.<\/p>\n<h4 class=\"wp-block-heading\"><strong>ARIA: Adaptive Rate Interleave Alignment<\/strong><\/h4>\n<p>A common failure mode in streaming voice interaction is \u2018speech instability.\u2019 Because text tokens and speech tokens have different encoding efficiencies, a model may misread numbers or stutter when attempting to synchronize its text reasoning with its audio output.<\/p>\n<p>To address this, Alibaba Qwen team developed <strong>ARIA (Adaptive Rate Interleave Alignment)<\/strong>. This technique dynamically aligns text and speech units during generation. By adjusting the interleave rate based on the density of the information being processed, ARIA improves the naturalness and robustness of speech synthesis without increasing latency.<\/p>\n<h4 class=\"wp-block-heading\"><strong>Semantic Interruption and Turn-Taking<\/strong><\/h4>\n<p>For AI developers building voice assistants, handling interruptions is notoriously difficult. Qwen3.5-Omni introduces native <strong>turn-taking intent recognition<\/strong>. This allows the model to distinguish between \u2018backchanneling\u2019 (non-meaningful background noise or listener feedback like \u2018uh-huh\u2019) and an actual semantic interruption where the user intends to take the floor. This capability is baked directly into the model\u2019s API, enabling more human-like, full-duplex conversations.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Emergent Capability: Audio-Visual Vibe Coding<\/strong><\/h3>\n<p>Perhaps the most unique feature identified during the native multimodal scaling of Qwen3.5-Omni is <strong>Audio-Visual Vibe Coding<\/strong>. Unlike traditional code generation that relies on text prompts, Qwen3.5-Omni can perform coding tasks based directly on audio-visual instructions.<sup><\/sup><\/p>\n<p>For instance, a developer could record a video of a software UI, verbally describe a bug while pointing at specific elements, and the model can directly generate the fix.<sup><\/sup> This emergence suggests that the model has developed a cross-modal mapping between visual UI hierarchies, verbal intent, and symbolic code logic.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li>Qwen3.5-Omni uses a native <strong>Thinker-Talker<\/strong> multimodal architecture for unified text, audio, and video processing.<\/li>\n<li>The model supports <strong>256k context<\/strong>, <strong>10+ hours of audio<\/strong>, and <strong>400+ seconds of 720p video<\/strong> at 1 FPS.<\/li>\n<li>Alibaba reports <strong>speech recognition in 113 languages\/dialects<\/strong> and <strong>speech generation in 36 languages\/dialects<\/strong>.<\/li>\n<li>Key system features include <strong>semantic interruption<\/strong>, <strong>turn-taking intent recognition<\/strong>, <strong>TMRoPE<\/strong>, and <strong>ARIA<\/strong> for realtime interaction.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/qwen.ai\/blog?id=qwen3.5-omni\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>, <a href=\"https:\/\/chat.qwen.ai\/\" target=\"_blank\" rel=\"noreferrer noopener\">Qwenchat<\/a>, <a href=\"https:\/\/huggingface.co\/spaces\/Qwen\/Qwen3.5-Omni-Online-Demo\" target=\"_blank\" rel=\"noreferrer noopener\">Online demo on HF<\/a> <\/strong>and<strong> <a href=\"https:\/\/huggingface.co\/spaces\/Qwen\/Qwen3.5-Omni-Offline-Demo\" target=\"_blank\" rel=\"noreferrer noopener\">Offline demo on HF<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/30\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/\">Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>The landscape of multimodal large language models (MLLMs) has shifted from experimental \u2018wrappers\u2019\u2014where separate vision or audio encoders are stitched onto a text-based backbone\u2014to native, end-to-end \u2018omnimodal\u2019 architectures. Alibaba Qwen team latest release, Qwen3.5-Omni, represents a significant milestone in this evolution. Designed as a direct competitor to flagship models like Gemini 3.1 Pro, the Qwen3.5-Omni series introduces a unified framework capable of processing text, images, audio, and video simultaneously within a single computational pipeline. The technical significance of Qwen3.5-Omni lies in its Thinker-Talker architecture and its use of Hybrid-Attention Mixture of Experts (MoE) across all modalities. This approach enables the model to handle massive context windows and real-time interaction without the traditional latency penalties associated with cascaded systems. Model Tiers The series is offered in three sizes to balance performance and cost: Plus: High-complexity reasoning and maximum accuracy. Flash: Optimized for high-throughput and low-latency interaction. Light: A smaller variant for efficiency-focused tasks. https:\/\/qwen.ai\/blog?id=qwen3.5-omni The Thinker-Talker Architecture: A Unified MoE Framework At the core of Qwen3.5-Omni is a bifurcated yet tightly integrated architecture consisting of two main components: the Thinker and the Talker. In previous iterations, multimodal models often relied on external pre-trained encoders (such as Whisper for audio). Qwen3.5-Omni moves beyond this by utilizing a native Audio Transformer (AuT) encoder. This encoder was pre-trained on more than 100 million hours of audio-visual data, providing the model with a grounded understanding of temporal and acoustic nuances that traditional text-first models lack. Hybrid-Attention Mixture of Experts (MoE) Both the Thinker and the Talker leverage Hybrid-Attention MoE. In a standard MoE setup, only a subset of parameters (the \u2018experts\u2019) are activated for any given token, which allows for a high total parameter count with lower active computational costs. By applying this to a hybrid-attention mechanism, Qwen3.5-Omni can effectively weigh the importance of different modalities (e.g., focusing more on visual tokens during a video analysis task) while maintaining the throughput required for streaming services. This architecture supports a 256k long-context input, enabling the model to ingest and reason over: Over 10 hours of continuous audio. Over 400 seconds of 720p audio-visual content (sampled at 1 FPS). Benchmarking Performance: The \u2018215 SOTA\u2019 Milestone One of the most highlighted technical claims regarding the flagship Qwen3.5-Omni-Plus model is its performance on the global leaderboard. The model achieved State-of-the-Art (SOTA) results on 215 audio and audio-visual understanding, reasoning, and interaction subtasks. These 215 SOTA wins are not merely a measure of broad evaluation but span specific technical benchmarks, including: 3 audio-visual benchmarks and 5 general audio benchmarks. 8 ASR (Automatic Speech Recognition) benchmarks. 156 language-specific Speech-to-Text Translation (S2TT) tasks. 43 language-specific ASR tasks. According to their official technical reports, Qwen3.5-Omni-Plus surpasses Gemini 3.1 Pro in general audio understanding, reasoning, recognition, and translation. In audio-visual understanding, it achieves parity with Google\u2019s flagship, while maintaining the core text and visual performance of the standard Qwen3.5 series. https:\/\/qwen.ai\/blog?id=qwen3.5-omni Technical Solutions for Real-Time Interaction Building a model that can \u2018talk\u2019 and \u2018hear\u2019 in real-time requires solving specific engineering challenges related to streaming stability and conversational flow. ARIA: Adaptive Rate Interleave Alignment A common failure mode in streaming voice interaction is \u2018speech instability.\u2019 Because text tokens and speech tokens have different encoding efficiencies, a model may misread numbers or stutter when attempting to synchronize its text reasoning with its audio output. To address this, Alibaba Qwen team developed ARIA (Adaptive Rate Interleave Alignment). This technique dynamically aligns text and speech units during generation. By adjusting the interleave rate based on the density of the information being processed, ARIA improves the naturalness and robustness of speech synthesis without increasing latency. Semantic Interruption and Turn-Taking For AI developers building voice assistants, handling interruptions is notoriously difficult. Qwen3.5-Omni introduces native turn-taking intent recognition. This allows the model to distinguish between \u2018backchanneling\u2019 (non-meaningful background noise or listener feedback like \u2018uh-huh\u2019) and an actual semantic interruption where the user intends to take the floor. This capability is baked directly into the model\u2019s API, enabling more human-like, full-duplex conversations. Emergent Capability: Audio-Visual Vibe Coding Perhaps the most unique feature identified during the native multimodal scaling of Qwen3.5-Omni is Audio-Visual Vibe Coding. Unlike traditional code generation that relies on text prompts, Qwen3.5-Omni can perform coding tasks based directly on audio-visual instructions. For instance, a developer could record a video of a software UI, verbally describe a bug while pointing at specific elements, and the model can directly generate the fix. This emergence suggests that the model has developed a cross-modal mapping between visual UI hierarchies, verbal intent, and symbolic code logic. Key Takeaways Qwen3.5-Omni uses a native Thinker-Talker multimodal architecture for unified text, audio, and video processing. The model supports 256k context, 10+ hours of audio, and 400+ seconds of 720p video at 1 FPS. Alibaba reports speech recognition in 113 languages\/dialects and speech generation in 36 languages\/dialects. Key system features include semantic interruption, turn-taking intent recognition, TMRoPE, and ARIA for realtime interaction. Check out\u00a0the\u00a0Technical details, Qwenchat, Online demo on HF and Offline demo on HF.\u00a0Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a0120k+ ML SubReddit\u00a0and Subscribe to\u00a0our Newsletter. Wait! are you on telegram?\u00a0now you can join us on telegram as well. The post Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":80283,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-80282","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/fr\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/\" \/>\n<meta property=\"og:locale\" content=\"fr_FR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/fr\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-03-31T14:47:41+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u00c9crit par\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Dur\u00e9e de lecture estim\u00e9e\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction\",\"datePublished\":\"2026-03-31T14:47:41+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/\"},\"wordCount\":918,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue.webp\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/\",\"url\":\"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/\",\"name\":\"Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue.webp\",\"datePublished\":\"2026-03-31T14:47:41+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#breadcrumb\"},\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue.webp\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue.webp\",\"width\":1694,\"height\":1230},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"fr-FR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/fr\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/fr\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/","og_locale":"fr_FR","og_type":"article","og_title":"Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/fr\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-03-31T14:47:41+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u00c9crit par":"admin NU","Dur\u00e9e de lecture estim\u00e9e":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction","datePublished":"2026-03-31T14:47:41+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/"},"wordCount":918,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue.webp","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"fr-FR","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/","url":"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/","name":"Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue.webp","datePublished":"2026-03-31T14:47:41+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#breadcrumb"},"inLanguage":"fr-FR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/"]}]},{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue.webp","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue.webp","width":1694,"height":1230},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/alibaba-qwen-team-releases-qwen3-5-omni-a-native-multimodal-model-for-text-audio-video-and-realtime-interaction\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"fr-FR"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/fr\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue.webp",1694,1230,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue.webp",1694,1230,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue.webp",1694,1230,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue-150x150.webp",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue-300x218.webp",300,218,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue-1024x744.webp",1024,744,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue-1536x1115.webp",1536,1115,true],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue.webp",1694,1230,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue-18x12.webp",18,12,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue-300x300.webp",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue-600x436.webp",600,436,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-30-at-10.06.06-PM-1-deBKue-100x100.webp",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/fr\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/fr\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"The landscape of multimodal large language models (MLLMs) has shifted from experimental \u2018wrappers\u2019\u2014where separate vision or audio encoders are stitched onto a text-based backbone\u2014to native, end-to-end \u2018omnimodal\u2019 architectures. Alibaba Qwen team latest release, Qwen3.5-Omni, represents a significant milestone in this evolution. Designed as a direct competitor to flagship models like Gemini 3.1 Pro, the Qwen3.5-Omni\u2026","_links":{"self":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/80282","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/comments?post=80282"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/80282\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media\/80283"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media?parent=80282"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/categories?post=80282"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/tags?post=80282"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}