{"id":90115,"date":"2026-05-13T16:32:27","date_gmt":"2026-05-13T16:32:27","guid":{"rendered":"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/"},"modified":"2026-05-13T16:32:27","modified_gmt":"2026-05-13T16:32:27","slug":"mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration","status":"publish","type":"post","link":"https:\/\/youzum.net\/de\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/","title":{"rendered":"Mira Murati\u2019s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration"},"content":{"rendered":"<p>Most AI systems today work in turns. You type or speak, the model waits, processes your input, and then responds. That\u2019s the entire interaction loop. Thinking Machines Lab, an AI research lab, is arguing that this model of interaction is a fundamental bottleneck. Thinking Machines Lab team introduced a research preview of a new class of system they call <strong>interaction models<\/strong> to address it. The main idea for their research is interactivity should be native to the model itself, not bolted on as an afterthought.<\/p>\n<h2 class=\"wp-block-heading\"><strong>What\u2019s Wrong with Turn-Based AI<\/strong><\/h2>\n<p>If you\u2019ve built anything with a language model or voice API, you\u2019ve worked around the limitations of turn-based interaction. The model has no awareness of what\u2019s happening while you\u2019re still typing or speaking. It can\u2019t see you pause mid-sentence, notice your camera feed, or react to something visual in real time. While the model is generating, it\u2019s equally blind \u2014 perception freezes until it finishes or gets interrupted.<\/p>\n<p>This creates a narrow channel for human-AI collaboration that limits how much of a person\u2019s knowledge, intent, and judgment can reach the model, and how much of the model\u2019s work can be understood.<\/p>\n<p>To work around this, most real-time AI systems use a <strong>harness<\/strong> \u2014 a collection of separate components stitched together to simulate responsiveness. A common example is <strong>voice-activity detection (VAD)<\/strong>, which predicts when a user has finished speaking so a turn-based model knows when to start generating. This harness is made out of components that are meaningfully less intelligent than the model itself, and it precludes capabilities like proactive visual reactions, speaking while listening, or responding to cues that are never explicitly stated aloud.<\/p>\n<p>Thinking Machines Lab\u2019s argument is a version of the \u2018bitter lesson\u2019 in machine learning: hand-crafted systems will eventually be outpaced by scaling general capabilities. For interactivity to scale with intelligence, it must be part of the model itself. With this approach, scaling a model makes it smarter <em>and<\/em> a better collaborator.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1422\" height=\"1288\" data-attachment-id=\"79794\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/13\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/screenshot-2026-05-13-at-2-05-10-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-13-at-2.05.10-AM-1.png\" data-orig-size=\"1422,1288\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-13 at 2.05.10\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-13-at-2.05.10-AM-1-1024x928.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-13-at-2.05.10-AM-1.png\" alt=\"\" class=\"wp-image-79794\" \/><figcaption class=\"wp-element-caption\">https:\/\/thinkingmachines.ai\/blog\/interaction-models\/<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>The Architecture: Multi-Stream, Micro-Turn Design<\/strong><\/h2>\n<p>The system has two components working in parallel: an <strong>interaction model<\/strong> that maintains constant real-time exchange with the user, and a <strong>background model<\/strong> that handles deeper reasoning tasks asynchronously.<\/p>\n<p>The interaction model is always on \u2014 continuously taking in audio, video, and text and producing responses in real time. When a task requires sustained reasoning (tool use, web search, longer-horizon planning), it delegates to the background model by sending a <strong>rich context package containing the full conversation<\/strong> \u2014 not a standalone query. Results stream back as the background model produces them, and the interaction model interleaves those updates into the conversation at a moment appropriate to what the user is currently doing, rather than as an abrupt context switch. Both models share their context throughout.<\/p>\n<p>Think of it like one person who keeps you engaged in conversation while a colleague in the background looks something up and passes notes forward in real time.<\/p>\n<p>The key architectural decision enabling this is <strong>time-aligned micro-turns<\/strong>. The interaction model continuously interleaves the processing of 200ms worth of input with the generation of 200ms worth of output. Rather than consuming a complete user turn and generating a complete response, both input and output are treated as streams processed in 200ms chunks. This is what allows the model to speak while listening, react to visual cues without being prompted verbally, handle true simultaneous speech, and make tool calls and browse the web while the conversation is still in progress \u2014 weaving results back in as they arrive.<\/p>\n<p><strong>Encoder-free early fusion<\/strong> is the specific design choice that makes multimodal processing work at this cadence. Rather than routing audio and video through large, separate pretrained encoders (like a Whisper-style ASR model or a standalone TTS decoder), the architecture uses minimal pre-processing. Audio signals are ingested as <strong>dMel<\/strong> and transformed via a lightweight embedding layer. Video frames are split into 40\u00d740 patches encoded by an <strong>hMLP<\/strong>. Audio output uses a <strong>flow head<\/strong> for decoding. All components are co-trained from scratch together with the transformer \u2014 there is no separately pretrained encoder or decoder at any stage.<\/p>\n<p>On the inference side, the 200ms chunk design creates engineering challenges. Existing LLM inference libraries aren\u2019t optimized for frequent small prefills \u2014 they carry significant per-turn overhead. Thinking Machines implemented <strong>streaming sessions<\/strong>, where the client sends each 200ms chunk as a separate request while the inference server appends chunks into a persistent sequence in GPU memory, avoiding repeated memory reallocations and metadata computations. They\u2019ve upstreamed a version of this to SGLang, the open-source inference framework. Additionally, they use a <strong>gather+gemv strategy for MoE kernels<\/strong> instead of standard grouped gemm, following prior work from PyTorch and Cursor, to optimize for the latency-sensitive shapes required by bidirectional serving.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"1754\" height=\"1306\" data-attachment-id=\"79792\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/13\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/screenshot-2026-05-13-at-1-54-21-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-13-at-1.54.21-AM-1.png\" data-orig-size=\"1754,1306\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-13 at 1.54.21\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-13-at-1.54.21-AM-1-1024x762.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-13-at-1.54.21-AM-1.png\" alt=\"\" class=\"wp-image-79792\" \/><figcaption class=\"wp-element-caption\">https:\/\/thinkingmachines.ai\/blog\/interaction-models\/<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Benchmarks: Where It Stands<\/strong><\/h2>\n<p>The model, named <code>TML-Interaction-Small<\/code>, is a <strong>276B parameter Mixture-of-Experts (MoE)<\/strong> with <strong>12B active parameters<\/strong>.<\/p>\n<p>The benchmark table distinguishes between <strong>Instant<\/strong> models (no extended reasoning) and <strong>Thinking<\/strong> models (with reasoning). <code>TML-Interaction-Small<\/code> is an Instant model. Among all Instant models in the comparison, it achieves the highest score on <strong>Audio MultiChallenge APR<\/strong> at 43.4% \u2014 above GPT-realtime-2.0 (minimal) at 37.6%, GPT-realtime-1.5 at 34.7%, and Gemini-3.1-flash-live-preview (minimal) at 26.8%. The Thinking models, GPT-realtime-2.0 (xhigh) at 48.5% and Gemini-3.1-flash-live (high) at 36.1%, use extended reasoning to achieve their scores.<\/p>\n<p>On <strong>FD-bench v1.5<\/strong>, which measures interaction quality across user interruption, backchanneling, talking-to-others, and background speech scenarios, <code>TML-Interaction-Small<\/code> scores 77.8 average quality \u2014 compared to 54.3 for Gemini-3.1-flash-live (minimal), 48.3 for GPT-realtime-1.5, and 47.8 for GPT-realtime-2.0 (xhigh).<\/p>\n<p>On <strong>FD-bench v1<\/strong> turn-taking latency, the model responds in 0.40 seconds \u2014 compared to 0.57s for Gemini, 0.59s for GPT-realtime-1.5, and 1.18s for GPT-realtime-2.0 (minimal).<\/p>\n<p>On <strong>FD-bench v3<\/strong>, which evaluates response quality and tool use (audio + tools combined), <code>TML-Interaction-Small<\/code> (with background agent enabled) scores 82.8% Response Quality \/ 68.0% Pass@1 \u2014 the highest in the comparison table.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"1236\" height=\"726\" data-attachment-id=\"79790\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/13\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/screenshot-2026-05-13-at-1-52-59-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-13-at-1.52.59-AM-1.png\" data-orig-size=\"1236,726\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-13 at 1.52.59\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-13-at-1.52.59-AM-1-1024x601.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-13-at-1.52.59-AM-1.png\" alt=\"\" class=\"wp-image-79790\" \/><figcaption class=\"wp-element-caption\">https:\/\/thinkingmachines.ai\/blog\/interaction-models\/<\/figcaption><\/figure>\n<\/div>\n<p><strong>Thinking Machines research team also introduced new internal benchmarks targeting capabilities that no existing model handles:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>TimeSpeak<\/strong> \u2014 Tests whether the model initiates speech at user-specified times with correct content. TML: 64.7 macro-accuracy vs. 4.3 for GPT-realtime-2.0 (minimal).<\/li>\n<li><strong>CueSpeak<\/strong> \u2014 Tests whether the model responds to verbal cues at the correct moment. TML: 81.7 vs. 2.9.<\/li>\n<li><strong>RepCount-A<\/strong> (adapted from an existing repetition-counting dataset) \u2014 Tests visual counting of repeated physical actions in a streaming setting. TML: 35.4 off-by-one accuracy vs. 1.3.<\/li>\n<li><strong>ProactiveVideoQA<\/strong> (adapted benchmark) \u2014 Tests whether the model answers a question at the exact moment the answer becomes visually available in a streamed video. TML: 33.5 PAUC@\u03c9=0.5 vs. 25.0 (the no-response baseline).<\/li>\n<li><strong>Charades<\/strong> (adapted for temporal action localization) \u2014 The model is asked to say \u201cstart\u201d and \u201cstop\u201d as an action begins and ends in a streamed video. TML: 32.4 mIoU vs. 0 for GPT-realtime-2.0 (minimal) \u2014 a clean zero.<\/li>\n<\/ul>\n<p>So far, no existing model can meaningfully perform any of these tasks.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<div class=\"tml-top-bar\">\n<div class=\"tml-dots\">\n      <span class=\"tml-dot tml-dot-r\"><\/span><br \/>\n      <span class=\"tml-dot tml-dot-y\"><\/span><br \/>\n      <span class=\"tml-dot tml-dot-g\"><\/span>\n    <\/div>\n<p>    <span class=\"tml-label\">Interaction Models \u2014 Getting Started Guide<\/span><br \/>\n    <span class=\"tml-counter\">01 \/ 07<\/span>\n  <\/p><\/div>\n<div class=\"tml-viewport\">\n<div class=\"tml-track\">\n<p>      <!-- SLIDE 1 --><\/p>\n<div class=\"tml-slide\">\n        <span class=\"tml-slide-num\">01 \u2014 Overview<\/span>\n<h2>What Are <span>Interaction Models?<\/span><\/h2>\n<p>        <span class=\"tml-badge\">Research Preview \u2014 May 2026<\/span><\/p>\n<p>Thinking Machines Lab introduced <strong>interaction models<\/strong> \u2014 a new class of AI system where real-time interactivity is native to the model itself, not bolted on through external scaffolding.<\/p>\n<p>Unlike standard LLM APIs that work in a request\u2014response loop, interaction models continuously perceive and respond across <strong>audio, video, and text<\/strong> at the same time \u2014 the way a live human conversation works.<\/p>\n<div class=\"tml-divider\"><\/div>\n<div class=\"tml-grid\">\n<div class=\"tml-card\">\n            <span class=\"tml-card-title\">Standard LLM APIs<\/span>\n<p>Turn-based. Model waits for your full input, then generates a full response. Perception freezes during generation.<\/p>\n<\/div>\n<div class=\"tml-card\">\n            <span class=\"tml-card-title\">Interaction Models<\/span>\n<p>Continuous. The model perceives and responds in parallel in 200ms chunks \u2014 across audio, video, and text simultaneously.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p>      <!-- SLIDE 2 --><\/p>\n<div class=\"tml-slide\">\n        <span class=\"tml-slide-num\">02 \u2014 Architecture<\/span>\n<h2>How the <span>Two-Model System<\/span> Works<\/h2>\n<p>The system is built around two components that run in parallel and share the same context at all times.<\/p>\n<div class=\"tml-two-col\">\n<div class=\"tml-col-block\">\n            <span class=\"tml-col-head\">Interaction Model<\/span>\n<p>Always live. Receives audio, video, and text in continuous 200ms chunks. Handles conversation flow, interruptions, backchanneling, and immediate responses in real time.<\/p>\n<\/div>\n<div class=\"tml-col-block\">\n            <span class=\"tml-col-head\">Background Model<\/span>\n<p>Runs asynchronously. Handles deep reasoning, tool calls, web search, and longer-horizon work. Receives the <em>full conversation<\/em> \u2014 not just a standalone query \u2014 and streams results back as they arrive.<\/p>\n<\/div>\n<\/div>\n<div class=\"tml-divider\"><\/div>\n<p>The interaction model stays present during background tasks \u2014 taking new input, answering follow-ups, and weaving results into the conversation at the right moment, not as an abrupt context switch.<\/p>\n<\/div>\n<p>      <!-- SLIDE 3 --><\/p>\n<div class=\"tml-slide\">\n        <span class=\"tml-slide-num\">03 \u2014 Capabilities<\/span>\n<h2>What You Can <span>Actually Do<\/span><\/h2>\n<p>Because interactivity is native to the model, these are built-in behaviors \u2014 not harness features:<\/p>\n<ul class=\"tml-list\">\n<li><strong>Simultaneous speech<\/strong> \u2014 Speak and listen at the same time (e.g. live translation from Spanish to English as you talk)<\/li>\n<li><strong>Verbal interjections<\/strong> \u2014 Model jumps in mid-sentence based on context, not just when you stop talking<\/li>\n<li><strong>Visual proactivity<\/strong> \u2014 Model reacts to what it sees on camera without you saying anything (e.g. counting pushups, flagging a code bug it sees)<\/li>\n<li><strong>Time-awareness<\/strong> \u2014 Model tracks elapsed time and can initiate speech at user-specified moments<\/li>\n<li><strong>Concurrent tool use<\/strong> \u2014 Searches the web, calls tools, and generates UI while the conversation is still in progress<\/li>\n<li><strong>Seamless dialog management<\/strong> \u2014 Tracks pauses, self-corrections, and yield signals without a separate VAD component<\/li>\n<\/ul><\/div>\n<p>      <!-- SLIDE 4 --><\/p>\n<div class=\"tml-slide\">\n        <span class=\"tml-slide-num\">04 \u2014 Technical Design<\/span>\n<h2>The <span>Micro-Turn<\/span> Architecture<\/h2>\n<p>For engineers curious about how this works under the hood, three design choices make real-time multimodal processing possible:<\/p>\n<div class=\"tml-code\">200ms micro-turns<br \/>\n\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014<br \/>\nInput stream  : [chunk 0][chunk 1][chunk 2][chunk 3]\u2026<br \/>\nOutput stream : [chunk 0][chunk 1][chunk 2][chunk 3]\u2026<br \/>\nInterleaved   : in_0 out_0 in_1 out_1 in_2 out_2\u2026\n<p>Audio input   : dMel + lightweight embedding layer<br \/>\nVideo input   : 40\u00d740 patches via hMLP<br \/>\nAudio output  : flow head decoder<br \/>\nAll components co-trained from scratch with transformer<\/p><\/div>\n<p>Rather than routing audio and video through large pretrained encoders (like Whisper), inputs are processed via minimal embeddings and <strong>co-trained from scratch<\/strong> \u2014 called <span class=\"tml-highlight\">encoder-free early fusion<\/span>.<\/p>\n<p>On the inference side, <strong>streaming sessions<\/strong> append each 200ms chunk into a persistent sequence in GPU memory, avoiding repeated memory reallocations and metadata computations per request. A version of this has been upstreamed to <strong>SGLang<\/strong>.<\/p>\n<\/div>\n<p>      <!-- SLIDE 5 --><\/p>\n<div class=\"tml-slide\">\n        <span class=\"tml-slide-num\">05 \u2014 Benchmarks<\/span>\n<h2>How <span>TML-Interaction-Small<\/span> Performs<\/h2>\n<p>The model is a <strong>276B parameter MoE<\/strong> with <strong>12B active parameters<\/strong>. Key results against other instant (non-thinking) real-time models:<\/p>\n<div class=\"tml-stat-row\">\n<div class=\"tml-stat\">\n            <span class=\"tml-stat-val\">77.8<\/span><br \/>\n            <span class=\"tml-stat-lbl\">FD-bench v1.5<br \/>Interaction Quality<\/span>\n          <\/div>\n<div class=\"tml-stat\">\n            <span class=\"tml-stat-val\">0.40s<\/span><br \/>\n            <span class=\"tml-stat-lbl\">FD-bench v1<br \/>Turn Latency<\/span>\n          <\/div>\n<div class=\"tml-stat\">\n            <span class=\"tml-stat-val\">43.4<\/span><br \/>\n            <span class=\"tml-stat-lbl\">Audio MultiChallenge<br \/>APR (best instant)<\/span>\n          <\/div>\n<div class=\"tml-stat\">\n            <span class=\"tml-stat-val\">82.8%<\/span><br \/>\n            <span class=\"tml-stat-lbl\">FD-bench v3<br \/>Response Quality<\/span>\n          <\/div>\n<\/div>\n<div class=\"tml-divider\"><\/div>\n<p>On proactive\/time-aware benchmarks where <strong>no existing model meaningfully performs<\/strong>: TimeSpeak 64.7, CueSpeak 81.7, RepCount-A 35.4, Charades mIoU 32.4 \u2014 vs. near-zero for all other tested models including GPT-realtime-2.0.<\/p>\n<\/div>\n<p>      <!-- SLIDE 6 --><\/p>\n<div class=\"tml-slide\">\n        <span class=\"tml-slide-num\">06 \u2014 Getting Access<\/span>\n<h2>How to <span>Join the Preview<\/span><\/h2>\n<p>As of May 2026, Thinking Machines Lab is opening a <strong>limited research preview<\/strong> to collect feedback. A wider release is planned later in 2026.<\/p>\n<ul class=\"tml-list\">\n<li><strong>Apply for early access<\/strong> \u2014 Contact the team via <span class=\"tml-highlight\">thinkingmachines.ai<\/span> (email link on the blog post)<\/li>\n<li><strong>Research grant program<\/strong> \u2014 A research grant is available for work on interaction model benchmarks, evaluation frameworks, and human-AI collaboration research<\/li>\n<li><strong>Follow Thinking Machines Lab<\/strong> \u2014 Updates and wider release announcements at <span class=\"tml-highlight\">thinkingmachines.ai<\/span><\/li>\n<li><strong>Contribute benchmarks<\/strong> \u2014 The lab explicitly invites the community to develop new frameworks for measuring interactivity quality \u2014 an area they consider underserved<\/li>\n<\/ul>\n<div class=\"tml-warn\">\n          <span class=\"tml-warn-title\">Note<\/span>\n<p>This is a research preview, not a production API. Access is gated and limited during this phase.<\/p>\n<\/div>\n<\/div>\n<p>      <!-- SLIDE 7 --><\/p>\n<div class=\"tml-slide\">\n        <span class=\"tml-slide-num\">07 \u2014 Limitations<\/span>\n<h2>What to Know <span>Before You Build<\/span><\/h2>\n<p>Thinking Machines Lab is transparent about where the current system falls short:<\/p>\n<div class=\"tml-grid\">\n<div class=\"tml-card\">\n            <span class=\"tml-card-title\">Long Sessions<\/span>\n<p>Continuous audio and video accumulate context fast. Very long sessions still require careful context management \u2014 an active area of work.<\/p>\n<\/div>\n<div class=\"tml-card\">\n            <span class=\"tml-card-title\">Network Dependency<\/span>\n<p>Streaming at 200ms chunks requires reliable connectivity. Poor connections significantly degrade the experience.<\/p>\n<\/div>\n<div class=\"tml-card\">\n            <span class=\"tml-card-title\">Model Size<\/span>\n<p>Larger pretrained models exist but are currently too slow to serve in real-time. Larger variants are planned for later in 2026.<\/p>\n<\/div>\n<div class=\"tml-card\">\n            <span class=\"tml-card-title\">Safety &amp; Alignment<\/span>\n<p>Real-time interaction opens new alignment research questions. Feedback collection is active. Harmbench refusal rate: 99.0%.<\/p>\n<\/div>\n<\/div>\n<div class=\"tml-divider\"><\/div>\n<p>Source: Thinking Machines Lab, \u201cInteraction Models: A Scalable Approach to Human-AI Collaboration,\u201d May 2026 \u2014 thinkingmachines.ai\/blog\/interaction-models<\/p>\n<\/div>\n<\/div>\n<p><!-- \/track -->\n  <\/p><\/div>\n<p><!-- \/viewport --><\/p>\n<div class=\"tml-nav\">\n    <button class=\"tml-btn\" disabled>\u2190 Prev<\/button>\n<div class=\"tml-pips\"><\/div>\n<p>    <button class=\"tml-btn tml-btn-primary\">Next \u2192<\/button>\n  <\/p><\/div>\n<div>\n    <span>Created &amp; Designed by <a href=\"https:\/\/www.marktechpost.com\/\" target=\"_blank\" rel=\"noopener\">Marktechpost.com<\/a><\/span>\n  <\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Thinking Machines Lab&#8217;s interaction model handles real-time audio, video, and text natively \u2014 no VAD harness, no turn boundaries, no stitched components.<\/li>\n<li>The architecture splits into two models: an interaction model that stays live with the user, and a background model that handles reasoning and tool use asynchronously \u2014 sharing full conversation context throughout.<\/li>\n<li>200ms micro-turns replace the standard request-response loop, enabling simultaneous speech, visual proactivity, and live tool calls without waiting for a user turn to end.<\/li>\n<li>On FD-bench v1.5 (interaction quality), TML-Interaction-Small scores 77.8 \u2014 versus 54.3 for Gemini and 47.8 for GPT-realtime-2.0 (xhigh) \u2014 while also leading all instant models on Audio MultiChallenge intelligence benchmarks.<\/li>\n<li>Existing real-time APIs score near zero on time-awareness and visual proactivity benchmarks (TimeSpeak, CueSpeak, Charades, RepCount-A) \u2014 TML-Interaction-Small is the only model that can meaningfully perform these tasks today.<\/li>\n<\/ul>\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\">\n<div class=\"wp-block-embed__wrapper\">\n<\/div>\n<\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/thinkingmachines.ai\/blog\/interaction-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/13\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/\">Mira Murati\u2019s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Most AI systems today work in turns. You type or speak, the model waits, processes your input, and then responds. That\u2019s the entire interaction loop. Thinking Machines Lab, an AI research lab, is arguing that this model of interaction is a fundamental bottleneck. Thinking Machines Lab team introduced a research preview of a new class of system they call interaction models to address it. The main idea for their research is interactivity should be native to the model itself, not bolted on as an afterthought. What\u2019s Wrong with Turn-Based AI If you\u2019ve built anything with a language model or voice API, you\u2019ve worked around the limitations of turn-based interaction. The model has no awareness of what\u2019s happening while you\u2019re still typing or speaking. It can\u2019t see you pause mid-sentence, notice your camera feed, or react to something visual in real time. While the model is generating, it\u2019s equally blind \u2014 perception freezes until it finishes or gets interrupted. This creates a narrow channel for human-AI collaboration that limits how much of a person\u2019s knowledge, intent, and judgment can reach the model, and how much of the model\u2019s work can be understood. To work around this, most real-time AI systems use a harness \u2014 a collection of separate components stitched together to simulate responsiveness. A common example is voice-activity detection (VAD), which predicts when a user has finished speaking so a turn-based model knows when to start generating. This harness is made out of components that are meaningfully less intelligent than the model itself, and it precludes capabilities like proactive visual reactions, speaking while listening, or responding to cues that are never explicitly stated aloud. Thinking Machines Lab\u2019s argument is a version of the \u2018bitter lesson\u2019 in machine learning: hand-crafted systems will eventually be outpaced by scaling general capabilities. For interactivity to scale with intelligence, it must be part of the model itself. With this approach, scaling a model makes it smarter and a better collaborator. https:\/\/thinkingmachines.ai\/blog\/interaction-models\/ The Architecture: Multi-Stream, Micro-Turn Design The system has two components working in parallel: an interaction model that maintains constant real-time exchange with the user, and a background model that handles deeper reasoning tasks asynchronously. The interaction model is always on \u2014 continuously taking in audio, video, and text and producing responses in real time. When a task requires sustained reasoning (tool use, web search, longer-horizon planning), it delegates to the background model by sending a rich context package containing the full conversation \u2014 not a standalone query. Results stream back as the background model produces them, and the interaction model interleaves those updates into the conversation at a moment appropriate to what the user is currently doing, rather than as an abrupt context switch. Both models share their context throughout. Think of it like one person who keeps you engaged in conversation while a colleague in the background looks something up and passes notes forward in real time. The key architectural decision enabling this is time-aligned micro-turns. The interaction model continuously interleaves the processing of 200ms worth of input with the generation of 200ms worth of output. Rather than consuming a complete user turn and generating a complete response, both input and output are treated as streams processed in 200ms chunks. This is what allows the model to speak while listening, react to visual cues without being prompted verbally, handle true simultaneous speech, and make tool calls and browse the web while the conversation is still in progress \u2014 weaving results back in as they arrive. Encoder-free early fusion is the specific design choice that makes multimodal processing work at this cadence. Rather than routing audio and video through large, separate pretrained encoders (like a Whisper-style ASR model or a standalone TTS decoder), the architecture uses minimal pre-processing. Audio signals are ingested as dMel and transformed via a lightweight embedding layer. Video frames are split into 40\u00d740 patches encoded by an hMLP. Audio output uses a flow head for decoding. All components are co-trained from scratch together with the transformer \u2014 there is no separately pretrained encoder or decoder at any stage. On the inference side, the 200ms chunk design creates engineering challenges. Existing LLM inference libraries aren\u2019t optimized for frequent small prefills \u2014 they carry significant per-turn overhead. Thinking Machines implemented streaming sessions, where the client sends each 200ms chunk as a separate request while the inference server appends chunks into a persistent sequence in GPU memory, avoiding repeated memory reallocations and metadata computations. They\u2019ve upstreamed a version of this to SGLang, the open-source inference framework. Additionally, they use a gather+gemv strategy for MoE kernels instead of standard grouped gemm, following prior work from PyTorch and Cursor, to optimize for the latency-sensitive shapes required by bidirectional serving. https:\/\/thinkingmachines.ai\/blog\/interaction-models\/ Benchmarks: Where It Stands The model, named TML-Interaction-Small, is a 276B parameter Mixture-of-Experts (MoE) with 12B active parameters. The benchmark table distinguishes between Instant models (no extended reasoning) and Thinking models (with reasoning). TML-Interaction-Small is an Instant model. Among all Instant models in the comparison, it achieves the highest score on Audio MultiChallenge APR at 43.4% \u2014 above GPT-realtime-2.0 (minimal) at 37.6%, GPT-realtime-1.5 at 34.7%, and Gemini-3.1-flash-live-preview (minimal) at 26.8%. The Thinking models, GPT-realtime-2.0 (xhigh) at 48.5% and Gemini-3.1-flash-live (high) at 36.1%, use extended reasoning to achieve their scores. On FD-bench v1.5, which measures interaction quality across user interruption, backchanneling, talking-to-others, and background speech scenarios, TML-Interaction-Small scores 77.8 average quality \u2014 compared to 54.3 for Gemini-3.1-flash-live (minimal), 48.3 for GPT-realtime-1.5, and 47.8 for GPT-realtime-2.0 (xhigh). On FD-bench v1 turn-taking latency, the model responds in 0.40 seconds \u2014 compared to 0.57s for Gemini, 0.59s for GPT-realtime-1.5, and 1.18s for GPT-realtime-2.0 (minimal). On FD-bench v3, which evaluates response quality and tool use (audio + tools combined), TML-Interaction-Small (with background agent enabled) scores 82.8% Response Quality \/ 68.0% Pass@1 \u2014 the highest in the comparison table. https:\/\/thinkingmachines.ai\/blog\/interaction-models\/ Thinking Machines research team also introduced new internal benchmarks targeting capabilities that no existing model handles: TimeSpeak \u2014 Tests whether the model initiates speech at user-specified times with correct content.<\/p>","protected":false},"author":2,"featured_media":90109,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-90115","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Mira Murati\u2019s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/de\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Mira Murati\u2019s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/de\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-13T16:32:27+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"11\u00a0Minuten\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Mira Murati\u2019s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration\",\"datePublished\":\"2026-05-13T16:32:27+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/\"},\"wordCount\":2125,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9.jpg\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/\",\"url\":\"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/\",\"name\":\"Mira Murati\u2019s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9.jpg\",\"datePublished\":\"2026-05-13T16:32:27+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9.jpg\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9.jpg\",\"width\":1280,\"height\":720},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Mira Murati\u2019s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/de\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Mira Murati\u2019s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/de\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/","og_locale":"de_DE","og_type":"article","og_title":"Mira Murati\u2019s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/de\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-05-13T16:32:27+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Verfasst von":"admin NU","Gesch\u00e4tzte Lesezeit":"11\u00a0Minuten"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Mira Murati\u2019s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration","datePublished":"2026-05-13T16:32:27+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/"},"wordCount":2125,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9.jpg","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/","url":"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/","name":"Mira Murati\u2019s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9.jpg","datePublished":"2026-05-13T16:32:27+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9.jpg","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9.jpg","width":1280,"height":720},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/mira-muratis-thinking-machines-lab-introduces-interaction-models-a-native-multimodal-architecture-for-real-time-human-ai-collaboration\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Mira Murati\u2019s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/de\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9.jpg",1280,720,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9.jpg",1280,720,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9.jpg",1280,720,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9-150x150.jpg",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9-300x169.jpg",300,169,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9-1024x576.jpg",1024,576,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9.jpg",1280,720,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9.jpg",1280,720,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9-18x10.jpg",18,10,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9-300x300.jpg",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9-600x338.jpg",600,338,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2026\/05\/a12avongnn4-adHRv9-100x100.jpg",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/de\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/de\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Most AI systems today work in turns. You type or speak, the model waits, processes your input, and then responds. That\u2019s the entire interaction loop. Thinking Machines Lab, an AI research lab, is arguing that this model of interaction is a fundamental bottleneck. Thinking Machines Lab team introduced a research preview of a new class&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts\/90115","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/comments?post=90115"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts\/90115\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/media\/90109"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/media?parent=90115"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/categories?post=90115"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/tags?post=90115"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}