{"id":15251,"date":"2025-05-28T02:15:00","date_gmt":"2025-05-28T02:15:00","guid":{"rendered":"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/"},"modified":"2025-05-28T02:15:00","modified_gmt":"2025-05-28T02:15:00","slug":"meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models","status":"publish","type":"post","link":"https:\/\/youzum.net\/fr\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/","title":{"rendered":"Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models"},"content":{"rendered":"<p>Multi-modal large language models (MLLMs) have shown great progress as versatile AI assistants capable of handling diverse visual tasks. However, their deployment as isolated digital entities limits their potential impact. The growing demand to integrate MLLMs into real-world applications like robotics and autonomous vehicles requires complex spatial understanding. Current MLLMs show fundamental spatial reasoning deficiencies, often failing at basic tasks such as distinguishing left from right. While previous research attributes these limitations to insufficient specialized training data and solves them through spatial data incorporation during training, these approaches focus on single-image scenarios, thus restricting the model\u2019s perception to static field-of-view analysis without dynamic information.<\/p>\n<p>Several research methods have tried to address spatial understanding limitations in MLLMs. MLLMs incorporate image encoders that convert visual inputs into tokens processed alongside text in the language model\u2019s latent space. Previous research has focused on single-image spatial understanding, evaluating inter-object spatial relations, or spatial recognition. Some benchmarks like BLINK, UniQA-3D, and VSIBench extend beyond single images. Existing improvements of MLLMs for spatial understanding include SpatialVLM, which fine-tunes models on curated spatial datasets, SpatialRGPT, which incorporates mask-based references and depth images, and SpatialPIN, which utilizes specialized perception models without fine-tuning.<\/p>\n<p>Researchers from FAIR Meta and the Chinese University of Hong Kong have proposed a framework to enhance MLLMs with robust multi-frame spatial understanding. This integrates three components: depth perception, visual correspondence, and dynamic perception to overcome the limitations of static single-image analysis. Researchers develop MultiSPA, a novel large-scale dataset containing over 27 million samples spanning diverse 3D and 4D scenes. The resulting Multi-SpatialMLLM model achieves significant improvements over baselines and proprietary systems, with scalable and generalizable multi-frame reasoning. Further, five tasks are introduced to generate training data: depth perception, visual correspondence, camera movement perception, object movement perception, and object size perception.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A?key=lfb9T0KaVHlTlLZwNKpKVA\" alt=\"\" \/><\/figure>\n<\/div>\n<p>The Multi-SpatialMLLM centers around the MultiSPA data generation pipeline and comprehensive benchmark system. The data format follows standard MLLM fine-tuning strategies, which have the format of QA pairs: User: &lt;image&gt;\u2026&lt;image&gt;{description}{question} and Assistant: {answer}. Researchers used the GPT-4o to generate diverse templates for task descriptions, questions, and answers. Further, high-quality annotated scene datasets are used, including 4D datasets Aria Digital Twin and Panoptic Studio, along with 3D tracking annotations from TAPVid3D for object movement perception and ScanNet for other spatial tasks. The MultiSPA generates over 27M QA samples from 1.1M unique images, with 300 samples held out for each subtask evaluation, totaling 7,800 benchmark samples.<\/p>\n<p>On the MultiSPA benchmark, the Multi-SpatialMLLM achieves an average 36% gain over base models, reaching 80-90% accuracy on qualitative tasks compared to 50% for baseline models while outperforming all proprietary systems. Even on challenging tasks like predicting camera movement vectors, it attains 18% accuracy versus near-zero performance from other baselines. On the BLINK benchmark, Multi-SpatialMLLM achieves nearly 90% accuracy with an average 26.4% improvement over base models, surpassing several proprietary systems and showing transferable multi-frame spatial understanding. Standard VQA benchmark evaluations show rough parity with original performance, indicating the model maintains general-purpose MLLM proficiency without overfitting to spatial reasoning tasks.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXeiOS5xSS6pGtnq1FbK24XnTtqrHsczV8YQsEhNnuiYHh9iXSMn0dbbPPgplkXlg16ov3RCXrXC3Hr7XVZzUa_PX7yC4Xv2_W8Uk5AB7zRfsclXmaQsZl6Tg5pJMgjwoSCVo-ab_g?key=lfb9T0KaVHlTlLZwNKpKVA\" alt=\"\" \/><\/figure>\n<\/div>\n<p>In this paper, researchers extend MLLMs\u2019 spatial understanding to multi-frame scenarios, addressing a critical gap overlooked in previous investigations. They introduced MultiSPA, the first large-scale dataset and benchmark for multi-frame spatial reasoning tasks. Experimental validation shows the effectiveness, scalability, and strong generalization capabilities of the proposed Multi-SpatialMLLM across diverse spatial understanding challenges. The research reveals significant insights, including multi-task learning benefits and emergent behaviors in complex spatial reasoning. The model establishes new applications, including acting as a multi-frame reward annotator.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p><strong>Check out the <a href=\"https:\/\/arxiv.org\/abs\/2505.17015\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/runsenxu.com\/projects\/Multi-SpatialMLLM\/\" target=\"_blank\" rel=\"noreferrer noopener\">Project Page<\/a> and <a href=\"https:\/\/github.com\/facebookresearch\/Multi-SpatialMLLM\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page<\/a><em>.<\/em><\/strong>\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">95k+ ML SubReddit<\/a><\/strong> and Subscribe to <strong><a href=\"https:\/\/www.airesearchinsights.com\/subscribe\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>.<\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/05\/27\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/\">Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Multi-modal large language models (MLLMs) have shown great progress as versatile AI assistants capable of handling diverse visual tasks. However, their deployment as isolated digital entities limits their potential impact. The growing demand to integrate MLLMs into real-world applications like robotics and autonomous vehicles requires complex spatial understanding. Current MLLMs show fundamental spatial reasoning deficiencies, often failing at basic tasks such as distinguishing left from right. While previous research attributes these limitations to insufficient specialized training data and solves them through spatial data incorporation during training, these approaches focus on single-image scenarios, thus restricting the model\u2019s perception to static field-of-view analysis without dynamic information. Several research methods have tried to address spatial understanding limitations in MLLMs. MLLMs incorporate image encoders that convert visual inputs into tokens processed alongside text in the language model\u2019s latent space. Previous research has focused on single-image spatial understanding, evaluating inter-object spatial relations, or spatial recognition. Some benchmarks like BLINK, UniQA-3D, and VSIBench extend beyond single images. Existing improvements of MLLMs for spatial understanding include SpatialVLM, which fine-tunes models on curated spatial datasets, SpatialRGPT, which incorporates mask-based references and depth images, and SpatialPIN, which utilizes specialized perception models without fine-tuning. Researchers from FAIR Meta and the Chinese University of Hong Kong have proposed a framework to enhance MLLMs with robust multi-frame spatial understanding. This integrates three components: depth perception, visual correspondence, and dynamic perception to overcome the limitations of static single-image analysis. Researchers develop MultiSPA, a novel large-scale dataset containing over 27 million samples spanning diverse 3D and 4D scenes. The resulting Multi-SpatialMLLM model achieves significant improvements over baselines and proprietary systems, with scalable and generalizable multi-frame reasoning. Further, five tasks are introduced to generate training data: depth perception, visual correspondence, camera movement perception, object movement perception, and object size perception. The Multi-SpatialMLLM centers around the MultiSPA data generation pipeline and comprehensive benchmark system. The data format follows standard MLLM fine-tuning strategies, which have the format of QA pairs: User: &lt;image&gt;\u2026&lt;image&gt;{description}{question} and Assistant: {answer}. Researchers used the GPT-4o to generate diverse templates for task descriptions, questions, and answers. Further, high-quality annotated scene datasets are used, including 4D datasets Aria Digital Twin and Panoptic Studio, along with 3D tracking annotations from TAPVid3D for object movement perception and ScanNet for other spatial tasks. The MultiSPA generates over 27M QA samples from 1.1M unique images, with 300 samples held out for each subtask evaluation, totaling 7,800 benchmark samples. On the MultiSPA benchmark, the Multi-SpatialMLLM achieves an average 36% gain over base models, reaching 80-90% accuracy on qualitative tasks compared to 50% for baseline models while outperforming all proprietary systems. Even on challenging tasks like predicting camera movement vectors, it attains 18% accuracy versus near-zero performance from other baselines. On the BLINK benchmark, Multi-SpatialMLLM achieves nearly 90% accuracy with an average 26.4% improvement over base models, surpassing several proprietary systems and showing transferable multi-frame spatial understanding. Standard VQA benchmark evaluations show rough parity with original performance, indicating the model maintains general-purpose MLLM proficiency without overfitting to spatial reasoning tasks. In this paper, researchers extend MLLMs\u2019 spatial understanding to multi-frame scenarios, addressing a critical gap overlooked in previous investigations. They introduced MultiSPA, the first large-scale dataset and benchmark for multi-frame spatial reasoning tasks. Experimental validation shows the effectiveness, scalability, and strong generalization capabilities of the proposed Multi-SpatialMLLM across diverse spatial understanding challenges. The research reveals significant insights, including multi-task learning benefits and emergent behaviors in complex spatial reasoning. The model establishes new applications, including acting as a multi-frame reward annotator. Check out the Paper, Project Page and GitHub Page.\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a095k+ ML SubReddit and Subscribe to our Newsletter. The post Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":15252,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-15251","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/fr\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/\" \/>\n<meta property=\"og:locale\" content=\"fr_FR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/fr\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-05-28T02:15:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1600\" \/>\n\t<meta property=\"og:image:height\" content=\"701\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u00c9crit par\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Dur\u00e9e de lecture estim\u00e9e\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models\",\"datePublished\":\"2025-05-28T02:15:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/\"},\"wordCount\":659,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/\",\"url\":\"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/\",\"name\":\"Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png\",\"datePublished\":\"2025-05-28T02:15:00+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#breadcrumb\"},\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png\",\"width\":1600,\"height\":701},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"fr-FR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/fr\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/fr\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/","og_locale":"fr_FR","og_type":"article","og_title":"Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/fr\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-05-28T02:15:00+00:00","og_image":[{"width":1600,"height":701,"url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png","type":"image\/png"}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u00c9crit par":"admin NU","Dur\u00e9e de lecture estim\u00e9e":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models","datePublished":"2025-05-28T02:15:00+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/"},"wordCount":659,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"fr-FR","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/","url":"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/","name":"Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png","datePublished":"2025-05-28T02:15:00+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#breadcrumb"},"inLanguage":"fr-FR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/"]}]},{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png","width":1600,"height":701},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/meta-ai-introduces-multi-spatialmllm-a-multi-frame-spatial-understanding-with-multi-modal-large-language-models\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"fr-FR"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/fr\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png",1600,701,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png",1600,701,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png",1600,701,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH-300x131.png",300,131,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH-1024x449.png",1024,449,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH-1536x673.png",1536,673,true],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH.png",1600,701,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH-18x8.png",18,8,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH-600x263.png",600,263,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcHsTEvFTdgfGvJ52HG92YUbcvnGSu1LUD29Gm1kYbnIV4NdrtMt5nAmU48rqXd7A3zSnj_qCTka9nl_Qm-97m1RDhJBb4ZFPHJ4EH2vC8yh-ZTJQLhKX2ZcOait7zPgu3ZEv1K8A-jGxNzH-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/fr\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/fr\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Multi-modal large language models (MLLMs) have shown great progress as versatile AI assistants capable of handling diverse visual tasks. However, their deployment as isolated digital entities limits their potential impact. The growing demand to integrate MLLMs into real-world applications like robotics and autonomous vehicles requires complex spatial understanding. Current MLLMs show fundamental spatial reasoning deficiencies,\u2026","_links":{"self":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/15251","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/comments?post=15251"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/15251\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media\/15252"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media?parent=15251"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/categories?post=15251"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/tags?post=15251"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}