{"id":17794,"date":"2025-06-10T03:58:40","date_gmt":"2025-06-10T03:58:40","guid":{"rendered":"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/"},"modified":"2025-06-10T03:58:40","modified_gmt":"2025-06-10T03:58:40","slug":"vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control","status":"publish","type":"post","link":"https:\/\/youzum.net\/de\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/","title":{"rendered":"VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control"},"content":{"rendered":"<h3 class=\"wp-block-heading\"><strong>Bridging Perception and Action in Robotics<\/strong><\/h3>\n<p>Multimodal Large Language Models (MLLMs) hold promise for enabling machines, such as robotic arms and legged robots, to perceive their surroundings, interpret scenarios, and take meaningful actions. The integration of such intelligence into physical systems is advancing the field of robotics, pushing it toward autonomous machines that don\u2019t just see and describe but also plan and move within their environments based on contextual understanding.<\/p>\n<p>Despite the growing power of MLLMs, one persistent issue is their inability to combine vision, reasoning, and physical interaction into one cohesive system. Typically, models trained to understand images or text fall short when asked to control robots in real-world spaces. The core problem is that understanding a scene is fundamentally different from acting within it. Multimodal understanding focuses on perception and analysis, while physical control needs precise, real-time decision-making based on that perception. This disconnect creates bottlenecks when attempting to build agents that must simultaneously observe, reason, and act in varied environments.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw?key=XOkG0tdrp96PXbSpwJpJFQ\" alt=\"\"\/><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Limitations of Prior VLA Models<\/strong><\/h3>\n<p>Previous tools designed for robot control rely heavily on vision-language-action (VLA) models. These models train on extensive robotic datasets to convert visual observations into control signals. While some solutions try to preserve the reasoning capability of MLLMs by translating commands into text-based actions, they face difficulty in maintaining accuracy and adaptability during control tasks. For instance, VLAs often degrade in performance when applied to diverse or long-horizon robotic operations. Furthermore, due to the gap between image-based understanding and motion control, these tools usually fail to generalize across different environments or robot types.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Introducing VeBrain: A Unified Multimodal Framework<\/strong><\/h3>\n<p>Researchers from Shanghai AI Laboratory, Tsinghua University, and SenseTime Research have introduced a unified framework called Visual Embodied Brain (VeBrain) in collaboration with multiple other institutes. VeBrain reformulates robot control as text-based tasks within a 2D visual space, aligning it more closely with how MLLMs function. The framework integrates multimodal understanding, spatial reasoning, and robotic control into one structure. A specially designed robotic adapter processes the MLLM\u2019s output into executable movement policies, enabling a single model to manage perception, reasoning, and control. VeBrain is also supported by a high-quality instruction dataset called VeBrain-600k, which combines over 600,000 samples of multimodal tasks, including robot motion and reasoning steps.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcdmvmI98lQ8L-eoGJpIEXIXoX0TVbcacR1SsWOFEZIM6pty8tDtQVUveDZwihIygXdTgREYoFy9RSbY1UNvTGuVragpHN94h--4OtUcN5gLViNSfO8mmpaVsr_uV_vDNHFnfnx4w?key=XOkG0tdrp96PXbSpwJpJFQ\" alt=\"\"\/><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Technical Components: Architecture and Robotic Adapter<\/strong><\/h3>\n<p>To carry out its functions, VeBrain utilizes an architecture based on Qwen2.5-VL, augmented with components that enable real-world control. The robotic adapter contains four key modules. The point tracker updates 2D keypoints as the robot\u2019s view changes, ensuring accurate targeting. The movement controller transforms 2D key points into 3D movements by combining image data with depth maps. The skill executor maps predicted actions, such as \u201cturn\u201d or \u201cgrasp,\u201d to pre-trained robotic skills. Lastly, the dynamic takeover module monitors failures or anomalies, handing control back to the MLLM when needed. These modules form a closed-loop system that makes decisions, acts, and self-corrects, allowing robots to operate effectively in diverse situations.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Performance Evaluation Across Multimodal and Robotic Benchmarks<\/strong><\/h3>\n<p>VeBrain was evaluated across 13 multimodal and 5 spatial benchmarks. On MMVet, it achieved a 5.6% improvement over Qwen2.5-VL. It achieved a score of 101.5 on the CIDEr metric for ScanQA and scored 83.7 on MMBench. On the VSI benchmark, it averaged 39.9, outperforming Qwen2.5-VL\u2019s 35.9. In robotic evaluations, VeBrain showed 86.4% success across seven-legged robot tasks, significantly surpassing models like VLA and \u03c00, which scored 32.1% and 31.4%, respectively. On robotic arm tasks, it achieved a success rate of 74.3%, outperforming others by up to 80%. These results show VeBrain\u2019s ability to handle long-horizon and spatially complex control challenges with high reliability.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXftWWSghBE31h9vmDlWbRI0zXldW5CxmBC65FNi6R28Srzfjjv40QT4WQDS0WKy23OxRo5E91i-mAZcpFOVKHLN4s1jC4zHPFfDUZuoVl7GJS4w8SzKLLRzN-B4nAt7hRkY9IB_?key=XOkG0tdrp96PXbSpwJpJFQ\" alt=\"\"\/><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h3>\n<p>The research presents a compelling direction for embodied AI. Researchers succeeded in redefining robot control as a language task, enabling high-level reasoning and low-level action to coexist. The method bridges the gap between image understanding and robot execution in a way that\u2019s both functional and scalable. With a robust design and strong performance, VeBrain signals a shift toward more unified, intelligent robotics systems capable of operating autonomously across diverse tasks and environments.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<p><strong>Check out the\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2506.00123\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a> and <a href=\"https:\/\/github.com\/OpenGVLab\/VeBrain\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page<\/a><em>.<\/em><\/strong>\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">99k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.airesearchinsights.com\/subscribe\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>.<\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/06\/09\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/\">VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Bridging Perception and Action in Robotics Multimodal Large Language Models (MLLMs) hold promise for enabling machines, such as robotic arms and legged robots, to perceive their surroundings, interpret scenarios, and take meaningful actions. The integration of such intelligence into physical systems is advancing the field of robotics, pushing it toward autonomous machines that don\u2019t just see and describe but also plan and move within their environments based on contextual understanding. Despite the growing power of MLLMs, one persistent issue is their inability to combine vision, reasoning, and physical interaction into one cohesive system. Typically, models trained to understand images or text fall short when asked to control robots in real-world spaces. The core problem is that understanding a scene is fundamentally different from acting within it. Multimodal understanding focuses on perception and analysis, while physical control needs precise, real-time decision-making based on that perception. This disconnect creates bottlenecks when attempting to build agents that must simultaneously observe, reason, and act in varied environments. Limitations of Prior VLA Models Previous tools designed for robot control rely heavily on vision-language-action (VLA) models. These models train on extensive robotic datasets to convert visual observations into control signals. While some solutions try to preserve the reasoning capability of MLLMs by translating commands into text-based actions, they face difficulty in maintaining accuracy and adaptability during control tasks. For instance, VLAs often degrade in performance when applied to diverse or long-horizon robotic operations. Furthermore, due to the gap between image-based understanding and motion control, these tools usually fail to generalize across different environments or robot types. Introducing VeBrain: A Unified Multimodal Framework Researchers from Shanghai AI Laboratory, Tsinghua University, and SenseTime Research have introduced a unified framework called Visual Embodied Brain (VeBrain) in collaboration with multiple other institutes. VeBrain reformulates robot control as text-based tasks within a 2D visual space, aligning it more closely with how MLLMs function. The framework integrates multimodal understanding, spatial reasoning, and robotic control into one structure. A specially designed robotic adapter processes the MLLM\u2019s output into executable movement policies, enabling a single model to manage perception, reasoning, and control. VeBrain is also supported by a high-quality instruction dataset called VeBrain-600k, which combines over 600,000 samples of multimodal tasks, including robot motion and reasoning steps. Technical Components: Architecture and Robotic Adapter To carry out its functions, VeBrain utilizes an architecture based on Qwen2.5-VL, augmented with components that enable real-world control. The robotic adapter contains four key modules. The point tracker updates 2D keypoints as the robot\u2019s view changes, ensuring accurate targeting. The movement controller transforms 2D key points into 3D movements by combining image data with depth maps. The skill executor maps predicted actions, such as \u201cturn\u201d or \u201cgrasp,\u201d to pre-trained robotic skills. Lastly, the dynamic takeover module monitors failures or anomalies, handing control back to the MLLM when needed. These modules form a closed-loop system that makes decisions, acts, and self-corrects, allowing robots to operate effectively in diverse situations. Performance Evaluation Across Multimodal and Robotic Benchmarks VeBrain was evaluated across 13 multimodal and 5 spatial benchmarks. On MMVet, it achieved a 5.6% improvement over Qwen2.5-VL. It achieved a score of 101.5 on the CIDEr metric for ScanQA and scored 83.7 on MMBench. On the VSI benchmark, it averaged 39.9, outperforming Qwen2.5-VL\u2019s 35.9. In robotic evaluations, VeBrain showed 86.4% success across seven-legged robot tasks, significantly surpassing models like VLA and \u03c00, which scored 32.1% and 31.4%, respectively. On robotic arm tasks, it achieved a success rate of 74.3%, outperforming others by up to 80%. These results show VeBrain\u2019s ability to handle long-horizon and spatially complex control challenges with high reliability. Conclusion The research presents a compelling direction for embodied AI. Researchers succeeded in redefining robot control as a language task, enabling high-level reasoning and low-level action to coexist. The method bridges the gap between image understanding and robot execution in a way that\u2019s both functional and scalable. With a robust design and strong performance, VeBrain signals a shift toward more unified, intelligent robotics systems capable of operating autonomously across diverse tasks and environments. Check out the\u00a0Paper and GitHub Page.\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a099k+ ML SubReddit\u00a0and Subscribe to\u00a0our Newsletter. The post VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":17795,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-17794","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/de\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/de\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-06-10T03:58:40+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1600\" \/>\n\t<meta property=\"og:image:height\" content=\"1253\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"4\u00a0Minuten\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control\",\"datePublished\":\"2025-06-10T03:58:40+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/\"},\"wordCount\":741,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/\",\"url\":\"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/\",\"name\":\"VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png\",\"datePublished\":\"2025-06-10T03:58:40+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png\",\"width\":1600,\"height\":1253},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/de\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/de\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/","og_locale":"de_DE","og_type":"article","og_title":"VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/de\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-06-10T03:58:40+00:00","og_image":[{"width":1600,"height":1253,"url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png","type":"image\/png"}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Verfasst von":"admin NU","Gesch\u00e4tzte Lesezeit":"4\u00a0Minuten"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control","datePublished":"2025-06-10T03:58:40+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/"},"wordCount":741,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/","url":"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/","name":"VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png","datePublished":"2025-06-10T03:58:40+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png","width":1600,"height":1253},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/vebrain-a-unified-multimodal-ai-framework-for-visual-reasoning-and-real-world-robotic-control\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/de\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png",1600,1253,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png",1600,1253,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png",1600,1253,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq-300x235.png",300,235,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq-1024x802.png",1024,802,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq-1536x1203.png",1536,1203,true],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq.png",1600,1253,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq-15x12.png",15,12,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq-600x470.png",600,470,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXfP4MqBAsghsnFcXOOWbIZGyVpknWccz8QqfFrDV4ylrTW5yQOjMYn4jJOBiv77qIp7mfLRkDD7zM180FRGDo6BH1_prIilvDdK1efgsEZbV2hFKdyZHfcI5zhIa7JZuk9zJ7jQfw-vmRmrq-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/de\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/de\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/de\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Bridging Perception and Action in Robotics Multimodal Large Language Models (MLLMs) hold promise for enabling machines, such as robotic arms and legged robots, to perceive their surroundings, interpret scenarios, and take meaningful actions. The integration of such intelligence into physical systems is advancing the field of robotics, pushing it toward autonomous machines that don\u2019t just&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts\/17794","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/comments?post=17794"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/posts\/17794\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/media\/17795"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/media?parent=17794"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/categories?post=17794"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/de\/wp-json\/wp\/v2\/tags?post=17794"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}