{"id":14833,"date":"2025-05-26T02:03:20","date_gmt":"2025-05-26T02:03:20","guid":{"rendered":"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/"},"modified":"2025-05-26T02:03:20","modified_gmt":"2025-05-26T02:03:20","slug":"this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding","status":"publish","type":"post","link":"https:\/\/youzum.net\/es\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/","title":{"rendered":"This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding"},"content":{"rendered":"<p>The core idea of Multimodal Large Language Models (MLLMs) is to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components.<\/p>\n<p>A major challenge in building such models is their limited ability to combine visual understanding with logical thinking. Current systems often produce textual outputs that explain reasoning but fail to reference the specific parts of an image they rely on. This creates a gap where models may arrive at an answer without clearly showing how the visual evidence contributed to their decision. It\u2019s also difficult to ensure that models generate visual reasoning steps directly connecting to their answers. The fundamental problem lies in how to naturally train models to interleave text and image reasoning without needing large datasets annotated with visual references, which are scarce and expensive to produce.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA?key=rw8Ifcq-EgXTOnHt1IJ0zQ\" alt=\"\"\/><\/figure>\n<\/div>\n<p>Existing methods try to address this by using reinforcement learning or prompting strategies. Some systems generate bounding box coordinates as answers, while others produce step-by-step textual reasoning chains. However, these approaches have limitations. Models that only produce bounding boxes lack explanation, while those generating only text risk ignoring visual evidence. Previous methods often separate visual grounding and reasoning, making it hard for models to explain why a particular visual element leads to a certain conclusion. While some models use dense supervision data or additional tools, they generally require heavy annotation and do not scale well. This makes it difficult for developers to create models that can explain their reasoning transparently and handle various visual tasks with minimal data.<\/p>\n<p>Researchers from UC Santa Cruz and eBay introduced a new method called Grounded Reasoning with Images and Text (GRIT) that allows MLLMs like Qwen 2.5-VL and InternVL 3 to generate reasoning chains that mix natural language with explicit bounding box coordinates pointing to relevant image regions. This unified approach enables models to reason about and visually ground their answers without requiring dense annotations or labeled reasoning chains. GRIT also uses a lightweight reinforcement learning algorithm called GRPO-GR, which optimizes both the accuracy of the final answer and the structure of the reasoning, encouraging models to include specific tokens like &lt;think&gt; and &lt;rethink&gt;, as well as bounding box formats. This design eliminates the need for costly annotated data while ensuring that models learn to reference visual content meaningfully within their logical steps.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcuKcwE1D2Jl6yrYbFiDr8FZXxmWWwIvaThhPqF6Nh4-52ntvG0HXoihu_U1G0v8b3w-ZcMxrw9eldTaMSBvrAf0mB-uvwLQkyvydwTzxfevM6J-bS7at-x6IyJ2FM4AT5A01exqw?key=rw8Ifcq-EgXTOnHt1IJ0zQ\" alt=\"\"\/><\/figure>\n<\/div>\n<p>The methodology in GRIT focuses on generating outputs that combine textual reasoning and visual grounding seamlessly. Instead of requiring models to process cropped images or additional visual data after generating bounding boxes, GRIT teaches models to use their internal understanding of the image. Bounding boxes are generated during the reasoning process, and models learn to reflect on these coordinates within their logical reasoning. The reinforcement learning framework rewards the correct use of bounding box formats and reasoning structure, and it guides models to produce coherent, grounded reasoning chains. GRIT demonstrates remarkable data efficiency by using only 20 image-question-answer triplets sourced from Visual Spatial Reasoning and TallyQA datasets. The model training was conducted on NVIDIA A100 GPUs, with optimization techniques like AdamW and a cosine scheduler applied over 200 training steps, which shows the method\u2019s scalability despite limited data.<\/p>\n<p>Performance evaluations revealed that GRIT-trained models outperform several baselines in reasoning and grounding accuracy. For example, Qwen 2.5-VL trained with GRIT achieved 72.9% answer accuracy on Visual Spatial Reasoning, 47.8% on TallyQA, and 62.8% on GQA datasets. It also reached a grounding IoU score of 0.325 on VSR and 0.447 on TallyQA. In contrast, baseline models like Direct Query or Chain-of-Thought often performed significantly lower, showing limited ability to unify reasoning with visual grounding. GRIT models demonstrated a strong correlation between visual regions and textual reasoning, producing outputs that reflected a meaningful connection between image evidence and logical thought. GRIT also showed improvements on out-of-domain benchmarks, though gains were more pronounced on in-domain data, highlighting the importance of training data diversity for broader generalization.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXeuTjltnJQViOjY8k3mQtoSsGHYTCgCQZ6K8jeQKHqbsIqUk0F0rLjRRi6MVl-0skyT9LBSYFeTGE6Dit-5SJtJAsL1JmIPoR7Uhr6vXqHUNl1PnM1mP93HUe3OiW7bm8kLemsi2Q?key=rw8Ifcq-EgXTOnHt1IJ0zQ\" alt=\"\"\/><\/figure>\n<\/div>\n<p>In conclusion, the research addressed the problem of disconnected reasoning and visual grounding in MLLMs by introducing GRIT. The method allows models to reason with images through a simple, efficient approach that requires minimal data. GRIT successfully teaches MLLMs to combine visual evidence with logical reasoning in a unified output, achieving strong performance across multiple benchmarks and demonstrating a promising step toward more interpretable AI systems.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<p><strong>Check out the <a href=\"https:\/\/arxiv.org\/abs\/2505.15879\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/grounded-reasoning.github.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">Project<\/a>, and <a href=\"https:\/\/github.com\/eric-ai-lab\/GRIT\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page<\/a><em>.<\/em><\/strong>\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">95k+ ML SubReddit<\/a><\/strong> and Subscribe to <strong><a href=\"https:\/\/www.airesearchinsights.com\/subscribe\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>.<\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/05\/24\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/\">This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>The core idea of Multimodal Large Language Models (MLLMs) is to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components. A major challenge in building such models is their limited ability to combine visual understanding with logical thinking. Current systems often produce textual outputs that explain reasoning but fail to reference the specific parts of an image they rely on. This creates a gap where models may arrive at an answer without clearly showing how the visual evidence contributed to their decision. It\u2019s also difficult to ensure that models generate visual reasoning steps directly connecting to their answers. The fundamental problem lies in how to naturally train models to interleave text and image reasoning without needing large datasets annotated with visual references, which are scarce and expensive to produce. Existing methods try to address this by using reinforcement learning or prompting strategies. Some systems generate bounding box coordinates as answers, while others produce step-by-step textual reasoning chains. However, these approaches have limitations. Models that only produce bounding boxes lack explanation, while those generating only text risk ignoring visual evidence. Previous methods often separate visual grounding and reasoning, making it hard for models to explain why a particular visual element leads to a certain conclusion. While some models use dense supervision data or additional tools, they generally require heavy annotation and do not scale well. This makes it difficult for developers to create models that can explain their reasoning transparently and handle various visual tasks with minimal data. Researchers from UC Santa Cruz and eBay introduced a new method called Grounded Reasoning with Images and Text (GRIT) that allows MLLMs like Qwen 2.5-VL and InternVL 3 to generate reasoning chains that mix natural language with explicit bounding box coordinates pointing to relevant image regions. This unified approach enables models to reason about and visually ground their answers without requiring dense annotations or labeled reasoning chains. GRIT also uses a lightweight reinforcement learning algorithm called GRPO-GR, which optimizes both the accuracy of the final answer and the structure of the reasoning, encouraging models to include specific tokens like &lt;think&gt; and &lt;rethink&gt;, as well as bounding box formats. This design eliminates the need for costly annotated data while ensuring that models learn to reference visual content meaningfully within their logical steps. The methodology in GRIT focuses on generating outputs that combine textual reasoning and visual grounding seamlessly. Instead of requiring models to process cropped images or additional visual data after generating bounding boxes, GRIT teaches models to use their internal understanding of the image. Bounding boxes are generated during the reasoning process, and models learn to reflect on these coordinates within their logical reasoning. The reinforcement learning framework rewards the correct use of bounding box formats and reasoning structure, and it guides models to produce coherent, grounded reasoning chains. GRIT demonstrates remarkable data efficiency by using only 20 image-question-answer triplets sourced from Visual Spatial Reasoning and TallyQA datasets. The model training was conducted on NVIDIA A100 GPUs, with optimization techniques like AdamW and a cosine scheduler applied over 200 training steps, which shows the method\u2019s scalability despite limited data. Performance evaluations revealed that GRIT-trained models outperform several baselines in reasoning and grounding accuracy. For example, Qwen 2.5-VL trained with GRIT achieved 72.9% answer accuracy on Visual Spatial Reasoning, 47.8% on TallyQA, and 62.8% on GQA datasets. It also reached a grounding IoU score of 0.325 on VSR and 0.447 on TallyQA. In contrast, baseline models like Direct Query or Chain-of-Thought often performed significantly lower, showing limited ability to unify reasoning with visual grounding. GRIT models demonstrated a strong correlation between visual regions and textual reasoning, producing outputs that reflected a meaningful connection between image evidence and logical thought. GRIT also showed improvements on out-of-domain benchmarks, though gains were more pronounced on in-domain data, highlighting the importance of training data diversity for broader generalization. In conclusion, the research addressed the problem of disconnected reasoning and visual grounding in MLLMs by introducing GRIT. The method allows models to reason with images through a simple, efficient approach that requires minimal data. GRIT successfully teaches MLLMs to combine visual evidence with logical reasoning in a unified output, achieving strong performance across multiple benchmarks and demonstrating a promising step toward more interpretable AI systems. Check out the Paper, Project, and GitHub Page.\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a095k+ ML SubReddit and Subscribe to our Newsletter. The post This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":14834,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-14833","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/es\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/\" \/>\n<meta property=\"og:locale\" content=\"es_ES\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/es\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-05-26T02:03:20+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1476\" \/>\n\t<meta property=\"og:image:height\" content=\"728\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Escrito por\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Tiempo de lectura\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutos\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding\",\"datePublished\":\"2025-05-26T02:03:20+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/\"},\"wordCount\":826,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/\",\"url\":\"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/\",\"name\":\"This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png\",\"datePublished\":\"2025-05-26T02:03:20+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#breadcrumb\"},\"inLanguage\":\"es\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png\",\"width\":1476,\"height\":728},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"es\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"es\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/es\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/es\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/","og_locale":"es_ES","og_type":"article","og_title":"This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/es\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-05-26T02:03:20+00:00","og_image":[{"width":1476,"height":728,"url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png","type":"image\/png"}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Escrito por":"admin NU","Tiempo de lectura":"4 minutos"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding","datePublished":"2025-05-26T02:03:20+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/"},"wordCount":826,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"es","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/","url":"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/","name":"This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png","datePublished":"2025-05-26T02:03:20+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#breadcrumb"},"inLanguage":"es","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/"]}]},{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png","width":1476,"height":728},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/this-ai-paper-introduces-grit-a-method-for-teaching-mllms-to-reason-with-images-by-interleaving-text-and-visual-grounding\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"es"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"es","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/es\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png",1476,728,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png",1476,728,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png",1476,728,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl-300x148.png",300,148,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl-1024x505.png",1024,505,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png",1476,728,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl.png",1476,728,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl-18x9.png",18,9,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl-600x296.png",600,296,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXezsDL73CqlK5IKxUVWWpNYVWDfR3ce4LkqFT-n-I8Jbg8wh3H_7Q7lrxRhJJjIRC45EXtf4gp1wsderQetAWj94VlyjJoiv3gT6VGnGESlnYlLVo01vCgYPB7jegPgx2fp_RbtSA-74Iwpl-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/es\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/es\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/es\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"The core idea of Multimodal Large Language Models (MLLMs) is to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components. A major&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/14833","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/comments?post=14833"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/posts\/14833\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/media\/14834"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/media?parent=14833"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/categories?post=14833"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/es\/wp-json\/wp\/v2\/tags?post=14833"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}