{"id":27067,"date":"2025-07-24T06:24:34","date_gmt":"2025-07-24T06:24:34","guid":{"rendered":"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/"},"modified":"2025-07-24T06:24:34","modified_gmt":"2025-07-24T06:24:34","slug":"gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks","status":"publish","type":"post","link":"https:\/\/youzum.net\/zh\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/","title":{"rendered":"GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks"},"content":{"rendered":"<p>Multimodal foundation models (MFMs) like GPT-4o, Gemini, and Claude have shown rapid progress recently, especially in public demos. While their language skills are well studied, their true ability to understand visual information remains unclear. Most benchmarks used today focus heavily on text-based tasks, such as VQA or classification, which often reflect language strengths more than visual capabilities. These tests also require text outputs, making it difficult to fairly assess visual skills or compare MFMs with vision-specific models. Moreover, critical aspects such as 3D perception, segmentation, and grouping, which are core to visual understanding, are still largely overlooked in current evaluations.\u00a0<\/p>\n<p>MFMs have demonstrated strong performance in tasks that combine visual and language understanding, such as captioning and visual question answering. However, their effectiveness in tasks that require detailed visual comprehension remains unclear. Most current benchmarks rely on text-based outputs, making it difficult to compare MFMs with vision-only models fairly. Some studies attempt to adapt vision datasets for MFMs by converting annotations into text, but this limitation restricts evaluation to language outputs. Prompting strategies have also been explored to help MFMs tackle visual tasks by breaking them into manageable subtasks, though reproducibility remains a challenge in some cases.\u00a0<\/p>\n<p>Researchers at EPFL evaluated several popular multimodal foundation models\u2014such as GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet on core computer vision tasks, including segmentation, object detection, and depth prediction, using datasets like COCO and ImageNet. Since most MFMs are designed to output text and are only accessible via APIs, they developed a prompt-chaining framework to translate these visual tasks into text-compatible formats. Their findings show that while MFMs are competent generalists, they fall short of specialized vision models, especially in geometric tasks. GPT-4o stood out, performing best in 4 out of 6 tasks. The evaluation toolkit will be open-sourced.\u00a0<\/p>\n<p>To evaluate MFMs on vision tasks, the study designed a prompt chaining strategy, breaking complex tasks into simpler, language-friendly subtasks. For example, instead of predicting bounding boxes directly, the model first identifies present objects, then locates them through recursive image cropping. For segmentation and grouping, images are divided into superpixels, which are easier to label and compare. Depth and surface normals are estimated using pairwise rankings of superpixel regions. This modular design leverages MFMs\u2019 strength in classification and similarity, while calibration controls ensure fair comparisons. The method is flexible, and performance improves with finer-grained prompting.\u00a0<\/p>\n<p>The study evaluates various MFMs, including GPT-4, Gemini Flash, and Claude 3.5, across multiple tasks, such as image classification, object detection, and segmentation. Using datasets like ImageNet, COCO, and Hypersim, results show GPT-4o reaching 77.2% on ImageNet and 60.62 AP50 for object detection, outperformed by specialist models like ViT-G (90.94%) and Co-DETR (91.30%). Semantic segmentation results show GPT-4o at 44.89 mIoU, while OneFormer leads with 65.52. MFMs handle distribution shifts reasonably well but lag on precise visual reasoning. The study also introduces prompt chaining and oracle baselines to evaluate upper-bound performance.\u00a0<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g?key=YCjiNTpDdykkGC_IFZvFPw\" alt=\"\" \/><\/figure>\n<\/div>\n<p>In conclusion, the study introduces a benchmarking framework to assess the visual capabilities of MFMs, such as GPT-4o, Gemini, and Claude, by converting standard vision tasks into prompt-based formats. Findings show MFMs perform better on semantic tasks than geometric ones, with GPT-4o leading overall. However, all MFMs lag significantly behind task-specific vision models. Despite being generalists trained primarily on image-text data, they show promising progress, especially newer reasoning models, such as o3, on 3D tasks. Limitations include high inference cost and prompt sensitivity. Still, this framework provides a unified approach to evaluating MFMs\u2019 visual understanding, laying the groundwork for future advancements.\u00a0<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the<strong>\u00a0<mark><a href=\"https:\/\/arxiv.org\/pdf\/2507.01955\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/github.com\/EPFL-VILAB\/fm-vision-evals\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page<\/a> and <a href=\"https:\/\/fm-vision-evals.epfl.ch\/\" target=\"_blank\" rel=\"noreferrer noopener\">Project<\/a><\/mark>.<\/strong>\u00a0All credit for this research goes to the researchers of this project.<\/p>\n<p class=\"has-background dropcapp1\">Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more<a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">\u00a0<strong>[SUBSCRIBE NOW]<\/strong><\/a><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/07\/23\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/\">GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Multimodal foundation models (MFMs) like GPT-4o, Gemini, and Claude have shown rapid progress recently, especially in public demos. While their language skills are well studied, their true ability to understand visual information remains unclear. Most benchmarks used today focus heavily on text-based tasks, such as VQA or classification, which often reflect language strengths more than visual capabilities. These tests also require text outputs, making it difficult to fairly assess visual skills or compare MFMs with vision-specific models. Moreover, critical aspects such as 3D perception, segmentation, and grouping, which are core to visual understanding, are still largely overlooked in current evaluations.\u00a0 MFMs have demonstrated strong performance in tasks that combine visual and language understanding, such as captioning and visual question answering. However, their effectiveness in tasks that require detailed visual comprehension remains unclear. Most current benchmarks rely on text-based outputs, making it difficult to compare MFMs with vision-only models fairly. Some studies attempt to adapt vision datasets for MFMs by converting annotations into text, but this limitation restricts evaluation to language outputs. Prompting strategies have also been explored to help MFMs tackle visual tasks by breaking them into manageable subtasks, though reproducibility remains a challenge in some cases.\u00a0 Researchers at EPFL evaluated several popular multimodal foundation models\u2014such as GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet on core computer vision tasks, including segmentation, object detection, and depth prediction, using datasets like COCO and ImageNet. Since most MFMs are designed to output text and are only accessible via APIs, they developed a prompt-chaining framework to translate these visual tasks into text-compatible formats. Their findings show that while MFMs are competent generalists, they fall short of specialized vision models, especially in geometric tasks. GPT-4o stood out, performing best in 4 out of 6 tasks. The evaluation toolkit will be open-sourced.\u00a0 To evaluate MFMs on vision tasks, the study designed a prompt chaining strategy, breaking complex tasks into simpler, language-friendly subtasks. For example, instead of predicting bounding boxes directly, the model first identifies present objects, then locates them through recursive image cropping. For segmentation and grouping, images are divided into superpixels, which are easier to label and compare. Depth and surface normals are estimated using pairwise rankings of superpixel regions. This modular design leverages MFMs\u2019 strength in classification and similarity, while calibration controls ensure fair comparisons. The method is flexible, and performance improves with finer-grained prompting.\u00a0 The study evaluates various MFMs, including GPT-4, Gemini Flash, and Claude 3.5, across multiple tasks, such as image classification, object detection, and segmentation. Using datasets like ImageNet, COCO, and Hypersim, results show GPT-4o reaching 77.2% on ImageNet and 60.62 AP50 for object detection, outperformed by specialist models like ViT-G (90.94%) and Co-DETR (91.30%). Semantic segmentation results show GPT-4o at 44.89 mIoU, while OneFormer leads with 65.52. MFMs handle distribution shifts reasonably well but lag on precise visual reasoning. The study also introduces prompt chaining and oracle baselines to evaluate upper-bound performance.\u00a0 In conclusion, the study introduces a benchmarking framework to assess the visual capabilities of MFMs, such as GPT-4o, Gemini, and Claude, by converting standard vision tasks into prompt-based formats. Findings show MFMs perform better on semantic tasks than geometric ones, with GPT-4o leading overall. However, all MFMs lag significantly behind task-specific vision models. Despite being generalists trained primarily on image-text data, they show promising progress, especially newer reasoning models, such as o3, on 3D tasks. Limitations include high inference cost and prompt sensitivity. Still, this framework provides a unified approach to evaluating MFMs\u2019 visual understanding, laying the groundwork for future advancements.\u00a0 Check out the\u00a0Paper, GitHub Page and Project.\u00a0All credit for this research goes to the researchers of this project. Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more\u00a0[SUBSCRIBE NOW] The post GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":27068,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-27067","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/zh\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/\" \/>\n<meta property=\"og:locale\" content=\"zh_CN\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/zh\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-07-24T06:24:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g?key=YCjiNTpDdykkGC_IFZvFPw\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u4f5c\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 \u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks\",\"datePublished\":\"2025-07-24T06:24:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/\"},\"wordCount\":670,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/\",\"url\":\"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/\",\"name\":\"GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png\",\"datePublished\":\"2025-07-24T06:24:34+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#breadcrumb\"},\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png\",\"width\":441,\"height\":660},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"zh-Hans\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/zh\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/zh\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/","og_locale":"zh_CN","og_type":"article","og_title":"GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/zh\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-07-24T06:24:34+00:00","og_image":[{"url":"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g?key=YCjiNTpDdykkGC_IFZvFPw","type":"","width":"","height":""}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u4f5c\u8005":"admin NU","\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4":"3 \u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks","datePublished":"2025-07-24T06:24:34+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/"},"wordCount":670,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"zh-Hans","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/","url":"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/","name":"GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png","datePublished":"2025-07-24T06:24:34+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#breadcrumb"},"inLanguage":"zh-Hans","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/"]}]},{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png","width":441,"height":660},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/gpt-4o-understands-text-but-does-it-see-clearly-a-benchmarking-study-of-mfms-on-vision-tasks\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"zh-Hans"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/zh\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png",441,660,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png",441,660,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png",441,660,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4-200x300.png",200,300,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png",441,660,false],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png",441,660,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png",441,660,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4-8x12.png",8,12,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4.png",441,660,false],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/07\/AD_4nXcIr4r-44VfXJbkwrjCOcApnkj9qc6PZdOk-wCtMRKwkF6yKx73YKp7_kG7gY19duuJvUK2aZyATPhoZyOkEkAmkRXwRrIJ_pNUAQwnEEW5Mhnlg0zvunGkJjg9cICS9MadF45j5g-eZhyf4-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/zh\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/zh\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Multimodal foundation models (MFMs) like GPT-4o, Gemini, and Claude have shown rapid progress recently, especially in public demos. While their language skills are well studied, their true ability to understand visual information remains unclear. Most benchmarks used today focus heavily on text-based tasks, such as VQA or classification, which often reflect language strengths more than&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/27067","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/comments?post=27067"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/27067\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media\/27068"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media?parent=27067"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/categories?post=27067"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/tags?post=27067"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}