{"id":11444,"date":"2025-05-09T02:01:48","date_gmt":"2025-05-09T02:01:48","guid":{"rendered":"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/"},"modified":"2025-05-09T02:01:48","modified_gmt":"2025-05-09T02:01:48","slug":"multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities","status":"publish","type":"post","link":"https:\/\/youzum.net\/ja\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/","title":{"rendered":"Multimodal LLMs Without Compromise: Researchers from UCLA, UW\u2013Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities"},"content":{"rendered":"<p>LLMs have made significant strides in language-related tasks such as conversational AI, reasoning, and code generation. However, human communication extends beyond text, often incorporating visual elements to enhance understanding. To create a truly versatile AI, models need the ability to process and generate text and visual information simultaneously. Training such unified vision-language models from scratch using methods like autoregressive token prediction or a hybrid approach combining diffusion and language losses has shown strong performance. Still, it requires vast computational resources and retraining for each new modality. An alternative approach adapts pretrained LLMs with vision capabilities, which offers a more efficient path but often compromises the language model\u2019s original performance.<\/p>\n<p>Current research has focused on three main strategies: merging LLMs with standalone image generation models, training large multimodal models end-to-end, or using a combination of diffusion and autoregressive losses. While these methods have achieved state-of-the-art results, they either require retraining large models or result in degradation of the LLM\u2019s core capabilities. Despite these challenges, leveraging pretrained LLMs with added vision components has demonstrated significant potential, particularly in tasks involving image understanding and generation. However, these methods still face limitations in terms of efficiency and flexibility.\u00a0<\/p>\n<p>Researchers from UCLA, the University of Wisconsin-Madison, and Adobe Research propose X-Fusion, which adapts pretrained LLMs for multimodal tasks while preserving language capabilities. X-Fusion utilizes a dual-tower architecture, freezing the LLM\u2019s language weights while adding a vision-specific tower to process visual information. The approach aligns text and vision features at multiple levels, improving performance in image-to-text and text-to-image tasks. Through ablation studies, the researchers emphasize the importance of clean image data for training and show that aligning vision features with pre-trained representations accelerates convergence, especially for smaller models.\u00a0<\/p>\n<p>X-Fusion is a unified framework that adapts pretrained LLMs for vision tasks while retaining their language capabilities. It uses a dual-tower design, freezing the LLM\u2019s text weights while introducing a separate vision tower for processing visual information. Images are tokenized using a pretrained encoder, and image and text tokens are jointly optimized. The model incorporates an optional X-Fuse operation to merge features from both towers for enhanced performance. X-Fusion is trained with autoregressive and image denoising losses, and its performance is evaluated on image generation (text-to-image) and image understanding (image-to-text) tasks.\u00a0<\/p>\n<p>The study evaluates the Dual Tower architecture against alternative transformer variants for multimodal integration. It compares the Single Tower, Gated Tower, and Dual Projection designs, highlighting the flexibility of the Dual Tower for image and text tasks. The Dual Tower performs best in image generation and understanding, outperforming other designs by 23% in FID without increasing training parameters. The study also investigates the effects of noise and data ratios on performance, finding that clean images improve understanding and generation. Additionally, aligning vision features with a pretrained encoder like CLIP boosts performance, especially for smaller models.\u00a0<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw?key=R_I3jWUGgVPq1oHfkdnACw\" alt=\"\" \/><\/figure>\n<\/div>\n<p>In conclusion, X-Fusion is a framework that adapts pretrained LLMs to multimodal tasks, such as image understanding and generation, while preserving language capabilities. It introduces a Dual Tower architecture where language weights remain fixed, and a separate trainable vision tower processes visual features. Experimental results show that X-Fusion outperforms alternative designs in image and text-to-image tasks. Key findings include the benefits of incorporating understanding-focused data, reducing noise in image data, and the positive impact of feature alignment, especially for smaller models. The research contributes valuable insights into building efficient multimodal models.\u00a0<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/arxiv.org\/abs\/2504.20996\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a><\/strong>. Also,\u00a0don\u2019t forget to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>.<\/p>\n<p><strong>Here\u2019s a brief overview of what we\u2019re building at Marktechpost:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Newsletter\u2013\u00a0<a href=\"https:\/\/minicon.marktechpost.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">airesearchinsights.com\/<\/a>(30k+ subscribers)<\/strong><\/li>\n<li><strong>miniCON AI Events \u2013\u00a0<a href=\"https:\/\/minicon.marktechpost.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">minicon.marktechpost.com<\/a><\/strong><\/li>\n<li><strong>AI Reports &amp; Magazines \u2013\u00a0<a href=\"https:\/\/magazine.marktechpost.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">magazine.marktechpost.com<\/a><\/strong><\/li>\n<li><strong>AI Dev &amp; Research News \u2013\u00a0<a href=\"https:\/\/marktechpost.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">marktechpost.com<\/a>\u00a0(1M+ monthly readers)<\/strong><\/li>\n<li><strong>ML News Community \u2013<a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">\u00a0r\/machinelearningnews<\/a>\u00a0(92k+ members)<\/strong><\/li>\n<\/ul>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/05\/08\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/\">Multimodal LLMs Without Compromise: Researchers from UCLA, UW\u2013Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>LLMs have made significant strides in language-related tasks such as conversational AI, reasoning, and code generation. However, human communication extends beyond text, often incorporating visual elements to enhance understanding. To create a truly versatile AI, models need the ability to process and generate text and visual information simultaneously. Training such unified vision-language models from scratch using methods like autoregressive token prediction or a hybrid approach combining diffusion and language losses has shown strong performance. Still, it requires vast computational resources and retraining for each new modality. An alternative approach adapts pretrained LLMs with vision capabilities, which offers a more efficient path but often compromises the language model\u2019s original performance. Current research has focused on three main strategies: merging LLMs with standalone image generation models, training large multimodal models end-to-end, or using a combination of diffusion and autoregressive losses. While these methods have achieved state-of-the-art results, they either require retraining large models or result in degradation of the LLM\u2019s core capabilities. Despite these challenges, leveraging pretrained LLMs with added vision components has demonstrated significant potential, particularly in tasks involving image understanding and generation. However, these methods still face limitations in terms of efficiency and flexibility.\u00a0 Researchers from UCLA, the University of Wisconsin-Madison, and Adobe Research propose X-Fusion, which adapts pretrained LLMs for multimodal tasks while preserving language capabilities. X-Fusion utilizes a dual-tower architecture, freezing the LLM\u2019s language weights while adding a vision-specific tower to process visual information. The approach aligns text and vision features at multiple levels, improving performance in image-to-text and text-to-image tasks. Through ablation studies, the researchers emphasize the importance of clean image data for training and show that aligning vision features with pre-trained representations accelerates convergence, especially for smaller models.\u00a0 X-Fusion is a unified framework that adapts pretrained LLMs for vision tasks while retaining their language capabilities. It uses a dual-tower design, freezing the LLM\u2019s text weights while introducing a separate vision tower for processing visual information. Images are tokenized using a pretrained encoder, and image and text tokens are jointly optimized. The model incorporates an optional X-Fuse operation to merge features from both towers for enhanced performance. X-Fusion is trained with autoregressive and image denoising losses, and its performance is evaluated on image generation (text-to-image) and image understanding (image-to-text) tasks.\u00a0 The study evaluates the Dual Tower architecture against alternative transformer variants for multimodal integration. It compares the Single Tower, Gated Tower, and Dual Projection designs, highlighting the flexibility of the Dual Tower for image and text tasks. The Dual Tower performs best in image generation and understanding, outperforming other designs by 23% in FID without increasing training parameters. The study also investigates the effects of noise and data ratios on performance, finding that clean images improve understanding and generation. Additionally, aligning vision features with a pretrained encoder like CLIP boosts performance, especially for smaller models.\u00a0 In conclusion, X-Fusion is a framework that adapts pretrained LLMs to multimodal tasks, such as image understanding and generation, while preserving language capabilities. It introduces a Dual Tower architecture where language weights remain fixed, and a separate trainable vision tower processes visual features. Experimental results show that X-Fusion outperforms alternative designs in image and text-to-image tasks. Key findings include the benefits of incorporating understanding-focused data, reducing noise in image data, and the positive impact of feature alignment, especially for smaller models. The research contributes valuable insights into building efficient multimodal models.\u00a0 Check out the\u00a0Paper. Also,\u00a0don\u2019t forget to follow us on\u00a0Twitter. Here\u2019s a brief overview of what we\u2019re building at Marktechpost: Newsletter\u2013\u00a0airesearchinsights.com\/(30k+ subscribers) miniCON AI Events \u2013\u00a0minicon.marktechpost.com AI Reports &amp; Magazines \u2013\u00a0magazine.marktechpost.com AI Dev &amp; Research News \u2013\u00a0marktechpost.com\u00a0(1M+ monthly readers) ML News Community \u2013\u00a0r\/machinelearningnews\u00a0(92k+ members) The post Multimodal LLMs Without Compromise: Researchers from UCLA, UW\u2013Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":11445,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-11444","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Multimodal LLMs Without Compromise: Researchers from UCLA, UW\u2013Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/ja\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/\" \/>\n<meta property=\"og:locale\" content=\"ja_JP\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Multimodal LLMs Without Compromise: Researchers from UCLA, UW\u2013Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/ja\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-05-09T02:01:48+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png\" \/>\n\t<meta property=\"og:image:width\" content=\"819\" \/>\n\t<meta property=\"og:image:height\" content=\"676\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u57f7\u7b46\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u63a8\u5b9a\u8aad\u307f\u53d6\u308a\u6642\u9593\" \/>\n\t<meta name=\"twitter:data2\" content=\"3\u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Multimodal LLMs Without Compromise: Researchers from UCLA, UW\u2013Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities\",\"datePublished\":\"2025-05-09T02:01:48+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/\"},\"wordCount\":676,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"ja\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/\",\"url\":\"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/\",\"name\":\"Multimodal LLMs Without Compromise: Researchers from UCLA, UW\u2013Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png\",\"datePublished\":\"2025-05-09T02:01:48+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#breadcrumb\"},\"inLanguage\":\"ja\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png\",\"width\":819,\"height\":676},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Multimodal LLMs Without Compromise: Researchers from UCLA, UW\u2013Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ja\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/ja\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Multimodal LLMs Without Compromise: Researchers from UCLA, UW\u2013Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/ja\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/","og_locale":"ja_JP","og_type":"article","og_title":"Multimodal LLMs Without Compromise: Researchers from UCLA, UW\u2013Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/ja\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-05-09T02:01:48+00:00","og_image":[{"width":819,"height":676,"url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png","type":"image\/png"}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u57f7\u7b46\u8005":"admin NU","\u63a8\u5b9a\u8aad\u307f\u53d6\u308a\u6642\u9593":"3\u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Multimodal LLMs Without Compromise: Researchers from UCLA, UW\u2013Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities","datePublished":"2025-05-09T02:01:48+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/"},"wordCount":676,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"ja","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/","url":"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/","name":"Multimodal LLMs Without Compromise: Researchers from UCLA, UW\u2013Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png","datePublished":"2025-05-09T02:01:48+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#breadcrumb"},"inLanguage":"ja","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/"]}]},{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png","width":819,"height":676},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/multimodal-llms-without-compromise-researchers-from-ucla-uw-madison-and-adobe-introduce-x-fusion-to-add-vision-to-frozen-language-models-without-losing-language-capabilities\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Multimodal LLMs Without Compromise: Researchers from UCLA, UW\u2013Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ja"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/ja\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png",819,676,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png",819,676,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png",819,676,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6-300x248.png",300,248,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png",819,676,false],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png",819,676,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6.png",819,676,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6-15x12.png",15,12,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6-600x495.png",600,495,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcohDn7EOdyxxfzfWSnNGUDBpfYYKoiPpOiOJBJCeactdaqCrq2TeFjTDD5KCboMNNUUuVAtJDlR1Yzo2_m7zt7UWleq7bPw7R-emgzo9p7aPCAeUXbeQ-VvutMKCn1tPkTX3srnw-khpxA6-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/ja\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/ja\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/ja\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/ja\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/ja\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"LLMs have made significant strides in language-related tasks such as conversational AI, reasoning, and code generation. However, human communication extends beyond text, often incorporating visual elements to enhance understanding. To create a truly versatile AI, models need the ability to process and generate text and visual information simultaneously. Training such unified vision-language models from scratch&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/posts\/11444","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/comments?post=11444"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/posts\/11444\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/media\/11445"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/media?parent=11444"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/categories?post=11444"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/tags?post=11444"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}