{"id":13985,"date":"2025-05-17T04:11:41","date_gmt":"2025-05-17T04:11:41","guid":{"rendered":"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/"},"modified":"2025-05-17T04:11:41","modified_gmt":"2025-05-17T04:11:41","slug":"salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation","status":"publish","type":"post","link":"https:\/\/youzum.net\/zh\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/","title":{"rendered":"Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation"},"content":{"rendered":"<p>Multimodal modeling focuses on building systems to understand and generate content across visual and textual formats. These models are designed to interpret visual scenes and produce new images using natural language prompts. With growing interest in bridging vision and language, researchers are working toward integrating image recognition and image generation capabilities into a unified system. This approach eliminates the need for separate pipelines and opens the path to more coherent and intelligent interactions across modalities.<\/p>\n<p>A key challenge in this field is to develop architectures that handle both understanding and generation without compromising the quality of either. Models need to grasp complex visual concepts and produce high-quality images matching user prompts. The difficulty lies in identifying suitable picture representations and training procedures that support both tasks. This problem becomes more evident when the same model is expected to interpret detailed text descriptions and generate visually accurate outputs based on them. It requires alignment of semantic understanding and pixel-level synthesis.<\/p>\n<p>Previous approaches have generally used Variational Autoencoders (VAEs) or CLIP-based encoders to represent images. VAEs are efficient for reconstruction but encode lower-level features, often leading to less informative representations. CLIP-based encoders provide high-level semantic embeddings by learning from large-scale image-text pairs. However, CLIP was not built for image reconstruction, making it challenging to use for generation unless paired with models like diffusion decoders. In terms of training, Mean Squared Error (MSE) is widely used for simplicity but tends to produce deterministic outputs. To improve generation diversity and quality, researchers have turned to Flow Matching, which introduces controlled stochasticity and better models the continuous nature of image features.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA?key=PL5mr655cLpuQIRgQ7iGjQ\" alt=\"\"\/><\/figure>\n<\/div>\n<p>Researchers from Salesforce Research, in collaboration with the University of Maryland and several academic institutions, introduced BLIP3-o, a family of unified multimodal models. The model adopts a dual-stage training strategy where image understanding is learned first, followed by image generation. The proposed system leverages CLIP embeddings to represent images and integrates them with a diffusion transformer to synthesize new visual outputs. Unlike previous joint training methods, the sequential approach maintains the strength of each task independently. The diffusion module is trained while keeping the autoregressive backbone frozen, avoiding task interference. To improve alignment and visual fidelity, the team also curated BLIP3o-60k, a high-quality instruction-tuning dataset created by prompting GPT-4o across varied visual categories, including scenes, objects, gestures, and text. They developed two model versions: an 8-billion parameter model trained with proprietary and public data, and a 4-billion version using only open-source data.<\/p>\n<p>The image generation pipeline of BLIP3-o is built on Qwen2.5-VL large language models. Prompts are processed to produce visual features refined through a Flow Matching diffusion transformer. This transformer is based on the Lumina-Next architecture, optimized for speed and quality with 3D rotary position embedding and grouped-query attention. The model encodes each image into 64 fixed-length semantic vectors, regardless of resolution, which supports compact storage and efficient decoding. The research team used a large-scale dataset of 25 million images from sources like CC12M, SA-1B, and JourneyDB to train the models. They extended it with 30 million proprietary samples for the 8B model. They also included 60k instruction-tuning samples covering challenging prompts such as complex gestures and landmarks, generated via GPT-4o.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXc6RGjevg1Ev1Wrvb3hZMvPLBrROUb-EQfnKtcOKnAoUmT2rp7j2sHX0yGnwh95Q7X0TAbKHY0C6ewXxUtcQLm-EdvuRdAKtwSRYFkiwD03vNYuLelFSw0rDz3P63nJoyGLKOXg?key=PL5mr655cLpuQIRgQ7iGjQ\" alt=\"\"\/><\/figure>\n<\/div>\n<p>In terms of performance, BLIP3-o demonstrated top scores across multiple benchmarks. The 8B model achieved a GenEval score of 0.84 for image generation alignment and a WISE score of 0.62 for reasoning ability. Image understanding scored 1682.6 on MME-Perception, 647.1 on MME-Cognition, 50.6 on MMMU, and 83.1 on both VQAv2 and TextVQA datasets. A human evaluation comparing BLIP3-o 8B with Janus Pro 7B showed that BLIP3-o was preferred 50.4% of the time for visual quality and 51.5% for prompt alignment. These results are supported by statistically significant p-values (5.05e-06 and 1.16e-05), indicating the superiority of BLIP3-o in subjective quality assessments.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXci1r1xzKv2VlWXw8r5P2mtc2RmhHwCHjtEXS7P4f0QrR3b4GJgztCNaZseDspgEwQ586J--3v3bgFjCDtcBVNga_5yTJcXnPxhvN7KcVmkpWEfaBHJYgGya7rFtS3wtg2wc6w47Q?key=PL5mr655cLpuQIRgQ7iGjQ\" alt=\"\"\/><\/figure>\n<\/div>\n<p>This research outlines a clear solution to the dual challenge of image understanding and generation. CLIP embeddings, Flow Matching, and a sequential training strategy demonstrate how the problem can be approached methodically. The BLIP3-o model delivers state-of-the-art results and introduces an efficient and open approach to unified multimodal modeling.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<p><strong>Check out the <a href=\"https:\/\/arxiv.org\/abs\/2505.09568\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/github.com\/JiuhaiChen\/BLIP3o\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page<\/a> and <a href=\"https:\/\/huggingface.co\/BLIP3o\/BLIP3o-Model\" target=\"_blank\" rel=\"noreferrer noopener\">Model on Hugging Face<\/a><em>.<\/em><\/strong>\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">90k+ ML SubReddit<\/a><\/strong>.<\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/05\/16\/salesforce-ai-releases-blip3-o-a-fully-open-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/\">Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Multimodal modeling focuses on building systems to understand and generate content across visual and textual formats. These models are designed to interpret visual scenes and produce new images using natural language prompts. With growing interest in bridging vision and language, researchers are working toward integrating image recognition and image generation capabilities into a unified system. This approach eliminates the need for separate pipelines and opens the path to more coherent and intelligent interactions across modalities. A key challenge in this field is to develop architectures that handle both understanding and generation without compromising the quality of either. Models need to grasp complex visual concepts and produce high-quality images matching user prompts. The difficulty lies in identifying suitable picture representations and training procedures that support both tasks. This problem becomes more evident when the same model is expected to interpret detailed text descriptions and generate visually accurate outputs based on them. It requires alignment of semantic understanding and pixel-level synthesis. Previous approaches have generally used Variational Autoencoders (VAEs) or CLIP-based encoders to represent images. VAEs are efficient for reconstruction but encode lower-level features, often leading to less informative representations. CLIP-based encoders provide high-level semantic embeddings by learning from large-scale image-text pairs. However, CLIP was not built for image reconstruction, making it challenging to use for generation unless paired with models like diffusion decoders. In terms of training, Mean Squared Error (MSE) is widely used for simplicity but tends to produce deterministic outputs. To improve generation diversity and quality, researchers have turned to Flow Matching, which introduces controlled stochasticity and better models the continuous nature of image features. Researchers from Salesforce Research, in collaboration with the University of Maryland and several academic institutions, introduced BLIP3-o, a family of unified multimodal models. The model adopts a dual-stage training strategy where image understanding is learned first, followed by image generation. The proposed system leverages CLIP embeddings to represent images and integrates them with a diffusion transformer to synthesize new visual outputs. Unlike previous joint training methods, the sequential approach maintains the strength of each task independently. The diffusion module is trained while keeping the autoregressive backbone frozen, avoiding task interference. To improve alignment and visual fidelity, the team also curated BLIP3o-60k, a high-quality instruction-tuning dataset created by prompting GPT-4o across varied visual categories, including scenes, objects, gestures, and text. They developed two model versions: an 8-billion parameter model trained with proprietary and public data, and a 4-billion version using only open-source data. The image generation pipeline of BLIP3-o is built on Qwen2.5-VL large language models. Prompts are processed to produce visual features refined through a Flow Matching diffusion transformer. This transformer is based on the Lumina-Next architecture, optimized for speed and quality with 3D rotary position embedding and grouped-query attention. The model encodes each image into 64 fixed-length semantic vectors, regardless of resolution, which supports compact storage and efficient decoding. The research team used a large-scale dataset of 25 million images from sources like CC12M, SA-1B, and JourneyDB to train the models. They extended it with 30 million proprietary samples for the 8B model. They also included 60k instruction-tuning samples covering challenging prompts such as complex gestures and landmarks, generated via GPT-4o. In terms of performance, BLIP3-o demonstrated top scores across multiple benchmarks. The 8B model achieved a GenEval score of 0.84 for image generation alignment and a WISE score of 0.62 for reasoning ability. Image understanding scored 1682.6 on MME-Perception, 647.1 on MME-Cognition, 50.6 on MMMU, and 83.1 on both VQAv2 and TextVQA datasets. A human evaluation comparing BLIP3-o 8B with Janus Pro 7B showed that BLIP3-o was preferred 50.4% of the time for visual quality and 51.5% for prompt alignment. These results are supported by statistically significant p-values (5.05e-06 and 1.16e-05), indicating the superiority of BLIP3-o in subjective quality assessments. This research outlines a clear solution to the dual challenge of image understanding and generation. CLIP embeddings, Flow Matching, and a sequential training strategy demonstrate how the problem can be approached methodically. The BLIP3-o model delivers state-of-the-art results and introduces an efficient and open approach to unified multimodal modeling. Check out the Paper, GitHub Page and Model on Hugging Face.\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a090k+ ML SubReddit. The post Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":13986,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-13985","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/zh\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/\" \/>\n<meta property=\"og:locale\" content=\"zh_CN\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/zh\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-05-17T04:11:41+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1440\" \/>\n\t<meta property=\"og:image:height\" content=\"836\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u4f5c\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 \u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation\",\"datePublished\":\"2025-05-17T04:11:41+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/\"},\"wordCount\":773,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/\",\"url\":\"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/\",\"name\":\"Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png\",\"datePublished\":\"2025-05-17T04:11:41+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#breadcrumb\"},\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png\",\"width\":1440,\"height\":836},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"zh-Hans\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/zh\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/zh\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/","og_locale":"zh_CN","og_type":"article","og_title":"Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/zh\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-05-17T04:11:41+00:00","og_image":[{"width":1440,"height":836,"url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png","type":"image\/png"}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u4f5c\u8005":"admin NU","\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4":"4 \u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation","datePublished":"2025-05-17T04:11:41+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/"},"wordCount":773,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"zh-Hans","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/","url":"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/","name":"Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png","datePublished":"2025-05-17T04:11:41+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#breadcrumb"},"inLanguage":"zh-Hans","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/"]}]},{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png","width":1440,"height":836},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/salesforce-ai-releases-blip3-o-a-fully-open-source-unified-multimodal-model-built-with-clip-embeddings-and-flow-matching-for-image-understanding-and-generation\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"zh-Hans"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/zh\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png",1440,836,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png",1440,836,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png",1440,836,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3-300x174.png",300,174,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3-1024x594.png",1024,594,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png",1440,836,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3.png",1440,836,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3-18x10.png",18,10,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3-600x348.png",600,348,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/05\/AD_4nXcacU8IOCeNFkKl67EpUulzl733JFYI5g0sl5SzT9z1xQUnasrdQqHEH5Zy3rCol4QXHgOzc_nzb30xgs2Ituq6gsFzR8UdKKKfU7qstkBkU6f2AUuWIjr5TZKS_NxilWjpiRnPQA-CvF0K3-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/zh\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/zh\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/zh\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Multimodal modeling focuses on building systems to understand and generate content across visual and textual formats. These models are designed to interpret visual scenes and produce new images using natural language prompts. With growing interest in bridging vision and language, researchers are working toward integrating image recognition and image generation capabilities into a unified system.&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/13985","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/comments?post=13985"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/posts\/13985\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media\/13986"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/media?parent=13985"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/categories?post=13985"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/zh\/wp-json\/wp\/v2\/tags?post=13985"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}