{"id":16233,"date":"2025-06-03T03:47:34","date_gmt":"2025-06-03T03:47:34","guid":{"rendered":"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/"},"modified":"2025-06-03T03:47:34","modified_gmt":"2025-06-03T03:47:34","slug":"this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning","status":"publish","type":"post","link":"https:\/\/youzum.net\/fr\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/","title":{"rendered":"This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning"},"content":{"rendered":"<p>Multimodal large language models (MLLMs) are designed to process and generate content across various modalities, including text, images, audio, and video. These models aim to understand and integrate information from different sources, enabling applications such as visual question answering, image captioning, and multimodal dialogue systems. The development of MLLMs represents a significant step toward creating AI systems that can interpret and interact with the world in a more human-like manner.<\/p>\n<p>A primary challenge in developing effective MLLMs lies in integrating diverse input types, particularly visual data, into language models while maintaining high performance across tasks. Existing models often struggle with balancing strong language understanding and effective visual reasoning, especially when scaling to complex data. Further, many models require large datasets to perform well, making it difficult to adapt to specific tasks or domains. These challenges highlight the need for more efficient and scalable approaches to multimodal learning.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw?key=-5G84BcVLv9yF7CmogKAvw\" alt=\"\" \/><\/figure>\n<\/div>\n<p>Current MLLMs predominantly utilize autoregressive methods, predicting one token at a time in a left-to-right manner. While effective, this approach has limitations in handling complex multimodal contexts. Alternative methods, such as diffusion models, have been explored; however, they often exhibit weaker language understanding due to their restricted architectures or inadequate training strategies. These limitations suggest a gap where a purely diffusion-based model could offer competitive multimodal reasoning capabilities if designed effectively.<\/p>\n<p>Researchers from the Renmin University of China and Ant Group introduced LLaDA-V, a purely diffusion-based masked language modeling (MLLM) model that integrates visual instruction tuning with masked diffusion models. Built upon LLaDA, a large language diffusion model, LLaDA-V incorporates a vision encoder and an MLP connector to project visual features into the language embedding space, enabling effective multimodal alignment. This design represents a departure from the autoregressive paradigms dominant in current multimodal approaches, aiming to overcome existing limitations while maintaining data efficiency and scalability.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXeKtkGX_cpanP0pMPsevMcyOjVOsDoYVsatpRf-db_wVVCOFg2zKLU4uHgyso50WH3t2IQyxFW63_xFHJaUCUTdobQ2E543ZLhzIxwFg4gM4zKojeNRaHECUWccwXdERXX98rXtOg?key=-5G84BcVLv9yF7CmogKAvw\" alt=\"\" \/><\/figure>\n<\/div>\n<p>LLaDA-V employs a masked diffusion process where text responses are gradually refined through iterative prediction of masked tokens. Unlike autoregressive models that predict tokens sequentially, LLaDA-V generates outputs by reversing the masked diffusion process. The model is trained in three stages: the first stage aligns vision and language embeddings by mapping visual features from SigLIP2 into LLaDA\u2019s language space. The second stage fine-tunes the model using 10 million single-image samples and 2 million multimodal samples from MAmmoTH-VL. The third stage focuses on reasoning, using 900K QA pairs from VisualWebInstruct and a mixed dataset strategy. Bidirectional attention improves context comprehension, enabling robust multimodal understanding.<\/p>\n<p>In evaluations across 18 multimodal tasks, LLaDA-V demonstrated superior performance compared to hybrid autoregressive-diffusion and purely diffusion-based models. It outperformed LLaMA3-V on most multidisciplinary knowledge and mathematical reasoning tasks like MMMU, MMMU-Pro, and MMStar, achieving a score of 60.1 on MMStar, close to Qwen2-VL\u2019s 60.7, despite LLaDA-V using the weaker LLaDA-8B language tower. LLaDA-V also excelled in data efficiency, outperforming LLaMA3-V on MMMU-Pro with 1M samples against LLaMA3-V\u2019s 9M. Although it lagged in chart and document understanding benchmarks, such as AI2D, and in real-world scene tasks, like RealworldQA, LLaDA-V\u2019s results highlight its promise for multimodal tasks.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXc4Zuv9R4VtAc6N8Wi_e7S-OmOzMiJoR8LsLXrryhVrA2B4BWSRpGHfaqj9g0CQtf63ZtZW0cpcnZfGNR7VXpAOMicEh2ZbavwM5nR3S6_nLwxNjTzhqgJqpPtQQuSHdudl5gZX?key=-5G84BcVLv9yF7CmogKAvw\" alt=\"\" \/><\/figure>\n<\/div>\n<p>In summary, LLaDA-V addresses the challenges of building effective multimodal models by introducing a purely diffusion-based architecture that combines visual instruction tuning with masked diffusion. The approach offers strong multimodal reasoning capabilities while maintaining data efficiency. This work demonstrates the potential of diffusion models in multimodal AI, paving the way for further exploration of probabilistic approaches to complex AI tasks.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p><strong>Check out the <a href=\"https:\/\/arxiv.org\/abs\/2505.16933\" target=\"_blank\" rel=\"noreferrer noopener\">Paper <\/a>and <a href=\"https:\/\/github.com\/ML-GSAI\/LLaDA-V\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page<\/a> <em>.<\/em><\/strong>\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">95k+ ML SubReddit<\/a><\/strong> and Subscribe to <strong><a href=\"https:\/\/www.airesearchinsights.com\/subscribe\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>.<\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/06\/02\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/\">This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Multimodal large language models (MLLMs) are designed to process and generate content across various modalities, including text, images, audio, and video. These models aim to understand and integrate information from different sources, enabling applications such as visual question answering, image captioning, and multimodal dialogue systems. The development of MLLMs represents a significant step toward creating AI systems that can interpret and interact with the world in a more human-like manner. A primary challenge in developing effective MLLMs lies in integrating diverse input types, particularly visual data, into language models while maintaining high performance across tasks. Existing models often struggle with balancing strong language understanding and effective visual reasoning, especially when scaling to complex data. Further, many models require large datasets to perform well, making it difficult to adapt to specific tasks or domains. These challenges highlight the need for more efficient and scalable approaches to multimodal learning. Current MLLMs predominantly utilize autoregressive methods, predicting one token at a time in a left-to-right manner. While effective, this approach has limitations in handling complex multimodal contexts. Alternative methods, such as diffusion models, have been explored; however, they often exhibit weaker language understanding due to their restricted architectures or inadequate training strategies. These limitations suggest a gap where a purely diffusion-based model could offer competitive multimodal reasoning capabilities if designed effectively. Researchers from the Renmin University of China and Ant Group introduced LLaDA-V, a purely diffusion-based masked language modeling (MLLM) model that integrates visual instruction tuning with masked diffusion models. Built upon LLaDA, a large language diffusion model, LLaDA-V incorporates a vision encoder and an MLP connector to project visual features into the language embedding space, enabling effective multimodal alignment. This design represents a departure from the autoregressive paradigms dominant in current multimodal approaches, aiming to overcome existing limitations while maintaining data efficiency and scalability. LLaDA-V employs a masked diffusion process where text responses are gradually refined through iterative prediction of masked tokens. Unlike autoregressive models that predict tokens sequentially, LLaDA-V generates outputs by reversing the masked diffusion process. The model is trained in three stages: the first stage aligns vision and language embeddings by mapping visual features from SigLIP2 into LLaDA\u2019s language space. The second stage fine-tunes the model using 10 million single-image samples and 2 million multimodal samples from MAmmoTH-VL. The third stage focuses on reasoning, using 900K QA pairs from VisualWebInstruct and a mixed dataset strategy. Bidirectional attention improves context comprehension, enabling robust multimodal understanding. In evaluations across 18 multimodal tasks, LLaDA-V demonstrated superior performance compared to hybrid autoregressive-diffusion and purely diffusion-based models. It outperformed LLaMA3-V on most multidisciplinary knowledge and mathematical reasoning tasks like MMMU, MMMU-Pro, and MMStar, achieving a score of 60.1 on MMStar, close to Qwen2-VL\u2019s 60.7, despite LLaDA-V using the weaker LLaDA-8B language tower. LLaDA-V also excelled in data efficiency, outperforming LLaMA3-V on MMMU-Pro with 1M samples against LLaMA3-V\u2019s 9M. Although it lagged in chart and document understanding benchmarks, such as AI2D, and in real-world scene tasks, like RealworldQA, LLaDA-V\u2019s results highlight its promise for multimodal tasks. In summary, LLaDA-V addresses the challenges of building effective multimodal models by introducing a purely diffusion-based architecture that combines visual instruction tuning with masked diffusion. The approach offers strong multimodal reasoning capabilities while maintaining data efficiency. This work demonstrates the potential of diffusion models in multimodal AI, paving the way for further exploration of probabilistic approaches to complex AI tasks. Check out the Paper and GitHub Page .\u00a0All credit for this research goes to the researchers of this project. Also,\u00a0feel free to follow us on\u00a0Twitter\u00a0and don\u2019t forget to join our\u00a095k+ ML SubReddit and Subscribe to our Newsletter. The post This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning appeared first on MarkTechPost.<\/p>","protected":false},"author":2,"featured_media":16234,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-16233","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/fr\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/\" \/>\n<meta property=\"og:locale\" content=\"fr_FR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/fr\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-06-03T03:47:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1496\" \/>\n\t<meta property=\"og:image:height\" content=\"788\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u00c9crit par\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Dur\u00e9e de lecture estim\u00e9e\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning\",\"datePublished\":\"2025-06-03T03:47:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/\"},\"wordCount\":652,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/\",\"url\":\"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/\",\"name\":\"This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png\",\"datePublished\":\"2025-06-03T03:47:34+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#breadcrumb\"},\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#primaryimage\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png\",\"width\":1496,\"height\":788},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"fr-FR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/fr\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/fr\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/","og_locale":"fr_FR","og_type":"article","og_title":"This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/fr\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2025-06-03T03:47:34+00:00","og_image":[{"width":1496,"height":788,"url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png","type":"image\/png"}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u00c9crit par":"admin NU","Dur\u00e9e de lecture estim\u00e9e":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning","datePublished":"2025-06-03T03:47:34+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/"},"wordCount":652,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"fr-FR","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/","url":"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/","name":"This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#primaryimage"},"thumbnailUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png","datePublished":"2025-06-03T03:47:34+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#breadcrumb"},"inLanguage":"fr-FR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/"]}]},{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#primaryimage","url":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png","width":1496,"height":788},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/this-ai-paper-introduces-llada-v-a-purely-diffusion-based-multimodal-large-language-model-for-visual-instruction-tuning-and-multimodal-reasoning\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"fr-FR"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/fr\/members\/adminnu\/"}]}},"rttpg_featured_image_url":{"full":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png",1496,788,false],"landscape":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png",1496,788,false],"portraits":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png",1496,788,false],"thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW-150x150.png",150,150,true],"medium":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW-300x158.png",300,158,true],"large":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW-1024x539.png",1024,539,true],"1536x1536":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png",1496,788,false],"2048x2048":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW.png",1496,788,false],"trp-custom-language-flag":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW-18x9.png",18,9,true],"woocommerce_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW-300x300.png",300,300,true],"woocommerce_single":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW-600x316.png",600,316,true],"woocommerce_gallery_thumbnail":["https:\/\/youzum.net\/wp-content\/uploads\/2025\/06\/AD_4nXdllHremyd-CiRI_8vT6s3RdHcleqQO19J3BosyTWCDH7vVXQ72BfTYIzLBpjbTY0pCIl8dWJBi3pYLL-elg1Mu21uegURBk0BVd4ovi_Puz83Vm3hRrwu1PF7VoiPxX1Zfm48Cjw-fM2AJW-100x100.png",100,100,true]},"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/fr\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/fr\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"Multimodal large language models (MLLMs) are designed to process and generate content across various modalities, including text, images, audio, and video. These models aim to understand and integrate information from different sources, enabling applications such as visual question answering, image captioning, and multimodal dialogue systems. The development of MLLMs represents a significant step toward creating\u2026","_links":{"self":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/16233","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/comments?post=16233"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/16233\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media\/16234"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media?parent=16233"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/categories?post=16233"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/tags?post=16233"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}