YouZum

Committee

AI, Committee, ニュース, Uncategorized

National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation

In recent months, there has been growing interest in applying diffusion models—originally designed for continuous data, such as images—to natural language processing tasks. This has led to the development of Discrete Diffusion Language Models (DLMs), which treat text generation as a denoising process. Unlike traditional autoregressive models, DLMs enable parallel decoding and provide better control over structure, offering advantages such as flexible initialization of entire sequences, explicit control over output format, and improved infilling through bidirectional attention. Furthermore, their non-sequential nature opens the door to faster generation. Despite these benefits, most current multimodal large language models (MLLMs)—such as LLaMA, Qwen-VL, and InternVL—still rely solely on autoregressive methods. Work in diffusion-based language models has explored both continuous and discrete diffusion spaces. Continuous approaches, such as DiffuSeq and SED, use embedding or relaxed categorical spaces for smoother generation. In contrast, discrete models like SDDM and RDM tailor the diffusion process to linguistic structures. Training techniques vary, but commonly use masked language modeling losses or entropy-based score matching. Some hybrid models, such as AR-Diffusion and SSD-LM, combine autoregressive and diffusion strategies to leverage the strengths of both approaches. Meanwhile, open-source MLLMs such as LLaVA and InternVL have advanced through visual instruction tuning and joint pretraining, yet still follow an autoregressive generation scheme.  Researchers at the National University of Singapore present Dimple, the first Discrete DMLLM, which integrates a vision encoder with a discrete diffusion-based language model. To overcome the instability and performance issues of purely diffusion-based training, they introduce a two-phase training method—Autoregressive-then-Diffusion—combining initial autoregressive alignment with subsequent diffusion-based masked language modeling. Dimple-7B surpasses LLaVA-NEXT by 3.9% on benchmarks. The team also introduces Confident Decoding for dynamic token generation and explores Structure Priors for precise control over output. These innovations significantly improve inference efficiency, generation flexibility, and structural controllability without sacrificing performance.  Dimple is a Discrete Diffusion Multimodal LLM that integrates a vision encoder with a diffusion-based language model. To address inefficiencies in diffusion training, such as sparse supervision and limited generation coverage, the model is trained in two phases: first with autoregressive training using a causal attention mask for vision-language alignment, then with diffusion training to restore generation capabilities. During inference, a dynamic “Confident Decoding” strategy adapts token updates based on prediction confidence. Despite using significantly fewer training samples, Dimple exhibits competitive performance on multiple benchmarks, outperforming similar-scale autoregressive models, although it trails behind larger-scale state-of-the-art systems.  The experiments evaluate Dimple, a DMLLM, against autoregressive models on instruction-following tasks. Dimple, trained with a hybrid strategy that combines autoregressive and diffusion tuning, exhibits strong performance, surpassing models with similar training data on most benchmarks. Although it lags behind models trained on much larger datasets, Dimple benefits from a stronger base language model. Ablation studies reveal that combining autoregressive and diffusion tuning mitigates issues like length bias and improves consistency. Prefilling further boosts inference speed significantly, with only minor performance drops, making the model both efficient and competitive in multimodal understanding tasks.  In conclusion, Dimple, the first DMLLM, is designed to overcome the limitations of purely discrete diffusion training, such as instability and length bias. Dimple employs a hybrid training approach that starts with autoregressive learning, followed by diffusion tuning, yielding the Dimple-7B model, which outperforms LLaVA-NEXT by 3.9%. A decoding strategy, confident decoding, significantly reduces inference steps, while prefilling improves speed with minimal performance trade-offs. Dimple also enables structured and controllable outputs through structure priors, offering fine-grained control over format and length capabilities that autoregressive models struggle to provide.  Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation appeared first on MarkTechPost.

National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation 投稿を読む »

AI, Committee, ニュース, Uncategorized

This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency

Web navigation focuses on teaching machines how to interact with websites to perform tasks such as searching for information, shopping, or booking services. Building a capable web navigation agent is a complex task because it requires understanding the structure of websites, interpreting user goals, and making a series of decisions across multiple steps. These tasks are further complicated by the need for agents to adapt in dynamic web environments, where content can change frequently and where multimodal information, such as text and images, must be understood together. A key problem in web navigation is the absence of reliable and detailed reward models that can guide agents in real-time. Existing methods primarily rely on multimodal large language models (MLLMs) like GPT-4o and GPT-4o-mini as evaluators, which are expensive, slow, and often inaccurate, especially when handling long sequences of actions in multi-step tasks. These models use prompting-based evaluation or binary success/failure feedback but fail to provide step-level guidance, often leading to errors such as repeated actions or missing critical steps like clicking specific buttons or filling form fields. This limitation reduces the practicality of deploying web agents in real-world scenarios, where efficiency, accuracy, and cost-effectiveness are crucial. The research team from Yonsei University and Carnegie Mellon University introduced WEB-SHEPHERD, a process reward model specifically designed for web navigation tasks. WEB-SHEPHERD is the first model to evaluate web navigation agents at the step level, using structured checklists to guide assessments. The researchers also developed the WEBPRM COLLECTION, a dataset of 40,000 step-level annotated web navigation tasks, and the WEBREWARDBENCH benchmark for evaluating PRMs. These resources were designed to enable WEB-SHEPHERD to provide detailed feedback by breaking down complex tasks into smaller, measurable subgoals. WEB-SHEPHERD works by generating a checklist for each task based on the user’s instruction, such as “Search for product” or “Click on product page,” and evaluates the agent’s progress against these subgoals. The model uses next-token prediction to generate feedback and assigns rewards based on checklist completion. This process enables WEB-SHEPHERD to assess the correctness of each step with fine-grained judgment. The model estimates the reward for each step by combining the probabilities of “Yes,” “No,” and “In Progress” tokens and averages these across the checklist. This detailed scoring system enables agents to receive targeted feedback on their progress, enhancing their ability to navigate complex websites. The researchers demonstrated that WEB-SHEPHERD significantly outperforms existing models. On the WEBREWARDBENCH benchmark, WEB-SHEPHERD achieved a Mean Reciprocal Rank (MRR) score of 87.6% and a trajectory accuracy of 55% in the text-only setting, compared to GPT-4o-mini’s 47.5% MRR and 0% trajectory accuracy without checklists. When tested in WebArena-lite using GPT-4o-mini as the policy model, WEB-SHEPHERD achieved a 34.55% success rate, which is 10.9 points higher than using GPT-4o-mini as the evaluator, while also being ten times more cost-efficient. In ablation studies, the researchers observed that WEB-SHEPHERD’s performance dropped significantly when checklists or feedback were removed, proving their importance for accurate reward assignments. They also showed that multimodal input, surprisingly, did not always improve performance and sometimes introduced noise. This research highlights the critical role of detailed process-level rewards in building reliable web agents. The team’s work addresses the core challenge of web navigation—evaluating complex, multi-step actions—and offers a solution that is both scalable and cost-effective. With WEB-SHEPHERD, agents can now receive accurate feedback during navigation, enabling them to make better decisions and complete tasks more effectively. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency appeared first on MarkTechPost.

This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency 投稿を読む »

AI, Committee, ニュース, Uncategorized

Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models

Multi-modal large language models (MLLMs) have shown great progress as versatile AI assistants capable of handling diverse visual tasks. However, their deployment as isolated digital entities limits their potential impact. The growing demand to integrate MLLMs into real-world applications like robotics and autonomous vehicles requires complex spatial understanding. Current MLLMs show fundamental spatial reasoning deficiencies, often failing at basic tasks such as distinguishing left from right. While previous research attributes these limitations to insufficient specialized training data and solves them through spatial data incorporation during training, these approaches focus on single-image scenarios, thus restricting the model’s perception to static field-of-view analysis without dynamic information. Several research methods have tried to address spatial understanding limitations in MLLMs. MLLMs incorporate image encoders that convert visual inputs into tokens processed alongside text in the language model’s latent space. Previous research has focused on single-image spatial understanding, evaluating inter-object spatial relations, or spatial recognition. Some benchmarks like BLINK, UniQA-3D, and VSIBench extend beyond single images. Existing improvements of MLLMs for spatial understanding include SpatialVLM, which fine-tunes models on curated spatial datasets, SpatialRGPT, which incorporates mask-based references and depth images, and SpatialPIN, which utilizes specialized perception models without fine-tuning. Researchers from FAIR Meta and the Chinese University of Hong Kong have proposed a framework to enhance MLLMs with robust multi-frame spatial understanding. This integrates three components: depth perception, visual correspondence, and dynamic perception to overcome the limitations of static single-image analysis. Researchers develop MultiSPA, a novel large-scale dataset containing over 27 million samples spanning diverse 3D and 4D scenes. The resulting Multi-SpatialMLLM model achieves significant improvements over baselines and proprietary systems, with scalable and generalizable multi-frame reasoning. Further, five tasks are introduced to generate training data: depth perception, visual correspondence, camera movement perception, object movement perception, and object size perception. The Multi-SpatialMLLM centers around the MultiSPA data generation pipeline and comprehensive benchmark system. The data format follows standard MLLM fine-tuning strategies, which have the format of QA pairs: User: <image>…<image>{description}{question} and Assistant: {answer}. Researchers used the GPT-4o to generate diverse templates for task descriptions, questions, and answers. Further, high-quality annotated scene datasets are used, including 4D datasets Aria Digital Twin and Panoptic Studio, along with 3D tracking annotations from TAPVid3D for object movement perception and ScanNet for other spatial tasks. The MultiSPA generates over 27M QA samples from 1.1M unique images, with 300 samples held out for each subtask evaluation, totaling 7,800 benchmark samples. On the MultiSPA benchmark, the Multi-SpatialMLLM achieves an average 36% gain over base models, reaching 80-90% accuracy on qualitative tasks compared to 50% for baseline models while outperforming all proprietary systems. Even on challenging tasks like predicting camera movement vectors, it attains 18% accuracy versus near-zero performance from other baselines. On the BLINK benchmark, Multi-SpatialMLLM achieves nearly 90% accuracy with an average 26.4% improvement over base models, surpassing several proprietary systems and showing transferable multi-frame spatial understanding. Standard VQA benchmark evaluations show rough parity with original performance, indicating the model maintains general-purpose MLLM proficiency without overfitting to spatial reasoning tasks. In this paper, researchers extend MLLMs’ spatial understanding to multi-frame scenarios, addressing a critical gap overlooked in previous investigations. They introduced MultiSPA, the first large-scale dataset and benchmark for multi-frame spatial reasoning tasks. Experimental validation shows the effectiveness, scalability, and strong generalization capabilities of the proposed Multi-SpatialMLLM across diverse spatial understanding challenges. The research reveals significant insights, including multi-task learning benefits and emergent behaviors in complex spatial reasoning. The model establishes new applications, including acting as a multi-frame reward annotator. Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models appeared first on MarkTechPost.

Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models 投稿を読む »

We use cookies to improve your experience and performance on our website. You can learn more at プライバシーポリシー and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
ja