YouZum

Committee

AI, Committee, Noticias, Uncategorized

Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic Integration

State-of-the-art models show human-competitive accuracy on AIME, GPQA, MATH-500, and OlympiadBench, solving Olympiad-level problems. Recent multimodal foundation models have advanced benchmarks for disciplinary knowledge and mathematical reasoning. However, these evaluations miss a crucial aspect of machine intelligence: physical reasoning, which requires integrating disciplinary knowledge, symbolic operations, and real-world constraints. Physical problem-solving differs fundamentally from pure mathematical reasoning as it demands models to decode implicit conditions in questions. For example, interpreting “smooth surface” as zero friction coefficient, and maintaining physical consistency across reasoning chains because physical laws remain constant regardless of reasoning trajectories. MLLM shows excellent visual understanding by integrating visual and textual data across various tasks, motivating exploration of its reasoning abilities. However, uncertainty remains regarding whether these models possess genuine advanced reasoning capabilities for visual tasks, particularly in physical domains closer to real-world scenarios. Several LLM benchmarks have emerged to evaluate reasoning abilities, with PHYBench being most relevant for physics reasoning. MLLM scientific benchmarks, such as PhysReason and EMMA, contain multimodal physics problems with figures, however, they include only small physics subsets, which inadequately evaluate MLLMs’ capabilities for reasoning and solving advanced physics problems. Researchers from the University of Hong Kong, the University of Michigan, the University of Toronto, the University of Waterloo, and the Ohio State University have proposed PHYX, a novel benchmark to evaluate the physical reasoning capabilities of foundation models. It comprises 3,000 visually-grounded physics questions, precisely curated across six distinct physics domains: Mechanics, Electromagnetism, Thermodynamics, Wave/Acoustics, Optics, and Modern Physics. It evaluates physics-based reasoning via multimodal problem-solving with three core innovations: (a) 3,000 newly collected questions with realistic physical scenarios requiring integrated visual analysis and causal reasoning, (b) Expert-validated data design covering six fundamental physics domains, and (c) Strict unified three-step evaluation protocols. Researchers designed a four-stage data collection process to ensure high-quality data. The process begins with an in-depth survey of core physics disciplines to determine coverage across diverse domains and subfields, followed by the recruitment of STEM graduate students as expert annotators. They comply with copyright restrictions and avoid data contamination by selecting questions without answers that are immediately available. Moreover, quality control involves a three-stage cleaning process including duplicate detection through lexical overlap analysis with manual review by physics Ph.D. students, followed by filtering the shortest 10% of questions based on textual length, resulting in 3,000 high-quality questions from an initial collection of 3,300. PHYX presents significant challenges for current models, with even the worst-performing human experts achieving 75.6% accuracy, outperforming all evaluated models and showing a gap between human expertise and current model capabilities. The benchmark reveals that multiple-choice formats narrow performance gaps by allowing weaker models to rely on surface-level cues, but open-ended questions demand genuine reasoning and precise answer generation. Comparing GPT-4o’s performance on PHYX to previously reported results on MathVista and MATH-V (both 63.8%), lower accuracy in physical reasoning tasks emphasizes that physical reasoning requires deeper integration of abstract concepts and real-world knowledge, presenting greater challenges than purely mathematical contexts. In conclusion, researchers introduced PHYX, the first large-scale benchmark for evaluating physical reasoning in multimodal, visually grounded scenarios. Rigorous evaluation reveals that state-of-the-art models show limitations in physical reasoning, relying predominantly on memorized knowledge, mathematical formulas, and superficial visual patterns rather than genuine understanding of physical principles. The benchmark focuses exclusively on English-language prompts and annotations, limiting assessment of multilingual reasoning abilities. Also, while images depict physically realistic scenarios, they are often schematic or textbook-style rather than real-world photographs, which may not fully capture the complexity of perception in natural environments. Check out the Paper, Code and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic Integration appeared first on MarkTechPost.

Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic Integration Leer entrada »

AI, Committee, Noticias, Uncategorized

Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy

Long CoT reasoning improves large language models’ performance on complex tasks but comes with drawbacks. The typical “think-then-answer” method slows down response times, disrupting real-time interactions like those in chatbots. It also risks inaccuracies, as errors in earlier reasoning steps can lead to a misleading final answer. Unlike humans, who often share partial thoughts or conclusions during conversations, LLMs delay responses until all reasoning is complete. While RL is commonly used to train reasoning models, it mainly rewards final answers, overlooking useful intermediate insights. There is growing interest in teaching models that alternate between thinking and answering, but this remains a challenge.  RL has become a popular method to enhance reasoning in LLMs, building on its success in aligning models with human preferences. Two common reward types guide RL: outcome-based rewards (ORM), which focus on the final answer, and process-based rewards (PRM), which provide feedback on intermediate reasoning steps. While PRMs offer more detailed supervision, they often rely on human annotation and additional models, making them complex and prone to issues like reward hacking. Separately, efforts to improve LLM reasoning have explored prompting strategies, structured reasoning, tool integration, and methods to reduce latency and improve efficiency.  Researchers from Apple and Duke University introduce Interleaved Reasoning, a new RL approach that enables language models to alternate between thinking and answering when solving complex, multi-step questions. Instead of waiting until the end to respond, models provide informative intermediate answers, which improves feedback for users and guides their reasoning. Using a straightforward rule-based reward, the model is trained to produce helpful reasoning steps, leading to over 80% faster responses and up to 19.3% better accuracy. Trained only on QA and logic datasets, the method demonstrates strong generalization to more challenging benchmarks, such as MATH, GPQA, and MMLU.  The study proposes a reinforcement learning framework to train LLMs for Interleaved Reasoning, where models alternate between internal thinking and user-facing intermediate answers. Each intermediate step, or “sub-answer,” is shared once the model reaches a meaningful milestone in reasoning. A specialized training template with <think> and <answer> tags is used. The approach utilizes rule-based rewards—specifically, format, final accuracy, and conditional intermediate accuracy—to guide learning. Notably, intermediate rewards are applied only when specific criteria are met, ensuring the model prioritizes overall correctness. They also test different reward schemes, such as all-or-none, partial credit, and time-discounted rewards, to optimize the quality of reasoning.  The interleaved reasoning approach was evaluated on both familiar and unfamiliar datasets using Qwen2.5 models (1.5B and 7B). Unlike traditional methods that separate thinking and answering, the interleaved method provides answers incrementally, improving both speed and usefulness. When combined with intermediate rewards, it significantly enhances model performance while reducing response delays by over 80%. Even without exposure to new domains during training, the model adapts well, showing strong generalization. These results highlight the value of interleaved reasoning in making AI systems more responsive and effective in real-world, multi-step reasoning tasks.  In conclusion, the study explores how interleaved reasoning—where models alternate between reasoning and generating intermediate answers—can significantly improve performance and responsiveness. Using the Qwen2.5-1.5B model, the authors show that providing timely intermediate feedback during training boosts accuracy and accelerates response generation. Different RL strategies were tested, with PPO showing stable results, and conditional, time-discounted rewards proving to be the most effective. The method scales well to complex tasks and outperforms traditional think-then-answer baselines. Unlike token-level reward models, this approach employs simple rule-based rewards after completing full reasoning steps, thereby avoiding reward hacking. Ultimately, interleaved reasoning enhances reasoning quality and efficiency without relying on external tools.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy appeared first on MarkTechPost.

Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy Leer entrada »

AI, Committee, Noticias, Uncategorized

Samsung Researchers Introduced ANSE (Active Noise Selection for Generation): A Model-Aware Framework for Improving Text-to-Video Diffusion Models through Attention-Based Uncertainty Estimation

Video generation models have become a core technology for creating dynamic content by transforming text prompts into high-quality video sequences. Diffusion models, in particular, have established themselves as a leading approach for this task. These models work by starting from random noise and iteratively refining it into realistic video frames. Text-to-video (T2V) models extend this capability by incorporating temporal elements and aligning generated content with textual prompts, producing videos that are both visually compelling and semantically accurate. Despite advancements in architecture design, such as latent diffusion models and motion-aware attention modules, a significant challenge remains: ensuring consistent, high-quality video generation across different runs, particularly when the only change is the initial random noise seed. This challenge has highlighted the need for smarter, model-aware noise selection strategies to avoid unpredictable outputs and wasted computational resources. The core problem lies in how diffusion models initialize their generation process from Gaussian noise. The specific noise seed used can drastically impact the final video quality, temporal coherence, and prompt fidelity. For example, the same text prompt might generate entirely different videos depending on the random noise seed. Current approaches often attempt to address this problem by using handcrafted noise priors or frequency-based adjustments. Methods like FreeInit and FreqPrior apply external filtering techniques, while others like PYoCo introduce structured noise patterns. However, these methods rely on assumptions that may not hold across different datasets or models, require multiple full sampling passes (resulting in high computational costs), and fail to leverage the model’s internal attention signals, which could indicate which seeds are most promising for generation. As a result, there is a need for a more principled, model-aware method that can guide noise selection without incurring heavy computational penalties or relying on handcrafted priors. The research team from Samsung Research introduced ANSE (Active Noise Selection for Generation), an Active Noise Selection framework for video diffusion models. ANSE addresses the noise selection problem by using internal model signals, specifically attention-based uncertainty estimates, to guide noise seed selection. At the core of ANSE is BANSA (Bayesian Active Noise Selection via Attention), a novel acquisition function that quantifies the consistency and confidence of the model’s attention maps under stochastic perturbations. The research team designed BANSA to operate efficiently during inference by approximating its calculations through Bernoulli-masked attention sampling, which introduces randomness directly into the attention computation without requiring multiple full forward passes. This stochastic method enables the model to estimate the stability of its attention behavior across different noise seeds and select those that promote more confident and coherent attention patterns, which are empirically linked to improved video quality. BANSA works by evaluating entropy in the attention maps, which are generated at specific layers during the early denoising steps. The researchers identified that layers 14 for the CogVideoX-2B model and layer 19 for the CogVideoX-5B model provided sufficient correlation (above a 0.7 threshold) with the full-layer uncertainty estimate, significantly reducing computational overhead. The BANSA score is computed by comparing the average entropy of individual attention maps to the entropy of their mean, where a lower BANSA score indicates higher confidence and consistency in attention patterns. This score is used to rank candidate noise seeds from a pool of 10 (M = 10), each evaluated using 10 stochastic forward passes (K = 10). The noise seed with the lowest BANSA score is then used to generate the final video, achieving improved quality without requiring model retraining or external priors. On the CogVideoX-2B model, the total VBench score improved from 81.03 to 81.66 (+0.63), with a +0.48 gain in quality score and +1.23 gain in semantic alignment. On the larger CogVideoX-5B model, ANSE increased the total VBench score from 81.52 to 81.71 (+0.25), with a +0.17 gain in quality and +0.60 gain in semantic alignment. Notably, these improvements came with only an 8.68% increase in inference time for CogVideoX-2B and 13.78% for CogVideoX-5B. In contrast, prior methods, such as FreeInit and FreqPrior, required a 200% increase in inference time, making ANSE significantly more efficient. Qualitative evaluations further highlighted the benefits, showing that ANSE improved visual clarity, semantic consistency, and motion portrayal. For example, videos of “a koala playing the piano” and “a zebra running” showed more natural, anatomically correct motion under ANSE, while in prompts like “exploding,” ANSE-generated videos captured dynamic transitions more effectively. The research also explored different acquisition functions, comparing BANSA against random noise selection and entropy-based methods. BANSA using Bernoulli-masked attention achieved the highest total scores (81.66 for CogVideoX-2B), outperforming both random (81.03) and entropy-based methods (81.13). The study also found that increasing the number of stochastic forward passes (K) improved performance up to K = 10, beyond which the gains plateaued. Similarly, performance saturated at a noise pool size (M) of 10. A control experiment where the model intentionally selected seeds with the highest BANSA scores resulted in degraded video quality, confirming that lower BANSA scores correlate with better generation outcomes. While ANSE improves noise selection, it does not modify the generation process itself, meaning that some low-BANSA seeds can still result in suboptimal videos. The team acknowledged this limitation and suggested that BANSA is best viewed as a practical surrogate for more computationally intensive methods, such as per-seed sampling with post-hoc filtering. They also proposed that future work could integrate information-theoretic refinements or active learning strategies to enhance the quality of generation further. Several key takeaways from the research include: ANSE improves total VBench scores for video generation: from 81.03 to 81.66 on CogVideoX-2B and from 81.52 to 81.71 on CogVideoX-5B. Quality and semantic alignment gains are +0.48 and +1.23 for CogVideoX-2B, and +0.17 and +0.60 for CogVideoX-5B, respectively. Inference time increases are modest: +8.68% for CogVideoX-2B and +13.78% for CogVideoX-5B. BANSA scores derived from Bernoulli-masked attention outperform random and entropy-based methods for noise selection. The layer selection strategy reduces computational load by computing uncertainty at layers 14 and 19 for CogVideoX-2B and CogVideoX-5B, respectively. ANSE achieves efficiency by avoiding multiple full sampling passes, in contrast to methods like FreeInit, which require 200% more inference

Samsung Researchers Introduced ANSE (Active Noise Selection for Generation): A Model-Aware Framework for Improving Text-to-Video Diffusion Models through Attention-Based Uncertainty Estimation Leer entrada »

AI, Committee, Noticias, Uncategorized

National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation

In recent months, there has been growing interest in applying diffusion models—originally designed for continuous data, such as images—to natural language processing tasks. This has led to the development of Discrete Diffusion Language Models (DLMs), which treat text generation as a denoising process. Unlike traditional autoregressive models, DLMs enable parallel decoding and provide better control over structure, offering advantages such as flexible initialization of entire sequences, explicit control over output format, and improved infilling through bidirectional attention. Furthermore, their non-sequential nature opens the door to faster generation. Despite these benefits, most current multimodal large language models (MLLMs)—such as LLaMA, Qwen-VL, and InternVL—still rely solely on autoregressive methods. Work in diffusion-based language models has explored both continuous and discrete diffusion spaces. Continuous approaches, such as DiffuSeq and SED, use embedding or relaxed categorical spaces for smoother generation. In contrast, discrete models like SDDM and RDM tailor the diffusion process to linguistic structures. Training techniques vary, but commonly use masked language modeling losses or entropy-based score matching. Some hybrid models, such as AR-Diffusion and SSD-LM, combine autoregressive and diffusion strategies to leverage the strengths of both approaches. Meanwhile, open-source MLLMs such as LLaVA and InternVL have advanced through visual instruction tuning and joint pretraining, yet still follow an autoregressive generation scheme.  Researchers at the National University of Singapore present Dimple, the first Discrete DMLLM, which integrates a vision encoder with a discrete diffusion-based language model. To overcome the instability and performance issues of purely diffusion-based training, they introduce a two-phase training method—Autoregressive-then-Diffusion—combining initial autoregressive alignment with subsequent diffusion-based masked language modeling. Dimple-7B surpasses LLaVA-NEXT by 3.9% on benchmarks. The team also introduces Confident Decoding for dynamic token generation and explores Structure Priors for precise control over output. These innovations significantly improve inference efficiency, generation flexibility, and structural controllability without sacrificing performance.  Dimple is a Discrete Diffusion Multimodal LLM that integrates a vision encoder with a diffusion-based language model. To address inefficiencies in diffusion training, such as sparse supervision and limited generation coverage, the model is trained in two phases: first with autoregressive training using a causal attention mask for vision-language alignment, then with diffusion training to restore generation capabilities. During inference, a dynamic “Confident Decoding” strategy adapts token updates based on prediction confidence. Despite using significantly fewer training samples, Dimple exhibits competitive performance on multiple benchmarks, outperforming similar-scale autoregressive models, although it trails behind larger-scale state-of-the-art systems.  The experiments evaluate Dimple, a DMLLM, against autoregressive models on instruction-following tasks. Dimple, trained with a hybrid strategy that combines autoregressive and diffusion tuning, exhibits strong performance, surpassing models with similar training data on most benchmarks. Although it lags behind models trained on much larger datasets, Dimple benefits from a stronger base language model. Ablation studies reveal that combining autoregressive and diffusion tuning mitigates issues like length bias and improves consistency. Prefilling further boosts inference speed significantly, with only minor performance drops, making the model both efficient and competitive in multimodal understanding tasks.  In conclusion, Dimple, the first DMLLM, is designed to overcome the limitations of purely discrete diffusion training, such as instability and length bias. Dimple employs a hybrid training approach that starts with autoregressive learning, followed by diffusion tuning, yielding the Dimple-7B model, which outperforms LLaVA-NEXT by 3.9%. A decoding strategy, confident decoding, significantly reduces inference steps, while prefilling improves speed with minimal performance trade-offs. Dimple also enables structured and controllable outputs through structure priors, offering fine-grained control over format and length capabilities that autoregressive models struggle to provide.  Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation appeared first on MarkTechPost.

National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation Leer entrada »

AI, Committee, Noticias, Uncategorized

This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency

Web navigation focuses on teaching machines how to interact with websites to perform tasks such as searching for information, shopping, or booking services. Building a capable web navigation agent is a complex task because it requires understanding the structure of websites, interpreting user goals, and making a series of decisions across multiple steps. These tasks are further complicated by the need for agents to adapt in dynamic web environments, where content can change frequently and where multimodal information, such as text and images, must be understood together. A key problem in web navigation is the absence of reliable and detailed reward models that can guide agents in real-time. Existing methods primarily rely on multimodal large language models (MLLMs) like GPT-4o and GPT-4o-mini as evaluators, which are expensive, slow, and often inaccurate, especially when handling long sequences of actions in multi-step tasks. These models use prompting-based evaluation or binary success/failure feedback but fail to provide step-level guidance, often leading to errors such as repeated actions or missing critical steps like clicking specific buttons or filling form fields. This limitation reduces the practicality of deploying web agents in real-world scenarios, where efficiency, accuracy, and cost-effectiveness are crucial. The research team from Yonsei University and Carnegie Mellon University introduced WEB-SHEPHERD, a process reward model specifically designed for web navigation tasks. WEB-SHEPHERD is the first model to evaluate web navigation agents at the step level, using structured checklists to guide assessments. The researchers also developed the WEBPRM COLLECTION, a dataset of 40,000 step-level annotated web navigation tasks, and the WEBREWARDBENCH benchmark for evaluating PRMs. These resources were designed to enable WEB-SHEPHERD to provide detailed feedback by breaking down complex tasks into smaller, measurable subgoals. WEB-SHEPHERD works by generating a checklist for each task based on the user’s instruction, such as “Search for product” or “Click on product page,” and evaluates the agent’s progress against these subgoals. The model uses next-token prediction to generate feedback and assigns rewards based on checklist completion. This process enables WEB-SHEPHERD to assess the correctness of each step with fine-grained judgment. The model estimates the reward for each step by combining the probabilities of “Yes,” “No,” and “In Progress” tokens and averages these across the checklist. This detailed scoring system enables agents to receive targeted feedback on their progress, enhancing their ability to navigate complex websites. The researchers demonstrated that WEB-SHEPHERD significantly outperforms existing models. On the WEBREWARDBENCH benchmark, WEB-SHEPHERD achieved a Mean Reciprocal Rank (MRR) score of 87.6% and a trajectory accuracy of 55% in the text-only setting, compared to GPT-4o-mini’s 47.5% MRR and 0% trajectory accuracy without checklists. When tested in WebArena-lite using GPT-4o-mini as the policy model, WEB-SHEPHERD achieved a 34.55% success rate, which is 10.9 points higher than using GPT-4o-mini as the evaluator, while also being ten times more cost-efficient. In ablation studies, the researchers observed that WEB-SHEPHERD’s performance dropped significantly when checklists or feedback were removed, proving their importance for accurate reward assignments. They also showed that multimodal input, surprisingly, did not always improve performance and sometimes introduced noise. This research highlights the critical role of detailed process-level rewards in building reliable web agents. The team’s work addresses the core challenge of web navigation—evaluating complex, multi-step actions—and offers a solution that is both scalable and cost-effective. With WEB-SHEPHERD, agents can now receive accurate feedback during navigation, enabling them to make better decisions and complete tasks more effectively. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency appeared first on MarkTechPost.

This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency Leer entrada »

es_ES