YouZum

Committee

AI, Committee, ニュース, Uncategorized

Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models

Multi-modal large language models (MLLMs) have shown great progress as versatile AI assistants capable of handling diverse visual tasks. However, their deployment as isolated digital entities limits their potential impact. The growing demand to integrate MLLMs into real-world applications like robotics and autonomous vehicles requires complex spatial understanding. Current MLLMs show fundamental spatial reasoning deficiencies, often failing at basic tasks such as distinguishing left from right. While previous research attributes these limitations to insufficient specialized training data and solves them through spatial data incorporation during training, these approaches focus on single-image scenarios, thus restricting the model’s perception to static field-of-view analysis without dynamic information. Several research methods have tried to address spatial understanding limitations in MLLMs. MLLMs incorporate image encoders that convert visual inputs into tokens processed alongside text in the language model’s latent space. Previous research has focused on single-image spatial understanding, evaluating inter-object spatial relations, or spatial recognition. Some benchmarks like BLINK, UniQA-3D, and VSIBench extend beyond single images. Existing improvements of MLLMs for spatial understanding include SpatialVLM, which fine-tunes models on curated spatial datasets, SpatialRGPT, which incorporates mask-based references and depth images, and SpatialPIN, which utilizes specialized perception models without fine-tuning. Researchers from FAIR Meta and the Chinese University of Hong Kong have proposed a framework to enhance MLLMs with robust multi-frame spatial understanding. This integrates three components: depth perception, visual correspondence, and dynamic perception to overcome the limitations of static single-image analysis. Researchers develop MultiSPA, a novel large-scale dataset containing over 27 million samples spanning diverse 3D and 4D scenes. The resulting Multi-SpatialMLLM model achieves significant improvements over baselines and proprietary systems, with scalable and generalizable multi-frame reasoning. Further, five tasks are introduced to generate training data: depth perception, visual correspondence, camera movement perception, object movement perception, and object size perception. The Multi-SpatialMLLM centers around the MultiSPA data generation pipeline and comprehensive benchmark system. The data format follows standard MLLM fine-tuning strategies, which have the format of QA pairs: User: <image>…<image>{description}{question} and Assistant: {answer}. Researchers used the GPT-4o to generate diverse templates for task descriptions, questions, and answers. Further, high-quality annotated scene datasets are used, including 4D datasets Aria Digital Twin and Panoptic Studio, along with 3D tracking annotations from TAPVid3D for object movement perception and ScanNet for other spatial tasks. The MultiSPA generates over 27M QA samples from 1.1M unique images, with 300 samples held out for each subtask evaluation, totaling 7,800 benchmark samples. On the MultiSPA benchmark, the Multi-SpatialMLLM achieves an average 36% gain over base models, reaching 80-90% accuracy on qualitative tasks compared to 50% for baseline models while outperforming all proprietary systems. Even on challenging tasks like predicting camera movement vectors, it attains 18% accuracy versus near-zero performance from other baselines. On the BLINK benchmark, Multi-SpatialMLLM achieves nearly 90% accuracy with an average 26.4% improvement over base models, surpassing several proprietary systems and showing transferable multi-frame spatial understanding. Standard VQA benchmark evaluations show rough parity with original performance, indicating the model maintains general-purpose MLLM proficiency without overfitting to spatial reasoning tasks. In this paper, researchers extend MLLMs’ spatial understanding to multi-frame scenarios, addressing a critical gap overlooked in previous investigations. They introduced MultiSPA, the first large-scale dataset and benchmark for multi-frame spatial reasoning tasks. Experimental validation shows the effectiveness, scalability, and strong generalization capabilities of the proposed Multi-SpatialMLLM across diverse spatial understanding challenges. The research reveals significant insights, including multi-task learning benefits and emergent behaviors in complex spatial reasoning. The model establishes new applications, including acting as a multi-frame reward annotator. Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models appeared first on MarkTechPost.

Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment

Reinforcement learning (RL) has emerged as a fundamental approach in LLM post-training, utilizing supervision signals from human feedback (RLHF) or verifiable rewards (RLVR). While RLVR shows promise in mathematical reasoning, it faces significant constraints due to dependence on training queries with verifiable answers. This requirement limits applications to large-scale training on general-domain queries where verification proves intractable. Further, current reward models, categorized into scalar and generative types, cannot effectively scale test-time compute for reward estimation. Existing approaches apply uniform computational resources across all inputs, lacking adaptability to allocate additional resources to challenging queries requiring nuanced analysis. Formulation strategies and scoring schemes characterize reward models. Numeric approaches assign scalar scores to query-response pairs, while generative methods produce natural language feedback. Scoring follows absolute evaluation of individual pairs or discriminative comparison of candidate responses. Generative reward models, aligned with the LLM-as-a-Judge paradigm, offer interpretable feedback but face reliability concerns due to biased judgments. Inference-time scaling methods dynamically adjust computational resources, including parallel strategies like multi-sampling and horizon-based scaling for extended reasoning traces. However, they lack systematic adaptation to input complexity, limiting their effectiveness across diverse query types. Researchers from Microsoft Research, Tsinghua University, and Peking University have proposed Reward Reasoning Models (RRMs), which perform explicit reasoning before producing final rewards. This reasoning phase allows RRMs to adaptively allocate additional computational resources when evaluating responses to complex tasks. RRMs introduce a dimension for enhancing reward modeling by scaling test-time compute while maintaining general applicability across diverse evaluation scenarios. Through chain-of-thought reasoning, RRMs utilize additional test-time compute for complex queries where appropriate rewards are not immediately apparent. This encourages RRMs to self-evolve reward reasoning capabilities without explicit reasoning traces as training data. RRMs utilize the Qwen2 model with a Transformer-decoder backbone, formulating reward modeling as text completion where RRMs autoregressively generate thinking processes followed by final judgments. Each input contains a query and two responses to determine preference without allowing ties. Researchers use the RewardBench repository to guide systematic analysis across evaluation criteria, including instruction fidelity, helpfulness, accuracy, harmlessness, and detail level. RRMs support multi-response evaluation through ELO rating systems and knockout tournaments, both combinable with majority voting for enhanced test-time compute utilization. This samples RRMs multiple times for pairwise comparisons, performing majority voting to obtain robust comparison results. Evaluation results show that RRMs achieve competitive performance against strong baselines on RewardBench and PandaLM Test benchmarks, with RRM-32B attaining 98.6% accuracy in reasoning categories. Comparing with DirectJudge models trained on identical data reveals substantial performance gaps, indicating RRMs effectively use test-time compute for complex queries. In reward-guided best-of-N inference, RRMs surpass all baseline models without additional test-time compute, with majority voting providing substantial improvements across evaluated subsets. Post-training experiments show steady downstream performance improvements on MMLU-Pro and GPQA. Scaling experiments across 7B, 14B, and 32B models confirm that longer thinking horizons consistently improve accuracy. In conclusion, researchers introduced RRMs to perform explicit reasoning processes before reward assignment to address computational inflexibility in existing reward modeling approaches. Rule-based-reward RL enables RRMs to develop complex reasoning capabilities without requiring explicit reasoning traces as supervision. RRMs efficiently utilize test-time compute through parallel and sequential scaling approaches. The effectiveness of RRMs in practical applications, including reward-guided best-of-N inference and post-training feedback, demonstrates their potential as strong alternatives to traditional scalar reward models in alignment techniques. Check out the Paper and Models on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment appeared first on MarkTechPost.

Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment 投稿を読む »

AI, Committee, ニュース, Uncategorized

This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks

Neural networks have long been powerful tools for handling complex data-driven tasks. Still, they often struggle to make discrete decisions under strict constraints, like routing vehicles or scheduling jobs. These discrete decision problems, commonly found in operations research, are computationally intensive and difficult to integrate into the smooth, continuous frameworks of neural networks. Such challenges limit the ability to combine learning-based models with combinatorial reasoning, creating a bottleneck in applications that demand both. A major issue arises when integrating discrete combinatorial solvers with gradient-based learning systems. Many combinatorial problems are NP-hard, meaning it’s impossible to find exact solutions within a reasonable time for large instances. Existing strategies often depend on exact solvers or introduce continuous relaxations, which may not provide solutions that respect the hard constraints of the original problem. These approaches typically involve heavy computational costs, and when exact oracles are unavailable, the methods fail to deliver consistent gradients for learning. This creates a gap where neural networks can learn representations but cannot reliably make complex, structured decisions in a way that scales. Commonly used methods rely on exact solvers for structured inference tasks, such as MAP solvers in graphical models or linear programming relaxations. These methods often require repeated oracle calls during each training iteration and depend on specific problem formulations. Techniques like Fenchel-Young losses or perturbation-based methods allow approximate learning, but their guarantees break down when used with inexact solvers like local search heuristics. This reliance on exact solutions hinders their practical use in large-scale, real-world combinatorial tasks, such as vehicle routing with dynamic requests and time windows. Researchers from Google DeepMind and ENPC propose a novel solution by transforming local search heuristics into differentiable combinatorial layers through the lens of Markov Chain Monte Carlo (MCMC) methods. The researchers create MCMC layers that operate on discrete combinatorial spaces by mapping problem-specific neighborhood systems into proposal distributions. This design allows neural networks to integrate local search heuristics, like simulated annealing or Metropolis-Hastings, as part of the learning pipeline without access to exact solvers. Their approach enables gradient-based learning over discrete solutions by using acceptance rules that correct for the bias introduced by approximate solvers, ensuring theoretical soundness while reducing the computational burden. In more detail, the researchers construct a framework where local search heuristics propose neighbor solutions based on the problem structure, and the acceptance rules from MCMC methods ensure these moves result in a valid sampling process over the solution space. The resulting MCMC layer approximates the target distribution of feasible solutions and provides unbiased gradients for a single iteration under a target-dependent Fenchel-Young loss. This makes it possible to perform learning even with minimal MCMC iterations, such as using a single sample per forward pass while maintaining theoretical convergence properties. By embedding this layer in a neural network, they can train models that predict parameters for combinatorial problems and improve solution quality over time. The research team evaluated this method on a large-scale dynamic vehicle routing problem with time windows, a complex, real-world combinatorial optimization task. They showed their approach could handle large instances efficiently, significantly outperforming perturbation-based methods under limited time budgets. For example, their MCMC layer achieved a test relative cost of 5.9% compared to anticipative baselines when using a heuristic-based initialization. In comparison, the perturbation-based method achieved 6.3% under the same conditions. Even at extremely low time budgets, such as a 1 ms time limit, their method outperformed perturbation methods by a large margin—achieving 7.8% relative cost versus 65.2% for perturbation-based approaches. They also demonstrated that initializing the MCMC chain with ground-truth solutions or heuristic-enhanced states improved learning efficiency and solution quality, especially when using a small number of MCMC iterations. This research demonstrates a principled way to integrate NP-hard combinatorial problems into neural networks without relying on exact solvers. The problem of combining learning with discrete decision-making is addressed by using MCMC layers constructed from local search heuristics, enabling theoretically sound, efficient training. The proposed method bridges the gap between deep learning and combinatorial optimization, providing a scalable and practical solution for complex tasks like vehicle routing. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks appeared first on MarkTechPost.

This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks 投稿を読む »

AI, Committee, ニュース, Uncategorized

Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary Search

Chaotic systems, such as fluid dynamics or brain activity, are highly sensitive to initial conditions, making long-term predictions difficult. Even minor errors in modeling these systems can rapidly grow, which limits the effectiveness of many scientific machine learning (SciML) approaches. Traditional forecasting methods rely on models trained on specific time series or broad datasets lacking true dynamical structure. However, recent work has demonstrated the potential for local forecasting models to predict chaotic systems more accurately over longer timeframes by learning the numerical rules governing these systems. The real challenge is achieving out-of-domain generalization—creating models that can adapt and forecast new, previously unseen dynamical systems. This would require integrating prior knowledge with the ability to adapt locally. Still, the need for task-specific data constrains current methods and often overlooks key dynamical system properties such as ergodicity, channel coupling, and conserved quantities. Machine learning for dynamical systems (MLDS) utilizes the unique properties of such systems as inductive biases. These include fixed relationships among system variables and invariant statistical measures, like strange attractors or conserved quantities. MLDS models use these properties to build more accurate and generalizable models, sometimes incorporating probabilistic or latent variable techniques. While datasets of dynamical systems have been curated and new systems are often generated by tweaking parameters or using symbolic methods, these approaches typically don’t ensure diverse or stable dynamics. Structural stability is a challenge—small changes may not yield new behaviors, while large ones can cause trivial dynamics. Foundation models aim to address this by enabling transfer learning and zero-shot inference. Still, most current models perform comparably to standard time series models or are limited in generating meaningful, dynamic variety. Some progress has been made through techniques like embedding spaces or symbolic discovery, but a richer, more diverse sampling of dynamical behaviors remains an open challenge.  Researchers at the Oden Institute, UT Austin, introduce Panda (Patched Attention for Nonlinear Dynamics), a pretrained model trained solely on synthetic data from 20,000 algorithmically-generated chaotic systems. These systems were created using an evolutionary algorithm based on known chaotic ODEs. Despite training only on low-dimensional ODEs, Panda shows strong zero-shot forecasting on real-world nonlinear systems—including fluid dynamics and electrophysiology—and unexpectedly generalizes to PDEs. The model incorporates innovations like masked pretraining, channel attention, and kernelized patching to capture dynamical structure. A neural scaling law also emerges, linking Panda’s forecasting performance to the diversity of training systems.  The researchers generated 20,000 new chaotic systems using a genetic algorithm that evolves from a curated set of 135 known chaotic ODEs. These systems are mutated and recombined using a skew product approach, with only truly chaotic behaviors retained through rigorous tests. Augmentations like time-delay embeddings and affine transformations expand the dataset while preserving its dynamics. A separate set of 9,300 unseen systems is held out for zero-shot testing. The model, Panda, is built on PatchTST and enhanced with features like channel attention, temporal-channel attention layers, and dynamic embeddings using polynomial and Fourier features, inspired by Koopman operator theory.  Panda demonstrates strong zero-shot forecasting capabilities on unseen nonlinear dynamical systems, outperforming models like Chronos-SFT across various metrics and prediction horizons. Trained solely on 3D systems, it generalizes to higher-dimensional ones due to channel attention. Despite never encountering PDEs during training, Panda also succeeds on real-world experimental data and chaotic PDEs, such as the Kuramoto-Sivashinsky and von Kármán vortex street. Architectural ablations confirm the importance of channel attention and dynamics embeddings. The model exhibits neural scaling with increased dynamical system diversity and forms interpretable attention patterns, suggesting resonance and attractor-sensitive structure. This indicates Panda’s broad generalization across complex dynamical behaviors.  In conclusion, Panda is a pretrained model designed to uncover generalizable patterns in dynamical systems. Trained on a large, diverse set of synthetic chaotic systems, Panda demonstrates strong zero-shot forecasting on unseen real-world data and even partial differential equations, despite only being trained on low-dimensional ODEs. Its performance improves with system diversity, revealing a neural scaling law. The model also shows emergent nonlinear resonance in attention patterns. While focused on low-dimensional dynamics, the approach may extend to higher-dimensional systems by leveraging sparse interactions. Future directions include alternative pretraining strategies to improve rollout performance forecasting chaotic behaviors.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary Search appeared first on MarkTechPost.

Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary Search 投稿を読む »

AI, Committee, ニュース, Uncategorized

This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding

The core idea of Multimodal Large Language Models (MLLMs) is to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components. A major challenge in building such models is their limited ability to combine visual understanding with logical thinking. Current systems often produce textual outputs that explain reasoning but fail to reference the specific parts of an image they rely on. This creates a gap where models may arrive at an answer without clearly showing how the visual evidence contributed to their decision. It’s also difficult to ensure that models generate visual reasoning steps directly connecting to their answers. The fundamental problem lies in how to naturally train models to interleave text and image reasoning without needing large datasets annotated with visual references, which are scarce and expensive to produce. Existing methods try to address this by using reinforcement learning or prompting strategies. Some systems generate bounding box coordinates as answers, while others produce step-by-step textual reasoning chains. However, these approaches have limitations. Models that only produce bounding boxes lack explanation, while those generating only text risk ignoring visual evidence. Previous methods often separate visual grounding and reasoning, making it hard for models to explain why a particular visual element leads to a certain conclusion. While some models use dense supervision data or additional tools, they generally require heavy annotation and do not scale well. This makes it difficult for developers to create models that can explain their reasoning transparently and handle various visual tasks with minimal data. Researchers from UC Santa Cruz and eBay introduced a new method called Grounded Reasoning with Images and Text (GRIT) that allows MLLMs like Qwen 2.5-VL and InternVL 3 to generate reasoning chains that mix natural language with explicit bounding box coordinates pointing to relevant image regions. This unified approach enables models to reason about and visually ground their answers without requiring dense annotations or labeled reasoning chains. GRIT also uses a lightweight reinforcement learning algorithm called GRPO-GR, which optimizes both the accuracy of the final answer and the structure of the reasoning, encouraging models to include specific tokens like <think> and <rethink>, as well as bounding box formats. This design eliminates the need for costly annotated data while ensuring that models learn to reference visual content meaningfully within their logical steps. The methodology in GRIT focuses on generating outputs that combine textual reasoning and visual grounding seamlessly. Instead of requiring models to process cropped images or additional visual data after generating bounding boxes, GRIT teaches models to use their internal understanding of the image. Bounding boxes are generated during the reasoning process, and models learn to reflect on these coordinates within their logical reasoning. The reinforcement learning framework rewards the correct use of bounding box formats and reasoning structure, and it guides models to produce coherent, grounded reasoning chains. GRIT demonstrates remarkable data efficiency by using only 20 image-question-answer triplets sourced from Visual Spatial Reasoning and TallyQA datasets. The model training was conducted on NVIDIA A100 GPUs, with optimization techniques like AdamW and a cosine scheduler applied over 200 training steps, which shows the method’s scalability despite limited data. Performance evaluations revealed that GRIT-trained models outperform several baselines in reasoning and grounding accuracy. For example, Qwen 2.5-VL trained with GRIT achieved 72.9% answer accuracy on Visual Spatial Reasoning, 47.8% on TallyQA, and 62.8% on GQA datasets. It also reached a grounding IoU score of 0.325 on VSR and 0.447 on TallyQA. In contrast, baseline models like Direct Query or Chain-of-Thought often performed significantly lower, showing limited ability to unify reasoning with visual grounding. GRIT models demonstrated a strong correlation between visual regions and textual reasoning, producing outputs that reflected a meaningful connection between image evidence and logical thought. GRIT also showed improvements on out-of-domain benchmarks, though gains were more pronounced on in-domain data, highlighting the importance of training data diversity for broader generalization. In conclusion, the research addressed the problem of disconnected reasoning and visual grounding in MLLMs by introducing GRIT. The method allows models to reason with images through a simple, efficient approach that requires minimal data. GRIT successfully teaches MLLMs to combine visual evidence with logical reasoning in a unified output, achieving strong performance across multiple benchmarks and demonstrating a promising step toward more interpretable AI systems. Check out the Paper, Project, and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding appeared first on MarkTechPost.

This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding 投稿を読む »

ja