YouZum

Committee

AI, Committee, ข่าว, Uncategorized

Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models

Multi-modal large language models (MLLMs) have shown great progress as versatile AI assistants capable of handling diverse visual tasks. However, their deployment as isolated digital entities limits their potential impact. The growing demand to integrate MLLMs into real-world applications like robotics and autonomous vehicles requires complex spatial understanding. Current MLLMs show fundamental spatial reasoning deficiencies, often failing at basic tasks such as distinguishing left from right. While previous research attributes these limitations to insufficient specialized training data and solves them through spatial data incorporation during training, these approaches focus on single-image scenarios, thus restricting the model’s perception to static field-of-view analysis without dynamic information. Several research methods have tried to address spatial understanding limitations in MLLMs. MLLMs incorporate image encoders that convert visual inputs into tokens processed alongside text in the language model’s latent space. Previous research has focused on single-image spatial understanding, evaluating inter-object spatial relations, or spatial recognition. Some benchmarks like BLINK, UniQA-3D, and VSIBench extend beyond single images. Existing improvements of MLLMs for spatial understanding include SpatialVLM, which fine-tunes models on curated spatial datasets, SpatialRGPT, which incorporates mask-based references and depth images, and SpatialPIN, which utilizes specialized perception models without fine-tuning. Researchers from FAIR Meta and the Chinese University of Hong Kong have proposed a framework to enhance MLLMs with robust multi-frame spatial understanding. This integrates three components: depth perception, visual correspondence, and dynamic perception to overcome the limitations of static single-image analysis. Researchers develop MultiSPA, a novel large-scale dataset containing over 27 million samples spanning diverse 3D and 4D scenes. The resulting Multi-SpatialMLLM model achieves significant improvements over baselines and proprietary systems, with scalable and generalizable multi-frame reasoning. Further, five tasks are introduced to generate training data: depth perception, visual correspondence, camera movement perception, object movement perception, and object size perception. The Multi-SpatialMLLM centers around the MultiSPA data generation pipeline and comprehensive benchmark system. The data format follows standard MLLM fine-tuning strategies, which have the format of QA pairs: User: <image>…<image>{description}{question} and Assistant: {answer}. Researchers used the GPT-4o to generate diverse templates for task descriptions, questions, and answers. Further, high-quality annotated scene datasets are used, including 4D datasets Aria Digital Twin and Panoptic Studio, along with 3D tracking annotations from TAPVid3D for object movement perception and ScanNet for other spatial tasks. The MultiSPA generates over 27M QA samples from 1.1M unique images, with 300 samples held out for each subtask evaluation, totaling 7,800 benchmark samples. On the MultiSPA benchmark, the Multi-SpatialMLLM achieves an average 36% gain over base models, reaching 80-90% accuracy on qualitative tasks compared to 50% for baseline models while outperforming all proprietary systems. Even on challenging tasks like predicting camera movement vectors, it attains 18% accuracy versus near-zero performance from other baselines. On the BLINK benchmark, Multi-SpatialMLLM achieves nearly 90% accuracy with an average 26.4% improvement over base models, surpassing several proprietary systems and showing transferable multi-frame spatial understanding. Standard VQA benchmark evaluations show rough parity with original performance, indicating the model maintains general-purpose MLLM proficiency without overfitting to spatial reasoning tasks. In this paper, researchers extend MLLMs’ spatial understanding to multi-frame scenarios, addressing a critical gap overlooked in previous investigations. They introduced MultiSPA, the first large-scale dataset and benchmark for multi-frame spatial reasoning tasks. Experimental validation shows the effectiveness, scalability, and strong generalization capabilities of the proposed Multi-SpatialMLLM across diverse spatial understanding challenges. The research reveals significant insights, including multi-task learning benefits and emergent behaviors in complex spatial reasoning. The model establishes new applications, including acting as a multi-frame reward annotator. Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models appeared first on MarkTechPost.

Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models Read Post »

AI, Committee, ข่าว, Uncategorized

Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment

Reinforcement learning (RL) has emerged as a fundamental approach in LLM post-training, utilizing supervision signals from human feedback (RLHF) or verifiable rewards (RLVR). While RLVR shows promise in mathematical reasoning, it faces significant constraints due to dependence on training queries with verifiable answers. This requirement limits applications to large-scale training on general-domain queries where verification proves intractable. Further, current reward models, categorized into scalar and generative types, cannot effectively scale test-time compute for reward estimation. Existing approaches apply uniform computational resources across all inputs, lacking adaptability to allocate additional resources to challenging queries requiring nuanced analysis. Formulation strategies and scoring schemes characterize reward models. Numeric approaches assign scalar scores to query-response pairs, while generative methods produce natural language feedback. Scoring follows absolute evaluation of individual pairs or discriminative comparison of candidate responses. Generative reward models, aligned with the LLM-as-a-Judge paradigm, offer interpretable feedback but face reliability concerns due to biased judgments. Inference-time scaling methods dynamically adjust computational resources, including parallel strategies like multi-sampling and horizon-based scaling for extended reasoning traces. However, they lack systematic adaptation to input complexity, limiting their effectiveness across diverse query types. Researchers from Microsoft Research, Tsinghua University, and Peking University have proposed Reward Reasoning Models (RRMs), which perform explicit reasoning before producing final rewards. This reasoning phase allows RRMs to adaptively allocate additional computational resources when evaluating responses to complex tasks. RRMs introduce a dimension for enhancing reward modeling by scaling test-time compute while maintaining general applicability across diverse evaluation scenarios. Through chain-of-thought reasoning, RRMs utilize additional test-time compute for complex queries where appropriate rewards are not immediately apparent. This encourages RRMs to self-evolve reward reasoning capabilities without explicit reasoning traces as training data. RRMs utilize the Qwen2 model with a Transformer-decoder backbone, formulating reward modeling as text completion where RRMs autoregressively generate thinking processes followed by final judgments. Each input contains a query and two responses to determine preference without allowing ties. Researchers use the RewardBench repository to guide systematic analysis across evaluation criteria, including instruction fidelity, helpfulness, accuracy, harmlessness, and detail level. RRMs support multi-response evaluation through ELO rating systems and knockout tournaments, both combinable with majority voting for enhanced test-time compute utilization. This samples RRMs multiple times for pairwise comparisons, performing majority voting to obtain robust comparison results. Evaluation results show that RRMs achieve competitive performance against strong baselines on RewardBench and PandaLM Test benchmarks, with RRM-32B attaining 98.6% accuracy in reasoning categories. Comparing with DirectJudge models trained on identical data reveals substantial performance gaps, indicating RRMs effectively use test-time compute for complex queries. In reward-guided best-of-N inference, RRMs surpass all baseline models without additional test-time compute, with majority voting providing substantial improvements across evaluated subsets. Post-training experiments show steady downstream performance improvements on MMLU-Pro and GPQA. Scaling experiments across 7B, 14B, and 32B models confirm that longer thinking horizons consistently improve accuracy. In conclusion, researchers introduced RRMs to perform explicit reasoning processes before reward assignment to address computational inflexibility in existing reward modeling approaches. Rule-based-reward RL enables RRMs to develop complex reasoning capabilities without requiring explicit reasoning traces as supervision. RRMs efficiently utilize test-time compute through parallel and sequential scaling approaches. The effectiveness of RRMs in practical applications, including reward-guided best-of-N inference and post-training feedback, demonstrates their potential as strong alternatives to traditional scalar reward models in alignment techniques. Check out the Paper and Models on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment appeared first on MarkTechPost.

Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment Read Post »

AI, Committee, ข่าว, Uncategorized

This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks

Neural networks have long been powerful tools for handling complex data-driven tasks. Still, they often struggle to make discrete decisions under strict constraints, like routing vehicles or scheduling jobs. These discrete decision problems, commonly found in operations research, are computationally intensive and difficult to integrate into the smooth, continuous frameworks of neural networks. Such challenges limit the ability to combine learning-based models with combinatorial reasoning, creating a bottleneck in applications that demand both. A major issue arises when integrating discrete combinatorial solvers with gradient-based learning systems. Many combinatorial problems are NP-hard, meaning it’s impossible to find exact solutions within a reasonable time for large instances. Existing strategies often depend on exact solvers or introduce continuous relaxations, which may not provide solutions that respect the hard constraints of the original problem. These approaches typically involve heavy computational costs, and when exact oracles are unavailable, the methods fail to deliver consistent gradients for learning. This creates a gap where neural networks can learn representations but cannot reliably make complex, structured decisions in a way that scales. Commonly used methods rely on exact solvers for structured inference tasks, such as MAP solvers in graphical models or linear programming relaxations. These methods often require repeated oracle calls during each training iteration and depend on specific problem formulations. Techniques like Fenchel-Young losses or perturbation-based methods allow approximate learning, but their guarantees break down when used with inexact solvers like local search heuristics. This reliance on exact solutions hinders their practical use in large-scale, real-world combinatorial tasks, such as vehicle routing with dynamic requests and time windows. Researchers from Google DeepMind and ENPC propose a novel solution by transforming local search heuristics into differentiable combinatorial layers through the lens of Markov Chain Monte Carlo (MCMC) methods. The researchers create MCMC layers that operate on discrete combinatorial spaces by mapping problem-specific neighborhood systems into proposal distributions. This design allows neural networks to integrate local search heuristics, like simulated annealing or Metropolis-Hastings, as part of the learning pipeline without access to exact solvers. Their approach enables gradient-based learning over discrete solutions by using acceptance rules that correct for the bias introduced by approximate solvers, ensuring theoretical soundness while reducing the computational burden. In more detail, the researchers construct a framework where local search heuristics propose neighbor solutions based on the problem structure, and the acceptance rules from MCMC methods ensure these moves result in a valid sampling process over the solution space. The resulting MCMC layer approximates the target distribution of feasible solutions and provides unbiased gradients for a single iteration under a target-dependent Fenchel-Young loss. This makes it possible to perform learning even with minimal MCMC iterations, such as using a single sample per forward pass while maintaining theoretical convergence properties. By embedding this layer in a neural network, they can train models that predict parameters for combinatorial problems and improve solution quality over time. The research team evaluated this method on a large-scale dynamic vehicle routing problem with time windows, a complex, real-world combinatorial optimization task. They showed their approach could handle large instances efficiently, significantly outperforming perturbation-based methods under limited time budgets. For example, their MCMC layer achieved a test relative cost of 5.9% compared to anticipative baselines when using a heuristic-based initialization. In comparison, the perturbation-based method achieved 6.3% under the same conditions. Even at extremely low time budgets, such as a 1 ms time limit, their method outperformed perturbation methods by a large margin—achieving 7.8% relative cost versus 65.2% for perturbation-based approaches. They also demonstrated that initializing the MCMC chain with ground-truth solutions or heuristic-enhanced states improved learning efficiency and solution quality, especially when using a small number of MCMC iterations. This research demonstrates a principled way to integrate NP-hard combinatorial problems into neural networks without relying on exact solvers. The problem of combining learning with discrete decision-making is addressed by using MCMC layers constructed from local search heuristics, enabling theoretically sound, efficient training. The proposed method bridges the gap between deep learning and combinatorial optimization, providing a scalable and practical solution for complex tasks like vehicle routing. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks appeared first on MarkTechPost.

This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks Read Post »

AI, Committee, ข่าว, Uncategorized

Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary Search

Chaotic systems, such as fluid dynamics or brain activity, are highly sensitive to initial conditions, making long-term predictions difficult. Even minor errors in modeling these systems can rapidly grow, which limits the effectiveness of many scientific machine learning (SciML) approaches. Traditional forecasting methods rely on models trained on specific time series or broad datasets lacking true dynamical structure. However, recent work has demonstrated the potential for local forecasting models to predict chaotic systems more accurately over longer timeframes by learning the numerical rules governing these systems. The real challenge is achieving out-of-domain generalization—creating models that can adapt and forecast new, previously unseen dynamical systems. This would require integrating prior knowledge with the ability to adapt locally. Still, the need for task-specific data constrains current methods and often overlooks key dynamical system properties such as ergodicity, channel coupling, and conserved quantities. Machine learning for dynamical systems (MLDS) utilizes the unique properties of such systems as inductive biases. These include fixed relationships among system variables and invariant statistical measures, like strange attractors or conserved quantities. MLDS models use these properties to build more accurate and generalizable models, sometimes incorporating probabilistic or latent variable techniques. While datasets of dynamical systems have been curated and new systems are often generated by tweaking parameters or using symbolic methods, these approaches typically don’t ensure diverse or stable dynamics. Structural stability is a challenge—small changes may not yield new behaviors, while large ones can cause trivial dynamics. Foundation models aim to address this by enabling transfer learning and zero-shot inference. Still, most current models perform comparably to standard time series models or are limited in generating meaningful, dynamic variety. Some progress has been made through techniques like embedding spaces or symbolic discovery, but a richer, more diverse sampling of dynamical behaviors remains an open challenge.  Researchers at the Oden Institute, UT Austin, introduce Panda (Patched Attention for Nonlinear Dynamics), a pretrained model trained solely on synthetic data from 20,000 algorithmically-generated chaotic systems. These systems were created using an evolutionary algorithm based on known chaotic ODEs. Despite training only on low-dimensional ODEs, Panda shows strong zero-shot forecasting on real-world nonlinear systems—including fluid dynamics and electrophysiology—and unexpectedly generalizes to PDEs. The model incorporates innovations like masked pretraining, channel attention, and kernelized patching to capture dynamical structure. A neural scaling law also emerges, linking Panda’s forecasting performance to the diversity of training systems.  The researchers generated 20,000 new chaotic systems using a genetic algorithm that evolves from a curated set of 135 known chaotic ODEs. These systems are mutated and recombined using a skew product approach, with only truly chaotic behaviors retained through rigorous tests. Augmentations like time-delay embeddings and affine transformations expand the dataset while preserving its dynamics. A separate set of 9,300 unseen systems is held out for zero-shot testing. The model, Panda, is built on PatchTST and enhanced with features like channel attention, temporal-channel attention layers, and dynamic embeddings using polynomial and Fourier features, inspired by Koopman operator theory.  Panda demonstrates strong zero-shot forecasting capabilities on unseen nonlinear dynamical systems, outperforming models like Chronos-SFT across various metrics and prediction horizons. Trained solely on 3D systems, it generalizes to higher-dimensional ones due to channel attention. Despite never encountering PDEs during training, Panda also succeeds on real-world experimental data and chaotic PDEs, such as the Kuramoto-Sivashinsky and von Kármán vortex street. Architectural ablations confirm the importance of channel attention and dynamics embeddings. The model exhibits neural scaling with increased dynamical system diversity and forms interpretable attention patterns, suggesting resonance and attractor-sensitive structure. This indicates Panda’s broad generalization across complex dynamical behaviors.  In conclusion, Panda is a pretrained model designed to uncover generalizable patterns in dynamical systems. Trained on a large, diverse set of synthetic chaotic systems, Panda demonstrates strong zero-shot forecasting on unseen real-world data and even partial differential equations, despite only being trained on low-dimensional ODEs. Its performance improves with system diversity, revealing a neural scaling law. The model also shows emergent nonlinear resonance in attention patterns. While focused on low-dimensional dynamics, the approach may extend to higher-dimensional systems by leveraging sparse interactions. Future directions include alternative pretraining strategies to improve rollout performance forecasting chaotic behaviors.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary Search appeared first on MarkTechPost.

Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary Search Read Post »

AI, Committee, ข่าว, Uncategorized

Microsoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces

Many websites lack accessible and cost-effective ways to integrate natural language interfaces, making it difficult for users to interact with site content through conversational AI. Existing solutions often depend on centralized, proprietary services or require significant technical expertise, limiting scalability and adaptability. This creates a barrier for developers who want to implement intelligent agents capable of answering questions or assisting users using the site’s data. As a result, there is a need for an open, standardized approach that allows websites to expose structured information and support natural language interactions without relying heavily on external infrastructure or high-cost models.  Building conversational interfaces for websites remains a complex challenge, often requiring custom solutions and deep technical expertise. NLWeb, developed by Microsoft researchers, aims to simplify this process by enabling sites to support natural language interactions easily. By natively integrating with the Machine Communication Protocol (MCP), NLWeb allows the same language interfaces to be used by both human users and AI agents. It builds on existing web standards like Schema.org and RSS—already used by millions of websites—to provide a semantic foundation that can be easily leveraged for natural language capabilities. NLWeb is not a single tool or product but a suite of open protocols and open-source reference implementations designed to lay the groundwork for an AI-enabled web. Like HTML once did for document sharing, NLWeb envisions a shared infrastructure for integrating conversational AI into web content. Its sample code is a practical starting point rather than a final solution, encouraging community innovation and diverse implementations. This open, collaborative model draws inspiration from the early days of the internet, where shared standards and grassroots efforts drove rapid progress. NLWeb aims to do the same for AI-driven web experiences by enabling human-friendly interfaces and agent-to-agent communication through common protocols.  NLWeb consists of two main parts: a simple protocol for natural language interaction with websites, and a JSON-based response format that uses Schema.org. It includes an implementation that works well for sites structured as item lists—like products or reviews—and offers UI widgets to enable conversational access to such content. NLWeb also acts as an MCP (Model Context Protocol) server, letting AI models ask questions via a standardized “ask” method. Responses combine existing site data with insights from large language models, enhancing user interaction. NLWeb is open, cross-platform, and compatible with various AI models and vector databases, offering flexible integration options.  NLWeb offers web publishers a simple way to add conversational AI to their sites with minimal coding and without needing to build chatbots from scratch. It uses existing site data, ensuring accurate, real-time responses while keeping costs low. Publishers can choose which AI models to use and maintain control over their data. The system improves user engagement by enabling natural interactions, personalizing content, and enhancing support. Its open-source nature allows customization, and it positions websites for a future where AI agents browse and interact with the web.  In conclusion, NLWeb represents a foundational step toward a more interactive and intelligent web, where users can engage with websites through natural language rather than rigid interfaces. By combining structured data formats like Schema.org with the power of AI models, NLWeb simplifies the creation of conversational experiences. It empowers publishers to enhance their sites with minimal effort, offering benefits like improved user engagement, faster support, and personalized content delivery. As the web evolves into an ecosystem where AI agents play a growing role, NLWeb ensures that websites are not only accessible to humans but also seamlessly integrable with the agent-driven digital future.  Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Microsoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces appeared first on MarkTechPost.

Microsoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces Read Post »

th