YouZum

Committee

AI, Committee, ข่าว, Uncategorized

VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control

Bridging Perception and Action in Robotics Multimodal Large Language Models (MLLMs) hold promise for enabling machines, such as robotic arms and legged robots, to perceive their surroundings, interpret scenarios, and take meaningful actions. The integration of such intelligence into physical systems is advancing the field of robotics, pushing it toward autonomous machines that don’t just see and describe but also plan and move within their environments based on contextual understanding. Despite the growing power of MLLMs, one persistent issue is their inability to combine vision, reasoning, and physical interaction into one cohesive system. Typically, models trained to understand images or text fall short when asked to control robots in real-world spaces. The core problem is that understanding a scene is fundamentally different from acting within it. Multimodal understanding focuses on perception and analysis, while physical control needs precise, real-time decision-making based on that perception. This disconnect creates bottlenecks when attempting to build agents that must simultaneously observe, reason, and act in varied environments. Limitations of Prior VLA Models Previous tools designed for robot control rely heavily on vision-language-action (VLA) models. These models train on extensive robotic datasets to convert visual observations into control signals. While some solutions try to preserve the reasoning capability of MLLMs by translating commands into text-based actions, they face difficulty in maintaining accuracy and adaptability during control tasks. For instance, VLAs often degrade in performance when applied to diverse or long-horizon robotic operations. Furthermore, due to the gap between image-based understanding and motion control, these tools usually fail to generalize across different environments or robot types. Introducing VeBrain: A Unified Multimodal Framework Researchers from Shanghai AI Laboratory, Tsinghua University, and SenseTime Research have introduced a unified framework called Visual Embodied Brain (VeBrain) in collaboration with multiple other institutes. VeBrain reformulates robot control as text-based tasks within a 2D visual space, aligning it more closely with how MLLMs function. The framework integrates multimodal understanding, spatial reasoning, and robotic control into one structure. A specially designed robotic adapter processes the MLLM’s output into executable movement policies, enabling a single model to manage perception, reasoning, and control. VeBrain is also supported by a high-quality instruction dataset called VeBrain-600k, which combines over 600,000 samples of multimodal tasks, including robot motion and reasoning steps. Technical Components: Architecture and Robotic Adapter To carry out its functions, VeBrain utilizes an architecture based on Qwen2.5-VL, augmented with components that enable real-world control. The robotic adapter contains four key modules. The point tracker updates 2D keypoints as the robot’s view changes, ensuring accurate targeting. The movement controller transforms 2D key points into 3D movements by combining image data with depth maps. The skill executor maps predicted actions, such as “turn” or “grasp,” to pre-trained robotic skills. Lastly, the dynamic takeover module monitors failures or anomalies, handing control back to the MLLM when needed. These modules form a closed-loop system that makes decisions, acts, and self-corrects, allowing robots to operate effectively in diverse situations. Performance Evaluation Across Multimodal and Robotic Benchmarks VeBrain was evaluated across 13 multimodal and 5 spatial benchmarks. On MMVet, it achieved a 5.6% improvement over Qwen2.5-VL. It achieved a score of 101.5 on the CIDEr metric for ScanQA and scored 83.7 on MMBench. On the VSI benchmark, it averaged 39.9, outperforming Qwen2.5-VL’s 35.9. In robotic evaluations, VeBrain showed 86.4% success across seven-legged robot tasks, significantly surpassing models like VLA and π0, which scored 32.1% and 31.4%, respectively. On robotic arm tasks, it achieved a success rate of 74.3%, outperforming others by up to 80%. These results show VeBrain’s ability to handle long-horizon and spatially complex control challenges with high reliability. Conclusion The research presents a compelling direction for embodied AI. Researchers succeeded in redefining robot control as a language task, enabling high-level reasoning and low-level action to coexist. The method bridges the gap between image understanding and robot execution in a way that’s both functional and scalable. With a robust design and strong performance, VeBrain signals a shift toward more unified, intelligent robotics systems capable of operating autonomously across diverse tasks and environments. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter. The post VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control appeared first on MarkTechPost.

VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control Read Post »

AI, Committee, ข่าว, Uncategorized

LLMs Can Also Do Well! Breaking Barriers in Semantic Role Labeling via Large Language Models

arXiv:2506.05385v1 Announce Type: new Abstract: Semantic role labeling (SRL) is a crucial task of natural language processing (NLP). Although generative decoder-based large language models (LLMs) have achieved remarkable success across various NLP tasks, they still lag behind state-of-the-art encoder-decoder (BERT-like) models in SRL. In this work, we seek to bridge this gap by equipping LLMs for SRL with two mechanisms: (a) retrieval-augmented generation and (b) self-correction. The first mechanism enables LLMs to leverage external linguistic knowledge such as predicate and argument structure descriptions, while the second allows LLMs to identify and correct inconsistent SRL outputs. We conduct extensive experiments on three widely-used benchmarks of SRL (CPB1.0, CoNLL-2009, and CoNLL-2012). Results demonstrate that our method achieves state-of-the-art performance in both Chinese and English, marking the first successful application of LLMs to surpass encoder-decoder approaches in SRL.

LLMs Can Also Do Well! Breaking Barriers in Semantic Role Labeling via Large Language Models Read Post »

AI, Committee, ข่าว, Uncategorized

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

arXiv:2506.05412v1 Announce Type: cross Abstract: Gaze-referential inference–the ability to infer what others are looking at–is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study Read Post »

AI, Committee, ข่าว, Uncategorized

A Unified Representation for Continuity and Discontinuity: Syntactic and Computational Motivations

arXiv:2506.05686v1 Announce Type: new Abstract: This paper advances a unified representation of linguistic structure for three grammar formalisms, namely, Phrase Structure Grammar (PSG), Dependency Grammar (DG) and Categorial Grammar (CG) from the perspective of syntactic and computational complexity considerations. The correspondence principle is proposed to enable a unified representation of the representational principles from PSG, DG, and CG. To that end, the paper first illustrates a series of steps in achieving a unified representation for a discontinuous subordinate clause from Turkish as an illustrative case. This affords a new way of approaching discontinuity in natural language from a theoretical point of view that unites and integrates the basic tenets of PSG, DG, and CG, with significant consequences for syntactic analysis. Then this paper demonstrates that a unified representation can simplify computational complexity with regards to the neurocognitive representation and processing of both continuous and discontinuous sentences vis-`a-vis the basic principles of PSG, DG, and CG.

A Unified Representation for Continuity and Discontinuity: Syntactic and Computational Motivations Read Post »

AI, Committee, ข่าว, Uncategorized

IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems

arXiv:2506.05947v1 Announce Type: new Abstract: In emotional support conversations, unclear intentions can lead supporters to employ inappropriate strategies, inadvertently imposing their expectations or solutions on the seeker. Clearly defined intentions are essential for guiding both the supporter’s motivations and the overall emotional support process. In this paper, we propose the Intention-centered Emotional Support Conversation (IntentionESC) framework, which defines the possible intentions of supporters in emotional support conversations, identifies key emotional state aspects for inferring these intentions, and maps them to appropriate support strategies. While Large Language Models (LLMs) excel in text generating, they fundamentally operate as probabilistic models trained on extensive datasets, lacking a true understanding of human thought processes and intentions. To address this limitation, we introduce the Intention Centric Chain-of-Thought (ICECoT) mechanism. ICECoT enables LLMs to mimic human reasoning by analyzing emotional states, inferring intentions, and selecting suitable support strategies, thereby generating more effective emotional support responses. To train the model with ICECoT and integrate expert knowledge, we design an automated annotation pipeline that produces high-quality training data. Furthermore, we develop a comprehensive evaluation scheme to assess emotional support efficacy and conduct extensive experiments to validate our framework. Our data and code are available at https://github.com/43zxj/IntentionESC_ICECoT.

IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems Read Post »

AI, Committee, ข่าว, Uncategorized

Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency

arXiv:2411.16525v2 Announce Type: replace-cross Abstract: We investigate the statistical and computational limits of prompt tuning for transformer-based foundation models. Our key contributions are prompt tuning on emph{single-head} transformers with only a emph{single} self-attention layer: (i) is universal, and (ii) supports efficient (even almost-linear time) algorithms under the Strong Exponential Time Hypothesis (SETH). Statistically, we prove that prompt tuning on such simplest possible transformers are universal approximators for sequence-to-sequence Lipschitz functions. In addition, we provide an exponential-in-$dL$ and -in-$(1/epsilon)$ lower bound on the required soft-prompt tokens for prompt tuning to memorize any dataset with 1-layer, 1-head transformers. Computationally, we identify a phase transition in the efficiency of prompt tuning, determined by the norm of the emph{soft-prompt-induced} keys and queries, and provide an upper bound criterion. Beyond this criterion, no sub-quadratic (efficient) algorithm for prompt tuning exists under SETH. Within this criterion, we showcase our theory by proving the existence of almost-linear time prompt tuning inference algorithms. These fundamental limits provide important necessary conditions for designing expressive and efficient prompt tuning methods for practitioners.

Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency Read Post »

AI, Committee, ข่าว, Uncategorized

Meet BioReason: The World’s First Reasoning Model in Biology that Enables AI to Reason about Genomics like a Biology Expert

A major hurdle in using AI for genomics is the lack of interpretable, step-by-step reasoning from complex DNA data. While DNA foundation models excel at learning rich sequence patterns for tasks such as variant prediction and gene regulation, they often operate as black boxes, offering limited insight into the underlying biological mechanisms. Meanwhile, large language models demonstrate impressive reasoning skills across various domains, but they aren’t designed to handle raw genomic sequences. This gap between strong DNA representation and deep biological reasoning prevents AI from reaching expert-level understanding and limits its potential to drive scientific discovery through meaningful, hypothesis-driven explanations.  DNA foundation models have made significant progress by learning rich representations directly from genomic sequences, showing strong performance across a range of biological tasks. Models like Evo2, with its long-range capabilities, highlight their potential, but their lack of interpretability limits deeper biological insights. Meanwhile, large language models excel in reasoning over biomedical texts but often don’t engage directly with raw genomic data. Attempts, such as GeneGPT and TxGemma, represent early efforts to bridge this gap. Current genomic benchmarks assess task performance but fall short in evaluating reasoning and hypothesis generation.  Researchers from the University of Toronto, Vector Institute, University Health Network (UHN), Arc Institute, Cohere, University of California, San Francisco, and Google DeepMind have introduced BIOREASON, a pioneering AI system that unites a DNA foundation model with an LLM. This integration allows BIOREASON to analyze raw genomic sequences while applying LLM-based reasoning to generate clear, biologically grounded insights. Trained through supervised fine-tuning and reinforcement learning, it achieves a performance gain of 15% or more over traditional models, reaching up to 97% accuracy in KEGG-based disease pathway prediction. This approach offers interpretable, step-by-step outputs that advance biological understanding and facilitate hypothesis generation.  The BIOREASON model is a multimodal framework designed to support deep, interpretable biological reasoning by combining genomic sequences with natural language queries. It uses a DNA foundation model to extract rich, contextual embeddings from raw DNA inputs and integrates these with tokenized textual queries to form a unified input for a LLM, specifically Qwen3. The system is trained to generate step-by-step explanations of biological processes. DNA embeddings are projected into the LLM’s space using a learnable layer, and the combined input is enriched with positional encoding. Additionally, reinforcement learning via Group Relative Policy Optimization refines its reasoning capabilities.  The researchers evaluated BIOREASON on three datasets focused on DNA variant interpretation and biological reasoning. It outperformed both DNA-only and LLM-only models in predicting disease outcomes from genomic variants. The best-performing version, which combined Evo2 and Qwen3-4B, achieved high accuracy and F1-scores across all tasks. A notable case study involved a PFN1 mutation linked to ALS, where BIOREASON accurately predicted the disease and generated a 10-step explanation tracing the variant’s impact on actin dynamics and motor neuron degeneration. This shows its strength not just in accurate predictions but also in providing transparent, biologically grounded reasoning paths.  In conclusion, BIOREASON combines DNA encoders with large language models to enable detailed, interpretable reasoning over genomic data. Unlike traditional models, it not only makes accurate predictions but also explains the biological logic behind them using step-by-step outputs. This helps scientists better understand disease mechanisms and generate new research questions. While powerful, BIOREASON has challenges, like high computational costs and limited uncertainty measures. Future work aims to address these issues by improving scalability, incorporating additional biological data such as RNA and proteins, and applying it to broader tasks, including GWAS. Overall, BIOREASON shows promise in advancing precision medicine and genomic research.  Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Meet BioReason: The World’s First Reasoning Model in Biology that Enables AI to Reason about Genomics like a Biology Expert appeared first on MarkTechPost.

Meet BioReason: The World’s First Reasoning Model in Biology that Enables AI to Reason about Genomics like a Biology Expert Read Post »

AI, Committee, ข่าว, Uncategorized

Google Introduces Open-Source Full-Stack AI Agent Stack Using Gemini 2.5 and LangGraph for Multi-Step Web Search, Reflection, and Synthesis

Introduction: The Need for Dynamic AI Research Assistants Conversational AI has rapidly evolved beyond basic chatbot frameworks. However, most large language models (LLMs) still suffer from a critical limitation—they generate responses based only on static training data, lacking the ability to self-identify knowledge gaps or perform real-time information synthesis. As a result, these models often deliver incomplete or outdated answers, particularly for evolving or niche topics. To overcome these issues, AI agents must go beyond passive querying. They need to recognize informational gaps, perform autonomous web searches, validate results, and refine responses—effectively mimicking a human research assistant. Google’s Full-Stack Research Agent: Gemini 2.5 + LangGraph Google, in collaboration with contributors from Hugging Face and other open-source communities, has developed a full-stack research agent stack designed to solve this problem. Built with a React frontend and a FastAPI + LangGraph backend, this system combines language generation with intelligent control flow and dynamic web search. The research agent stack utilizes the Gemini 2.5 API to process user queries, generating structured search terms. It then performs recursive search-and-reflection cycles using the Google Search API, verifying whether each result sufficiently answers the original query. This iterative process continues until the agent generates a validated, well-cited response. Architecture Overview: Developer-Friendly and Extensible Frontend: Built with Vite + React, offering hot reloading and clean module separation. Backend: Powered by Python (3.8+), FastAPI, and LangGraph, enabling decision control, evaluation loops, and autonomous query refinement. Key Directories: The agent logic resides in backend/src/agent/graph.py, while UI components are structured under frontend/. Local Setup: Requires Node.js, Python, and a Gemini API Key. Run with make dev, or launch frontend/backend separately. Endpoints: Backend API: http://127.0.0.1:2024 Frontend UI: http://localhost:5173 This separation of concerns ensures that developers can easily modify the agent’s behavior or UI presentation, making the project suitable for global research teams and tech developers alike. Technical Highlights and Performance Reflective Looping: The LangGraph agent evaluates search results and identifies coverage gaps, autonomously refining queries without human intervention. Delayed Response Synthesis: The AI waits until it gathers sufficient information before generating an answer. Source Citations: Answers include embedded hyperlinks to original sources, improving trust and traceability. Use Cases: Ideal for academic research, enterprise knowledge bases, technical support bots, and consulting tools where accuracy and validation matter. Why It Matters: A Step Towards Autonomous Web Research This system illustrates how autonomous reasoning and search synthesis can be integrated directly into LLM workflows. The agent doesn’t just respond—it investigates, verifies, and adapts. This reflects a broader shift in AI development: from stateless Q&A bots to real-time reasoning agents. The agent enables developers, researchers, and enterprises in regions such as North America, Europe, India, and Southeast Asia to deploy AI research assistants with minimal setup. By using globally accessible tools like FastAPI, React, and Gemini APIs, the project is well-positioned for widespread adoption. Key Takeaways Agent Design: Modular React + LangGraph system supports autonomous query generation and reflection. Iterative Reasoning: Agent refines search queries until confidence thresholds are met. Citations Built-In: Outputs include direct links to web sources for transparency. Developer-Ready: Local setup requires Node.js, Python 3.8+, and a Gemini API key. Open-Source: Publicly available for community contribution and extension. Conclusion By combining Google’s Gemini 2.5 with LangGraph’s logic orchestration, this project delivers a breakthrough in autonomous AI reasoning. It showcases how research workflows can be automated without compromising accuracy or traceability. As conversational agents evolve, systems like this one set the standard for intelligent, trustworthy, and developer-friendly AI research tools. Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter. The post Google Introduces Open-Source Full-Stack AI Agent Stack Using Gemini 2.5 and LangGraph for Multi-Step Web Search, Reflection, and Synthesis appeared first on MarkTechPost.

Google Introduces Open-Source Full-Stack AI Agent Stack Using Gemini 2.5 and LangGraph for Multi-Step Web Search, Reflection, and Synthesis Read Post »

AI, Committee, ข่าว, Uncategorized

High-Entropy Token Selection in Reinforcement Learning with Verifiable Rewards (RLVR) Improves Accuracy and Reduces Training Cost for LLMs

Large Language Models (LLMs) generate step-by-step responses known as Chain-of-Thoughts (CoTs), where each token contributes to a coherent and logical narrative. To improve the quality of reasoning, various reinforcement learning techniques have been employed. These methods allow the model to learn from feedback mechanisms by aligning generated outputs with correctness criteria. As LLMs grow in complexity and capacity, researchers have begun probing the internal structure of token generation to discern patterns that enhance or limit performance. One area gaining attention is the token entropy distribution, a measurement of uncertainty in token prediction, which is now being linked to the model’s ability to make meaningful logical decisions during reasoning. A core issue in training reasoning models using reinforcement learning is treating all output tokens equally. When models are optimized using reinforcement learning with verifiable rewards (RLVR), the update process traditionally includes every token in the generated sequence, regardless of its functional role. This uniform treatment fails to distinguish tokens that lead to significant reasoning shifts from those that merely extend existing linguistic structures. As a result, a large portion of training resources may be directed at tokens that offer minimal contribution to the model’s reasoning capabilities. Without prioritizing the few tokens that play decisive roles in navigating different logic paths, these methods miss opportunities for focused and effective optimization. Most RLVR frameworks, including Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Dynamic sAmpling Policy Optimization (DAPO), function by evaluating entire sequences of token outputs against reward functions that assess correctness. PPO relies on stabilizing policy updates through a clipped objective function. GRPO improves upon this by estimating advantage values using grouped responses, rather than a separate value network. DAPO introduces additional enhancements, such as the clip-higher mechanism and overlong reward shaping. These methods, however, do not factor in token-level entropy or distinguish the importance of individual tokens in the reasoning chain, instead applying uniform gradient updates across the board. In an attempt to refine how RLVR training impacts LLM reasoning, researchers from Alibaba Inc. and Tsinghua University presented a new methodology focused on token entropy patterns. They observed that in the CoT sequences generated by Qwen3 models, a small subset of tokens, roughly 20%, display significantly higher entropy. These tokens, labeled “forking tokens,” often correspond to moments where the model must decide between multiple reasoning paths. The remaining 80% of tokens typically exhibit low entropy and act as extensions of prior statements. By limiting policy gradient updates solely to these high-entropy tokens, the research team was able not only to maintain but, in many cases, improve performance on challenging reasoning benchmarks. To quantify token entropy, the researchers used the entropy formula based on the probability distribution over possible token choices at each step. They found that over half of all generated tokens had entropy values below 0.01, indicating near-deterministic behavior. Only 20% exceeded an entropy of 0.672, marking them as the decision-making hubs within CoTs. High-entropy tokens often include logical operators and connective words such as “assume,” “since,” or “thus,” which introduce new conditions or transitions in logic. In contrast, low-entropy tokens included predictable symbols, suffixes, or code fragments. Through controlled experiments, it became clear that manipulating the entropy of these forking tokens directly influenced the model’s reasoning performance, while altering low-entropy tokens had little effect. The research team conducted extensive experiments across three model sizes: Qwen3-8B, Qwen3-14B, and Qwen3-32B. When training only the top 20% high-entropy tokens, the Qwen3-32B model achieved a score of 63.5 on AIME’24 and 56.7 on AIME’25, both setting new performance benchmarks for models under 600B parameters. Furthermore, increasing the maximum response length from 20k to 29k raised the AIME’24 score to 68.1. In comparison, training on the bottom 80% of low-entropy tokens caused performance to drop significantly. The Qwen3-14B model showed gains of +4.79 on AIME’25 and +5.21 on AIME’24, while the Qwen3-8B maintained competitive results relative to full-token training. An ablation study further confirmed the importance of retaining the 20% threshold. Decreasing the fraction to 10% omitted essential decision points, and increasing it to 50% or 100% diluted the effect by including too many low-entropy tokens, thereby reducing entropy diversity and hindering exploration. In essence, the research provides a new direction for enhancing the reasoning abilities of language models by identifying and selectively training on the minority of tokens that disproportionately contribute to reasoning success. It avoids inefficient training and instead proposes a scalable approach that aligns reinforcement learning objectives with actual decision-making moments in token sequences. The success of this strategy lies in using entropy as a guide to distinguish useful tokens from filler. Several Key takeaways from the research include: Around 20% of tokens exhibit high entropy and serve as forking points that direct reasoning paths. Training only on these high-entropy tokens delivers performance equal to or better than training on the full token set. Qwen3-32B achieved scores of 63.5 on AIME’24 and 56.7 on AIME’25, outperforming larger models trained traditionally. Extending response length from 20k to 29k further pushed the AIME’24 score to 68.1. Training on the remaining 80% of low-entropy tokens led to sharp performance degradation. Retaining the 20% threshold for high-entropy tokens optimally balances exploration and performance. Larger models gain more from this strategy due to their capacity to benefit from enhanced exploration. The strategy scales well and could guide more efficient training of next-generation reasoning models. In conclusion, this research effectively rethinks the application of reinforcement learning to language models by introducing a focus on token-level entropy. By optimizing only the minority that influences reasoning paths, the method enhances performance while reducing computational overhead. It provides a practical roadmap for future efforts to improve reasoning in LLMs without unnecessary complexity. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 98k+ ML SubReddit and Subscribe to our Newsletter. The post High-Entropy Token Selection in Reinforcement Learning with Verifiable Rewards (RLVR) Improves Accuracy and Reduces Training Cost for LLMs appeared first on MarkTechPost.

High-Entropy Token Selection in Reinforcement Learning with Verifiable Rewards (RLVR) Improves Accuracy and Reduces Training Cost for LLMs Read Post »

th