YouZum

Committee

AI, Committee, ニュース, Uncategorized

MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents

arXiv:2509.17628v1 Announce Type: new Abstract: Large Language Models (LLMs) have excelled in question-answering (QA) tasks within single domains. However, their reasoning and coordination capabilities in complex, multi-stage scenarios remain underexplored. Existing benchmarks typically focus on isolated tasks or narrow domains, overlooking models’ abilities for multi-stage collaboration and optimization without explicit external guidance. To bridge this gap, we propose textbf{MSCoRe}, a novel benchmark comprising 126696 domain-specific QA instances spanning scenarios in automotive, pharmaceutical, electronics, and energy sectors. The dataset is created using a structured three-phase pipeline: dynamic sampling, iterative question-answer generation, and a multi-level quality assessment to ensure data quality. Tasks are further categorized into three difficulty levels according to stage coverage and complexity. With MSCoRe, we have conducted a comprehensive evaluation of various state-of-the-art LLM agents. The commercial models performed best across all tasks and scenarios, but a notable gap in ROUGE scores remains between simple and complex tasks. We also tested the models’ robustness and found that their performance is negatively affected by noisy data. MSCoRe provides a valuable new resource for the community to evaluate and improve multi-stage reasoning in LLM agents. The code and data are available at https://github.com/D3E0-source/MSCoRE.

MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents 投稿を読む »

AI, Committee, ニュース, Uncategorized

From Roots to Rewards: Dynamic Tree Reasoning with Reinforcement Learning

arXiv:2507.13142v3 Announce Type: replace-cross Abstract: Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree’s static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree’s probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems. Code available at: https://github.com/ahmedehabb/From-Roots-to-Rewards-Dynamic-Tree-Reasoning-with-RL

From Roots to Rewards: Dynamic Tree Reasoning with Reinforcement Learning 投稿を読む »

AI, Committee, ニュース, Uncategorized

ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding

arXiv:2509.17481v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) have recently demonstrated remarkable progress, yet hallucination remains a critical barrier, particularly in chart understanding, which requires sophisticated perceptual and cognitive abilities as well as rigorous factual accuracy. While prior work has investigated hallucinations and chart comprehension independently, their intersection remains largely unexplored. To address this gap, we present ChartHal, a benchmark that features a fine-grained taxonomy of hallucination scenarios in chart understanding, along with a human-validated dataset of 1,062 samples. Our evaluation shows that state-of-the-art LVLMs suffer from severe hallucinations on ChartHal, including proprietary models such as GPT-5 and o4-mini, which achieve only 34.46% and 22.79% accuracy, respectively. Further analysis reveals that questions involving information absent from or contradictory to charts are especially likely to trigger hallucinations, underscoring the urgent need for more robust mitigation strategies. Code and data are available at https://github.com/ymcui/ChartHal .

ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding 投稿を読む »

AI, Committee, ニュース, Uncategorized

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning

arXiv:2509.17437v1 Announce Type: new Abstract: Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, which constrain RL reward signals for effective training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and geometric problem solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning 投稿を読む »

AI, Committee, ニュース, Uncategorized

Audio Contrastive-based Fine-tuning: Decoupling Representation Learning and Classification

arXiv:2309.11895v4 Announce Type: replace-cross Abstract: Standard fine-tuning of pre-trained audio models couples representation learning with classifier training, which can obscure the true quality of the learned representations. In this work, we advocate for a disentangled two-stage framework that separates representation refinement from downstream evaluation. First, we employ a “contrastive-tuning” stage to explicitly improve the geometric structure of the model’s embedding space. Subsequently, we introduce a dual-probe evaluation protocol to assess the quality of these refined representations from a geometric perspective. This protocol uses a linear probe to measure global linear separability and a k-Nearest Neighbours probe to investigate the local structure of class clusters. Our experiments on a diverse set of audio classification tasks show that our framework provides a better foundation for classification, leading to improved accuracy. Our newly proposed dual-probing framework acts as a powerful analytical lens, demonstrating why contrastive learning is more effective by revealing a superior embedding space. It significantly outperforms vanilla fine-tuning, particularly on single-label datasets with a large number of classes, and also surpasses strong baselines on multi-label tasks using a Jaccard-weighted loss. Our findings demonstrate that decoupling representation refinement from classifier training is a broadly effective strategy for unlocking the full potential of pre-trained audio models. Our code will be publicly available.

Audio Contrastive-based Fine-tuning: Decoupling Representation Learning and Classification 投稿を読む »

AI, Committee, ニュース, Uncategorized

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

arXiv:2506.12158v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly used to generate synthetic textual data for training smaller specialized models. However, a comparison of various generation strategies for low-resource language settings is lacking. While various prompting strategies have been proposed, such as demonstrations, label-based summaries, and self-revision, their comparative effectiveness remains unclear, especially for low-resource languages. In this paper, we systematically evaluate the performance of these generation strategies and their combinations across 11 typologically diverse languages, including several extremely low-resource ones. Using three NLP tasks and four open-source LLMs, we assess downstream model performance on generated versus gold-standard data. Our results show that strategic combinations of generation methods, particularly target-language demonstrations with LLM-based revisions, yield strong performance, narrowing the gap with real data to as little as 5% in some settings. We also find that smart prompting techniques can reduce the advantage of larger LLMs, highlighting efficient generation strategies for synthetic data generation in low-resource scenarios with smaller models.

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages 投稿を読む »

AI, Committee, ニュース, Uncategorized

MIT Researchers Enhanced Artificial Intelligence (AI) 64x Better at Planning, Achieving 94% Accuracy

Can a 8B-parameter language model produce provably valid multi-step plans instead of plausible guesses? MIT CSAIL researchers introduce PDDL-INSTRUCT, an instruction-tuning framework that couples logical chain-of-thought with external plan validation (VAL) to lift symbolic planning performance of LLMs. On PlanBench, a tuned Llama-3-8B reaches 94% valid plans on Blocksworld, with large jumps on Mystery Blocksworld and Logistics; across domains they report up to a 66% absolute improvement over baselines. https://arxiv.org/pdf/2509.13351 But What’s new? The research team tackles a well-known failure mode: LLMs often generate “plausible-sounding” but logically invalid multi-step plans. PDDL-INSTRUCT couples explicit state/action semantics with ground-truth checking: Error education: Models are trained to explain why candidate plans fail (unsatisfied preconditions, wrong effects, frame violations, or goal not reached). Logical chain-of-thought (CoT): Prompts require step-by-step inference over preconditions and add/del effects, yielding state→action→state traces ⟨sᵢ, aᵢ₊₁, sᵢ₊₁⟩. External verification (VAL): Every step is validated with the classic VAL plan validator; feedback can be binary (valid/invalid) or detailed (which precondition/effect failed). Detailed feedback yielded the strongest gains. Two-stage optimization: Stage-1 optimizes the reasoning chains (penalizing state-transition errors); Stage-2 optimizes end-task planning accuracy. How Good is it? Benchmarks Evaluation follows PlanBench—Blocksworld, Mystery Blocksworld (predicate names obfuscated to break pattern-matching), and Logistics—established stress tests where generic LLMs historically underperform on plan generation. The authors highlight that Mystery Blocksworld is particularly challenging; prior studies often report <5% validity without tool support. Blocksworld: up to 94% valid plans with Llama-3-8B under PDDL-INSTRUCT. Mystery Blocksworld: large relative gains; the paper reports dramatic improvement versus a near-zero baseline (reported as orders-of-magnitude, e.g., 64× in their summary figures/tables). Logistics: substantial increases in valid plans. Across domains, the research team showcase up to 66% absolute improvement over untuned baselines. Detailed validator feedback outperforms binary signals, and longer feedback budgets further help. https://arxiv.org/pdf/2509.13351 Summary PDDL-INSTRUCT shows that coupling logical chain-of-thought with external plan validation can materially improve LLM planning, but its current scope is classical PDDL domains (Blocksworld, Mystery Blocksworld, Logistics) and relies on VAL as an external oracle; the reported gains—e.g., 94% valid plans on Blocksworld and large relative improvements on Mystery Blocksworld with Llama-3-8B—demonstrate a viable path for neuro-symbolic training where reasoning steps are grounded in formal semantics and checked automatically, suggesting immediate utility for agent pipelines that can tolerate a verifier in the loop while longer-horizon, temporal/numeric, and cost-sensitive planning remain open extensions. Check out the PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post MIT Researchers Enhanced Artificial Intelligence (AI) 64x Better at Planning, Achieving 94% Accuracy appeared first on MarkTechPost.

MIT Researchers Enhanced Artificial Intelligence (AI) 64x Better at Planning, Achieving 94% Accuracy 投稿を読む »

AI, Committee, ニュース, Uncategorized

Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation

arXiv:2506.00288v3 Announce Type: replace Abstract: Continued pretraining (CPT) is a popular approach to adapt existing large language models (LLMs) to new languages. When doing so, it is common practice to include a portion of English data in the mixture, but its role has not been carefully studied to date. In this work, we show that including English does not impact validation perplexity, yet it is critical for the emergence of downstream capabilities in the target language. We introduce a language-agnostic benchmark for in-context learning (ICL), which reveals catastrophic forgetting early on CPT when English is not included. This in turn damages the ability of the model to generalize to downstream prompts in the target language as measured by perplexity, even if it does not manifest in terms of accuracy until later in training, and can be tied to a big shift in the model parameters. Based on these insights, we introduce curriculum learning and exponential moving average (EMA) of weights as effective alternatives to mitigate the need for English. All in all, our work sheds light into the dynamics by which emergent abilities arise when doing CPT for language adaptation, and can serve as a foundation to design more effective methods in the future.

Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation 投稿を読む »

AI, Committee, ニュース, Uncategorized

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

arXiv:2509.16197v1 Announce Type: cross Abstract: Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer 投稿を読む »

AI, Committee, ニュース, Uncategorized

Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations

arXiv:2509.15655v1 Announce Type: new Abstract: Transformer-based speech language models (SLMs) have significantly improved neural speech recognition and understanding. While existing research has examined how well SLMs encode shallow acoustic and phonetic features, the extent to which SLMs encode nuanced syntactic and conceptual features remains unclear. By drawing parallels with linguistic competence assessments for large language models, this study is the first to systematically evaluate the presence of contextual syntactic and semantic features across SLMs for self-supervised learning (S3M), automatic speech recognition (ASR), speech compression (codec), and as the encoder for auditory large language models (AudioLLMs). Through minimal pair designs and diagnostic feature analysis across 71 tasks spanning diverse linguistic levels, our layer-wise and time-resolved analysis uncovers that 1) all speech encode grammatical features more robustly than conceptual ones.

Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations 投稿を読む »

We use cookies to improve your experience and performance on our website. You can learn more at プライバシーポリシー and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
ja