YouZum

Committee

AI, Committee, Actualités, Uncategorized

ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding

arXiv:2509.17481v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) have recently demonstrated remarkable progress, yet hallucination remains a critical barrier, particularly in chart understanding, which requires sophisticated perceptual and cognitive abilities as well as rigorous factual accuracy. While prior work has investigated hallucinations and chart comprehension independently, their intersection remains largely unexplored. To address this gap, we present ChartHal, a benchmark that features a fine-grained taxonomy of hallucination scenarios in chart understanding, along with a human-validated dataset of 1,062 samples. Our evaluation shows that state-of-the-art LVLMs suffer from severe hallucinations on ChartHal, including proprietary models such as GPT-5 and o4-mini, which achieve only 34.46% and 22.79% accuracy, respectively. Further analysis reveals that questions involving information absent from or contradictory to charts are especially likely to trigger hallucinations, underscoring the urgent need for more robust mitigation strategies. Code and data are available at https://github.com/ymcui/ChartHal .

ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding Lire l’article »

AI, Committee, Actualités, Uncategorized

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning

arXiv:2509.17437v1 Announce Type: new Abstract: Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, which constrain RL reward signals for effective training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and geometric problem solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning Lire l’article »

AI, Committee, Actualités, Uncategorized

MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents

arXiv:2509.17628v1 Announce Type: new Abstract: Large Language Models (LLMs) have excelled in question-answering (QA) tasks within single domains. However, their reasoning and coordination capabilities in complex, multi-stage scenarios remain underexplored. Existing benchmarks typically focus on isolated tasks or narrow domains, overlooking models’ abilities for multi-stage collaboration and optimization without explicit external guidance. To bridge this gap, we propose textbf{MSCoRe}, a novel benchmark comprising 126696 domain-specific QA instances spanning scenarios in automotive, pharmaceutical, electronics, and energy sectors. The dataset is created using a structured three-phase pipeline: dynamic sampling, iterative question-answer generation, and a multi-level quality assessment to ensure data quality. Tasks are further categorized into three difficulty levels according to stage coverage and complexity. With MSCoRe, we have conducted a comprehensive evaluation of various state-of-the-art LLM agents. The commercial models performed best across all tasks and scenarios, but a notable gap in ROUGE scores remains between simple and complex tasks. We also tested the models’ robustness and found that their performance is negatively affected by noisy data. MSCoRe provides a valuable new resource for the community to evaluate and improve multi-stage reasoning in LLM agents. The code and data are available at https://github.com/D3E0-source/MSCoRE.

MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents Lire l’article »

AI, Committee, Actualités, Uncategorized

From Roots to Rewards: Dynamic Tree Reasoning with Reinforcement Learning

arXiv:2507.13142v3 Announce Type: replace-cross Abstract: Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree’s static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree’s probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems. Code available at: https://github.com/ahmedehabb/From-Roots-to-Rewards-Dynamic-Tree-Reasoning-with-RL

From Roots to Rewards: Dynamic Tree Reasoning with Reinforcement Learning Lire l’article »

AI, Committee, Actualités, Uncategorized

Audio Contrastive-based Fine-tuning: Decoupling Representation Learning and Classification

arXiv:2309.11895v4 Announce Type: replace-cross Abstract: Standard fine-tuning of pre-trained audio models couples representation learning with classifier training, which can obscure the true quality of the learned representations. In this work, we advocate for a disentangled two-stage framework that separates representation refinement from downstream evaluation. First, we employ a “contrastive-tuning” stage to explicitly improve the geometric structure of the model’s embedding space. Subsequently, we introduce a dual-probe evaluation protocol to assess the quality of these refined representations from a geometric perspective. This protocol uses a linear probe to measure global linear separability and a k-Nearest Neighbours probe to investigate the local structure of class clusters. Our experiments on a diverse set of audio classification tasks show that our framework provides a better foundation for classification, leading to improved accuracy. Our newly proposed dual-probing framework acts as a powerful analytical lens, demonstrating why contrastive learning is more effective by revealing a superior embedding space. It significantly outperforms vanilla fine-tuning, particularly on single-label datasets with a large number of classes, and also surpasses strong baselines on multi-label tasks using a Jaccard-weighted loss. Our findings demonstrate that decoupling representation refinement from classifier training is a broadly effective strategy for unlocking the full potential of pre-trained audio models. Our code will be publicly available.

Audio Contrastive-based Fine-tuning: Decoupling Representation Learning and Classification Lire l’article »

AI, Committee, Actualités, Uncategorized

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

arXiv:2506.12158v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly used to generate synthetic textual data for training smaller specialized models. However, a comparison of various generation strategies for low-resource language settings is lacking. While various prompting strategies have been proposed, such as demonstrations, label-based summaries, and self-revision, their comparative effectiveness remains unclear, especially for low-resource languages. In this paper, we systematically evaluate the performance of these generation strategies and their combinations across 11 typologically diverse languages, including several extremely low-resource ones. Using three NLP tasks and four open-source LLMs, we assess downstream model performance on generated versus gold-standard data. Our results show that strategic combinations of generation methods, particularly target-language demonstrations with LLM-based revisions, yield strong performance, narrowing the gap with real data to as little as 5% in some settings. We also find that smart prompting techniques can reduce the advantage of larger LLMs, highlighting efficient generation strategies for synthetic data generation in low-resource scenarios with smaller models.

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages Lire l’article »

AI, Committee, Actualités, Uncategorized

MIT Researchers Enhanced Artificial Intelligence (AI) 64x Better at Planning, Achieving 94% Accuracy

Can a 8B-parameter language model produce provably valid multi-step plans instead of plausible guesses? MIT CSAIL researchers introduce PDDL-INSTRUCT, an instruction-tuning framework that couples logical chain-of-thought with external plan validation (VAL) to lift symbolic planning performance of LLMs. On PlanBench, a tuned Llama-3-8B reaches 94% valid plans on Blocksworld, with large jumps on Mystery Blocksworld and Logistics; across domains they report up to a 66% absolute improvement over baselines. https://arxiv.org/pdf/2509.13351 But What’s new? The research team tackles a well-known failure mode: LLMs often generate “plausible-sounding” but logically invalid multi-step plans. PDDL-INSTRUCT couples explicit state/action semantics with ground-truth checking: Error education: Models are trained to explain why candidate plans fail (unsatisfied preconditions, wrong effects, frame violations, or goal not reached). Logical chain-of-thought (CoT): Prompts require step-by-step inference over preconditions and add/del effects, yielding state→action→state traces ⟨sᵢ, aᵢ₊₁, sᵢ₊₁⟩. External verification (VAL): Every step is validated with the classic VAL plan validator; feedback can be binary (valid/invalid) or detailed (which precondition/effect failed). Detailed feedback yielded the strongest gains. Two-stage optimization: Stage-1 optimizes the reasoning chains (penalizing state-transition errors); Stage-2 optimizes end-task planning accuracy. How Good is it? Benchmarks Evaluation follows PlanBench—Blocksworld, Mystery Blocksworld (predicate names obfuscated to break pattern-matching), and Logistics—established stress tests where generic LLMs historically underperform on plan generation. The authors highlight that Mystery Blocksworld is particularly challenging; prior studies often report <5% validity without tool support. Blocksworld: up to 94% valid plans with Llama-3-8B under PDDL-INSTRUCT. Mystery Blocksworld: large relative gains; the paper reports dramatic improvement versus a near-zero baseline (reported as orders-of-magnitude, e.g., 64× in their summary figures/tables). Logistics: substantial increases in valid plans. Across domains, the research team showcase up to 66% absolute improvement over untuned baselines. Detailed validator feedback outperforms binary signals, and longer feedback budgets further help. https://arxiv.org/pdf/2509.13351 Summary PDDL-INSTRUCT shows that coupling logical chain-of-thought with external plan validation can materially improve LLM planning, but its current scope is classical PDDL domains (Blocksworld, Mystery Blocksworld, Logistics) and relies on VAL as an external oracle; the reported gains—e.g., 94% valid plans on Blocksworld and large relative improvements on Mystery Blocksworld with Llama-3-8B—demonstrate a viable path for neuro-symbolic training where reasoning steps are grounded in formal semantics and checked automatically, suggesting immediate utility for agent pipelines that can tolerate a verifier in the loop while longer-horizon, temporal/numeric, and cost-sensitive planning remain open extensions. Check out the PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post MIT Researchers Enhanced Artificial Intelligence (AI) 64x Better at Planning, Achieving 94% Accuracy appeared first on MarkTechPost.

MIT Researchers Enhanced Artificial Intelligence (AI) 64x Better at Planning, Achieving 94% Accuracy Lire l’article »

AI, Committee, Actualités, Uncategorized

Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations

arXiv:2509.15655v1 Announce Type: new Abstract: Transformer-based speech language models (SLMs) have significantly improved neural speech recognition and understanding. While existing research has examined how well SLMs encode shallow acoustic and phonetic features, the extent to which SLMs encode nuanced syntactic and conceptual features remains unclear. By drawing parallels with linguistic competence assessments for large language models, this study is the first to systematically evaluate the presence of contextual syntactic and semantic features across SLMs for self-supervised learning (S3M), automatic speech recognition (ASR), speech compression (codec), and as the encoder for auditory large language models (AudioLLMs). Through minimal pair designs and diagnostic feature analysis across 71 tasks spanning diverse linguistic levels, our layer-wise and time-resolved analysis uncovers that 1) all speech encode grammatical features more robustly than conceptual ones.

Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations Lire l’article »

AI, Committee, Actualités, Uncategorized

Localmax dynamics for attention in transformers and its asymptotic behavior

arXiv:2509.15958v1 Announce Type: new Abstract: We introduce a new discrete-time attention model, termed the localmax dynamics, which interpolates between the classic softmax dynamics and the hardmax dynamics, where only the tokens that maximize the influence toward a given token have a positive weight. As in hardmax, uniform weights are determined by a parameter controlling neighbor influence, but the key extension lies in relaxing neighborhood interactions through an alignment-sensitivity parameter, which allows controlled deviations from pure hardmax behavior. As we prove, while the convex hull of the token states still converges to a convex polytope, its structure can no longer be fully described by a maximal alignment set, prompting the introduction of quiescent sets to capture the invariant behavior of tokens near vertices. We show that these sets play a key role in understanding the asymptotic behavior of the system, even under time-varying alignment sensitivity parameters. We further show that localmax dynamics does not exhibit finite-time convergence and provide results for vanishing, nonzero, time-varying alignment-sensitivity parameters, recovering the limiting behavior of hardmax as a by-product. Finally, we adapt Lyapunov-based methods from classical opinion dynamics, highlighting their limitations in the asymmetric setting of localmax interactions and outlining directions for future research.

Localmax dynamics for attention in transformers and its asymptotic behavior Lire l’article »

AI, Committee, Actualités, Uncategorized

Tag&Tab: Pretraining Data Detection in Large Language Models Using Keyword-Based Membership Inference Attack

arXiv:2501.08454v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have become essential tools for digital task assistance. Their training relies heavily on the collection of vast amounts of data, which may include copyright-protected or sensitive information. Recent studies on detecting pretraining data in LLMs have primarily focused on sentence- or paragraph-level membership inference attacks (MIAs), usually involving probability analysis of the target model’s predicted tokens. However, these methods often exhibit poor accuracy, failing to account for the semantic importance of textual content and word significance. To address these shortcomings, we propose Tag&Tab, a novel approach for detecting data used in LLM pretraining. Our method leverages established natural language processing (NLP) techniques to tag keywords in the input text, a process we term Tagging. Then, the LLM is used to obtain probabilities for these keywords and calculate their average log-likelihood to determine input text membership, a process we refer to as Tabbing. Our experiments on four benchmark datasets (BookMIA, MIMIR, PatentMIA, and the Pile) and several open-source LLMs of varying sizes demonstrate an average increase in AUC scores ranging from 5.3% to 17.6% over state-of-the-art methods. Tag&Tab not only sets a new standard for data leakage detection in LLMs, but its outstanding performance is a testament to the importance of words in MIAs on LLMs.

Tag&Tab: Pretraining Data Detection in Large Language Models Using Keyword-Based Membership Inference Attack Lire l’article »

We use cookies to improve your experience and performance on our website. You can learn more at Politique de confidentialité and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
fr_FR