YouZum

Committee

AI, Committee, Nachrichten, Uncategorized

When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models

arXiv:2509.18762v1 Announce Type: new Abstract: Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks. As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach. While the effects of data length in continued pretraining have been extensively studied, their implications for SFT remain unclear. In this work, we systematically investigate how SFT data length influences LLM behavior on short-context tasks. Counterintuitively, we find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining. To uncover the underlying mechanisms of this phenomenon, we first decouple and analyze two key components, Multi-Head Attention (MHA) and Feed-Forward Network (FFN), and show that both independently benefit from long-context SFT. We further study their interaction and reveal a knowledge preference bias: long-context SFT promotes contextual knowledge, while short-context SFT favors parametric knowledge, making exclusive reliance on long-context SFT suboptimal. Finally, we demonstrate that hybrid training mitigates this bias, offering explainable guidance for fine-tuning LLMs.

When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Human-Annotated NER Dataset for the Kyrgyz Language

arXiv:2509.19109v1 Announce Type: new Abstract: We introduce KyrgyzNER, the first manually annotated named entity recognition dataset for the Kyrgyz language. Comprising 1,499 news articles from the 24.KG news portal, the dataset contains 10,900 sentences and 39,075 entity mentions across 27 named entity classes. We show our annotation scheme, discuss the challenges encountered in the annotation process, and present the descriptive statistics. We also evaluate several named entity recognition models, including traditional sequence labeling approaches based on conditional random fields and state-of-the-art multilingual transformer-based models fine-tuned on our dataset. While all models show difficulties with rare entity categories, models such as the multilingual RoBERTa variant pretrained on a large corpus across many languages achieve a promising balance between precision and recall. These findings emphasize both the challenges and opportunities of using multilingual pretrained models for processing languages with limited resources. Although the multilingual RoBERTa model performed best, other multilingual models yielded comparable results. This suggests that future work exploring more granular annotation schemes may offer deeper insights for Kyrgyz language processing pipelines evaluation.

Human-Annotated NER Dataset for the Kyrgyz Language Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Finding My Voice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation

arXiv:2509.19231v1 Announce Type: cross Abstract: We present ChiReSSD, a speech reconstruction framework that preserves children speaker’s identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with particular emphasis on pitch and prosody. We evaluate our method on the STAR dataset and report substantial improvements in lexical accuracy and speaker identity preservation. Furthermore, we automatically predict the phonetic content in the original and reconstructed pairs, where the proportion of corrected consonants is comparable to the percentage of correct consonants (PCC), a clinical speech assessment metric. Our experiments show Pearson correlation of 0.63 between automatic and human expert annotations, highlighting the potential to reduce the manual transcription burden. In addition, experiments on the TORGO dataset demonstrate effective generalization for reconstructing adult dysarthric speech. Our results indicate that disentangled, style-based TTS reconstruction can provide identity-preserving speech across diverse clinical populations.

Finding My Voice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

When and How Long Did Therapy Happen? Soft-Supervising Temporal Localization Using Audio-Language Models

arXiv:2506.09707v3 Announce Type: replace-cross Abstract: Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements, identifying their start and stop times, directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process focused 30-second windows of audio-transcript input. Fidelity labels for three core protocol phases, therapist orientation (P1), imaginal exposure (P2), and post-imaginal processing (P3), are generated via LLM-based prompting and verified by trained raters. The model is trained to predict normalized boundary offsets using soft supervision guided by task-specific prompts. On a dataset of 308 real PE sessions, our best configuration (LoRA rank 8, 30s windows) achieves a mean absolute error (MAE) of 5.3s across tasks, within typical rater tolerance for timestamp review, enabling practical fidelity QC. We further analyze the effects of window size and LoRA rank, highlighting the importance of context granularity and model adaptation. This work introduces a privacy-preserving, scalable framework for fidelity tracking in PE therapy, with potential to support clinician training, supervision, and quality assurance.

When and How Long Did Therapy Happen? Soft-Supervising Temporal Localization Using Audio-Language Models Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework

arXiv:2509.12955v2 Announce Type: replace Abstract: The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of “AI for Science”. However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: https://github.com/ZH-heng/research_workflow.

Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding

arXiv:2509.17481v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) have recently demonstrated remarkable progress, yet hallucination remains a critical barrier, particularly in chart understanding, which requires sophisticated perceptual and cognitive abilities as well as rigorous factual accuracy. While prior work has investigated hallucinations and chart comprehension independently, their intersection remains largely unexplored. To address this gap, we present ChartHal, a benchmark that features a fine-grained taxonomy of hallucination scenarios in chart understanding, along with a human-validated dataset of 1,062 samples. Our evaluation shows that state-of-the-art LVLMs suffer from severe hallucinations on ChartHal, including proprietary models such as GPT-5 and o4-mini, which achieve only 34.46% and 22.79% accuracy, respectively. Further analysis reveals that questions involving information absent from or contradictory to charts are especially likely to trigger hallucinations, underscoring the urgent need for more robust mitigation strategies. Code and data are available at https://github.com/ymcui/ChartHal .

ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning

arXiv:2509.17437v1 Announce Type: new Abstract: Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, which constrain RL reward signals for effective training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and geometric problem solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents

arXiv:2509.17628v1 Announce Type: new Abstract: Large Language Models (LLMs) have excelled in question-answering (QA) tasks within single domains. However, their reasoning and coordination capabilities in complex, multi-stage scenarios remain underexplored. Existing benchmarks typically focus on isolated tasks or narrow domains, overlooking models’ abilities for multi-stage collaboration and optimization without explicit external guidance. To bridge this gap, we propose textbf{MSCoRe}, a novel benchmark comprising 126696 domain-specific QA instances spanning scenarios in automotive, pharmaceutical, electronics, and energy sectors. The dataset is created using a structured three-phase pipeline: dynamic sampling, iterative question-answer generation, and a multi-level quality assessment to ensure data quality. Tasks are further categorized into three difficulty levels according to stage coverage and complexity. With MSCoRe, we have conducted a comprehensive evaluation of various state-of-the-art LLM agents. The commercial models performed best across all tasks and scenarios, but a notable gap in ROUGE scores remains between simple and complex tasks. We also tested the models’ robustness and found that their performance is negatively affected by noisy data. MSCoRe provides a valuable new resource for the community to evaluate and improve multi-stage reasoning in LLM agents. The code and data are available at https://github.com/D3E0-source/MSCoRE.

MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

From Roots to Rewards: Dynamic Tree Reasoning with Reinforcement Learning

arXiv:2507.13142v3 Announce Type: replace-cross Abstract: Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree’s static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree’s probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems. Code available at: https://github.com/ahmedehabb/From-Roots-to-Rewards-Dynamic-Tree-Reasoning-with-RL

From Roots to Rewards: Dynamic Tree Reasoning with Reinforcement Learning Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Audio Contrastive-based Fine-tuning: Decoupling Representation Learning and Classification

arXiv:2309.11895v4 Announce Type: replace-cross Abstract: Standard fine-tuning of pre-trained audio models couples representation learning with classifier training, which can obscure the true quality of the learned representations. In this work, we advocate for a disentangled two-stage framework that separates representation refinement from downstream evaluation. First, we employ a “contrastive-tuning” stage to explicitly improve the geometric structure of the model’s embedding space. Subsequently, we introduce a dual-probe evaluation protocol to assess the quality of these refined representations from a geometric perspective. This protocol uses a linear probe to measure global linear separability and a k-Nearest Neighbours probe to investigate the local structure of class clusters. Our experiments on a diverse set of audio classification tasks show that our framework provides a better foundation for classification, leading to improved accuracy. Our newly proposed dual-probing framework acts as a powerful analytical lens, demonstrating why contrastive learning is more effective by revealing a superior embedding space. It significantly outperforms vanilla fine-tuning, particularly on single-label datasets with a large number of classes, and also surpasses strong baselines on multi-label tasks using a Jaccard-weighted loss. Our findings demonstrate that decoupling representation refinement from classifier training is a broadly effective strategy for unlocking the full potential of pre-trained audio models. Our code will be publicly available.

Audio Contrastive-based Fine-tuning: Decoupling Representation Learning and Classification Beitrag lesen »

We use cookies to improve your experience and performance on our website. You can learn more at Datenschutzrichtlinie and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
de_DE