YouZum

AI

AI, Committee, ニュース, Uncategorized

Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models

arXiv:2506.14625v2 Announce Type: replace Abstract: Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs’ moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.

Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

Dynamic Acoustic Model Architecture Optimization in Training for ASR

arXiv:2506.13180v2 Announce Type: replace Abstract: Architecture design is inherently complex. Existing approaches rely on either handcrafted rules, which demand extensive empirical expertise, or automated methods like neural architecture search, which are computationally intensive. In this paper, we introduce DMAO, an architecture optimization framework that employs a grow-and-drop strategy to automatically reallocate parameters during training. This reallocation shifts resources from less-utilized areas to those parts of the model where they are most beneficial. Notably, DMAO only introduces negligible training overhead at a given model complexity. We evaluate DMAO through experiments with CTC on LibriSpeech, TED-LIUM-v2 and Switchboard datasets. The results show that, using the same amount of training resources, our proposed DMAO consistently improves WER by up to 6% relatively across various architectures, model sizes, and datasets. Furthermore, we analyze the pattern of parameter redistribution and uncover insightful findings.

Dynamic Acoustic Model Architecture Optimization in Training for ASR 投稿を読む »

AI, Committee, ニュース, Uncategorized

MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs

arXiv:2506.15215v1 Announce Type: new Abstract: Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions, making refined and interpretable automatic evaluation both crucial and challenging. Traditional metrics like ROUGE and BERTScore struggle to capture semantic similarities due to different patterns between model responses and reference answers. Current LLM-based evaluation approaches, such as pairwise or listwise comparisons of candidate answers, lack intuitive interpretability. While pointwise scoring of each response provides some descriptions, it fails to adapt across different question contents. Most notably, existing methods overlook the distinction between factoid and non-factoid questions. To address these challenges, we propose textbf{MinosEval}, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers using different evaluation strategies. For factoid questions, it applies an adaptive key-point scoring strategy, while for non-factoid questions, it uses an instance-aware listwise ranking strategy. Experiments on multiple open-ended QA datasets, including self-built ones with more candidate responses to complement community resources, show that MinosEval better aligns with human annotations and offers more interpretable results.

MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs 投稿を読む »

AI, Committee, ニュース, Uncategorized

HtFLlib: A Unified Benchmarking Library for Evaluating Heterogeneous Federated Learning Methods Across Modalities

AI institutions develop heterogeneous models for specific tasks but face data scarcity challenges during training. Traditional Federated Learning (FL) supports only homogeneous model collaboration, which needs identical architectures across all clients. However, clients develop model architectures for their unique requirements. Moreover, sharing effort-intensive locally trained models contains intellectual property and reduces participants’ interest in engaging in collaborations. Heterogeneous Federated Learning (HtFL) addresses these limitations, but the literature lacks a unified benchmark for evaluating HtFL across various domains and aspects. Background and Categories of HtFL Methods Existing FL benchmarks focus on data heterogeneity using homogeneous client models but neglect real scenarios that involve model heterogeneity. Representative HtFL methods fall into three main categories addressing these limitations. Partial parameter sharing methods such as LG-FedAvg, FedGen, and FedGH maintain heterogeneous feature extractors while assuming homogeneous classifier heads for knowledge transfer. Mutual distillation, such as FML, FedKD, and FedMRL, trains and shares small auxiliary models through distillation techniques. Prototype sharing methods transfer lightweight class-wise prototypes as global knowledge, collecting local prototypes from clients, and collecting them on servers to guide local training. However, it remains unclear whether existing HtFL methods perform consistently across diverse scenarios. Introducing HtFLlib: A Unified Benchmark Researchers from Shanghai Jiao Tong University, Beihang University, Chongqing University, Tongji University, Hong Kong Polytechnic University, and The Queen’s University of Belfast have proposed the first Heterogeneous Federated Learning Library (HtFLlib), an easy and extensible method for integrating multiple datasets and model heterogeneity scenarios. This method integrates: 12 datasets across various domains, modalities, and data heterogeneity scenarios 40 model architectures ranging from small to large, across three modalities.  A modularized and easy-to-extend HtFL codebase with implementations of 10 representative HtFL methods. Systematic evaluations covering accuracy, convergence, computation costs, and communication costs.  Datasets and Modalities in HtFLlib HtFLlib contains detailed data heterogeneity scenarios divided into three settings: Label Skew with Pathological and Dirichlet as subsettings, Feature Shift, and Real-World. It integrates 12 datasets, including Cifar10, Cifar100, Flowers102, Tiny-ImageNet, KVASIR, COVIDx, DomainNet, Camelyon17, AG News, Shakespeare, HAR, and PAMAP2. These datasets vary significantly in domain, data volume, and class numbers, demonstrating HtFLlib’s comprehensive and versatile nature. Moreover, researchers’ main focus is on image data, especially the label skew setting, as image tasks are the most commonly used tasks across various fields. The HtFL methods are evaluated across image, text, and sensor signal tasks to evaluate their respective strengths and weaknesses. Performance Analysis: Image Modality For image data, most HtFL methods show decreased accuracy as model heterogeneity increases. The FedMRL shows superior strength through its combination of auxiliary global and local models. When introducing heterogeneous classifiers that make partial parameter sharing methods inapplicable, FedTGP maintains superiority across diverse settings due to its adaptive prototype refinement ability. Medical dataset experiments with black-boxed pre-trained heterogeneous models demonstrate that HtFL enhances model quality compared to pre-trained models and achieves greater improvements than auxiliary models, such as FML. For text data, FedMRL’s advantages in label skew settings diminish in real-world settings, while FedProto and FedTGP perform relatively poorly compared to image tasks. Conclusion In conclusion, researchers introduced HtFLlib, a framework that addresses the critical gap in HtFL benchmarking by providing unified evaluation standards across diverse domains and scenarios. HtFLlib’s modular design and extensible architecture provide a detailed benchmark for both research and practical applications in HtFL. Moreover, its ability to support heterogeneous models in collaborative learning opens the way for future research into utilizing complex pre-trained large models, black-box systems, and varied architectures across different tasks and modalities. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post HtFLlib: A Unified Benchmarking Library for Evaluating Heterogeneous Federated Learning Methods Across Modalities appeared first on MarkTechPost.

HtFLlib: A Unified Benchmarking Library for Evaluating Heterogeneous Federated Learning Methods Across Modalities 投稿を読む »

AI, Committee, ニュース, Uncategorized

AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science

arXiv:2506.13992v1 Announce Type: cross Abstract: Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. AssistedDS features both synthetic datasets with explicitly known generative mechanisms and real-world Kaggle competitions, each accompanied by curated bundles of helpful and adversarial documents. These documents provide domain-specific insights into data cleaning, feature engineering, and model selection. We assess state-of-the-art LLMs on their ability to discern and apply beneficial versus harmful domain knowledge, evaluating submission validity, information recall, and predictive performance. Our results demonstrate three key findings: (1) LLMs frequently exhibit an uncritical adoption of provided information, significantly impairing their predictive performance when adversarial content is introduced, (2) helpful guidance is often insufficient to counteract the negative influence of adversarial information, and (3) in Kaggle datasets, LLMs often make errors in handling time-series data, applying consistent feature engineering across different folds, and interpreting categorical variables correctly. These findings highlight a substantial gap in current models’ ability to critically evaluate and leverage expert knowledge, underscoring an essential research direction for developing more robust, knowledge-aware automated data science systems.

AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science 投稿を読む »

AI, Committee, ニュース, Uncategorized

Optimizing Length Compression in Large Reasoning Models

arXiv:2506.14755v1 Announce Type: cross Abstract: Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as “invalid thinking” — models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.

Optimizing Length Compression in Large Reasoning Models 投稿を読む »

ja