YouZum

Committee

AI, Committee, 新闻, Uncategorized

ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs

arXiv:2506.15211v1 Announce Type: new Abstract: Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought (Long CoT) reasoning have demonstrated remarkable cross-domain generalization capabilities. However, the underlying mechanisms supporting such transfer remain poorly understood. We hypothesize that cross-domain generalization arises from shared abstract reasoning prototypes — fundamental reasoning patterns that capture the essence of problems across domains. These prototypes minimize the nuances of the representation, revealing that seemingly diverse tasks are grounded in shared reasoning structures.Based on this hypothesis, we propose ProtoReasoning, a framework that enhances the reasoning ability of LLMs by leveraging scalable and verifiable prototypical representations (Prolog for logical reasoning, PDDL for planning).ProtoReasoning features: (1) an automated prototype construction pipeline that transforms problems into corresponding prototype representations; (2) a comprehensive verification system providing reliable feedback through Prolog/PDDL interpreters; (3) the scalability to synthesize problems arbitrarily within prototype space while ensuring correctness. Extensive experiments show that ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning (Enigmata-Eval), 6.3% improvement on planning tasks, 4.0% improvement on general reasoning (MMLU) and 1.0% on mathematics (AIME24). Significantly, our ablation studies confirm that learning in prototype space also demonstrates enhanced generalization to structurally similar problems compared to training solely on natural language representations, validating our hypothesis that reasoning prototypes serve as the foundation for generalizable reasoning in large language models.

ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs Read Post »

AI, Committee, 新闻, Uncategorized

HtFLlib: A Unified Benchmarking Library for Evaluating Heterogeneous Federated Learning Methods Across Modalities

AI institutions develop heterogeneous models for specific tasks but face data scarcity challenges during training. Traditional Federated Learning (FL) supports only homogeneous model collaboration, which needs identical architectures across all clients. However, clients develop model architectures for their unique requirements. Moreover, sharing effort-intensive locally trained models contains intellectual property and reduces participants’ interest in engaging in collaborations. Heterogeneous Federated Learning (HtFL) addresses these limitations, but the literature lacks a unified benchmark for evaluating HtFL across various domains and aspects. Background and Categories of HtFL Methods Existing FL benchmarks focus on data heterogeneity using homogeneous client models but neglect real scenarios that involve model heterogeneity. Representative HtFL methods fall into three main categories addressing these limitations. Partial parameter sharing methods such as LG-FedAvg, FedGen, and FedGH maintain heterogeneous feature extractors while assuming homogeneous classifier heads for knowledge transfer. Mutual distillation, such as FML, FedKD, and FedMRL, trains and shares small auxiliary models through distillation techniques. Prototype sharing methods transfer lightweight class-wise prototypes as global knowledge, collecting local prototypes from clients, and collecting them on servers to guide local training. However, it remains unclear whether existing HtFL methods perform consistently across diverse scenarios. Introducing HtFLlib: A Unified Benchmark Researchers from Shanghai Jiao Tong University, Beihang University, Chongqing University, Tongji University, Hong Kong Polytechnic University, and The Queen’s University of Belfast have proposed the first Heterogeneous Federated Learning Library (HtFLlib), an easy and extensible method for integrating multiple datasets and model heterogeneity scenarios. This method integrates: 12 datasets across various domains, modalities, and data heterogeneity scenarios 40 model architectures ranging from small to large, across three modalities.  A modularized and easy-to-extend HtFL codebase with implementations of 10 representative HtFL methods. Systematic evaluations covering accuracy, convergence, computation costs, and communication costs.  Datasets and Modalities in HtFLlib HtFLlib contains detailed data heterogeneity scenarios divided into three settings: Label Skew with Pathological and Dirichlet as subsettings, Feature Shift, and Real-World. It integrates 12 datasets, including Cifar10, Cifar100, Flowers102, Tiny-ImageNet, KVASIR, COVIDx, DomainNet, Camelyon17, AG News, Shakespeare, HAR, and PAMAP2. These datasets vary significantly in domain, data volume, and class numbers, demonstrating HtFLlib’s comprehensive and versatile nature. Moreover, researchers’ main focus is on image data, especially the label skew setting, as image tasks are the most commonly used tasks across various fields. The HtFL methods are evaluated across image, text, and sensor signal tasks to evaluate their respective strengths and weaknesses. Performance Analysis: Image Modality For image data, most HtFL methods show decreased accuracy as model heterogeneity increases. The FedMRL shows superior strength through its combination of auxiliary global and local models. When introducing heterogeneous classifiers that make partial parameter sharing methods inapplicable, FedTGP maintains superiority across diverse settings due to its adaptive prototype refinement ability. Medical dataset experiments with black-boxed pre-trained heterogeneous models demonstrate that HtFL enhances model quality compared to pre-trained models and achieves greater improvements than auxiliary models, such as FML. For text data, FedMRL’s advantages in label skew settings diminish in real-world settings, while FedProto and FedTGP perform relatively poorly compared to image tasks. Conclusion In conclusion, researchers introduced HtFLlib, a framework that addresses the critical gap in HtFL benchmarking by providing unified evaluation standards across diverse domains and scenarios. HtFLlib’s modular design and extensible architecture provide a detailed benchmark for both research and practical applications in HtFL. Moreover, its ability to support heterogeneous models in collaborative learning opens the way for future research into utilizing complex pre-trained large models, black-box systems, and varied architectures across different tasks and modalities. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post HtFLlib: A Unified Benchmarking Library for Evaluating Heterogeneous Federated Learning Methods Across Modalities appeared first on MarkTechPost.

HtFLlib: A Unified Benchmarking Library for Evaluating Heterogeneous Federated Learning Methods Across Modalities Read Post »

AI, Committee, 新闻, Uncategorized

AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science

arXiv:2506.13992v1 Announce Type: cross Abstract: Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. AssistedDS features both synthetic datasets with explicitly known generative mechanisms and real-world Kaggle competitions, each accompanied by curated bundles of helpful and adversarial documents. These documents provide domain-specific insights into data cleaning, feature engineering, and model selection. We assess state-of-the-art LLMs on their ability to discern and apply beneficial versus harmful domain knowledge, evaluating submission validity, information recall, and predictive performance. Our results demonstrate three key findings: (1) LLMs frequently exhibit an uncritical adoption of provided information, significantly impairing their predictive performance when adversarial content is introduced, (2) helpful guidance is often insufficient to counteract the negative influence of adversarial information, and (3) in Kaggle datasets, LLMs often make errors in handling time-series data, applying consistent feature engineering across different folds, and interpreting categorical variables correctly. These findings highlight a substantial gap in current models’ ability to critically evaluate and leverage expert knowledge, underscoring an essential research direction for developing more robust, knowledge-aware automated data science systems.

AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science Read Post »

AI, Committee, 新闻, Uncategorized

Optimizing Length Compression in Large Reasoning Models

arXiv:2506.14755v1 Announce Type: cross Abstract: Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as “invalid thinking” — models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.

Optimizing Length Compression in Large Reasoning Models Read Post »

AI, Committee, 新闻, Uncategorized

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

arXiv:2505.13227v2 Announce Type: replace-cross Abstract: Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Read Post »

AI, Committee, 新闻, Uncategorized

AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR

arXiv:2506.14190v1 Announce Type: new Abstract: Developing code-switched ASR systems is challenging due to language ambiguity and limited exposure to multilingual, code-switched data, while collecting such speech is costly. Prior work generates synthetic audio from text, but these methods are computationally intensive and hard to scale. We introduce AsyncSwitch, a novel asynchronous adaptation framework that leverages large-scale, text-rich web data to pre-expose ASR models to diverse code-switched domains before fine-tuning on paired speech-text corpora. Our three-stage process (1) trains decoder self-attention and feedforward layers on code-switched text, (2) aligns decoder and encoder via cross-attention using limited speech-text data, and (3) fully fine-tunes the entire model. Experiments with Whisper on Malay-English code-switching demonstrate a 9.02% relative WER reduction, while improving monolingual performance in Singlish, Malay, and other English variants.

AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR Read Post »

zh_CN