YouZum

AI

AI, Committee, Nachrichten, Uncategorized

High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning

arXiv:2506.04051v1 Announce Type: new Abstract: Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability — a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data that encodes what the model can and cannot reliably generate. We generate this data by splitting responses of the pretrained LLM into factual fragments (atomic statements or reasoning steps), and use ground truth information to identify incorrect fragments. We achieve capability-aligned finetuning responses by either removing incorrect fragments or replacing them with “Unsure from Here” — according to a tunable threshold that allows practitioners to trade off response completeness and mean correctness of the response’s fragments. We finetune four open-source models for biography writing, mathematics, coding, and medicine with HALT for three different trade-off thresholds. HALT effectively trades off response completeness for correctness, increasing the mean correctness of response fragments by 15% on average, while resulting in a 4% improvement in the F1 score (mean of completeness and correctness of the response) compared to the relevant baselines. By tuning HALT for highest correctness, we train a single reliable Llama3-70B model with correctness increased from 51% to 87% across all four domains while maintaining 53% of the response completeness achieved with standard finetuning.

High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

DiaBlo: Diagonal Blocks Are Sufficient For Finetuning

arXiv:2506.03230v1 Announce Type: cross Abstract: Finetuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Finetuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. We conduct extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, to evaluate the effectiveness and efficiency of DiaBlo. Across these benchmarks, DiaBlo demonstrates strong and consistent performance while maintaining high memory efficiency and fast finetuning speed. Codes are available at https://github.com/ziyangjoy/DiaBlo.

DiaBlo: Diagonal Blocks Are Sufficient For Finetuning Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

A Survey on (M)LLM-Based GUI Agents

arXiv:2504.13865v2 Announce Type: replace-cross Abstract: Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction, evolving from rule-based automation scripts to sophisticated AI-driven systems capable of understanding and executing complex interface operations. This survey provides a comprehensive examination of the rapidly advancing field of LLM-based GUI Agents, systematically analyzing their architectural foundations, technical components, and evaluation methodologies. We identify and analyze four fundamental components that constitute modern GUI Agents: (1) perception systems that integrate text-based parsing with multimodal understanding for comprehensive interface comprehension; (2) exploration mechanisms that construct and maintain knowledge bases through internal modeling, historical experience, and external information retrieval; (3) planning frameworks that leverage advanced reasoning methodologies for task decomposition and execution; and (4) interaction systems that manage action generation with robust safety controls. Through rigorous analysis of these components, we reveal how recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms. We critically examine current evaluation frameworks, highlighting methodological limitations in existing benchmarks while proposing directions for standardization. This survey also identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control, while outlining promising research directions for enhancing GUI Agents’ capabilities. Our systematic review provides researchers and practitioners with a thorough understanding of the field’s current state and offers insights into future developments in intelligent interface automation.

A Survey on (M)LLM-Based GUI Agents Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise Workflows

Mistral AI announced the release of Mistral Code, an AI-powered coding assistant tailored for enterprise software development environments. This release signals Mistral’s move toward addressing long-standing requirements in professional development pipelines: control, security, and model adaptability. Addressing Enterprise-Grade Requirements Mistral Code targets several key limitations observed in traditional AI coding tools: Data Sovereignty and Control: Organizations can maintain full control over their code and infrastructure. Mistral Code offers options for on-premises deployment, enabling compliance with internal data governance policies. Customizability: Unlike off-the-shelf assistants, Mistral Code is fully tunable to an enterprise’s internal codebase, allowing the assistant to reflect project-specific conventions and logic structures. Beyond Completion: The tool supports end-to-end workflows including debugging, test generation, and code transformation, moving beyond standard autocomplete functionality. Unified Vendor Management: Mistral provides a single vendor solution with full visibility across the development stack, simplifying integration and support processes. Initial deployments have been conducted with their partners such as Capgemini, Abanca, and SNCF, suggesting the assistant’s applicability across both regulated and large-scale environments. System Architecture and Capabilities Mistral Code integrates four foundational models, each designed for a distinct set of development tasks: Codestral: Specializes in code completion and in-filling, optimized for latency and multi-language support. Codestral Embed: Powers semantic search and code retrieval tasks through dense vector embeddings. Devstral: Designed for longer-horizon tasks, such as multi-step problem-solving and refactoring. Mistral Medium: Enables conversational interactions and contextual Q&A inside the IDE. The assistant supports over 80 programming languages and interfaces seamlessly with development artifacts like file structures, Git diffs, and terminal outputs. Developers can use natural language to initiate refactors, generate unit tests, or receive in-line explanations—all within their IDE. Deployment Models Mistral Code offers flexible deployment modes to meet diverse IT policies and performance needs: Cloud: For teams working in managed cloud environments. Reserved Cloud Capacity: Dedicated infrastructure to meet latency, throughput, or compliance requirements. On-Premises: For enterprises with strict infrastructure control needs, especially in regulated sectors. The assistant is currently in private beta for JetBrains IDEs and Visual Studio Code, with broader IDE support expected as adoption grows. Administrative Features for IT Oversight To align with enterprise security and operational practices, Mistral Code includes a comprehensive management layer: Role-Based Access Control (RBAC): Configurable access policies to manage user permissions at scale. Audit Logs: Full traceability of actions and interactions with the assistant for compliance auditing. Usage Analytics: Detailed reporting dashboards to monitor adoption, performance, and optimization opportunities. These features support internal security reviews, cost accountability, and usage governance. Conclusion Mistral Code introduces a modular and enterprise-aligned approach to AI-assisted development. By prioritizing adaptability, transparency, and data integrity, Mistral AI offers an alternative to generalized coding assistants that often fall short in production-grade environments. The tool’s architecture and deployment flexibility position it as a viable solution for organizations seeking to integrate AI without compromising on internal controls or development rigor. Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise Workflows appeared first on MarkTechPost.

Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise Workflows Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

M$^3$FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset

arXiv:2506.02510v1 Announce Type: new Abstract: Recent breakthroughs in large language models (LLMs) have led to the development of new benchmarks for evaluating their performance in the financial domain. However, current financial benchmarks often rely on news articles, earnings reports, or announcements, making it challenging to capture the real-world dynamics of financial meetings. To address this gap, we propose a novel benchmark called $texttt{M$^3$FinMeeting}$, which is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding. First, $texttt{M$^3$FinMeeting}$ supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts. Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS), ensuring that the benchmark spans a broad range of financial activities. Finally, $texttt{M$^3$FinMeeting}$ includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding. Experimental results with seven popular LLMs reveal that even the most advanced long-context models have significant room for improvement, demonstrating the effectiveness of $texttt{M$^3$FinMeeting}$ as a benchmark for assessing LLMs’ financial meeting comprehension skills.

M$^3$FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective

arXiv:2506.02553v1 Announce Type: cross Abstract: We study a common challenge in reinforcement learning for large language models (LLMs): the Zero-Reward Assumption, where non-terminal actions (i.e., intermediate token generations) receive zero task-specific immediate reward, while only the final token receives a reward for the entire response. This assumption arises frequently in practice, as precise token-level rewards are often difficult or infeasible to obtain in LLM applications. In this work, we provide a unifying theoretical perspective. We introduce the Trajectory Policy Gradient Theorem, which shows that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model, regardless of whether the Zero-Reward Assumption holds or not, for algorithms in the REINFORCE and Actor-Critic families. This result reveals that widely used methods such as PPO, GRPO, ReMax, and RLOO inherently possess the capacity to model token-level reward signals, offering a theoretical justification for response-level reward approaches. Our findings pave the way for more practical, efficient LLM fine-tuning, allowing developers to treat training algorithms as black boxes and focus on improving the response-level reward model with auxiliary sub-models. We also offer a detailed analysis of popular RL and non-RL methods, comparing their theoretical foundations and practical advantages across common LLM tasks. Finally, we propose a new algorithm: Token-Reinforced Policy Optimization (TRePO), a theoretically grounded method that is simpler than PPO, matches GRPO in memory efficiency, and holds promise for broad applicability.

Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective Beitrag lesen »

We use cookies to improve your experience and performance on our website. You can learn more at Datenschutzrichtlinie and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
de_DE