YouZum

新闻

AI, Committee, 新闻, Uncategorized

Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents

AI agents powered by LLMs show great promise for handling complex business tasks, especially in areas like Customer Relationship Management (CRM). However, evaluating their real-world effectiveness is challenging due to the lack of publicly available, realistic business data. Existing benchmarks often focus on simple, one-turn interactions or narrow applications, such as customer service, missing out on broader domains, including sales, CPQ processes, and B2B operations. They also fail to test how well agents manage sensitive information. These limitations make it challenging to fully comprehend how LLM agents perform across the diverse range of real-world business scenarios and communication styles.  Previous benchmarks have largely focused on customer service tasks in B2C scenarios, overlooking key business operations, such as sales and CPQ processes, as well as the unique challenges of B2B interactions, including longer sales cycles. Moreover, many benchmarks lack realism, often ignoring multi-turn dialogue or skipping expert validation of tasks and environments. Another critical gap is the absence of confidentiality evaluation, vital in workplace settings where AI agents routinely engage with sensitive business and customer data. Without assessing data awareness, these benchmarks fail to address serious practical concerns, such as privacy, legal risk, and trust.  Researchers from Salesforce AI Research have introduced CRMArena-Pro, a benchmark designed to realistically evaluate LLM agents like Gemini 2.5 Pro in professional business environments. It features expert-validated tasks across customer service, sales, and CPQ, spanning both B2B and B2C contexts. The benchmark tests multi-turn conversations and assesses confidentiality awareness. Findings show that even top-performing models such as Gemini 2.5 Pro achieve only around 58% accuracy in single-turn tasks, with performance dropping to 35% in multi-turn settings. Workflow Execution is an exception, where Gemini 2.5 Pro exceeds 83%, but confidentiality handling remains a major challenge across all evaluated models.  CRMArena-Pro is a new benchmark created to rigorously test LLM agents in realistic business settings, including customer service, sales, and CPQ scenarios. Built using synthetic yet structurally accurate enterprise data generated with GPT-4 and based on Salesforce schemas, the benchmark simulates business environments through sandboxed Salesforce Organizations. It features 19 tasks grouped under four key skills: database querying, textual reasoning, workflow execution, and policy compliance. CRMArena-Pro also includes multi-turn conversations with simulated users and tests confidentiality awareness. Expert evaluations confirmed the realism of the data and environment, ensuring a reliable testbed for LLM agent performance.  The evaluation compared top LLM agents across 19 business tasks, focusing on task completion and awareness of confidentiality. Metrics varied by task type—exact match was used for structured outputs, and F1 score for generative responses. A GPT-4o-based LLM Judge assessed whether models appropriately refused to share sensitive information. Models like Gemini-2.5-Pro and o1, with advanced reasoning, clearly outperformed lighter or non-reasoning versions, especially in complex tasks. While performance was similar across B2B and B2C settings, nuanced trends emerged based on model strength. Confidentiality-aware prompts improved refusal rates but sometimes reduced task accuracy, highlighting a trade-off between privacy and performance.  In conclusion, CRMArena-Pro is a new benchmark designed to test how well LLM agents handle real-world business tasks in customer relationship management. It includes 19 expert-reviewed tasks across both B2B and B2C scenarios, covering sales, service, and pricing operations. While top agents performed decently in single-turn tasks (about 58% success), their performance dropped sharply to around 35% in multi-turn conversations. Workflow execution was the easiest area, but most other skills proved challenging. Confidentiality awareness was low, and improving it through prompting often reduced task accuracy. These findings reveal a clear gap between the capabilities of LLMs and the needs of enterprises.  Check out the Paper, GitHub Page, Hugging Face Page and Technical Blog. All credit for this research goes to the researchers of this project. Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents appeared first on MarkTechPost.

Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents Read Post »

AI, Committee, 新闻, Uncategorized

From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks

Web automation agents have become a growing focus in artificial intelligence, particularly due to their ability to execute human-like actions in digital environments. These agents interact with websites via Graphical User Interfaces (GUIs), mimicking human behaviors such as clicking, typing, and navigating across web pages. This approach bypasses the need for dedicated Application Programming Interfaces (APIs), which are often unavailable or limited in many web applications. Instead, these agents can operate universally across web domains, making them flexible tools for a broad range of tasks. The evolution of large language models (LLMs) has enabled these agents to not only interpret web content but also reason, plan, and act with increasing sophistication. As their abilities grow, so too does the need to evaluate them on more than just simple browsing tasks. Benchmarks that once sufficed for early models are no longer capable of measuring the full extent of modern agents’ capabilities. As these web agents progress, a pressing issue arises: their competence in handling mundane, memory-intensive, and multi-step digital chores remains insufficiently measured. Many tasks that humans perform on websites, such as retrieving data from different pages, performing calculations based on previous inputs, or applying complex rules, require significant cognitive effort. These are not merely navigation challenges; they test memory, logic, and long-term planning. Yet most benchmarks focus on simplified scenarios, failing to reflect the types of digital chores people often prefer to avoid. Furthermore, the limitations in these benchmarks become more apparent as agents improve their performance. Ambiguities in task instructions or inconsistencies in expected outputs begin to skew evaluations. When agents generate reasonable but slightly divergent answers, they are penalized incorrectly due to vague task definitions. Such flaws make it difficult to distinguish between true model limitations and benchmark shortcomings. Previous efforts to evaluate web agents have focused on benchmarks such as WebArena. WebArena gained widespread adoption due to its reproducibility and ability to simulate real-world websites, including Reddit, GitLab, and E-Commerce Platforms. It offered over 800 tasks designed to test an agent’s ability to complete web-based goals within these environments. However, these tasks mostly focused on general browsing and did not adequately challenge more advanced agents. Other benchmarks, such as Mind2Web, GAIA, and MMIn, contributed by exploring real web tasks or platform-specific environments like ServiceNow, but each came with trade-offs. Some lacked interactivity, others did not support reproducibility, and some were too narrowly scoped. These limitations created a gap in measuring agent progress in areas that require complex decision-making, long-term memory, and accurate data processing across multiple webpages. Researchers from the University of Tokyo introduced WebChoreArena. This expanded framework builds upon the structure of WebArena but significantly increases task difficulty and complexity. WebChoreArena features a total of 532 newly curated tasks, distributed across the same four simulated websites. These tasks are designed to be more demanding, reflecting scenarios where agents must engage in tasks like data aggregation, memory recall, and multi-step reasoning. Importantly, the benchmark was constructed to ensure full reproducibility and standardization, enabling fair comparisons between agents and avoiding the ambiguities found in earlier tools. The inclusion of diverse task types and input modalities helps simulate realistic web usage and evaluates agents on a more practical and challenging scale. WebChoreArena categorizes its tasks into four main types. One hundred seventeen tasks fall under Massive Memory, requiring agents to extract and remember large volumes of information, such as compiling all customer names linked to high-value transactions. Calculation tasks, which include 132 entries, involve arithmetic operations like identifying the highest spending months based on multiple data points. Long-Term Memory tasks number 127 and test the agent’s ability to connect information across various pages, such as retrieving pricing rules from one site and applying them on another. An additional 65 tasks are categorized as ‘Others’, including operations such as assigning labels in GitLab that do not fit traditional task formats. Each task specifies its input modality, with 451 tasks solvable with any observation type, 69 requiring only textual input, and 12 dependent exclusively on image inputs. In evaluating the benchmark, the researchers used three prominent large language models: GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro. These were tested in conjunction with two advanced web agents, AgentOccam and BrowserGym. The results highlighted the increased difficulty of WebChoreArena compared to previous benchmarks. GPT-4o, which had achieved 42.8% accuracy on WebArena, managed only 6.8% on WebChoreArena. Claude 3.7 Sonnet and Gemini 2.5 Pro performed better, with Gemini reaching a peak accuracy of 44.9%. Despite being the top performer, this result still reflected significant gaps in capability when dealing with the more complex tasks of WebChoreArena. The benchmark also proved more sensitive in detecting performance differences between models, making it a valuable tool for benchmarking ongoing advances in web agent technologies. Several Key Takeaways from the research include: WebChoreArena includes 532 tasks: 117 Massive Memory, 132 Calculation, 127 Long-Term Memory, and 65 Others. Tasks are distributed across Shopping (117), Shopping Admin (132), Reddit (91), GitLab (127), and 65 Cross-site scenarios. Input types: 451 tasks are solvable with any input, 69 require textual input, and 12 need image input. GPT-4o scored only 6.8% on WebChoreArena compared to 42.8% on WebArena. Gemini 2.5 Pro achieved the highest score at 44.9%, indicating current limitations in handling complex tasks. WebChoreArena provides a clearer performance gradient between models than WebArena, enhancing benchmarking value. A total of 117 task templates were used to ensure diversity and reproducibility across roughly 4.5 instances per template. The benchmark demanded over 300 hours of annotation and refinement, reflecting its rigorous construction. Evaluations utilize string matching, URL matching, and HTML structure comparisons to assess accuracy. In conclusion, this research highlights the disparity between general browsing proficiency and the higher-order cognitive abilities necessary for web-based tasks. The newly introduced WebChoreArena stands as a robust and detailed benchmark designed specifically to push web agents into territories where they must rely on reasoning, memory, and logic. It replaces ambiguity with standardization, and its tasks mimic the digital drudgery that agents must learn to handle if they

From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks Read Post »

AI, Committee, 新闻, Uncategorized

Comparison of different Unique hard attention transformer models by the formal languages they can recognize

arXiv:2506.03370v1 Announce Type: cross Abstract: This note is a survey of various results on the capabilities of unique hard attention transformers encoders (UHATs) to recognize formal languages. We distinguish between masked vs. non-masked, finite vs. infinite image and general vs. bilinear attention score functions. We recall some relations between these models, as well as a lower bound in terms of first-order logic and an upper bound in terms of circuit complexity.

Comparison of different Unique hard attention transformer models by the formal languages they can recognize Read Post »

AI, Committee, 新闻, Uncategorized

PromptCanvas: Composable Prompting Workspaces Using Dynamic Widgets for Exploration and Iteration in Creative Writing

arXiv:2506.03741v1 Announce Type: cross Abstract: We introduce PromptCanvas, a concept that transforms prompting into a composable, widget-based experience on an infinite canvas. Users can generate, customize, and arrange interactive widgets representing various facets of their text, offering greater control over AI-generated content. PromptCanvas allows widget creation through system suggestions, user prompts, or manual input, providing a flexible environment tailored to individual needs. This enables deeper engagement with the creative process. In a lab study with 18 participants, PromptCanvas outperformed a traditional conversational UI on the Creativity Support Index. Participants found that it reduced cognitive load, with lower mental demand and frustration. Qualitative feedback revealed that the visual organization of thoughts and easy iteration encouraged new perspectives and ideas. A follow-up field study (N=10) confirmed these results, showcasing the potential of dynamic, customizable interfaces in improving collaborative writing with AI.

PromptCanvas: Composable Prompting Workspaces Using Dynamic Widgets for Exploration and Iteration in Creative Writing Read Post »

AI, Committee, 新闻, Uncategorized

High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning

arXiv:2506.04051v1 Announce Type: new Abstract: Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability — a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data that encodes what the model can and cannot reliably generate. We generate this data by splitting responses of the pretrained LLM into factual fragments (atomic statements or reasoning steps), and use ground truth information to identify incorrect fragments. We achieve capability-aligned finetuning responses by either removing incorrect fragments or replacing them with “Unsure from Here” — according to a tunable threshold that allows practitioners to trade off response completeness and mean correctness of the response’s fragments. We finetune four open-source models for biography writing, mathematics, coding, and medicine with HALT for three different trade-off thresholds. HALT effectively trades off response completeness for correctness, increasing the mean correctness of response fragments by 15% on average, while resulting in a 4% improvement in the F1 score (mean of completeness and correctness of the response) compared to the relevant baselines. By tuning HALT for highest correctness, we train a single reliable Llama3-70B model with correctness increased from 51% to 87% across all four domains while maintaining 53% of the response completeness achieved with standard finetuning.

High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning Read Post »

AI, Committee, 新闻, Uncategorized

DiaBlo: Diagonal Blocks Are Sufficient For Finetuning

arXiv:2506.03230v1 Announce Type: cross Abstract: Finetuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Finetuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. We conduct extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, to evaluate the effectiveness and efficiency of DiaBlo. Across these benchmarks, DiaBlo demonstrates strong and consistent performance while maintaining high memory efficiency and fast finetuning speed. Codes are available at https://github.com/ziyangjoy/DiaBlo.

DiaBlo: Diagonal Blocks Are Sufficient For Finetuning Read Post »

AI, Committee, 新闻, Uncategorized

A Survey on (M)LLM-Based GUI Agents

arXiv:2504.13865v2 Announce Type: replace-cross Abstract: Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction, evolving from rule-based automation scripts to sophisticated AI-driven systems capable of understanding and executing complex interface operations. This survey provides a comprehensive examination of the rapidly advancing field of LLM-based GUI Agents, systematically analyzing their architectural foundations, technical components, and evaluation methodologies. We identify and analyze four fundamental components that constitute modern GUI Agents: (1) perception systems that integrate text-based parsing with multimodal understanding for comprehensive interface comprehension; (2) exploration mechanisms that construct and maintain knowledge bases through internal modeling, historical experience, and external information retrieval; (3) planning frameworks that leverage advanced reasoning methodologies for task decomposition and execution; and (4) interaction systems that manage action generation with robust safety controls. Through rigorous analysis of these components, we reveal how recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms. We critically examine current evaluation frameworks, highlighting methodological limitations in existing benchmarks while proposing directions for standardization. This survey also identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control, while outlining promising research directions for enhancing GUI Agents’ capabilities. Our systematic review provides researchers and practitioners with a thorough understanding of the field’s current state and offers insights into future developments in intelligent interface automation.

A Survey on (M)LLM-Based GUI Agents Read Post »

zh_CN