YouZum

Committee

AI, Committee, ニュース, Uncategorized

Not All Options Are Created Equal: Textual Option Weighting for Token-Efficient LLM-Based Knowledge Tracing

arXiv:2410.12872v2 Announce Type: replace Abstract: Large Language Models (LLMs) have recently emerged as promising tools for knowledge tracing (KT) due to their strong reasoning and generalization abilities. While recent LLM-based KT methods have proposed new prompt formats, they struggle to represent the full interaction histories of example learners within a single prompt during in-context learning (ICL), resulting in limited scalability and high computational cost under token constraints. In this work, we present textit{LLM-based Option-weighted Knowledge Tracing (LOKT)}, a simple yet effective framework that encodes the interaction histories of example learners in context as textit{textual categorical option weights (TCOW)}. TCOW are semantic labels (e.g., “inadequate”) assigned to the options selected by learners when answering questions, enhancing the interpretability of LLMs. Experiments on multiple-choice datasets show that LOKT outperforms existing non-LLM and LLM-based KT models in both cold-start and warm-start settings. Moreover, LOKT enables scalable and cost-efficient inference, achieving strong performance even under strict token constraints. Our code is available at href{https://anonymous.4open.science/r/LOKT_model-3233}{https://anonymous.4open.science/r/LOKT_model-3233}.

Not All Options Are Created Equal: Textual Option Weighting for Token-Efficient LLM-Based Knowledge Tracing 投稿を読む »

AI, Committee, ニュース, Uncategorized

Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning

Reinforcement finetuning uses reward signals to guide the large language model toward desirable behavior. This method sharpens the model’s ability to produce logical and structured outputs by reinforcing correct responses. Yet, the challenge persists in ensuring that these models also know when not to respond—particularly when faced with incomplete or misleading questions that don’t have a definite answer. The problem arises when language models, after reinforcement finetuning, begin to lose their ability to refuse to answer unclear or ambiguous queries. Instead of signaling uncertainty, the models tend to produce confidently stated but incorrect responses. This phenomenon, identified in the paper as the “hallucination tax,” highlights a growing risk. As models are trained to perform better, they may also become more likely to hallucinate answers in situations where silence would be more appropriate. This is especially hazardous in domains that require high trust and precision. Tools currently used in training large language models often overlook the importance of refusal behavior. Reinforcement finetuning frameworks tend to reward only correct answers while penalizing incorrect ones, ignoring cases where a valid response should be no answer at all. The reward systems in use do not sufficiently reinforce refusal, resulting in overconfident models. For instance, the paper shows that refusal rates dropped to near zero across multiple models after standard RFT, demonstrating that current training fails to address hallucination properly. Researchers from the University of Southern California developed the Synthetic Unanswerable Math (SUM) dataset. SUM introduces implicitly unanswerable math problems by modifying existing questions through criteria such as missing key information or creating logical inconsistencies. The researchers used DeepScaleR as the base dataset and employed the o3-mini model to generate high-quality unanswerable questions. This synthetic dataset aims to teach models to recognize when a problem lacks sufficient information and respond accordingly. SUM’s core technique is to mix answerable and unanswerable problems during training. Questions are modified to become ambiguous or unsolvable while maintaining plausibility. The training prompts instruct models to say “I don’t know” for unanswerable inputs. By introducing only 10% of the SUM data into reinforcement finetuning, models begin to leverage inference-time reasoning to evaluate uncertainty. This structure allows them to refuse answers more appropriately without impairing their performance on solvable problems. Performance analysis shows significant improvements. After training with SUM, the Qwen2.5-7B model increased its refusal rate from 0.01 to 0.73 on the SUM benchmark and from 0.01 to 0.81 on the UMWP benchmark. On the SelfAware dataset, refusal accuracy rose dramatically from 0.01 to 0.94. Llama-3.1-8B-Instruct showed a similar trend, with refusal rates improving from 0.00 to 0.75 on SUM and from 0.01 to 0.79 on UMWP. Despite these gains in refusal behavior, accuracy on answerable datasets, such as GSM8K and MATH-500, remained stable, with most changes ranging from 0.00 to -0.05. The minimal drop indicates that refusal training can be introduced without major sacrifices in task performance. This study outlines a clear trade-off between improved reasoning and trustworthiness. Reinforcement finetuning, while powerful, tends to suppress cautious behavior. The SUM dataset corrects this by teaching models to recognize what they cannot solve. With only a small addition to training data, language models become better at identifying the boundaries of their knowledge. This approach marks a significant step in making AI systems not just smarter but also more careful and honest. Check out the Paper and Dataset on Hugging Face. All credit for this research goes to the researchers of this project. Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning appeared first on MarkTechPost.

Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning 投稿を読む »

AI, Committee, ニュース, Uncategorized

Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards

Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation (RAG). However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed. Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages (119 in total), making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs. These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs. Technical Architecture Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to the [EOS] token. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function. The models are trained using a robust multi-stage training pipeline: Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks. Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity (>0.7), fine-tuning performance in downstream applications. Model merging: Spherical linear interpolation (SLERP) of multiple fine-tuned checkpoints ensures robustness and generalization. This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings. Performance Benchmarks and Insights The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks. On MMTEB (216 tasks across 250+ languages), Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series. On MTEB (English v2): Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B. On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA. For reranking: Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers. Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance. Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops (up to 6 points on MMTEB), emphasizing their contributions. Conclusion Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation. Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards appeared first on MarkTechPost.

Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards 投稿を読む »

AI, Committee, ニュース, Uncategorized

Manus has kick-started an AI agent boom in China

Last year, China saw a boom in foundation models, the do-everything large language models that underpin the AI revolution. This year, the focus has shifted to AI agents—systems that are less about responding to users’ queries and more about autonomously accomplishing things for them.  There are now a host of Chinese startups building these general-purpose digital tools, which can answer emails, browse the internet to plan vacations, and even design an interactive website. Many of these have emerged in just the last two months, following in the footsteps of Manus—a general AI agent that sparked weeks of social media frenzy for invite codes after its limited-release launch in early March.  These emerging AI agents aren’t large language models themselves. Instead, they’re built on top of them, using a workflow-based structure designed to get things done. A lot of these systems also introduce a different way of interacting with AI. Rather than just chatting back and forth with users, they are optimized for managing and executing multistep tasks—booking flights, managing schedules, conducting research—by using external tools and remembering instructions.  China could take the lead on building these kinds of agents. The country’s tightly integrated app ecosystems, rapid product cycles, and digitally fluent user base could provide a favorable environment for embedding AI into daily life.  For now, its leading AI agent startups are focusing their attention on the global market, because the best Western models don’t operate inside China’s firewalls. But that could change soon: Tech giants like ByteDance and Tencent are preparing their own AI agents that could bake automation directly into their native super-apps, pulling data from their vast ecosystem of programs that dominate many aspects of daily life in the country.  As the race to define what a useful AI agent looks like unfolds, a mix of ambitious startups and entrenched tech giants are now testing how these tools might actually work in practice—and for whom. Set the standard It’s been a whirlwind few months for Manus, which was developed by the Wuhan-based startup Butterfly Effect. The company raised $75 million in a funding round led by the US venture capital firm Benchmark, took the product on an ambitious global roadshow, and hired dozens of new employees.  Even before registration opened to the public in May, Manus had become a reference point for what a broad, consumer‑oriented AI agent should accomplish. Rather than handling narrow chores for businesses, this “general” agent is designed to be able to help with everyday tasks like trip planning, stock comparison, or your kid’s school project.  Unlike previous AI agents, Manus uses a browser-based sandbox that lets users supervise the agent like an intern, watching in real time as it scrolls through web pages, reads articles, or codes actions. It also proactively asks clarifying questions, supports long-term memory that would serve as context for future tasks. “Manus represents a promising product experience for AI agents,” says Ang Li, cofounder and CEO of Simular, a startup based in Palo Alto, California, that’s building computer use agents, AI agents that control a virtual computer. “I believe Chinese startups have a huge advantage when it comes to designing consumer products, thanks to cutthroat domestic competition that leads to fast execution and greater attention to product details.” In the case of Manus, the competition is moving fast. Two of the most buzzy follow‑ups, Genspark and Flowith, for example, are already boasting benchmark scores that match or edge past Manus’s.  Genspark, led by former Baidu executives Eric Jing and Kay Zhu, links many small “super agents” through what it calls multi‑component prompting. The agent can switch among several large language models, accepts both images and text, and carries out tasks from making slide decks to placing phone calls. Whereas Manus relies heavily on Browser Use, a popular open-source product that lets agents operate a web browser in a virtual window like a human, Genspark directly integrates with a wide array of tools and APIs. Launched in April, the company says that it already has over 5 million users and over $36 million in yearly revenue. Flowith, the work of a young team that first grabbed public attention in April 2025 at a developer event hosted by the popular social media app Xiaohongshu, takes a different tack. Marketed as an “infinite agent,” it opens on a blank canvas where each question becomes a node on a branching map. Users can backtrack, take new branches, and store results in personal or sharable “knowledge gardens”—a design that feels more like project management software (think Notion) than a typical chat interface. Every inquiry or task builds its own mind-map-like graph, encouraging a more nonlinear and creative interaction with AI. Flowith’s core agent, NEO, runs in the cloud and can perform scheduled tasks like sending emails and compiling files. The founders want the app to be a “knowledge marketbase”, and aims to tap into the social aspect of AI with the aspiration of becoming “the OnlyFans of AI knowledge creators”. What they also share with Manus is the global ambition. Both Genspark and Flowith have stated that their primary focus is the international market. A global address Startups like Manus, Genspark, and Flowith—though founded by Chinese entrepreneurs—could blend seamlessly into the global tech scene and compete effectively abroad. Founders, investors, and analysts that MIT Technology Review has spoken to believe Chinese companies are moving fast, executing well, and quickly coming up with new products.  Money reinforces the pull to launch overseas. Customers there pay more, and there are plenty to go around. “You can price in USD, and with the exchange rate that’s a sevenfold multiplier,” Manus cofounder Xiao Hong quipped on a podcast. “Even if we’re only operating at 10% power because of cultural differences overseas, we’ll still make more than in China.” But creating the same functionality in China is a challenge. Major US AI companies including OpenAI and Anthropic have opted out of mainland China because of geopolitical risks and challenges with regulatory compliance. Their absence initially created a black market as users

Manus has kick-started an AI agent boom in China 投稿を読む »

AI, Committee, ニュース, Uncategorized

Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents

AI agents powered by LLMs show great promise for handling complex business tasks, especially in areas like Customer Relationship Management (CRM). However, evaluating their real-world effectiveness is challenging due to the lack of publicly available, realistic business data. Existing benchmarks often focus on simple, one-turn interactions or narrow applications, such as customer service, missing out on broader domains, including sales, CPQ processes, and B2B operations. They also fail to test how well agents manage sensitive information. These limitations make it challenging to fully comprehend how LLM agents perform across the diverse range of real-world business scenarios and communication styles.  Previous benchmarks have largely focused on customer service tasks in B2C scenarios, overlooking key business operations, such as sales and CPQ processes, as well as the unique challenges of B2B interactions, including longer sales cycles. Moreover, many benchmarks lack realism, often ignoring multi-turn dialogue or skipping expert validation of tasks and environments. Another critical gap is the absence of confidentiality evaluation, vital in workplace settings where AI agents routinely engage with sensitive business and customer data. Without assessing data awareness, these benchmarks fail to address serious practical concerns, such as privacy, legal risk, and trust.  Researchers from Salesforce AI Research have introduced CRMArena-Pro, a benchmark designed to realistically evaluate LLM agents like Gemini 2.5 Pro in professional business environments. It features expert-validated tasks across customer service, sales, and CPQ, spanning both B2B and B2C contexts. The benchmark tests multi-turn conversations and assesses confidentiality awareness. Findings show that even top-performing models such as Gemini 2.5 Pro achieve only around 58% accuracy in single-turn tasks, with performance dropping to 35% in multi-turn settings. Workflow Execution is an exception, where Gemini 2.5 Pro exceeds 83%, but confidentiality handling remains a major challenge across all evaluated models.  CRMArena-Pro is a new benchmark created to rigorously test LLM agents in realistic business settings, including customer service, sales, and CPQ scenarios. Built using synthetic yet structurally accurate enterprise data generated with GPT-4 and based on Salesforce schemas, the benchmark simulates business environments through sandboxed Salesforce Organizations. It features 19 tasks grouped under four key skills: database querying, textual reasoning, workflow execution, and policy compliance. CRMArena-Pro also includes multi-turn conversations with simulated users and tests confidentiality awareness. Expert evaluations confirmed the realism of the data and environment, ensuring a reliable testbed for LLM agent performance.  The evaluation compared top LLM agents across 19 business tasks, focusing on task completion and awareness of confidentiality. Metrics varied by task type—exact match was used for structured outputs, and F1 score for generative responses. A GPT-4o-based LLM Judge assessed whether models appropriately refused to share sensitive information. Models like Gemini-2.5-Pro and o1, with advanced reasoning, clearly outperformed lighter or non-reasoning versions, especially in complex tasks. While performance was similar across B2B and B2C settings, nuanced trends emerged based on model strength. Confidentiality-aware prompts improved refusal rates but sometimes reduced task accuracy, highlighting a trade-off between privacy and performance.  In conclusion, CRMArena-Pro is a new benchmark designed to test how well LLM agents handle real-world business tasks in customer relationship management. It includes 19 expert-reviewed tasks across both B2B and B2C scenarios, covering sales, service, and pricing operations. While top agents performed decently in single-turn tasks (about 58% success), their performance dropped sharply to around 35% in multi-turn conversations. Workflow execution was the easiest area, but most other skills proved challenging. Confidentiality awareness was low, and improving it through prompting often reduced task accuracy. These findings reveal a clear gap between the capabilities of LLMs and the needs of enterprises.  Check out the Paper, GitHub Page, Hugging Face Page and Technical Blog. All credit for this research goes to the researchers of this project. Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents appeared first on MarkTechPost.

Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents 投稿を読む »

AI, Committee, ニュース, Uncategorized

From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks

Web automation agents have become a growing focus in artificial intelligence, particularly due to their ability to execute human-like actions in digital environments. These agents interact with websites via Graphical User Interfaces (GUIs), mimicking human behaviors such as clicking, typing, and navigating across web pages. This approach bypasses the need for dedicated Application Programming Interfaces (APIs), which are often unavailable or limited in many web applications. Instead, these agents can operate universally across web domains, making them flexible tools for a broad range of tasks. The evolution of large language models (LLMs) has enabled these agents to not only interpret web content but also reason, plan, and act with increasing sophistication. As their abilities grow, so too does the need to evaluate them on more than just simple browsing tasks. Benchmarks that once sufficed for early models are no longer capable of measuring the full extent of modern agents’ capabilities. As these web agents progress, a pressing issue arises: their competence in handling mundane, memory-intensive, and multi-step digital chores remains insufficiently measured. Many tasks that humans perform on websites, such as retrieving data from different pages, performing calculations based on previous inputs, or applying complex rules, require significant cognitive effort. These are not merely navigation challenges; they test memory, logic, and long-term planning. Yet most benchmarks focus on simplified scenarios, failing to reflect the types of digital chores people often prefer to avoid. Furthermore, the limitations in these benchmarks become more apparent as agents improve their performance. Ambiguities in task instructions or inconsistencies in expected outputs begin to skew evaluations. When agents generate reasonable but slightly divergent answers, they are penalized incorrectly due to vague task definitions. Such flaws make it difficult to distinguish between true model limitations and benchmark shortcomings. Previous efforts to evaluate web agents have focused on benchmarks such as WebArena. WebArena gained widespread adoption due to its reproducibility and ability to simulate real-world websites, including Reddit, GitLab, and E-Commerce Platforms. It offered over 800 tasks designed to test an agent’s ability to complete web-based goals within these environments. However, these tasks mostly focused on general browsing and did not adequately challenge more advanced agents. Other benchmarks, such as Mind2Web, GAIA, and MMIn, contributed by exploring real web tasks or platform-specific environments like ServiceNow, but each came with trade-offs. Some lacked interactivity, others did not support reproducibility, and some were too narrowly scoped. These limitations created a gap in measuring agent progress in areas that require complex decision-making, long-term memory, and accurate data processing across multiple webpages. Researchers from the University of Tokyo introduced WebChoreArena. This expanded framework builds upon the structure of WebArena but significantly increases task difficulty and complexity. WebChoreArena features a total of 532 newly curated tasks, distributed across the same four simulated websites. These tasks are designed to be more demanding, reflecting scenarios where agents must engage in tasks like data aggregation, memory recall, and multi-step reasoning. Importantly, the benchmark was constructed to ensure full reproducibility and standardization, enabling fair comparisons between agents and avoiding the ambiguities found in earlier tools. The inclusion of diverse task types and input modalities helps simulate realistic web usage and evaluates agents on a more practical and challenging scale. WebChoreArena categorizes its tasks into four main types. One hundred seventeen tasks fall under Massive Memory, requiring agents to extract and remember large volumes of information, such as compiling all customer names linked to high-value transactions. Calculation tasks, which include 132 entries, involve arithmetic operations like identifying the highest spending months based on multiple data points. Long-Term Memory tasks number 127 and test the agent’s ability to connect information across various pages, such as retrieving pricing rules from one site and applying them on another. An additional 65 tasks are categorized as ‘Others’, including operations such as assigning labels in GitLab that do not fit traditional task formats. Each task specifies its input modality, with 451 tasks solvable with any observation type, 69 requiring only textual input, and 12 dependent exclusively on image inputs. In evaluating the benchmark, the researchers used three prominent large language models: GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro. These were tested in conjunction with two advanced web agents, AgentOccam and BrowserGym. The results highlighted the increased difficulty of WebChoreArena compared to previous benchmarks. GPT-4o, which had achieved 42.8% accuracy on WebArena, managed only 6.8% on WebChoreArena. Claude 3.7 Sonnet and Gemini 2.5 Pro performed better, with Gemini reaching a peak accuracy of 44.9%. Despite being the top performer, this result still reflected significant gaps in capability when dealing with the more complex tasks of WebChoreArena. The benchmark also proved more sensitive in detecting performance differences between models, making it a valuable tool for benchmarking ongoing advances in web agent technologies. Several Key Takeaways from the research include: WebChoreArena includes 532 tasks: 117 Massive Memory, 132 Calculation, 127 Long-Term Memory, and 65 Others. Tasks are distributed across Shopping (117), Shopping Admin (132), Reddit (91), GitLab (127), and 65 Cross-site scenarios. Input types: 451 tasks are solvable with any input, 69 require textual input, and 12 need image input. GPT-4o scored only 6.8% on WebChoreArena compared to 42.8% on WebArena. Gemini 2.5 Pro achieved the highest score at 44.9%, indicating current limitations in handling complex tasks. WebChoreArena provides a clearer performance gradient between models than WebArena, enhancing benchmarking value. A total of 117 task templates were used to ensure diversity and reproducibility across roughly 4.5 instances per template. The benchmark demanded over 300 hours of annotation and refinement, reflecting its rigorous construction. Evaluations utilize string matching, URL matching, and HTML structure comparisons to assess accuracy. In conclusion, this research highlights the disparity between general browsing proficiency and the higher-order cognitive abilities necessary for web-based tasks. The newly introduced WebChoreArena stands as a robust and detailed benchmark designed specifically to push web agents into territories where they must rely on reasoning, memory, and logic. It replaces ambiguity with standardization, and its tasks mimic the digital drudgery that agents must learn to handle if they

From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks 投稿を読む »

AI, Committee, ニュース, Uncategorized

Comparison of different Unique hard attention transformer models by the formal languages they can recognize

arXiv:2506.03370v1 Announce Type: cross Abstract: This note is a survey of various results on the capabilities of unique hard attention transformers encoders (UHATs) to recognize formal languages. We distinguish between masked vs. non-masked, finite vs. infinite image and general vs. bilinear attention score functions. We recall some relations between these models, as well as a lower bound in terms of first-order logic and an upper bound in terms of circuit complexity.

Comparison of different Unique hard attention transformer models by the formal languages they can recognize 投稿を読む »

AI, Committee, ニュース, Uncategorized

High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning

arXiv:2506.04051v1 Announce Type: new Abstract: Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability — a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data that encodes what the model can and cannot reliably generate. We generate this data by splitting responses of the pretrained LLM into factual fragments (atomic statements or reasoning steps), and use ground truth information to identify incorrect fragments. We achieve capability-aligned finetuning responses by either removing incorrect fragments or replacing them with “Unsure from Here” — according to a tunable threshold that allows practitioners to trade off response completeness and mean correctness of the response’s fragments. We finetune four open-source models for biography writing, mathematics, coding, and medicine with HALT for three different trade-off thresholds. HALT effectively trades off response completeness for correctness, increasing the mean correctness of response fragments by 15% on average, while resulting in a 4% improvement in the F1 score (mean of completeness and correctness of the response) compared to the relevant baselines. By tuning HALT for highest correctness, we train a single reliable Llama3-70B model with correctness increased from 51% to 87% across all four domains while maintaining 53% of the response completeness achieved with standard finetuning.

High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning 投稿を読む »

ja