YouZum

Committee

AI, Committee, 新闻, Uncategorized

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

arXiv:2508.03686v1 Announce Type: new Abstract: Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate answer verification, evaluation protocols, and reinforcement learning research. Code and dataset are available at https://github.com/open-compass/CompassVerifier.

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward Read Post »

AI, Committee, 新闻, Uncategorized

Google AI Releases LangExtract: An Open Source Python Library that Extracts Structured Data from Unstructured Text Documents

In today’s data-driven world, valuable insights are often buried in unstructured text—be it clinical notes, lengthy legal contracts, or customer feedback threads. Extracting meaningful, traceable information from these documents is both a technical and practical challenge. Google AI’s new open-source Python library, LangExtract, is designed to address this gap directly, using LLMs like Gemini to deliver powerful, automated extraction with traceability and transparency at its core. Key Innovations of LangExtract 1. Declarative and Traceable Extraction LangExtract lets users define custom extraction tasks using natural language instructions and high-quality “few-shot” examples. This empowers developers and analysts to specify exactly which entities, relationships, or facts to extract, and in what structure. Crucially, every extracted piece of information is tied directly back to its source text—enabling validation, auditing, and end-to-end traceability. 2. Domain Versatility The library works not just in tech demos but in critical real-world domains—including health (clinical notes, medical reports), finance (summaries, risk documents), law (contracts), research literature, and even the arts (analyzing Shakespeare). Original use cases include automatic extraction of medications, dosages, and administration details from clinical documents, as well as relationships and emotions from plays or literature. 3. Schema Enforcement with LLMs Powered by Gemini and compatible with other LLMs, LangExtract enables enforcement of custom output schemas (like JSON), so results aren’t just accurate—they’re immediately usable in downstream databases, analytics, or AI pipelines. It solves traditional LLM weaknesses around hallucination and schema drift by grounding outputs to both user instructions and actual source text. 4. Scalability and Visualization Handles Large Volumes: LangExtract efficiently processes long documents by chunking, parallelizing, and aggregating results. Interactive Visualization: Developers can generate interactive HTML reports, viewing each extracted entity with context by highlighting its location in the original document—making auditing and error analysis seamless. Smooth Integration: Works in Google Colab, Jupyter, or as standalone HTML files, supporting a rapid feedback loop for developers and researchers. 5. Installation and Usage Install easily with pip: Copy CodeCopiedUse a different Browser pip install langextract Example Workflow (Extracting Character Info from Shakespeare): Copy CodeCopiedUse a different Browser import langextract as lx import textwrap # 1. Define your prompt prompt = textwrap.dedent(“”” Extract characters, emotions, and relationships in order of appearance. Use exact text for extractions. Do not paraphrase or overlap entities. Provide meaningful attributes for each entity to add context. “””) # 2. Give a high-quality example examples = [ lx.data.ExampleData( text=”ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.”, extractions=[ lx.data.Extraction(extraction_class=”character”, extraction_text=”ROMEO”, attributes={“emotional_state”: “wonder”}), lx.data.Extraction(extraction_class=”emotion”, extraction_text=”But soft!”, attributes={“feeling”: “gentle awe”}), lx.data.Extraction(extraction_class=”relationship”, extraction_text=”Juliet is the sun”, attributes={“type”: “metaphor”}), ], ) ] # 3. Extract from new text input_text = “Lady Juliet gazed longingly at the stars, her heart aching for Romeo” result = lx.extract( text_or_documents=input_text, prompt_description=prompt, examples=examples, model_id=”gemini-2.5-pro” ) # 4. Save and visualize results lx.io.save_annotated_documents([result], output_name=”extraction_results.jsonl”) html_content = lx.visualize(“extraction_results.jsonl”) with open(“visualization.html”, “w”) as f: f.write(html_content) This results in structured, source-anchored JSON outputs, plus an interactive HTML visualization for easy review and demonstration. Specialized & Real-World Applications Medicine: Extracts medications, dosages, timing, and links them back to source sentences. Powered by insights from research conducted on accelerating medical information extraction, LangExtract’s approach is directly applicable to structuring clinical and radiology reports—improving clarity and supporting interoperability. Finance & Law: Automatically pulls relevant clauses, terms, or risks from dense legal or financial text, ensuring every output can be traced back to its context. Research & Data Mining: Streamlines high-throughput extraction from thousands of scientific papers. The team even provides a demonstration called RadExtract for structuring radiology reports—highlighting not just what was extracted, but exactly where the information appeared in the original input. How LangExtract Compares Feature Traditional Approaches LangExtract Approach Schema Consistency Often manual/error-prone Enforced via instructions & few-shot examples Result Traceability Minimal All output linked to input text Scaling to Long Texts Windowed, lossy Chunked + parallel extraction, then aggregation Visualization Custom, usually absent Built-in, interactive HTML reports Deployment Rigid, model-specific Gemini-first, open to other LLMs & on-premises In Summary LangExtract presents a new era for extracting structured, actionable data from text—delivering: Declarative, explainable extraction Traceable results backed by source context Instant visualization for rapid iteration Easy integration into any Python workflow Check out the GitHub Page and Technical Blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Google AI Releases LangExtract: An Open Source Python Library that Extracts Structured Data from Unstructured Text Documents appeared first on MarkTechPost.

Google AI Releases LangExtract: An Open Source Python Library that Extracts Structured Data from Unstructured Text Documents Read Post »

AI, Committee, 新闻, Uncategorized

CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

arXiv:2508.01674v1 Announce Type: new Abstract: Personalization of Large Language Models (LLMs) often assumes users hold static preferences that reflect globally in all tasks. In reality, humans hold dynamic preferences that change depending on the context. As users interact with an LLM in various contexts, they naturally reveal their contextual preferences, which a model must infer and apply in future contexts to ensure alignment. To assess this, we introduce CUPID, a benchmark of 756 human-curated interaction session histories between users and LLM-based chat assistants. In each interaction session, the user provides a request in a specific context and expresses their preference through multi-turn feedback. Given a new user request and prior interaction sessions, our benchmark assesses whether LLMs can infer the preference relevant to this request and generate a response that satisfies this preference. With CUPID, we evaluated 10 open and proprietary LLMs, revealing that state-of-the-art LLMs struggle to infer preferences from multi-turn interactions and fail to discern what previous context is relevant to a new request — under 50% precision and 65% recall. Our work highlights the need to advance LLM capabilities for more contextually personalized interactions and proposes CUPID as a resource to drive these improvements.

CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions Read Post »

AI, Committee, 新闻, Uncategorized

FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning

arXiv:2506.16123v2 Announce Type: replace Abstract: This paper presents FinCoT, a structured chain-of-thought (CoT) prompting framework that embeds domain-specific expert financial reasoning blueprints to guide large language models’ behaviors. We identify three main prompting styles in financial NLP (FinNLP): (1) standard prompting (zero-shot), (2) unstructured CoT (free-form reasoning), and (3) structured CoT (with explicitly structured reasoning steps). Prior work has mainly focused on the first two, while structured CoT remains underexplored and lacks domain expertise incorporation. Therefore, we evaluate all three prompting approaches across ten CFA-style financial domains and introduce FinCoT as the first structured finance-specific prompting approach incorporating blueprints from domain experts. FinCoT improves the accuracy of a general-purpose model, Qwen3-8B-Base, from 63.2% to 80.5%, and boosts Fin-R1 (7B), a finance-specific model, from 65.7% to 75.7%, while reducing output length by up to 8.9x and 1.16x compared to structured CoT methods, respectively. We find that FinCoT proves most effective for models lacking financial post-training. Our findings show that FinCoT does not only improve performance and reduce inference costs but also yields more interpretable and expert-aligned reasoning traces.

FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning Read Post »

AI, Committee, 新闻, Uncategorized

One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models

arXiv:2505.07167v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research. However, they remain vulnerable to jailbreak attacks, which manipulate the models into generating harmful responses despite safety alignment. Recent studies have shown that current safety-aligned LLMs often undergo the shallow safety alignment, where the first few tokens largely determine whether the response will be harmful. Through comprehensive observations, we find that safety-aligned LLMs and various defense strategies generate highly similar initial tokens in their refusal responses, which we define as safety trigger tokens. Building on this insight, we propose texttt{D-STT}, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM to trigger the model’s learned safety patterns. In this process, the safety trigger is constrained to a single token, which effectively preserves model usability by introducing minimum intervention in the decoding process. Extensive experiments across diverse jailbreak attacks and benign prompts demonstrate that ours significantly reduces output harmfulness while preserving model usability and incurring negligible response time overhead, outperforming ten baseline methods.

One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models Read Post »

AI, Committee, 新闻, Uncategorized

AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection

arXiv:2508.01249v1 Announce Type: cross Abstract: Large Language Model (LLM) agents offer a powerful new paradigm for solving various problems by combining natural language reasoning with the execution of external tools. However, their dynamic and non-transparent behavior introduces critical security risks, particularly in the presence of prompt injection attacks. In this work, we propose a novel insight that treats the agent runtime traces as structured programs with analyzable semantics. Thus, we present AgentArmor, a program analysis framework that converts agent traces into graph intermediate representation-based structured program dependency representations (e.g., CFG, DFG, and PDG) and enforces security policies via a type system. AgentArmor consists of three key components: (1) a graph constructor that reconstructs the agent’s working traces as graph-based intermediate representations with control flow and data flow described within; (2) a property registry that attaches security-relevant metadata of interacted tools & data, and (3) a type system that performs static inference and checking over the intermediate representation. By representing agent behavior as structured programs, AgentArmor enables program analysis over sensitive data flow, trust boundaries, and policy violations. We evaluate AgentArmor on the AgentDojo benchmark, the results show that AgentArmor can achieve 95.75% of TPR, with only 3.66% of FPR. Our results demonstrate AgentArmor’s ability to detect prompt injection vulnerabilities and enforce fine-grained security constraints.

AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection Read Post »

AI, Committee, 新闻, Uncategorized

Team “better_call_claude”: Style Change Detection using a Sequential Sentence Pair Classifier

arXiv:2508.00675v1 Announce Type: new Abstract: Style change detection – identifying the points in a document where writing style shifts – remains one of the most important and challenging problems in computational authorship analysis. At PAN 2025, the shared task challenges participants to detect style switches at the most fine-grained level: individual sentences. The task spans three datasets, each designed with controlled and increasing thematic variety within documents. We propose to address this problem by modeling the content of each problem instance – that is, a series of sentences – as a whole, using a Sequential Sentence Pair Classifier (SSPC). The architecture leverages a pre-trained language model (PLM) to obtain representations of individual sentences, which are then fed into a bidirectional LSTM (BiLSTM) to contextualize them within the document. The BiLSTM-produced vectors of adjacent sentences are concatenated and passed to a multi-layer perceptron for prediction per adjacency. Building on the work of previous PAN participants classical text segmentation, the approach is relatively conservative and lightweight. Nevertheless, it proves effective in leveraging contextual information and addressing what is arguably the most challenging aspect of this year’s shared task: the notorious problem of “stylistically shallow”, short sentences that are prevalent in the proposed benchmark data. Evaluated on the official PAN-2025 test datasets, the model achieves strong macro-F1 scores of 0.923, 0.828, and 0.724 on the EASY, MEDIUM, and HARD data, respectively, outperforming not only the official random baselines but also a much more challenging one: claude-3.7-sonnet’s zero-shot performance.

Team “better_call_claude”: Style Change Detection using a Sequential Sentence Pair Classifier Read Post »

AI, Committee, 新闻, Uncategorized

IFEvalCode: Controlled Code Generation

arXiv:2507.22462v2 Announce Type: replace Abstract: Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed requirements such as coding style, line count, and structural constraints, beyond mere correctness. To address this, the paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs in controlled code generation, ensuring outputs align more closely with human-defined guidelines. The authors further present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages (Python, Java, JavaScript, TypeScript, Shell, C++, and C#), with each sample featuring both Chinese and English queries. Unlike existing benchmarks, IFEvalCode decouples evaluation into two metrics: correctness (Corr.) and instruction-following (Instr.), enabling a more nuanced assessment. Experiments on over 40 LLMs reveal that closed-source models outperform open-source ones in controllable code generation and highlight a significant gap between the models’ ability to generate correct code versus code that precisely follows instructions.

IFEvalCode: Controlled Code Generation Read Post »

AI, Committee, 新闻, Uncategorized

Loss Landscape Degeneracy and Stagewise Development in Transformers

arXiv:2402.02364v3 Announce Type: replace-cross Abstract: Deep learning involves navigating a high-dimensional loss landscape over the neural network parameter space. Over the course of training, complex computational structures form and re-form inside the neural network, leading to shifts in input/output behavior. It is a priority for the science of deep learning to uncover principles governing the development of neural network structure and behavior. Drawing on the framework of singular learning theory, we propose that model development is deeply linked to degeneracy in the local geometry of the loss landscape. We investigate this link by monitoring loss landscape degeneracy throughout training, as quantified by the local learning coefficient, for a transformer language model and an in-context linear regression transformer. We show that training can be divided into distinct periods of change in loss landscape degeneracy, and that these changes in degeneracy coincide with significant changes in the internal computational structure and the input/output behavior of the transformers. This finding provides suggestive evidence that degeneracy and development are linked in transformers, underscoring the potential of a degeneracy-based perspective for understanding modern deep learning.

Loss Landscape Degeneracy and Stagewise Development in Transformers Read Post »

AI, Committee, 新闻, Uncategorized

Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

arXiv:2508.00414v1 Announce Type: cross Abstract: General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present textbf{Cognitive Kernel-Pro}, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at https://github.com/Tencent/CognitiveKernel-Pro

Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at 隱私權政策 and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
zh_CN