YouZum

AI

AI, Committee, ニュース, Uncategorized

Curved Inference: Concern-Sensitive Geometry in Large Language Model Residual Streams

arXiv:2507.21107v1 Announce Type: new Abstract: We propose Curved Inference – a geometric Interpretability framework that tracks how the residual stream trajectory of a large language model bends in response to shifts in semantic concern. Across 20 matched prompts spanning emotional, moral, perspective, logical, identity, environmental, and nonsense domains, we analyse Gemma3-1b and LLaMA3.2-3b using five native-space metrics, with a primary focus on curvature (k{appa}_i) and salience (S(t)). These metrics are computed under a pullback semantic metric derived from the unembedding matrix, ensuring that all measurements reflect token-aligned geometry rather than raw coordinate structure. We find that concern-shifted prompts reliably alter internal activation trajectories in both models – with LLaMA exhibiting consistent, statistically significant scaling in both curvature and salience as concern intensity increases. Gemma also responds to concern but shows weaker differentiation between moderate and strong variants. Our results support a two-layer view of LLM geometry – a latent conceptual structure encoded in the embedding space, and a contextual trajectory shaped by prompt-specific inference. Curved Inference reveals how models navigate, reorient, or reinforce semantic meaning over depth, offering a principled method for diagnosing alignment, abstraction, and emergent inference dynamics. These findings offer fresh insight into semantic abstraction and model alignment through the lens of Curved Inference.

Curved Inference: Concern-Sensitive Geometry in Large Language Model Residual Streams 投稿を読む »

AI, Committee, ニュース, Uncategorized

FrugalRAG: Learning to retrieve and reason for multi-hop QA

arXiv:2507.07634v2 Announce Type: replace Abstract: We consider the problem of answering complex questions, given access to a large unstructured document corpus. The de facto approach to solving the problem is to leverage language models that (iteratively) retrieve and reason through the retrieved documents, until the model has sufficient information to generate an answer. Attempts at improving this approach focus on retrieval-augmented generation (RAG) metrics such as accuracy and recall and can be categorized into two types: (a) fine-tuning on large question answering (QA) datasets augmented with chain-of-thought traces, and (b) leveraging RL-based fine-tuning techniques that rely on question-document relevance signals. However, efficiency in the number of retrieval searches is an equally important metric, which has received less attention. In this work, we show that: (1) Large-scale fine-tuning is not needed to improve RAG metrics, contrary to popular claims in recent literature. Specifically, a standard ReAct pipeline with improved prompts can outperform state-of-the-art methods on benchmarks such as HotPotQA. (2) Supervised and RL-based fine-tuning can help RAG from the perspective of frugality, i.e., the latency due to number of searches at inference time. For example, we show that we can achieve competitive RAG metrics at nearly half the cost (in terms of number of searches) on popular RAG benchmarks, using the same base model, and at a small training cost (1000 examples).

FrugalRAG: Learning to retrieve and reason for multi-hop QA 投稿を読む »

AI, Committee, ニュース, Uncategorized

A Survey of Classification Tasks and Approaches for Legal Contracts

arXiv:2507.21108v1 Announce Type: new Abstract: Given the large size and volumes of contracts and their underlying inherent complexity, manual reviews become inefficient and prone to errors, creating a clear need for automation. Automatic Legal Contract Classification (LCC) revolutionizes the way legal contracts are analyzed, offering substantial improvements in speed, accuracy, and accessibility. This survey delves into the challenges of automatic LCC and a detailed examination of key tasks, datasets, and methodologies. We identify seven classification tasks within LCC, and review fourteen datasets related to English-language contracts, including public, proprietary, and non-public sources. We also introduce a methodology taxonomy for LCC, categorized into Traditional Machine Learning, Deep Learning, and Transformer-based approaches. Additionally, the survey discusses evaluation techniques and highlights the best-performing results from the reviewed studies. By providing a thorough overview of current methods and their limitations, this survey suggests future research directions to improve the efficiency, accuracy, and scalability of LCC. As the first comprehensive survey on LCC, it aims to support legal NLP researchers and practitioners in improving legal processes, making legal information more accessible, and promoting a more informed and equitable society.

A Survey of Classification Tasks and Approaches for Legal Contracts 投稿を読む »

AI, Committee, ニュース, Uncategorized

MiroMind-M1: Advancing Open-Source Mathematical Reasoning via Context-Aware Multi-Stage Reinforcement Learning

Large language models (LLMs) have recently demonstrated remarkable progress in multi-step reasoning, establishing mathematical problem-solving as a rigorous benchmark for assessing advanced capabilities. While proprietary models like GPT-4o and Claude Sonnet 4 lead performance, their closed-source nature impedes transparency and reproducibility. Addressing these gaps, MiroMind AI Released the MiroMind-M1 series, a fully open-source pipeline—spanning datasets, models, training code, and evaluation scripts—that sets new standards for openness and state-of-the-art mathematical reasoning within the Qwen-2.5 model ecosystem. Architectural Foundation and Motivation MiroMind-M1 is built on the robust Qwen-2.5 backbone, with enhancements geared explicitly for mathematical reasoning. The team adopts a two-stage training protocol: Supervised Fine-Tuning (SFT): The model is fine-tuned on 719K carefully curated and verified mathematical problems, equipping it with strong step-by-step reasoning abilities. Reinforcement Learning with Verifiable Rewards (RLVR): Next, the model undergoes RL on 62K challenging and rigorously verifiable math problems, leveraging reward signals from a robust external verifier. This approach is motivated by both the need for strong mathematical logic and by the lessons learned from leading RLMs: imitating chain-of-thought exemplars improves general reasoning, while reinforcement learning, guided by precise rewards, further refines accuracy and efficiency. Data Transparency and Quality A hallmark of the MiroMind-M1 project is the full openness and cleanliness of its training data: SFT corpus composition: Draws from OpenR1, OpenThoughts, Light-R1, and Synthetic-1, ensuring problems have verified solutions and rich, multi-step reasoning traces. Stringent deduplication and decontamination: Employs N-gram overlap filtering to eliminate duplication and data leakage with evaluation sets (e.g., AIME24, AIME25, MATH500). Preference for long trajectories: Experiments show that training on samples with longer reasoning traces consistently yields higher benchmark scores, highlighting the importance of deep semantic content in the reasoning signal. The resulting dataset provides 719K verified training traces—significantly advancing open reproducible research over prior efforts. Supervised Fine-Tuning: Empirical Excellence For SFT, MiroMind-SFT-7B is initialized from Qwen2.5-Math-7B and trained with a large context window (max 32,768 tokens) and a no-packing strategy to avoid cross-sample attention contamination. Its performance on key math benchmarks outpaces peer open models: Model AIME24 AIME25 MATH500 DeepSeek-R1-Distill 55.5 40.4 92.8 MiMo-7B-SFT 58.7 44.3 93.0 MiroMind-SFT-7B 60.4 45.0 94.6 These results validate the efficacy of the data curation and training design: richer, deeper samples and no-packing lead to consistently superior performance. CAMPO: Context-Aware Multi-Stage Policy Optimization A key innovation in MiroMind-M1’s RLVR phase is the CAMPO algorithm. CAMPO addresses two critical RL challenges—training instability and token inefficiency—by: Multi-stage training with expanding context limits: Training starts with constrained output lengths (e.g., 16K tokens), then gradually increases to allow deeper reasoning, balancing efficiency and thoroughness. Dynamic repetition penalty: A dedicated repetition critic penalizes outputs exhibiting early or excessive repetition, preventing utility collapse and enforcing output diversity. Accurate external verifier: The reward feedback system is substantially improved to robustly score math answers (including tricky cases with units, π, and percentages), ensuring training signals are tightly aligned with true correctness. CAMPO not only stabilizes RL dynamics but also results in models that solve problems with fewer, more relevant tokens—accelerating inference and reducing costs without sacrificing accuracy. Benchmark Performance: State-of-the-Art Efficiency MiroMind’s open models achieve highly competitive or state-of-the-art results for open Qwen-2.5-based math models (7B/32B parameters): Model AIME24 AIME25 MATH500 DeepSeek-R1-7B 55.5 39.2 – MiMo-7B-RL 68.2 55.4 95.8 Skywork-OR1-7B 72.2 54.6 – MiroMind-RL-7B 73.4 57.8 96.7 Skywork-OR1-32B 77.1 68.2 97.5 MiroMind-RL-32B 77.5 65.6 96.4 Notably, MiroMind-M1-RL models not only match or exceed peer accuracy, but do so with greater token efficiency—the 32B model produces shorter, more concise solutions without loss of correctness, thanks to CAMPO’s training. Full Stack and Reproducibility Every component of the MiroMind-M1 stack is openly released: Model weights (SFT and RL checkpoints for both 7B and 32B scales) Datasets (full 719K SFT, 62K RLVR) Training scripts (supporting multi-node distributed training on Ray) Evaluation code (standardized scripts and benchmark configs) Researchers can replicate, audit, and extend MiroMind-M1 from raw data to trained models, advancing reproducibility and accelerating new open research. Conclusion MiroMind-M1 demonstrates that with careful data curation, innovative RL algorithms (CAMPO), and radical transparency, open-source language models can rival proprietary systems in advanced mathematical reasoning. This project sets a new bar for reproducibility and collaborative advancement in reasoning LLMs, providing both a high-quality resource and a robust platform for future innovation. Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post MiroMind-M1: Advancing Open-Source Mathematical Reasoning via Context-Aware Multi-Stage Reinforcement Learning appeared first on MarkTechPost.

MiroMind-M1: Advancing Open-Source Mathematical Reasoning via Context-Aware Multi-Stage Reinforcement Learning 投稿を読む »

AI, Committee, ニュース, Uncategorized

Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals

Reinforcement Learning with Verifiable Rewards (RLVR) allows LLMs to perform complex reasoning on tasks with clear, verifiable outcomes, with strong performance in mathematics and coding. However, many real-world scenarios lack such explicit verifiable answers, posing a challenge for training models without direct reward signals. Current methods address this gap through RLHF via preference ranking, where human judgments are collected over pairs or lists of model outputs. Moreover, preference-based reward models can boost performance in the early stages, but they tend to overfit to superficial artifacts such as response length, formatting quirks, and annotator biases. These models require large volumes of pairwise comparisons, making them brittle and costly. RLVR methods now extend beyond mathematics and coding, with GENERAL-REASONER demonstrating strong performance in physics, finance, and policy, achieving a ten-point gain on MMLU-Pro through GRPO fine-tuning. Rubric-based evaluation has become a standard for advanced LLMs, with frameworks like HEALTHBENCH pairing clinician-written criteria with automated judges to evaluate factuality, safety, and empathy. However, these rubrics appear only during evaluation phases rather than training. Moreover, process supervision methods try to provide more granular feedback by rewarding intermediate reasoning steps through MCTS-generated labels and generative reward models such as THINKPRM. Researchers from Scale AI have proposed Rubrics as Rewards (RaR), an on-policy reinforcement learning framework that utilizes checklist-style rubrics to guide multi-criteria tasks.     The method generates prompt-specific rubrics based on carefully designed principles, where each rubric outlines clear standards for high-quality responses and provides human-interpretable supervision signals. Moreover, it is applied to medicine and science domains, resulting in two specialized training datasets, RaR-Medicine-20k and RaR-Science-20k. RaR enables smaller judge models to achieve superior alignment with human preferences by transforming rubrics into structured reward signals while maintaining robust performance across different model scales. Researchers used LLMs as expert proxies to generate these rubrics, ensuring adherence to the following desiderata: grounded in expert guidance, comprehensive coverage, semantic weighting, and self-contained evaluation. For each domain, specialized prompts instruct the LLM to generate 7-20 rubric items based on the complexity of the input question. Each item is assigned categorical weights, such as Essential Criteria or Important Criteria, to determine its significance for correct answers. The training utilizes the GRPO algorithm with Qwen2.5-7B as the base policy model. Moreover, the training pipeline operates through three core components: Response Generation, Reward Computation, and Policy Update.  The RaR-Implicit method outperforms baseline methods such as Simple-Likert, with the best variant achieving up to 28% relative improvement on HealthBench-1k and 13% on GPQA.   It also outperforms both base and instruction-tuned policy models, showing the effectiveness of rubric-guided training for nuanced response evaluation while matching or exceeding Reference-Likert baseline performance. Beyond raw metrics, rubric-guided evaluations provide clearer and more accurate signals across model scales, achieving higher accuracy when preferred responses receive appropriate ratings. Moreover, expert guidance proves essential for synthetic rubric generation, with rubrics developed using reference answers achieving higher accuracy than those without human insights. In summary, researchers introduced RaR that advances post-training of language models by using structured, checklist-style rubrics as reward signals. It offers stable training signals, maintaining human interpretability and alignment. However, this research remains limited to medical and science domains, requiring validation across tasks such as open-ended dialogue. Researchers explored only two reward aggregation strategies, implicit and explicit, leaving the alternative weighting schemes. Moreover, they did not conduct a controlled analysis of reward hacking risks, and the reliance on off-the-shelf LLMs as judges suggests future work could benefit from dedicated evaluators with enhanced reasoning capabilities. Check out the Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals appeared first on MarkTechPost.

Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals 投稿を読む »

AI, Committee, ニュース, Uncategorized

FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models

arXiv:2507.20924v1 Announce Type: new Abstract: Sexism has become widespread on social media and in online conversation. To help address this issue, the fifth Sexism Identification in Social Networks (EXIST) challenge is initiated at CLEF 2025. Among this year’s international benchmarks, we concentrate on solving the first task aiming to identify and classify sexism in social media textual posts. In this paper, we describe our solutions and report results for three subtasks: Subtask 1.1 – Sexism Identification in Tweets, Subtask 1.2 – Source Intention in Tweets, and Subtask 1.3 – Sexism Categorization in Tweets. We implement three models to address each subtask which constitute three individual runs: Speech Concept Bottleneck Model (SCBM), Speech Concept Bottleneck Model with Transformer (SCBMT), and a fine-tuned XLM-RoBERTa transformer model. SCBM uses descriptive adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to encode input texts into a human-interpretable representation of adjectives, then used to train a lightweight classifier for downstream tasks. SCBMT extends SCBM by fusing adjective-based representation with contextual embeddings from transformers to balance interpretability and classification performance. Beyond competitive results, these two models offer fine-grained explanations at both instance (local) and class (global) levels. We also investigate how additional metadata, e.g., annotators’ demographic profiles, can be leveraged. For Subtask 1.1, XLM-RoBERTa, fine-tuned on provided data augmented with prior datasets, ranks 6th for English and Spanish and 4th for English in the Soft-Soft evaluation. Our SCBMT achieves 7th for English and Spanish and 6th for Spanish.

FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering

arXiv:2507.19995v1 Announce Type: new Abstract: The advent of large language models (LLMs) has led to significant achievements in various domains, including legal text processing. Leveraging LLMs for legal tasks is a natural evolution and an increasingly compelling choice. However, their capabilities are often portrayed as greater than they truly are. Despite the progress, we are still far from the ultimate goal of fully automating legal tasks using artificial intelligence (AI) and natural language processing (NLP). Moreover, legal systems are deeply domain-specific and exhibit substantial variation across different countries and languages. The need for building legal text processing applications for different natural languages is, therefore, large and urgent. However, there is a big challenge for legal NLP in low-resource languages such as Vietnamese due to the scarcity of resources and annotated data. The need for labeled legal corpora for supervised training, validation, and supervised fine-tuning is critical. In this paper, we introduce the VLQA dataset, a comprehensive and high-quality resource tailored for the Vietnamese legal domain. We also conduct a comprehensive statistical analysis of the dataset and evaluate its effectiveness through experiments with state-of-the-art models on legal information retrieval and question-answering tasks.

VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering 投稿を読む »

AI, Committee, ニュース, Uncategorized

AutoLibra: Agent Metric Induction from Open-Ended Feedback

arXiv:2505.02820v2 Announce Type: replace-cross Abstract: Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open-ended human feedback e.g. “If you find that the button is disabled, don’t click it again”, or “This agent has too much autonomy to decide what to do on its own” into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent’s behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate the alignment of a set of (induced) metrics with open feedback: “coverage” and “redundancy”. Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra’s ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra-induced metrics serve as better prompt-engineering targets than the task success rate on a wide range of text game tasks, improving agent performance over baseline by a mean of 20%. Second, we show that AutoLibra can iteratively select high-quality fine-tuning data for web navigation agents. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.

AutoLibra: Agent Metric Induction from Open-Ended Feedback 投稿を読む »

AI, Committee, ニュース, Uncategorized

Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory

arXiv:2507.19980v1 Announce Type: new Abstract: This study investigates the estimation of reliability for large language models (LLMs) in scoring writing tasks from the AP Chinese Language and Culture Exam. Using generalizability theory, the research evaluates and compares score consistency between human and AI raters across two types of AP Chinese free-response writing tasks: story narration and email response. These essays were independently scored by two trained human raters and seven AI raters. Each essay received four scores: one holistic score and three analytic scores corresponding to the domains of task completion, delivery, and language use. Results indicate that although human raters produced more reliable scores overall, LLMs demonstrated reasonable consistency under certain conditions, particularly for story narration tasks. Composite scoring that incorporates both human and AI raters improved reliability, which supports that hybrid scoring models may offer benefits for large-scale writing assessments.

Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory 投稿を読む »

AI, Committee, ニュース, Uncategorized

SGPO: Self-Generated Preference Optimization based on Self-Improver

arXiv:2507.20181v1 Announce Type: new Abstract: Large language models (LLMs), despite their extensive pretraining on diverse datasets, require effective alignment to human preferences for practical and reliable deployment. Conventional alignment methods typically employ off-policy learning and depend on human-annotated datasets, which limits their broad applicability and introduces distribution shift issues during training. To address these challenges, we propose Self-Generated Preference Optimization based on Self-Improver (SGPO), an innovative alignment framework that leverages an on-policy self-improving mechanism. Specifically, the improver refines responses from a policy model to self-generate preference data for direct preference optimization (DPO) of the policy model. Here, the improver and policy are unified into a single model, and in order to generate higher-quality preference data, this self-improver learns to make incremental yet discernible improvements to the current responses by referencing supervised fine-tuning outputs. Experimental results on AlpacaEval 2.0 and Arena-Hard show that the proposed SGPO significantly improves performance over DPO and baseline self-improving methods without using external preference data.

SGPO: Self-Generated Preference Optimization based on Self-Improver 投稿を読む »

ja