YouZum

Uncategorized

AI, Committee, Notizie, Uncategorized

D’ej`a Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation

arXiv:2504.11829v3 Announce Type: replace Abstract: Generation capabilities and language coverage of multilingual large language models (mLLMs) are advancing rapidly. However, evaluation practices for generative abilities of mLLMs are still lacking comprehensiveness, scientific rigor, and consistent adoption across research labs, which undermines their potential to meaningfully guide mLLM development. We draw parallels with machine translation (MT) evaluation, a field that faced similar challenges and has, over decades, developed transparent reporting standards and reliable evaluations for multilingual generative models. Through targeted experiments across key stages of the generative evaluation pipeline, we demonstrate how best practices from MT evaluation can deepen the understanding of quality differences between models. Additionally, we identify essential components for robust meta-evaluation of mLLMs, ensuring the evaluation methods themselves are rigorously assessed. We distill these insights into a checklist of actionable recommendations for mLLM research and development.

D’ej`a Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning

arXiv:2502.13820v3 Announce Type: replace-cross Abstract: Synthetic verification techniques such as generating test cases and reward modelling are common ways to enhance the coding capabilities of large language models (LLM) beyond predefined tests. Additionally, code verification has recently found great success as a critical component in improving reasoning capability of LLMs via reinforcement learning. In this paper, we propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We also propose multiple metrics to measure different aspects of the synthetic verifiers with the proposed benchmarks. By employing the proposed approach, we release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.

Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors

arXiv:2507.22367v1 Announce Type: new Abstract: Accurate and reliable personality assessment plays a vital role in many fields, such as emotional intelligence, mental health diagnostics, and personalized education. Unlike fleeting emotions, personality traits are stable, often subconsciously leaked through language, facial expressions, and body behaviors, with asynchronous patterns across modalities. It was hard to model personality semantics with traditional superficial features and seemed impossible to achieve effective cross-modal understanding. To address these challenges, we propose a novel personality assessment framework called textit{textbf{Traits Run Deep}}. It employs textit{textbf{psychology-informed prompts}} to elicit high-level personality-relevant semantic representations. Besides, it devises a textit{textbf{Text-Centric Trait Fusion Network}} that anchors rich text semantics to align and integrate asynchronous signals from other modalities. To be specific, such fusion module includes a Chunk-Wise Projector to decrease dimensionality, a Cross-Modal Connector and a Text Feature Enhancer for effective modality fusion and an ensemble regression head to improve generalization in data-scarce situations. To our knowledge, we are the first to apply personality-specific prompts to guide large language models (LLMs) in extracting personality-aware semantics for improved representation quality. Furthermore, extracting and fusing audio-visual apparent behavior features further improves the accuracy. Experimental results on the AVI validation set have demonstrated the effectiveness of the proposed components, i.e., approximately a 45% reduction in mean squared error (MSE). Final evaluations on the test set of the AVI Challenge 2025 confirm our method’s superiority, ranking first in the Personality Assessment track. The source code will be made available at https://github.com/MSA-LMC/TraitsRunDeep.

Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection

arXiv:2505.19010v2 Announce Type: replace-cross Abstract: Multi-modal learning has emerged as a crucial research direction, as integrating textual and visual information can substantially enhance performance in tasks such as classification, retrieval, and scene understanding. Despite advances with large pre-trained models, existing approaches often suffer from insufficient cross-modal interactions and rigid fusion strategies, failing to fully harness the complementary strengths of different modalities. To address these limitations, we propose Co-AttenDWG, co-attention with dimension-wise gating, and expert fusion. Our approach first projects textual and visual features into a shared embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This is further strengthened by a dimension-wise gating network, which adaptively modulates feature contributions at the channel level to emphasize salient information. In parallel, dual-path encoders independently refine modality-specific representations, while an additional cross-attention layer aligns the modalities further. The resulting features are aggregated via an expert fusion module that integrates learned gating and self-attention, yielding a robust unified representation. Experimental results on the MIMIC and SemEval Memotion 1.0 datasets show that Co-AttenDWG achieves state-of-the-art performance and superior cross-modal alignment, highlighting its effectiveness for diverse multi-modal applications.

Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

QE4PE: Word-level Quality Estimation for Human Post-Editing

arXiv:2503.03044v2 Announce Type: replace Abstract: Word-level quality estimation (QE) methods aim to detect erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. In this study, we investigate the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated from behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors’ speed are critical factors in determining highlights’ effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.

QE4PE: Word-level Quality Estimation for Human Post-Editing Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

FrugalRAG: Learning to retrieve and reason for multi-hop QA

arXiv:2507.07634v2 Announce Type: replace Abstract: We consider the problem of answering complex questions, given access to a large unstructured document corpus. The de facto approach to solving the problem is to leverage language models that (iteratively) retrieve and reason through the retrieved documents, until the model has sufficient information to generate an answer. Attempts at improving this approach focus on retrieval-augmented generation (RAG) metrics such as accuracy and recall and can be categorized into two types: (a) fine-tuning on large question answering (QA) datasets augmented with chain-of-thought traces, and (b) leveraging RL-based fine-tuning techniques that rely on question-document relevance signals. However, efficiency in the number of retrieval searches is an equally important metric, which has received less attention. In this work, we show that: (1) Large-scale fine-tuning is not needed to improve RAG metrics, contrary to popular claims in recent literature. Specifically, a standard ReAct pipeline with improved prompts can outperform state-of-the-art methods on benchmarks such as HotPotQA. (2) Supervised and RL-based fine-tuning can help RAG from the perspective of frugality, i.e., the latency due to number of searches at inference time. For example, we show that we can achieve competitive RAG metrics at nearly half the cost (in terms of number of searches) on popular RAG benchmarks, using the same base model, and at a small training cost (1000 examples).

FrugalRAG: Learning to retrieve and reason for multi-hop QA Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

A Survey of Classification Tasks and Approaches for Legal Contracts

arXiv:2507.21108v1 Announce Type: new Abstract: Given the large size and volumes of contracts and their underlying inherent complexity, manual reviews become inefficient and prone to errors, creating a clear need for automation. Automatic Legal Contract Classification (LCC) revolutionizes the way legal contracts are analyzed, offering substantial improvements in speed, accuracy, and accessibility. This survey delves into the challenges of automatic LCC and a detailed examination of key tasks, datasets, and methodologies. We identify seven classification tasks within LCC, and review fourteen datasets related to English-language contracts, including public, proprietary, and non-public sources. We also introduce a methodology taxonomy for LCC, categorized into Traditional Machine Learning, Deep Learning, and Transformer-based approaches. Additionally, the survey discusses evaluation techniques and highlights the best-performing results from the reviewed studies. By providing a thorough overview of current methods and their limitations, this survey suggests future research directions to improve the efficiency, accuracy, and scalability of LCC. As the first comprehensive survey on LCC, it aims to support legal NLP researchers and practitioners in improving legal processes, making legal information more accessible, and promoting a more informed and equitable society.

A Survey of Classification Tasks and Approaches for Legal Contracts Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Curved Inference: Concern-Sensitive Geometry in Large Language Model Residual Streams

arXiv:2507.21107v1 Announce Type: new Abstract: We propose Curved Inference – a geometric Interpretability framework that tracks how the residual stream trajectory of a large language model bends in response to shifts in semantic concern. Across 20 matched prompts spanning emotional, moral, perspective, logical, identity, environmental, and nonsense domains, we analyse Gemma3-1b and LLaMA3.2-3b using five native-space metrics, with a primary focus on curvature (k{appa}_i) and salience (S(t)). These metrics are computed under a pullback semantic metric derived from the unembedding matrix, ensuring that all measurements reflect token-aligned geometry rather than raw coordinate structure. We find that concern-shifted prompts reliably alter internal activation trajectories in both models – with LLaMA exhibiting consistent, statistically significant scaling in both curvature and salience as concern intensity increases. Gemma also responds to concern but shows weaker differentiation between moderate and strong variants. Our results support a two-layer view of LLM geometry – a latent conceptual structure encoded in the embedding space, and a contextual trajectory shaped by prompt-specific inference. Curved Inference reveals how models navigate, reorient, or reinforce semantic meaning over depth, offering a principled method for diagnosing alignment, abstraction, and emergent inference dynamics. These findings offer fresh insight into semantic abstraction and model alignment through the lens of Curved Inference.

Curved Inference: Concern-Sensitive Geometry in Large Language Model Residual Streams Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

MiroMind-M1: Advancing Open-Source Mathematical Reasoning via Context-Aware Multi-Stage Reinforcement Learning

Large language models (LLMs) have recently demonstrated remarkable progress in multi-step reasoning, establishing mathematical problem-solving as a rigorous benchmark for assessing advanced capabilities. While proprietary models like GPT-4o and Claude Sonnet 4 lead performance, their closed-source nature impedes transparency and reproducibility. Addressing these gaps, MiroMind AI Released the MiroMind-M1 series, a fully open-source pipeline—spanning datasets, models, training code, and evaluation scripts—that sets new standards for openness and state-of-the-art mathematical reasoning within the Qwen-2.5 model ecosystem. Architectural Foundation and Motivation MiroMind-M1 is built on the robust Qwen-2.5 backbone, with enhancements geared explicitly for mathematical reasoning. The team adopts a two-stage training protocol: Supervised Fine-Tuning (SFT): The model is fine-tuned on 719K carefully curated and verified mathematical problems, equipping it with strong step-by-step reasoning abilities. Reinforcement Learning with Verifiable Rewards (RLVR): Next, the model undergoes RL on 62K challenging and rigorously verifiable math problems, leveraging reward signals from a robust external verifier. This approach is motivated by both the need for strong mathematical logic and by the lessons learned from leading RLMs: imitating chain-of-thought exemplars improves general reasoning, while reinforcement learning, guided by precise rewards, further refines accuracy and efficiency. Data Transparency and Quality A hallmark of the MiroMind-M1 project is the full openness and cleanliness of its training data: SFT corpus composition: Draws from OpenR1, OpenThoughts, Light-R1, and Synthetic-1, ensuring problems have verified solutions and rich, multi-step reasoning traces. Stringent deduplication and decontamination: Employs N-gram overlap filtering to eliminate duplication and data leakage with evaluation sets (e.g., AIME24, AIME25, MATH500). Preference for long trajectories: Experiments show that training on samples with longer reasoning traces consistently yields higher benchmark scores, highlighting the importance of deep semantic content in the reasoning signal. The resulting dataset provides 719K verified training traces—significantly advancing open reproducible research over prior efforts. Supervised Fine-Tuning: Empirical Excellence For SFT, MiroMind-SFT-7B is initialized from Qwen2.5-Math-7B and trained with a large context window (max 32,768 tokens) and a no-packing strategy to avoid cross-sample attention contamination. Its performance on key math benchmarks outpaces peer open models: Model AIME24 AIME25 MATH500 DeepSeek-R1-Distill 55.5 40.4 92.8 MiMo-7B-SFT 58.7 44.3 93.0 MiroMind-SFT-7B 60.4 45.0 94.6 These results validate the efficacy of the data curation and training design: richer, deeper samples and no-packing lead to consistently superior performance. CAMPO: Context-Aware Multi-Stage Policy Optimization A key innovation in MiroMind-M1’s RLVR phase is the CAMPO algorithm. CAMPO addresses two critical RL challenges—training instability and token inefficiency—by: Multi-stage training with expanding context limits: Training starts with constrained output lengths (e.g., 16K tokens), then gradually increases to allow deeper reasoning, balancing efficiency and thoroughness. Dynamic repetition penalty: A dedicated repetition critic penalizes outputs exhibiting early or excessive repetition, preventing utility collapse and enforcing output diversity. Accurate external verifier: The reward feedback system is substantially improved to robustly score math answers (including tricky cases with units, π, and percentages), ensuring training signals are tightly aligned with true correctness. CAMPO not only stabilizes RL dynamics but also results in models that solve problems with fewer, more relevant tokens—accelerating inference and reducing costs without sacrificing accuracy. Benchmark Performance: State-of-the-Art Efficiency MiroMind’s open models achieve highly competitive or state-of-the-art results for open Qwen-2.5-based math models (7B/32B parameters): Model AIME24 AIME25 MATH500 DeepSeek-R1-7B 55.5 39.2 – MiMo-7B-RL 68.2 55.4 95.8 Skywork-OR1-7B 72.2 54.6 – MiroMind-RL-7B 73.4 57.8 96.7 Skywork-OR1-32B 77.1 68.2 97.5 MiroMind-RL-32B 77.5 65.6 96.4 Notably, MiroMind-M1-RL models not only match or exceed peer accuracy, but do so with greater token efficiency—the 32B model produces shorter, more concise solutions without loss of correctness, thanks to CAMPO’s training. Full Stack and Reproducibility Every component of the MiroMind-M1 stack is openly released: Model weights (SFT and RL checkpoints for both 7B and 32B scales) Datasets (full 719K SFT, 62K RLVR) Training scripts (supporting multi-node distributed training on Ray) Evaluation code (standardized scripts and benchmark configs) Researchers can replicate, audit, and extend MiroMind-M1 from raw data to trained models, advancing reproducibility and accelerating new open research. Conclusion MiroMind-M1 demonstrates that with careful data curation, innovative RL algorithms (CAMPO), and radical transparency, open-source language models can rival proprietary systems in advanced mathematical reasoning. This project sets a new bar for reproducibility and collaborative advancement in reasoning LLMs, providing both a high-quality resource and a robust platform for future innovation. Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post MiroMind-M1: Advancing Open-Source Mathematical Reasoning via Context-Aware Multi-Stage Reinforcement Learning appeared first on MarkTechPost.

MiroMind-M1: Advancing Open-Source Mathematical Reasoning via Context-Aware Multi-Stage Reinforcement Learning Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals

Reinforcement Learning with Verifiable Rewards (RLVR) allows LLMs to perform complex reasoning on tasks with clear, verifiable outcomes, with strong performance in mathematics and coding. However, many real-world scenarios lack such explicit verifiable answers, posing a challenge for training models without direct reward signals. Current methods address this gap through RLHF via preference ranking, where human judgments are collected over pairs or lists of model outputs. Moreover, preference-based reward models can boost performance in the early stages, but they tend to overfit to superficial artifacts such as response length, formatting quirks, and annotator biases. These models require large volumes of pairwise comparisons, making them brittle and costly. RLVR methods now extend beyond mathematics and coding, with GENERAL-REASONER demonstrating strong performance in physics, finance, and policy, achieving a ten-point gain on MMLU-Pro through GRPO fine-tuning. Rubric-based evaluation has become a standard for advanced LLMs, with frameworks like HEALTHBENCH pairing clinician-written criteria with automated judges to evaluate factuality, safety, and empathy. However, these rubrics appear only during evaluation phases rather than training. Moreover, process supervision methods try to provide more granular feedback by rewarding intermediate reasoning steps through MCTS-generated labels and generative reward models such as THINKPRM. Researchers from Scale AI have proposed Rubrics as Rewards (RaR), an on-policy reinforcement learning framework that utilizes checklist-style rubrics to guide multi-criteria tasks.     The method generates prompt-specific rubrics based on carefully designed principles, where each rubric outlines clear standards for high-quality responses and provides human-interpretable supervision signals. Moreover, it is applied to medicine and science domains, resulting in two specialized training datasets, RaR-Medicine-20k and RaR-Science-20k. RaR enables smaller judge models to achieve superior alignment with human preferences by transforming rubrics into structured reward signals while maintaining robust performance across different model scales. Researchers used LLMs as expert proxies to generate these rubrics, ensuring adherence to the following desiderata: grounded in expert guidance, comprehensive coverage, semantic weighting, and self-contained evaluation. For each domain, specialized prompts instruct the LLM to generate 7-20 rubric items based on the complexity of the input question. Each item is assigned categorical weights, such as Essential Criteria or Important Criteria, to determine its significance for correct answers. The training utilizes the GRPO algorithm with Qwen2.5-7B as the base policy model. Moreover, the training pipeline operates through three core components: Response Generation, Reward Computation, and Policy Update.  The RaR-Implicit method outperforms baseline methods such as Simple-Likert, with the best variant achieving up to 28% relative improvement on HealthBench-1k and 13% on GPQA.   It also outperforms both base and instruction-tuned policy models, showing the effectiveness of rubric-guided training for nuanced response evaluation while matching or exceeding Reference-Likert baseline performance. Beyond raw metrics, rubric-guided evaluations provide clearer and more accurate signals across model scales, achieving higher accuracy when preferred responses receive appropriate ratings. Moreover, expert guidance proves essential for synthetic rubric generation, with rubrics developed using reference answers achieving higher accuracy than those without human insights. In summary, researchers introduced RaR that advances post-training of language models by using structured, checklist-style rubrics as reward signals. It offers stable training signals, maintaining human interpretability and alignment. However, this research remains limited to medical and science domains, requiring validation across tasks such as open-ended dialogue. Researchers explored only two reward aggregation strategies, implicit and explicit, leaving the alternative weighting schemes. Moreover, they did not conduct a controlled analysis of reward hacking risks, and the reliance on off-the-shelf LLMs as judges suggests future work could benefit from dedicated evaluators with enhanced reasoning capabilities. Check out the Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals appeared first on MarkTechPost.

Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals Leggi l'articolo »

it_IT