YouZum

Committee

AI, Committee, Noticias, Uncategorized

A Survey on Training-free Alignment of Large Language Models

arXiv:2508.09016v1 Announce Type: new Abstract: The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques–leveraging in-context learning, decoding-time adjustments, and post-generation corrections–offer a promising alternative by enabling alignment without heavily retraining LLMs, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we provide a detailed examination from the viewpoint of LLMs and multimodal LLMs (MLLMs), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable LLMs.

A Survey on Training-free Alignment of Large Language Models Leer entrada »

AI, Committee, Noticias, Uncategorized

Putnam-AXIOM: A Functional and Static Benchmark

arXiv:2508.08292v1 Announce Type: new Abstract: Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving > 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances — yielding a contamination-resilient test bed. On the Original set, OpenAI’s o1-preview — the strongest evaluated model — scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement “boxed” accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.

Putnam-AXIOM: A Functional and Static Benchmark Leer entrada »

AI, Committee, Noticias, Uncategorized

Mistral AI Unveils Mistral Medium 3.1: Enhancing AI with Superior Performance and Usability

Mistral AI has introduced Mistral Medium 3.1, setting new standards in multimodal intelligence, enterprise readiness, and cost-efficiency for large language models (LLMs). Building on its rapidly expanding AI, Mistral continues to position itself as a European leader, pushing forward with frontier-class capabilities while breaking cost and deployment barriers. Recommended Article: NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video Key Technical Features of Mistral Medium 3.1 Overall Performance Boost:Mistral Medium 3.1 introduces major improvements in core reasoning, coding abilities, and multimodal competence. Users benefit from more accurate code generation and enhanced understanding across diverse content (text, images, and more). Enhanced Multimodal Capabilities:The model natively processes both textual and visual inputs, excelling in tasks such as programming, STEM reasoning, document understanding, and image analysis. Benchmarks reveal top-tier scores in long-context and multimodal tasks—often matching or beating flagship models like Llama 4 Maverick, Claude Sonnet 3.7, and GPT-4o. Improved Tone and Consistency:Mistral Medium 3.1 delivers a seamless and consistent conversational tone, whether system prompts and tools are used or not. This improvement ensures more natural and coherent interactions, crucial for both consumer and enterprise deployments. Smarter Web Searches:The model comes equipped with optimized algorithms for retrieving and synthesizing information from the web, leading to more accurate, complete, and contextually relevant search results in chat-based and API interfaces. Low Operational Costs:One of Mistral Medium 3’s standout attributes is its efficiency: it offers 8× lower cost than traditional large models. With pricing as low as $0.40 per million input tokens and $2 per million output tokens, businesses can scale intelligent services affordably. Enterprise-Grade Adaptability:Built for flexibility, Mistral Medium 3.1 supports hybrid, on-premises, and in-VPC deployment. Enterprise clients can run the model on self-hosted setups with as few as four GPUs—making it highly accessible and reducing infrastructure friction. Language and Coding Support:The model supports dozens of human languages and over 80 coding languages, making it a powerhouse for multilingual applications, global enterprises, and developer tooling. It offers advanced function calling and agentic workflows for complex automation. Integration and Customization:Mistral Medium 3.1 allows custom post-training, full fine-tuning, and deep integration into enterprise knowledge bases. It’s engineered for adaptive, domain-specific intelligence, continuous learning, and evolving business requirements. Enterprise Impact Mistral Medium 3.1 is tailored for demanding professional use: Coding Assistants: Top-of-class accuracy and code generation for developer workflows. Document Intelligence: Advanced reasoning over long, complex documents—ideal for legal, finance, and medical sectors. Customer Engagement: Personalized dialogue with deep contextual awareness. Secure, Custom Deployments: Hybrid and on-prem options for data-sensitive industries. Summary With Mistral Medium 3.1, Mistral AI advances its technical advancements for innovation—delivering a model that rivals giants in performance while maintaining radical cost-efficiency and deployment simplicity. Its multimodal prowess, enterprise customization, and robust benchmark scores make it not only a technological milestone, but also an accessible solution for organizations seeking advanced AI without prohibitive costs. For engineers, enterprises, and developers looking for a European alternative in the LLM arena, Mistral Medium 3.1 is a game-changing option that balances power, price, and practical deployability. Check out the Model here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Star us on GitHub Sponsor us The post Mistral AI Unveils Mistral Medium 3.1: Enhancing AI with Superior Performance and Usability appeared first on MarkTechPost.

Mistral AI Unveils Mistral Medium 3.1: Enhancing AI with Superior Performance and Usability Leer entrada »

AI, Committee, Noticias, Uncategorized

Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents

The landscape of software engineering automation is evolving rapidly, driven by advances in Large Language Models (LLMs). However, most approaches to training capable agents rely on proprietary models or costly teacher-based methods, leaving open-weight LLMs with limited capabilities in real-world scenarios. A team of researchers from Nebius AI and Humanoid introduced a reinforcement learning framework for training long-context, multi-turn software engineering agents using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm. The research explains a technical breakthrough in applying reinforcement learning (RL) to open-source LLMs for genuine, multi-turn software engineering tasks—moving beyond the single-turn, bandit-style settings that dominate RL for LLMs today. Beyond Single-Turn Reinforcement Learning RL Most RL methods for LLMs optimize for tasks such as mathematical reasoning or one-shot code generation, where agent actions are rewarded only at the conclusion and environments do not provide intermediate feedback. However, software engineering (SWE) is fundamentally different: it requires agents to operate over long sequences of actions, interpret rich feedback (compiler errors, test logs), and maintain context over hundreds of thousands of tokens—far exceeding typical single-step interaction loops. Core Challenges in RL for SWE Long-Horizon Reasoning: Agents must sustain logical coherence across many steps, often requiring context windows beyond 100k tokens. Stateful Environment Feedback: Actions yield meaningful, non-trivial observations (e.g., shell command outputs, test suite results) that guide subsequent decisions. Sparse/Delayed Rewards: Success signals typically emerge only at the end of complex interactions, complicating credit assignment. Evaluation Complexity: Measuring progress requires full trajectory unrolling and can be noisy due to test flakiness. The Technical Recipe: Modified DAPO and Agent Design The research team demonstrates a two-stage learning pipeline for training a Qwen2.5-72B-Instruct agent: 1. Rejection Fine-Tuning (RFT) The journey begins with supervised fine-tuning. The agent is run across 7,249 rigorously filtered SWE tasks (from the SWE-REBENCH dataset). Successful interaction traces—where the agent passes the environmental test suite—are used to fine-tune the model, particularly masking invalid environment-formatting actions during training. This alone boosts baseline accuracy from 11% to 20% on the SWE-bench Verified benchmark. 2. Reinforcement Learning Using Modified DAPO Building on Decoupled Advantage Policy Optimization (DAPO), several key modifications are introduced for scalability and stability: Asymmetric Clipping: Prevents collapse in policy entropy, maintaining exploration. Dynamic Sample Filtering: Focuses optimization on trajectories with actual learning signal. Length Penalties: Discourages excessive episode length, helping the agent avoid getting stuck in loops. Token-Level Averaging: Every token in every trajectory contributes equally to the gradient, empowering longer trajectories to influence updates. The agent utilizes a ReAct-style loop, which lets it combine reasoning steps with tool usage. Its supported toolkit includes arbitrary shell commands, precise code edits, navigation/search utilities, and a submit action to signal episode completion. Each interaction is grounded in a robust sandboxed environment, initialized from real repository snapshots and backed by a GitHub-style issue prompt. Scaling to Long Contexts and Real Benchmarks Initially trained with a context length of 65k tokens (already double that of most open models), performance stalls at 32%. A second RL phase expands the context to 131k tokens and doubles the episode length ceiling, focusing subsequent training on only the most beneficial tasks from the pool. This enables scaling to longer stack traces and diff histories inherent to real-world debugging and patching tasks. Results: Closing the Gap with Baselines The final RL-trained agent attains 39% Pass@1 accuracy on the SWE-bench Verified benchmark, doubling the rejection fine-tuned baseline, and matching the performance of cutting-edge open-weight models such as DeepSeek-V3-0324, all without teacher-based supervision. On held-out SWE-rebench splits, scores remain competitive (35% for May, 31.7% for June), indicating the method’s robustness. When compared head-to-head with top open baselines and specialized SWE agents, the RL agent matches or outperforms several models, confirming the effectiveness of the RL methodology in this domain. Pass@1 SWE-bench Verified Pass@10 Pass@1 SWE-rebench May Pass@10 Qwen2.5-72B-Instruct (RL, final) 39.04% 58.4% 35.0% 52.5% DeepSeek-V3-0324 39.56% 62.2% 36.75% 60.0% Qwen3-235B no-thinking 25.84% 54.4% 27.25% 57.5% Llama4 Maverick 15.84% 47.2% 19.0% 50.0% Pass@1 scores are averaged over 10 runs and reported as mean ± standard error. Key Insights Credit Assignment: RL in this sparse-reward regime remains fundamentally challenging. The paper suggests future work with reward shaping, step-level critics, or prefix-based rollouts for more granular feedback. Uncertainty Estimation: Real-world agents need to know when to abstain or express confidence. Techniques like output entropy or explicit confidence scoring are next steps. Infrastructure: Training utilized context parallelism (splitting long sequences over GPUs) on 16 H200 nodes, with distributed orchestration via Kubernetes and Tracto AI, and vLLM for fast inference. Conclusion This research validates RL as a potent paradigm for building autonomous software engineers using open-weight LLMs. By conquering long-horizon, multi-turn, real-environment tasks, the methodology paves the way for scalable, teacher-free agent development—directly leveraging the power of interaction rather than static instruction. With further refinements, such RL pipelines promise efficient, reliable, and versatile automation for the future of software engineering. Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Star us on GitHub Sponsor us The post Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents appeared first on MarkTechPost.

Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents Leer entrada »

AI, Committee, Noticias, Uncategorized

Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment

arXiv:2508.07750v1 Announce Type: cross Abstract: Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by pairwise preference dynamics. Our theoretical analysis establishes GRAO’s convergence guarantees and sample efficiency advantages over conventional approaches. Comprehensive evaluations across complex human alignment tasks demonstrate GRAO’s superior performance, achieving 57.70%,17.65% 7.95% and 5.18% relative improvements over SFT, DPO, PPO and GRPO baselines respectively. This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models.

Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment Leer entrada »

AI, Committee, Noticias, Uncategorized

Exploring Procedural Data Generation for Automatic Acoustic Guitar Fingerpicking Transcription

arXiv:2508.07987v1 Announce Type: cross Abstract: Automatic transcription of acoustic guitar fingerpicking performances remains a challenging task due to the scarcity of labeled training data and legal constraints connected with musical recordings. This work investigates a procedural data generation pipeline as an alternative to real audio recordings for training transcription models. Our approach synthesizes training data through four stages: knowledge-based fingerpicking tablature composition, MIDI performance rendering, physical modeling using an extended Karplus-Strong algorithm, and audio augmentation including reverb and distortion. We train and evaluate a CRNN-based note-tracking model on both real and synthetic datasets, demonstrating that procedural data can be used to achieve reasonable note-tracking results. Finetuning with a small amount of real data further enhances transcription accuracy, improving over models trained exclusively on real recordings. These results highlight the potential of procedurally generated audio for data-scarce music information retrieval tasks.

Exploring Procedural Data Generation for Automatic Acoustic Guitar Fingerpicking Transcription Leer entrada »

AI, Committee, Noticias, Uncategorized

SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling

arXiv:2508.08211v1 Announce Type: new Abstract: Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework’s effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark’s consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.

SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling Leer entrada »

AI, Committee, Noticias, Uncategorized

Improved Personalized Headline Generation via Denoising Fake Interests from Implicit Feedback

arXiv:2508.07178v1 Announce Type: new Abstract: Accurate personalized headline generation hinges on precisely capturing user interests from historical behaviors. However, existing methods neglect personalized-irrelevant click noise in entire historical clickstreams, which may lead to hallucinated headlines that deviate from genuine user preferences. In this paper, we reveal the detrimental impact of click noise on personalized generation quality through rigorous analysis in both user and news dimensions. Based on these insights, we propose a novel Personalized Headline Generation framework via Denoising Fake Interests from Implicit Feedback (PHG-DIF). PHG-DIF first employs dual-stage filtering to effectively remove clickstream noise, identified by short dwell times and abnormal click bursts, and then leverages multi-level temporal fusion to dynamically model users’ evolving and multi-faceted interests for precise profiling. Moreover, we release DT-PENS, a new benchmark dataset comprising the click behavior of 1,000 carefully curated users and nearly 10,000 annotated personalized headlines with historical dwell time annotations. Extensive experiments demonstrate that PHG-DIF substantially mitigates the adverse effects of click noise and significantly improves headline quality, achieving state-of-the-art (SOTA) results on DT-PENS. Our framework implementation and dataset are available at https://github.com/liukejin-up/PHG-DIF.

Improved Personalized Headline Generation via Denoising Fake Interests from Implicit Feedback Leer entrada »

AI, Committee, Noticias, Uncategorized

ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models

arXiv:2508.07484v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown remarkable performance across a wide range of natural language processing tasks. Quality Estimation (QE) for Machine Translation (MT), which assesses the quality of a source-target pair without relying on reference translations, remains a challenging cross-lingual task for LLMs. The challenges stem from the inherent limitations of existing LLM-based QE systems, which are pre-trained for causal language modelling rather than regression-specific tasks, further elevated by the presence of low-resource languages given pre-training data distribution. This paper introduces ALOPE, an adaptive layer-optimization framework designed to enhance LLM-based QE by restructuring Transformer representations through layer-wise adaptation for improved regression-based prediction. Our framework integrates low-rank adapters (LoRA) with regression task heads, leveraging selected pre-trained Transformer layers for improved cross-lingual alignment. In addition to the layer-specific adaptation, ALOPE introduces two strategies-dynamic weighting, which adaptively combines representations from multiple layers, and multi-head regression, which aggregates regression losses from multiple heads for QE. Our framework shows improvements over various existing LLM-based QE approaches. Empirical evidence suggests that intermediate Transformer layers in LLMs provide contextual representations that are more aligned with the cross-lingual nature of the QE task. We make resultant models and framework code publicly available for further research, also allowing existing LLM-based MT frameworks to be scaled with QE capabilities.

ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models Leer entrada »

AI, Committee, Noticias, Uncategorized

Adversarial Topic-aware Prompt-tuning for Cross-topic Automated Essay Scoring

arXiv:2508.05987v1 Announce Type: new Abstract: Cross-topic automated essay scoring (AES) aims to develop a transferable model capable of effectively evaluating essays on a target topic. A significant challenge in this domain arises from the inherent discrepancies between topics. While existing methods predominantly focus on extracting topic-shared features through distribution alignment of source and target topics, they often neglect topic-specific features, limiting their ability to assess critical traits such as topic adherence. To address this limitation, we propose an Adversarial TOpic-aware Prompt-tuning (ATOP), a novel method that jointly learns topic-shared and topic-specific features to improve cross-topic AES. ATOP achieves this by optimizing a learnable topic-aware prompt–comprising both shared and specific components–to elicit relevant knowledge from pre-trained language models (PLMs). To enhance the robustness of topic-shared prompt learning and mitigate feature scale sensitivity introduced by topic alignment, we incorporate adversarial training within a unified regression and classification framework. In addition, we employ a neighbor-based classifier to model the local structure of essay representations and generate pseudo-labels for target-topic essays. These pseudo-labels are then used to guide the supervised learning of topic-specific prompts tailored to the target topic. Extensive experiments on the publicly available ASAP++ dataset demonstrate that ATOP significantly outperforms existing state-of-the-art methods in both holistic and multi-trait essay scoring. The implementation of our method is publicly available at: https://anonymous.4open.science/r/ATOP-A271.

Adversarial Topic-aware Prompt-tuning for Cross-topic Automated Essay Scoring Leer entrada »

We use cookies to improve your experience and performance on our website. You can learn more at Política de privacidad and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
es_ES