Committee Archives - Page 34 sur 101

AI, Committee, Actualités, Uncategorized

Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction

admin NU / août 29, 2025

arXiv:2508.20395v1 Announce Type: new Abstract: Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer’s correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision. We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model’s uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative log-likelihood over the vocabulary) with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early.

Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction Lire l’article »

AI, Committee, Actualités, Uncategorized

One Joke to Rule them All? On the (Im)possibility of Generalizing Humor

admin NU / août 28, 2025

arXiv:2508.19402v1 Announce Type: new Abstract: Humor is a broad and complex form of communication that remains challenging for machines. Despite its broadness, most existing research on computational humor traditionally focused on modeling a specific type of humor. In this work, we wish to understand whether competence on one or more specific humor tasks confers any ability to transfer to novel, unseen types; in other words, is this fragmentation inevitable? This question is especially timely as new humor types continuously emerge in online and social media contexts (e.g., memes, anti-humor, AI fails). If Large Language Models (LLMs) are to keep up with this evolving landscape, they must be able to generalize across humor types by capturing deeper, transferable mechanisms. To investigate this, we conduct a series of transfer learning experiments across four datasets, representing different humor tasks. We train LLMs under varied diversity settings (1-3 datasets in training, testing on a novel task). Experiments reveal that models are capable of some transfer, and can reach up to 75% accuracy on unseen datasets; training on diverse sources improves transferability (1.88-4.05%) with minimal-to-no drop in in-domain performance. Further analysis suggests relations between humor types, with Dad Jokes surprisingly emerging as the best enabler of transfer (but is difficult to transfer to). We release data and code.

One Joke to Rule them All? On the (Im)possibility of Generalizing Humor Lire l’article »

AI, Committee, Actualités, Uncategorized

Diffusion Language Models Know the Answer Before Decoding

admin NU / août 28, 2025

arXiv:2508.19982v1 Announce Type: new Abstract: Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go “all-in” (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at https://github.com/pixeli99/Prophet.

Diffusion Language Models Know the Answer Before Decoding Lire l’article »

AI, Committee, Actualités, Uncategorized

KoWit-24: A Richly Annotated Dataset of Wordplay in News Headlines

admin NU / août 28, 2025

arXiv:2503.01510v2 Announce Type: replace Abstract: We present KoWit-24, a dataset with fine-grained annotation of wordplay in 2,700 Russian news headlines. KoWit-24 annotations include the presence of wordplay, its type, wordplay anchors, and words/phrases the wordplay refers to. Unlike the majority of existing humor collections of canned jokes, KoWit-24 provides wordplay contexts — each headline is accompanied by the news lead and summary. The most common type of wordplay in the dataset is the transformation of collocations, idioms, and named entities — the mechanism that has been underrepresented in previous humor datasets. Our experiments with five LLMs show that there is ample room for improvement in wordplay detection and interpretation tasks. The dataset and evaluation scripts are available at https://github.com/Humor-Research/KoWit-24

KoWit-24: A Richly Annotated Dataset of Wordplay in News Headlines Lire l’article »

AI, Committee, Actualités, Uncategorized

Building Task Bots with Self-learning for Enhanced Adaptability, Extensibility, and Factuality

admin NU / août 28, 2025

arXiv:2508.19689v1 Announce Type: new Abstract: Developing adaptable, extensible, and accurate task bots with minimal or zero human intervention is a significant challenge in dialog research. This thesis examines the obstacles and potential solutions for creating such bots, focusing on innovative techniques that enable bots to learn and adapt autonomously in constantly changing environments.

Building Task Bots with Self-learning for Enhanced Adaptability, Extensibility, and Factuality Lire l’article »

AI, Committee, Actualités, Uncategorized

Nous Research Team Releases Hermes 4: A Family of Open-Weight AI Models with Hybrid Reasoning

admin NU / août 28, 2025

Nous Research has released Hermes 4, a family of open-weight models (14B, 70B, and 405B parameter sizes based on Llama 3.1 checkpoints) that achieves frontier-level performance through pure post-training techniques. Hermes 4 introduces hybrid reasoning – models can toggle between standard responses and explicit reasoning using <think>…</think> tags when complex problems require deeper deliberation. What makes Hermes 4 particularly significant is its achievement of state-of-the-art performance among open-weight models while maintaining complete transparency and neutral alignment philosophy, demonstrating that sophisticated reasoning capabilities can be developed entirely through open-source methodologies. DataForge: Graph-Based Synthetic Data Generation DataForge is the main component behind Hermes 4’s core structure. But what is DataForge? DataForge is a revolutionary graph-based synthetic data generation system that transforms how training data is created. Unlike traditional curation approaches, DataForge operates through a directed acyclic graph (DAG) where each node implements a PDDL (Planning Domain Definition Language) action interface. Each node specifies preconditions, postconditions, and transformations, facilitating the automatic creation of complex data pipelines. By using pre-training seed data from DCLM and FineWeb, the system can transform a Wikipedia article into a rap song, and then generate instruction-answer pairs based on that transformation. This approach generates approximately 5 million samples totaling 19 billion tokens, with reasoning samples being intentionally token-heavy – averaging five times more tokens than non-reasoning counterparts to accommodate thinking traces up to 16,000 tokens long. https://arxiv.org/pdf/2508.18255 Rejection Sampling at Unprecedented Scale Hermes 4 uses Atropos, Nous Research’s open-source reinforcement learning environment, to implement rejection sampling across approximately 1,000 different task-specific verifiers. This massive verification infrastructure filters for high-quality reasoning trajectories across diverse domains. Key verification environments include Answer Format Training (rewarding correct formatting across 150+ output formats), Instruction Following (using RLVR-IFEval tasks with complex constraints), Schema Adherence (for JSON generation using Pydantic models), and Tool Use training for agentic behavior. The rejection sampling process creates a large corpus of verified reasoning trajectories, with multiple unique solution paths to the same verified result. This approach ensures the model learns robust reasoning patterns rather than memorizing specific solution templates. Length Control: Solving Overlong Generation One of Hermes 4’s most innovative contributions addresses the overlong reasoning problem – where reasoning models generate excessively long chains of thought without termination. The research team discovered their 14B model reached maximum context length 60% of the time on LiveCodeBench when in reasoning mode. Their super effective solution involves a second supervised fine-tuning stage teaching models to stop reasoning at exactly 30,000 tokens: Generate reasoning traces from the current policy Insert </think> tokens at exactly 30,000 tokens Train only on the termination decision, not the reasoning chain Apply gradient updates solely to </think> and <eos> tokens This approach achieves remarkable results: 78.4% reduction in overlong generation on AIME’24, 65.3% on AIME’25, and 79.8% on LiveCodeBench, with only 4.7% to 12.7% relative accuracy cost. By focusing learning signals entirely on the termination decision, the method avoids model collapse risks while teaching effective “counting behavior.” https://hermes4.nousresearch.com/ https://hermes4.nousresearch.com/ Benchmark Performance and Neutral Alignment Hermes 4 demonstrates state-of-the-art performance among open-weight models. The 405B model achieves 96.3% on MATH-500 (reasoning mode), 81.9% on AIME’24, 78.1% on AIME’25, 70.5% on GPQA Diamond, and 61.3% on LiveCodeBench. Particularly notable is its performance on RefusalBench, achieving 57.1% in reasoning mode – the highest score among evaluated models, significantly outperforming GPT-4o (17.67%) and Claude Sonnet 4 (17%). This demonstrates the model’s willingness to engage with controversial topics while maintaining appropriate boundaries, reflecting Nous Research’s neutral alignment philosophy. https://arxiv.org/pdf/2508.18255 Technical Architecture and Training Hermes 4 training leverages a modified TorchTitan across 192 NVIDIA B200 GPUs. The system handles highly heterogeneous sample length distribution through efficient packing (achieving >99.9% batch efficiency), flex attention, and sophisticated loss masking where only assistant-role tokens contribute to cross-entropy loss. Training follows a cosine learning rate schedule with 300 warmup steps and 9,000 total steps at 16,384 token context length with global batch size of 384 samples, combining Data Parallelism, Tensor Parallelism, and Fully Sharded Data Parallelism. Summary Hermes 4 marks a significant advancement in open-source AI development, proving that frontier-level reasoning capabilities can be achieved through transparent, reproducible methodologies without relying on proprietary training data or closed development processes. By combining innovative graph-based synthetic data generation, massive-scale rejection sampling, and elegant length control mechanisms, Nous Research has created models that not only match the performance of leading proprietary systems but also maintain the neutral alignment and steerability that make them genuinely useful tools rather than restrictive assistants Check out the Paper, Technical details, Model on Hugging Face and Chat. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Nous Research Team Releases Hermes 4: A Family of Open-Weight AI Models with Hybrid Reasoning appeared first on MarkTechPost.

Nous Research Team Releases Hermes 4: A Family of Open-Weight AI Models with Hybrid Reasoning Lire l’article »

AI, Committee, Actualités, Uncategorized

Recognizing Limits: Investigating Infeasibility in Large Language Models

admin NU / août 27, 2025

arXiv:2408.05873v3 Announce Type: replace Abstract: Large language models (LLMs) have shown remarkable performance in various tasks but often fail to handle queries that exceed their knowledge and capabilities, leading to incorrect or fabricated responses. This paper addresses the need for LLMs to recognize and refuse infeasible tasks due to the requests surpassing their capabilities. We conceptualize four main categories of infeasible tasks for LLMs, which cover a broad spectrum of hallucination-related challenges identified in prior literature. We develop and benchmark a new dataset comprising diverse infeasible and feasible tasks to evaluate multiple LLMs’ abilities to decline infeasible tasks. Furthermore, we explore the potential of increasing LLMs’ refusal capabilities with fine-tuning. Our experiments validate the effectiveness of the trained models, suggesting promising directions for improving the performance of LLMs in real-world applications.

Recognizing Limits: Investigating Infeasibility in Large Language Models Lire l’article »

AI, Committee, Actualités, Uncategorized

EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems

admin NU / août 27, 2025

arXiv:2508.17623v2 Announce Type: replace Abstract: Speech emotions play a crucial role in human-computer interaction, shaping engagement and context-aware communication. Despite recent advances in spoken dialogue systems, a holistic system for evaluating emotional reasoning is still lacking. To address this, we introduce EMO-Reasoning, a benchmark for assessing emotional coherence in dialogue systems. It leverages a curated dataset generated via text-to-speech to simulate diverse emotional states, overcoming the scarcity of emotional speech data. We further propose the Cross-turn Emotion Reasoning Score to assess the emotion transitions in multi-turn dialogues. Evaluating seven dialogue systems through continuous, categorical, and perceptual metrics, we show that our framework effectively detects emotional inconsistencies, providing insights for improving current dialogue systems. By releasing a systematic evaluation benchmark, we aim to advance emotion-aware spoken dialogue modeling toward more natural and adaptive interactions.

EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems Lire l’article »

AI, Committee, Actualités, Uncategorized

Training Language Model Agents to Find Vulnerabilities with CTF-Dojo

admin NU / août 27, 2025

arXiv:2508.18370v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated exceptional capabilities when trained within executable runtime environments, notably excelling at software engineering tasks through verified feedback loops. Yet, scalable and generalizable execution-grounded environments remain scarce, limiting progress in training more capable ML agents. We introduce CTF-Dojo, the first large-scale executable runtime tailored for training LLMs with verifiable feedback, featuring 658 fully functional Capture-The-Flag (CTF)-style challenges containerized in Docker with guaranteed reproducibility. To enable rapid scaling without manual intervention, we develop CTF-Forge, an automated pipeline that transforms publicly available artifacts into ready-to-use execution environments in minutes, eliminating weeks of expert configuration traditionally required. We trained LLM-based agents on just 486 high-quality, execution-verified trajectories from CTF-Dojo, achieving up to 11.6% absolute gains over strong baselines across three competitive benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best-performing 32B model reaches 31.9% Pass@1, establishing a new open-weight state-of-the-art that rivals frontier models like DeepSeek-V3-0324 and Gemini-2.5-Flash. By framing CTF-style tasks as a benchmark for executable-agent learning, CTF-Dojo demonstrates that execution-grounded training signals are not only effective but pivotal in advancing high-performance ML agents without dependence on costly proprietary systems.

Training Language Model Agents to Find Vulnerabilities with CTF-Dojo Lire l’article »

AI, Committee, Actualités, Uncategorized

Fingerprint Vector: Enabling Scalable and Efficient Model Fingerprint Transfer via Vector Addition

admin NU / août 27, 2025

arXiv:2409.08846v3 Announce Type: replace-cross Abstract: Backdoor-based fingerprinting has emerged as an effective technique for tracing the ownership of large language models. However, in real-world deployment scenarios, developers often instantiate multiple downstream models from a shared base model, and applying fingerprinting to each variant individually incurs prohibitive computational overhead. While inheritance-based approaches — where fingerprints are embedded into the base model and expected to persist through fine-tuning — appear attractive, they suffer from three key limitations: late-stage fingerprinting, fingerprint instability, and interference with downstream adaptation. To address these challenges, we propose a novel mechanism called the Fingerprint Vector. Our method first embeds a fingerprint into the base model via backdoor-based fine-tuning, then extracts a task-specific parameter delta as a fingerprint vector by computing the difference between the fingerprinted and clean models. This vector can be directly added to any structurally compatible downstream model, allowing the fingerprint to be transferred post hoc without additional fine-tuning. Extensive experiments show that Fingerprint Vector achieves comparable or superior performance to direct injection across key desiderata. It maintains strong effectiveness across diverse model architectures as well as mainstream downstream variants within the same family. It also preserves harmlessness and robustness in most cases. Even when slight robustness degradation is observed, the impact remains within acceptable bounds and is outweighed by the scalability benefits of our approach.

Fingerprint Vector: Enabling Scalable and Efficient Model Fingerprint Transfer via Vector Addition Lire l’article »

Committee

Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction

One Joke to Rule them All? On the (Im)possibility of Generalizing Humor

Diffusion Language Models Know the Answer Before Decoding

KoWit-24: A Richly Annotated Dataset of Wordplay in News Headlines

Building Task Bots with Self-learning for Enhanced Adaptability, Extensibility, and Factuality

Nous Research Team Releases Hermes 4: A Family of Open-Weight AI Models with Hybrid Reasoning

Recognizing Limits: Investigating Infeasibility in Large Language Models

EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems

Training Language Model Agents to Find Vulnerabilities with CTF-Dojo

Fingerprint Vector: Enabling Scalable and Efficient Model Fingerprint Transfer via Vector Addition

Nos services

Accueil

Comment ça marche

Actualités

Tarifs

Support

Centre d'aide

Signaler un problème

Donner un retour

Politique de confidentialité

Compte utilisateur

Suivez-nous