YouZum

Committee

AI, Committee, Nachrichten, Uncategorized

The Dual-Route Model of Induction

arXiv:2504.03022v2 Announce Type: replace Abstract: Prior work on in-context copying has shown the existence of induction heads, which attend to and promote individual tokens during copying. In this work we discover a new type of induction head: concept-level induction heads, which copy entire lexical units instead of individual tokens. Concept induction heads learn to attend to the ends of multi-token words throughout training, working in parallel with token-level induction heads to copy meaningful text. We show that these heads are responsible for semantic tasks like word-level translation, whereas token induction heads are vital for tasks that can only be done verbatim (like copying nonsense tokens). These two “routes” operate independently: we show that ablation of token induction heads causes models to paraphrase where they would otherwise copy verbatim. By patching concept induction head outputs, we find that they contain language-independent word representations that mediate natural language translation, suggesting that LLMs represent abstract word meanings independent of language or form.

The Dual-Route Model of Induction Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations

arXiv:2507.14688v1 Announce Type: new Abstract: Post-training has emerged as a crucial technique for aligning pre-trained Large Language Models (LLMs) with human instructions, significantly enhancing their performance across a wide range of tasks. Central to this process is the quality and diversity of post-training datasets. This paper presents a review of publicly available Arabic post-training datasets on the Hugging Face Hub, organized along four key dimensions: (1) LLM Capabilities (e.g., Question Answering, Translation, Reasoning, Summarization, Dialogue, Code Generation, and Function Calling); (2) Steerability (e.g., persona and system prompts); (3) Alignment (e.g., cultural, safety, ethics, and fairness), and (4) Robustness. Each dataset is rigorously evaluated based on popularity, practical adoption, recency and maintenance, documentation and annotation quality, licensing transparency, and scientific contribution. Our review revealed critical gaps in the development of Arabic post-training datasets, including limited task diversity, inconsistent or missing documentation and annotation, and low adoption across the community. Finally, the paper discusses the implications of these gaps on the progress of Arabic LLMs and applications while providing concrete recommendations for future efforts in post-training dataset development.

Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

arXiv:2507.14119v1 Announce Type: cross Abstract: Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets: original image, instruction, edited image. Yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approximately 2.2x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit: an open dataset of 358k high-quality triplets. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, an open-source fine-tuned Bagel model, which achieves state-of-the-art metrics in our experiments.

NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models

arXiv:2507.13761v1 Announce Type: new Abstract: Language models are highly sensitive to prompt formulations – small changes in input can drastically alter their output. This raises a critical question: To what extent can prompt sensitivity be exploited to generate inapt content? In this paper, we investigate how discrete components of prompt design influence the generation of inappropriate content in Visual Language Models (VLMs). Specifically, we analyze the impact of three key factors on successful jailbreaks: (a) the inclusion of detailed visual information, (b) the presence of adversarial examples, and (c) the use of positively framed beginning phrases. Our findings reveal that while a VLM can reliably distinguish between benign and harmful inputs in unimodal settings (text-only or image-only), this ability significantly degrades in multimodal contexts. Each of the three factors is independently capable of triggering a jailbreak, and we show that even a small number of in-context examples (as few as three) can push the model toward generating inappropriate outputs. Furthermore, we propose a framework that utilizes a skip-connection between two internal layers of the VLM, which substantially increases jailbreak success rates, even when using benign images. Finally, we demonstrate that memes, often perceived as humorous or harmless, can be as effective as toxic visuals in eliciting harmful content, underscoring the subtle and complex vulnerabilities of VLMs.

Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

arXiv:2502.13962v2 Announce Type: replace Abstract: Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

RAG-based Architectures for Drug Side Effect Retrieval in LLMs

arXiv:2507.13822v1 Announce Type: cross Abstract: Drug side effects are a major global health concern, necessitating advanced methods for their accurate detection and analysis. While Large Language Models (LLMs) offer promising conversational interfaces, their inherent limitations, including reliance on black-box training data, susceptibility to hallucinations, and lack of domain-specific knowledge, hinder their reliability in specialized fields like pharmacovigilance. To address this gap, we propose two architectures: Retrieval-Augmented Generation (RAG) and GraphRAG, which integrate comprehensive drug side effect knowledge into a Llama 3 8B language model. Through extensive evaluations on 19,520 drug side effect associations (covering 976 drugs and 3,851 side effect terms), our results demonstrate that GraphRAG achieves near-perfect accuracy in drug side effect retrieval. This framework offers a highly accurate and scalable solution, signifying a significant advancement in leveraging LLMs for critical pharmacovigilance applications.

RAG-based Architectures for Drug Side Effect Retrieval in LLMs Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Political Leaning and Politicalness Classification of Texts

arXiv:2507.13913v1 Announce Type: new Abstract: This paper addresses the challenge of automatically classifying text according to political leaning and politicalness using transformer models. We compose a comprehensive overview of existing datasets and models for these tasks, finding that current approaches create siloed solutions that perform poorly on out-of-distribution texts. To address this limitation, we compile a diverse dataset by combining 12 datasets for political leaning classification and creating a new dataset for politicalness by extending 18 existing datasets with the appropriate label. Through extensive benchmarking with leave-one-in and leave-one-out methodologies, we evaluate the performance of existing models and train new ones with enhanced generalization capabilities.

Political Leaning and Politicalness Classification of Texts Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

MemAgent: A Reinforcement Learning Framework Redefining Long-Context Processing in LLMs

Handling extremely long documents remains a persistent challenge for large language models (LLMs). Even with techniques such as length extrapolation and sparse attention, models often suffer from performance degradation and high computational costs. To address this, researchers from ByteDance Seed and Tsinghua University introduce MemAgent, a reinforcement learning-based memory agent designed to enable long-context processing with linear complexity and minimal performance loss. Limitations of Existing Approaches Current solutions for long-context modeling fall into three main categories: Length Extrapolation Methods (e.g., NTK, PI, YaRN, DCA): Extend the context window via positional embedding manipulations. However, they often face performance degradation and scaling issues. Sparse and Linear Attention Mechanisms: Reduce attention complexity to O(n) but typically require retraining from scratch and rely on fixed patterns or human-defined rules. Context Compression: Use token-level or external memory modules to condense long inputs but often disrupt standard generation and struggle with extrapolation. These approaches fail to deliver all three critical attributes: arbitrary input length support, consistent accuracy, and efficient linear complexity. MemAgent: Human-Like Memory Strategy Inspired by how humans summarize key information while ignoring noise, MemAgent processes input as a stream of evidence. At each step, it reads a document chunk and an internal memory, overwriting the latter with updated, compressed context. Key innovations: Fixed-Length Token-Based Memory: Compresses essential information while maintaining model compatibility. Segment-Wise Overwrite Mechanism: Supports infinite text lengths without growing memory. Linear Complexity: Memory update and decoding cost remain constant per chunk. Multi-Conv RL Training with GRPO MemAgent treats each document chunk interaction as an independent dialogue. It is trained via Group Relative Policy Optimization (GRPO) within a multi-conversation RL pipeline called DAPO, enabling reward-driven memory update. Key elements include: Rule-Based Verifier: Calculates outcome rewards by comparing model answers with multiple ground truths. Token-Level RL Signal: Applied uniformly across conversations stemming from a sample. This setup encourages memory compression focused on answer-relevant information and discards distractors. Performance Evaluation Using the RULER benchmark and synthetic datasets from HotpotQA and SQuAD, MemAgent was trained with an 8K context window and extrapolated up to 3.5 million tokens. Model 224K 896K 3.5M Qwen2.5-Instruct-14B-1M 37.5% 0.0% N/A QwenLong-L1-32B 17.2% 11.7% N/A RL-MemAgent-14B 81.3% 77.3% 78.1% MemAgent maintained over 95% accuracy on RULER benchmarks (8K to 512K tokens) and consistently outperformed long-context and distillation-based baselines. Case Study: Multi-Hop QA Given the query “The director of the romantic comedy ‘Big Stone Gap’ is based in what New York city?”, MemAgent progressively tracked relevant content across 3 chunks: Recognized unrelated content but retained location information. Maintained memory against irrelevant chunks. Correctly updated memory upon encountering Adriana Trigiani’s biography. Final answer: Greenwich Village, New York City. Theoretical Foundation and Complexity MemAgent reformulates the autoregressive model using latent memory variables (m₁…mₖ): p(x₁:N) = ∑ₘ₁:ₖ ∏ₖ p(cₖ | mₖ₋₁) * p(mₖ | cₖ, mₖ₋₁) This enables O(N) compute cost and human-readable intermediate memory—unlike attention-based feature compression. RL is essential, as memory updates are discrete and can’t be learned via backpropagation. Conclusion MemAgent offers a scalable and efficient solution to the long-context trilemma: unlimited input length, near-lossless accuracy, and linear complexity. Its RL-based overwrite memory mechanism allows LLMs to read, abstract, and generate over multi-million-token inputs without architectural modification. FAQs Q1: What is MemAgent?MemAgent is a reinforcement learning-based framework that equips LLMs with memory tokens to handle extremely long contexts efficiently. Q2: How is it different from attention or extrapolation methods?Unlike attention-based scaling or extrapolation techniques, MemAgent uses token-based memory updated via reinforcement learning. Q3: What models can MemAgent be applied to?Any Transformer-based LLM. No changes to the model architecture are required. Q4: How does it scale with input size?It maintains linear computational complexity regardless of input length by fixing the memory size. Q5: What are the applications of MemAgent?Long-document QA, agent memory systems, legal document review, scientific literature analysis, and real-time decision-making with large evidence bases. Check out the Paper. All credit for this research goes to the researchers of this project. Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship] The post MemAgent: A Reinforcement Learning Framework Redefining Long-Context Processing in LLMs appeared first on MarkTechPost.

MemAgent: A Reinforcement Learning Framework Redefining Long-Context Processing in LLMs Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

NVIDIA AI Releases OpenReasoning-Nemotron: A Suite of Reasoning-Enhanced LLMs Distilled from DeepSeek R1 0528

NVIDIA AI has introduced OpenReasoning-Nemotron, a family of large language models (LLMs) designed to excel in complex reasoning tasks across mathematics, science, and code. This model suite—comprising 1.5B, 7B, 14B, and 32B parameter versions—has been distilled from the 671B DeepSeek R1 0528 model, capturing its high-level reasoning capabilities in significantly smaller and more efficient models. The release positions NVIDIA as a leading contributor to the open-source LLM ecosystem, delivering models that push state-of-the-art (SOTA) performance while remaining commercially permissive and widely accessible via Hugging Face. Model Overview and Architecture Distillation from DeepSeek R1 0528 (671B) At the heart of OpenReasoning-Nemotron lies a distillation strategy that transfers reasoning ability from DeepSeek R1—a massive 671B parameter model—into smaller architectures. The process prioritizes reasoning generalization over raw token prediction, enabling compact models to perform effectively on structured, high-cognition tasks. The distillation dataset emphasizes mathematics, science, and programming languages, aligning model capabilities with key reasoning domains. Model Variants and Specs Model Name Parameters Intended Use Hugging Face Page OpenReasoning-Nemotron-1.5B 1.5B Entry-level reasoning and inference Link OpenReasoning-Nemotron-7B 7B Mid-scale reasoning, good for code/math Link OpenReasoning-Nemotron-14B 14B Advanced reasoning capabilities Link OpenReasoning-Nemotron-32B 32B Near frontier-model performance in logic-intensive tasks Link All models are compatible with transformer architectures, support FP16/INT8 quantization, and are optimized for NVIDIA GPUs and NeMo frameworks. Performance Benchmarks These models set new state-of-the-art pass@1 scores for their size class across multiple reasoning benchmarks: Model GPQA MMLU‑PRO HLE LiveCodeBench SciCode AIME24 AIME25 HMMT Feb 2025 1.5B 31.6 47.5 5.5 28.6 2.2 55.5 45.6 31.5 7B 61.1 71.9 8.3 63.3 16.2 84.7 78.2 63.5 14B 71.6 77.5 10.1 67.8 23.5 87.8 82.0 71.2 32B 73.1 80.0 11.9 70.2 28.5 89.2 84.0 73.8 All quoted scores are pass@1 without GenSelect. GenSelect (Heavy Mode) Using Generative Selection with 64 candidates (“GenSelect”), performance further improves, especially at 32B: 32B achieves: AIME24 89.2 → 93.3, AIME25 84.0 → 90.0, HMMT 73.8 → 96.7, LiveCodeBench 70.2 → 75.3. This demonstrates strong emergent reasoning performance at scale. Training Data and Reasoning Specialization The training corpus is a distilled, high-quality subset of the DeepSeek R1 0528 dataset. Key features include: Heavily curated reasoning data from math, science, and CS disciplines. Prompt-engineered fine-tuning designed to reinforce multi-step thought chains. Emphasis on logical consistency, constraint satisfaction, and symbolic reasoning. This deliberate curation ensures strong alignment with real-world reasoning problems found in both academia and applied ML domains. Open and Ecosystem Integration All four OpenReasoning-Nemotron models are released under an open and commercially permissive license, with model cards, evaluation scripts, and inference-ready weights available on Hugging Face: OpenReasoning-Nemotron-1.5B OpenReasoning-Nemotron-7B OpenReasoning-Nemotron-14B OpenReasoning-Nemotron-32B These models are designed to plug into the NVIDIA NeMo framework, and support TensorRT-LLM, ONNX, and Hugging Face Transformers toolchains, facilitating rapid deployment in production and research settings. Key Use Cases Math tutors and theorem solvers Scientific QA agents and medical reasoning systems Code generation and debugging assistants Chain-of-thought multi-hop question answering Synthetic data generation for structured domains Conclusion NVIDIA’s OpenReasoning-Nemotron models offer a pragmatic, open-source path toward scaling reasoning ability without frontier-scale compute costs. By distilling from the 671B DeepSeek R1 and targeting high-leverage reasoning domains, these models deliver a powerful balance of accuracy, efficiency, and accessibility. For developers, researchers, and enterprises working on logic-intensive AI applications, OpenReasoning-Nemotron provides a compelling foundation—free from the trade-offs that often accompany proprietary or overgeneralized models. Frequently Asked Questions (FAQs) Q1. What benchmarks are supported?GPQA, MMLU-PRO, HLE, LiveCodeBench, SciCode, AIME 2024/25, HMMT Feb 2025 (pass@1). Q2. How much data was used?A distillation corpus of 5 million reasoning log examples across domains, generated by DeepSeek‑R1‑0528. Q3. Is reinforcement learning used?No—models are trained purely via SFT, preserving efficiency while enabling future RL research. Q4. Can I scale reasoning with GenSelect?Yes. Using GenSelect significantly boosts performance—32B jumps from 73.8 to 96.7 on HMMT with 64 candidates. Check out the Technical details. All credit for this research goes to the researchers of this project. Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship] The post NVIDIA AI Releases OpenReasoning-Nemotron: A Suite of Reasoning-Enhanced LLMs Distilled from DeepSeek R1 0528 appeared first on MarkTechPost.

NVIDIA AI Releases OpenReasoning-Nemotron: A Suite of Reasoning-Enhanced LLMs Distilled from DeepSeek R1 0528 Beitrag lesen »

de_DE