YouZum

AI

AI, Committee, ニュース, Uncategorized

Graph-R1: An Agentic GraphRAG Framework for Structured, Multi-Turn Reasoning with Reinforcement Learning

Introduction Large Language Models (LLMs) have set new benchmarks in natural language processing, but their tendency for hallucination—generating inaccurate outputs—remains a critical issue for knowledge-intensive applications. Retrieval-Augmented Generation (RAG) frameworks attempt to solve this by incorporating external knowledge into language generation. However, traditional RAG approaches rely on chunk-based retrieval, which limits their ability to represent complex semantic relationships. Entity-relation graph-based RAG methods (GraphRAG) address some structural limitations, but still face high construction cost, one-shot retrieval inflexibility, and dependence on long-context reasoning and carefully crafted prompts. Researchers from Nanyang Technological University, National University of Singapore, Beijing Institute of Computer Technology and Application, and Beijing Anzhen Hospital have introduced Graph-R1, an agentic GraphRAG framework powered by end-to-end reinforcement learning. Image source: https://arxiv.org/pdf/2507.21892v1 Core Innovations of Graph-R1 1. Lightweight Knowledge Hypergraph Construction Graph-R1 constructs knowledge as a hypergraph, where each knowledge segment is extracted using LLM-driven n-ary relation extraction. This approach encodes richer and more semantically grounded relationships, boosting agentic reasoning capabilities while maintaining manageable cost and computational requirements. Efficiency: Only 5.69s and $2.81 per 1,000 tokens for construction (vs. $3.35 for GraphRAG and $4.14 for HyperGraphRAG), while generating semantically rich graphs with 120,499 nodes and 98,073 edges. 2. Multi-Turn Agentic Retrieval Process Graph-R1 models retrieval as a multi-turn interaction loop (“think-retrieve-rethink-generate”), allowing the agent to adaptively query and refine its knowledge path, unlike previous methods that use one-shot retrieval. Dynamic Reasoning: The agent decides at each step whether to continue exploring or terminate with an answer. Entity-based and direct hyperedge retrieval are fused through reciprocal rank aggregation, improving the chances of retrieving the most relevant knowledge. 3. End-to-End Reinforcement Learning Optimization Graph-R1 uses Group Relative Policy Optimization (GRPO) for end-to-end RL, integrating rewards for format adherence, relevance, and answer correctness. This unified reward guides agents to develop generalizable reasoning strategies tightly aligned with both the knowledge structure and output quality. Outcome-directed reward mechanism: Combines format rewards (structural coherence) and answer rewards (semantic accuracy) for effective optimization, only rewarding answers embedded in structurally valid reasoning trajectories. Key Findings Benchmarking on RAG QA Tasks Graph-R1 was evaluated across six standard QA datasets (2WikiMultiHopQA, HotpotQA, Musique, Natural Questions, PopQA, TriviaQA). Method Avg. F1 (Qwen2.5-7B) NaiveGeneration 13.87 StandardRAG 15.89 GraphRAG 24.87 HyperGraphRAG 29.40 Search-R1 46.19 R1-Searcher 42.29 Graph-R1 57.82 Graph-R1 achieves up to 57.82 average F1 with Qwen2.5-7B, surpassing all previous baselines by a wide margin. Larger base models amplify its performance gains. Ablation Analysis Component ablation demonstrates that removing hypergraph construction, multi-turn reasoning, or RL optimization dramatically reduces performance, validating the necessity of each module within Graph-R1. Retrieval and Efficiency Graph-R1 retrieval is more concise and effective. It achieves high F1 scores with moderate average content lengths (~1200-1500 tokens per exchange), and supports more interaction turns (average 2.3-2.5), facilitating stable and accurate knowledge extraction.2507.21892v1.pdf Generation cost is minimal: Despite richer representation, Graph-R1’s response time per query (7.0s) and per-query cost ($0) outperforms graph-based competitors like HyperGraphRAG (9.6s, $8.76).2507.21892v1.pdf Generation Quality Graph-R1’s generation quality is evaluated across seven dimensions—comprehensiveness, knowledgeability, correctness, relevance, diversity, logical coherence, factuality—and consistently outperforms all RL-based and graph-based baselines, achieving top scores in correctness (86.9), relevance (95.2), and coherence (88.5). Generalizability Cross-validation on out-of-distribution (O.O.D.) settings reveals that Graph-R1 maintains robust performance across datasets, with O.O.D./I.I.D. ratios often above 85%, demonstrating strong domain generalization properties. Theoretical Guarantees Graph-R1 is supported by information-theoretic analyses: Graph-structured knowledge provides higher information density per retrieval and faster convergence to correct answers compared to chunk-based retrieval. Multi-turn interaction enables the agent to achieve higher retrieval efficiency by dynamically focusing on high-impact graph regions. End-to-end RL optimization bridges graph-structured evidence and language generation, reducing output entropy and error rates. Algorithmic Workflow (High-Level) Knowledge Hypergraph Extraction: LLM extracts n-ary relations to build entity and hyperedge sets. Multi-turn Agentic Reasoning: The agent alternates between reflective thinking, querying, hypergraph retrieval (entity and hyperedge dual paths), and synthesis. GRPO Optimization: RL policy is updated using sampled trajectories and reward normalization, enforcing structure and answer correctness. Conclusion Graph-R1 demonstrates that integrating hypergraph-based knowledge representation, agentic multi-turn reasoning, and end-to-end RL delivers unprecedented gains in factual QA performance, retrieval efficiency, and generation quality, charting the path for next-generation agentic and knowledge-driven LLM systems. FAQ 1: What is the key innovation of Graph-R1 compared to earlier GraphRAG and RAG systems? Graph-R1 introduces an agentic framework where retrieval is modeled as a multi-turn interaction rather than a single one-shot process. Its main innovations are: Hypergraph Knowledge Representation: Instead of simple entity-relation graphs or text chunks, Graph-R1 constructs a semantic hypergraph that enables more expressive, n-ary relationships between entities. Multi-Turn Reasoning Loop: The agent operates in repeated cycles of “think–retrieve–rethink–generate” over the hypergraph, dynamically focusing queries rather than retrieving everything at once. End-to-End Reinforcement Learning (RL): The agent is trained with a reward function that simultaneously optimizes for step-wise logical reasoning and final answer correctness, enabling tighter alignment between structured knowledge and natural language answers. FAQ 2: How does Graph-R1’s retrieval and generation efficiency compare to previous methods? Graph-R1 is significantly more efficient and effective in both retrieval and answer generation: Lower Construction & Retrieval Cost: For building the knowledge hypergraph, Graph-R1 takes only 5.69 seconds and costs $2.81 per 1,000 tokens (on the 2Wiki dataset), outperforming similar graph-based methods. Faster and Cheaper Generation: Query response times (average 7 seconds per query) and generation costs ($0 per query) are better than prior graph-RAG systems, such as HyperGraphRAG. Conciseness & Robustness: Graph-R1 answers are both more concise (usually 1,200–1,500 tokens) and more accurate due to the multi-turn interaction, with state-of-the-art F1 scores across six QA datasets. FAQ 3: In which scenarios or domains is the Graph-R1 framework most applicable? Graph-R1 is ideal for complex knowledge-intensive applications demanding both factual accuracy and reasoning transparency, such as: Healthcare and Medical AI: Where multi-hop reasoning, traceability, and reliability are essential. Legal and Regulatory Domains: That require precise grounded answers and interpretable multi-step reasoning. Enterprise Knowledge Automation: For tasks needing scalable, dynamic querying and retrieval across large document or data corpora.The model’s architecture also allows for easy adaptation to other fields that benefit from agentic, multi-turn knowledge search anchored in

Graph-R1: An Agentic GraphRAG Framework for Structured, Multi-Turn Reasoning with Reinforcement Learning 投稿を読む »

AI, Committee, ニュース, Uncategorized

From terabytes to insights: Real-world AI obervability architecture

GUEST: Consider maintaining and developing an e-commerce platform that processes millions of transactions every minute, generating large amounts of telemetry data, including metrics, logs and traces across multiple microservices. When critical incidents occur, on-call engineers face the daunting task of sifting through an ocean of data to unravel r…Read More

From terabytes to insights: Real-world AI obervability architecture 投稿を読む »

AI, Committee, ニュース, Uncategorized

VL-Cogito: Advancing Multimodal Reasoning with Progressive Curriculum Reinforcement Learning

Multimodal reasoning, where models integrate and interpret information from multiple sources such as text, images, and diagrams, is a frontier challenge in AI. VL-Cogito is a state-of-the-art Multimodal Large Language Model (MLLM) proposed by DAMO Academy (Alibaba Group) and partners, introducing a robust reinforcement learning pipeline that fundamentally upgrades the reasoning skills of large models across mathematics, science, logic, charts, and general understanding. Core Innovations VL-Cogito’s unique approach centers around the Progressive Curriculum Reinforcement Learning (PCuRL) framework, engineered to systematically overcome the instability and domain gaps endemic to multimodal reasoning. The framework includes two breakthrough innovations: Online Difficulty Soft Weighting (ODSW): This mechanism assigns dynamic weights to training samples according to their difficulty and the model’s evolving capabilities. Rather than rigidly filtering out “easy” or “hard” samples, ODSW ensures each prompt contributes appropriately to gradient updates—enabling the model to progress from clear cases to intricate, challenging ones through a continuous curriculum. Three variants tune the focus for easy, medium, or hard stages using a piecewise function based on rollout accuracy, guided by learnability theory and empirical distribution of task difficulty. Dynamic Length Reward (DyLR): Traditional length rewards in RL-based reasoning models set a static target, which fails to consider task complexity and encourages unnecessary verbosity. DyLR solves this by calculating an ideal target length per prompt, estimated via the average length of correct rollout samples for each question. Short, rapid reasoning is promoted for easy tasks, while complex ones incentivize deeper, multi-step exploration, perfectly balancing efficiency and correctness. Training Pipeline VL-Cogito’s RL post-training starts directly from the Qwen2.5-VL-Instruct-7B backbone, with no initial supervised fine-tuning (SFT) cold start required. The PCuRL process is explicitly divided into three sequential RL stages: easy, medium, and hard. In each stage: The same dataset is shuffled, exposing the model to various generalization challenges. ODSW’s weighting function for that stage biases gradient updates towards the target difficulty. In the hard stage, DyLR is triggered to encourage adaptive reasoning chain expansion. Technical setup details: AdamW optimizer, LR=1e-6, DeepSpeed-ZeRO3. Rollout batch size: 512; global batch size: 128; sequence length: 4,096; KL divergence loss: 1e-3; 16 response samples per prompt; temperature: 1.0. Reward hyperparameters: α=1, β=0.5, γ=1, w=0.25 (penalty for zero-accuracy prompts). Dataset Curation and RL Data Sampling A meticulously curated training set covers 23 open-source multimodal datasets across six task categories: Mathematical Reasoning, Logical Reasoning, Counting, Science Reasoning, Chart Understanding, and General Image Understanding. All samples are reformulated to open-ended QA formats to prevent superficial multiple-choice cue exploitation. Difficulty sampling: Qwen2.5-VL-7B-Instruct is trialed; any sample passed by it with ≥50% accuracy over 8 runs is dropped, guaranteeing that only genuinely challenging tasks remain. Evaluation and Benchmark Results Performance Across Benchmarks VL-Cogito is benchmarked against both general-purpose and reasoning-oriented MLLMs on a ten-task panel, including datasets like Geometry@3K, MathVerse, MathVista, ChartQA, ScienceQA, MMMU, EMMA, and MMStar. Absolute accuracy gains over the backbone: +7.6% on Geometry@3K, +5.5% on MathVista, +4.9% on LogicVista, +2.2% on ScienceQA, +4.5% on EMMA, +3.8% on MMStar. State-of-the-art results on 6/10 benchmarks: VL-Cogito either leads or matches top results, especially on rigorous math and scientific tasks. Models “cold-started” with SFT or forced rethinking strategies do not surpass its robust, curriculum-based RL. Model Geo3K MathVerse MathVista MathVision LogicVista ChartQA SciQA MMMU EMMA MMStar VL-Cogito (7B) 68.7 53.3 74.8 30.7 48.9 83.4 87.6 52.6 29.1 66.3 VL-Rethinker (7B) 67.7 54.6 73.7 30.1 45.7 83.5 86.7 52.9 28.6 64.2 MM-Eureka (8B) 67.2 52.3 73.4 29.4 47.1 82.7 86.4 52.3 27.4 64.7 Qwen2.5-VL (7B) 61.6 50.4 69.3 28.7 44.0 82.4 85.4 50.9 24.6 62.5 Component-wise Ablation Curriculum RL alone lifts average scores by +0.8% over vanilla GRPO. Dynamic length reward further boosts performance, especially in hard math domains. ODSW consistently outperforms binary hard sample filtering, especially when training data is imbalanced or skewed. Reasoning Efficiency and Training Dynamics Dynamic rewards yield higher average accuracy and better token efficiency than fixed-length cosine rewards. Adaptive length emerges as longer for math and logic tasks, shorter for science and general understanding, precisely as intended. PCuRL’s hard stage induces a spike in reasoning length and validation accuracy, surpassing vanilla GRPO whose accuracy plateaus despite static output length. Case Studies VL-Cogito exhibits detailed, self-reflective, stepwise reasoning. For math, the model decomposes solutions into granular chains and actively corrects missteps, a behavior instilled by RL verification and advantage estimation[1, Figure 5]. On classification-style problems (e.g., identifying decomposers or skyscrapers in images), it methodically considers each option before boxing the answer, demonstrating strong multimodal comprehension and process reliability. Insights and Impact VL-Cogito’s systematic PCuRL pipeline validates several key insights: Learnability matters: Prompts with intermediate difficulty optimize model progress best. Exposure to challenge catalyzes deep reasoning: Over-emphasis on easy samples degenerates performance; progressive emphasis on harder samples builds durable analytic depth. Reward granularity is crucial: Combining correctness, format, and length facilitates nuanced, context-sensitive reasoning outputs. No-sft cold-start RL is feasible and highly effective: With PCuRL, models need not rely on expensive SFT warm-up. Conclusion VL-Cogito’s architecture and training innovations set a new standard for multimodal reasoning across diverse benchmarks. The design and empirical validation of progressive curriculum RL with dynamic length rewards point toward a general roadmap for robust reasoning in multimodal models. Discuss on Hacker News Join our ML Subreddit Sponsor us Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post VL-Cogito: Advancing Multimodal Reasoning with Progressive Curriculum Reinforcement Learning appeared first on MarkTechPost.

VL-Cogito: Advancing Multimodal Reasoning with Progressive Curriculum Reinforcement Learning 投稿を読む »

AI, Committee, ニュース, Uncategorized

Alibaba Qwen Unveils Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507: Refreshing the Importance of Small Language Models

Smaller Models with Smarter Performance and 256K Context Support Alibaba’s Qwen team has introduced two powerful additions to its small language model lineup: Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507. Despite having only 4 billion parameters, these models deliver exceptional capabilities across general-purpose and expert-level tasks while running efficiently on consumer-grade hardware. Both are designed with native 256K token context windows, meaning they can process extremely long inputs such as large codebases, multi-document archives, and extended dialogues without external modifications. Architecture and Core Design Both models feature 4 billion total parameters (3.6B excluding embeddings) built across 36 transformer layers. They use Grouped Query Attention (GQA) with 32 query heads and 8 key/value heads, enhancing efficiency and memory management for very large contexts. They are dense transformer architectures—not mixture-of-experts—which ensures consistent task performance. Long-context support up to 262,144 tokens is baked directly into the model architecture, and each model is pretrained extensively before undergoing alignment and safety post-training to ensure responsible, high-quality outputs. Qwen3-4B-Instruct-2507 — A Multilingual, Instruction-Following Generalist The Qwen3-4B-Instruct-2507 model is optimized for speed, clarity, and user-aligned instruction following. It is designed to deliver direct answers without explicit step-by-step reasoning, making it perfect for scenarios where users want concise responses rather than detailed thought processes. Multilingual coverage spans over 100 languages, making it highly suitable for global deployments in chatbots, customer support, education, and cross-language search. Its native 256K context support enables it to handle tasks like analyzing large legal documents, processing multi-hour transcripts, or summarizing massive datasets without splitting the content. Performance Benchmarks: Benchmark Task Score General Knowledge (MMLU-Pro) 69.6 Reasoning (AIME25) 47.4 SuperGPQA (QA) 42.8 Coding (LiveCodeBench) 35.1 Creative Writing 83.5 Multilingual Comprehension (MultiIF) 69.0 In practice, this means Qwen3-4B-Instruct-2507 can handle everything from language tutoring in multiple languages to generating rich narrative content, while still providing competent performance in reasoning, coding, and domain-specific knowledge. Qwen3-4B-Thinking-2507 — Expert-Level Chain-of-Thought Reasoning Where the Instruct model focuses on concise responsiveness, the Qwen3-4B-Thinking-2507 model is engineered for deep reasoning and problem-solving. It automatically generates explicit chains of thought in its outputs, making its decision-making process transparent—especially beneficial for complex domains like mathematics, science, and programming. This model excels at technical diagnostics, scientific data interpretation, and multi-step logical analysis. It’s suited for advanced AI agents, research assistants, and coding companions that need to reason through problems before answering. Performance Benchmarks: Benchmark Task Score Math (AIME25) 81.3% Science (HMMT25) 55.5% General QA (GPQA) 65.8% Coding (LiveCodeBench) 55.2% Tool Usage (BFCL) 71.2% Human Alignment 87.4% These scores demonstrate that Qwen3-4B-Thinking-2507 can match or even surpass much larger models in reasoning-heavy benchmarks, allowing more accurate and explainable results for mission-critical use cases. Across Both Models Both the Instruct and Thinking variants share key advancements. The 256K native context window allows for seamless work on extremely long inputs without external memory hacks. They also feature improved alignment, producing more natural, coherent, and context-aware responses in creative and multi-turn conversations. Furthermore, both are agent-ready, supporting API calling, multi-step reasoning, and workflow orchestration out-of-the-box. From a deployment perspective, they are highly efficient—capable of running on mainstream consumer GPUs with quantization for lower memory usage, and fully compatible with modern inference frameworks. This means developers can run them locally or scale them in cloud environments without significant resource investment. Practical Deployment and Applications Deployment is straightforward, with broad framework compatibility enabling integration into any modern ML pipeline. They can be used in edge devices, enterprise virtual assistants, research institutions, coding environments, and creative studios. Example scenarios include: Instruction-Following Mode: Customer support bots, multilingual educational assistants, real-time content generation. Thinking Mode: Scientific research analysis, legal reasoning, advanced coding tools, and agentic automation. Conclusion The Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507 prove that small language models can rival and even outperform larger models in specific domains when engineered thoughtfully. Their blend of long-context handling, strong multilingual capabilities, deep reasoning (in Thinking mode), and alignment improvements makes them powerful tools for both everyday and specialist AI applications. With these releases, Alibaba has set a new benchmark in making 256K-ready, high-performance AI models accessible to developers worldwide. Check out the Qwen3-4B-Instruct-2507 Model and Qwen3-4B-Thinking-2507 Model. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to Subscribe to our Newsletter. Discuss on Hacker News Join our ML Subreddit Sponsor us The post Alibaba Qwen Unveils Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507: Refreshing the Importance of Small Language Models appeared first on MarkTechPost.

Alibaba Qwen Unveils Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507: Refreshing the Importance of Small Language Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

Which Questions Improve Learning the Most? Utility Estimation of Questions with LM-based Simulations

arXiv:2502.17383v2 Announce Type: replace Abstract: Asking good questions is critical for comprehension and learning, yet evaluating and generating such questions remains a challenging problem. Prior work on inquisitive questions focuses on learner-generated, curiosity-driven queries and evaluates them using indirect metrics, such as salience or information gain, that do not directly capture a question’s impact on actual learning outcomes. We introduce QUEST (Question Utility Estimation with Simulated Tests), a framework that uses language models to simulate learners and directly quantify the utility of a question – its contribution to exam performance. QUEST simulates a learner who asks questions and receives answers while studying a textbook chapter, then uses them to take an end-of-chapter exam. Through this simulation, the utility of each question is estimated by its direct effect on exam performance, rather than inferred indirectly based on the underlying content. To support this evaluation, we curate TEXTBOOK-EXAM, a benchmark that aligns textbook sections with end-of-section exam questions across five academic disciplines. Using QUEST, we filter for high-utility questions and fine-tune question generators via rejection sampling. Experiments show that questions generated by QUEST-trained models improve simulated test scores by over 20% compared to strong baselines that are fine-tuned using indirect metrics or leverage prompting methods. Furthermore, utility is only weakly correlated with salience and similarity to exam questions, suggesting that it captures unique signal that benefits downstream performance. QUEST offers a new outcome-driven paradigm for question evaluation and generation – one that moves beyond question-answer content toward measurable improvements in learning outcomes.

Which Questions Improve Learning the Most? Utility Estimation of Questions with LM-based Simulations 投稿を読む »

AI, Committee, ニュース, Uncategorized

McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

arXiv:2507.02088v2 Announce Type: replace Abstract: As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.

McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

Attention Basin: Why Contextual Position Matters in Large Language Models

arXiv:2508.05128v1 Announce Type: new Abstract: The performance of Large Language Models (LLMs) is significantly sensitive to the contextual position of information in the input. To investigate the mechanism behind this positional bias, our extensive experiments reveal a consistent phenomenon we term the attention basin: when presented with a sequence of structured items (e.g., retrieved documents or few-shot examples), models systematically assign higher attention to the items at the beginning and end of the sequence, while neglecting those in the middle. Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model’s intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead. Experiments on multi-hop QA and few-shot in-context learning tasks demonstrate that AttnRank achieves substantial improvements across 10 large language models of varying architectures and scales, without modifying model parameters or training procedures.

Attention Basin: Why Contextual Position Matters in Large Language Models 投稿を読む »

We use cookies to improve your experience and performance on our website. You can learn more at プライバシーポリシー and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
ja