Committee Archives - 第43页共101页

Can a Crow Hatch a Falcon? Lineage Matters in Predicting Large Language Model Performance

admin NU / 8 月 11, 2025

arXiv:2504.19811v2 Announce Type: replace Abstract: Accurately forecasting the performance of Large Language Models (LLMs) before extensive fine-tuning or merging can substantially reduce both computational expense and development time. Although prior approaches like scaling laws account for global factors such as parameter size or training tokens, they often overlook explicit lineage relationships-i.e., which models are derived or merged from which parents. In this work, we propose a novel Lineage-Regularized Matrix Factorization (LRMF) framework that encodes ancestral ties among LLMs via a graph Laplacian regularizer. By leveraging multi-hop parent-child connections, LRMF consistently outperforms conventional matrix factorization and collaborative filtering methods in both instance-level and benchmark-level performance prediction. Our large-scale study includes 2,934 publicly available Hugging Face models and 21,000+ instances across 6 major benchmarks, showing that the introduction of lineage constraints yields up to 0.15-0.30 higher Pearson correlation coefficients with actual performance compared to baseline methods. Moreover, LRMF effectively addresses the cold-start problem, providing accurate estimates for newly derived or merged models even with minimal data. This lineage-guided strategy thus offers a resource-efficient way to inform hyperparameter tuning, data selection, and model combination in modern LLM development.

Can a Crow Hatch a Falcon? Lineage Matters in Predicting Large Language Model Performance Read Post »

AI, Committee, 新闻, Uncategorized

Architectural Fusion Through Contextual Partitioning in Large Language Models: A Novel Approach to Parameterized Knowledge Integration

admin NU / 8 月 11, 2025

arXiv:2501.12901v2 Announce Type: replace Abstract: Contextual Partitioning introduces an innovative approach to enhancing the architectural design of large-scale computational models through the dynamic segmentation of parameters into context-aware regions. This methodology emphasizes the importance of task-specific specialization, achieved through adaptive parameter allocation mechanisms that align with the linguistic features of input data. Experimental evaluations demonstrated substantial improvements in accuracy, perplexity, and contextual coherence across a variety of linguistic tasks, highlighting the adaptability and scalability of the proposed framework. By reducing redundancy and enhancing computational efficiency, Contextual Partitioning not only streamlines model operations but also expands the scope of applications for advanced language processing systems. The approach operates autonomously, requiring no external fine-tuning, thereby addressing a significant limitation in conventional parameter optimization techniques. Empirical results demonstrate the effectiveness of gradient-driven segmentation, enabling models to dynamically recalibrate and specialize in response to task-specific demands. Furthermore, resource utilization metrics reveal notable reductions in memory usage and training times, confirming the efficiency of the approach. Observations from qualitative analyses illustrate improved contextual coherence and logical flow in generated outputs, reinforcing the practical value of this technique. The findings collectively demonstrate the potential for Contextual Partitioning to redefine the scalability and adaptability of computational language architectures in diverse and complex domains.

Architectural Fusion Through Contextual Partitioning in Large Language Models: A Novel Approach to Parameterized Knowledge Integration Read Post »

AI, Committee, 新闻, Uncategorized

Are Your LLMs Capable of Stable Reasoning?

admin NU / 8 月 11, 2025

arXiv:2412.13147v5 Announce Type: replace-cross Abstract: The rapid advancement of large language models (LLMs) has shown remarkable progress in complex reasoning tasks. However, a significant disparity exists between benchmark performances and real-world applications. We attribute this gap primarily to current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, especially in complex reasoning tasks where both accuracy and consistency are essential. In this paper, we introduce G-Pass@$k$, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts, quantifying both the model’s performance potential and its stability. Through extensive experiments on various public and newly constructed benchmarks, we employ G-Pass@$k$ in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency. Our findings reveal a significant opportunity to enhance the realistic reasoning abilities of LLMs, underscoring the necessity for more robust evaluation metrics.

Are Your LLMs Capable of Stable Reasoning? Read Post »

AI, Committee, 新闻, Uncategorized

Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models

admin NU / 8 月 11, 2025

arXiv:2508.05803v1 Announce Type: new Abstract: Human memory is fleeting. As words are processed, the exact wordforms that make up incoming sentences are rapidly lost. Cognitive scientists have long believed that this limitation of memory may, paradoxically, help in learning language – an idea supported by classic connectionist modelling work. The rise of Transformers appears to challenge this idea, as these models can learn language effectively, despite lacking memory limitations or other architectural recency biases. Here, we investigate the hypothesized benefit of fleeting memory for language learning in tightly controlled experiments on transformer language models. Training transformers with and without fleeting memory on a developmentally realistic training set, we find that fleeting memory consistently improves language learning (as quantified by both overall language modelling performance and targeted syntactic evaluation) but, unexpectedly, impairs surprisal-based prediction of human reading times. Interestingly, follow up analyses revealed that this discrepancy – better language modeling, yet worse reading time prediction – could not be accounted for by prior explanations of why better language models sometimes fit human reading time worse. Together, these results support a benefit of memory limitations on neural network language learning – but not on predicting behavior.

Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models Read Post »

AI, Committee, 新闻, Uncategorized

Graph-R1: An Agentic GraphRAG Framework for Structured, Multi-Turn Reasoning with Reinforcement Learning

admin NU / 8 月 10, 2025

Introduction Large Language Models (LLMs) have set new benchmarks in natural language processing, but their tendency for hallucination—generating inaccurate outputs—remains a critical issue for knowledge-intensive applications. Retrieval-Augmented Generation (RAG) frameworks attempt to solve this by incorporating external knowledge into language generation. However, traditional RAG approaches rely on chunk-based retrieval, which limits their ability to represent complex semantic relationships. Entity-relation graph-based RAG methods (GraphRAG) address some structural limitations, but still face high construction cost, one-shot retrieval inflexibility, and dependence on long-context reasoning and carefully crafted prompts. Researchers from Nanyang Technological University, National University of Singapore, Beijing Institute of Computer Technology and Application, and Beijing Anzhen Hospital have introduced Graph-R1, an agentic GraphRAG framework powered by end-to-end reinforcement learning. Image source: https://arxiv.org/pdf/2507.21892v1 Core Innovations of Graph-R1 1. Lightweight Knowledge Hypergraph Construction Graph-R1 constructs knowledge as a hypergraph, where each knowledge segment is extracted using LLM-driven n-ary relation extraction. This approach encodes richer and more semantically grounded relationships, boosting agentic reasoning capabilities while maintaining manageable cost and computational requirements. Efficiency: Only 5.69s and $2.81 per 1,000 tokens for construction (vs. $3.35 for GraphRAG and $4.14 for HyperGraphRAG), while generating semantically rich graphs with 120,499 nodes and 98,073 edges. 2. Multi-Turn Agentic Retrieval Process Graph-R1 models retrieval as a multi-turn interaction loop (“think-retrieve-rethink-generate”), allowing the agent to adaptively query and refine its knowledge path, unlike previous methods that use one-shot retrieval. Dynamic Reasoning: The agent decides at each step whether to continue exploring or terminate with an answer. Entity-based and direct hyperedge retrieval are fused through reciprocal rank aggregation, improving the chances of retrieving the most relevant knowledge. 3. End-to-End Reinforcement Learning Optimization Graph-R1 uses Group Relative Policy Optimization (GRPO) for end-to-end RL, integrating rewards for format adherence, relevance, and answer correctness. This unified reward guides agents to develop generalizable reasoning strategies tightly aligned with both the knowledge structure and output quality. Outcome-directed reward mechanism: Combines format rewards (structural coherence) and answer rewards (semantic accuracy) for effective optimization, only rewarding answers embedded in structurally valid reasoning trajectories. Key Findings Benchmarking on RAG QA Tasks Graph-R1 was evaluated across six standard QA datasets (2WikiMultiHopQA, HotpotQA, Musique, Natural Questions, PopQA, TriviaQA). Method Avg. F1 (Qwen2.5-7B) NaiveGeneration 13.87 StandardRAG 15.89 GraphRAG 24.87 HyperGraphRAG 29.40 Search-R1 46.19 R1-Searcher 42.29 Graph-R1 57.82 Graph-R1 achieves up to 57.82 average F1 with Qwen2.5-7B, surpassing all previous baselines by a wide margin. Larger base models amplify its performance gains. Ablation Analysis Component ablation demonstrates that removing hypergraph construction, multi-turn reasoning, or RL optimization dramatically reduces performance, validating the necessity of each module within Graph-R1. Retrieval and Efficiency Graph-R1 retrieval is more concise and effective. It achieves high F1 scores with moderate average content lengths (~1200-1500 tokens per exchange), and supports more interaction turns (average 2.3-2.5), facilitating stable and accurate knowledge extraction.2507.21892v1.pdf Generation cost is minimal: Despite richer representation, Graph-R1’s response time per query (7.0s) and per-query cost ($0) outperforms graph-based competitors like HyperGraphRAG (9.6s, $8.76).2507.21892v1.pdf Generation Quality Graph-R1’s generation quality is evaluated across seven dimensions—comprehensiveness, knowledgeability, correctness, relevance, diversity, logical coherence, factuality—and consistently outperforms all RL-based and graph-based baselines, achieving top scores in correctness (86.9), relevance (95.2), and coherence (88.5). Generalizability Cross-validation on out-of-distribution (O.O.D.) settings reveals that Graph-R1 maintains robust performance across datasets, with O.O.D./I.I.D. ratios often above 85%, demonstrating strong domain generalization properties. Theoretical Guarantees Graph-R1 is supported by information-theoretic analyses: Graph-structured knowledge provides higher information density per retrieval and faster convergence to correct answers compared to chunk-based retrieval. Multi-turn interaction enables the agent to achieve higher retrieval efficiency by dynamically focusing on high-impact graph regions. End-to-end RL optimization bridges graph-structured evidence and language generation, reducing output entropy and error rates. Algorithmic Workflow (High-Level) Knowledge Hypergraph Extraction: LLM extracts n-ary relations to build entity and hyperedge sets. Multi-turn Agentic Reasoning: The agent alternates between reflective thinking, querying, hypergraph retrieval (entity and hyperedge dual paths), and synthesis. GRPO Optimization: RL policy is updated using sampled trajectories and reward normalization, enforcing structure and answer correctness. Conclusion Graph-R1 demonstrates that integrating hypergraph-based knowledge representation, agentic multi-turn reasoning, and end-to-end RL delivers unprecedented gains in factual QA performance, retrieval efficiency, and generation quality, charting the path for next-generation agentic and knowledge-driven LLM systems. FAQ 1: What is the key innovation of Graph-R1 compared to earlier GraphRAG and RAG systems? Graph-R1 introduces an agentic framework where retrieval is modeled as a multi-turn interaction rather than a single one-shot process. Its main innovations are: Hypergraph Knowledge Representation: Instead of simple entity-relation graphs or text chunks, Graph-R1 constructs a semantic hypergraph that enables more expressive, n-ary relationships between entities. Multi-Turn Reasoning Loop: The agent operates in repeated cycles of “think–retrieve–rethink–generate” over the hypergraph, dynamically focusing queries rather than retrieving everything at once. End-to-End Reinforcement Learning (RL): The agent is trained with a reward function that simultaneously optimizes for step-wise logical reasoning and final answer correctness, enabling tighter alignment between structured knowledge and natural language answers. FAQ 2: How does Graph-R1’s retrieval and generation efficiency compare to previous methods? Graph-R1 is significantly more efficient and effective in both retrieval and answer generation: Lower Construction & Retrieval Cost: For building the knowledge hypergraph, Graph-R1 takes only 5.69 seconds and costs $2.81 per 1,000 tokens (on the 2Wiki dataset), outperforming similar graph-based methods. Faster and Cheaper Generation: Query response times (average 7 seconds per query) and generation costs ($0 per query) are better than prior graph-RAG systems, such as HyperGraphRAG. Conciseness & Robustness: Graph-R1 answers are both more concise (usually 1,200–1,500 tokens) and more accurate due to the multi-turn interaction, with state-of-the-art F1 scores across six QA datasets. FAQ 3: In which scenarios or domains is the Graph-R1 framework most applicable? Graph-R1 is ideal for complex knowledge-intensive applications demanding both factual accuracy and reasoning transparency, such as: Healthcare and Medical AI: Where multi-hop reasoning, traceability, and reliability are essential. Legal and Regulatory Domains: That require precise grounded answers and interpretable multi-step reasoning. Enterprise Knowledge Automation: For tasks needing scalable, dynamic querying and retrieval across large document or data corpora.The model’s architecture also allows for easy adaptation to other fields that benefit from agentic, multi-turn knowledge search anchored in

Graph-R1: An Agentic GraphRAG Framework for Structured, Multi-Turn Reasoning with Reinforcement Learning Read Post »

AI, Committee, 新闻, Uncategorized

From terabytes to insights: Real-world AI obervability architecture

admin NU / 8 月 10, 2025

GUEST: Consider maintaining and developing an e-commerce platform that processes millions of transactions every minute, generating large amounts of telemetry data, including metrics, logs and traces across multiple microservices. When critical incidents occur, on-call engineers face the daunting task of sifting through an ocean of data to unravel r…Read More

From terabytes to insights: Real-world AI obervability architecture Read Post »

AI, Committee, 新闻, Uncategorized

OpenAI returns old models to ChatGPT as Sam Altman admits ‘bumpy’ GPT-5 rollout

admin NU / 8 月 9, 2025

The pressure is on for OpenAI to prove that GPT-5 isn’t just an incremental update, but a true step forward.Read More

OpenAI returns old models to ChatGPT as Sam Altman admits ‘bumpy’ GPT-5 rollout Read Post »

AI, Committee, 新闻, Uncategorized

OpenAI’s GPT-5 rollout is not going smoothly

admin NU / 8 月 9, 2025

It also failed on a simple algebra arithmetic problem that elementary schoolers could probably nail, 5.9 = x + 5.11.Read More

OpenAI’s GPT-5 rollout is not going smoothly Read Post »

AI, Committee, 新闻, Uncategorized

VL-Cogito: Advancing Multimodal Reasoning with Progressive Curriculum Reinforcement Learning

admin NU / 8 月 9, 2025

Multimodal reasoning, where models integrate and interpret information from multiple sources such as text, images, and diagrams, is a frontier challenge in AI. VL-Cogito is a state-of-the-art Multimodal Large Language Model (MLLM) proposed by DAMO Academy (Alibaba Group) and partners, introducing a robust reinforcement learning pipeline that fundamentally upgrades the reasoning skills of large models across mathematics, science, logic, charts, and general understanding. Core Innovations VL-Cogito’s unique approach centers around the Progressive Curriculum Reinforcement Learning (PCuRL) framework, engineered to systematically overcome the instability and domain gaps endemic to multimodal reasoning. The framework includes two breakthrough innovations: Online Difficulty Soft Weighting (ODSW): This mechanism assigns dynamic weights to training samples according to their difficulty and the model’s evolving capabilities. Rather than rigidly filtering out “easy” or “hard” samples, ODSW ensures each prompt contributes appropriately to gradient updates—enabling the model to progress from clear cases to intricate, challenging ones through a continuous curriculum. Three variants tune the focus for easy, medium, or hard stages using a piecewise function based on rollout accuracy, guided by learnability theory and empirical distribution of task difficulty. Dynamic Length Reward (DyLR): Traditional length rewards in RL-based reasoning models set a static target, which fails to consider task complexity and encourages unnecessary verbosity. DyLR solves this by calculating an ideal target length per prompt, estimated via the average length of correct rollout samples for each question. Short, rapid reasoning is promoted for easy tasks, while complex ones incentivize deeper, multi-step exploration, perfectly balancing efficiency and correctness. Training Pipeline VL-Cogito’s RL post-training starts directly from the Qwen2.5-VL-Instruct-7B backbone, with no initial supervised fine-tuning (SFT) cold start required. The PCuRL process is explicitly divided into three sequential RL stages: easy, medium, and hard. In each stage: The same dataset is shuffled, exposing the model to various generalization challenges. ODSW’s weighting function for that stage biases gradient updates towards the target difficulty. In the hard stage, DyLR is triggered to encourage adaptive reasoning chain expansion. Technical setup details: AdamW optimizer, LR=1e-6, DeepSpeed-ZeRO3. Rollout batch size: 512; global batch size: 128; sequence length: 4,096; KL divergence loss: 1e-3; 16 response samples per prompt; temperature: 1.0. Reward hyperparameters: α=1, β=0.5, γ=1, w=0.25 (penalty for zero-accuracy prompts). Dataset Curation and RL Data Sampling A meticulously curated training set covers 23 open-source multimodal datasets across six task categories: Mathematical Reasoning, Logical Reasoning, Counting, Science Reasoning, Chart Understanding, and General Image Understanding. All samples are reformulated to open-ended QA formats to prevent superficial multiple-choice cue exploitation. Difficulty sampling: Qwen2.5-VL-7B-Instruct is trialed; any sample passed by it with ≥50% accuracy over 8 runs is dropped, guaranteeing that only genuinely challenging tasks remain. Evaluation and Benchmark Results Performance Across Benchmarks VL-Cogito is benchmarked against both general-purpose and reasoning-oriented MLLMs on a ten-task panel, including datasets like Geometry@3K, MathVerse, MathVista, ChartQA, ScienceQA, MMMU, EMMA, and MMStar. Absolute accuracy gains over the backbone: +7.6% on Geometry@3K, +5.5% on MathVista, +4.9% on LogicVista, +2.2% on ScienceQA, +4.5% on EMMA, +3.8% on MMStar. State-of-the-art results on 6/10 benchmarks: VL-Cogito either leads or matches top results, especially on rigorous math and scientific tasks. Models “cold-started” with SFT or forced rethinking strategies do not surpass its robust, curriculum-based RL. Model Geo3K MathVerse MathVista MathVision LogicVista ChartQA SciQA MMMU EMMA MMStar VL-Cogito (7B) 68.7 53.3 74.8 30.7 48.9 83.4 87.6 52.6 29.1 66.3 VL-Rethinker (7B) 67.7 54.6 73.7 30.1 45.7 83.5 86.7 52.9 28.6 64.2 MM-Eureka (8B) 67.2 52.3 73.4 29.4 47.1 82.7 86.4 52.3 27.4 64.7 Qwen2.5-VL (7B) 61.6 50.4 69.3 28.7 44.0 82.4 85.4 50.9 24.6 62.5 Component-wise Ablation Curriculum RL alone lifts average scores by +0.8% over vanilla GRPO. Dynamic length reward further boosts performance, especially in hard math domains. ODSW consistently outperforms binary hard sample filtering, especially when training data is imbalanced or skewed. Reasoning Efficiency and Training Dynamics Dynamic rewards yield higher average accuracy and better token efficiency than fixed-length cosine rewards. Adaptive length emerges as longer for math and logic tasks, shorter for science and general understanding, precisely as intended. PCuRL’s hard stage induces a spike in reasoning length and validation accuracy, surpassing vanilla GRPO whose accuracy plateaus despite static output length. Case Studies VL-Cogito exhibits detailed, self-reflective, stepwise reasoning. For math, the model decomposes solutions into granular chains and actively corrects missteps, a behavior instilled by RL verification and advantage estimation[1, Figure 5]. On classification-style problems (e.g., identifying decomposers or skyscrapers in images), it methodically considers each option before boxing the answer, demonstrating strong multimodal comprehension and process reliability. Insights and Impact VL-Cogito’s systematic PCuRL pipeline validates several key insights: Learnability matters: Prompts with intermediate difficulty optimize model progress best. Exposure to challenge catalyzes deep reasoning: Over-emphasis on easy samples degenerates performance; progressive emphasis on harder samples builds durable analytic depth. Reward granularity is crucial: Combining correctness, format, and length facilitates nuanced, context-sensitive reasoning outputs. No-sft cold-start RL is feasible and highly effective: With PCuRL, models need not rely on expensive SFT warm-up. Conclusion VL-Cogito’s architecture and training innovations set a new standard for multimodal reasoning across diverse benchmarks. The design and empirical validation of progressive curriculum RL with dynamic length rewards point toward a general roadmap for robust reasoning in multimodal models. Discuss on Hacker News Join our ML Subreddit Sponsor us Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post VL-Cogito: Advancing Multimodal Reasoning with Progressive Curriculum Reinforcement Learning appeared first on MarkTechPost.

VL-Cogito: Advancing Multimodal Reasoning with Progressive Curriculum Reinforcement Learning Read Post »

AI, Committee, 新闻, Uncategorized

Anthropic revenue tied to two customers as AI pricing war threatens margins

admin NU / 8 月 9, 2025

Anthropic faces risks as $5B run rate leans on Cursor and GitHub Copilot as OpenAI’s cheaper GPT‑5 undercuts Claude, spotlighting customer concentration risks and enterprise AI cost pressure.Read More

Anthropic revenue tied to two customers as AI pricing war threatens margins Read Post »

Committee

Can a Crow Hatch a Falcon? Lineage Matters in Predicting Large Language Model Performance

Architectural Fusion Through Contextual Partitioning in Large Language Models: A Novel Approach to Parameterized Knowledge Integration

Are Your LLMs Capable of Stable Reasoning?

Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models

Graph-R1: An Agentic GraphRAG Framework for Structured, Multi-Turn Reasoning with Reinforcement Learning

From terabytes to insights: Real-world AI obervability architecture

OpenAI returns old models to ChatGPT as Sam Altman admits ‘bumpy’ GPT-5 rollout

OpenAI’s GPT-5 rollout is not going smoothly

VL-Cogito: Advancing Multimodal Reasoning with Progressive Curriculum Reinforcement Learning

Anthropic revenue tied to two customers as AI pricing war threatens margins

我们的服务

首页

工作原理

新闻

定价

支持

幫助中心

报告问题

提供反馈

隱私權政策

用户账户

关注我们