Committee Archives - 44ページ目 (101ページ中)

Alibaba Qwen Unveils Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507: Refreshing the Importance of Small Language Models

admin NU / 8月 9, 2025

Smaller Models with Smarter Performance and 256K Context Support Alibaba’s Qwen team has introduced two powerful additions to its small language model lineup: Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507. Despite having only 4 billion parameters, these models deliver exceptional capabilities across general-purpose and expert-level tasks while running efficiently on consumer-grade hardware. Both are designed with native 256K token context windows, meaning they can process extremely long inputs such as large codebases, multi-document archives, and extended dialogues without external modifications. Architecture and Core Design Both models feature 4 billion total parameters (3.6B excluding embeddings) built across 36 transformer layers. They use Grouped Query Attention (GQA) with 32 query heads and 8 key/value heads, enhancing efficiency and memory management for very large contexts. They are dense transformer architectures—not mixture-of-experts—which ensures consistent task performance. Long-context support up to 262,144 tokens is baked directly into the model architecture, and each model is pretrained extensively before undergoing alignment and safety post-training to ensure responsible, high-quality outputs. Qwen3-4B-Instruct-2507 — A Multilingual, Instruction-Following Generalist The Qwen3-4B-Instruct-2507 model is optimized for speed, clarity, and user-aligned instruction following. It is designed to deliver direct answers without explicit step-by-step reasoning, making it perfect for scenarios where users want concise responses rather than detailed thought processes. Multilingual coverage spans over 100 languages, making it highly suitable for global deployments in chatbots, customer support, education, and cross-language search. Its native 256K context support enables it to handle tasks like analyzing large legal documents, processing multi-hour transcripts, or summarizing massive datasets without splitting the content. Performance Benchmarks: Benchmark Task Score General Knowledge (MMLU-Pro) 69.6 Reasoning (AIME25) 47.4 SuperGPQA (QA) 42.8 Coding (LiveCodeBench) 35.1 Creative Writing 83.5 Multilingual Comprehension (MultiIF) 69.0 In practice, this means Qwen3-4B-Instruct-2507 can handle everything from language tutoring in multiple languages to generating rich narrative content, while still providing competent performance in reasoning, coding, and domain-specific knowledge. Qwen3-4B-Thinking-2507 — Expert-Level Chain-of-Thought Reasoning Where the Instruct model focuses on concise responsiveness, the Qwen3-4B-Thinking-2507 model is engineered for deep reasoning and problem-solving. It automatically generates explicit chains of thought in its outputs, making its decision-making process transparent—especially beneficial for complex domains like mathematics, science, and programming. This model excels at technical diagnostics, scientific data interpretation, and multi-step logical analysis. It’s suited for advanced AI agents, research assistants, and coding companions that need to reason through problems before answering. Performance Benchmarks: Benchmark Task Score Math (AIME25) 81.3% Science (HMMT25) 55.5% General QA (GPQA) 65.8% Coding (LiveCodeBench) 55.2% Tool Usage (BFCL) 71.2% Human Alignment 87.4% These scores demonstrate that Qwen3-4B-Thinking-2507 can match or even surpass much larger models in reasoning-heavy benchmarks, allowing more accurate and explainable results for mission-critical use cases. Across Both Models Both the Instruct and Thinking variants share key advancements. The 256K native context window allows for seamless work on extremely long inputs without external memory hacks. They also feature improved alignment, producing more natural, coherent, and context-aware responses in creative and multi-turn conversations. Furthermore, both are agent-ready, supporting API calling, multi-step reasoning, and workflow orchestration out-of-the-box. From a deployment perspective, they are highly efficient—capable of running on mainstream consumer GPUs with quantization for lower memory usage, and fully compatible with modern inference frameworks. This means developers can run them locally or scale them in cloud environments without significant resource investment. Practical Deployment and Applications Deployment is straightforward, with broad framework compatibility enabling integration into any modern ML pipeline. They can be used in edge devices, enterprise virtual assistants, research institutions, coding environments, and creative studios. Example scenarios include: Instruction-Following Mode: Customer support bots, multilingual educational assistants, real-time content generation. Thinking Mode: Scientific research analysis, legal reasoning, advanced coding tools, and agentic automation. Conclusion The Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507 prove that small language models can rival and even outperform larger models in specific domains when engineered thoughtfully. Their blend of long-context handling, strong multilingual capabilities, deep reasoning (in Thinking mode), and alignment improvements makes them powerful tools for both everyday and specialist AI applications. With these releases, Alibaba has set a new benchmark in making 256K-ready, high-performance AI models accessible to developers worldwide. Check out the Qwen3-4B-Instruct-2507 Model and Qwen3-4B-Thinking-2507 Model. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to Subscribe to our Newsletter. Discuss on Hacker News Join our ML Subreddit Sponsor us The post Alibaba Qwen Unveils Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507: Refreshing the Importance of Small Language Models appeared first on MarkTechPost.

Alibaba Qwen Unveils Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507: Refreshing the Importance of Small Language Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

Which Questions Improve Learning the Most? Utility Estimation of Questions with LM-based Simulations

admin NU / 8月 8, 2025

arXiv:2502.17383v2 Announce Type: replace Abstract: Asking good questions is critical for comprehension and learning, yet evaluating and generating such questions remains a challenging problem. Prior work on inquisitive questions focuses on learner-generated, curiosity-driven queries and evaluates them using indirect metrics, such as salience or information gain, that do not directly capture a question’s impact on actual learning outcomes. We introduce QUEST (Question Utility Estimation with Simulated Tests), a framework that uses language models to simulate learners and directly quantify the utility of a question – its contribution to exam performance. QUEST simulates a learner who asks questions and receives answers while studying a textbook chapter, then uses them to take an end-of-chapter exam. Through this simulation, the utility of each question is estimated by its direct effect on exam performance, rather than inferred indirectly based on the underlying content. To support this evaluation, we curate TEXTBOOK-EXAM, a benchmark that aligns textbook sections with end-of-section exam questions across five academic disciplines. Using QUEST, we filter for high-utility questions and fine-tune question generators via rejection sampling. Experiments show that questions generated by QUEST-trained models improve simulated test scores by over 20% compared to strong baselines that are fine-tuned using indirect metrics or leverage prompting methods. Furthermore, utility is only weakly correlated with salience and similarity to exam questions, suggesting that it captures unique signal that benefits downstream performance. QUEST offers a new outcome-driven paradigm for question evaluation and generation – one that moves beyond question-answer content toward measurable improvements in learning outcomes.

Which Questions Improve Learning the Most? Utility Estimation of Questions with LM-based Simulations 投稿を読む »

AI, Committee, ニュース, Uncategorized

McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

admin NU / 8月 8, 2025

arXiv:2507.02088v2 Announce Type: replace Abstract: As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.

McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

Attention Basin: Why Contextual Position Matters in Large Language Models

admin NU / 8月 8, 2025

arXiv:2508.05128v1 Announce Type: new Abstract: The performance of Large Language Models (LLMs) is significantly sensitive to the contextual position of information in the input. To investigate the mechanism behind this positional bias, our extensive experiments reveal a consistent phenomenon we term the attention basin: when presented with a sequence of structured items (e.g., retrieved documents or few-shot examples), models systematically assign higher attention to the items at the beginning and end of the sequence, while neglecting those in the middle. Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model’s intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead. Experiments on multi-hop QA and few-shot in-context learning tasks demonstrate that AttnRank achieves substantial improvements across 10 large language models of varying architectures and scales, without modifying model parameters or training procedures.

Attention Basin: Why Contextual Position Matters in Large Language Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

The World According to LLMs: How Geographic Origin Influences LLMs’ Entity Deduction Capabilities

admin NU / 8月 8, 2025

arXiv:2508.05525v1 Announce Type: new Abstract: Large Language Models (LLMs) have been extensively tuned to mitigate explicit biases, yet they often exhibit subtle implicit biases rooted in their pre-training data. Rather than directly probing LLMs with human-crafted questions that may trigger guardrails, we propose studying how models behave when they proactively ask questions themselves. The 20 Questions game, a multi-turn deduction task, serves as an ideal testbed for this purpose. We systematically evaluate geographic performance disparities in entity deduction using a new dataset, Geo20Q+, consisting of both notable people and culturally significant objects (e.g., foods, landmarks, animals) from diverse regions. We test popular LLMs across two gameplay configurations (canonical 20-question and unlimited turns) and in seven languages (English, Hindi, Mandarin, Japanese, French, Spanish, and Turkish). Our results reveal geographic disparities: LLMs are substantially more successful at deducing entities from the Global North than the Global South, and the Global West than the Global East. While Wikipedia pageviews and pre-training corpus frequency correlate mildly with performance, they fail to fully explain these disparities. Notably, the language in which the game is played has minimal impact on performance gaps. These findings demonstrate the value of creative, free-form evaluation frameworks for uncovering subtle biases in LLMs that remain hidden in standard prompting setups. By analyzing how models initiate and pursue reasoning goals over multiple turns, we find geographic and cultural disparities embedded in their reasoning processes. We release the dataset (Geo20Q+) and code at https://sites.google.com/view/llmbias20q/home.

The World According to LLMs: How Geographic Origin Influences LLMs’ Entity Deduction Capabilities 投稿を読む »

AI, Committee, ニュース, Uncategorized

Data Processing for the OpenGPT-X Model Family

admin NU / 8月 8, 2025

arXiv:2410.08800v4 Announce Type: replace Abstract: This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating open and high-performance multilingual large language models (LLMs). The project goal is to deliver models that cover all major European languages, with a particular focus on real-world applications within the European Union. We explain all data processing steps, starting with the data selection and requirement definition to the preparation of the final filtered data. We distinguish between curated data and web data, as each of these categories is handled by distinct pipelines, with curated data undergoing minimal filtering and web data requiring extensive filtering and deduplication. This distinction guided the development of specialized algorithmic solutions for both pipelines. In addition to describing the processing methodologies, we provide an in-depth analysis of the datasets, increasing transparency and alignment with European data regulations. Finally, we share key insights and challenges faced during the project, offering recommendations for future endeavors in large-scale multilingual data preparation for LLMs.

Data Processing for the OpenGPT-X Model Family 投稿を読む »

AI, Committee, ニュース, Uncategorized

HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization

admin NU / 8月 7, 2025

arXiv:2508.04010v1 Announce Type: new Abstract: Large language models enable agents to autonomously perform tasks in open web environments. However, as hidden threats within the web evolve, web agents face the challenge of balancing task performance with emerging risks during long-sequence operations. Although this challenge is critical, current research remains limited to single-objective optimization or single-turn scenarios, lacking the capability for collaborative optimization of both safety and utility in web environments. To address this gap, we propose HarmonyGuard, a multi-agent collaborative framework that leverages policy enhancement and objective optimization to jointly improve both utility and safety. HarmonyGuard features a multi-agent architecture characterized by two fundamental capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent within HarmonyGuard, which automatically extracts and maintains structured security policies from unstructured external documents, while continuously updating policies in response to evolving threats. (2) Dual-Objective Optimization: Based on the dual objectives of safety and utility, the Utility Agent integrated within HarmonyGuard performs the Markovian real-time reasoning to evaluate the objectives and utilizes metacognitive capabilities for their optimization. Extensive evaluations on multiple benchmarks show that HarmonyGuard improves policy compliance by up to 38% and task completion by up to 20% over existing baselines, while achieving over 90% policy compliance across all tasks. Our project is available here: https://github.com/YurunChen/HarmonyGuard.

HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization 投稿を読む »

AI, Committee, ニュース, Uncategorized

Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models

admin NU / 8月 7, 2025

arXiv:2508.03998v1 Announce Type: new Abstract: Successful group meetings, such as those implemented in group behavioral-change programs, work meetings, and other social contexts, must promote individual goal setting and execution while strengthening the social relationships within the group. Consequently, an ideal facilitator must be sensitive to the subtle dynamics of disengagement, difficulties with individual goal setting and execution, and interpersonal difficulties that signal a need for intervention. The challenges and cognitive load experienced by facilitators create a critical gap for an embodied technology that can interpret social exchanges while remaining aware of the needs of the individuals in the group and providing transparent recommendations that go beyond powerful but “black box” foundation models (FMs) that identify social cues. We address this important demand with a social robot co-facilitator that analyzes multimodal meeting data and provides discreet cues to the facilitator. The robot’s reasoning is powered by an agentic concept bottleneck model (CBM), which makes decisions based on human-interpretable concepts like participant engagement and sentiments, ensuring transparency and trustworthiness. Our core contribution is a transfer learning framework that distills the broad social understanding of an FM into our specialized and transparent CBM. This concept-driven system significantly outperforms direct zero-shot FMs in predicting the need for intervention and enables real-time human correction of its reasoning. Critically, we demonstrate robust knowledge transfer: the model generalizes across different groups and successfully transfers the expertise of senior human facilitators to improve the performance of novices. By transferring an expert’s cognitive model into an interpretable robotic partner, our work provides a powerful blueprint for augmenting human capabilities in complex social domains.

Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

admin NU / 8月 7, 2025

arXiv:2508.04632v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.

IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards 投稿を読む »

AI, Committee, ニュース, Uncategorized

MoE Architecture Comparison: Qwen3 30B-A3B vs. GPT-OSS 20B

admin NU / 8月 7, 2025

This article provides a technical comparison between two recently released Mixture-of-Experts (MoE) transformer models: Alibaba’s Qwen3 30B-A3B (released April 2025) and OpenAI’s GPT-OSS 20B (released August 2025). Both models represent distinct approaches to MoE architecture design, balancing computational efficiency with performance across different deployment scenarios. Model Overview Feature Qwen3 30B-A3B GPT-OSS 20B Total Parameters 30.5B 21B Active Parameters 3.3B 3.6B Number of Layers 48 24 MoE Experts 128 (8 active) 32 (4 active) Attention Architecture Grouped Query Attention Grouped Multi-Query Attention Query/Key-Value Heads 32Q / 4KV 64Q / 8KV Context Window 32,768 (ext. 262,144) 128,000 Vocabulary Size 151,936 o200k_harmony (~200k) Quantization Standard precision Native MXFP4 Release Date April 2025 August 2025 Sources: Qwen3 Official Documentation, OpenAI GPT-OSS Documentation Qwen3 30B-A3B Technical Specifications Architecture Details Qwen3 30B-A3B employs a deep transformer architecture with 48 layers, each containing a Mixture-of-Experts configuration with 128 experts per layer. The model activates 8 experts per token during inference, achieving a balance between specialization and computational efficiency. Attention Mechanism The model utilizes Grouped Query Attention (GQA) with 32 query heads and 4 key-value heads³. This design optimizes memory usage while maintaining attention quality, particularly beneficial for long-context processing. Context and Multilingual Support Native context length: 32,768 tokens Extended context: Up to 262,144 tokens (latest variants) Multilingual support: 119 languages and dialects Vocabulary: 151,936 tokens using BPE tokenization Unique Features Qwen3 incorporates a hybrid reasoning system supporting both “thinking” and “non-thinking” modes, allowing users to control computational overhead based on task complexity. GPT-OSS 20B Technical Specifications Architecture Details GPT-OSS 20B features a 24-layer transformer with 32 MoE experts per layer⁸. The model activates 4 experts per token, emphasizing wider expert capacity over fine-grained specialization. Attention Mechanism The model implements Grouped Multi-Query Attention with 64 query heads and 8 key-value heads arranged in groups of 8¹⁰. This configuration supports efficient inference while maintaining attention quality across the wider architecture. Context and Optimization Native context length: 128,000 tokens Quantization: Native MXFP4 (4.25-bit precision) for MoE weights Memory efficiency: Runs on 16GB memory with quantization Tokenizer: o200k_harmony (superset of GPT-4o tokenizer) Performance Characteristics GPT-OSS 20B uses alternating dense and locally banded sparse attention patterns similar to GPT-3, with Rotary Positional Embedding (RoPE) for positional encoding¹⁵. Architectural Philosophy Comparison Depth vs. Width Strategy Qwen3 30B-A3B emphasizes depth and expert diversity: 48 layers enable multi-stage reasoning and hierarchical abstraction 128 experts per layer provide fine-grained specialization Suitable for complex reasoning tasks requiring deep processing GPT-OSS 20B prioritizes width and computational density: 24 layers with larger experts maximize per-layer representational capacity Fewer but more powerful experts (32 vs 128) increase individual expert capability Optimized for efficient single-pass inference MoE Routing Strategies Qwen3: Routes tokens through 8 of 128 experts, encouraging diverse, context-sensitive processing paths and modular decision-making. GPT-OSS: Routes tokens through 4 of 32 experts, maximizing per-expert computational power and delivering concentrated processing per inference step. Memory and Deployment Considerations Qwen3 30B-A3B Memory requirements: Variable based on precision and context length Deployment: Optimized for cloud and edge deployment with flexible context extension Quantization: Supports various quantization schemes post-training GPT-OSS 20B Memory requirements: 16GB with native MXFP4 quantization, ~48GB in bfloat16 Deployment: Designed for consumer hardware compatibility Quantization: Native MXFP4 training enables efficient inference without quality degradation Performance Characteristics Qwen3 30B-A3B Excels in mathematical reasoning, coding, and complex logical tasks Strong performance in multilingual scenarios across 119 languages Thinking mode provides enhanced reasoning capabilities for complex problems GPT-OSS 20B Achieves performance comparable to OpenAI o3-mini on standard benchmarks Optimized for tool use, web browsing, and function calling Strong chain-of-thought reasoning with adjustable reasoning effort levels Use Case Recommendations Choose Qwen3 30B-A3B for: Complex reasoning tasks requiring multi-stage processing Multilingual applications across diverse languages Scenarios requiring flexible context length extension Applications where thinking/reasoning transparency is valued Choose GPT-OSS 20B for: Resource-constrained deployments requiring efficiency Tool-calling and agentic applications Rapid inference with consistent performance Edge deployment scenarios with limited memory Conclusion Qwen3 30B-A3B and GPT-OSS 20B represent complementary approaches to MoE architecture design. Qwen3 emphasizes depth, expert diversity, and multilingual capability, making it suitable for complex reasoning applications. GPT-OSS 20B prioritizes efficiency, tool integration, and deployment flexibility, positioning it for practical production environments with resource constraints. Both models demonstrate the evolution of MoE architectures beyond simple parameter scaling, incorporating sophisticated design choices that align architectural decisions with intended use cases and deployment scenarios. Note: This article is inspired from the reddit post and diagram shared by Sebastian Raschka. Sources Qwen3 30B-A3B Model Card – Hugging Face Qwen3 Technical Blog Qwen3 30B-A3B Base Specifications Qwen3 30B-A3B Instruct 2507 Qwen3 Official Documentation Qwen Tokenizer Documentation Qwen3 Model Features OpenAI GPT-OSS Introduction GPT-OSS GitHub Repository GPT-OSS 20B – Groq Documentation OpenAI GPT-OSS Technical Details Hugging Face GPT-OSS Blog OpenAI GPT-OSS 20B Model Card OpenAI GPT-OSS Introduction NVIDIA GPT-OSS Technical Blog Hugging Face GPT-OSS Blog Qwen3 Performance Analysis OpenAI GPT-OSS Model Card GPT-OSS 20B Capabilities The post MoE Architecture Comparison: Qwen3 30B-A3B vs. GPT-OSS 20B appeared first on MarkTechPost.

MoE Architecture Comparison: Qwen3 30B-A3B vs. GPT-OSS 20B 投稿を読む »

Committee

Alibaba Qwen Unveils Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507: Refreshing the Importance of Small Language Models

Which Questions Improve Learning the Most? Utility Estimation of Questions with LM-based Simulations

McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

Attention Basin: Why Contextual Position Matters in Large Language Models

The World According to LLMs: How Geographic Origin Influences LLMs’ Entity Deduction Capabilities

Data Processing for the OpenGPT-X Model Family

HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization

Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models

IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

MoE Architecture Comparison: Qwen3 30B-A3B vs. GPT-OSS 20B

私たちのサービス

ホーム

仕組み

ニュース

料金

サポート

ヘルプセンター

問題を報告

フィードバックを送る

プライバシーポリシー

ユーザーアカウント

フォローする