YouZum

Uncategorized

AI, Committee, ข่าว, Uncategorized

Alibaba Qwen Unveils Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507: Refreshing the Importance of Small Language Models

Smaller Models with Smarter Performance and 256K Context Support Alibaba’s Qwen team has introduced two powerful additions to its small language model lineup: Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507. Despite having only 4 billion parameters, these models deliver exceptional capabilities across general-purpose and expert-level tasks while running efficiently on consumer-grade hardware. Both are designed with native 256K token context windows, meaning they can process extremely long inputs such as large codebases, multi-document archives, and extended dialogues without external modifications. Architecture and Core Design Both models feature 4 billion total parameters (3.6B excluding embeddings) built across 36 transformer layers. They use Grouped Query Attention (GQA) with 32 query heads and 8 key/value heads, enhancing efficiency and memory management for very large contexts. They are dense transformer architectures—not mixture-of-experts—which ensures consistent task performance. Long-context support up to 262,144 tokens is baked directly into the model architecture, and each model is pretrained extensively before undergoing alignment and safety post-training to ensure responsible, high-quality outputs. Qwen3-4B-Instruct-2507 — A Multilingual, Instruction-Following Generalist The Qwen3-4B-Instruct-2507 model is optimized for speed, clarity, and user-aligned instruction following. It is designed to deliver direct answers without explicit step-by-step reasoning, making it perfect for scenarios where users want concise responses rather than detailed thought processes. Multilingual coverage spans over 100 languages, making it highly suitable for global deployments in chatbots, customer support, education, and cross-language search. Its native 256K context support enables it to handle tasks like analyzing large legal documents, processing multi-hour transcripts, or summarizing massive datasets without splitting the content. Performance Benchmarks: Benchmark Task Score General Knowledge (MMLU-Pro) 69.6 Reasoning (AIME25) 47.4 SuperGPQA (QA) 42.8 Coding (LiveCodeBench) 35.1 Creative Writing 83.5 Multilingual Comprehension (MultiIF) 69.0 In practice, this means Qwen3-4B-Instruct-2507 can handle everything from language tutoring in multiple languages to generating rich narrative content, while still providing competent performance in reasoning, coding, and domain-specific knowledge. Qwen3-4B-Thinking-2507 — Expert-Level Chain-of-Thought Reasoning Where the Instruct model focuses on concise responsiveness, the Qwen3-4B-Thinking-2507 model is engineered for deep reasoning and problem-solving. It automatically generates explicit chains of thought in its outputs, making its decision-making process transparent—especially beneficial for complex domains like mathematics, science, and programming. This model excels at technical diagnostics, scientific data interpretation, and multi-step logical analysis. It’s suited for advanced AI agents, research assistants, and coding companions that need to reason through problems before answering. Performance Benchmarks: Benchmark Task Score Math (AIME25) 81.3% Science (HMMT25) 55.5% General QA (GPQA) 65.8% Coding (LiveCodeBench) 55.2% Tool Usage (BFCL) 71.2% Human Alignment 87.4% These scores demonstrate that Qwen3-4B-Thinking-2507 can match or even surpass much larger models in reasoning-heavy benchmarks, allowing more accurate and explainable results for mission-critical use cases. Across Both Models Both the Instruct and Thinking variants share key advancements. The 256K native context window allows for seamless work on extremely long inputs without external memory hacks. They also feature improved alignment, producing more natural, coherent, and context-aware responses in creative and multi-turn conversations. Furthermore, both are agent-ready, supporting API calling, multi-step reasoning, and workflow orchestration out-of-the-box. From a deployment perspective, they are highly efficient—capable of running on mainstream consumer GPUs with quantization for lower memory usage, and fully compatible with modern inference frameworks. This means developers can run them locally or scale them in cloud environments without significant resource investment. Practical Deployment and Applications Deployment is straightforward, with broad framework compatibility enabling integration into any modern ML pipeline. They can be used in edge devices, enterprise virtual assistants, research institutions, coding environments, and creative studios. Example scenarios include: Instruction-Following Mode: Customer support bots, multilingual educational assistants, real-time content generation. Thinking Mode: Scientific research analysis, legal reasoning, advanced coding tools, and agentic automation. Conclusion The Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507 prove that small language models can rival and even outperform larger models in specific domains when engineered thoughtfully. Their blend of long-context handling, strong multilingual capabilities, deep reasoning (in Thinking mode), and alignment improvements makes them powerful tools for both everyday and specialist AI applications. With these releases, Alibaba has set a new benchmark in making 256K-ready, high-performance AI models accessible to developers worldwide. Check out the Qwen3-4B-Instruct-2507 Model and Qwen3-4B-Thinking-2507 Model. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to Subscribe to our Newsletter. Discuss on Hacker News Join our ML Subreddit Sponsor us The post Alibaba Qwen Unveils Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507: Refreshing the Importance of Small Language Models appeared first on MarkTechPost.

Alibaba Qwen Unveils Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507: Refreshing the Importance of Small Language Models Read Post »

AI, Committee, ข่าว, Uncategorized

Which Questions Improve Learning the Most? Utility Estimation of Questions with LM-based Simulations

arXiv:2502.17383v2 Announce Type: replace Abstract: Asking good questions is critical for comprehension and learning, yet evaluating and generating such questions remains a challenging problem. Prior work on inquisitive questions focuses on learner-generated, curiosity-driven queries and evaluates them using indirect metrics, such as salience or information gain, that do not directly capture a question’s impact on actual learning outcomes. We introduce QUEST (Question Utility Estimation with Simulated Tests), a framework that uses language models to simulate learners and directly quantify the utility of a question – its contribution to exam performance. QUEST simulates a learner who asks questions and receives answers while studying a textbook chapter, then uses them to take an end-of-chapter exam. Through this simulation, the utility of each question is estimated by its direct effect on exam performance, rather than inferred indirectly based on the underlying content. To support this evaluation, we curate TEXTBOOK-EXAM, a benchmark that aligns textbook sections with end-of-section exam questions across five academic disciplines. Using QUEST, we filter for high-utility questions and fine-tune question generators via rejection sampling. Experiments show that questions generated by QUEST-trained models improve simulated test scores by over 20% compared to strong baselines that are fine-tuned using indirect metrics or leverage prompting methods. Furthermore, utility is only weakly correlated with salience and similarity to exam questions, suggesting that it captures unique signal that benefits downstream performance. QUEST offers a new outcome-driven paradigm for question evaluation and generation – one that moves beyond question-answer content toward measurable improvements in learning outcomes.

Which Questions Improve Learning the Most? Utility Estimation of Questions with LM-based Simulations Read Post »

AI, Committee, ข่าว, Uncategorized

McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

arXiv:2507.02088v2 Announce Type: replace Abstract: As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.

McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models Read Post »

AI, Committee, ข่าว, Uncategorized

Attention Basin: Why Contextual Position Matters in Large Language Models

arXiv:2508.05128v1 Announce Type: new Abstract: The performance of Large Language Models (LLMs) is significantly sensitive to the contextual position of information in the input. To investigate the mechanism behind this positional bias, our extensive experiments reveal a consistent phenomenon we term the attention basin: when presented with a sequence of structured items (e.g., retrieved documents or few-shot examples), models systematically assign higher attention to the items at the beginning and end of the sequence, while neglecting those in the middle. Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model’s intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead. Experiments on multi-hop QA and few-shot in-context learning tasks demonstrate that AttnRank achieves substantial improvements across 10 large language models of varying architectures and scales, without modifying model parameters or training procedures.

Attention Basin: Why Contextual Position Matters in Large Language Models Read Post »

AI, Committee, ข่าว, Uncategorized

The World According to LLMs: How Geographic Origin Influences LLMs’ Entity Deduction Capabilities

arXiv:2508.05525v1 Announce Type: new Abstract: Large Language Models (LLMs) have been extensively tuned to mitigate explicit biases, yet they often exhibit subtle implicit biases rooted in their pre-training data. Rather than directly probing LLMs with human-crafted questions that may trigger guardrails, we propose studying how models behave when they proactively ask questions themselves. The 20 Questions game, a multi-turn deduction task, serves as an ideal testbed for this purpose. We systematically evaluate geographic performance disparities in entity deduction using a new dataset, Geo20Q+, consisting of both notable people and culturally significant objects (e.g., foods, landmarks, animals) from diverse regions. We test popular LLMs across two gameplay configurations (canonical 20-question and unlimited turns) and in seven languages (English, Hindi, Mandarin, Japanese, French, Spanish, and Turkish). Our results reveal geographic disparities: LLMs are substantially more successful at deducing entities from the Global North than the Global South, and the Global West than the Global East. While Wikipedia pageviews and pre-training corpus frequency correlate mildly with performance, they fail to fully explain these disparities. Notably, the language in which the game is played has minimal impact on performance gaps. These findings demonstrate the value of creative, free-form evaluation frameworks for uncovering subtle biases in LLMs that remain hidden in standard prompting setups. By analyzing how models initiate and pursue reasoning goals over multiple turns, we find geographic and cultural disparities embedded in their reasoning processes. We release the dataset (Geo20Q+) and code at https://sites.google.com/view/llmbias20q/home.

The World According to LLMs: How Geographic Origin Influences LLMs’ Entity Deduction Capabilities Read Post »

AI, Committee, ข่าว, Uncategorized

Data Processing for the OpenGPT-X Model Family

arXiv:2410.08800v4 Announce Type: replace Abstract: This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating open and high-performance multilingual large language models (LLMs). The project goal is to deliver models that cover all major European languages, with a particular focus on real-world applications within the European Union. We explain all data processing steps, starting with the data selection and requirement definition to the preparation of the final filtered data. We distinguish between curated data and web data, as each of these categories is handled by distinct pipelines, with curated data undergoing minimal filtering and web data requiring extensive filtering and deduplication. This distinction guided the development of specialized algorithmic solutions for both pipelines. In addition to describing the processing methodologies, we provide an in-depth analysis of the datasets, increasing transparency and alignment with European data regulations. Finally, we share key insights and challenges faced during the project, offering recommendations for future endeavors in large-scale multilingual data preparation for LLMs.

Data Processing for the OpenGPT-X Model Family Read Post »

AI, Committee, ข่าว, Uncategorized

HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization

arXiv:2508.04010v1 Announce Type: new Abstract: Large language models enable agents to autonomously perform tasks in open web environments. However, as hidden threats within the web evolve, web agents face the challenge of balancing task performance with emerging risks during long-sequence operations. Although this challenge is critical, current research remains limited to single-objective optimization or single-turn scenarios, lacking the capability for collaborative optimization of both safety and utility in web environments. To address this gap, we propose HarmonyGuard, a multi-agent collaborative framework that leverages policy enhancement and objective optimization to jointly improve both utility and safety. HarmonyGuard features a multi-agent architecture characterized by two fundamental capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent within HarmonyGuard, which automatically extracts and maintains structured security policies from unstructured external documents, while continuously updating policies in response to evolving threats. (2) Dual-Objective Optimization: Based on the dual objectives of safety and utility, the Utility Agent integrated within HarmonyGuard performs the Markovian real-time reasoning to evaluate the objectives and utilizes metacognitive capabilities for their optimization. Extensive evaluations on multiple benchmarks show that HarmonyGuard improves policy compliance by up to 38% and task completion by up to 20% over existing baselines, while achieving over 90% policy compliance across all tasks. Our project is available here: https://github.com/YurunChen/HarmonyGuard.

HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization Read Post »

AI, Committee, ข่าว, Uncategorized

Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models

arXiv:2508.03998v1 Announce Type: new Abstract: Successful group meetings, such as those implemented in group behavioral-change programs, work meetings, and other social contexts, must promote individual goal setting and execution while strengthening the social relationships within the group. Consequently, an ideal facilitator must be sensitive to the subtle dynamics of disengagement, difficulties with individual goal setting and execution, and interpersonal difficulties that signal a need for intervention. The challenges and cognitive load experienced by facilitators create a critical gap for an embodied technology that can interpret social exchanges while remaining aware of the needs of the individuals in the group and providing transparent recommendations that go beyond powerful but “black box” foundation models (FMs) that identify social cues. We address this important demand with a social robot co-facilitator that analyzes multimodal meeting data and provides discreet cues to the facilitator. The robot’s reasoning is powered by an agentic concept bottleneck model (CBM), which makes decisions based on human-interpretable concepts like participant engagement and sentiments, ensuring transparency and trustworthiness. Our core contribution is a transfer learning framework that distills the broad social understanding of an FM into our specialized and transparent CBM. This concept-driven system significantly outperforms direct zero-shot FMs in predicting the need for intervention and enables real-time human correction of its reasoning. Critically, we demonstrate robust knowledge transfer: the model generalizes across different groups and successfully transfers the expertise of senior human facilitators to improve the performance of novices. By transferring an expert’s cognitive model into an interpretable robotic partner, our work provides a powerful blueprint for augmenting human capabilities in complex social domains.

Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models Read Post »

AI, Committee, ข่าว, Uncategorized

IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

arXiv:2508.04632v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.

IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards Read Post »

AI, Committee, ข่าว, Uncategorized

Google DeepMind Introduces Genie 3: A General Purpose World Model that can Generate an Unprecedented Diversity of Interactive Environments

Google DeepMind has announced Genie 3, a revolutionary AI system capable of generating interactive, physically consistent virtual worlds from simple text prompts. This marks a substantial leap in the field of world models—a class of AI designed to understand and simulate environments, not merely render them, but produce dynamic spaces you can move through and interact with like a game engine in real-time. Technical Overview World Model Fundamentals: A world model, in this context, refers to a deep neural network trained to generate and simulate visually rich, interactive virtual environments. Genie 3 leverages advances in generative modeling and large-scale multimodal AI to produce entire worlds at 720p resolution and 24 frames per second that are truly navigable and reactive to user input. Natural Language Prompting: With Genie 3, users provide a plain English description (such as “a beach at sunset, with interactive sandcastles”) and the model synthesizes an environment fitting that description. Unlike traditional generative video or image models, Genie 3’s outputs are not just visual—they’re interactive. Users can walk, jump, or even paint within the environment, and those actions persist and remain consistent even as you explore other regions.youtube World Consistency and Memory: A key innovation is “world memory.” Genie 3’s generated environments retain changes introduced by the user. For example, if you alter an object or leave a mark, returning to that area shows the environment unchanged since your last interaction. This temporal and spatial persistence is crucial for use in training AI agents and robots, and for creating immersive, interactive scenarios that feel stable and real. Performance and Capabilities Smooth real-time interaction: Genie 3 runs at 24fps and 720p, allowing seamless navigation through the generated world. Extensible interaction: While not full-featured like established game engines, it supports fundamental inputs (walking, looking, jumping, painting) and can incorporate dynamic events on the fly (like altering weather, adding characters, etc.). High diversity: Genie 3 can render environments ranging from realistic city streets and schools to entirely fantastical realms, all via simple prompts. Longer horizons: Environments are physically consistent for several minutes—significantly longer than previous models, enabling more sustained play and interaction. Impact and Applications Game Design and Prototyping Genie 3 offers tremendous utility as a tool for ideation and rapid prototyping. Designers can test new mechanics, environments, or artistic ideas in seconds, accelerating creative iteration. It opens up the potential for on-the-fly generation of game scenarios that, while rough, could inspire new genres or gameplay experiences. Robotics and Embodied AI World models like Genie 3 are critical for training robots and embodied AI agents, allowing for extensive simulation-based learning before deployment in the real world. The ability to continuously generate interactive, diverse, and physically plausible environments provides virtually unlimited data for agent training and curriculum development. Beyond Gaming: XR, Education, and Simulation The text-to-world paradigm democratizes the creation of immersive XR experiences, letting smaller teams or even individuals generate new simulations rapidly for education, training, or research. It also paves the way for participatory simulations, digital twins, and agent-based decision-making in areas like urban planning, crisis management, and beyond. Genie 3 and the Future In my opinion, Genie 3 does not aim to replace traditional game engines yet, as it lacks their predictability, precision tools, and collaborative workflows. However, it represents a bridge: future pipelines may involve bouncing between neural world models and conventional engines, using each for what they do best—rapid creative synthesis and fine-grained polish, respectively. World models like Genie 3 are a significant milestone toward Artificial General Intelligence (AGI); they enable richer agent simulation, broader transfer learning, and a step closer to AI systems that understand and reason about the world at a foundational level. Genie 3’s emergence signals an exciting new chapter for AI, simulation, game design, and robotics. Its further development and integration could drastically change both how we build digital experiences and how intelligent agents learn, plan, and interact within complex environments. Check out the Technical Blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Google DeepMind Introduces Genie 3: A General Purpose World Model that can Generate an Unprecedented Diversity of Interactive Environments appeared first on MarkTechPost.

Google DeepMind Introduces Genie 3: A General Purpose World Model that can Generate an Unprecedented Diversity of Interactive Environments Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at นโยบายความเป็นส่วนตัว and manage your privacy settings by clicking Settings.

ตั้งค่าความเป็นส่วนตัว

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

ยอมรับทั้งหมด
จัดการความเป็นส่วนตัว
  • เปิดใช้งานตลอด

บันทึกการตั้งค่า
th