YouZum

Actualités

AI, Committee, Actualités, Uncategorized

Data Processing for the OpenGPT-X Model Family

arXiv:2410.08800v4 Announce Type: replace Abstract: This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating open and high-performance multilingual large language models (LLMs). The project goal is to deliver models that cover all major European languages, with a particular focus on real-world applications within the European Union. We explain all data processing steps, starting with the data selection and requirement definition to the preparation of the final filtered data. We distinguish between curated data and web data, as each of these categories is handled by distinct pipelines, with curated data undergoing minimal filtering and web data requiring extensive filtering and deduplication. This distinction guided the development of specialized algorithmic solutions for both pipelines. In addition to describing the processing methodologies, we provide an in-depth analysis of the datasets, increasing transparency and alignment with European data regulations. Finally, we share key insights and challenges faced during the project, offering recommendations for future endeavors in large-scale multilingual data preparation for LLMs.

Data Processing for the OpenGPT-X Model Family Lire l’article »

AI, Committee, Actualités, Uncategorized

HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization

arXiv:2508.04010v1 Announce Type: new Abstract: Large language models enable agents to autonomously perform tasks in open web environments. However, as hidden threats within the web evolve, web agents face the challenge of balancing task performance with emerging risks during long-sequence operations. Although this challenge is critical, current research remains limited to single-objective optimization or single-turn scenarios, lacking the capability for collaborative optimization of both safety and utility in web environments. To address this gap, we propose HarmonyGuard, a multi-agent collaborative framework that leverages policy enhancement and objective optimization to jointly improve both utility and safety. HarmonyGuard features a multi-agent architecture characterized by two fundamental capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent within HarmonyGuard, which automatically extracts and maintains structured security policies from unstructured external documents, while continuously updating policies in response to evolving threats. (2) Dual-Objective Optimization: Based on the dual objectives of safety and utility, the Utility Agent integrated within HarmonyGuard performs the Markovian real-time reasoning to evaluate the objectives and utilizes metacognitive capabilities for their optimization. Extensive evaluations on multiple benchmarks show that HarmonyGuard improves policy compliance by up to 38% and task completion by up to 20% over existing baselines, while achieving over 90% policy compliance across all tasks. Our project is available here: https://github.com/YurunChen/HarmonyGuard.

HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization Lire l’article »

AI, Committee, Actualités, Uncategorized

Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models

arXiv:2508.03998v1 Announce Type: new Abstract: Successful group meetings, such as those implemented in group behavioral-change programs, work meetings, and other social contexts, must promote individual goal setting and execution while strengthening the social relationships within the group. Consequently, an ideal facilitator must be sensitive to the subtle dynamics of disengagement, difficulties with individual goal setting and execution, and interpersonal difficulties that signal a need for intervention. The challenges and cognitive load experienced by facilitators create a critical gap for an embodied technology that can interpret social exchanges while remaining aware of the needs of the individuals in the group and providing transparent recommendations that go beyond powerful but “black box” foundation models (FMs) that identify social cues. We address this important demand with a social robot co-facilitator that analyzes multimodal meeting data and provides discreet cues to the facilitator. The robot’s reasoning is powered by an agentic concept bottleneck model (CBM), which makes decisions based on human-interpretable concepts like participant engagement and sentiments, ensuring transparency and trustworthiness. Our core contribution is a transfer learning framework that distills the broad social understanding of an FM into our specialized and transparent CBM. This concept-driven system significantly outperforms direct zero-shot FMs in predicting the need for intervention and enables real-time human correction of its reasoning. Critically, we demonstrate robust knowledge transfer: the model generalizes across different groups and successfully transfers the expertise of senior human facilitators to improve the performance of novices. By transferring an expert’s cognitive model into an interpretable robotic partner, our work provides a powerful blueprint for augmenting human capabilities in complex social domains.

Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models Lire l’article »

AI, Committee, Actualités, Uncategorized

IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

arXiv:2508.04632v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.

IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards Lire l’article »

AI, Committee, Actualités, Uncategorized

MoE Architecture Comparison: Qwen3 30B-A3B vs. GPT-OSS 20B

This article provides a technical comparison between two recently released Mixture-of-Experts (MoE) transformer models: Alibaba’s Qwen3 30B-A3B (released April 2025) and OpenAI’s GPT-OSS 20B (released August 2025). Both models represent distinct approaches to MoE architecture design, balancing computational efficiency with performance across different deployment scenarios. Model Overview Feature Qwen3 30B-A3B GPT-OSS 20B Total Parameters 30.5B 21B Active Parameters 3.3B 3.6B Number of Layers 48 24 MoE Experts 128 (8 active) 32 (4 active) Attention Architecture Grouped Query Attention Grouped Multi-Query Attention Query/Key-Value Heads 32Q / 4KV 64Q / 8KV Context Window 32,768 (ext. 262,144) 128,000 Vocabulary Size 151,936 o200k_harmony (~200k) Quantization Standard precision Native MXFP4 Release Date April 2025 August 2025 Sources: Qwen3 Official Documentation, OpenAI GPT-OSS Documentation Qwen3 30B-A3B Technical Specifications Architecture Details Qwen3 30B-A3B employs a deep transformer architecture with 48 layers, each containing a Mixture-of-Experts configuration with 128 experts per layer. The model activates 8 experts per token during inference, achieving a balance between specialization and computational efficiency. Attention Mechanism The model utilizes Grouped Query Attention (GQA) with 32 query heads and 4 key-value heads³. This design optimizes memory usage while maintaining attention quality, particularly beneficial for long-context processing. Context and Multilingual Support Native context length: 32,768 tokens Extended context: Up to 262,144 tokens (latest variants) Multilingual support: 119 languages and dialects Vocabulary: 151,936 tokens using BPE tokenization Unique Features Qwen3 incorporates a hybrid reasoning system supporting both “thinking” and “non-thinking” modes, allowing users to control computational overhead based on task complexity. GPT-OSS 20B Technical Specifications Architecture Details GPT-OSS 20B features a 24-layer transformer with 32 MoE experts per layer⁸. The model activates 4 experts per token, emphasizing wider expert capacity over fine-grained specialization. Attention Mechanism The model implements Grouped Multi-Query Attention with 64 query heads and 8 key-value heads arranged in groups of 8¹⁰. This configuration supports efficient inference while maintaining attention quality across the wider architecture. Context and Optimization Native context length: 128,000 tokens Quantization: Native MXFP4 (4.25-bit precision) for MoE weights Memory efficiency: Runs on 16GB memory with quantization Tokenizer: o200k_harmony (superset of GPT-4o tokenizer) Performance Characteristics GPT-OSS 20B uses alternating dense and locally banded sparse attention patterns similar to GPT-3, with Rotary Positional Embedding (RoPE) for positional encoding¹⁵. Architectural Philosophy Comparison Depth vs. Width Strategy Qwen3 30B-A3B emphasizes depth and expert diversity: 48 layers enable multi-stage reasoning and hierarchical abstraction 128 experts per layer provide fine-grained specialization Suitable for complex reasoning tasks requiring deep processing GPT-OSS 20B prioritizes width and computational density: 24 layers with larger experts maximize per-layer representational capacity Fewer but more powerful experts (32 vs 128) increase individual expert capability Optimized for efficient single-pass inference MoE Routing Strategies Qwen3: Routes tokens through 8 of 128 experts, encouraging diverse, context-sensitive processing paths and modular decision-making. GPT-OSS: Routes tokens through 4 of 32 experts, maximizing per-expert computational power and delivering concentrated processing per inference step. Memory and Deployment Considerations Qwen3 30B-A3B Memory requirements: Variable based on precision and context length Deployment: Optimized for cloud and edge deployment with flexible context extension Quantization: Supports various quantization schemes post-training GPT-OSS 20B Memory requirements: 16GB with native MXFP4 quantization, ~48GB in bfloat16 Deployment: Designed for consumer hardware compatibility Quantization: Native MXFP4 training enables efficient inference without quality degradation Performance Characteristics Qwen3 30B-A3B Excels in mathematical reasoning, coding, and complex logical tasks Strong performance in multilingual scenarios across 119 languages Thinking mode provides enhanced reasoning capabilities for complex problems GPT-OSS 20B Achieves performance comparable to OpenAI o3-mini on standard benchmarks Optimized for tool use, web browsing, and function calling Strong chain-of-thought reasoning with adjustable reasoning effort levels Use Case Recommendations Choose Qwen3 30B-A3B for: Complex reasoning tasks requiring multi-stage processing Multilingual applications across diverse languages Scenarios requiring flexible context length extension Applications where thinking/reasoning transparency is valued Choose GPT-OSS 20B for: Resource-constrained deployments requiring efficiency Tool-calling and agentic applications Rapid inference with consistent performance Edge deployment scenarios with limited memory Conclusion Qwen3 30B-A3B and GPT-OSS 20B represent complementary approaches to MoE architecture design. Qwen3 emphasizes depth, expert diversity, and multilingual capability, making it suitable for complex reasoning applications. GPT-OSS 20B prioritizes efficiency, tool integration, and deployment flexibility, positioning it for practical production environments with resource constraints. Both models demonstrate the evolution of MoE architectures beyond simple parameter scaling, incorporating sophisticated design choices that align architectural decisions with intended use cases and deployment scenarios. Note: This article is inspired from the reddit post and diagram shared by Sebastian Raschka. Sources Qwen3 30B-A3B Model Card – Hugging Face Qwen3 Technical Blog Qwen3 30B-A3B Base Specifications Qwen3 30B-A3B Instruct 2507 Qwen3 Official Documentation Qwen Tokenizer Documentation Qwen3 Model Features OpenAI GPT-OSS Introduction GPT-OSS GitHub Repository GPT-OSS 20B – Groq Documentation OpenAI GPT-OSS Technical Details Hugging Face GPT-OSS Blog OpenAI GPT-OSS 20B Model Card OpenAI GPT-OSS Introduction NVIDIA GPT-OSS Technical Blog Hugging Face GPT-OSS Blog Qwen3 Performance Analysis OpenAI GPT-OSS Model Card GPT-OSS 20B Capabilities The post MoE Architecture Comparison: Qwen3 30B-A3B vs. GPT-OSS 20B appeared first on MarkTechPost.

MoE Architecture Comparison: Qwen3 30B-A3B vs. GPT-OSS 20B Lire l’article »

AI, Committee, Actualités, Uncategorized

Google DeepMind Introduces Genie 3: A General Purpose World Model that can Generate an Unprecedented Diversity of Interactive Environments

Google DeepMind has announced Genie 3, a revolutionary AI system capable of generating interactive, physically consistent virtual worlds from simple text prompts. This marks a substantial leap in the field of world models—a class of AI designed to understand and simulate environments, not merely render them, but produce dynamic spaces you can move through and interact with like a game engine in real-time. Technical Overview World Model Fundamentals: A world model, in this context, refers to a deep neural network trained to generate and simulate visually rich, interactive virtual environments. Genie 3 leverages advances in generative modeling and large-scale multimodal AI to produce entire worlds at 720p resolution and 24 frames per second that are truly navigable and reactive to user input. Natural Language Prompting: With Genie 3, users provide a plain English description (such as “a beach at sunset, with interactive sandcastles”) and the model synthesizes an environment fitting that description. Unlike traditional generative video or image models, Genie 3’s outputs are not just visual—they’re interactive. Users can walk, jump, or even paint within the environment, and those actions persist and remain consistent even as you explore other regions.youtube World Consistency and Memory: A key innovation is “world memory.” Genie 3’s generated environments retain changes introduced by the user. For example, if you alter an object or leave a mark, returning to that area shows the environment unchanged since your last interaction. This temporal and spatial persistence is crucial for use in training AI agents and robots, and for creating immersive, interactive scenarios that feel stable and real. Performance and Capabilities Smooth real-time interaction: Genie 3 runs at 24fps and 720p, allowing seamless navigation through the generated world. Extensible interaction: While not full-featured like established game engines, it supports fundamental inputs (walking, looking, jumping, painting) and can incorporate dynamic events on the fly (like altering weather, adding characters, etc.). High diversity: Genie 3 can render environments ranging from realistic city streets and schools to entirely fantastical realms, all via simple prompts. Longer horizons: Environments are physically consistent for several minutes—significantly longer than previous models, enabling more sustained play and interaction. Impact and Applications Game Design and Prototyping Genie 3 offers tremendous utility as a tool for ideation and rapid prototyping. Designers can test new mechanics, environments, or artistic ideas in seconds, accelerating creative iteration. It opens up the potential for on-the-fly generation of game scenarios that, while rough, could inspire new genres or gameplay experiences. Robotics and Embodied AI World models like Genie 3 are critical for training robots and embodied AI agents, allowing for extensive simulation-based learning before deployment in the real world. The ability to continuously generate interactive, diverse, and physically plausible environments provides virtually unlimited data for agent training and curriculum development. Beyond Gaming: XR, Education, and Simulation The text-to-world paradigm democratizes the creation of immersive XR experiences, letting smaller teams or even individuals generate new simulations rapidly for education, training, or research. It also paves the way for participatory simulations, digital twins, and agent-based decision-making in areas like urban planning, crisis management, and beyond. Genie 3 and the Future In my opinion, Genie 3 does not aim to replace traditional game engines yet, as it lacks their predictability, precision tools, and collaborative workflows. However, it represents a bridge: future pipelines may involve bouncing between neural world models and conventional engines, using each for what they do best—rapid creative synthesis and fine-grained polish, respectively. World models like Genie 3 are a significant milestone toward Artificial General Intelligence (AGI); they enable richer agent simulation, broader transfer learning, and a step closer to AI systems that understand and reason about the world at a foundational level. Genie 3’s emergence signals an exciting new chapter for AI, simulation, game design, and robotics. Its further development and integration could drastically change both how we build digital experiences and how intelligent agents learn, plan, and interact within complex environments. Check out the Technical Blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Google DeepMind Introduces Genie 3: A General Purpose World Model that can Generate an Unprecedented Diversity of Interactive Environments appeared first on MarkTechPost.

Google DeepMind Introduces Genie 3: A General Purpose World Model that can Generate an Unprecedented Diversity of Interactive Environments Lire l’article »

AI, Committee, Actualités, Uncategorized

PyLate: Flexible Training and Retrieval for Late Interaction Models

arXiv:2508.03555v1 Announce Type: cross Abstract: Neural ranking has become a cornerstone of modern information retrieval. While single vector search remains the dominant paradigm, it suffers from the shortcoming of compressing all the information into a single vector. This compression leads to notable performance degradation in out-of-domain, long-context, and reasoning-intensive retrieval tasks. Multi-vector approaches pioneered by ColBERT aim to address these limitations by preserving individual token embeddings and computing similarity via the MaxSim operator. This architecture has demonstrated superior empirical advantages, including enhanced out-of-domain generalization, long-context handling, and performance in complex retrieval scenarios. Despite these compelling empirical results and clear theoretical advantages, the practical adoption and public availability of late interaction models remain low compared to their single-vector counterparts, primarily due to a lack of accessible and modular tools for training and experimenting with such models. To bridge this gap, we introduce PyLate, a streamlined library built on top of Sentence Transformers to support multi-vector architectures natively, inheriting its efficient training, advanced logging, and automated model card generation while requiring minimal code changes to code templates users are already familiar with. By offering multi-vector-specific features such as efficient indexes, PyLate aims to accelerate research and real-world application of late interaction models, thereby unlocking their full potential in modern IR systems. Finally, PyLate has already enabled the development of state-of-the-art models, including GTE-ModernColBERT and Reason-ModernColBERT, demonstrating its practical utility for both research and production environments.

PyLate: Flexible Training and Retrieval for Late Interaction Models Lire l’article »

AI, Committee, Actualités, Uncategorized

Evaluating LLMs on Real-World Forecasting Against Expert Forecasters

arXiv:2507.04562v3 Announce Type: replace-cross Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. A year ago, large language models struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against top forecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of experts.

Evaluating LLMs on Real-World Forecasting Against Expert Forecasters Lire l’article »

AI, Committee, Actualités, Uncategorized

Pre-trained Transformer-Based Approach for Arabic Question Answering : A Comparative Study

arXiv:2111.05671v2 Announce Type: replace Abstract: Question answering(QA) is one of the most challenging yet widely investigated problems in Natural Language Processing (NLP). Question-answering (QA) systems try to produce answers for given questions. These answers can be generated from unstructured or structured text. Hence, QA is considered an important research area that can be used in evaluating text understanding systems. A large volume of QA studies was devoted to the English language, investigating the most advanced techniques and achieving state-of-the-art results. However, research efforts in the Arabic question-answering progress at a considerably slower pace due to the scarcity of research efforts in Arabic QA and the lack of large benchmark datasets. Recently many pre-trained language models provided high performance in many Arabic NLP problems. In this work, we evaluate the state-of-the-art pre-trained transformers models for Arabic QA using four reading comprehension datasets which are Arabic-SQuAD, ARCD, AQAD, and TyDiQA-GoldP datasets. We fine-tuned and compared the performance of the AraBERTv2-base model, AraBERTv0.2-large model, and AraELECTRA model. In the last, we provide an analysis to understand and interpret the low-performance results obtained by some models.

Pre-trained Transformer-Based Approach for Arabic Question Answering : A Comparative Study Lire l’article »

AI, Committee, Actualités, Uncategorized

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

arXiv:2508.03686v1 Announce Type: new Abstract: Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate answer verification, evaluation protocols, and reinforcement learning research. Code and dataset are available at https://github.com/open-compass/CompassVerifier.

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward Lire l’article »

We use cookies to improve your experience and performance on our website. You can learn more at Politique de confidentialité and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
fr_FR