YouZum

Committee

AI, Committee, Noticias, Uncategorized

Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning

arXiv:2508.21589v1 Announce Type: new Abstract: Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals – loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our method consistently enhances the quality of seed data and boosts LLM’s performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are coming soon.

Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning Leer entrada »

AI, Committee, Noticias, Uncategorized

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

arXiv:2508.21788v1 Announce Type: new Abstract: Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI’s FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance–most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval Leer entrada »

AI, Committee, Noticias, Uncategorized

Transforming Wearable Data into Personal Health Insights using Large Language Model Agents

arXiv:2406.06464v3 Announce Type: replace-cross Abstract: Deriving personalized insights from popular wearable trackers requires complex numerical reasoning that challenges standard LLMs, necessitating tool-based approaches like code generation. Large language model (LLM) agents present a promising yet largely untapped solution for this analysis at scale. We introduce the Personal Health Insights Agent (PHIA), a system leveraging multistep reasoning with code generation and information retrieval to analyze and interpret behavioral health data. To test its capabilities, we create and share two benchmark datasets with over 4000 health insights questions. A 650-hour human expert evaluation shows that PHIA significantly outperforms a strong code generation baseline, achieving 84% accuracy on objective, numerical questions and, for open-ended ones, earning 83% favorable ratings while being twice as likely to achieve the highest quality rating. This work can advance behavioral health by empowering individuals to understand their data, enabling a new era of accessible, personalized, and data-driven wellness for the wider population.

Transforming Wearable Data into Personal Health Insights using Large Language Model Agents Leer entrada »

AI, Committee, Noticias, Uncategorized

Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization

arXiv:2507.05137v2 Announce Type: replace Abstract: Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.

Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization Leer entrada »

AI, Committee, Noticias, Uncategorized

Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach

arXiv:2508.21206v1 Announce Type: new Abstract: Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.

Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach Leer entrada »

AI, Committee, Noticias, Uncategorized

A Coding Guide to Building a Brain-Inspired Hierarchical Reasoning AI Agent with Hugging Face Models

In this tutorial, we set out to recreate the spirit of the Hierarchical Reasoning Model (HRM) using a free Hugging Face model that runs locally. We walk through the design of a lightweight yet structured reasoning agent, where we act as both architects and experimenters. By breaking problems into subgoals, solving them with Python, critiquing the outcomes, and synthesizing a final answer, we can experience how hierarchical planning and execution can enhance reasoning performance. This process enables us to see, in real-time, how a brain-inspired workflow can be implemented without requiring massive model sizes or expensive APIs. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser !pip -q install -U transformers accelerate bitsandbytes rich import os, re, json, textwrap, traceback from typing import Dict, Any, List from rich import print as rprint import torch from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline MODEL_NAME = “Qwen/Qwen2.5-1.5B-Instruct” DTYPE = torch.bfloat16 if torch.cuda.is_available() else torch.float32 We begin by installing the required libraries and loading the Qwen2.5-1.5B-Instruct model from Hugging Face. We set the data type based on GPU availability to ensure efficient model execution in Colab. Copy CodeCopiedUse a different Browser tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, device_map=”auto”, torch_dtype=DTYPE, load_in_4bit=True ) gen = pipeline( “text-generation”, model=model, tokenizer=tok, return_full_text=False ) We load the tokenizer and model, configure it to run in 4-bit for efficiency, and wrap everything in a text-generation pipeline so we can interact with the model easily in Colab. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser def chat(prompt: str, system: str = “”, max_new_tokens: int = 512, temperature: float = 0.3) -> str: msgs = [] if system: msgs.append({“role”:”system”,”content”:system}) msgs.append({“role”:”user”,”content”:prompt}) inputs = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True) out = gen(inputs, max_new_tokens=max_new_tokens, do_sample=(temperature>0), temperature=temperature, top_p=0.9) return out[0][“generated_text”].strip() def extract_json(txt: str) -> Dict[str, Any]: m = re.search(r”{[sS]*}$”, txt.strip()) if not m: m = re.search(r”{[sS]*?}”, txt) try: return json.loads(m.group(0)) if m else {} except Exception: # fallback: strip code fences s = re.sub(r”^“`.*?n|n“`$”, “”, txt, flags=re.S) try: return json.loads(s) except Exception: return {} We define helper functions: the chat function allows us to send prompts to the model with optional system instructions and sampling controls, while extract_json helps us parse structured JSON outputs from the model reliably, even if the response includes code fences or additional text. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser def extract_code(txt: str) -> str: m = re.search(r”“`(?:python)?s*([sS]*?)“`”, txt, flags=re.I) return (m.group(1) if m else txt).strip() def run_python(code: str, env: Dict[str, Any] | None = None) -> Dict[str, Any]: import io, contextlib g = {“__name__”: “__main__”}; l = {} if env: g.update(env) buf = io.StringIO() try: with contextlib.redirect_stdout(buf): exec(code, g, l) out = l.get(“RESULT”, g.get(“RESULT”)) return {“ok”: True, “result”: out, “stdout”: buf.getvalue()} except Exception as e: return {“ok”: False, “error”: str(e), “trace”: traceback.format_exc(), “stdout”: buf.getvalue()} PLANNER_SYS = “””You are the HRM Planner. Decompose the TASK into 2–4 atomic, code-solvable subgoals. Return compact JSON only: {“subgoals”:[…], “final_format”:”<one-line answer format>”}.””” SOLVER_SYS = “””You are the HRM Solver. Given SUBGOAL and CONTEXT vars, output a single Python snippet. Rules: – Compute deterministically. – Set a variable RESULT to the answer. – Keep code short; stdlib only. Return only a Python code block.””” CRITIC_SYS = “””You are the HRM Critic. Given TASK and LOGS (subgoal results), decide if final answer is ready. Return JSON only: {“action”:”submit”|”revise”,”critique”:”…”, “fix_hint”:”<if revise>”}.””” SYNTH_SYS = “””You are the HRM Synthesizer. Given TASK, LOGS, and final_format, output only the final answer (no steps). Follow final_format exactly.””” We add two important pieces: utility functions and system prompts. The extract_code function pulls Python snippets from the model’s output, while run_python safely executes those snippets and captures their results. Alongside, we define four role prompts, Planner, Solver, Critic, and Synthesizer, which guide the model to break tasks into subgoals, solve them with code, verify correctness, and finally produce a clean answer. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser def plan(task: str) -> Dict[str, Any]: p = f”TASK:n{task}nReturn JSON only.” return extract_json(chat(p, PLANNER_SYS, temperature=0.2, max_new_tokens=300)) def solve_subgoal(subgoal: str, context: Dict[str, Any]) -> Dict[str, Any]: prompt = f”SUBGOAL:n{subgoal}nCONTEXT vars: {list(context.keys())}nReturn Python code only.” code = extract_code(chat(prompt, SOLVER_SYS, temperature=0.2, max_new_tokens=400)) res = run_python(code, env=context) return {“subgoal”: subgoal, “code”: code, “run”: res} def critic(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]: pl = [{“subgoal”: L[“subgoal”], “result”: L[“run”].get(“result”), “ok”: L[“run”][“ok”]} for L in logs] out = chat(“TASK:n”+task+”nLOGS:n”+json.dumps(pl, ensure_ascii=False, indent=2)+”nReturn JSON only.”, CRITIC_SYS, temperature=0.1, max_new_tokens=250) return extract_json(out) def refine(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]: sys = “Refine subgoals minimally to fix issues. Return same JSON schema as planner.” out = chat(“TASK:n”+task+”nLOGS:n”+json.dumps(logs, ensure_ascii=False)+”nReturn JSON only.”, sys, temperature=0.2, max_new_tokens=250) j = extract_json(out) return j if j.get(“subgoals”) else {} def synthesize(task: str, logs: List[Dict[str, Any]], final_format: str) -> str: packed = [{“subgoal”: L[“subgoal”], “result”: L[“run”].get(“result”)} for L in logs] return chat(“TASK:n”+task+”nLOGS:n”+json.dumps(packed, ensure_ascii=False)+ f”nfinal_format: {final_format}nOnly the final answer.”, SYNTH_SYS, temperature=0.0, max_new_tokens=120).strip() def hrm_agent(task: str, context: Dict[str, Any] | None = None, budget: int = 2) -> Dict[str, Any]: ctx = dict(context or {}) trace, plan_json = [], plan(task) for round_id in range(1, budget+1): logs = [solve_subgoal(sg, ctx) for sg in plan_json.get(“subgoals”, [])] for L in logs: ctx_key = f”g{len(trace)}_{abs(hash(L[‘subgoal’]))%9999}” ctx[ctx_key] = L[“run”].get(“result”) verdict = critic(task, logs) trace.append({“round”: round_id, “plan”: plan_json, “logs”: logs, “verdict”: verdict}) if verdict.get(“action”) == “submit”: break plan_json = refine(task, logs) or plan_json final = synthesize(task, trace[-1][“logs”], plan_json.get(“final_format”, “Answer: <value>”)) return {“final”: final, “trace”: trace} We implement the full HRM loop: we plan subgoals, solve each by generating and running Python (capturing RESULT), then we critique, optionally refine the plan, and synthesize a clean final answer. We orchestrate these rounds in hrm_agent, carrying forward intermediate results as context so we iteratively improve and stop once the critic says “submit.” Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser ARC_TASK = textwrap.dedent(“”” Infer the transformation rule from train examples and apply to test. Return exactly: “Answer: <grid>”, where <grid> is a Python list of lists of ints. “””).strip() ARC_DATA = { “train”: [ {“inp”: [[0,0],[1,0]], “out”: [[1,1],[0,1]]}, {“inp”: [[0,1],[0,0]], “out”: [[1,0],[1,1]]} ], “test”: [[0,0],[0,1]] } res1 = hrm_agent(ARC_TASK, context={“TRAIN”: ARC_DATA[“train”], “TEST”: ARC_DATA[“test”]}, budget=2) rprint(“n[bold]Demo 1 —

A Coding Guide to Building a Brain-Inspired Hierarchical Reasoning AI Agent with Hugging Face Models Leer entrada »

AI, Committee, Noticias, Uncategorized

Chunking vs. Tokenization: Key Differences in AI Text Processing

Table of contents Introduction What is Tokenization? What is Chunking? The Key Differences That Matter Why This Matters for Real Applications Where You’ll Use Each Approach Current Best Practices (What Actually Works) Summary Introduction When you’re working with AI and natural language processing, you’ll quickly encounter two fundamental concepts that often get confused: tokenization and chunking. While both involve breaking down text into smaller pieces, they serve completely different purposes and work at different scales. If you’re building AI applications, understanding these differences isn’t just academic—it’s crucial for creating systems that actually work well. Think of it this way: if you’re making a sandwich, tokenization is like cutting your ingredients into bite-sized pieces, while chunking is like organizing those pieces into logical groups that make sense to eat together. Both are necessary, but they solve different problems. Source: marktechpost.com What is Tokenization? Tokenization is the process of breaking text into the smallest meaningful units that AI models can understand. These units, called tokens, are the basic building blocks that language models work with. You can think of tokens as the “words” in an AI’s vocabulary, though they’re often smaller than actual words. There are several ways to create tokens: Word-level tokenization splits text at spaces and punctuation. It’s straightforward but creates problems with rare words that the model has never seen before. Subword tokenization is more sophisticated and widely used today. Methods like Byte Pair Encoding (BPE), WordPiece, and SentencePiece break words into smaller chunks based on how frequently character combinations appear in training data. This approach handles new or rare words much better. Character-level tokenization treats each letter as a token. It’s simple but creates very long sequences that are harder for models to process efficiently. Here’s a practical example: Original text: “AI models process text efficiently.” Word tokens: [“AI”, “models”, “process”, “text”, “efficiently”] Subword tokens: [“AI”, “model”, “s”, “process”, “text”, “efficient”, “ly”] Notice how subword tokenization splits “models” into “model” and “s” because this pattern appears frequently in training data. This helps the model understand related words like “modeling” or “modeled” even if it hasn’t seen them before. What is Chunking? Chunking takes a completely different approach. Instead of breaking text into tiny pieces, it groups text into larger, coherent segments that preserve meaning and context. When you’re building applications like chatbots or search systems, you need these larger chunks to maintain the flow of ideas. Think about reading a research paper. You wouldn’t want each sentence scattered randomly—you’d want related sentences grouped together so the ideas make sense. That’s exactly what chunking does for AI systems. Here’s how it works in practice: Original text: “AI models process text efficiently. They rely on tokens to capture meaning and context. Chunking allows better retrieval.” Chunk 1: “AI models process text efficiently.” Chunk 2: “They rely on tokens to capture meaning and context.” Chunk 3: “Chunking allows better retrieval.” Modern chunking strategies have become quite sophisticated: Fixed-length chunking creates chunks of a specific size (like 500 words or 1000 characters). It’s predictable but sometimes breaks up related ideas awkwardly. Semantic chunking is smarter—it looks for natural breakpoints where topics change, using AI to understand when ideas shift from one concept to another. Recursive chunking works hierarchically, first trying to split at paragraph breaks, then sentences, then smaller units if needed. Sliding window chunking creates overlapping chunks to ensure important context isn’t lost at boundaries. The Key Differences That Matter Understanding when to use each approach makes all the difference in your AI applications: What You’re Doing Tokenization Chunking Size Tiny pieces (words, parts of words) Bigger pieces (sentences, paragraphs) Goal Make text digestible for AI models Keep meaning intact for humans and AI When You Use It Training models, processing input Search systems, question answering What You Optimize For Processing speed, vocabulary size Context preservation, retrieval accuracy Why This Matters for Real Applications For AI Model Performance When you’re working with language models, tokenization directly affects how much you pay and how fast your system runs. Models like GPT-4 charge by the token, so efficient tokenization saves money. Current models have different limits: GPT-4: Around 128,000 tokens Claude 3.5: Up to 200,000 tokens Gemini 2.0 Pro: Up to 2 million tokens Recent research shows that larger models actually work better with bigger vocabularies. For example, while LLaMA-2 70B uses about 32,000 different tokens, it would probably perform better with around 216,000. This matters because the right vocabulary size affects both performance and efficiency. For Search and Question-Answering Systems Chunking strategy can make or break your RAG (Retrieval-Augmented Generation) system. If your chunks are too small, you lose context. Too big, and you overwhelm the model with irrelevant information. Get it right, and your system provides accurate, helpful answers. Get it wrong, and you get hallucinations and poor results. Companies building enterprise AI systems have found that smart chunking strategies significantly reduce those frustrating cases where AI makes up facts or gives nonsensical answers. Where You’ll Use Each Approach Tokenization is Essential For: Training new models – You can’t train a language model without first tokenizing your training data. The tokenization strategy affects everything about how well the model learns. Fine-tuning existing models – When you adapt a pre-trained model for your specific domain (like medical or legal text), you need to carefully consider whether the existing tokenization works for your specialized vocabulary. Cross-language applications – Subword tokenization is particularly helpful when working with languages that have complex word structures or when building multilingual systems. Chunking is Critical For: Building company knowledge bases – When you want employees to ask questions and get accurate answers from your internal documents, proper chunking ensures the AI retrieves relevant, complete information. Document analysis at scale – Whether you’re processing legal contracts, research papers, or customer feedback, chunking helps maintain document structure and meaning. Search systems – Modern search goes beyond keyword matching. Semantic chunking helps systems understand what users really want and retrieve the most relevant information. Current Best Practices (What Actually Works)

Chunking vs. Tokenization: Key Differences in AI Text Processing Leer entrada »

AI, Committee, Noticias, Uncategorized

Accenture Research Introduce MCP-Bench: A Large-Scale Benchmark that Evaluates LLM Agents in Complex Real-World Tasks via MCP Servers

Modern large language models (LLMs) have moved far beyond simple text generation. Many of the most promising real-world applications now require these models to use external tools—like APIs, databases, and software libraries—to solve complex tasks. But how do we truly know if an AI agent can plan, reason, and coordinate across tools the way a human assistant would? This is the question MCP-Bench sets out to answer. The Problem with Existing Benchmarks Most previous benchmarks for tool-using LLMs focused on one-off API calls or narrow, artificially stitched workflows. Even the more advanced evaluations rarely tested how well agents could discover and chain the right tools from fuzzy, real-world instructions—let alone whether they could coordinate across multiple domains and ground their answers in actual evidence. In practice, this means that many models perform well on artificial tasks, but struggle with the complexity and ambiguity of real-world scenarios. https://arxiv.org/abs/2508.20453 What Makes MCP-Bench Different A team of researchers from Accenture introduce MCP-Bench, a Model Context Protocol (MCP) based benchmark for LLM agents that directly connects them to 28 real-world servers, each offering a set of tools across various domains—such as finance, scientific computing, healthcare, travel, and academic research. In total, the benchmark covers 250 tools, arranged so that realistic workflows require both sequential and parallel tool use, sometimes across multiple servers. https://arxiv.org/abs/2508.20453 Key features: Authentic tasks: Tasks are designed to reflect real user needs, such as planning a multi-stop camping trip (involving geospatial, weather, and park information), conducting biomedical research, or converting units in scientific calculations. Fuzzy instructions: Rather than specifying tools or steps, tasks are described in natural, sometimes vague language—requiring the agent to infer what to do, much like a human assistant would. Tool diversity: The benchmark includes everything from medical calculators and scientific computing libraries to financial analytics, icon collections, and even niche tools like I Ching divination services. Quality control: Tasks are automatically generated, then filtered for solvability and real-world relevance. Each task also comes in two forms: a precise technical description (used for evaluation) and a conversational, fuzzy version (what the agent sees). Multi-layered evaluation: Both automated metrics (like “did the agent use the correct tool and provide the right parameters?”) and LLM-based judges (to assess planning, grounding, and reasoning) are used. How Agents Are Tested An agent running MCP-Bench receives a task (e.g., “Plan a camping trip to Yosemite with detailed logistics and weather forecasts”) and must decide, step by step, which tools to call, in what order, and how to use their outputs. These workflows can span multiple rounds of interaction, with the agent synthesizing results into a coherent, evidence-backed answer. Each agent is evaluated on several dimensions, including: Tool selection: Did it choose the right tools for each part of the task? Parameter accuracy: Did it provide complete and correct inputs to each tool? Planning and coordination: Did it handle dependencies and parallel steps properly? Evidence grounding: Does its final answer directly reference the outputs from tools, avoiding unsupported claims? What the Results Show The researchers tested 20 state-of-the-art LLMs across 104 tasks. The main findings: Basic tool use is solid: Most models could correctly call tools and handle parameter schemas, even for complex or domain-specific tools. Planning is still hard: Even the best models struggled with long, multi-step workflows that required not just selecting tools, but also understanding when to move to the next step, which parts can run in parallel, and how to handle unexpected results. Smaller models fall behind: As tasks became more complex, especially those spanning multiple servers, smaller models were more likely to make mistakes, repeat steps, or miss subtasks. Efficiency varies widely: Some models needed many more tool calls and rounds of interaction to achieve the same results, suggesting inefficiencies in planning and execution. Humans are still needed for nuance: While the benchmark is automated, human checks ensure tasks are realistic and solvable—a reminder that truly robust evaluation still benefits from human expertise. https://arxiv.org/abs/2508.20453 Why This Research Matters? MCP-Bench provides a practical way to assess how well AI agents can act as “digital assistants” in real-world settings—situations where users aren’t always precise and the right answer depends on weaving together information from many sources. The benchmark exposes gaps in current LLM capabilities, especially around complex planning, cross-domain reasoning, and evidence-based synthesis—areas crucial for deploying AI agents in business, research, and specialized fields. Summary MCP-Bench is a serious, large-scale test for AI agents using real tools and real tasks, with no shortcuts or artificial setups. It shows what current models do well and where they still fall short. For anyone building or evaluating AI assistants, these results—and the benchmark itself—are likely to be a useful reality check. Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Accenture Research Introduce MCP-Bench: A Large-Scale Benchmark that Evaluates LLM Agents in Complex Real-World Tasks via MCP Servers appeared first on MarkTechPost.

Accenture Research Introduce MCP-Bench: A Large-Scale Benchmark that Evaluates LLM Agents in Complex Real-World Tasks via MCP Servers Leer entrada »

AI, Committee, Noticias, Uncategorized

Microsoft AI Introduces rStar2-Agent: A 14B Math Reasoning Model Trained with Agentic Reinforcement Learning to Achieve Frontier-Level Performance

Table of contents The Problem with “Thinking Longer” The Agentic Approach Infrastructure Challenges and Solutions GRPO-RoC: Learning from High-Quality Examples Training Strategy: From Simple to Complex Breakthrough Results Understanding the Mechanisms Summary The Problem with “Thinking Longer” Large language models have made impressive strides in mathematical reasoning by extending their Chain-of-Thought (CoT) processes—essentially “thinking longer” through more detailed reasoning steps. However, this approach has fundamental limitations. When models encounter subtle errors in their reasoning chains, they often compound these mistakes rather than detecting and correcting them. Internal self-reflection frequently fails, especially when the initial reasoning approach is fundamentally flawed. Microsoft’s new research report introduces rStar2-Agent, that takes a different approach: instead of just thinking longer, it teaches models to think smarter by actively using coding tools to verify, explore, and refine their reasoning process. https://arxiv.org/abs/2508.20722 The Agentic Approach rStar2-Agent represents a shift toward agentic reinforcement learning, where a 14B parameter model interacts with a Python execution environment throughout its reasoning process. Rather than relying solely on internal reflection, the model can write code, execute it, analyze the results, and adjust its approach based on concrete feedback. This creates a dynamic problem-solving process. When the model encounters a complex mathematical problem, it might generate initial reasoning, write Python code to test hypotheses, analyze execution results, and iterate toward a solution. The approach mirrors how human mathematicians often work—using computational tools to verify intuitions and explore different solution paths. Infrastructure Challenges and Solutions Scaling agentic RL presents significant technical hurdles. During training, a single batch can generate tens of thousands of concurrent code execution requests, creating bottlenecks that can stall GPU utilization. The researchers addressed this with two key infrastructure innovations. First, they built a distributed code execution service capable of handling 45,000 concurrent tool calls with sub-second latency. The system isolates code execution from the main training process while maintaining high throughput through careful load balancing across CPU workers. Second, they developed a dynamic rollout scheduler that allocates computational work based on real-time GPU cache availability rather than static assignment. This prevents GPU idle time caused by uneven workload distribution—a common problem when some reasoning traces require significantly more computation than others. These infrastructure improvements enabled the entire training process to complete in just one week using 64 AMD MI300X GPUs, demonstrating that frontier-level reasoning capabilities don’t require massive computational resources when efficiently orchestrated. GRPO-RoC: Learning from High-Quality Examples The core algorithmic innovation is Group Relative Policy Optimization with Resampling on Correct (GRPO-RoC). Traditional reinforcement learning in this context faces a quality problem: models receive positive rewards for correct final answers even when their reasoning process includes multiple code errors or inefficient tool usage. GRPO-RoC addresses this by implementing an asymmetric sampling strategy. During training, the algorithm: Oversamples initial rollouts to create a larger pool of reasoning traces Preserves diversity in failed attempts to maintain learning from various error modes Filters positive examples to emphasize traces with minimal tool errors and cleaner formatting This approach ensures the model learns from high-quality successful reasoning while still exposure to diverse failure patterns. The result is more efficient tool usage and shorter, more focused reasoning traces. https://arxiv.org/abs/2508.20722 Training Strategy: From Simple to Complex The training process unfolds in three carefully designed stages, starting with non-reasoning supervised fine-tuning that focuses purely on instruction following and tool formatting—deliberately avoiding complex reasoning examples that might create early biases. Stage 1 constrains responses to 8,000 tokens, forcing the model to develop concise reasoning strategies. Despite this limitation, performance jumps dramatically—from near-zero to over 70% on challenging benchmarks. Stage 2 extends the token limit to 12,000, allowing for more complex reasoning while maintaining the efficiency gains from the first stage. Stage 3 shifts focus to the most difficult problems by filtering out those the model has already mastered, ensuring continued learning from challenging cases. This progression from concise to extended reasoning, combined with increasing problem difficulty, maximizes learning efficiency while minimizing computational overhead. Breakthrough Results The results are striking. rStar2-Agent-14B achieves 80.6% accuracy on AIME24 and 69.8% on AIME25, surpassing much larger models including the 671B parameter DeepSeek-R1. Perhaps more importantly, it accomplishes this with significantly shorter reasoning traces—averaging around 10,000 tokens compared to over 17,000 for comparable models. The efficiency gains extend beyond mathematics. Despite training exclusively on math problems, the model demonstrates strong transfer learning, outperforming specialized models on scientific reasoning benchmarks and maintaining competitive performance on general alignment tasks. https://arxiv.org/abs/2508.20722 Understanding the Mechanisms Analysis of the trained model reveals fascinating behavioral patterns. High-entropy tokens in reasoning traces fall into two categories: traditional “forking tokens” that trigger self-reflection and exploration, and a new category of “reflection tokens” that emerge specifically in response to tool feedback. These reflection tokens represent a form of environment-driven reasoning where the model carefully analyzes code execution results, diagnoses errors, and adjusts its approach accordingly. This creates more sophisticated problem-solving behavior than pure CoT reasoning can achieve. Summary rStar2-Agent demonstrates that moderate-sized models can achieve frontier-level reasoning through sophisticated training rather than brute-force scaling. The approach suggests a more sustainable path toward advanced AI capabilities—one that emphasizes efficiency, tool integration, and smart training strategies over raw computational power. The success of this agentic approach also points toward future AI systems that can seamlessly integrate multiple tools and environments, moving beyond static text generation toward dynamic, interactive problem-solving capabilities. Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Microsoft AI Introduces rStar2-Agent: A 14B Math Reasoning Model Trained with Agentic Reinforcement Learning to Achieve Frontier-Level Performance appeared first on MarkTechPost.

Microsoft AI Introduces rStar2-Agent: A 14B Math Reasoning Model Trained with Agentic Reinforcement Learning to Achieve Frontier-Level Performance Leer entrada »

We use cookies to improve your experience and performance on our website. You can learn more at Política de privacidad and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
es_ES