YouZum

Uncategorized

AI, Committee, ニュース, Uncategorized

A Coding Guide to Building a Brain-Inspired Hierarchical Reasoning AI Agent with Hugging Face Models

In this tutorial, we set out to recreate the spirit of the Hierarchical Reasoning Model (HRM) using a free Hugging Face model that runs locally. We walk through the design of a lightweight yet structured reasoning agent, where we act as both architects and experimenters. By breaking problems into subgoals, solving them with Python, critiquing the outcomes, and synthesizing a final answer, we can experience how hierarchical planning and execution can enhance reasoning performance. This process enables us to see, in real-time, how a brain-inspired workflow can be implemented without requiring massive model sizes or expensive APIs. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser !pip -q install -U transformers accelerate bitsandbytes rich import os, re, json, textwrap, traceback from typing import Dict, Any, List from rich import print as rprint import torch from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline MODEL_NAME = “Qwen/Qwen2.5-1.5B-Instruct” DTYPE = torch.bfloat16 if torch.cuda.is_available() else torch.float32 We begin by installing the required libraries and loading the Qwen2.5-1.5B-Instruct model from Hugging Face. We set the data type based on GPU availability to ensure efficient model execution in Colab. Copy CodeCopiedUse a different Browser tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, device_map=”auto”, torch_dtype=DTYPE, load_in_4bit=True ) gen = pipeline( “text-generation”, model=model, tokenizer=tok, return_full_text=False ) We load the tokenizer and model, configure it to run in 4-bit for efficiency, and wrap everything in a text-generation pipeline so we can interact with the model easily in Colab. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser def chat(prompt: str, system: str = “”, max_new_tokens: int = 512, temperature: float = 0.3) -> str: msgs = [] if system: msgs.append({“role”:”system”,”content”:system}) msgs.append({“role”:”user”,”content”:prompt}) inputs = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True) out = gen(inputs, max_new_tokens=max_new_tokens, do_sample=(temperature>0), temperature=temperature, top_p=0.9) return out[0][“generated_text”].strip() def extract_json(txt: str) -> Dict[str, Any]: m = re.search(r”{[sS]*}$”, txt.strip()) if not m: m = re.search(r”{[sS]*?}”, txt) try: return json.loads(m.group(0)) if m else {} except Exception: # fallback: strip code fences s = re.sub(r”^“`.*?n|n“`$”, “”, txt, flags=re.S) try: return json.loads(s) except Exception: return {} We define helper functions: the chat function allows us to send prompts to the model with optional system instructions and sampling controls, while extract_json helps us parse structured JSON outputs from the model reliably, even if the response includes code fences or additional text. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser def extract_code(txt: str) -> str: m = re.search(r”“`(?:python)?s*([sS]*?)“`”, txt, flags=re.I) return (m.group(1) if m else txt).strip() def run_python(code: str, env: Dict[str, Any] | None = None) -> Dict[str, Any]: import io, contextlib g = {“__name__”: “__main__”}; l = {} if env: g.update(env) buf = io.StringIO() try: with contextlib.redirect_stdout(buf): exec(code, g, l) out = l.get(“RESULT”, g.get(“RESULT”)) return {“ok”: True, “result”: out, “stdout”: buf.getvalue()} except Exception as e: return {“ok”: False, “error”: str(e), “trace”: traceback.format_exc(), “stdout”: buf.getvalue()} PLANNER_SYS = “””You are the HRM Planner. Decompose the TASK into 2–4 atomic, code-solvable subgoals. Return compact JSON only: {“subgoals”:[…], “final_format”:”<one-line answer format>”}.””” SOLVER_SYS = “””You are the HRM Solver. Given SUBGOAL and CONTEXT vars, output a single Python snippet. Rules: – Compute deterministically. – Set a variable RESULT to the answer. – Keep code short; stdlib only. Return only a Python code block.””” CRITIC_SYS = “””You are the HRM Critic. Given TASK and LOGS (subgoal results), decide if final answer is ready. Return JSON only: {“action”:”submit”|”revise”,”critique”:”…”, “fix_hint”:”<if revise>”}.””” SYNTH_SYS = “””You are the HRM Synthesizer. Given TASK, LOGS, and final_format, output only the final answer (no steps). Follow final_format exactly.””” We add two important pieces: utility functions and system prompts. The extract_code function pulls Python snippets from the model’s output, while run_python safely executes those snippets and captures their results. Alongside, we define four role prompts, Planner, Solver, Critic, and Synthesizer, which guide the model to break tasks into subgoals, solve them with code, verify correctness, and finally produce a clean answer. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser def plan(task: str) -> Dict[str, Any]: p = f”TASK:n{task}nReturn JSON only.” return extract_json(chat(p, PLANNER_SYS, temperature=0.2, max_new_tokens=300)) def solve_subgoal(subgoal: str, context: Dict[str, Any]) -> Dict[str, Any]: prompt = f”SUBGOAL:n{subgoal}nCONTEXT vars: {list(context.keys())}nReturn Python code only.” code = extract_code(chat(prompt, SOLVER_SYS, temperature=0.2, max_new_tokens=400)) res = run_python(code, env=context) return {“subgoal”: subgoal, “code”: code, “run”: res} def critic(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]: pl = [{“subgoal”: L[“subgoal”], “result”: L[“run”].get(“result”), “ok”: L[“run”][“ok”]} for L in logs] out = chat(“TASK:n”+task+”nLOGS:n”+json.dumps(pl, ensure_ascii=False, indent=2)+”nReturn JSON only.”, CRITIC_SYS, temperature=0.1, max_new_tokens=250) return extract_json(out) def refine(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]: sys = “Refine subgoals minimally to fix issues. Return same JSON schema as planner.” out = chat(“TASK:n”+task+”nLOGS:n”+json.dumps(logs, ensure_ascii=False)+”nReturn JSON only.”, sys, temperature=0.2, max_new_tokens=250) j = extract_json(out) return j if j.get(“subgoals”) else {} def synthesize(task: str, logs: List[Dict[str, Any]], final_format: str) -> str: packed = [{“subgoal”: L[“subgoal”], “result”: L[“run”].get(“result”)} for L in logs] return chat(“TASK:n”+task+”nLOGS:n”+json.dumps(packed, ensure_ascii=False)+ f”nfinal_format: {final_format}nOnly the final answer.”, SYNTH_SYS, temperature=0.0, max_new_tokens=120).strip() def hrm_agent(task: str, context: Dict[str, Any] | None = None, budget: int = 2) -> Dict[str, Any]: ctx = dict(context or {}) trace, plan_json = [], plan(task) for round_id in range(1, budget+1): logs = [solve_subgoal(sg, ctx) for sg in plan_json.get(“subgoals”, [])] for L in logs: ctx_key = f”g{len(trace)}_{abs(hash(L[‘subgoal’]))%9999}” ctx[ctx_key] = L[“run”].get(“result”) verdict = critic(task, logs) trace.append({“round”: round_id, “plan”: plan_json, “logs”: logs, “verdict”: verdict}) if verdict.get(“action”) == “submit”: break plan_json = refine(task, logs) or plan_json final = synthesize(task, trace[-1][“logs”], plan_json.get(“final_format”, “Answer: <value>”)) return {“final”: final, “trace”: trace} We implement the full HRM loop: we plan subgoals, solve each by generating and running Python (capturing RESULT), then we critique, optionally refine the plan, and synthesize a clean final answer. We orchestrate these rounds in hrm_agent, carrying forward intermediate results as context so we iteratively improve and stop once the critic says “submit.” Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser ARC_TASK = textwrap.dedent(“”” Infer the transformation rule from train examples and apply to test. Return exactly: “Answer: <grid>”, where <grid> is a Python list of lists of ints. “””).strip() ARC_DATA = { “train”: [ {“inp”: [[0,0],[1,0]], “out”: [[1,1],[0,1]]}, {“inp”: [[0,1],[0,0]], “out”: [[1,0],[1,1]]} ], “test”: [[0,0],[0,1]] } res1 = hrm_agent(ARC_TASK, context={“TRAIN”: ARC_DATA[“train”], “TEST”: ARC_DATA[“test”]}, budget=2) rprint(“n[bold]Demo 1 —

A Coding Guide to Building a Brain-Inspired Hierarchical Reasoning AI Agent with Hugging Face Models 投稿を読む »

AI, Committee, ニュース, Uncategorized

Chunking vs. Tokenization: Key Differences in AI Text Processing

Table of contents Introduction What is Tokenization? What is Chunking? The Key Differences That Matter Why This Matters for Real Applications Where You’ll Use Each Approach Current Best Practices (What Actually Works) Summary Introduction When you’re working with AI and natural language processing, you’ll quickly encounter two fundamental concepts that often get confused: tokenization and chunking. While both involve breaking down text into smaller pieces, they serve completely different purposes and work at different scales. If you’re building AI applications, understanding these differences isn’t just academic—it’s crucial for creating systems that actually work well. Think of it this way: if you’re making a sandwich, tokenization is like cutting your ingredients into bite-sized pieces, while chunking is like organizing those pieces into logical groups that make sense to eat together. Both are necessary, but they solve different problems. Source: marktechpost.com What is Tokenization? Tokenization is the process of breaking text into the smallest meaningful units that AI models can understand. These units, called tokens, are the basic building blocks that language models work with. You can think of tokens as the “words” in an AI’s vocabulary, though they’re often smaller than actual words. There are several ways to create tokens: Word-level tokenization splits text at spaces and punctuation. It’s straightforward but creates problems with rare words that the model has never seen before. Subword tokenization is more sophisticated and widely used today. Methods like Byte Pair Encoding (BPE), WordPiece, and SentencePiece break words into smaller chunks based on how frequently character combinations appear in training data. This approach handles new or rare words much better. Character-level tokenization treats each letter as a token. It’s simple but creates very long sequences that are harder for models to process efficiently. Here’s a practical example: Original text: “AI models process text efficiently.” Word tokens: [“AI”, “models”, “process”, “text”, “efficiently”] Subword tokens: [“AI”, “model”, “s”, “process”, “text”, “efficient”, “ly”] Notice how subword tokenization splits “models” into “model” and “s” because this pattern appears frequently in training data. This helps the model understand related words like “modeling” or “modeled” even if it hasn’t seen them before. What is Chunking? Chunking takes a completely different approach. Instead of breaking text into tiny pieces, it groups text into larger, coherent segments that preserve meaning and context. When you’re building applications like chatbots or search systems, you need these larger chunks to maintain the flow of ideas. Think about reading a research paper. You wouldn’t want each sentence scattered randomly—you’d want related sentences grouped together so the ideas make sense. That’s exactly what chunking does for AI systems. Here’s how it works in practice: Original text: “AI models process text efficiently. They rely on tokens to capture meaning and context. Chunking allows better retrieval.” Chunk 1: “AI models process text efficiently.” Chunk 2: “They rely on tokens to capture meaning and context.” Chunk 3: “Chunking allows better retrieval.” Modern chunking strategies have become quite sophisticated: Fixed-length chunking creates chunks of a specific size (like 500 words or 1000 characters). It’s predictable but sometimes breaks up related ideas awkwardly. Semantic chunking is smarter—it looks for natural breakpoints where topics change, using AI to understand when ideas shift from one concept to another. Recursive chunking works hierarchically, first trying to split at paragraph breaks, then sentences, then smaller units if needed. Sliding window chunking creates overlapping chunks to ensure important context isn’t lost at boundaries. The Key Differences That Matter Understanding when to use each approach makes all the difference in your AI applications: What You’re Doing Tokenization Chunking Size Tiny pieces (words, parts of words) Bigger pieces (sentences, paragraphs) Goal Make text digestible for AI models Keep meaning intact for humans and AI When You Use It Training models, processing input Search systems, question answering What You Optimize For Processing speed, vocabulary size Context preservation, retrieval accuracy Why This Matters for Real Applications For AI Model Performance When you’re working with language models, tokenization directly affects how much you pay and how fast your system runs. Models like GPT-4 charge by the token, so efficient tokenization saves money. Current models have different limits: GPT-4: Around 128,000 tokens Claude 3.5: Up to 200,000 tokens Gemini 2.0 Pro: Up to 2 million tokens Recent research shows that larger models actually work better with bigger vocabularies. For example, while LLaMA-2 70B uses about 32,000 different tokens, it would probably perform better with around 216,000. This matters because the right vocabulary size affects both performance and efficiency. For Search and Question-Answering Systems Chunking strategy can make or break your RAG (Retrieval-Augmented Generation) system. If your chunks are too small, you lose context. Too big, and you overwhelm the model with irrelevant information. Get it right, and your system provides accurate, helpful answers. Get it wrong, and you get hallucinations and poor results. Companies building enterprise AI systems have found that smart chunking strategies significantly reduce those frustrating cases where AI makes up facts or gives nonsensical answers. Where You’ll Use Each Approach Tokenization is Essential For: Training new models – You can’t train a language model without first tokenizing your training data. The tokenization strategy affects everything about how well the model learns. Fine-tuning existing models – When you adapt a pre-trained model for your specific domain (like medical or legal text), you need to carefully consider whether the existing tokenization works for your specialized vocabulary. Cross-language applications – Subword tokenization is particularly helpful when working with languages that have complex word structures or when building multilingual systems. Chunking is Critical For: Building company knowledge bases – When you want employees to ask questions and get accurate answers from your internal documents, proper chunking ensures the AI retrieves relevant, complete information. Document analysis at scale – Whether you’re processing legal contracts, research papers, or customer feedback, chunking helps maintain document structure and meaning. Search systems – Modern search goes beyond keyword matching. Semantic chunking helps systems understand what users really want and retrieve the most relevant information. Current Best Practices (What Actually Works)

Chunking vs. Tokenization: Key Differences in AI Text Processing 投稿を読む »

AI, Committee, ニュース, Uncategorized

Accenture Research Introduce MCP-Bench: A Large-Scale Benchmark that Evaluates LLM Agents in Complex Real-World Tasks via MCP Servers

Modern large language models (LLMs) have moved far beyond simple text generation. Many of the most promising real-world applications now require these models to use external tools—like APIs, databases, and software libraries—to solve complex tasks. But how do we truly know if an AI agent can plan, reason, and coordinate across tools the way a human assistant would? This is the question MCP-Bench sets out to answer. The Problem with Existing Benchmarks Most previous benchmarks for tool-using LLMs focused on one-off API calls or narrow, artificially stitched workflows. Even the more advanced evaluations rarely tested how well agents could discover and chain the right tools from fuzzy, real-world instructions—let alone whether they could coordinate across multiple domains and ground their answers in actual evidence. In practice, this means that many models perform well on artificial tasks, but struggle with the complexity and ambiguity of real-world scenarios. https://arxiv.org/abs/2508.20453 What Makes MCP-Bench Different A team of researchers from Accenture introduce MCP-Bench, a Model Context Protocol (MCP) based benchmark for LLM agents that directly connects them to 28 real-world servers, each offering a set of tools across various domains—such as finance, scientific computing, healthcare, travel, and academic research. In total, the benchmark covers 250 tools, arranged so that realistic workflows require both sequential and parallel tool use, sometimes across multiple servers. https://arxiv.org/abs/2508.20453 Key features: Authentic tasks: Tasks are designed to reflect real user needs, such as planning a multi-stop camping trip (involving geospatial, weather, and park information), conducting biomedical research, or converting units in scientific calculations. Fuzzy instructions: Rather than specifying tools or steps, tasks are described in natural, sometimes vague language—requiring the agent to infer what to do, much like a human assistant would. Tool diversity: The benchmark includes everything from medical calculators and scientific computing libraries to financial analytics, icon collections, and even niche tools like I Ching divination services. Quality control: Tasks are automatically generated, then filtered for solvability and real-world relevance. Each task also comes in two forms: a precise technical description (used for evaluation) and a conversational, fuzzy version (what the agent sees). Multi-layered evaluation: Both automated metrics (like “did the agent use the correct tool and provide the right parameters?”) and LLM-based judges (to assess planning, grounding, and reasoning) are used. How Agents Are Tested An agent running MCP-Bench receives a task (e.g., “Plan a camping trip to Yosemite with detailed logistics and weather forecasts”) and must decide, step by step, which tools to call, in what order, and how to use their outputs. These workflows can span multiple rounds of interaction, with the agent synthesizing results into a coherent, evidence-backed answer. Each agent is evaluated on several dimensions, including: Tool selection: Did it choose the right tools for each part of the task? Parameter accuracy: Did it provide complete and correct inputs to each tool? Planning and coordination: Did it handle dependencies and parallel steps properly? Evidence grounding: Does its final answer directly reference the outputs from tools, avoiding unsupported claims? What the Results Show The researchers tested 20 state-of-the-art LLMs across 104 tasks. The main findings: Basic tool use is solid: Most models could correctly call tools and handle parameter schemas, even for complex or domain-specific tools. Planning is still hard: Even the best models struggled with long, multi-step workflows that required not just selecting tools, but also understanding when to move to the next step, which parts can run in parallel, and how to handle unexpected results. Smaller models fall behind: As tasks became more complex, especially those spanning multiple servers, smaller models were more likely to make mistakes, repeat steps, or miss subtasks. Efficiency varies widely: Some models needed many more tool calls and rounds of interaction to achieve the same results, suggesting inefficiencies in planning and execution. Humans are still needed for nuance: While the benchmark is automated, human checks ensure tasks are realistic and solvable—a reminder that truly robust evaluation still benefits from human expertise. https://arxiv.org/abs/2508.20453 Why This Research Matters? MCP-Bench provides a practical way to assess how well AI agents can act as “digital assistants” in real-world settings—situations where users aren’t always precise and the right answer depends on weaving together information from many sources. The benchmark exposes gaps in current LLM capabilities, especially around complex planning, cross-domain reasoning, and evidence-based synthesis—areas crucial for deploying AI agents in business, research, and specialized fields. Summary MCP-Bench is a serious, large-scale test for AI agents using real tools and real tasks, with no shortcuts or artificial setups. It shows what current models do well and where they still fall short. For anyone building or evaluating AI assistants, these results—and the benchmark itself—are likely to be a useful reality check. Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Accenture Research Introduce MCP-Bench: A Large-Scale Benchmark that Evaluates LLM Agents in Complex Real-World Tasks via MCP Servers appeared first on MarkTechPost.

Accenture Research Introduce MCP-Bench: A Large-Scale Benchmark that Evaluates LLM Agents in Complex Real-World Tasks via MCP Servers 投稿を読む »

AI, Committee, ニュース, Uncategorized

Microsoft AI Introduces rStar2-Agent: A 14B Math Reasoning Model Trained with Agentic Reinforcement Learning to Achieve Frontier-Level Performance

Table of contents The Problem with “Thinking Longer” The Agentic Approach Infrastructure Challenges and Solutions GRPO-RoC: Learning from High-Quality Examples Training Strategy: From Simple to Complex Breakthrough Results Understanding the Mechanisms Summary The Problem with “Thinking Longer” Large language models have made impressive strides in mathematical reasoning by extending their Chain-of-Thought (CoT) processes—essentially “thinking longer” through more detailed reasoning steps. However, this approach has fundamental limitations. When models encounter subtle errors in their reasoning chains, they often compound these mistakes rather than detecting and correcting them. Internal self-reflection frequently fails, especially when the initial reasoning approach is fundamentally flawed. Microsoft’s new research report introduces rStar2-Agent, that takes a different approach: instead of just thinking longer, it teaches models to think smarter by actively using coding tools to verify, explore, and refine their reasoning process. https://arxiv.org/abs/2508.20722 The Agentic Approach rStar2-Agent represents a shift toward agentic reinforcement learning, where a 14B parameter model interacts with a Python execution environment throughout its reasoning process. Rather than relying solely on internal reflection, the model can write code, execute it, analyze the results, and adjust its approach based on concrete feedback. This creates a dynamic problem-solving process. When the model encounters a complex mathematical problem, it might generate initial reasoning, write Python code to test hypotheses, analyze execution results, and iterate toward a solution. The approach mirrors how human mathematicians often work—using computational tools to verify intuitions and explore different solution paths. Infrastructure Challenges and Solutions Scaling agentic RL presents significant technical hurdles. During training, a single batch can generate tens of thousands of concurrent code execution requests, creating bottlenecks that can stall GPU utilization. The researchers addressed this with two key infrastructure innovations. First, they built a distributed code execution service capable of handling 45,000 concurrent tool calls with sub-second latency. The system isolates code execution from the main training process while maintaining high throughput through careful load balancing across CPU workers. Second, they developed a dynamic rollout scheduler that allocates computational work based on real-time GPU cache availability rather than static assignment. This prevents GPU idle time caused by uneven workload distribution—a common problem when some reasoning traces require significantly more computation than others. These infrastructure improvements enabled the entire training process to complete in just one week using 64 AMD MI300X GPUs, demonstrating that frontier-level reasoning capabilities don’t require massive computational resources when efficiently orchestrated. GRPO-RoC: Learning from High-Quality Examples The core algorithmic innovation is Group Relative Policy Optimization with Resampling on Correct (GRPO-RoC). Traditional reinforcement learning in this context faces a quality problem: models receive positive rewards for correct final answers even when their reasoning process includes multiple code errors or inefficient tool usage. GRPO-RoC addresses this by implementing an asymmetric sampling strategy. During training, the algorithm: Oversamples initial rollouts to create a larger pool of reasoning traces Preserves diversity in failed attempts to maintain learning from various error modes Filters positive examples to emphasize traces with minimal tool errors and cleaner formatting This approach ensures the model learns from high-quality successful reasoning while still exposure to diverse failure patterns. The result is more efficient tool usage and shorter, more focused reasoning traces. https://arxiv.org/abs/2508.20722 Training Strategy: From Simple to Complex The training process unfolds in three carefully designed stages, starting with non-reasoning supervised fine-tuning that focuses purely on instruction following and tool formatting—deliberately avoiding complex reasoning examples that might create early biases. Stage 1 constrains responses to 8,000 tokens, forcing the model to develop concise reasoning strategies. Despite this limitation, performance jumps dramatically—from near-zero to over 70% on challenging benchmarks. Stage 2 extends the token limit to 12,000, allowing for more complex reasoning while maintaining the efficiency gains from the first stage. Stage 3 shifts focus to the most difficult problems by filtering out those the model has already mastered, ensuring continued learning from challenging cases. This progression from concise to extended reasoning, combined with increasing problem difficulty, maximizes learning efficiency while minimizing computational overhead. Breakthrough Results The results are striking. rStar2-Agent-14B achieves 80.6% accuracy on AIME24 and 69.8% on AIME25, surpassing much larger models including the 671B parameter DeepSeek-R1. Perhaps more importantly, it accomplishes this with significantly shorter reasoning traces—averaging around 10,000 tokens compared to over 17,000 for comparable models. The efficiency gains extend beyond mathematics. Despite training exclusively on math problems, the model demonstrates strong transfer learning, outperforming specialized models on scientific reasoning benchmarks and maintaining competitive performance on general alignment tasks. https://arxiv.org/abs/2508.20722 Understanding the Mechanisms Analysis of the trained model reveals fascinating behavioral patterns. High-entropy tokens in reasoning traces fall into two categories: traditional “forking tokens” that trigger self-reflection and exploration, and a new category of “reflection tokens” that emerge specifically in response to tool feedback. These reflection tokens represent a form of environment-driven reasoning where the model carefully analyzes code execution results, diagnoses errors, and adjusts its approach accordingly. This creates more sophisticated problem-solving behavior than pure CoT reasoning can achieve. Summary rStar2-Agent demonstrates that moderate-sized models can achieve frontier-level reasoning through sophisticated training rather than brute-force scaling. The approach suggests a more sustainable path toward advanced AI capabilities—one that emphasizes efficiency, tool integration, and smart training strategies over raw computational power. The success of this agentic approach also points toward future AI systems that can seamlessly integrate multiple tools and environments, moving beyond static text generation toward dynamic, interactive problem-solving capabilities. Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Microsoft AI Introduces rStar2-Agent: A 14B Math Reasoning Model Trained with Agentic Reinforcement Learning to Achieve Frontier-Level Performance appeared first on MarkTechPost.

Microsoft AI Introduces rStar2-Agent: A 14B Math Reasoning Model Trained with Agentic Reinforcement Learning to Achieve Frontier-Level Performance 投稿を読む »

AI, Committee, ニュース, Uncategorized

Microsoft AI Lab Unveils MAI-Voice-1 and MAI-1-Preview: New In-House Models for Voice AI

Microsoft AI lab officially launched MAI-Voice-1 and MAI-1-preview, marking a new phase for the company’s artificial intelligence research and development efforts. The announcement explains how Microsoft AI Lab is getting involved in AI research without any third party involvement. MAI-Voice-1 and MAI-1-preview models supports distinct but complementary roles in speech synthesis and general-purpose language understanding. MAI-Voice-1: Technical Details and Capabilities MAI-Voice-1 is a speech generation model that produces audio with high fidelity. It generates one minute of natural-sounding audio in under one second using a single GPU, supporting applications such as interactive assistants and podcast narration with low latency and hardware needs. Try out here The model uses a transformer-based architecture trained on a diverse multilingual speech dataset. It handles single-speaker and multi-speaker scenarios, providing expressive and context-appropriate voice outputs. MAI-Voice-1 is integrated into Microsoft products like Copilot Daily for voice updates and news summaries. It is available for testing in Copilot Labs, where users can create audio stories or guided narratives from text prompts. Technically, the model focuses on quality, versatility, and speed. Its single-GPU operation differs from systems requiring multiple GPUs, enabling integration in consumer devices and cloud applications beyond research settings MAI-1-Preview: Foundation Model Architecture and Performance MAI-1-preview is Microsoft’s first end-to-end, in-house foundation language model. Unlike previous models that Microsoft integrated or licensed from outside, MAI-1-preview was trained entirely on Microsoft’s own infrastructure, using a mixture-of-experts architecture and approximately 15,000 NVIDIA H100 GPUs. Microsoft AI team have made the MAI-1-preview on the LMArena platform, placing it next to several other models. MAI-1-preview is optimized for instruction-following and everyday conversational tasks, making it suitable for consumer-focused applications rather than enterprise or highly specialized use cases. Microsoft has begun rolling out access to the model for select text-based scenarios within Copilot, with a gradual expansion planned as feedback is collected and the system is refined. Model Development and Training Infrastructure The development of MAI-Voice-1 and MAI-1-preview was supported by Microsoft’s next-generation GB200 GPU cluster, a custom-built infrastructure specifically optimized for training large generative models. In addition to hardware, Microsoft has invested heavily in talent, assembling a team with deep expertise in generative AI, speech synthesis, and large-scale systems engineering. The company’s approach to model development emphasizes a balance between fundamental research and practical deployment, aiming to create systems that are not just theoretically impressive but also reliable and useful in everyday scenarios. Applications MAI-Voice-1 can be used for real-time voice assistance, audio content creation in media and education, or accessibility features. Its ability to simulate multiple speakers supports use in interactive scenarios such as storytelling, language learning, or simulated conversations. The model’s efficiency also allows for deployment on consumer hardware. MAI-1-preview is focused on general language understanding and generation, assisting with tasks like drafting emails, answering questions, summarizing text, or helping with understanding and assisting school tasks in a conversational format. Conclusion Microsoft’s release of MAI-Voice-1 and MAI-1-preview shows the company can now develop core generative AI models internally, backed by substantial investment in training infrastructure and technical talent. Both models are intended for practical, real-world use and are being refined with user feedback. This development adds to the diversity of model architectures and training methods in the field, with a focus on systems that are efficient, reliable, and suitable for integration into everyday applications. Microsoft’s approach—using large-scale resources, gradual deployment, and direct engagement with users—offers one example of how organizations can progress AI capabilities while emphasizing practical, incremental improvement. Check out the Technical details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Microsoft AI Lab Unveils MAI-Voice-1 and MAI-1-Preview: New In-House Models for Voice AI appeared first on MarkTechPost.

Microsoft AI Lab Unveils MAI-Voice-1 and MAI-1-Preview: New In-House Models for Voice AI 投稿を読む »

AI, Committee, ニュース, Uncategorized

Top 20 Voice AI Blogs and News Websites 2025: The Ultimate Resource Guide

Voice AI technology has experienced unprecedented growth in 2025, with revolutionary breakthroughs in real-time conversational AI, emotional intelligence, and voice synthesis. As enterprises increasingly adopt voice agents and consumers embrace next-generation AI assistants, staying informed about the latest developments has become crucial for professionals across industries. The global Voice AI market has reached $5.4 billion in 2024, reflecting a remarkable 25% increase from the previous year, with voice AI solutions attracting $2.1 billion in equity funding. Top 20 Voice AI Blogs and Websites 1. OpenAI Blog – Voice AI Research & Development OpenAI leads the voice AI revolution with groundbreaking models like GPT-4o Realtime API and advanced text-to-speech systems. Their blog provides insider insights into cutting-edge research, model releases, and real-world applications. OpenAI’s recent announcement of gpt-realtime and Realtime API updates for production voice agents represents a major breakthrough in conversational AI. Key Focus Areas: Real-time speech-to-speech models Voice synthesis and emotional expression Safety and responsible AI deployment Developer tools and APIs 2. MarkTechPost – Voice AI News & Analysis MarkTechPost has established itself as the go-to source for comprehensive AI news coverage, with exceptional depth in voice AI reporting. Their expert analysis of emerging technologies and market trends makes complex developments accessible to both technical and business audiences. Their recent coverage of Microsoft’s MAI-Voice-1 launch and comprehensive analysis of the voice AI landscape demonstrates their commitment to timely, authoritative reporting. Key Focus Areas: Voice AI market analysis and trends Technical breakthroughs in speech synthesis Enterprise voice agent implementations Industry funding and acquisitions 3. Google AI Blog – Multimodal & Speech Research Google’s research team consistently pushes the boundaries of conversational AI, with innovations like real-time voice agent architecture and advanced speech recognition systems. Their recent work on building real-time voice agents with Gemini demonstrates practical applications of their research. Key Contributions: Multimodal AI integration Real-time voice agent architecture Speech understanding and generation Privacy-preserving voice technologies 4. Microsoft Azure AI Blog – Enterprise Voice Solutions Microsoft’s Azure AI Speech services power millions of enterprise applications. Their blog provides practical insights into implementing voice AI at scale, including personal voice creation, enterprise speech-to-text solutions, and multilingual voice support.autogpt+3 Focus Areas: Personal voice creation and customization Enterprise speech-to-text solutions Multilingual voice support Azure cognitive services integration 5. ElevenLabs Blog – Voice Synthesis Innovation ElevenLabs has revolutionized voice cloning and synthesis, setting new standards for natural-sounding AI voices. The company secured $180 million in Series C funding in January 2025, reaching a valuation of $3.3 billion, demonstrating strong investor confidence in their technology. Specializations: Voice cloning technology Multilingual speech synthesis Creative applications in media API development for voice integration 6. Deepgram Blog – Speech Recognition Excellence Deepgram’s State of Voice AI 2025 report provides authoritative market analysis, identifying 2025 as “the year of human-like voice AI agents”. Their technical content explores the latest in speech recognition and real-time transcription. Key Insights: Voice AI market trends and predictions Technical deep-dives into speech recognition Developer tutorials and best practices Industry adoption case studies 7. Anthropic Research – Conversational AI Ethics & Voice Mode Anthropic’s work on Claude focuses on safe, beneficial AI development with emphasis on alignment and responsible deployment. In May 2025, Anthropic launched voice mode for Claude, powered by Claude Sonnet 4, enabling complete spoken conversations with five distinct voice options. Focus Areas: AI safety in conversational systems Ethical voice AI development Human-AI interaction research Voice mode implementation using ElevenLabs technology 8. Stanford HAI Blog – Academic Voice AI Research Stanford’s Human-Centered AI Institute produces cutting-edge research on voice interaction and turn-taking in conversations. Their recent work on teaching voice assistants when to speak represents breakthrough research in conversational AI, moving beyond simple silence detection to analyze voice intonation patterns. Research Highlights: Conversational AI turn-taking and interruption handling World Wide Voice Web (WWvW) development Silent speech recognition advances Open-source virtual assistant development 9. Hume AI Blog – Emotionally Intelligent Voice Hume AI specializes in emotionally intelligent voice interactions, combining speech technology with empathic understanding. Their Empathic Voice Interface (EVI 3) represents a breakthrough in conversational AI, capable of understanding and responding with natural, emotionally intelligent voice interactions. Innovations: Emotional intelligence in voice AI Empathic voice interfaces Voice control and customization Human wellbeing optimization through AI 10. MIT Technology Review – Voice AI Analysis MIT Technology Review provides in-depth analysis of voice AI trends, societal implications, and breakthrough research with rigorous journalistic standards. Their coverage includes voice AI diversity initiatives, synthetic voice technology implications, and ethical considerations in voice technology deployment. Coverage Areas: Voice AI diversity and inclusion Audio deepfake detection and prevention Industry analysis and market trends Ethical considerations in voice tech 11. Resemble AI Blog – Voice Cloning & Security Resemble AI leads in voice cloning technology while addressing security concerns like deepfake detection. They specialize in advanced voice cloning techniques, enterprise voice solutions, and voice security authentication. Expertise: Advanced voice cloning techniques Deepfake detection and prevention Enterprise voice solutions Voice security and authentication 12. TechCrunch – Voice AI Industry News TechCrunch provides comprehensive coverage of voice AI startups, funding rounds, and industry developments. They extensively covered Anthropic’s voice mode launch and provide regular updates on industry partnerships and product launches. Coverage Focus: Startup funding and acquisitions Industry partnerships and deals Product launches and demos Market analysis and predictions 13. VentureBeat AI – Voice Technology Trends VentureBeat offers detailed coverage of voice AI business applications and enterprise adoption trends. They specialize in enterprise AI adoption analysis, voice technology market research, and developer tools coverage. Specializations: Enterprise AI adoption Voice technology market analysis Product reviews and comparisons Developer tools and platforms 14. Towards Data Science – Technical Voice AI Content This Medium publication features hands-on tutorials, technical deep-dives, and practical implementations of voice AI technologies. Content includes privacy-preserving voice AI implementations, voice assistant tuning, and AI-powered language learning applications. Content Types: Technical tutorials and guides Voice AI implementation case studies Python and machine learning applications Data science approaches to speech 15. Amazon Alexa Blog – Voice Assistant Innovation Amazon’s Alexa team shares

Top 20 Voice AI Blogs and News Websites 2025: The Ultimate Resource Guide 投稿を読む »

AI, Committee, ニュース, Uncategorized

Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning

arXiv:2508.20712v1 Announce Type: new Abstract: This paper introduces the first multi-lingual and multi-label classification model for implicit discourse relation recognition (IDRR). Our model, HArch, is evaluated on the recently released DiscoGeM 2.0 corpus and leverages hierarchical dependencies between discourse senses to predict probability distributions across all three sense levels in the PDTB 3.0 framework. We compare several pre-trained encoder backbones and find that RoBERTa-HArch achieves the best performance in English, while XLM-RoBERTa-HArch performs best in the multi-lingual setting. In addition, we compare our fine-tuned models against GPT-4o and Llama-4-Maverick using few-shot prompting across all language configurations. Our results show that our fine-tuned models consistently outperform these LLMs, highlighting the advantages of task-specific fine-tuning over prompting in IDRR. Finally, we report SOTA results on the DiscoGeM 1.0 corpus, further validating the effectiveness of our hierarchical approach.

Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning 投稿を読む »

We use cookies to improve your experience and performance on our website. You can learn more at プライバシーポリシー and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
ja