YouZum

Committee

AI, Committee, Notizie, Uncategorized

Meet Elysia: A New Open-Source Python Framework Redefining Agentic RAG Systems with Decision Trees and Smarter Data Handling

If you’ve ever tried to build a agentic RAG system that actually works well, you know the pain. You feed it some documents, cross your fingers, and hope it doesn’t hallucinate when someone asks it a simple question. Most of the time, you get back irrelevant chunks of text that barely answer what was asked. Elysia is trying to fix this mess, and honestly, their approach is quite creative. Built by the folks at Weaviate, this open-source Python framework doesn’t just throw more AI at the problem – it completely rethinks how AI agents should work with your data. Note: Python 3.12 required What’s Actually Wrong with Most RAG Systems Here’s the thing that drives everyone crazy: traditional RAG systems are basically blind. They take your question, convert it to vectors, find some “similar” text, and hope for the best. It’s like asking someone to find you a good restaurant while they’re wearing a blindfold – they might get lucky, but probably not. Most systems also dump every possible tool on the AI at once, which is like giving a toddler access to your entire toolbox and expecting them to build a bookshelf. Elysia’s Three Pillars: 1) Decision Trees Instead of giving AI agents every tool at once, Elysia guides them through a structured nodes for decisions. Think of it like a flowchart that actually makes sense. Each step has context about what happened before and what options come next. The really cool part? The system shows you exactly which path the agent took and why, so when something goes wrong, you can actually debug it instead of just shrugging and trying again. When the AI realizes it can’t do something (like searching for car prices in a makeup database), it doesn’t just keep trying forever. It sets an “impossible flag” and moves on, which sounds obvious but apparently needed to be invented. 2) Smart Data Source Display Remember when every AI just spat out paragraphs of text? Elysia actually looks at your data and figures out how to show it properly. Got e-commerce products? You get product cards. GitHub issues? You get ticket layouts. Spreadsheet data? You get actual tables. The system examines your data structure first – the fields, the types, the relationships – then picks one of the seven formats that makes sense. 3) Data Expertise This might be the biggest difference. Before Elysia searches anything, it analyzes your database to understand what’s actually in there. It can summarize, generate metadata, and choose display types. It looks at: What kinds of fields you have What the data ranges look like How different pieces relate to each other What would make sense to search for How does it Work? Learning from Feedback Elysia remembers when users say “yes, this was helpful” and uses those examples to improve future responses. But it does this smartly – your feedback doesn’t mess up other people’s results, and it helps the system get better at answering your specific types of questions. This means you can use smaller, cheaper models that still give good results because they’re learning from actual success cases. Chunking That Makes Sense Most RAG systems chunk all your documents upfront, which uses tons of storage and often creates weird breaks. Elysia chunks documents only when needed. It searches full documents first, then if a document looks relevant but is too long, it breaks it down on the fly. This saves storage space and actually works better because the chunking decisions are informed by what the user is actually looking for. Model Routing Different tasks need different models. Simple questions don’t need GPT-4, and complex analysis doesn’t work well with tiny models. Elysia automatically routes tasks to the right model based on complexity, which saves money and improves speed. https://weaviate.io/blog/elysia-agentic-rag Getting Started The setup is quite simple: Copy CodeCopiedUse a different Browser pip install elysia-ai elysia start That’s it. You get both a web interface and the Python framework. For developers who want to customize things: Copy CodeCopiedUse a different Browser from elysia import tool, Tree tree = Tree() @tool(tree=tree) async def add(x: int, y: int) -> int: return x + y tree(“What is the sum of 9009 and 6006?”) If you have Weaviate data, it’s even simpler: Copy CodeCopiedUse a different Browser import elysia tree = elysia.Tree() response, objects = tree( “What are the 10 most expensive items in the Ecommerce collection?”, collection_names = [“Ecommerce”] ) Real-World Example: Glowe’s Chatbot The Glowe skincare chatbot platform uses Elysia to handle complex product recommendations. Users can ask things like “What products work well with retinol but won’t irritate sensitive skin?” and get intelligent responses that consider ingredient interactions, user preferences, and product availability.youtube This isn’t just keyword matching – it’s understanding context and relationship between ingredients, user history, and product characteristics in ways that would be really hard to code manually.youtube Summary Elysia represents Weaviate’s attempt to move beyond traditional ask-retrieve-generate RAG patterns by combining decision-tree agents, adaptive data presentation, and learning from user feedback. Rather than just generating text responses, it analyzes data structure beforehand and selects appropriate display formats while maintaining transparency in its decision-making process. As Weaviate’s planned replacement for their Verba RAG system, it offers a foundation for building more sophisticated AI applications that understand both what users are asking and how to present answers effectively, though whether this translates to meaningfully better real-world performance remains to be seen since it is still in beta. Check out the TECHNICAL DETAILS and GITHUB PAGE. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Meet Elysia: A New Open-Source Python Framework Redefining Agentic RAG Systems with Decision Trees and Smarter Data Handling appeared first on MarkTechPost.

Meet Elysia: A New Open-Source Python Framework Redefining Agentic RAG Systems with Decision Trees and Smarter Data Handling Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

StepFun AI Releases Step-Audio 2 Mini: An Open-Source 8B Speech-to-Speech AI Model that Surpasses GPT-4o-Audio

The StepFun AI team has released Step-Audio 2 Mini, an 8B parameter speech-to-speech large audio language model (LALM) that delivers expressive, grounded, and real-time audio interaction. Released under the Apache 2.0 license, this open-source model achieves state-of-the-art performance across speech recognition, audio understanding, and speech conversation benchmarks—surpassing commercial systems such as GPT-4o-Audio. https://huggingface.co/stepfun-ai/Step-Audio-2-mini Key Features 1. Unified Audio–Text Tokenization Unlike cascaded ASR+LLM+TTS pipelines, Step-Audio 2 integrates Multimodal Discrete Token Modeling, where text and audio tokens share a single modeling stream. This enables: Seamless reasoning across text and audio. On-the-fly voice style switching during inference. Consistency in semantic, prosodic, and emotional outputs. 2. Expressive and Emotion-Aware Generation The model doesn’t just transcribe speech—it interprets paralinguistic features like pitch, rhythm, emotion, timbre, and style. This allows conversations with realistic emotional tones such as whispering, sadness, or excitement. Benchmarks on StepEval-Audio-Paralinguistic show Step-Audio 2 achieving 83.1% accuracy, far beyond GPT-4o Audio (43.5%) and Qwen-Omni (44.2%). 3. Retrieval-Augmented Speech Generation Step-Audio 2 incorporates multimodal RAG (Retrieval-Augmented Generation): Web search integration for factual grounding. Audio search—a novel capability that retrieves real voices from a large library and fuses them into responses, enabling voice timbre/style imitation at inference time. 4. Tool Calling and Multimodal Reasoning The system extends beyond speech synthesis by supporting tool invocation. Benchmarks show that Step-Audio 2 matches textual LLMs in tool selection and parameter accuracy, while uniquely excelling at audio search tool calls—a capability unavailable in text-only LLMs. Training and Data Scale Text + Audio Corpus: 1.356T tokens Audio Hours: 8M+ real and synthetic hours Speaker Diversity: ~50K voices across languages and dialects Pretraining Pipeline: multi-stage curriculum covering ASR, TTS, speech-to-speech translation, and emotion-labeled conversational synthesis. This large-scale training allows Step-Audio 2 Mini to retain strong text reasoning (via its Qwen2-Audio and CosyVoice foundation) while mastering fine-grained audio modeling. Performance Benchmarks https://huggingface.co/stepfun-ai/Step-Audio-2-mini https://arxiv.org/abs/2507.16632 Automatic Speech Recognition (ASR) English: Average WER 3.14% (beats GPT-4o Transcribe at an average 4.5%). Chinese: Average CER 3.08% (significantly lower than GPT-4o and Qwen-Omni). Robust across dialects and accents. Audio Understanding (MMAU Benchmark) Step-Audio 2: 78.0 average, outperforming Omni-R1 (77.0) and Audio Flamingo 3 (73.1). Strongest in sound and speech reasoning tasks. Speech Translation CoVoST 2 (S2TT): BLEU 39.26 (highest among open and closed models). CVSS (S2ST): BLEU 30.87, ahead of GPT-4o (23.68). Conversational Benchmarks (URO-Bench) Chinese Conversations: Best overall at 83.3 (basic) and 68.2 (pro). English Conversations: Competitive with GPT-4o (83.9 vs. 84.5), far ahead of other open models. Source: Marktechpost.com Conclusion Step-Audio 2 Mini makes advanced, multimodal speech intelligence accessible to the developers and research community. By combining Qwen2-Audio’s reasoning capacity with CosyVoice’s tokenization pipeline, and augmenting with retrieval-based grounding, StepFun has delivered one of the most capable open audio LLMs. Check out the PAPER and MODEL on HUGGING FACE. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post StepFun AI Releases Step-Audio 2 Mini: An Open-Source 8B Speech-to-Speech AI Model that Surpasses GPT-4o-Audio appeared first on MarkTechPost.

StepFun AI Releases Step-Audio 2 Mini: An Open-Source 8B Speech-to-Speech AI Model that Surpasses GPT-4o-Audio Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning

arXiv:2508.21589v1 Announce Type: new Abstract: Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals – loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our method consistently enhances the quality of seed data and boosts LLM’s performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are coming soon.

Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

arXiv:2508.21788v1 Announce Type: new Abstract: Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI’s FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance–most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Transforming Wearable Data into Personal Health Insights using Large Language Model Agents

arXiv:2406.06464v3 Announce Type: replace-cross Abstract: Deriving personalized insights from popular wearable trackers requires complex numerical reasoning that challenges standard LLMs, necessitating tool-based approaches like code generation. Large language model (LLM) agents present a promising yet largely untapped solution for this analysis at scale. We introduce the Personal Health Insights Agent (PHIA), a system leveraging multistep reasoning with code generation and information retrieval to analyze and interpret behavioral health data. To test its capabilities, we create and share two benchmark datasets with over 4000 health insights questions. A 650-hour human expert evaluation shows that PHIA significantly outperforms a strong code generation baseline, achieving 84% accuracy on objective, numerical questions and, for open-ended ones, earning 83% favorable ratings while being twice as likely to achieve the highest quality rating. This work can advance behavioral health by empowering individuals to understand their data, enabling a new era of accessible, personalized, and data-driven wellness for the wider population.

Transforming Wearable Data into Personal Health Insights using Large Language Model Agents Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization

arXiv:2507.05137v2 Announce Type: replace Abstract: Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.

Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach

arXiv:2508.21206v1 Announce Type: new Abstract: Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.

Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

A Coding Guide to Building a Brain-Inspired Hierarchical Reasoning AI Agent with Hugging Face Models

In this tutorial, we set out to recreate the spirit of the Hierarchical Reasoning Model (HRM) using a free Hugging Face model that runs locally. We walk through the design of a lightweight yet structured reasoning agent, where we act as both architects and experimenters. By breaking problems into subgoals, solving them with Python, critiquing the outcomes, and synthesizing a final answer, we can experience how hierarchical planning and execution can enhance reasoning performance. This process enables us to see, in real-time, how a brain-inspired workflow can be implemented without requiring massive model sizes or expensive APIs. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser !pip -q install -U transformers accelerate bitsandbytes rich import os, re, json, textwrap, traceback from typing import Dict, Any, List from rich import print as rprint import torch from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline MODEL_NAME = “Qwen/Qwen2.5-1.5B-Instruct” DTYPE = torch.bfloat16 if torch.cuda.is_available() else torch.float32 We begin by installing the required libraries and loading the Qwen2.5-1.5B-Instruct model from Hugging Face. We set the data type based on GPU availability to ensure efficient model execution in Colab. Copy CodeCopiedUse a different Browser tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, device_map=”auto”, torch_dtype=DTYPE, load_in_4bit=True ) gen = pipeline( “text-generation”, model=model, tokenizer=tok, return_full_text=False ) We load the tokenizer and model, configure it to run in 4-bit for efficiency, and wrap everything in a text-generation pipeline so we can interact with the model easily in Colab. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser def chat(prompt: str, system: str = “”, max_new_tokens: int = 512, temperature: float = 0.3) -> str: msgs = [] if system: msgs.append({“role”:”system”,”content”:system}) msgs.append({“role”:”user”,”content”:prompt}) inputs = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True) out = gen(inputs, max_new_tokens=max_new_tokens, do_sample=(temperature>0), temperature=temperature, top_p=0.9) return out[0][“generated_text”].strip() def extract_json(txt: str) -> Dict[str, Any]: m = re.search(r”{[sS]*}$”, txt.strip()) if not m: m = re.search(r”{[sS]*?}”, txt) try: return json.loads(m.group(0)) if m else {} except Exception: # fallback: strip code fences s = re.sub(r”^“`.*?n|n“`$”, “”, txt, flags=re.S) try: return json.loads(s) except Exception: return {} We define helper functions: the chat function allows us to send prompts to the model with optional system instructions and sampling controls, while extract_json helps us parse structured JSON outputs from the model reliably, even if the response includes code fences or additional text. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser def extract_code(txt: str) -> str: m = re.search(r”“`(?:python)?s*([sS]*?)“`”, txt, flags=re.I) return (m.group(1) if m else txt).strip() def run_python(code: str, env: Dict[str, Any] | None = None) -> Dict[str, Any]: import io, contextlib g = {“__name__”: “__main__”}; l = {} if env: g.update(env) buf = io.StringIO() try: with contextlib.redirect_stdout(buf): exec(code, g, l) out = l.get(“RESULT”, g.get(“RESULT”)) return {“ok”: True, “result”: out, “stdout”: buf.getvalue()} except Exception as e: return {“ok”: False, “error”: str(e), “trace”: traceback.format_exc(), “stdout”: buf.getvalue()} PLANNER_SYS = “””You are the HRM Planner. Decompose the TASK into 2–4 atomic, code-solvable subgoals. Return compact JSON only: {“subgoals”:[…], “final_format”:”<one-line answer format>”}.””” SOLVER_SYS = “””You are the HRM Solver. Given SUBGOAL and CONTEXT vars, output a single Python snippet. Rules: – Compute deterministically. – Set a variable RESULT to the answer. – Keep code short; stdlib only. Return only a Python code block.””” CRITIC_SYS = “””You are the HRM Critic. Given TASK and LOGS (subgoal results), decide if final answer is ready. Return JSON only: {“action”:”submit”|”revise”,”critique”:”…”, “fix_hint”:”<if revise>”}.””” SYNTH_SYS = “””You are the HRM Synthesizer. Given TASK, LOGS, and final_format, output only the final answer (no steps). Follow final_format exactly.””” We add two important pieces: utility functions and system prompts. The extract_code function pulls Python snippets from the model’s output, while run_python safely executes those snippets and captures their results. Alongside, we define four role prompts, Planner, Solver, Critic, and Synthesizer, which guide the model to break tasks into subgoals, solve them with code, verify correctness, and finally produce a clean answer. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser def plan(task: str) -> Dict[str, Any]: p = f”TASK:n{task}nReturn JSON only.” return extract_json(chat(p, PLANNER_SYS, temperature=0.2, max_new_tokens=300)) def solve_subgoal(subgoal: str, context: Dict[str, Any]) -> Dict[str, Any]: prompt = f”SUBGOAL:n{subgoal}nCONTEXT vars: {list(context.keys())}nReturn Python code only.” code = extract_code(chat(prompt, SOLVER_SYS, temperature=0.2, max_new_tokens=400)) res = run_python(code, env=context) return {“subgoal”: subgoal, “code”: code, “run”: res} def critic(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]: pl = [{“subgoal”: L[“subgoal”], “result”: L[“run”].get(“result”), “ok”: L[“run”][“ok”]} for L in logs] out = chat(“TASK:n”+task+”nLOGS:n”+json.dumps(pl, ensure_ascii=False, indent=2)+”nReturn JSON only.”, CRITIC_SYS, temperature=0.1, max_new_tokens=250) return extract_json(out) def refine(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]: sys = “Refine subgoals minimally to fix issues. Return same JSON schema as planner.” out = chat(“TASK:n”+task+”nLOGS:n”+json.dumps(logs, ensure_ascii=False)+”nReturn JSON only.”, sys, temperature=0.2, max_new_tokens=250) j = extract_json(out) return j if j.get(“subgoals”) else {} def synthesize(task: str, logs: List[Dict[str, Any]], final_format: str) -> str: packed = [{“subgoal”: L[“subgoal”], “result”: L[“run”].get(“result”)} for L in logs] return chat(“TASK:n”+task+”nLOGS:n”+json.dumps(packed, ensure_ascii=False)+ f”nfinal_format: {final_format}nOnly the final answer.”, SYNTH_SYS, temperature=0.0, max_new_tokens=120).strip() def hrm_agent(task: str, context: Dict[str, Any] | None = None, budget: int = 2) -> Dict[str, Any]: ctx = dict(context or {}) trace, plan_json = [], plan(task) for round_id in range(1, budget+1): logs = [solve_subgoal(sg, ctx) for sg in plan_json.get(“subgoals”, [])] for L in logs: ctx_key = f”g{len(trace)}_{abs(hash(L[‘subgoal’]))%9999}” ctx[ctx_key] = L[“run”].get(“result”) verdict = critic(task, logs) trace.append({“round”: round_id, “plan”: plan_json, “logs”: logs, “verdict”: verdict}) if verdict.get(“action”) == “submit”: break plan_json = refine(task, logs) or plan_json final = synthesize(task, trace[-1][“logs”], plan_json.get(“final_format”, “Answer: <value>”)) return {“final”: final, “trace”: trace} We implement the full HRM loop: we plan subgoals, solve each by generating and running Python (capturing RESULT), then we critique, optionally refine the plan, and synthesize a clean final answer. We orchestrate these rounds in hrm_agent, carrying forward intermediate results as context so we iteratively improve and stop once the critic says “submit.” Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser ARC_TASK = textwrap.dedent(“”” Infer the transformation rule from train examples and apply to test. Return exactly: “Answer: <grid>”, where <grid> is a Python list of lists of ints. “””).strip() ARC_DATA = { “train”: [ {“inp”: [[0,0],[1,0]], “out”: [[1,1],[0,1]]}, {“inp”: [[0,1],[0,0]], “out”: [[1,0],[1,1]]} ], “test”: [[0,0],[0,1]] } res1 = hrm_agent(ARC_TASK, context={“TRAIN”: ARC_DATA[“train”], “TEST”: ARC_DATA[“test”]}, budget=2) rprint(“n[bold]Demo 1 —

A Coding Guide to Building a Brain-Inspired Hierarchical Reasoning AI Agent with Hugging Face Models Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Chunking vs. Tokenization: Key Differences in AI Text Processing

Table of contents Introduction What is Tokenization? What is Chunking? The Key Differences That Matter Why This Matters for Real Applications Where You’ll Use Each Approach Current Best Practices (What Actually Works) Summary Introduction When you’re working with AI and natural language processing, you’ll quickly encounter two fundamental concepts that often get confused: tokenization and chunking. While both involve breaking down text into smaller pieces, they serve completely different purposes and work at different scales. If you’re building AI applications, understanding these differences isn’t just academic—it’s crucial for creating systems that actually work well. Think of it this way: if you’re making a sandwich, tokenization is like cutting your ingredients into bite-sized pieces, while chunking is like organizing those pieces into logical groups that make sense to eat together. Both are necessary, but they solve different problems. Source: marktechpost.com What is Tokenization? Tokenization is the process of breaking text into the smallest meaningful units that AI models can understand. These units, called tokens, are the basic building blocks that language models work with. You can think of tokens as the “words” in an AI’s vocabulary, though they’re often smaller than actual words. There are several ways to create tokens: Word-level tokenization splits text at spaces and punctuation. It’s straightforward but creates problems with rare words that the model has never seen before. Subword tokenization is more sophisticated and widely used today. Methods like Byte Pair Encoding (BPE), WordPiece, and SentencePiece break words into smaller chunks based on how frequently character combinations appear in training data. This approach handles new or rare words much better. Character-level tokenization treats each letter as a token. It’s simple but creates very long sequences that are harder for models to process efficiently. Here’s a practical example: Original text: “AI models process text efficiently.” Word tokens: [“AI”, “models”, “process”, “text”, “efficiently”] Subword tokens: [“AI”, “model”, “s”, “process”, “text”, “efficient”, “ly”] Notice how subword tokenization splits “models” into “model” and “s” because this pattern appears frequently in training data. This helps the model understand related words like “modeling” or “modeled” even if it hasn’t seen them before. What is Chunking? Chunking takes a completely different approach. Instead of breaking text into tiny pieces, it groups text into larger, coherent segments that preserve meaning and context. When you’re building applications like chatbots or search systems, you need these larger chunks to maintain the flow of ideas. Think about reading a research paper. You wouldn’t want each sentence scattered randomly—you’d want related sentences grouped together so the ideas make sense. That’s exactly what chunking does for AI systems. Here’s how it works in practice: Original text: “AI models process text efficiently. They rely on tokens to capture meaning and context. Chunking allows better retrieval.” Chunk 1: “AI models process text efficiently.” Chunk 2: “They rely on tokens to capture meaning and context.” Chunk 3: “Chunking allows better retrieval.” Modern chunking strategies have become quite sophisticated: Fixed-length chunking creates chunks of a specific size (like 500 words or 1000 characters). It’s predictable but sometimes breaks up related ideas awkwardly. Semantic chunking is smarter—it looks for natural breakpoints where topics change, using AI to understand when ideas shift from one concept to another. Recursive chunking works hierarchically, first trying to split at paragraph breaks, then sentences, then smaller units if needed. Sliding window chunking creates overlapping chunks to ensure important context isn’t lost at boundaries. The Key Differences That Matter Understanding when to use each approach makes all the difference in your AI applications: What You’re Doing Tokenization Chunking Size Tiny pieces (words, parts of words) Bigger pieces (sentences, paragraphs) Goal Make text digestible for AI models Keep meaning intact for humans and AI When You Use It Training models, processing input Search systems, question answering What You Optimize For Processing speed, vocabulary size Context preservation, retrieval accuracy Why This Matters for Real Applications For AI Model Performance When you’re working with language models, tokenization directly affects how much you pay and how fast your system runs. Models like GPT-4 charge by the token, so efficient tokenization saves money. Current models have different limits: GPT-4: Around 128,000 tokens Claude 3.5: Up to 200,000 tokens Gemini 2.0 Pro: Up to 2 million tokens Recent research shows that larger models actually work better with bigger vocabularies. For example, while LLaMA-2 70B uses about 32,000 different tokens, it would probably perform better with around 216,000. This matters because the right vocabulary size affects both performance and efficiency. For Search and Question-Answering Systems Chunking strategy can make or break your RAG (Retrieval-Augmented Generation) system. If your chunks are too small, you lose context. Too big, and you overwhelm the model with irrelevant information. Get it right, and your system provides accurate, helpful answers. Get it wrong, and you get hallucinations and poor results. Companies building enterprise AI systems have found that smart chunking strategies significantly reduce those frustrating cases where AI makes up facts or gives nonsensical answers. Where You’ll Use Each Approach Tokenization is Essential For: Training new models – You can’t train a language model without first tokenizing your training data. The tokenization strategy affects everything about how well the model learns. Fine-tuning existing models – When you adapt a pre-trained model for your specific domain (like medical or legal text), you need to carefully consider whether the existing tokenization works for your specialized vocabulary. Cross-language applications – Subword tokenization is particularly helpful when working with languages that have complex word structures or when building multilingual systems. Chunking is Critical For: Building company knowledge bases – When you want employees to ask questions and get accurate answers from your internal documents, proper chunking ensures the AI retrieves relevant, complete information. Document analysis at scale – Whether you’re processing legal contracts, research papers, or customer feedback, chunking helps maintain document structure and meaning. Search systems – Modern search goes beyond keyword matching. Semantic chunking helps systems understand what users really want and retrieve the most relevant information. Current Best Practices (What Actually Works)

Chunking vs. Tokenization: Key Differences in AI Text Processing Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Accenture Research Introduce MCP-Bench: A Large-Scale Benchmark that Evaluates LLM Agents in Complex Real-World Tasks via MCP Servers

Modern large language models (LLMs) have moved far beyond simple text generation. Many of the most promising real-world applications now require these models to use external tools—like APIs, databases, and software libraries—to solve complex tasks. But how do we truly know if an AI agent can plan, reason, and coordinate across tools the way a human assistant would? This is the question MCP-Bench sets out to answer. The Problem with Existing Benchmarks Most previous benchmarks for tool-using LLMs focused on one-off API calls or narrow, artificially stitched workflows. Even the more advanced evaluations rarely tested how well agents could discover and chain the right tools from fuzzy, real-world instructions—let alone whether they could coordinate across multiple domains and ground their answers in actual evidence. In practice, this means that many models perform well on artificial tasks, but struggle with the complexity and ambiguity of real-world scenarios. https://arxiv.org/abs/2508.20453 What Makes MCP-Bench Different A team of researchers from Accenture introduce MCP-Bench, a Model Context Protocol (MCP) based benchmark for LLM agents that directly connects them to 28 real-world servers, each offering a set of tools across various domains—such as finance, scientific computing, healthcare, travel, and academic research. In total, the benchmark covers 250 tools, arranged so that realistic workflows require both sequential and parallel tool use, sometimes across multiple servers. https://arxiv.org/abs/2508.20453 Key features: Authentic tasks: Tasks are designed to reflect real user needs, such as planning a multi-stop camping trip (involving geospatial, weather, and park information), conducting biomedical research, or converting units in scientific calculations. Fuzzy instructions: Rather than specifying tools or steps, tasks are described in natural, sometimes vague language—requiring the agent to infer what to do, much like a human assistant would. Tool diversity: The benchmark includes everything from medical calculators and scientific computing libraries to financial analytics, icon collections, and even niche tools like I Ching divination services. Quality control: Tasks are automatically generated, then filtered for solvability and real-world relevance. Each task also comes in two forms: a precise technical description (used for evaluation) and a conversational, fuzzy version (what the agent sees). Multi-layered evaluation: Both automated metrics (like “did the agent use the correct tool and provide the right parameters?”) and LLM-based judges (to assess planning, grounding, and reasoning) are used. How Agents Are Tested An agent running MCP-Bench receives a task (e.g., “Plan a camping trip to Yosemite with detailed logistics and weather forecasts”) and must decide, step by step, which tools to call, in what order, and how to use their outputs. These workflows can span multiple rounds of interaction, with the agent synthesizing results into a coherent, evidence-backed answer. Each agent is evaluated on several dimensions, including: Tool selection: Did it choose the right tools for each part of the task? Parameter accuracy: Did it provide complete and correct inputs to each tool? Planning and coordination: Did it handle dependencies and parallel steps properly? Evidence grounding: Does its final answer directly reference the outputs from tools, avoiding unsupported claims? What the Results Show The researchers tested 20 state-of-the-art LLMs across 104 tasks. The main findings: Basic tool use is solid: Most models could correctly call tools and handle parameter schemas, even for complex or domain-specific tools. Planning is still hard: Even the best models struggled with long, multi-step workflows that required not just selecting tools, but also understanding when to move to the next step, which parts can run in parallel, and how to handle unexpected results. Smaller models fall behind: As tasks became more complex, especially those spanning multiple servers, smaller models were more likely to make mistakes, repeat steps, or miss subtasks. Efficiency varies widely: Some models needed many more tool calls and rounds of interaction to achieve the same results, suggesting inefficiencies in planning and execution. Humans are still needed for nuance: While the benchmark is automated, human checks ensure tasks are realistic and solvable—a reminder that truly robust evaluation still benefits from human expertise. https://arxiv.org/abs/2508.20453 Why This Research Matters? MCP-Bench provides a practical way to assess how well AI agents can act as “digital assistants” in real-world settings—situations where users aren’t always precise and the right answer depends on weaving together information from many sources. The benchmark exposes gaps in current LLM capabilities, especially around complex planning, cross-domain reasoning, and evidence-based synthesis—areas crucial for deploying AI agents in business, research, and specialized fields. Summary MCP-Bench is a serious, large-scale test for AI agents using real tools and real tasks, with no shortcuts or artificial setups. It shows what current models do well and where they still fall short. For anyone building or evaluating AI assistants, these results—and the benchmark itself—are likely to be a useful reality check. Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Accenture Research Introduce MCP-Bench: A Large-Scale Benchmark that Evaluates LLM Agents in Complex Real-World Tasks via MCP Servers appeared first on MarkTechPost.

Accenture Research Introduce MCP-Bench: A Large-Scale Benchmark that Evaluates LLM Agents in Complex Real-World Tasks via MCP Servers Leggi l'articolo »

We use cookies to improve your experience and performance on our website. You can learn more at Politica sulla privacy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
it_IT