YouZum

AI

AI, Committee, Actualités, Uncategorized

Rebuilding the data stack for AI

Artificial intelligence may be dominating boardroom agendas, but many enterprises are discovering that the biggest obstacle to meaningful adoption is the state of their data. While consumer-facing AI tools have dazzled users with speed and ease, enterprise leaders are discovering that deploying AI at scale requires something far less glamorous but far more consequential: data infrastructure that is unified, governed, and fit for purpose. That gap between AI ambition and enterprise readiness is becoming one of the defining challenges of this next phase of digital transformation. As Bavesh Patel, senior vice president of Databricks, puts it, “the quality of that AI and how effective that AI is, is really dependent on information in your organization.” Yet in many companies, that information remains fragmented across legacy systems, siloed applications, and disconnected formats, making it nearly impossible for AI systems to generate trustworthy, context-rich outputs. “Really, the big competitive differentiator for most organizations is their own data and then their third-party data that they can add to it,” says Patel. For enterprise AI to deliver value, data must be consolidated into open formats, governed with precision, and made accessible across functions. Without that foundation, businesses risk “terrible AI,” as Patel bluntly describes it. That means moving beyond siloed SaaS platforms and disconnected dashboards toward a unified, open data architecture capable of combining structured and unstructured data, preserving real-time context, and enforcing rigorous access controls. When the groundwork is laid correctly, organizations can move toward measurable outcomes, unlocking efficiencies, automating complex workflows, and even launching entirely new lines of business. That value focus is critical, says Rajan Padmanabhan, unit technology officer at Infosys, especially as enterprises seek precision in the outputs driving business decisions. Rather than treating AI initiatives as isolated innovation projects, leading companies are tying AI deployment directly to business metrics, using governance frameworks to determine what delivers results and what should be abandoned quickly. “We see this big opportunity just with AI literacy with business users, where they’re very eager to understand how they should be thinking about AI,” adds Patel. “What does AI mean when you peel the covers? What are the pieces and the building blocks that you need to put in place, both from a technology and a training and an enablement standpoint?” The possibilities ahead are substantial. As AI agents evolve from copilots into autonomous operators capable of managing workflows and transactions, the organizations that win will be those that build the right foundation now. “What we are seeing as a new way of thinking is moving from a system of execution or a system of engagement to a system of action,” notes  Padmanabhan. “That is the new way we see the road ahead.” The future of AI in the enterprise will be determined by whether businesses can turn fragmented information into a strategic asset capable of powering both smarter decisions and entirely new ways of operating. This episode of Business Lab is produced in partnership with Infosys Topaz. Full Transcript: Megan Tatum: From MIT Technology Review, I’m Megan Tatum, and this is Business Lab, the show that helps business leaders make sense of new technologies coming out of the lab and into the marketplace. This episode is produced in partnership with Infosys Topaz. Now, recent advancements in AI may have unlocked some compelling new industrial applications, but a reliance on inadequate data models means that many enterprises are hitting a brick wall. AI and agentic AI in particular place a whole new set of demands on data. The technology requires greater access, context, and guardrails to operate effectively. Existing data models often fall short. They’re too fragmented or siloed. Data itself often lacks quality. To bridge the gap, they require an AI-ready upgrade. Two words for you: data reconfigured. My guest today, are Bavesh Patel, senior vice president for Go-to-Market at Databricks, and Rajan Padmanabhan, unit technology officer for data analytics and AI at Infosys. Welcome, Bavesh and Rajan. Rajan Padmanabhan: Thank you. Thanks for having us. Bavesh Patel: Thanks for having us. Megan: Fantastic. Thank you both so much for joining us today. Bavesh, if I could come to you first, when we talk about AI-ready data, what exactly do we mean? What new demands does AI place on data, and how does this impact the way it needs to be structured and used? Bavesh: Yeah. Great question. Appreciate you hosting us today. I think that obviously the whole world is enamored with AI because of all of the power that we can all see as users. AI is now democratized across hundreds of millions of users. And when we think about enterprises and businesses using AI, the quality of that AI and how effective that AI is really dependent on information in your organization, and that’s data. And what we found is that most enterprises, their data is kind of locked away in these different applications and different systems. And it’s very difficult to get a good view of, what is all my data? How trustworthy is it? How recent and fresh is it? And all of that is being injected into the AI. Unless you have a proper understanding of your data, the ability to ensure that it’s data that’s accurate and that can be used so that the AI can take advantage of it, you’re actually going to end up having terrible AI. We see a lot of customers spend time on cleansing their data, organizing their data, making sure it’s access controlled correctly, and that tends to be the fuel of good AI. Megan: Yeah. It’s such a foundational thing, isn’t it? But it can be missed, I think, quite easily. Rajan, what difference can having AI-ready data really make for enterprises as they unlock that full potential of AI and its applications? Rajan: First and foremost, thanks for having us. It’s a pleasure. I think in continuation of what Bavesh talked about, see, data and AI is pretty synonymous. And similarly, the consumer AI and enterprise AI and

Rebuilding the data stack for AI Lire l’article »

AI, Committee, Actualités, Uncategorized

A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing

In this tutorial, we explore kvcached, a dynamic KV-cache implementation on top of vLLM, to understand how dynamic KV-cache allocation transforms GPU memory usage for large language models. We begin by setting up the environment and deploying lightweight Qwen2.5 models through an OpenAI-compatible API, ensuring a realistic inference workflow. We then design controlled experiments where we simulate bursty workloads to observe how memory behaves under both elastic and static allocation strategies. Through systematic measurement and visualization, we directly compare VRAM utilization and latency, and extend the setup to a multi-model scenario where we observe how memory flexibly shifts across active workloads in real time. Copy CodeCopiedUse a different Browser import os, sys, time, json, subprocess, threading, signal, shutil from pathlib import Path def sh(cmd, check=True): return subprocess.run(cmd, check=check, shell=isinstance(cmd, str)) try: import torch except ImportError: sh([sys.executable, “-m”, “pip”, “install”, “-q”, “torch”]) import torch assert torch.cuda.is_available(), “No GPU detected. In Colab: Runtime > Change runtime type > GPU.” props = torch.cuda.get_device_properties(0) print(f”[GPU] {torch.cuda.get_device_name(0)} ” f”({props.total_memory / 1e9:.1f} GB, ” f”compute capability {props.major}.{props.minor})”) def pip_install(*pkgs, extra=()): subprocess.run([sys.executable, “-m”, “pip”, “install”, “-q”, *pkgs, *extra], check=True) print(“[install] vLLM …”) pip_install(“vllm==0.10.2”) print(“[install] kvcached (compiles a small CUDA extension) …”) pip_install(“kvcached”, extra=[“–no-build-isolation”]) print(“[install] misc (matplotlib, requests, pynvml) …”) pip_install(“matplotlib”, “requests”, “pynvml”, “numpy”) MODEL_A = “Qwen/Qwen2.5-0.5B-Instruct” MODEL_B = “Qwen/Qwen2.5-1.5B-Instruct” PORT_A, PORT_B = 8001, 8002 MAX_MODEL_LEN = 2048 We start by setting up the environment and verifying that a GPU is available for our experiments. We install all required dependencies including vLLM and kvcached along with supporting libraries. We then define our model configurations and ports to prepare for launching the inference servers. Copy CodeCopiedUse a different Browser def launch_vllm(model, port, kvcached=True, gpu_mem_util=0.55, log_path=None): “””Start a vLLM OpenAI-compatible server as a subprocess. With kvcached=True the autopatch hooks replace vLLM’s KV-cache allocator with the elastic one.””” env = os.environ.copy() env[“VLLM_USE_V1”] = “1” if kvcached: env[“ENABLE_KVCACHED”] = “true” env[“KVCACHED_AUTOPATCH”] = “1” env[“KVCACHED_IPC_NAME”] = f”kvc_{port}” cmd = [ sys.executable, “-m”, “vllm.entrypoints.openai.api_server”, “–model”, model, “–port”, str(port), “–max-model-len”, str(MAX_MODEL_LEN), “–disable-log-requests”, “–no-enable-prefix-caching”, “–enforce-eager”, ] if not kvcached: cmd += [“–gpu-memory-utilization”, str(gpu_mem_util)] log = open(log_path or os.devnull, “w”) proc = subprocess.Popen(cmd, env=env, stdout=log, stderr=subprocess.STDOUT, preexec_fn=os.setsid) return proc, log def wait_ready(port, timeout=420): import requests url = f”http://localhost:{port}/v1/models” t0 = time.time() while time.time() – t0 < timeout: try: if requests.get(url, timeout=2).status_code == 200: return True except Exception: pass time.sleep(3) raise TimeoutError(f”vLLM on port {port} didn’t come up within {timeout}s”) def shutdown(proc, log): if proc and proc.poll() is None: try: os.killpg(os.getpgid(proc.pid), signal.SIGTERM) proc.wait(timeout=45) except Exception: os.killpg(os.getpgid(proc.pid), signal.SIGKILL) if log and not log.closed: log.close() time.sleep(3) We implement helper functions to launch and manage the vLLM server with and without kvcached enabled. We configure environment variables to activate dynamic KV-cache behavior and ensure proper server initialization. We also define utilities to wait for server readiness and safely shut down processes after execution. Copy CodeCopiedUse a different Browser import pynvml pynvml.nvmlInit() NV_HANDLE = pynvml.nvmlDeviceGetHandleByIndex(0) def vram_used_mb(): info = pynvml.nvmlDeviceGetMemoryInfo(NV_HANDLE) return info.used / (1024 ** 2) class MemorySampler(threading.Thread): def __init__(self, interval=0.2): super().__init__(daemon=True) self.interval = interval self.samples = [] self._stop = threading.Event() def run(self): t0 = time.time() while not self._stop.is_set(): self.samples.append((time.time() – t0, vram_used_mb())) time.sleep(self.interval) def stop(self): self._stop.set(); self.join() import requests from concurrent.futures import ThreadPoolExecutor PROMPTS = [ “Explain quantum entanglement to a curious 10-year-old.”, “Write a Python function that detects cycles in a linked list.”, “Summarize the plot of Hamlet in one paragraph.”, “List 5 surprising household uses for baking soda with explanations.”, “Compose a vivid haiku about rainy Monday mornings.”, “Describe the Fermi paradox and three plausible resolutions.”, “Translate ‘knowledge is power’ into French, German, and Japanese.”, “Explain the difference between TCP and UDP with real examples.”, ] def bursty_workload(port, model, n_bursts=3, burst_size=6, pause=6.0, max_tokens=180): “””Fire n_bursts waves of burst_size concurrent requests with an idle gap between waves. The idle gap is where kvcached releases physical VRAM — a static-allocation engine simply cannot.””” url = f”http://localhost:{port}/v1/chat/completions” def one(i): body = { “model”: model, “messages”: [{“role”: “user”, “content”: PROMPTS[i % len(PROMPTS)]}], “max_tokens”: max_tokens, “temperature”: 0.7, } t0 = time.time() r = requests.post(url, json=body, timeout=180) r.raise_for_status() return time.time() – t0 latencies = [] with ThreadPoolExecutor(max_workers=burst_size) as ex: for b in range(n_bursts): print(f” burst {b+1}/{n_bursts} ({burst_size} concurrent)”) latencies += list(ex.map(one, range(burst_size))) if b < n_bursts – 1: time.sleep(pause) return latencies We initialize GPU memory tracking using pynvml to monitor VRAM usage in real time. We create a background sampling thread that continuously records memory consumption during experiments. We then define a bursty workload generator that sends concurrent requests to simulate realistic LLM usage patterns. Copy CodeCopiedUse a different Browser print(“n=== Experiment 1: vLLM + kvcached ===”) proc, log = launch_vllm(MODEL_A, PORT_A, kvcached=True, log_path=”/tmp/vllm_kvc.log”) try: wait_ready(PORT_A) idle_kvc = vram_used_mb() print(f” Idle VRAM after load (weights only): {idle_kvc:.0f} MB”) sampler = MemorySampler(); sampler.start() lat_kvc = bursty_workload(PORT_A, MODEL_A) time.sleep(6) sampler.stop() mem_kvc = sampler.samples finally: shutdown(proc, log) print(“n=== Experiment 2: vLLM baseline (static KV allocation) ===”) proc, log = launch_vllm(MODEL_A, PORT_A, kvcached=False, log_path=”/tmp/vllm_base.log”) try: wait_ready(PORT_A) idle_base = vram_used_mb() print(f” Idle VRAM (weights + pre-reserved KV pool): {idle_base:.0f} MB”) sampler = MemorySampler(); sampler.start() lat_base = bursty_workload(PORT_A, MODEL_A) time.sleep(6) sampler.stop() mem_base = sampler.samples finally: shutdown(proc, log) We run the first experiment with kvcached enabled and capture both memory usage and latency metrics. We then execute the same workload under a baseline static allocation setup for comparison. We collect and store all results to enable a clear side-by-side evaluation of both approaches. Copy CodeCopiedUse a different Browser import numpy as np import matplotlib.pyplot as plt fig, axes = plt.subplots(1, 2, figsize=(14, 4.5)) tk, mk = zip(*mem_kvc); tb, mb = zip(*mem_base) axes[0].plot(tk, mk, label=”with kvcached”, linewidth=2, color=”#1f77b4″) axes[0].plot(tb, mb, label=”baseline (static)”, linewidth=2, linestyle=”–“, color=”#d62728″) axes[0].axhline(idle_kvc, color=”#1f77b4″, alpha=.3, linestyle=”:”) axes[0].axhline(idle_base, color=”#d62728″, alpha=.3, linestyle=”:”) axes[0].set_xlabel(“time (s)”); axes[0].set_ylabel(“GPU memory used (MB)”) axes[0].set_title(“VRAM under a bursty workloadn(dotted = idle-baseline VRAM)”) axes[0].grid(alpha=.3); axes[0].legend() axes[1].boxplot([lat_kvc, lat_base], labels=[“kvcached”, “baseline”]) axes[1].set_ylabel(“request latency (s)”) axes[1].set_title(f”Latency across {len(lat_kvc)} requests”) axes[1].grid(alpha=.3) plt.tight_layout() plt.savefig(“/content/kvcached_single_model.png”, dpi=120, bbox_inches=”tight”) plt.show() print(“n— Single-model summary ——————————————–“) print(f” Idle VRAM kvcached: {idle_kvc:>6.0f} MB ” f”baseline: {idle_base:>6.0f} MB ” f”(savings: {idle_base – idle_kvc:>5.0f} MB)”) print(f” Peak VRAM kvcached: {max(mk):>6.0f} MB ” f”baseline: {max(mb):>6.0f} MB”) print(f” Median lat. kvcached: {np.median(lat_kvc):>6.2f} s

A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing Lire l’article »

AI, Committee, Actualités, Uncategorized

xAI Launches grok-voice-think-fast-1.0: Topping τ-voice Bench at 67.3%, Outperforming Gemini, GPT Realtime, and More

Building a production-grade voice AI agent is one of the hardest engineering challenges in applied machine learning today. It is not just about transcription accuracy. You need a system that can hold context across a five-minute conversation, invoke external APIs mid-call without an awkward pause, gracefully recover when a caller corrects themselves, and do all of this reliably when the audio is degraded by background noise, a heavy accent, or a dropped word. Most current systems handle one or two of those requirements. xAI’s newly released grok-voice-think-fast-1.0 is making a serious claim to handle all of them — and the benchmark numbers back it up. Available via the xAI API, grok-voice-think-fast-1.0 is the xAI’s new flagship voice model. It is purpose-built for complex, ambiguous, multi-step workflows across customer support, sales, and enterprise applications, and it is already deployed at scale powering Starlink’s live phone operations. What Makes a Voice Agent Full-Duplex? Before unpacking the benchmark results, it is worth understanding what kind of model grok-voice-think-fast-1.0 is. It is evaluated on the (Tau) τ-voice Bench as a full-duplex voice agent. The system processes incoming speech and generates responses simultaneously, rather than waiting for the speaker to stop before it begins thinking. This is how humans communicate in real conversations. It is also why handling interruptions is a genuinely hard technical problem: the model must decide in real time whether a mid-sentence utterance is a correction, a clarification, or just a filler word, and adjust its behavior accordingly. The τ-voice Bench evaluates agents specifically under these realistic conditions: noise, accents, interruptions, and natural turn-taking, making it a more relevant measure for production deployments than traditional clean-audio ASR benchmarks. https://x.ai/news/grok-voice-think-fast-1 The Numbers: A Significant Lead The benchmark results xAI published are striking in how large the gaps are. On the τ-voice Bench overall leaderboard, grok-voice-think-fast-1.0 scores 67.3%, compared to 43.8% for Gemini 3.1 Flash Live, 38.3% for Grok Voice Fast 1.0 (xAI’s own previous model), and 35.3% for GPT Realtime 1.5. Breaking that down by vertical tells an even clearer story: In Retail — covering order handling, returns, and promotions in noisy environments — grok-voice-think-fast-1.0 scores 62.3%, followed by Grok Voice Fast 1.0 at 45.6%, Gemini 3.1 Flash Live at 44.7%, and GPT Realtime 1.5 at 38.6%. In Airline — booking changes, delays, and complex itineraries — the scores are 66% for Grok Voice Think Fast 1.0, 64% for Grok Voice Fast 1.0, 40% for Gemini 3.1 Flash Live, and 36% for GPT Realtime 1.5. The most dramatic gap appears in Telecom: plan changes, billing disputes, and technical troubleshooting — where grok-voice-think-fast-1.0 achieves 73.7%, while Grok Voice Fast 1.0 scores 40.4%, Gemini 3.1 Flash Live 21.9%, and GPT Realtime 1.5 21.1%. A 33-percentage-point lead over the next competitor in a single vertical is not a marginal improvement. That is an architectural advantage. Real-Time Reasoning With Zero Added Latency One of the most technically significant design decisions in this model is how reasoning is handled. grok-voice-think-fast-1.0 performs reasoning in the background, thinking through challenging queries and workflows in real time with no impact on response latency. For AI teams, this is the difficult part to build: reasoning models traditionally increase response time because they generate intermediate ‘thinking’ tokens before producing an answer. Hiding that computation from the conversational latency budget, while still benefiting from it, requires careful architecture work. The practical payoff is accuracy without sluggishness. xAI team demonstrates this with a representative edge case: when asked “Which months of the year are spelled with the letter X?”, grok-voice-think-fast-1.0 correctly responds that no month contains the letter X. On the other hand, the competing models confidently and incorrectly answered “February.” This class of error, where a model produces a plausible-sounding but wrong answer with high confidence, is particularly damaging in voice interfaces because users have no text output to cross-check. Precise Data Entry and Read-Back A core workflow capability of grok-voice-think-fast-1.0 is structured data capture and read-back. The model can seamlessly collect email addresses, physical street addresses, phone numbers, full names, account numbers, and other structured data, even when information is spoken quickly or with a strong accent. It gracefully handles speech disfluencies and accepts natural corrections as a human would, then reads back the confirmed data to the user. xAI illustrates this with a concrete example. A caller says: “Yep, it’s 1410, uh wait, 1450 Page Mill Street. Actually no sorry, that’s Page Mill Road.” The model processes the spoken corrections in real time, invokes a search_address tool with the corrected parameter “1450 Page Mill Rd”, and reads back the normalized address for user confirmation. Data teams who has spent time building post-call cleanup pipelines to extract structured fields from messy transcripts, this native capture-and-read-back capability represents a meaningful reduction in downstream processing complexity. The model has been battle-tested in the toughest real-world conditions: telephony audio, background noise, heavy accents, and frequent interruptions. It natively supports 25+ languages, making it ideal for global deployments across use cases including customer support, phone sales, appointment booking, and restaurant reservations. The Starlink Deployment: Production at Scale The most compelling validation of grok-voice-think-fast-1.0 is not the benchmark alone but it’s live deployment. Grok Voice powers the full phone sales and customer support operation for Starlink at +1 (888) GO STARLINK. The numbers xAI discloses from this deployment are operationally significant: a 20% sales conversion rate (meaning one in five callers making a sales inquiry purchases Starlink service while on the phone with Grok), a 70% autonomous resolution rate for customer support inquiries with no human in the loop, and a single agent operating across 28 distinct tools spanning hundreds of support and sales workflows. Key Takeaways grok-voice-think-fast-1.0 leads the τ-voice Bench with a 67.3% score, outperforming Gemini 3.1 Flash Live (43.8%), Grok Voice Fast 1.0 (38.3%), and GPT Realtime 1.5 (35.3%). The model performs background reasoning with zero added latency, allowing it to think through complex, multi-step workflows in real time without slowing down conversational responses. Precise data entry and read-back is a native capability, enabling

xAI Launches grok-voice-think-fast-1.0: Topping τ-voice Bench at 67.3%, Outperforming Gemini, GPT Realtime, and More Lire l’article »

AI, Committee, Actualités, Uncategorized

A Coding Tutorial on Datashader on Rendering Massive Datasets with High-Performance Python Visual Analytics

In this tutorial, we explore Datashader, a powerful, high-performance visualization library for rendering massive datasets that quickly overwhelm traditional plotting tools. We work through its full rendering pipeline in Google Colab, starting from dense point clouds and reduction-based aggregations to categorical rendering, line visualizations, raster data, quadmesh grids, compositing, and dashboard-style analytical views. As we move through each section, we focus on how Datashader transforms raw large-scale data into meaningful visual structure with speed, flexibility, and visual clarity, while keeping Matplotlib as the final presentation layer. Copy CodeCopiedUse a different Browser import subprocess, sys subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, “datashader”, “colorcet”, “numba”, “scipy”]) import numpy as np import pandas as pd import datashader as ds import datashader.transfer_functions as tf from datashader import reductions as rd import colorcet as cc import matplotlib.pyplot as plt import matplotlib.colors as mcolors from matplotlib.gridspec import GridSpec from scipy.stats import multivariate_normal import time, warnings warnings.filterwarnings(“ignore”) print(“Datashader version:”, ds.__version__) def show(img, title=””, ax=None, figsize=(6, 5)): standalone = ax is None if standalone: fig, ax = plt.subplots(figsize=figsize) rgba = img.to_pil() ax.imshow(rgba, origin=”upper”, aspect=”auto”) ax.set_title(title, fontsize=11, fontweight=”bold”) ax.axis(“off”) if standalone: plt.tight_layout() plt.show() print(“n=== SECTION 1: Core Pipeline ===”) rng = np.random.default_rng(42) N = 2_000_000 x = np.concatenate([rng.normal(-1, 0.5, N//3), rng.normal( 1, 0.5, N//3), rng.normal( 0, 1.5, N//3)]) y = np.concatenate([rng.normal(-1, 0.5, N//3), rng.normal( 1, 0.5, N//3), rng.normal( 0, 0.5, N//3)]) df_base = pd.DataFrame({“x”: x, “y”: y}) canvas = ds.Canvas(plot_width=600, plot_height=500, x_range=(-4, 4), y_range=(-4, 4)) agg = canvas.points(df_base, “x”, “y”, agg=rd.count()) fig, axes = plt.subplots(1, 3, figsize=(15, 4)) combos = [ (“Linear / blues”, tf.shade(agg, cmap=cc.blues, how=”linear”)), (“Log / fire”, tf.shade(agg, cmap=cc.fire, how=”log” )), (“Eq-hist / bmy”, tf.shade(agg, cmap=cc.bmy, how=”eq_hist”)), ] for ax, (title, img) in zip(axes, combos): show(img, title, ax=ax) plt.suptitle(“Section 1 – 2 M points: Linear vs Log vs Eq-Hist normalisation”, fontsize=13, fontweight=”bold”) plt.tight_layout() plt.show() print(“n=== SECTION 2: Reduction Types ===”) n_actual = len(df_base) df_base[“value”] = rng.exponential(scale=2, size=n_actual) df_base[“label”] = pd.Categorical( rng.choice([“A”, “B”, “C”], size=n_actual), categories=[“A”, “B”, “C”] ) canvas2 = ds.Canvas(plot_width=400, plot_height=350, x_range=(-4, 4), y_range=(-4, 4)) reductions_cfg = [ (“count()”, rd.count(), cc.kbc), (“sum(value)”, rd.sum(“value”), cc.CET_L3), (“mean(value)”, rd.mean(“value”), cc.CET_D4), (“std(value)”, rd.std(“value”), cc.CET_L16), (“min(value)”, rd.min(“value”), cc.CET_L17), (“max(value)”, rd.max(“value”), cc.bgyw), (“var(value)”, rd.var(“value”), cc.CET_L18), (“count_cat(label)”, rd.count_cat(“label”), None), ] fig, axes = plt.subplots(2, 4, figsize=(18, 9)) axes = axes.flat for ax, (name, agg_fn, cmap) in zip(axes, reductions_cfg): agg_r = canvas2.points(df_base, “x”, “y”, agg=agg_fn) if cmap is None: img = tf.shade(agg_r, color_key={“A”:”#e41a1c”,”B”:”#377eb8″,”C”:”#4daf4a”}) else: img = tf.shade(agg_r, cmap=cmap, how=”eq_hist”) show(img, name, ax=ax) plt.suptitle(“Section 2 – All Reduction Types on 2 M points”, fontsize=14, fontweight=”bold”) plt.tight_layout() plt.show() print(“n=== SECTION 3: Categorical Visualisation ===”) N_cat = 500_000 categories = [“Cluster A”, “Cluster B”, “Cluster C”, “Cluster D”] centers = [(-2, -2), (-2, 2), (2, -2), (2, 2)] colors = {“Cluster A”:”#e41a1c”,”Cluster B”:”#377eb8″, “Cluster C”:”#4daf4a”,”Cluster D”:”#ff7f00″} frames = [] for cat, (cx, cy) in zip(categories, centers): n = N_cat // len(categories) frames.append(pd.DataFrame({ “x”: rng.normal(cx, 0.8, n), “y”: rng.normal(cy, 0.8, n), “cat”: pd.Categorical([cat]*n, categories=categories), })) df_cat = pd.concat(frames, ignore_index=True) canvas3 = ds.Canvas(plot_width=500, plot_height=500, x_range=(-5, 5), y_range=(-5, 5)) agg_cat = canvas3.points(df_cat, “x”, “y”, agg=rd.count_cat(“cat”)) fig, axes = plt.subplots(1, 3, figsize=(16, 5)) img_raw = tf.shade(agg_cat, color_key=colors) show(img_raw, “Raw (no spread)”, ax=axes[0]) img_sp1 = tf.spread(tf.shade(agg_cat, color_key=colors), px=1) show(img_sp1, “Spread px=1″, ax=axes[1]) img_bg = tf.set_background(tf.shade(agg_cat, color_key=colors), color=”black”) show(img_bg, “Black background”, ax=axes[2]) for cat, col in colors.items(): axes[2].plot([], [], “o”, color=col, label=cat, markersize=8) axes[2].legend(loc=”lower right”, fontsize=8, framealpha=0.6) plt.suptitle(“Section 3 – Categorical Rendering (500 k points)”, fontsize=13, fontweight=”bold”) plt.tight_layout() plt.show() We install the required libraries and import everything needed to build a complete Datashader workflow in Google Colab. We define a helper function to display Datashader images with Matplotlib, which keeps the rendering pipeline simple and visually consistent. We then begin with the core Datashader pipeline, explore multiple reduction types, and show how categorical data can be rendered clearly using color keys, spreading, and background adjustments. Copy CodeCopiedUse a different Browser print(“n=== SECTION 4: Line Rendering ===”) n_series, n_steps = 5_000, 500 t = np.linspace(0, 1, n_steps) xs = np.tile(t, n_series) walks = np.cumsum(rng.normal(0, 0.05, (n_series, n_steps)), axis=1) ys = walks.ravel() series_id = np.repeat(np.arange(n_series), n_steps) df_lines = pd.DataFrame({“x”: xs, “y”: ys, “id”: series_id}) canvas4 = ds.Canvas(plot_width=700, plot_height=450, x_range=(0, 1), y_range=(-6, 6)) agg_lines = canvas4.line(df_lines, “x”, “y”, agg=rd.count(), line_width=1) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) show(tf.shade(agg_lines, cmap=cc.fire, how=”eq_hist”), “5 000 random walks – eq_hist / fire”, ax=axes[0]) show(tf.shade(agg_lines, cmap=cc.blues, how=”log”), “5 000 random walks – log / blues”, ax=axes[1]) plt.suptitle(“Section 4 – Line / Time-Series Rendering”, fontsize=13, fontweight=”bold”) plt.tight_layout() plt.show() print(“n=== SECTION 5: Raster / Grid Data ===”) import xarray as xr res = 1000 lon = np.linspace(-180, 180, res) lat = np.linspace(-90, 90, res) LON, LAT = np.meshgrid(lon, lat) z = ( multivariate_normal.pdf(np.stack([LON, LAT], -1), mean=[30, 30], cov=[[800,0],[0,500]]) + multivariate_normal.pdf(np.stack([LON, LAT], -1), mean=[-60, -20], cov=[[600,0],[0,400]]) + 0.02 * rng.standard_normal((res, res))) da = xr.DataArray(z, dims=[“y”, “x”], coords={“x”: lon, “y”: lat}) canvas5 = ds.Canvas(plot_width=700, plot_height=400, x_range=(-180, 180), y_range=(-90, 90)) agg_raster = canvas5.raster(da) fig, axes = plt.subplots(1, 2, figsize=(14, 4)) show(tf.shade(agg_raster, cmap=cc.CET_L18, how=”eq_hist”), “Synthetic elevation – eq_hist”, ax=axes[0]) show(tf.shade(agg_raster, cmap=cc.rainbow, how=”linear”), “Synthetic elevation – linear”, ax=axes[1]) plt.suptitle(“Section 5 – Raster / Grid (xarray DataArray)”, fontsize=13, fontweight=”bold”) plt.tight_layout() plt.show() print(“n=== SECTION 6: QuadMesh / 2-D Grid Glyph ===”) lon6 = np.concatenate([np.linspace(-180, -60, 80), np.linspace(-60, 60, 30), np.linspace( 60, 180, 80)]) lat6 = np.concatenate([np.linspace(-90, -30, 40), np.linspace(-30, 30, 20), np.linspace( 30, 90, 40)]) LON6, LAT6 = np.meshgrid(lon6, lat6) def vortex(lon0, lat0, amp=1.0): return amp * np.exp(-((LON6-lon0)**2/1200 + (LAT6-lat0)**2/600)) field6 = vortex(-40, 30, 1.2) + vortex(120, -20, 0.9) + 0.05 * rng.standard_normal(LON6.shape) da6 = xr.DataArray(field6.astype(np.float32), dims=[“y”, “x”], coords={“x”: lon6, “y”: lat6}, name=”intensity”) canvas6 = ds.Canvas(plot_width=700, plot_height=380, x_range=(-180, 180), y_range=(-90, 90)) agg6 = canvas6.quadmesh(da6) canvas6z = ds.Canvas(plot_width=500, plot_height=400, x_range=(-80, 0), y_range=(0, 60)) agg6z = canvas6z.quadmesh(da6) field6_smooth = vortex(-40, 30, 1.0) + vortex(120, -20, 0.8) da6_diff = xr.DataArray((field6 – field6_smooth).astype(np.float32), dims=[“y”,”x”], coords={“x”: lon6, “y”: lat6}, name=”anomaly”) agg6d = canvas6.quadmesh(da6_diff) fig, axes = plt.subplots(1, 3, figsize=(18, 5)) show(tf.shade(agg6, cmap=cc.fire, how=”eq_hist”), “Global field – eq_hist”, ax=axes[0]) show(tf.shade(agg6z, cmap=cc.CET_L3, how=”linear”), “N. Atlantic zoom – linear”, ax=axes[1]) show(tf.shade(agg6d, cmap=cc.CET_D4, how=”eq_hist”), “Residual (anomaly) – eq_hist”,ax=axes[2]) plt.suptitle(“Section 6 – canvas.quadmesh(): non-uniform 2-D grids”, fontsize=13, fontweight=”bold”) plt.tight_layout() plt.show() We move beyond point clouds and use Datashader to

A Coding Tutorial on Datashader on Rendering Massive Datasets with High-Performance Python Visual Analytics Lire l’article »

AI, Committee, Actualités, Uncategorized

RAG Without Vectors: How PageIndex Retrieves by Reasoning

Retrieval is where most RAG systems quietly break. Traditional pipelines rely on vector similarity—embedding queries and document chunks into the same space and fetching the “closest” matches. But similarity is a weak proxy for what we actually need: relevance grounded in reasoning. In long, professional documents—like financial reports, research papers, or legal texts—the right answer often isn’t in the most semantically similar paragraph. It requires navigating structure, understanding context, and performing multi-step reasoning across sections. This is exactly where vector-based RAG starts to fall apart. PageIndex is designed to solve this gap by rethinking retrieval from first principles. Instead of chunking documents and searching via embeddings, it builds a hierarchical table-of-contents-style tree index and uses LLMs to reason over that structure—much like a human expert scanning sections, drilling down, and connecting ideas. This enables a vectorless, reasoning-driven retrieval process that is more interpretable, traceable, and aligned with how knowledge is actually extracted from complex documents. By replacing similarity search with structured exploration and tree-based reasoning, PageIndex delivers significantly higher retrieval accuracy—demonstrated by its strong performance on benchmarks like FinanceBench—making it particularly effective for domains that demand precision and deep understanding. In this article, we’ll use PageIndex to index the seminal Transformer paper — “Attention Is All You Need” — and run two cross-cutting queries against it without a single vector or embedding. Instead of chunking the PDF and retrieving by similarity, PageIndex builds a hierarchical tree of the document’s sections, then uses GPT-5.4 to reason over node summaries and identify exactly which sections contain the answer — before reading a single word of full text. Setting up the dependencies For this tutorial, you would require PageIndex & OpenAI API keys. You can get the same from https://dash.pageindex.ai/api-keys and https://platform.openai.com/api-keys respectively. Copy CodeCopiedUse a different Browser pip install pageindex openai requests Copy CodeCopiedUse a different Browser from pageindex import PageIndexClient import pageindex.utils as utils import os from getpass import getpass PAGEINDEX_API_KEY = getpass(‘Enter PageIndex API Key: ‘) pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY) We import the OpenAI client and configure it with an API key to enable access to LLMs. Then, we define an asynchronous helper function that sends prompts to the model and returns the generated response. Copy CodeCopiedUse a different Browser import openai OPENAI_API_KEY = getpass(‘Enter OpenAI API Key: ‘) async def call_llm(prompt, model=”gpt-5.4″, temperature=0): client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY) response = await client.chat.completions.create( model=model, messages=[{“role”: “user”, “content”: prompt}], temperature=temperature ) return response.choices[0].message.content.strip() Building the PageIndex Tree In this chunk, we download the Transformer paper directly from arXiv and submit it to PageIndex, which processes the PDF and builds a hierarchical tree of its sections — each node storing a title, a summary, and the full section text. Once the tree is ready, we print it out to inspect the structure PageIndex has inferred: every chapter, subsection, and nested heading becomes a node in the tree, preserving the document’s natural organization exactly as the authors intended it. Copy CodeCopiedUse a different Browser # ───────────────────────────────────────────── # Step 1: Build the PageIndex Tree # ───────────────────────────────────────────── # 1.1 Download the Transformer paper and submit it import os, requests pdf_url = “https://arxiv.org/pdf/1706.03762.pdf” pdf_path = os.path.join(“data”, pdf_url.split(“/”)[-1]) os.makedirs(“data”, exist_ok=True) print(“Downloading ‘Attention Is All You Need’…”) response = requests.get(pdf_url) with open(pdf_path, “wb”) as f: f.write(response.content) print(f” Saved to {pdf_path}”) doc_id = pi_client.submit_document(pdf_path)[“doc_id”] print(f” Document submitted. doc_id: {doc_id}”) # 1.2 Retrieve the tree (poll until ready) import time print(“nWaiting for PageIndex tree to be ready”, end=””) while not pi_client.is_retrieval_ready(doc_id): print(“.”, end=””, flush=True) time.sleep(5) tree = pi_client.get_tree(doc_id, node_summary=True)[“result”] print(“nn Document Tree Structure:”) utils.print_tree(tree) Reasoning-Based Retrieval With the tree built, we now run a query that is intentionally cross-cutting — one that can’t be answered by a single section of the paper. We strip the full text from each node, leaving only titles and summaries, and pass the entire tree structure to GPT-5.4. The model then reasons over these summaries to identify every node likely to contain a relevant answer, returning both its step-by-step thinking and a list of matched node IDs. This is the core of what makes PageIndex different: the LLM decides where to look before any full text is loaded. Copy CodeCopiedUse a different Browser # ───────────────────────────────────────────── # Step 2: Reasoning-Based Retrieval # ───────────────────────────────────────────── # 2.1 Define a query that requires navigating across sections import json # This query is intentionally cross-cutting — it can’t be answered # by a single section, which is where tree search shines over top-k. query = “Why did the authors choose self-attention over recurrence, and what are the complexity trade-offs they compared?” tree_without_text = utils.remove_fields(tree.copy(), fields=[“text”]) search_prompt = f””” You are given a question and a hierarchical tree structure of a research paper. Each node has a node_id, title, and a summary of its content. Your task: identify ALL nodes that are likely to contain information relevant to answering the question. Think carefully — the answer may be spread across multiple sections. Question: {query} Document tree: {json.dumps(tree_without_text, indent=2)} Reply ONLY in this JSON format, no preamble: {{ “thinking”: “<step-by-step reasoning about which nodes are relevant and why>”, “node_list”: [“node_id_1”, “node_id_2”, …] }} “”” print(f’ Query: “{query}”n’) print(“Running tree search with GPT-5.4…”) tree_search_result = await call_llm(search_prompt) # 2.2 Inspect the retrieval reasoning and matched nodes node_map = utils.create_node_mapping(tree) result_json = json.loads(tree_search_result) print(“n LLM Reasoning:”) utils.print_wrapped(result_json[“thinking”]) print(“n Retrieved Nodes:”) for node_id in result_json[“node_list”]: node = node_map[node_id] print(f” • [{node[‘node_id’]}] Page {node[‘page_index’]:>2} — {node[‘title’]}”) Answer Generation Once the relevant nodes are identified, we pull their full text and stitch it together into a single context block — each section clearly labeled so the model knows where each piece of information comes from. That combined context is then handed to GPT-5.4 with a structured prompt that asks for the core motivation, the specific complexity numbers, and any caveats the authors acknowledged. The model answers using only what was retrieved, grounding every claim directly in the paper’s text. Copy CodeCopiedUse a different Browser # ───────────────────────────────────────────── # Step 3: Answer Generation # ───────────────────────────────────────────── # 3.1 Stitch together context from all retrieved nodes node_list = result_json[“node_list”] relevant_content = “nn—nn”.join( f”[Section:

RAG Without Vectors: How PageIndex Retrieves by Reasoning Lire l’article »

AI, Committee, Actualités, Uncategorized

Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models

As AI agents move from research demos to production deployments, one question has become impossible to ignore: how do you actually know if an agent is good? Perplexity scores and MMLU leaderboard numbers tell you very little about whether a model can navigate a real website, resolve a GitHub issue, or reliably handle a customer service workflow across hundreds of interactions. The field has responded with a wave of agentic benchmarks — but not all of them are equally meaningful. One important caveat before diving in: agent benchmark scores are highly scaffold-dependent. The model, prompt design, tool access, retry budget, execution environment, and evaluator version can all materially change reported scores. No number should be read in isolation, context about how it was produced matters as much as the number itself. With that in mind, here are seven benchmarks that have emerged as genuine signals of agentic capability, explaining what each one tests, why it matters, and where notable results currently stand. 1. SWE-bench Verified Leaderboard & details: swebench.com What it tests: Real-world software engineering. SWE-bench evaluates LLMs and AI agents on their ability to resolve real-world software engineering issues, drawing from 2,294 problems sourced from GitHub issues across 12 popular Python repositories. The agent must produce a working patch — not a description of a fix, but actual code that passes unit tests. The Verified subset is a human-validated collection of 500 high-quality samples developed in collaboration with OpenAI and professional software engineers, and is the version most commonly cited in frontier model evaluations today. Why it matters: The benchmark’s trajectory makes it one of the most reliable long-run progress trackers in the field. When it launched in 2023, Claude 2 could resolve only 1.96% of issues. In vendor-reported late-2025 and early-2026 results, top frontier models crossed the 80% range on SWE-bench Verified — though exact scores vary meaningfully by scaffold, effort setting, tool setup, and evaluator protocol, and should not be compared directly across vendors without accounting for those differences. A consistent pattern has emerged: closed-source models tend to outperform open-source ones, and performance is heavily shaped by the agent harness as much as the underlying model. One caveat worth flagging: high SWE-bench scores do not guarantee a general-purpose agent. They indicate strength in software repair tasks specifically — not universal autonomy — which is precisely why it must be used alongside the other benchmarks in this list. 2. GAIA Leaderboard & details: huggingface.co/spaces/gaia-benchmark/leaderboard What it tests: General-purpose assistant capabilities that require multi-step reasoning, web browsing, tool use, and basic multimodal understanding. GAIA tasks are deceptively simple in phrasing but require a chain of non-trivial operations to complete correctly — the kind of compound task a real assistant would face in the wild. Why it matters: GAIA is widely referenced in agent evaluation research and maintains an active Hugging Face leaderboard where teams across the community submit results. Its design resists shortcut-taking: an agent cannot guess its way through. It has become one of the standard suites for exposing tool-use brittleness and reproducibility gaps in real agent evaluations — surfacing failure modes that narrower benchmarks miss entirely. For teams evaluating general-purpose assistants rather than task-specific agents, GAIA remains one of the most honest signal generators available. 3. WebArena Leaderboard & details: webarena.dev What it tests: Autonomous web navigation in realistic, functional environments. WebArena creates websites across four domains — e-commerce, social forums, collaborative software development, and content management — with real functionality and data that mirrors their real-world equivalents. Agents must interpret high-level natural language commands and execute them entirely through a live browser interface. The benchmark consists of 812 long-horizon tasks, and the original paper’s best GPT-4-based agent achieved only 14.41% end-to-end task success, against a human baseline of 78.24%. Why it matters: Progress on WebArena has been substantial. By early 2025, specialized systems were reporting single-agent task completion rates above 60% — IBM’s CUGA system reached 61.7% on the full benchmark (February 2025), and OpenAI’s Computer-Using Agent achieved 58.1% in its January 2025 technical report. These gains reflect a broader pattern in stronger web agents: explicit planning, specialized action execution, memory or state tracking, reflection, and task-specific training or evaluation loops. The remaining gap to human performance — 78.24% per the original paper — reflects harder unsolved problems like deep visual understanding and common-sense reasoning. WebArena is one of the most widely used benchmarks for testing true web autonomy, not scripted automation. 4. τ-bench (Tau-bench) Leaderboard & code: github.com/sierra-research/tau-bench What it tests: Tool-agent-user interaction under real-world policy constraints. τ-bench emulates dynamic, multi-turn conversations between a simulated user and a language agent equipped with domain-specific API tools and policy guidelines. The benchmark covers two domains — τ-retail and τ-airline — and simultaneously evaluates three things: whether the agent can gather required information from a user across multiple exchanges, whether it correctly follows domain-specific policy rules (e.g., rejecting non-refundable ticket changes), and whether it behaves consistently at scale via the pass^k reliability metric. Why it matters: τ-bench exposes a reliability crisis that most one-shot benchmarks are completely blind to. Even state-of-the-art function calling agents like GPT-4o succeed on fewer than 50% of tasks, and their consistency is far worse — pass^8 falls below 25% in the retail domain. That means an agent that can handle a task in one trial cannot reliably handle the same task eight times in a row. For any real deployment handling millions of interactions, that inconsistency is disqualifying. By combining reasoning, tool-use, policy adherence, and repeatability into a single evaluation framework, τ-bench fills a gap that outcome-only benchmarks leave wide open. 5. ARC-AGI-2 Leaderboard & competition: arcprize.org/leaderboard What it tests: Fluid intelligence — the ability to generalize to genuinely novel visual reasoning puzzles that resist memorization or pattern-matching from training data. Each task presents the agent with a small number of input-output grid examples and asks it to infer the underlying abstract rule, then apply it to a new input. Created by François Chollet, the benchmark is the centerpiece of the ARC

Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models Lire l’article »

AI, Committee, Actualités, Uncategorized

A Coding Implementation on Deepgram Python SDK for Transcription, Text-to-Speech, Async Audio Processing, and Text Intelligence

In this tutorial, we build an advanced hands-on workflow with the Deepgram Python SDK and explore how modern voice AI capabilities come together in a single Python environment. We set up authentication, connect both synchronous and asynchronous Deepgram clients, and work directly with real audio data to understand how the SDK handles transcription, speech generation, and text analysis in practice. We transcribe audio from both a URL and a local file, inspect confidence scores, word-level timestamps, speaker diarization, paragraph formatting, and AI-generated summaries, and then extend the pipeline to async processing for faster, more scalable execution. We also generate speech with multiple TTS voices, analyze text for sentiment, topics, and intents, and examine advanced transcription controls such as keyword search, replacement, boosting, raw response access, and structured error handling. Through this process, we create a practical, end-to-end Deepgram voice AI workflow that is both technically detailed and easy to adapt for real-world applications. Copy CodeCopiedUse a different Browser !pip install deepgram-sdk httpx –quiet import os, asyncio, textwrap, urllib.request from getpass import getpass from deepgram import DeepgramClient, AsyncDeepgramClient from deepgram.core.api_error import ApiError from IPython.display import Audio, display DEEPGRAM_API_KEY = getpass(” Enter your Deepgram API key: “) os.environ[“DEEPGRAM_API_KEY”] = DEEPGRAM_API_KEY client = DeepgramClient(api_key=DEEPGRAM_API_KEY) async_client = AsyncDeepgramClient(api_key=DEEPGRAM_API_KEY) AUDIO_URL = “https://dpgr.am/spacewalk.wav” AUDIO_PATH = “/tmp/sample.wav” urllib.request.urlretrieve(AUDIO_URL, AUDIO_PATH) def read_audio(path=AUDIO_PATH): with open(path, “rb”) as f: return f.read() def _get(obj, key, default=None): “””Get a field from either a dict or an object — v6 returns both.””” if isinstance(obj, dict): return obj.get(key, default) return getattr(obj, key, default) def get_model_name(meta): mi = _get(meta, “model_info”) if mi is None: return “n/a” return _get(mi, “name”, “n/a”) def tts_to_bytes(response) -> bytes: “””v6 generate() returns a generator of chunks or an object with .stream.””” if hasattr(response, “stream”): return response.stream.getvalue() return b””.join(chunk for chunk in response if isinstance(chunk, bytes)) def save_tts(response, path: str) -> str: with open(path, “wb”) as f: f.write(tts_to_bytes(response)) return path print(” Deepgram client ready | sample audio downloaded”) print(“n” + “=”*60) print(” SECTION 2: Pre-Recorded Transcription from URL”) print(“=”*60) response = client.listen.v1.media.transcribe_url( url=AUDIO_URL, model=”nova-3″, smart_format=True, diarize=True, language=”en”, utterances=True, filler_words=True, ) transcript = response.results.channels[0].alternatives[0].transcript print(f”n Full Transcript:n{textwrap.fill(transcript, 80)}”) confidence = response.results.channels[0].alternatives[0].confidence print(f”n Confidence: {confidence:.2%}”) words = response.results.channels[0].alternatives[0].words print(f”n First 5 words with timing:”) for w in words[:5]: print(f” ‘{w.word}’ start={w.start:.2f}s end={w.end:.2f}s conf={w.confidence:.2f}”) print(f”n Speaker Diarization (first 5 words):”) for w in words[:5]: speaker = getattr(w, “speaker”, None) if speaker is not None: print(f” Speaker {int(speaker)}: ‘{w.word}'”) meta = response.metadata print(f”n Metadata: duration={meta.duration:.2f}s channels={int(meta.channels)} model={get_model_name(meta)}”) We install the Deepgram SDK and its dependencies, then securely set up authentication using our API key. We initialize both synchronous and asynchronous Deepgram clients, download a sample audio file, and define helper functions to make it easier to work with mixed response objects, audio bytes, model metadata, and streamed TTS outputs. We then run our first pre-recorded transcription from a URL and inspect the transcript, confidence score, word-level timestamps, speaker diarization, and metadata to understand the structure and richness of the response. Copy CodeCopiedUse a different Browser print(“n” + “=”*60) print(” SECTION 3: Pre-Recorded Transcription from File”) print(“=”*60) file_response = client.listen.v1.media.transcribe_file( request=read_audio(), model=”nova-3″, smart_format=True, diarize=True, paragraphs=True, summarize=”v2”, ) alt = file_response.results.channels[0].alternatives[0] paragraphs = getattr(alt, “paragraphs”, None) if paragraphs and _get(paragraphs, “paragraphs”): print(“n Paragraph-Formatted Transcript:”) for para in _get(paragraphs, “paragraphs”)[:2]: sentences = ” “.join(_get(s, “text”, “”) for s in (_get(para, “sentences”) or [])) print(f” [Speaker {int(_get(para,’speaker’,0))}, ” f”{_get(para,’start’,0):.1f}s–{_get(para,’end’,0):.1f}s] {sentences[:120]}…”) else: print(f”n Transcript: {alt.transcript[:200]}…”) if getattr(file_response.results, “summary”, None): short = _get(file_response.results.summary, “short”, “”) if short: print(f”n AI Summary: {short}”) print(f”n Confidence: {alt.confidence:.2%}”) print(f” Word count : {len(alt.words)}”) print(“n” + “=”*60) print(” SECTION 4: Async Parallel Transcription”) print(“=”*60) async def transcribe_async(): audio_bytes = read_audio() async def from_url(label): r = await async_client.listen.v1.media.transcribe_url( url=AUDIO_URL, model=”nova-3″, smart_format=True, ) print(f” [{label}] {r.results.channels[0].alternatives[0].transcript[:100]}…”) async def from_file(label): r = await async_client.listen.v1.media.transcribe_file( request=audio_bytes, model=”nova-3″, smart_format=True, ) print(f” [{label}] {r.results.channels[0].alternatives[0].transcript[:100]}…”) await asyncio.gather(from_url(“From URL”), from_file(“From File”)) await transcribe_async() We move from URL-based to file-based transcription by sending raw audio bytes directly to the Deepgram API, enabling richer options such as paragraphs and summarization. We inspect the returned paragraph structure, speaker segmentation, summary output, confidence score, and word count to see how the SDK supports more readable and analysis-friendly transcription results. We also introduce asynchronous processing and run URL-based and file-based transcription in parallel, helping us understand how to build faster, more scalable voice AI pipelines. Copy CodeCopiedUse a different Browser print(“n” + “=”*60) print(” SECTION 5: Text-to-Speech”) print(“=”*60) sample_text = ( “Welcome to the Deepgram advanced tutorial. ” “This SDK lets you transcribe audio, generate speech, ” “and analyse text — all with a simple Python interface.” ) tts_path = save_tts( client.speak.v1.audio.generate(text=sample_text, model=”aura-2-asteria-en”), “/tmp/tts_output.mp3″, ) size_kb = os.path.getsize(tts_path) / 1024 print(f” TTS audio saved → {tts_path} ({size_kb:.1f} KB)”) display(Audio(tts_path)) print(“n” + “=”*60) print(” SECTION 6: Multiple TTS Voices Comparison”) print(“=”*60) voices = { “aura-2-asteria-en”: “Asteria (female, warm)”, “aura-2-orion-en”: “Orion (male, deep)”, “aura-2-luna-en”: “Luna (female, bright)”, } for model_id, label in voices.items(): try: path = save_tts( client.speak.v1.audio.generate(text=”Hello! I am a Deepgram voice model.”, model=model_id), f”/tmp/tts_{model_id}.mp3″, ) print(f” {label}”) display(Audio(path)) except Exception as e: print(f” {label} — {e}”) print(“n” + “=”*60) print(” SECTION 7: Text Intelligence — Sentiment, Topics, Intents”) print(“=”*60) review_text = ( “I absolutely love this product! It arrived quickly, the quality is ” “outstanding, and customer support was incredibly helpful when I had ” “a question. I would definitely recommend it to anyone looking for ” “a reliable solution. Five stars!” ) read_response = client.read.v1.text.analyze( request={“text”: review_text}, language=”en”, sentiment=True, topics=True, intents=True, summarize=True, ) results = read_response.results We focus on speech generation by converting text to audio using Deepgram’s text-to-speech API and saving the resulting audio as an MP3 file. We then compare multiple TTS voices to hear how different voice models behave and how easily we can switch between them while keeping the same code pattern. After that, we begin working with the Read API by passing the review text into Deepgram’s text intelligence system to analyze language beyond simple transcription. Copy CodeCopiedUse a different Browser if getattr(results, “sentiments”, None): overall = results.sentiments.average print(f” Sentiment: {_get(overall,’sentiment’,’?’).upper()} ” f”(score={_get(overall,’sentiment_score’,0):.3f})”) for seg in (_get(results.sentiments, “segments”) or [])[:2]: print(f” •

A Coding Implementation on Deepgram Python SDK for Transcription, Text-to-Speech, Async Audio Processing, and Text Intelligence Lire l’article »

AI, Committee, Actualités, Uncategorized

A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation

In this tutorial, we work with Microsoft’s OpenMementos dataset and explore how reasoning traces are structured through blocks and mementos in a practical, Colab-ready workflow. We stream the dataset efficiently, parse its special-token format, inspect how reasoning and summaries are organized, and measure the compression provided by the memento representation across different domains. As we move through the analysis, we also visualize dataset patterns, align the streamed format with the richer full subset, simulate inference-time compression, and prepare the data for supervised fine-tuning. In this way, we build both an intuitive and technical understanding of how OpenMementos captures long-form reasoning while preserving compact summaries that can support efficient training and inference. Copy CodeCopiedUse a different Browser !pip install -q -U datasets transformers matplotlib pandas import re, itertools, textwrap from collections import Counter from typing import Dict import pandas as pd import matplotlib.pyplot as plt from datasets import load_dataset DATASET = “microsoft/OpenMementos” ds_stream = load_dataset(DATASET, split=”train”, streaming=True) first_row = next(iter(ds_stream)) print(“Columns :”, list(first_row.keys())) print(“Domain :”, first_row[“domain”], “| Source:”, first_row[“source”]) print(“Problem head:”, first_row[“problem”][:160].replace(“n”, ” “), “…”) We install the required libraries and import the core tools needed for dataset streaming, parsing, analysis, and visualization. We then connect to the Microsoft OpenMementos dataset in streaming mode to inspect it without downloading the entire dataset locally. By reading the first example, we begin understanding the dataset schema, the problem format, and the domain and source metadata attached to each reasoning trace. Copy CodeCopiedUse a different Browser BLOCK_RE = re.compile(r”<|block_start|>(.*?)<|block_end|>”, re.DOTALL) SUMMARY_RE = re.compile(r”<|summary_start|>(.*?)<|summary_end|>”, re.DOTALL) THINK_RE = re.compile(r”<think>(.*?)</think>”, re.DOTALL) def parse_memento(response: str) -> Dict: blocks = [m.strip() for m in BLOCK_RE.findall(response)] summaries = [m.strip() for m in SUMMARY_RE.findall(response)] think_m = THINK_RE.search(response) final_ans = response.split(“</think>”)[-1].strip() if “</think>” in response else “” return {“blocks”: blocks, “summaries”: summaries, “reasoning”: (think_m.group(1) if think_m else “”), “final_answer”: final_ans} parsed = parse_memento(first_row[“response”]) print(f”n→ {len(parsed[‘blocks’])} blocks, {len(parsed[‘summaries’])} mementos parsed”) print(“First block :”, parsed[“blocks”][0][:140].replace(“n”, ” “), “…”) print(“First memento :”, parsed[“summaries”][0][:140].replace(“n”, ” “), “…”) N_SAMPLES = 500 rows = [] for i, ex in enumerate(itertools.islice( load_dataset(DATASET, split=”train”, streaming=True), N_SAMPLES)): p = parse_memento(ex[“response”]) if not p[“blocks”] or len(p[“blocks”]) != len(p[“summaries”]): continue blk_c = sum(len(b) for b in p[“blocks”]) sum_c = sum(len(s) for s in p[“summaries”]) blk_w = sum(len(b.split()) for b in p[“blocks”]) sum_w = sum(len(s.split()) for s in p[“summaries”]) rows.append(dict(domain=ex[“domain”], source=ex[“source”], n_blocks=len(p[“blocks”]), block_chars=blk_c, summ_chars=sum_c, block_words=blk_w, summ_words=sum_w, compress_char=sum_c / max(blk_c, 1), compress_word=sum_w / max(blk_w, 1))) if (i + 1) % 100 == 0: print(f” processed {i+1}/{N_SAMPLES}”) df = pd.DataFrame(rows) print(f”nAnalyzed {len(df)} rows. Domain counts:”) print(df[“domain”].value_counts().to_string()) per_dom = df.groupby(“domain”).agg( n=(“domain”, “count”), median_blocks=(“n_blocks”, “median”), median_block_words=(“block_words”, “median”), median_summ_words=(“summ_words”, “median”), median_char_ratio=(“compress_char”, “median”), median_word_ratio=(“compress_word”, “median”), ).round(3) print(“nPer-domain medians (ratio = mementos / blocks):”) print(per_dom.to_string()) We define the regex-based parser that extracts reasoning blocks, memento summaries, the main thinking section, and the final answer from each response. We test the parser on the first streamed example and confirm that the block-summary structure is being captured correctly. We then run a streaming analysis over multiple samples to compute block counts, word counts, character counts, and compression ratios, which helps us study how the dataset behaves across examples and domains. Copy CodeCopiedUse a different Browser def compress_trace(response: str, keep_last_k: int = 1) -> str: blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response) if not blocks or len(blocks) != len(summaries): return response out, n = [“<think>”], len(blocks) for i, (b, s) in enumerate(zip(blocks, summaries)): if i >= n – keep_last_k: out.append(f”<|block_start|>{b}<|block_end|>”) out.append(f”<|summary_start|>{s}<|summary_end|>”) else: out.append(f”<|summary_start|>{s}<|summary_end|>”) out.append(“</think>”) out.append(response.split(“</think>”)[-1]) return “n”.join(out) orig, comp = first_row[“response”], compress_trace(first_row[“response”], 1) print(f”nOriginal : {len(orig):>8,} chars”) print(f”Compressed : {len(comp):>8,} chars ({len(comp)/len(orig)*100:.1f}% of original)”) from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained(“gpt2”) MEM_TOKENS = [“<|block_start|>”, “<|block_end|>”, “<|summary_start|>”, “<|summary_end|>”, “<think>”, “</think>”] tok.add_special_tokens({“additional_special_tokens”: MEM_TOKENS}) def tlen(s): return len(tok(s, add_special_tokens=False).input_ids) blk_tok = sum(tlen(b) for b in parsed[“blocks”]) sum_tok = sum(tlen(s) for s in parsed[“summaries”]) print(f”nTrace-level token compression for this example:”) print(f” block tokens = {blk_tok}”) print(f” memento tokens = {sum_tok}”) print(f” compression = {blk_tok / max(sum_tok,1):.2f}× (paper reports ~6×)”) def to_chat(ex): return {“messages”: [ {“role”: “user”, “content”: ex[“problem”]}, {“role”: “assistant”, “content”: ex[“response”]}, ]} chat_stream = load_dataset(DATASET, split=”train”, streaming=True).map(to_chat) chat_ex = next(iter(chat_stream)) print(“nSFT chat example (truncated):”) for m in chat_ex[“messages”]: print(f” [{m[‘role’]:9s}] {m[‘content’][:130].replace(chr(10),’ ‘)}…”) We visualize the dataset’s structural patterns by plotting block counts, compression ratios, and the relationship between block size and memento size. We compare these distributions across domains to see how reasoning organization differs between math, code, and science examples. We also stream one example from the full subset and inspect its additional sentence-level and block-alignment fields, which helps us understand the richer internal annotation pipeline behind the dataset. Copy CodeCopiedUse a different Browser def compress_trace(response: str, keep_last_k: int = 1) -> str: blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response) if not blocks or len(blocks) != len(summaries): return response out, n = [“<think>”], len(blocks) for i, (b, s) in enumerate(zip(blocks, summaries)): if i >= n – keep_last_k: out.append(f”<|block_start|>{b}<|block_end|>”) out.append(f”<|summary_start|>{s}<|summary_end|>”) else: out.append(f”<|summary_start|>{s}<|summary_end|>”) out.append(“</think>”) out.append(response.split(“</think>”)[-1]) return “n”.join(out) orig, comp = first_row[“response”], compress_trace(first_row[“response”], 1) print(f”nOriginal : {len(orig):>8,} chars”) print(f”Compressed : {len(comp):>8,} chars ({len(comp)/len(orig)*100:.1f}% of original)”) from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained(“gpt2”) MEM_TOKENS = [“<|block_start|>”, “<|block_end|>”, “<|summary_start|>”, “<|summary_end|>”, “<think>”, “</think>”] tok.add_special_tokens({“additional_special_tokens”: MEM_TOKENS}) def tlen(s): return len(tok(s, add_special_tokens=False).input_ids) blk_tok = sum(tlen(b) for b in parsed[“blocks”]) sum_tok = sum(tlen(s) for s in parsed[“summaries”]) print(f”nTrace-level token compression for this example:”) print(f” block tokens = {blk_tok}”) print(f” memento tokens = {sum_tok}”) print(f” compression = {blk_tok / max(sum_tok,1):.2f}× (paper reports ~6×)”) def to_chat(ex): return {“messages”: [ {“role”: “user”, “content”: ex[“problem”]}, {“role”: “assistant”, “content”: ex[“response”]}, ]} chat_stream = load_dataset(DATASET, split=”train”, streaming=True).map(to_chat) chat_ex = next(iter(chat_stream)) print(“nSFT chat example (truncated):”) for m in chat_ex[“messages”]: print(f” [{m[‘role’]:9s}] {m[‘content’][:130].replace(chr(10),’ ‘)}…”) We simulate inference-time compression by rewriting a reasoning trace so that older blocks are replaced by their mementos while the latest blocks remain intact. We then compare the original and compressed trace lengths to see how much context can be reduced in practice. After that, we integrate a tokenizer, add special memento tokens, measure token-level compression, and convert the dataset to an SFT-style chat format suitable for training workflows. Copy CodeCopiedUse a different Browser def render_trace(response: str, width: int = 220) -> None: p = parse_memento(response) print(“=” * 72) print(f”{len(p[‘blocks’])} blocks ·

A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation Lire l’article »

AI, Committee, Actualités, Uncategorized

Three reasons why DeepSeek’s new model matters

On Friday, Chinese AI firm DeepSeek released a preview of V4, its long-awaited new flagship model. Notably, the model can process much longer prompts than its last generation, thanks to a new design that helps it handle large amounts of text more efficiently. Like DeepSeek’s previous models, V4 is open source, meaning it is available for anyone to download, use, and modify. V4 marks DeepSeek’s most significant release since R1, the reasoning model it launched in January 2025. R1, which was trained on limited computing resources, stunned the global AI industry with its strong performance and efficiency, turning DeepSeek from a little-known research team into China’s best-known AI company almost overnight. It also helped set off a wave of open-weight model releases from other Chinese AI firms.  DeepSeek has kept a relatively low profile since then—but earlier this month, it effectively teased V4’s release when it added “expert” and “flash” modes to the online version of its model, prompting speculation that the updates were tied to a bigger upcoming release. While the company has become a powerful symbol of China’s AI ambitions, its big return to cutting-edge frontier models comes after months of scrutiny—including major personnel departures, delays to previous model launches, and growing scrutiny from both the US and Chinese governments.  So, will V4 shake the AI field the way R1 did? Almost certainly not, but here are three big reasons why this release matters. 1. It breaks new ground for an open-source model. As with R1 before it, DeepSeek claims that V4’s performance rivals the best models available at a fraction of the price. This is great news for developers and for companies using the tech, because it means they can access frontier AI capabilities on their own terms, and without worrying about skyrocketing costs. The new model comes in two versions, both of which are available on DeepSeek’s website and in its app, with API access also open to developers. V4-Pro is a larger model built for coding and complex agent tasks, and V4-Flash is a smaller version designed to be faster and cheaper to run. Both versions offer reasoning modes, in which the model can carefully parse a user’s prompt and show each step as it works through the problem. For V4-Pro, DeepSeek charges $1.74 per million input tokens and $3.48 per million output tokens, a fraction of the cost of comparable models from OpenAI and Anthropic. V4-Flash is even cheaper, at about $0.14 per million input tokens and about $0.28 per million output tokens, making it one of the cheapest top-tier models available. This would make it a very appealing model to build applications on. In terms of performance, V4 is, perhaps unsurprisingly, a huge jump from R1—and it seems to be a strong alternative to just about all the latest big AI models. On the major benchmarks, according to results shared by the company, DeepSeek V4-Pro competes with leading closed-source models, matching the performance of Anthropic’s Claude-Opus-4.6, OpenAI’s GPT-5.4, and Google’s Gemini-3.1. And compared to other open-source models, such as Alibaba’s Qwen-3.5 or Z.ai’s GLM-5.1, DeepSeek V4 exceeds them all on coding, math, and STEM problems, making it one of the strongest open-source models ever released.  DeepSeek also says that V4-Pro now ranks among the strongest open-source models on benchmarks for agentic coding tasks and performs well on other tests that measure ability to carry out multistep problems. Its writing ability and world knowledge also lead the field, according to benchmarking results shared by the company.  In a technical report released alongside the model, DeepSeek shared results from an internal survey of 85 experienced developers: More than 90% included V4-Pro among their top model choices for coding tasks. DeepSeek says it has specifically optimized V4 for popular agent frameworks such as Claude Code, OpenClaw, and CodeBuddy. 2. It delivers on a new approach to memory efficiency. One of the key innovations of V4 is its long context window—the amount of text the model can process at once. Both versions can handle 1 million tokens, which is large enough to fit all three volumes of The Lord of the Rings and The Hobbit combined. The company says this context window size is now the default across all DeepSeek services and it matches what is offered by cutting-edge versions of models like Gemini and Claude.  But it’s important to know not just that DeepSeek has made this leap, but how it did so. V4 makes significant architectural changes to the company’s former models—especially in the attention mechanism, which is the feature of AI models that helps them understand each part of a prompt in relation to the rest. As the prompt text gets longer, these comparisons become much more costly, making attention one of the main bottlenecks for long-context models. DeepSeek’s innovation was to make the model more selective about what it pays attention to. Instead of treating all earlier text as equally important, V4 compresses older information and focuses on the parts most likely to matter in the present moment, while still keeping nearby text in full so it does not miss important details.  DeepSeek says this sharply reduces the cost of using long context. In a 1-million-token context, V4-Pro uses only 27% of the computing power required by its previous model, V3.2, while cutting memory use to 10%. The reduction in V4-Flash is even larger, using just 10% of the computing power and 7% of the memory. In practice, this could make it cheaper to build tools that need to work across huge amounts of material, such as an AI coding assistant that can read an entire codebase or a research agent that can analyze a long archive of documents without constantly forgetting what came before. DeepSeek’s interest in long context windows didn’t start with V4. Over the past year and a half, the company has quietly published a series of papers on how AI models “remember” information, experimenting with compression and mathematical techniques to extend what AI models could realistically handle. 3.

Three reasons why DeepSeek’s new model matters Lire l’article »

We use cookies to improve your experience and performance on our website. You can learn more at Politique de confidentialité and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
fr_FR