Uncategorized Archives - Page 4 of 180

RAG Without Vectors: How PageIndex Retrieves by Reasoning

admin NU / เมษายน 26, 2026

Retrieval is where most RAG systems quietly break. Traditional pipelines rely on vector similarity—embedding queries and document chunks into the same space and fetching the “closest” matches. But similarity is a weak proxy for what we actually need: relevance grounded in reasoning. In long, professional documents—like financial reports, research papers, or legal texts—the right answer often isn’t in the most semantically similar paragraph. It requires navigating structure, understanding context, and performing multi-step reasoning across sections. This is exactly where vector-based RAG starts to fall apart. PageIndex is designed to solve this gap by rethinking retrieval from first principles. Instead of chunking documents and searching via embeddings, it builds a hierarchical table-of-contents-style tree index and uses LLMs to reason over that structure—much like a human expert scanning sections, drilling down, and connecting ideas. This enables a vectorless, reasoning-driven retrieval process that is more interpretable, traceable, and aligned with how knowledge is actually extracted from complex documents. By replacing similarity search with structured exploration and tree-based reasoning, PageIndex delivers significantly higher retrieval accuracy—demonstrated by its strong performance on benchmarks like FinanceBench—making it particularly effective for domains that demand precision and deep understanding. In this article, we’ll use PageIndex to index the seminal Transformer paper — “Attention Is All You Need” — and run two cross-cutting queries against it without a single vector or embedding. Instead of chunking the PDF and retrieving by similarity, PageIndex builds a hierarchical tree of the document’s sections, then uses GPT-5.4 to reason over node summaries and identify exactly which sections contain the answer — before reading a single word of full text. Setting up the dependencies For this tutorial, you would require PageIndex & OpenAI API keys. You can get the same from https://dash.pageindex.ai/api-keys and https://platform.openai.com/api-keys respectively. Copy CodeCopiedUse a different Browser pip install pageindex openai requests Copy CodeCopiedUse a different Browser from pageindex import PageIndexClient import pageindex.utils as utils import os from getpass import getpass PAGEINDEX_API_KEY = getpass(‘Enter PageIndex API Key: ‘) pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY) We import the OpenAI client and configure it with an API key to enable access to LLMs. Then, we define an asynchronous helper function that sends prompts to the model and returns the generated response. Copy CodeCopiedUse a different Browser import openai OPENAI_API_KEY = getpass(‘Enter OpenAI API Key: ‘) async def call_llm(prompt, model=”gpt-5.4″, temperature=0): client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY) response = await client.chat.completions.create( model=model, messages=[{“role”: “user”, “content”: prompt}], temperature=temperature ) return response.choices[0].message.content.strip() Building the PageIndex Tree In this chunk, we download the Transformer paper directly from arXiv and submit it to PageIndex, which processes the PDF and builds a hierarchical tree of its sections — each node storing a title, a summary, and the full section text. Once the tree is ready, we print it out to inspect the structure PageIndex has inferred: every chapter, subsection, and nested heading becomes a node in the tree, preserving the document’s natural organization exactly as the authors intended it. Copy CodeCopiedUse a different Browser # ───────────────────────────────────────────── # Step 1: Build the PageIndex Tree # ───────────────────────────────────────────── # 1.1 Download the Transformer paper and submit it import os, requests pdf_url = “https://arxiv.org/pdf/1706.03762.pdf” pdf_path = os.path.join(“data”, pdf_url.split(“/”)[-1]) os.makedirs(“data”, exist_ok=True) print(“Downloading ‘Attention Is All You Need’…”) response = requests.get(pdf_url) with open(pdf_path, “wb”) as f: f.write(response.content) print(f” Saved to {pdf_path}”) doc_id = pi_client.submit_document(pdf_path)[“doc_id”] print(f” Document submitted. doc_id: {doc_id}”) # 1.2 Retrieve the tree (poll until ready) import time print(“nWaiting for PageIndex tree to be ready”, end=””) while not pi_client.is_retrieval_ready(doc_id): print(“.”, end=””, flush=True) time.sleep(5) tree = pi_client.get_tree(doc_id, node_summary=True)[“result”] print(“nn Document Tree Structure:”) utils.print_tree(tree) Reasoning-Based Retrieval With the tree built, we now run a query that is intentionally cross-cutting — one that can’t be answered by a single section of the paper. We strip the full text from each node, leaving only titles and summaries, and pass the entire tree structure to GPT-5.4. The model then reasons over these summaries to identify every node likely to contain a relevant answer, returning both its step-by-step thinking and a list of matched node IDs. This is the core of what makes PageIndex different: the LLM decides where to look before any full text is loaded. Copy CodeCopiedUse a different Browser # ───────────────────────────────────────────── # Step 2: Reasoning-Based Retrieval # ───────────────────────────────────────────── # 2.1 Define a query that requires navigating across sections import json # This query is intentionally cross-cutting — it can’t be answered # by a single section, which is where tree search shines over top-k. query = “Why did the authors choose self-attention over recurrence, and what are the complexity trade-offs they compared?” tree_without_text = utils.remove_fields(tree.copy(), fields=[“text”]) search_prompt = f””” You are given a question and a hierarchical tree structure of a research paper. Each node has a node_id, title, and a summary of its content. Your task: identify ALL nodes that are likely to contain information relevant to answering the question. Think carefully — the answer may be spread across multiple sections. Question: {query} Document tree: {json.dumps(tree_without_text, indent=2)} Reply ONLY in this JSON format, no preamble: {{ “thinking”: “<step-by-step reasoning about which nodes are relevant and why>”, “node_list”: [“node_id_1”, “node_id_2”, …] }} “”” print(f’ Query: “{query}”n’) print(“Running tree search with GPT-5.4…”) tree_search_result = await call_llm(search_prompt) # 2.2 Inspect the retrieval reasoning and matched nodes node_map = utils.create_node_mapping(tree) result_json = json.loads(tree_search_result) print(“n LLM Reasoning:”) utils.print_wrapped(result_json[“thinking”]) print(“n Retrieved Nodes:”) for node_id in result_json[“node_list”]: node = node_map[node_id] print(f” • [{node[‘node_id’]}] Page {node[‘page_index’]:>2} — {node[‘title’]}”) Answer Generation Once the relevant nodes are identified, we pull their full text and stitch it together into a single context block — each section clearly labeled so the model knows where each piece of information comes from. That combined context is then handed to GPT-5.4 with a structured prompt that asks for the core motivation, the specific complexity numbers, and any caveats the authors acknowledged. The model answers using only what was retrieved, grounding every claim directly in the paper’s text. Copy CodeCopiedUse a different Browser # ───────────────────────────────────────────── # Step 3: Answer Generation # ───────────────────────────────────────────── # 3.1 Stitch together context from all retrieved nodes node_list = result_json[“node_list”] relevant_content = “nn—nn”.join( f”[Section:

RAG Without Vectors: How PageIndex Retrieves by Reasoning Read Post »

AI, Committee, ข่าว, Uncategorized

Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models

admin NU / เมษายน 26, 2026

As AI agents move from research demos to production deployments, one question has become impossible to ignore: how do you actually know if an agent is good? Perplexity scores and MMLU leaderboard numbers tell you very little about whether a model can navigate a real website, resolve a GitHub issue, or reliably handle a customer service workflow across hundreds of interactions. The field has responded with a wave of agentic benchmarks — but not all of them are equally meaningful. One important caveat before diving in: agent benchmark scores are highly scaffold-dependent. The model, prompt design, tool access, retry budget, execution environment, and evaluator version can all materially change reported scores. No number should be read in isolation, context about how it was produced matters as much as the number itself. With that in mind, here are seven benchmarks that have emerged as genuine signals of agentic capability, explaining what each one tests, why it matters, and where notable results currently stand. 1. SWE-bench Verified Leaderboard & details: swebench.com What it tests: Real-world software engineering. SWE-bench evaluates LLMs and AI agents on their ability to resolve real-world software engineering issues, drawing from 2,294 problems sourced from GitHub issues across 12 popular Python repositories. The agent must produce a working patch — not a description of a fix, but actual code that passes unit tests. The Verified subset is a human-validated collection of 500 high-quality samples developed in collaboration with OpenAI and professional software engineers, and is the version most commonly cited in frontier model evaluations today. Why it matters: The benchmark’s trajectory makes it one of the most reliable long-run progress trackers in the field. When it launched in 2023, Claude 2 could resolve only 1.96% of issues. In vendor-reported late-2025 and early-2026 results, top frontier models crossed the 80% range on SWE-bench Verified — though exact scores vary meaningfully by scaffold, effort setting, tool setup, and evaluator protocol, and should not be compared directly across vendors without accounting for those differences. A consistent pattern has emerged: closed-source models tend to outperform open-source ones, and performance is heavily shaped by the agent harness as much as the underlying model. One caveat worth flagging: high SWE-bench scores do not guarantee a general-purpose agent. They indicate strength in software repair tasks specifically — not universal autonomy — which is precisely why it must be used alongside the other benchmarks in this list. 2. GAIA Leaderboard & details: huggingface.co/spaces/gaia-benchmark/leaderboard What it tests: General-purpose assistant capabilities that require multi-step reasoning, web browsing, tool use, and basic multimodal understanding. GAIA tasks are deceptively simple in phrasing but require a chain of non-trivial operations to complete correctly — the kind of compound task a real assistant would face in the wild. Why it matters: GAIA is widely referenced in agent evaluation research and maintains an active Hugging Face leaderboard where teams across the community submit results. Its design resists shortcut-taking: an agent cannot guess its way through. It has become one of the standard suites for exposing tool-use brittleness and reproducibility gaps in real agent evaluations — surfacing failure modes that narrower benchmarks miss entirely. For teams evaluating general-purpose assistants rather than task-specific agents, GAIA remains one of the most honest signal generators available. 3. WebArena Leaderboard & details: webarena.dev What it tests: Autonomous web navigation in realistic, functional environments. WebArena creates websites across four domains — e-commerce, social forums, collaborative software development, and content management — with real functionality and data that mirrors their real-world equivalents. Agents must interpret high-level natural language commands and execute them entirely through a live browser interface. The benchmark consists of 812 long-horizon tasks, and the original paper’s best GPT-4-based agent achieved only 14.41% end-to-end task success, against a human baseline of 78.24%. Why it matters: Progress on WebArena has been substantial. By early 2025, specialized systems were reporting single-agent task completion rates above 60% — IBM’s CUGA system reached 61.7% on the full benchmark (February 2025), and OpenAI’s Computer-Using Agent achieved 58.1% in its January 2025 technical report. These gains reflect a broader pattern in stronger web agents: explicit planning, specialized action execution, memory or state tracking, reflection, and task-specific training or evaluation loops. The remaining gap to human performance — 78.24% per the original paper — reflects harder unsolved problems like deep visual understanding and common-sense reasoning. WebArena is one of the most widely used benchmarks for testing true web autonomy, not scripted automation. 4. τ-bench (Tau-bench) Leaderboard & code: github.com/sierra-research/tau-bench What it tests: Tool-agent-user interaction under real-world policy constraints. τ-bench emulates dynamic, multi-turn conversations between a simulated user and a language agent equipped with domain-specific API tools and policy guidelines. The benchmark covers two domains — τ-retail and τ-airline — and simultaneously evaluates three things: whether the agent can gather required information from a user across multiple exchanges, whether it correctly follows domain-specific policy rules (e.g., rejecting non-refundable ticket changes), and whether it behaves consistently at scale via the pass^k reliability metric. Why it matters: τ-bench exposes a reliability crisis that most one-shot benchmarks are completely blind to. Even state-of-the-art function calling agents like GPT-4o succeed on fewer than 50% of tasks, and their consistency is far worse — pass^8 falls below 25% in the retail domain. That means an agent that can handle a task in one trial cannot reliably handle the same task eight times in a row. For any real deployment handling millions of interactions, that inconsistency is disqualifying. By combining reasoning, tool-use, policy adherence, and repeatability into a single evaluation framework, τ-bench fills a gap that outcome-only benchmarks leave wide open. 5. ARC-AGI-2 Leaderboard & competition: arcprize.org/leaderboard What it tests: Fluid intelligence — the ability to generalize to genuinely novel visual reasoning puzzles that resist memorization or pattern-matching from training data. Each task presents the agent with a small number of input-output grid examples and asks it to infer the underlying abstract rule, then apply it to a new input. Created by François Chollet, the benchmark is the centerpiece of the ARC

Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models Read Post »

AI, Committee, ข่าว, Uncategorized

A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation

admin NU / เมษายน 25, 2026

In this tutorial, we work with Microsoft’s OpenMementos dataset and explore how reasoning traces are structured through blocks and mementos in a practical, Colab-ready workflow. We stream the dataset efficiently, parse its special-token format, inspect how reasoning and summaries are organized, and measure the compression provided by the memento representation across different domains. As we move through the analysis, we also visualize dataset patterns, align the streamed format with the richer full subset, simulate inference-time compression, and prepare the data for supervised fine-tuning. In this way, we build both an intuitive and technical understanding of how OpenMementos captures long-form reasoning while preserving compact summaries that can support efficient training and inference. Copy CodeCopiedUse a different Browser !pip install -q -U datasets transformers matplotlib pandas import re, itertools, textwrap from collections import Counter from typing import Dict import pandas as pd import matplotlib.pyplot as plt from datasets import load_dataset DATASET = “microsoft/OpenMementos” ds_stream = load_dataset(DATASET, split=”train”, streaming=True) first_row = next(iter(ds_stream)) print(“Columns :”, list(first_row.keys())) print(“Domain :”, first_row[“domain”], “| Source:”, first_row[“source”]) print(“Problem head:”, first_row[“problem”][:160].replace(“n”, ” “), “…”) We install the required libraries and import the core tools needed for dataset streaming, parsing, analysis, and visualization. We then connect to the Microsoft OpenMementos dataset in streaming mode to inspect it without downloading the entire dataset locally. By reading the first example, we begin understanding the dataset schema, the problem format, and the domain and source metadata attached to each reasoning trace. Copy CodeCopiedUse a different Browser BLOCK_RE = re.compile(r”<|block_start|>(.*?)<|block_end|>”, re.DOTALL) SUMMARY_RE = re.compile(r”<|summary_start|>(.*?)<|summary_end|>”, re.DOTALL) THINK_RE = re.compile(r”<think>(.*?)</think>”, re.DOTALL) def parse_memento(response: str) -> Dict: blocks = [m.strip() for m in BLOCK_RE.findall(response)] summaries = [m.strip() for m in SUMMARY_RE.findall(response)] think_m = THINK_RE.search(response) final_ans = response.split(“</think>”)[-1].strip() if “</think>” in response else “” return {“blocks”: blocks, “summaries”: summaries, “reasoning”: (think_m.group(1) if think_m else “”), “final_answer”: final_ans} parsed = parse_memento(first_row[“response”]) print(f”n→ {len(parsed[‘blocks’])} blocks, {len(parsed[‘summaries’])} mementos parsed”) print(“First block :”, parsed[“blocks”][0][:140].replace(“n”, ” “), “…”) print(“First memento :”, parsed[“summaries”][0][:140].replace(“n”, ” “), “…”) N_SAMPLES = 500 rows = [] for i, ex in enumerate(itertools.islice( load_dataset(DATASET, split=”train”, streaming=True), N_SAMPLES)): p = parse_memento(ex[“response”]) if not p[“blocks”] or len(p[“blocks”]) != len(p[“summaries”]): continue blk_c = sum(len(b) for b in p[“blocks”]) sum_c = sum(len(s) for s in p[“summaries”]) blk_w = sum(len(b.split()) for b in p[“blocks”]) sum_w = sum(len(s.split()) for s in p[“summaries”]) rows.append(dict(domain=ex[“domain”], source=ex[“source”], n_blocks=len(p[“blocks”]), block_chars=blk_c, summ_chars=sum_c, block_words=blk_w, summ_words=sum_w, compress_char=sum_c / max(blk_c, 1), compress_word=sum_w / max(blk_w, 1))) if (i + 1) % 100 == 0: print(f” processed {i+1}/{N_SAMPLES}”) df = pd.DataFrame(rows) print(f”nAnalyzed {len(df)} rows. Domain counts:”) print(df[“domain”].value_counts().to_string()) per_dom = df.groupby(“domain”).agg( n=(“domain”, “count”), median_blocks=(“n_blocks”, “median”), median_block_words=(“block_words”, “median”), median_summ_words=(“summ_words”, “median”), median_char_ratio=(“compress_char”, “median”), median_word_ratio=(“compress_word”, “median”), ).round(3) print(“nPer-domain medians (ratio = mementos / blocks):”) print(per_dom.to_string()) We define the regex-based parser that extracts reasoning blocks, memento summaries, the main thinking section, and the final answer from each response. We test the parser on the first streamed example and confirm that the block-summary structure is being captured correctly. We then run a streaming analysis over multiple samples to compute block counts, word counts, character counts, and compression ratios, which helps us study how the dataset behaves across examples and domains. Copy CodeCopiedUse a different Browser def compress_trace(response: str, keep_last_k: int = 1) -> str: blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response) if not blocks or len(blocks) != len(summaries): return response out, n = [“<think>”], len(blocks) for i, (b, s) in enumerate(zip(blocks, summaries)): if i >= n – keep_last_k: out.append(f”<|block_start|>{b}<|block_end|>”) out.append(f”<|summary_start|>{s}<|summary_end|>”) else: out.append(f”<|summary_start|>{s}<|summary_end|>”) out.append(“</think>”) out.append(response.split(“</think>”)[-1]) return “n”.join(out) orig, comp = first_row[“response”], compress_trace(first_row[“response”], 1) print(f”nOriginal : {len(orig):>8,} chars”) print(f”Compressed : {len(comp):>8,} chars ({len(comp)/len(orig)*100:.1f}% of original)”) from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained(“gpt2”) MEM_TOKENS = [“<|block_start|>”, “<|block_end|>”, “<|summary_start|>”, “<|summary_end|>”, “<think>”, “</think>”] tok.add_special_tokens({“additional_special_tokens”: MEM_TOKENS}) def tlen(s): return len(tok(s, add_special_tokens=False).input_ids) blk_tok = sum(tlen(b) for b in parsed[“blocks”]) sum_tok = sum(tlen(s) for s in parsed[“summaries”]) print(f”nTrace-level token compression for this example:”) print(f” block tokens = {blk_tok}”) print(f” memento tokens = {sum_tok}”) print(f” compression = {blk_tok / max(sum_tok,1):.2f}× (paper reports ~6×)”) def to_chat(ex): return {“messages”: [ {“role”: “user”, “content”: ex[“problem”]}, {“role”: “assistant”, “content”: ex[“response”]}, ]} chat_stream = load_dataset(DATASET, split=”train”, streaming=True).map(to_chat) chat_ex = next(iter(chat_stream)) print(“nSFT chat example (truncated):”) for m in chat_ex[“messages”]: print(f” [{m[‘role’]:9s}] {m[‘content’][:130].replace(chr(10),’ ‘)}…”) We visualize the dataset’s structural patterns by plotting block counts, compression ratios, and the relationship between block size and memento size. We compare these distributions across domains to see how reasoning organization differs between math, code, and science examples. We also stream one example from the full subset and inspect its additional sentence-level and block-alignment fields, which helps us understand the richer internal annotation pipeline behind the dataset. Copy CodeCopiedUse a different Browser def compress_trace(response: str, keep_last_k: int = 1) -> str: blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response) if not blocks or len(blocks) != len(summaries): return response out, n = [“<think>”], len(blocks) for i, (b, s) in enumerate(zip(blocks, summaries)): if i >= n – keep_last_k: out.append(f”<|block_start|>{b}<|block_end|>”) out.append(f”<|summary_start|>{s}<|summary_end|>”) else: out.append(f”<|summary_start|>{s}<|summary_end|>”) out.append(“</think>”) out.append(response.split(“</think>”)[-1]) return “n”.join(out) orig, comp = first_row[“response”], compress_trace(first_row[“response”], 1) print(f”nOriginal : {len(orig):>8,} chars”) print(f”Compressed : {len(comp):>8,} chars ({len(comp)/len(orig)*100:.1f}% of original)”) from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained(“gpt2”) MEM_TOKENS = [“<|block_start|>”, “<|block_end|>”, “<|summary_start|>”, “<|summary_end|>”, “<think>”, “</think>”] tok.add_special_tokens({“additional_special_tokens”: MEM_TOKENS}) def tlen(s): return len(tok(s, add_special_tokens=False).input_ids) blk_tok = sum(tlen(b) for b in parsed[“blocks”]) sum_tok = sum(tlen(s) for s in parsed[“summaries”]) print(f”nTrace-level token compression for this example:”) print(f” block tokens = {blk_tok}”) print(f” memento tokens = {sum_tok}”) print(f” compression = {blk_tok / max(sum_tok,1):.2f}× (paper reports ~6×)”) def to_chat(ex): return {“messages”: [ {“role”: “user”, “content”: ex[“problem”]}, {“role”: “assistant”, “content”: ex[“response”]}, ]} chat_stream = load_dataset(DATASET, split=”train”, streaming=True).map(to_chat) chat_ex = next(iter(chat_stream)) print(“nSFT chat example (truncated):”) for m in chat_ex[“messages”]: print(f” [{m[‘role’]:9s}] {m[‘content’][:130].replace(chr(10),’ ‘)}…”) We simulate inference-time compression by rewriting a reasoning trace so that older blocks are replaced by their mementos while the latest blocks remain intact. We then compare the original and compressed trace lengths to see how much context can be reduced in practice. After that, we integrate a tokenizer, add special memento tokens, measure token-level compression, and convert the dataset to an SFT-style chat format suitable for training workflows. Copy CodeCopiedUse a different Browser def render_trace(response: str, width: int = 220) -> None: p = parse_memento(response) print(“=” * 72) print(f”{len(p[‘blocks’])} blocks ·

A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation Read Post »

AI, Committee, ข่าว, Uncategorized

Three reasons why DeepSeek’s new model matters

admin NU / เมษายน 25, 2026

On Friday, Chinese AI firm DeepSeek released a preview of V4, its long-awaited new flagship model. Notably, the model can process much longer prompts than its last generation, thanks to a new design that helps it handle large amounts of text more efficiently. Like DeepSeek’s previous models, V4 is open source, meaning it is available for anyone to download, use, and modify. V4 marks DeepSeek’s most significant release since R1, the reasoning model it launched in January 2025. R1, which was trained on limited computing resources, stunned the global AI industry with its strong performance and efficiency, turning DeepSeek from a little-known research team into China’s best-known AI company almost overnight. It also helped set off a wave of open-weight model releases from other Chinese AI firms. DeepSeek has kept a relatively low profile since then—but earlier this month, it effectively teased V4’s release when it added “expert” and “flash” modes to the online version of its model, prompting speculation that the updates were tied to a bigger upcoming release. While the company has become a powerful symbol of China’s AI ambitions, its big return to cutting-edge frontier models comes after months of scrutiny—including major personnel departures, delays to previous model launches, and growing scrutiny from both the US and Chinese governments. So, will V4 shake the AI field the way R1 did? Almost certainly not, but here are three big reasons why this release matters. 1. It breaks new ground for an open-source model. As with R1 before it, DeepSeek claims that V4’s performance rivals the best models available at a fraction of the price. This is great news for developers and for companies using the tech, because it means they can access frontier AI capabilities on their own terms, and without worrying about skyrocketing costs. The new model comes in two versions, both of which are available on DeepSeek’s website and in its app, with API access also open to developers. V4-Pro is a larger model built for coding and complex agent tasks, and V4-Flash is a smaller version designed to be faster and cheaper to run. Both versions offer reasoning modes, in which the model can carefully parse a user’s prompt and show each step as it works through the problem. For V4-Pro, DeepSeek charges $1.74 per million input tokens and $3.48 per million output tokens, a fraction of the cost of comparable models from OpenAI and Anthropic. V4-Flash is even cheaper, at about $0.14 per million input tokens and about $0.28 per million output tokens, making it one of the cheapest top-tier models available. This would make it a very appealing model to build applications on. In terms of performance, V4 is, perhaps unsurprisingly, a huge jump from R1—and it seems to be a strong alternative to just about all the latest big AI models. On the major benchmarks, according to results shared by the company, DeepSeek V4-Pro competes with leading closed-source models, matching the performance of Anthropic’s Claude-Opus-4.6, OpenAI’s GPT-5.4, and Google’s Gemini-3.1. And compared to other open-source models, such as Alibaba’s Qwen-3.5 or Z.ai’s GLM-5.1, DeepSeek V4 exceeds them all on coding, math, and STEM problems, making it one of the strongest open-source models ever released. DeepSeek also says that V4-Pro now ranks among the strongest open-source models on benchmarks for agentic coding tasks and performs well on other tests that measure ability to carry out multistep problems. Its writing ability and world knowledge also lead the field, according to benchmarking results shared by the company. In a technical report released alongside the model, DeepSeek shared results from an internal survey of 85 experienced developers: More than 90% included V4-Pro among their top model choices for coding tasks. DeepSeek says it has specifically optimized V4 for popular agent frameworks such as Claude Code, OpenClaw, and CodeBuddy. 2. It delivers on a new approach to memory efficiency. One of the key innovations of V4 is its long context window—the amount of text the model can process at once. Both versions can handle 1 million tokens, which is large enough to fit all three volumes of The Lord of the Rings and The Hobbit combined. The company says this context window size is now the default across all DeepSeek services and it matches what is offered by cutting-edge versions of models like Gemini and Claude. But it’s important to know not just that DeepSeek has made this leap, but how it did so. V4 makes significant architectural changes to the company’s former models—especially in the attention mechanism, which is the feature of AI models that helps them understand each part of a prompt in relation to the rest. As the prompt text gets longer, these comparisons become much more costly, making attention one of the main bottlenecks for long-context models. DeepSeek’s innovation was to make the model more selective about what it pays attention to. Instead of treating all earlier text as equally important, V4 compresses older information and focuses on the parts most likely to matter in the present moment, while still keeping nearby text in full so it does not miss important details. DeepSeek says this sharply reduces the cost of using long context. In a 1-million-token context, V4-Pro uses only 27% of the computing power required by its previous model, V3.2, while cutting memory use to 10%. The reduction in V4-Flash is even larger, using just 10% of the computing power and 7% of the memory. In practice, this could make it cheaper to build tools that need to work across huge amounts of material, such as an AI coding assistant that can read an entire codebase or a research agent that can analyze a long archive of documents without constantly forgetting what came before. DeepSeek’s interest in long context windows didn’t start with V4. Over the past year and a half, the company has quietly published a series of papers on how AI models “remember” information, experimenting with compression and mathematical techniques to extend what AI models could realistically handle. 3.

Three reasons why DeepSeek’s new model matters Read Post »

AI, Committee, ข่าว, Uncategorized

A Coding Implementation on Deepgram Python SDK for Transcription, Text-to-Speech, Async Audio Processing, and Text Intelligence

admin NU / เมษายน 25, 2026

In this tutorial, we build an advanced hands-on workflow with the Deepgram Python SDK and explore how modern voice AI capabilities come together in a single Python environment. We set up authentication, connect both synchronous and asynchronous Deepgram clients, and work directly with real audio data to understand how the SDK handles transcription, speech generation, and text analysis in practice. We transcribe audio from both a URL and a local file, inspect confidence scores, word-level timestamps, speaker diarization, paragraph formatting, and AI-generated summaries, and then extend the pipeline to async processing for faster, more scalable execution. We also generate speech with multiple TTS voices, analyze text for sentiment, topics, and intents, and examine advanced transcription controls such as keyword search, replacement, boosting, raw response access, and structured error handling. Through this process, we create a practical, end-to-end Deepgram voice AI workflow that is both technically detailed and easy to adapt for real-world applications. Copy CodeCopiedUse a different Browser !pip install deepgram-sdk httpx –quiet import os, asyncio, textwrap, urllib.request from getpass import getpass from deepgram import DeepgramClient, AsyncDeepgramClient from deepgram.core.api_error import ApiError from IPython.display import Audio, display DEEPGRAM_API_KEY = getpass(” Enter your Deepgram API key: “) os.environ[“DEEPGRAM_API_KEY”] = DEEPGRAM_API_KEY client = DeepgramClient(api_key=DEEPGRAM_API_KEY) async_client = AsyncDeepgramClient(api_key=DEEPGRAM_API_KEY) AUDIO_URL = “https://dpgr.am/spacewalk.wav” AUDIO_PATH = “/tmp/sample.wav” urllib.request.urlretrieve(AUDIO_URL, AUDIO_PATH) def read_audio(path=AUDIO_PATH): with open(path, “rb”) as f: return f.read() def _get(obj, key, default=None): “””Get a field from either a dict or an object — v6 returns both.””” if isinstance(obj, dict): return obj.get(key, default) return getattr(obj, key, default) def get_model_name(meta): mi = _get(meta, “model_info”) if mi is None: return “n/a” return _get(mi, “name”, “n/a”) def tts_to_bytes(response) -> bytes: “””v6 generate() returns a generator of chunks or an object with .stream.””” if hasattr(response, “stream”): return response.stream.getvalue() return b””.join(chunk for chunk in response if isinstance(chunk, bytes)) def save_tts(response, path: str) -> str: with open(path, “wb”) as f: f.write(tts_to_bytes(response)) return path print(” Deepgram client ready | sample audio downloaded”) print(“n” + “=”*60) print(” SECTION 2: Pre-Recorded Transcription from URL”) print(“=”*60) response = client.listen.v1.media.transcribe_url( url=AUDIO_URL, model=”nova-3″, smart_format=True, diarize=True, language=”en”, utterances=True, filler_words=True, ) transcript = response.results.channels[0].alternatives[0].transcript print(f”n Full Transcript:n{textwrap.fill(transcript, 80)}”) confidence = response.results.channels[0].alternatives[0].confidence print(f”n Confidence: {confidence:.2%}”) words = response.results.channels[0].alternatives[0].words print(f”n First 5 words with timing:”) for w in words[:5]: print(f” ‘{w.word}’ start={w.start:.2f}s end={w.end:.2f}s conf={w.confidence:.2f}”) print(f”n Speaker Diarization (first 5 words):”) for w in words[:5]: speaker = getattr(w, “speaker”, None) if speaker is not None: print(f” Speaker {int(speaker)}: ‘{w.word}'”) meta = response.metadata print(f”n Metadata: duration={meta.duration:.2f}s channels={int(meta.channels)} model={get_model_name(meta)}”) We install the Deepgram SDK and its dependencies, then securely set up authentication using our API key. We initialize both synchronous and asynchronous Deepgram clients, download a sample audio file, and define helper functions to make it easier to work with mixed response objects, audio bytes, model metadata, and streamed TTS outputs. We then run our first pre-recorded transcription from a URL and inspect the transcript, confidence score, word-level timestamps, speaker diarization, and metadata to understand the structure and richness of the response. Copy CodeCopiedUse a different Browser print(“n” + “=”*60) print(” SECTION 3: Pre-Recorded Transcription from File”) print(“=”*60) file_response = client.listen.v1.media.transcribe_file( request=read_audio(), model=”nova-3″, smart_format=True, diarize=True, paragraphs=True, summarize=”v2”, ) alt = file_response.results.channels[0].alternatives[0] paragraphs = getattr(alt, “paragraphs”, None) if paragraphs and _get(paragraphs, “paragraphs”): print(“n Paragraph-Formatted Transcript:”) for para in _get(paragraphs, “paragraphs”)[:2]: sentences = ” “.join(_get(s, “text”, “”) for s in (_get(para, “sentences”) or [])) print(f” [Speaker {int(_get(para,’speaker’,0))}, ” f”{_get(para,’start’,0):.1f}s–{_get(para,’end’,0):.1f}s] {sentences[:120]}…”) else: print(f”n Transcript: {alt.transcript[:200]}…”) if getattr(file_response.results, “summary”, None): short = _get(file_response.results.summary, “short”, “”) if short: print(f”n AI Summary: {short}”) print(f”n Confidence: {alt.confidence:.2%}”) print(f” Word count : {len(alt.words)}”) print(“n” + “=”*60) print(” SECTION 4: Async Parallel Transcription”) print(“=”*60) async def transcribe_async(): audio_bytes = read_audio() async def from_url(label): r = await async_client.listen.v1.media.transcribe_url( url=AUDIO_URL, model=”nova-3″, smart_format=True, ) print(f” [{label}] {r.results.channels[0].alternatives[0].transcript[:100]}…”) async def from_file(label): r = await async_client.listen.v1.media.transcribe_file( request=audio_bytes, model=”nova-3″, smart_format=True, ) print(f” [{label}] {r.results.channels[0].alternatives[0].transcript[:100]}…”) await asyncio.gather(from_url(“From URL”), from_file(“From File”)) await transcribe_async() We move from URL-based to file-based transcription by sending raw audio bytes directly to the Deepgram API, enabling richer options such as paragraphs and summarization. We inspect the returned paragraph structure, speaker segmentation, summary output, confidence score, and word count to see how the SDK supports more readable and analysis-friendly transcription results. We also introduce asynchronous processing and run URL-based and file-based transcription in parallel, helping us understand how to build faster, more scalable voice AI pipelines. Copy CodeCopiedUse a different Browser print(“n” + “=”*60) print(” SECTION 5: Text-to-Speech”) print(“=”*60) sample_text = ( “Welcome to the Deepgram advanced tutorial. ” “This SDK lets you transcribe audio, generate speech, ” “and analyse text — all with a simple Python interface.” ) tts_path = save_tts( client.speak.v1.audio.generate(text=sample_text, model=”aura-2-asteria-en”), “/tmp/tts_output.mp3″, ) size_kb = os.path.getsize(tts_path) / 1024 print(f” TTS audio saved → {tts_path} ({size_kb:.1f} KB)”) display(Audio(tts_path)) print(“n” + “=”*60) print(” SECTION 6: Multiple TTS Voices Comparison”) print(“=”*60) voices = { “aura-2-asteria-en”: “Asteria (female, warm)”, “aura-2-orion-en”: “Orion (male, deep)”, “aura-2-luna-en”: “Luna (female, bright)”, } for model_id, label in voices.items(): try: path = save_tts( client.speak.v1.audio.generate(text=”Hello! I am a Deepgram voice model.”, model=model_id), f”/tmp/tts_{model_id}.mp3″, ) print(f” {label}”) display(Audio(path)) except Exception as e: print(f” {label} — {e}”) print(“n” + “=”*60) print(” SECTION 7: Text Intelligence — Sentiment, Topics, Intents”) print(“=”*60) review_text = ( “I absolutely love this product! It arrived quickly, the quality is ” “outstanding, and customer support was incredibly helpful when I had ” “a question. I would definitely recommend it to anyone looking for ” “a reliable solution. Five stars!” ) read_response = client.read.v1.text.analyze( request={“text”: review_text}, language=”en”, sentiment=True, topics=True, intents=True, summarize=True, ) results = read_response.results We focus on speech generation by converting text to audio using Deepgram’s text-to-speech API and saving the resulting audio as an MP3 file. We then compare multiple TTS voices to hear how different voice models behave and how easily we can switch between them while keeping the same code pattern. After that, we begin working with the Read API by passing the review text into Deepgram’s text intelligence system to analyze language beyond simple transcription. Copy CodeCopiedUse a different Browser if getattr(results, “sentiments”, None): overall = results.sentiments.average print(f” Sentiment: {_get(overall,’sentiment’,’?’).upper()} ” f”(score={_get(overall,’sentiment_score’,0):.3f})”) for seg in (_get(results.sentiments, “segments”) or [])[:2]: print(f” •

A Coding Implementation on Deepgram Python SDK for Transcription, Text-to-Speech, Async Audio Processing, and Text Intelligence Read Post »

AI, Committee, ข่าว, Uncategorized

Meet GitNexus: An Open-Source MCP-Native Knowledge Graph Engine That Gives Claude Code and Cursor Full Codebase Structural Awareness

admin NU / เมษายน 25, 2026

There is a quiet failure mode that lives at the center of every AI-assisted coding workflow. You ask Claude Code, Cursor, or Windsurf to modify a function. The agent does it confidently, cleanly, and incorrectly — because it had no idea that 47 other functions depended on the return type it just changed. Breaking changes ship. The test suite screams. And you spend the next two hours untangling what the model should have known before it touched a single line. An Indian Computer Science student built GitNexus to fix that. The open-source project, now sitting at 28,000+ stars and 3,000+ forks on GitHub with 45 contributors, describes itself as ‘the nervous system for agent context.’ That description undersells what it actually does. What Actually is GitNexus? GitNexus is a code intelligence layer, not a documentation tool. It indexes an entire repository into a structured knowledge graph — mapping every function call, import, class inheritance, interface implementation, and execution flow — and then exposes that graph to AI agents through a Model Context Protocol (MCP) server. The agents stop guessing. They query. To understand why this is significant, you need to understand what AI coding agents currently operate on. Most tools like Cursor, Claude Code, and Windsurf rely on either file-based context windows (they read the files nearby and hope for the best) or traditional Graph RAG approaches (they query a graph with a series of prompts, hoping to discover what matters). Neither approach gives an agent a structural map of the repository before it acts. GitNexus pre-computes the entire dependency structure at index time. When an agent asks ‘what depends on this function?’, it gets a complete, confidence-scored answer in one query, instead of chaining 10 successive queries that each risk missing something. The Indexing Pipeline Running npx gitnexus analyze from the root of a repository kicks off a multi-phase indexing pipeline that does the following: First, it walks the file tree and maps folder and file relationships (the Structure phase). Then it parses every function, class, method, and interface using Tree-sitter ASTs (Abstract Syntax Trees). Tree-sitter is a high-performance, incremental parser originally developed at GitHub that produces concrete syntax trees for any supported language. GitNexus uses it to extract symbols with precision that regex or simple text search cannot match. After parsing, GitNexus performs cross-file resolution: it resolves imports, function calls, class heritage, constructor inference, and self/this receiver types across the whole codebase. This is the step where it learns that UserController in src/controllers/user.ts calls into UserService, which authRouter imports, which handleLogin depends on. Next comes clustering — GitNexus groups related symbols into functional communities using Leiden community detection on the call graph, assigning each cluster a cohesion score. Then it traces execution flows from entry points through full call chains to build what it calls ‘processes.’ Finally it indexes everything for hybrid search using BM25 (a keyword ranking algorithm), semantic vector embeddings, and RRF (Reciprocal Rank Fusion) to merge results. The graph is stored in LadybugDB, an embedded graph database with native vector support formerly known as KuzuDB. This entire pipeline runs locally — no code leaves your machine. A particularly useful flag for teams: gitnexus analyze –skills takes the Leiden community detection one step further. Instead of only grouping symbols internally, it generates a custom SKILL.md file for each detected functional area of your codebase under .claude/skills/generated/. Each skill file describes that module’s key files, entry points, execution flows, and cross-area connections — so an AI agent working in the authentication module gets targeted architectural context for that specific area, not a generic overview of the entire repo. Skills are regenerated on each –skills run to stay current. https://github.com/abhigyanpatwari/GitNexus Seven Tools and Two Prompts Your Agent Gets Once indexed, GitNexus registers an MCP server that exposes seven tools and two guided prompts to your AI agent. impact runs blast radius analysis. Given a target symbol, it returns every upstream caller grouped by depth with confidence scores — handleLogin [CALLS 90%], UserController [CALLS 85%] — so the agent knows what it risks breaking before it touches anything. context gives a 360-degree view of any symbol: its callers, its callees, every process it participates in, and which step of each process it occupies. query runs process-grouped hybrid search across the codebase, returning matching symbols alongside the execution flows they belong to. detect_changes performs git-diff impact analysis — it maps changed lines to affected processes and assigns a risk level before you commit. rename executes coordinated multi-file symbol renames using the graph for high-confidence edits and text search for the rest, with a dry-run mode to preview changes before applying them. cypher exposes raw Cypher graph queries for engineers who want to write custom traversals against the knowledge graph directly. list_repos handles the multi-repo case — GitNexus uses a global registry at ~/.gitnexus/ so one MCP server can serve multiple indexed repositories simultaneously. Beyond the tools, GitNexus also exposes two MCP prompts for guided workflows. detect_impact runs a pre-commit change analysis that surfaces scope, affected processes, and an overall risk level — think of it as a structured checklist before any significant edit. generate_map produces architecture documentation directly from the knowledge graph, complete with Mermaid diagrams, making it useful for onboarding engineers or documenting a codebase that has grown faster than its docs. Editor Support and Deepest Integration with Claude Code GitNexus supports Claude Code, Cursor, Codex, OpenCode, and Windsurf. Editor support varies by tier. Windsurf gets MCP only. Cursor, Codex, and OpenCode get MCP plus agent skills. Claude Code gets the full stack: MCP tools, agent skills (Exploring, Debugging, Impact Analysis, Refactoring), PreToolUse hooks that enrich every search with graph context before Claude acts, and PostToolUse hooks that auto-reindex after commits. For Claude Code users, GitNexus installs itself completely — hooks, skills, and an AGENTS.md / CLAUDE.md context file — in a single npx gitnexus analyze command. The Model Democratization Angle One of the less obvious implications of this architecture is what it does for smaller models. Because

Meet GitNexus: An Open-Source MCP-Native Knowledge Graph Engine That Gives Claude Code and Cursor Full Codebase Structural Awareness Read Post »

AI, Committee, ข่าว, Uncategorized

Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation

admin NU / เมษายน 25, 2026

For years, the computer vision community has operated on two separate tracks: generative models (which produce images) and discriminative models (which understand them). The assumption was straightforward — models good at making pictures aren’t necessarily good at reading them. A new paper from Google, titled “Image Generators are Generalist Vision Learners” (arXiv:2604.20329), published April 22, 2026, blows that assumption apart. A team of Google DeepMind researchers introduced Vision Banana, a single unified model that surpasses or matches state-of-the-art specialist systems across a wide range of visual understanding tasks — including semantic segmentation, instance segmentation, monocular metric depth estimation, and surface normal estimation — while simultaneously retaining the original image generation capabilities of its base model. https://arxiv.org/pdf/2604.20329 The LLM Analogy That Changes Everything If you’ve worked with large language models, you already understand the two-phase playbook: first, pretrain a base model on massive text data using a generative objective, then apply instruction-tuning to align it for downstream tasks. The pretraining phase is where the model develops a rich internal representation of language that can be repurposed for almost anything. The Google team’s core claim is that image generation training plays the exact same foundational role for vision. Their base model, Nano Banana Pro (NBP), is Google’s state-of-the-art image generator. By performing a lightweight instruction-tuning pass — mixing a small proportion of computer vision task data at a very low ratio into NBP’s original training mixture — they created Vision Banana. The key insight: generating photorealistic images implicitly requires a model to understand geometry, semantics, depth, and object relationships. Vision Banana learns to express that latent knowledge in measurable, decodable formats. Critically, no training data from any of the evaluation benchmarks is included in the instruction-tuning mixture — ensuring that all results reflect true generalist capability rather than in-domain memorization. How It Works: Perception as Image Generation Rather than adding specialized decoder heads or regression modules for each task, all vision task outputs are parameterized as RGB images. The model is instruction-tuned to produce visualizations that follow precise, invertible color schemes — meaning the generated images can be decoded back into quantitative outputs for benchmark evaluation. The research team identified three key advantages of this strategy. First, it supports a wide variety of tasks with a single unified model — after instruction-tuning, only the prompt changes, not the weights. Second, it requires relatively little new training data, since instruction-tuning is solely teaching the model how to format computer vision outputs as RGB. Third, it helps the model retain its original image generation capabilities, since the outputs are simply new RGB images. For semantic segmentation, the model is prompted with instructions such as: “Generate a segmentation visualization of this image, using the color mapping: {‘cat’: ‘red’, ‘background’: ‘yellow’}.” Each pixel is colored by its predicted class, and because color assignments are specified in the prompt, no fixed label vocabulary is needed. For instance segmentation, since the number of instances is unknown in advance, Vision Banana uses a per-class inference strategy — running a separate pass per class and dynamically assigning unique colors to each instance. Masks are recovered by clustering pixels with similar colors using a threshold. Metric depth estimation uses a bijective mapping between unbounded metric depth values in [0, ∞) and bounded RGB values in [0, 1]³. A power transform (shape parameter λ = −3, scale parameter c = 10/3) first “curves” metric depth values, which are then encoded as a false-color visualization that traverses the edges of the RGB cube, following the structure of a 3D Hilbert curve. This transform is strictly invertible, so the generated depth image decodes cleanly back to physical metric distances. Crucially, no camera parameters — neither intrinsics nor extrinsics — are required at training or inference time. The model infers absolute scale purely from visual cues and world knowledge embedded during pretraining. The depth training data is also entirely synthetic, generated from simulation rendering engines, with zero real-world depth data used. For surface normal estimation, the mapping is more direct: surface normals are unit vectors (x, y, z) ranging from −1.0 to 1.0, which map naturally to RGB channels. Facing-left normals encode as pinkish-red; facing-up normals encode as light green; normals pointing toward the camera encode as light blue/purple. The Numbers: Beating Specialists at Their Own Game Vision Banana’s results across benchmarks — all in zero-shot transfer settings, where the model has never seen any training data from the evaluated datasets — are significant: Semantic segmentation on Cityscapes val: mIoU of 0.699, compared to SAM 3’s 0.652 — a 4.7-point gain. Referring expression segmentation on RefCOCOg UMD val: cIoU of 0.738, edging out SAM 3 Agent’s 0.734. Reasoning segmentation on ReasonSeg val: gIoU of 0.793, beating SAM 3 Agent’s 0.770 — and notably surpassing even non-zero-shot methods trained on in-domain data, including X-SAM. Instance segmentation on SA-Co/Gold: pmF1 of 0.540, on par with DINO-X (0.552), and ahead of Gemini 2.5 (0.461), APE-D (0.369), and OWLv2 (0.420) under zero-shot transfer. Metric depth estimation: average δ1 of 0.882 across six major benchmarks; on the four datasets where Depth Anything V3 was evaluated (NYU, ETH3D, DIODE-Indoor, KITTI), Vision Banana scores 0.929 versus Depth Anything V3’s 0.918 — while using zero real-world training data and no camera parameters. Surface normal estimation: average mean angle error of 18.928° across four datasets, compared to Lotus-2’s 19.642°. On indoor datasets specifically, Vision Banana achieves the lowest mean angle error (15.549°) and lowest median angle error (9.300°) among all compared methods. On generative benchmarks, Vision Banana holds its own against its base model: it achieves a 53.5% win rate against Nano Banana Pro on GenAI-Bench (text-to-image), and a 47.8% win rate on ImgEdit (image editing), where Nano Banana Pro scores 52.2%. Overall, the results confirm that lightweight instruction-tuning does not degrade the model’s generative capabilities. Key Takeaways Image generation pretraining is a generalist vision learner: Just as LLM pretraining unlocks emergent language understanding, Google’s research shows that training on image generation naturally develops powerful internal visual representations that transfer to perception tasks

Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation Read Post »

AI, Committee, ข่าว, Uncategorized

Cross-Domain Data Selection and Augmentation for Automatic Compliance Detection

admin NU / เมษายน 24, 2026

arXiv:2604.21469v1 Announce Type: new Abstract: Automating the detection of regulatory compliance remains a challenging task due to the complexity and variability of legal texts. Models trained on one regulation often fail to generalise to others. This limitation underscores the need for principled methods to improve cross-domain transfer. We study data selection as a strategy to mitigate negative transfer in compliance detection framed as a natural language inference (NLI) task. Specifically, we evaluate four approaches for selecting augmentation data from a larger source domain: random sampling, Moore-Lewis’s cross-entropy difference, importance weighting, and embedding-based retrieval. We systematically vary the proportion of selected data to analyse its effect on cross-domain adaptation. Our findings demonstrate that targeted data selection substantially reduces negative transfer, offering a practical path toward scalable and reliable compliance automation across heterogeneous regulations.

Cross-Domain Data Selection and Augmentation for Automatic Compliance Detection Read Post »

AI, Committee, ข่าว, Uncategorized

Learning Dynamic Representations and Policies from Multimodal Clinical Time-Series with Informative Missingness

admin NU / เมษายน 24, 2026

arXiv:2604.21235v1 Announce Type: cross Abstract: Multimodal clinical records contain structured measurements and clinical notes recorded over time, offering rich temporal information about the evolution of patient health. Yet these observations are sparse, and whether they are recorded depends on the patient’s latent condition. Observation patterns also differ across modalities, as structured measurements and clinical notes arise under distinct recording processes. While prior work has developed methods that accommodate missingness in clinical time series, how to extract and use the information carried by the observation process itself remains underexplored. We therefore propose a patient representation learning framework for multimodal clinical time series that explicitly leverages informative missingness. The framework combines (1) a multimodal encoder that captures signals from structured and textual data together with their observation patterns, (2) a Bayesian filtering module that updates a latent patient state over time from observed multimodal signals, and (3) downstream modules for offline treatment policy learning and patient outcome prediction based on the learned patient state. We evaluate the framework on ICU sepsis cohorts from MIMIC-III, MIMIC-IV, and eICU. It improves both offline treatment policy learning and adverse outcome prediction, achieving FQE 0.679 versus 0.528 for clinician behavior and AUROC 0.886 for post-72-hour mortality prediction on MIMIC-III.

Learning Dynamic Representations and Policies from Multimodal Clinical Time-Series with Informative Missingness Read Post »

AI, Committee, ข่าว, Uncategorized

Health-care AI is here. We don’t know if it actually helps patients.

admin NU / เมษายน 24, 2026

I don’t need to tell you that AI is everywhere. Or that it is being used, increasingly, in hospitals. Doctors are using AI to help them with notetaking. AI-based tools are trawling through patient records, flagging people who may require certain support or treatments. They are also used to interpret medical exam results and X-rays. A growing number of studies suggest that many of these tools can deliver accurate results. But there’s a bigger question here: Does using them actually translate into better health outcomes for patients? We don’t yet have a good answer. That’s what Jenna Wiens, a computer scientist at the University of Michigan, and Anna Goldenberg of the University of Toronto, argue in a paper published in the journal Nature Medicine this week. Wiens tells me she has spent years investigating how AI might benefit health care. For the first decade of her career she tried to pitch the technology to clinicians. Over the last few years, she says, it’s as though “a switch flipped.” Health-care providers not only appear much more interested in the promise of these technologies, they have also begun rapidly deploying them. The problem is that many providers aren’t rigorously assessing how well they actually work. Take “ambient AI” tools, for example. Also known as AI scribes, they “listen” to conversations between doctors and patients, then transcribe and summarize them. Multiple tools are available, and they are already being widely adopted by health-care providers. A few months ago, a staffer at a major New York medical center who develops AI tools for doctors told me that, anecdotally, medics are “overjoyed” by the technology—it allows them to focus all their attention on their patients during appointments, and it saves them from a lot of time-consuming paperwork. Early studies support these anecdotes and suggest that the tools can reduce clinician burnout. That’s all well and good. But what about patient health outcomes? “[Researchers] have evaluated provider or clinician and patient satisfaction, but not really how these tools are affecting clinical decision-making,” says Wiens. “We just don’t know.” The same holds true for other AI-based technologies used in health-care settings. Some are used to predict patients’ health trajectories, others to recommend treatments. They are designed to make health care more effective and efficient. But even a tool that is “accurate” won’t necessarily improve health outcomes. AI might speed up the interpretation of a chest X-ray, for example. But how much will a doctor rely on its analysis? How will that tool affect the way a doctor interacts with patients or recommends treatment? And ultimately: What will this mean for those patients? The answers to those questions might vary between hospitals or departments and could depend on clinical workflows, says Wiens. They might also differ between doctors at various stages of their careers. Take the AI scribes, as another example. Some research on AI use in education suggests that such tools can impact the way people cognitively process information. Could they affect the way a doctor processes a patient’s information? Will the tools affect the way medical students think about patient data in a way that impacts care? These questions need to be explored, says Wiens. “We like things that save us time, but we have to think about the unintended consequences of this,” she says. In a study published in January 2025, Paige Nong at the University of Minnesota and her colleagues found that around 65% of US hospitals used AI-assisted predictive tools. Only two-thirds of those hospitals evaluated their accuracy. Even fewer assessed them for bias. The number of hospitals using these tools has probably increased since then, says Wiens. Those hospitals, or entities other than the companies developing the tools, need to evaluate how much they help in specific settings. There’s a possibility that they could leave patients worse off, although it’s more likely that AI tools just aren’t as beneficial as health-care providers might assume they are, says Wiens. “I do believe in the potential of AI to really improve clinical care,” says Wiens, who stresses that she doesn’t want to stop the adoption of AI tools in health care. She just wants more information about how they are affecting people. “I have to believe that in the future it’s not all AI or no AI,” she says. “It’s somewhere in between.” This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here.

Health-care AI is here. We don’t know if it actually helps patients. Read Post »

Uncategorized

RAG Without Vectors: How PageIndex Retrieves by Reasoning

Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models

A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation

Three reasons why DeepSeek’s new model matters

A Coding Implementation on Deepgram Python SDK for Transcription, Text-to-Speech, Async Audio Processing, and Text Intelligence

Meet GitNexus: An Open-Source MCP-Native Knowledge Graph Engine That Gives Claude Code and Cursor Full Codebase Structural Awareness

Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation

Cross-Domain Data Selection and Augmentation for Automatic Compliance Detection

Learning Dynamic Representations and Policies from Multimodal Clinical Time-Series with Informative Missingness

Health-care AI is here. We don’t know if it actually helps patients.

บริการของเรา

หน้าแรก

วิธีการทำงาน

ข่าว

แพ็กเกจราคา

ฝ่ายสนับสนุน

ศูนย์ช่วยเหลือ

รายงานปัญหา

ให้ความคิดเห็น

นโยบายความเป็นส่วนตัว

บัญชีผู้ใช้

ติดตามเรา