YouZum

AI

AI, Committee, ข่าว, Uncategorized

Anthropic Releases Claude Opus 4.7: A Major Upgrade for Agentic Coding, High-Resolution Vision, and Long-Horizon Autonomous Tasks

Anthropic has launched Claude Opus 4.7, it’s latest frontier model and a direct successor to Claude Opus 4.6. The release is positioned as a focused improvement rather than a full generational leap, but the gains it delivers are substantial in the areas that matter most to developers building real-world AI-powered applications: agentic software engineering, multimodal reasoning, and long-running autonomous task execution. https://www.anthropic.com/news/claude-opus-4-7 What Exactly is Claude Opus 4.7? Anthropic maintains a model family with tiers — Haiku (fast and lightweight), Sonnet (balanced), and Opus (highest capability). Opus 4.7 sits at the top of this stack, below only the newly previewed Claude Mythos, which Anthropic has kept in a restricted release. Opus 4.7 represents a notable improvement on Opus 4.6 in advanced software engineering, with particular gains on the most difficult tasks. Crucially, users report being able to hand off their hardest coding work — the kind that previously needed close supervision — to Opus 4.7 with confidence, as it handles complex, long-running tasks with rigor and consistency, pays precise attention to instructions, and devises ways to verify its own outputs before reporting back. The model verifying its own outputs is a meaningful behavioral shift. Earlier models often produced results without internal sanity checks; Opus 4.7 appears to close that loop autonomously, which has significant implications for CI/CD pipelines and multi-step agentic workflows. Stronger Coding Benchmarks Early testers have put some sharp numbers on the coding improvements. On a 93-task coding benchmark, Opus 4.7 lifted resolution by 13% over Opus 4.6, including four tasks that neither Opus 4.6 nor Sonnet 4.6 could solve. On CursorBench — a widely-used developer evaluation harness — Opus 4.7 cleared 70% versus Opus 4.6 at 58%. And for complex multi-step workflows, one tester observed a 14% gain over Opus 4.6 at fewer tokens and a third of the tool errors — and notably, Opus 4.7 was the first model to pass their implicit-need tests, continuing to execute through tool failures that used to stop Opus cold. Improved Vision: 3× the Resolution of Prior Models One of the most technically concrete upgrades in Opus 4.7 is its multimodal capability. Opus 4.7 can now accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times as many pixels as prior Claude models. Many real-world applications — from computer-use agents reading dense UI screenshots to data extraction from complex engineering diagrams — fail not because the model lacks reasoning ability, but because it can’t resolve fine visual detail. This opens up a wealth of multimodal uses that depend on fine visual detail: computer-use agents reading dense screenshots, data extractions from complex diagrams, and work that needs pixel-perfect references. The impact in production has already been dramatic. One tester working on computer-use workflows reported that Opus 4.7 scored 98.5% on their visual-acuity benchmark versus 54.5% for Opus 4.6 — effectively eliminating their single biggest Opus pain point. This is a model-level change rather than an API parameter, so images users send to Claude will simply be processed at higher fidelity — though because higher-resolution images consume more tokens, users who don’t require the extra detail can downsample images before sending them to the model. https://www.anthropic.com/news/claude-opus-4-7 A New Effort Level: xhigh, Plus Task Budgets Developers working with the Claude API will notice two new levers for controlling compute spend. First, Opus 4.7 introduces a new xhigh (‘extra high’) effort level between high and max, giving users finer control over the tradeoff between reasoning and latency on hard problems. In Claude Code, Anthropic team has raised the default effort level to xhigh for all plans. When testing Opus 4.7 for coding and agentic use cases, Anthropic recommends starting with high or xhigh effort. Second, task budgets are now launching in public beta on the Claude Platform API, giving developers a way to guide Claude’s token spend so it can prioritize work across longer runs. Together, these two controls give developer teams meaningful production levers — especially relevant when running parallelized agent pipelines where per-call cost and latency must be managed carefully. New in Claude Code: /ultrareview and Auto Mode for Max Users Two new Claude Code features ship alongside Opus 4.7 that are worth flagging for devs who use it as part of their development workflow. The new /ultrareview slash command produces a dedicated review session that reads through changes and flags bugs and design issues that a careful reviewer would catch. Anthropic is giving Pro and Max Claude Code users three free ultrareviews to try it out. Think of it as a senior engineer review pass on demand — useful before merging complex PRs or shipping to production. Additionally, auto mode has been extended to Max users. Auto mode is a new permissions option where Claude makes decisions on your behalf, meaning that you can run longer tasks with fewer interruptions — and with less risk than if you had chosen to skip all permissions. This is particularly valuable for agents executing multi-step tasks overnight or across large codebases. File System-Based Memory for Long Multi-Session Work A less-discussed but operationally significant improvement is how Opus 4.7 handles memory. Opus 4.7 is better at using file system-based memory — it remembers important notes across long, multi-session work and uses them to move on to new tasks that, as a result, need less up-front context. On third-party benchmarks, the model also achieved state-of-the-art results on GDPval-AA, a third-party evaluation of economically valuable knowledge work across finance, legal, and other domains. Key Takeaways Claude Opus 4.7 is Anthropic’s strongest coding model to date, handling complex, long-running agentic tasks with far less supervision than Opus 4.6 — and uniquely verifies its own outputs before reporting back. Vision capability has tripled, with support for images up to ~3.75 megapixels, making it significantly more reliable for computer-use agents, diagram parsing, and any workflow that depends on fine visual detail. A new xhigh effort level and task budgets give developers precise control over the reasoning-vs-latency tradeoff and token spend — critical levers

Anthropic Releases Claude Opus 4.7: A Major Upgrade for Agentic Coding, High-Resolution Vision, and Long-Horizon Autonomous Tasks Read Post »

AI, Committee, ข่าว, Uncategorized

A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG

In this tutorial, we implement how to run the Bonsai 1-bit large language model efficiently using GPU acceleration and PrismML’s optimized GGUF deployment stack. We set up the environment, install the required dependencies, and download the prebuilt llama.cpp binaries, and load the Bonsai-1.7B model for fast inference on CUDA. As we progress, we examine how 1-bit quantization works under the hood, why the Q1_0_g128 format is so memory-efficient, and how this makes Bonsai practical for lightweight yet capable language model deployment. We also test core inference, benchmarking, multi-turn chat, structured JSON generation, code generation, OpenAI-compatible server mode, and a small retrieval-augmented generation workflow, giving us a complete, hands-on view of how Bonsai operates in real-world use. Copy CodeCopiedUse a different Browser import os, sys, subprocess, time, json, urllib.request, tarfile, textwrap try: import google.colab IN_COLAB = True except ImportError: IN_COLAB = False def section(title): bar = “═” * 60 print(f”n{bar}n {title}n{bar}”) section(“1 · Environment & GPU Check”) def run(cmd, capture=False, check=True, **kw): return subprocess.run( cmd, shell=True, capture_output=capture, text=True, check=check, **kw ) gpu_info = run(“nvidia-smi –query-gpu=name,memory.total,driver_version –format=csv,noheader”, capture=True, check=False) if gpu_info.returncode == 0: print(” GPU detected:”, gpu_info.stdout.strip()) else: print(” No GPU found — inference will run on CPU (much slower).”) cuda_check = run(“nvcc –version”, capture=True, check=False) if cuda_check.returncode == 0: for line in cuda_check.stdout.splitlines(): if “release” in line: print(” CUDA:”, line.strip()) break print(f” Python {sys.version.split()[0]} | Platform: Linux (Colab)”) section(“2 · Installing Python Dependencies”) run(“pip install -q huggingface_hub requests tqdm openai”) print(” huggingface_hub, requests, tqdm, openai installed”) from huggingface_hub import hf_hub_download We begin by importing the core Python modules that we need for system operations, downloads, timing, and JSON handling. We check whether we are running inside Google Colab, define a reusable section printer, and create a helper function to run shell commands cleanly from Python. We then verify the GPU and CUDA environment, print the Python runtime details, install the required Python dependencies, and prepare the Hugging Face download utility for the next stages. Copy CodeCopiedUse a different Browser section(“3 · Downloading PrismML llama.cpp Prebuilt Binaries”) RELEASE_TAG = “prism-b8194-1179bfc” BASE_URL = f”https://github.com/PrismML-Eng/llama.cpp/releases/download/{RELEASE_TAG}” BIN_DIR = “/content/bonsai_bin” os.makedirs(BIN_DIR, exist_ok=True) def detect_cuda_build(): r = run(“nvcc –version”, capture=True, check=False) for line in r.stdout.splitlines(): if “release” in line: try: ver = float(line.split(“release”)[-1].strip().split(“,”)[0].strip()) if ver >= 13.0: return “13.1” if ver >= 12.6: return “12.8” return “12.4” except ValueError: pass return “12.4” cuda_build = detect_cuda_build() print(f” Detected CUDA build slot: {cuda_build}”) TAR_NAME = f”llama-{RELEASE_TAG}-bin-linux-cuda-{cuda_build}-x64.tar.gz” TAR_URL = f”{BASE_URL}/{TAR_NAME}” tar_path = f”/tmp/{TAR_NAME}” if not os.path.exists(f”{BIN_DIR}/llama-cli”): print(f” Downloading: {TAR_URL}”) urllib.request.urlretrieve(TAR_URL, tar_path) print(” Extracting …”) with tarfile.open(tar_path, “r:gz”) as t: t.extractall(BIN_DIR) for fname in os.listdir(BIN_DIR): fp = os.path.join(BIN_DIR, fname) if os.path.isfile(fp): os.chmod(fp, 0o755) print(f” Binaries extracted to {BIN_DIR}”) bins = sorted(f for f in os.listdir(BIN_DIR) if os.path.isfile(os.path.join(BIN_DIR, f))) print(” Available:”, “, “.join(bins)) else: print(f” Binaries already present at {BIN_DIR}”) LLAMA_CLI = f”{BIN_DIR}/llama-cli” LLAMA_SERVER = f”{BIN_DIR}/llama-server” test = run(f”{LLAMA_CLI} –version”, capture=True, check=False) if test.returncode == 0: print(f” llama-cli version: {test.stdout.strip()[:80]}”) else: print(f” llama-cli test failed: {test.stderr.strip()[:200]}”) section(“4 · Downloading Bonsai-1.7B GGUF Model”) MODEL_REPO = “prism-ml/Bonsai-1.7B-gguf” MODEL_DIR = “/content/bonsai_models” GGUF_FILENAME = “Bonsai-1.7B.gguf” os.makedirs(MODEL_DIR, exist_ok=True) MODEL_PATH = os.path.join(MODEL_DIR, GGUF_FILENAME) if not os.path.exists(MODEL_PATH): print(f” Downloading {GGUF_FILENAME} (~248 MB) from HuggingFace …”) MODEL_PATH = hf_hub_download( repo_id=MODEL_REPO, filename=GGUF_FILENAME, local_dir=MODEL_DIR, ) print(f” Model saved to: {MODEL_PATH}”) else: print(f” Model already cached: {MODEL_PATH}”) size_mb = os.path.getsize(MODEL_PATH) / 1e6 print(f” File size on disk: {size_mb:.1f} MB”) section(“5 · Core Inference Helpers”) DEFAULT_GEN_ARGS = dict( temp=0.5, top_p=0.85, top_k=20, repeat_penalty=1.0, n_predict=256, n_gpu_layers=99, ctx_size=4096, ) def build_llama_cmd(prompt, system_prompt=”You are a helpful assistant.”, **overrides): args = {**DEFAULT_GEN_ARGS, **overrides} formatted = ( f”<|im_start|>systemn{system_prompt}<|im_end|>n” f”<|im_start|>usern{prompt}<|im_end|>n” f”<|im_start|>assistantn” ) safe_prompt = formatted.replace(‘”‘, ‘\”‘) return ( f'{LLAMA_CLI} -m “{MODEL_PATH}”‘ f’ -p “{safe_prompt}”‘ f’ -n {args[“n_predict”]}’ f’ –temp {args[“temp”]}’ f’ –top-p {args[“top_p”]}’ f’ –top-k {args[“top_k”]}’ f’ –repeat-penalty {args[“repeat_penalty”]}’ f’ -ngl {args[“n_gpu_layers”]}’ f’ -c {args[“ctx_size”]}’ f’ –no-display-prompt’ f’ -e’ ) def infer(prompt, system_prompt=”You are a helpful assistant.”, verbose=True, **overrides): cmd = build_llama_cmd(prompt, system_prompt, **overrides) t0 = time.time() result = run(cmd, capture=True, check=False) elapsed = time.time() – t0 output = result.stdout.strip() if verbose: print(f”n{‘─’*50}”) print(f”Prompt : {prompt[:100]}{‘…’ if len(prompt) > 100 else ”}”) print(f”{‘─’*50}”) print(output) print(f”{‘─’*50}”) print(f” {elapsed:.2f}s | ~{len(output.split())} words”) return output, elapsed print(” Inference helpers ready.”) section(“6 · Basic Inference — Hello, Bonsai!”) infer(“What makes 1-bit language models special compared to standard models?”) We download and prepare the PrismML prebuilt llama.cpp CUDA binaries that power local inference for the Bonsai model. We detect the available CUDA version, choose the matching binary build, extract the downloaded archive, make the files executable, and verify that the llama-cli binary works correctly. After that, we download the Bonsai-1.7B GGUF model from Hugging Face, set up the model path, define the default generation settings, and build the core helper functions that format prompts and run inference. Copy CodeCopiedUse a different Browser section(“7 · Q1_0_g128 Quantization — What’s Happening Under the Hood”) print(textwrap.dedent(“”” ╔══════════════════════════════════════════════════════════════╗ ║ Bonsai Q1_0_g128 Weight Representation ║ ╠══════════════════════════════════════════════════════════════╣ ║ Each weight = 1 bit: 0 → −scale ║ ║ 1 → +scale ║ ║ Every 128 weights share one FP16 scale factor. ║ ║ ║ ║ Effective bits per weight: ║ ║ 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw ║ ║ ║ ║ Memory comparison for Bonsai-1.7B: ║ ║ FP16: 3.44 GB (1.0× baseline) ║ ║ Q1_0_g128: 0.24 GB (14.2× smaller!) ║ ║ MLX 1-bit g128: 0.27 GB (12.8× smaller) ║ ╚══════════════════════════════════════════════════════════════╝ “””)) print(” Python demo of Q1_0_g128 quantization logic:n”) import random random.seed(42) GROUP_SIZE = 128 weights_fp16 = [random.gauss(0, 0.1) for _ in range(GROUP_SIZE)] scale = max(abs(w) for w in weights_fp16) quantized = [1 if w >= 0 else 0 for w in weights_fp16] dequantized = [scale if b == 1 else -scale for b in quantized] mse = sum((a – b) ** 2 for a, b in zip(weights_fp16, dequantized)) / GROUP_SIZE print(f” FP16 weights (first 8): {[f'{w:.4f}’ for w in weights_fp16[:8]]}”) print(f” 1-bit repr (first 8): {quantized[:8]}”) print(f” Shared scale: {scale:.4f}”) print(f” Dequantized (first 8): {[f'{w:.4f}’ for w in dequantized[:8]]}”) print(f” MSE of reconstruction: {mse:.6f}”) memory_fp16 = GROUP_SIZE * 2 memory_1bit = GROUP_SIZE / 8 + 2 print(f”n Memory: FP16={memory_fp16}B vs Q1_0_g128={memory_1bit:.1f}B ” f”({memory_fp16/memory_1bit:.1f}× reduction)”) section(“8 ·

A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG Read Post »

AI, Committee, ข่าว, Uncategorized

A Coding Guide for Property-Based Testing Using Hypothesis with Stateful, Differential, and Metamorphic Test Design

In this tutorial, we explore property-based testing using Hypothesis and build a rigorous testing pipeline that goes far beyond traditional unit testing. We implement invariants, differential testing, metamorphic testing, targeted exploration, and stateful testing to validate both functional correctness and behavioral guarantees of our systems. Instead of manually crafting edge cases, we let Hypothesis generate structured inputs, shrink failures to minimal counterexamples, and systematically uncover hidden bugs. Also, we demonstrate how modern testing practices can be integrated directly into experimental and research-driven workflows. Copy CodeCopiedUse a different Browser import sys, textwrap, subprocess, os, re, math !{sys.executable} -m pip -q install hypothesis pytest test_code = r”’ import re, math import pytest from hypothesis import ( given, assume, example, settings, note, target, HealthCheck, Phase ) from hypothesis import strategies as st from hypothesis.stateful import RuleBasedStateMachine, rule, invariant, initialize, precondition def clamp(x: int, lo: int, hi: int) -> int: if x < lo: return lo if x > hi: return hi return x def normalize_whitespace(s: str) -> str: return ” “.join(s.split()) def is_sorted_non_decreasing(xs): return all(xs[i] <= xs[i+1] for i in range(len(xs)-1)) def merge_sorted(a, b): i = j = 0 out = [] while i < len(a) and j < len(b): if a[i] <= b[j]: out.append(a[i]); i += 1 else: out.append(b[j]); j += 1 out.extend(a[i:]) out.extend(b[j:]) return out def merge_sorted_reference(a, b): return sorted(list(a) + list(b)) We set up the environment by installing Hypothesis and pytest and importing all required modules. We begin constructing the full test suite by defining core utility functions such as clamp, normalize_whitespace, and merge_sorted. We establish the functional foundation that our property-based tests will rigorously validate in later snippets. Copy CodeCopiedUse a different Browser def safe_parse_int(s: str): t = s.strip() if re.fullmatch(r”[+-]?d+”, t) is None: return (False, “not_an_int”) if len(t.lstrip(“+-“)) > 2000: return (False, “too_big”) try: return (True, int(t)) except Exception: return (False, “parse_error”) def safe_parse_int_alt(s: str): t = s.strip() if not t: return (False, “not_an_int”) sign = 1 if t[0] == “+”: t = t[1:] elif t[0] == “-“: sign = -1 t = t[1:] if not t or any(ch < “0” or ch > “9” for ch in t): return (False, “not_an_int”) if len(t) > 2000: return (False, “too_big”) val = 0 for ch in t: val = val * 10 + (ord(ch) – 48) return (True, sign * val) bounds = st.tuples(st.integers(-10_000, 10_000), st.integers(-10_000, 10_000)).map( lambda t: (t[0], t[1]) if t[0] <= t[1] else (t[1], t[0]) ) @st.composite def int_like_strings(draw): sign = draw(st.sampled_from([“”, “+”, “-“])) digits = draw(st.text(alphabet=st.characters(min_codepoint=48, max_codepoint=57), min_size=1, max_size=300)) left_ws = draw(st.text(alphabet=[” “, “t”, “n”], min_size=0, max_size=5)) right_ws = draw(st.text(alphabet=[” “, “t”, “n”], min_size=0, max_size=5)) return f”{left_ws}{sign}{digits}{right_ws}” sorted_lists = st.lists(st.integers(-10_000, 10_000), min_size=0, max_size=200).map(sorted) We implement parsing logic and define structured strategies that generate constrained, meaningful test inputs. We create composite strategies such as int_like_strings to precisely control the input space for property validation. We prepare sorted list generators and bounds strategies that enable differential and invariant-based testing. Copy CodeCopiedUse a different Browser @settings(max_examples=300, suppress_health_check=[HealthCheck.too_slow]) @given(x=st.integers(-50_000, 50_000), b=bounds) def test_clamp_within_bounds(x, b): lo, hi = b y = clamp(x, lo, hi) assert lo <= y <= hi @settings(max_examples=300, suppress_health_check=[HealthCheck.too_slow]) @given(x=st.integers(-50_000, 50_000), b=bounds) def test_clamp_idempotent(x, b): lo, hi = b y = clamp(x, lo, hi) assert clamp(y, lo, hi) == y @settings(max_examples=250) @given(s=st.text()) @example(” attb n c “) def test_normalize_whitespace_is_idempotent(s): t = normalize_whitespace(s) assert normalize_whitespace(t) == t assert normalize_whitespace(” nt ” + s + ” t”) == normalize_whitespace(s) @settings(max_examples=250, suppress_health_check=[HealthCheck.too_slow]) @given(a=sorted_lists, b=sorted_lists) def test_merge_sorted_matches_reference(a, b): out = merge_sorted(a, b) ref = merge_sorted_reference(a, b) assert out == ref assert is_sorted_non_decreasing(out) We define core property tests that validate correctness and idempotence across multiple functions. We use Hypothesis decorators to automatically explore edge cases and verify behavioral guarantees such as boundary constraints and deterministic normalization. We also implement differential testing to ensure our merge implementation matches a trusted reference. Copy CodeCopiedUse a different Browser @settings(max_examples=250, deadline=200, suppress_health_check=[HealthCheck.too_slow]) @given(s=int_like_strings()) def test_two_parsers_agree_on_int_like_strings(s): ok1, v1 = safe_parse_int(s) ok2, v2 = safe_parse_int_alt(s) assert ok1 and ok2 assert v1 == v2 @settings(max_examples=250) @given(s=st.text(min_size=0, max_size=200)) def test_safe_parse_int_rejects_non_ints(s): t = s.strip() m = re.fullmatch(r”[+-]?d+”, t) ok, val = safe_parse_int(s) if m is None: assert ok is False else: if len(t.lstrip(“+-“)) > 2000: assert ok is False and val == “too_big” else: assert ok is True and isinstance(val, int) def variance(xs): if len(xs) < 2: return 0.0 mu = sum(xs) / len(xs) return sum((x – mu) ** 2 for x in xs) / (len(xs) – 1) @settings(max_examples=250, phases=[Phase.generate, Phase.shrink]) @given(xs=st.lists(st.integers(-1000, 1000), min_size=0, max_size=80)) def test_statistics_sanity(xs): target(variance(xs)) if len(xs) == 0: assert variance(xs) == 0.0 elif len(xs) == 1: assert variance(xs) == 0.0 else: v = variance(xs) assert v >= 0.0 k = 7 assert math.isclose(variance([x + k for x in xs]), v, rel_tol=1e-12, abs_tol=1e-12) We extend our validation to parsing robustness and statistical correctness using targeted exploration. We verify that two independent integer parsers agree on structured inputs and enforce rejection rules on invalid strings. We further implement metamorphic testing by validating invariants of variance under transformation. Copy CodeCopiedUse a different Browser class Bank: def __init__(self): self.balance = 0 self.ledger = [] def deposit(self, amt: int): if amt <= 0: raise ValueError(“deposit must be positive”) self.balance += amt self.ledger.append((“dep”, amt)) def withdraw(self, amt: int): if amt <= 0: raise ValueError(“withdraw must be positive”) if amt > self.balance: raise ValueError(“insufficient funds”) self.balance -= amt self.ledger.append((“wd”, amt)) def replay_balance(self): bal = 0 for typ, amt in self.ledger: bal += amt if typ == “dep” else -amt return bal class BankMachine(RuleBasedStateMachine): def __init__(self): super().__init__() self.bank = Bank() @initialize() def init(self): assert self.bank.balance == 0 assert self.bank.replay_balance() == 0 @rule(amt=st.integers(min_value=1, max_value=10_000)) def deposit(self, amt): self.bank.deposit(amt) @precondition(lambda self: self.bank.balance > 0) @rule(amt=st.integers(min_value=1, max_value=10_000)) def withdraw(self, amt): assume(amt <= self.bank.balance) self.bank.withdraw(amt) @invariant() def balance_never_negative(self): assert self.bank.balance >= 0 @invariant() def ledger_replay_matches_balance(self): assert self.bank.replay_balance() == self.bank.balance TestBankMachine = BankMachine.TestCase ”’ path = “/tmp/test_hypothesis_advanced.py” with open(path, “w”, encoding=”utf-8″) as f: f.write(test_code) print(“Hypothesis version:”, __import__(“hypothesis”).__version__) print(“nRunning pytest on:”, path, “n”) res = subprocess.run([sys.executable, “-m”, “pytest”, “-q”, path], capture_output=True, text=True) print(res.stdout) if res.returncode != 0: print(res.stderr) if res.returncode == 0: print(“nAll Hypothesis tests passed.”) elif res.returncode == 5:

A Coding Guide for Property-Based Testing Using Hypothesis with Stateful, Differential, and Metamorphic Test Design Read Post »

AI, Committee, ข่าว, Uncategorized

NVIDIA Releases Ising: the First Open Quantum AI Model Family for Hybrid Quantum-Classical Systems

Quantum computing has spent years living in the future tense. Hardware has improved, research has compounded, and venture dollars have followed — but the gap between a quantum processor running in a lab and one running a real-world application remains stubbornly wide. NVIDIA moved to close that gap with the launch of NVIDIA Ising, the world’s first family of open quantum AI models specifically designed to help researchers and enterprises build quantum processors capable of running useful applications. Here’s the core problem Ising is designed to solve: quantum computers are extraordinarily sensitive. Their fundamental unit of computation, the qubit, is so easily disturbed by environmental noise that errors accumulate rapidly during computation. Before you can run anything meaningful on a quantum processor, two things have to work well — calibration (making sure the hardware is tuned and operating correctly) and error correction (detecting and fixing errors as they occur in real time). Both of these have historically been manual, slow, and difficult to scale. NVIDIA is betting that AI can automate both. What the Ising Model Family Actually Includes NVIDIA Ising includes two distinct components: Ising Calibration and Ising Decoding. Ising Calibration is a vision language model — a model architecture familiar to anyone who has worked with multimodal AI — that is designed to rapidly interpret and react to measurements from quantum processors. Think of it as an AI agent that continuously watches diagnostic readouts from quantum hardware and autonomously adjusts the system to keep it running optimally. This enables AI agents to automate continuous calibration, reducing the time needed from days to hours. That’s not a minor speedup — in quantum hardware development, days of calibration time between experiments is a major bottleneck. Ising Decoding comes in two variants of a 3D convolutional neural network (3D CNN) model, each optimized for different trade-offs: one tuned for speed and the other tuned for accuracy. These models perform real-time decoding for quantum error correction. If you’ve worked with signal processing or sequence modeling, error correction decoding is conceptually similar — you’re trying to infer what the ‘correct’ state of the system should be, given noisy observations. Ising Decoding models are up to 2.5x faster and 3x more accurate than pyMatching, the current open-source industry standard. The Ecosystem Is Already Moving Ising Calibration is already in use by Atom Computing, Academia Sinica, EeroQ, Conductor Quantum, Fermi National Accelerator Laboratory, Harvard John A. Paulson School of Engineering and Applied Sciences, Infleqtion, IonQ, IQM Quantum Computers, Lawrence Berkeley National Laboratory’s Advanced Quantum Testbed, Q-CTRL, and the U.K. National Physical Laboratory. Ising Decoding is being deployed by Cornell University, EdenCode, Infleqtion, IQM Quantum Computers, Quantum Elements, Sandia National Laboratories, SEEQC, University of California San Diego, UC Santa Barbara, University of Chicago, University of Southern California, and Yonsei University. That’s a remarkably broad day-one adoption spanning national labs, Ivy League institutions, and commercial quantum hardware companies across multiple qubit modalities. How It Fits Into NVIDIA’s Quantum Stack NVIDIA Ising complements the NVIDIA CUDA-Q software platform for hybrid quantum-classical computing and integrates with the NVIDIA NVQLink QPU-GPU hardware interconnect for real-time control and quantum error correction. CUDA-Q is NVIDIA’s broader programming model for hybrid quantum-classical workflows — if you’ve written CUDA kernels for GPU acceleration, CUDA-Q follows a similar philosophy of tightly coupling classical and accelerated compute. NVQLink is the hardware bridge that lets GPUs communicate with quantum processing units (QPUs) at the latency required for real-time error correction. Key Takeaways NVIDIA Ising is the world’s first family of open quantum AI models, purpose-built to solve the two hardest engineering problems blocking practical quantum computing — calibration and error correction — using AI instead of slow, manual processes. Ising Calibration uses a vision language model to autonomously tune quantum processors, reducing the time required for continuous calibration from days to hours by enabling AI agents to interpret and react to hardware measurements in real time. Ising Decoding uses a 3D convolutional neural network (3D CNN) to perform real-time quantum error correction, delivering up to 2.5x faster performance and 3x higher accuracy compared to pyMatching. Adoption is already broad and diverse on day one, with leading institutions including Fermi National Accelerator Laboratory, Harvard, Lawrence Berkeley National Laboratory’s Advanced Quantum Testbed, IQM Quantum Computers, Sandia National Laboratories, and over a dozen universities and enterprises deploying Ising Calibration and Ising Decoding across multiple qubit modalities. Ising integrates directly into NVIDIA’s full quantum-classical software and hardware stack, complementing the NVIDIA CUDA-Q platform for hybrid quantum-classical computing and the NVIDIA NVQLink QPU-GPU hardware interconnect, with models available on GitHub, Hugging Face, and build.nvidia.com and fine-tunable via NVIDIA NIM microservices. Check out the Technical details and Product Page here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post NVIDIA Releases Ising: the First Open Quantum AI Model Family for Hybrid Quantum-Classical Systems appeared first on MarkTechPost.

NVIDIA Releases Ising: the First Open Quantum AI Model Family for Hybrid Quantum-Classical Systems Read Post »

AI, Committee, ข่าว, Uncategorized

xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers

Elon Musk’s AI company xAI has launched two standalone audio APIs — a Speech-to-Text (STT) API and a Text-to-Speech (TTS) API — both built on the same infrastructure that powers Grok Voice on mobile apps, Tesla vehicles, and Starlink customer support. The release moves xAI squarely into the competitive speech API market currently occupied by ElevenLabs, Deepgram, and AssemblyAI. What Is the Grok Speech-to-Text API? Speech-to-Text is the technology that converts spoken audio into written text. For developers building meeting transcription tools, voice agents, call center analytics, or accessibility features, an STT API is a core building block. Rather than developing this from scratch, developers call an endpoint, send audio, and receive a structured transcript in return. The Grok STT API is now generally available, offering transcription across 25 languages with both batch and streaming modes. The batch mode is designed for processing pre-recorded audio files, while streaming enables real-time transcription as audio is captured. Pricing is kept straightforward: Speech-to-Text is $0.10 per hour for batch and $0.20 per hour for streaming. The API includes word-level timestamps, speaker diarization, and multichannel support, along with intelligent Inverse Text Normalization that correctly handles numbers, dates, currencies, and more. It also accepts 12 audio formats — nine container formats (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV) and three raw formats (PCM, µ-law, A-law), with a maximum file size of 500 MB per request. Speaker diarization is the process of separating audio by individual speakers — answering the question ‘who said what.’ This is critical for multi-speaker recordings like meetings, interviews, or customer calls. Word-level timestamps assign precise start and end times to each word in the transcript, enabling use cases like subtitle generation, searchable recordings, and legal documentation. Inverse Text Normalization converts spoken forms like ‘one hundred sixty-seven thousand nine hundred eighty-three dollars and fifteen cents’ into readable structured output: “$167,983.15.”. Benchmark Performance xAI research team is making strong claims on accuracy. On phone call entity recognition — names, account numbers, dates — Grok STT claims a 5.0% error rate versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. That is a substantial margin if it holds in production. For video and podcast transcription, Grok and ElevenLabs tied at a 2.4% error rate, with Deepgram and AssemblyAI trailing at 3.0% and 3.2% respectively. xAI team also reports a 6.9% word error rate on general audio benchmarks. https://x.ai/news/grok-stt-and-tts-apis https://x.ai/news/grok-stt-and-tts-apis What is the Grok Text-to-Speech API? Text-to-Speech converts written text into spoken audio. Developers use TTS APIs to power voice assistants, read-aloud features, podcast generation, IVR (interactive voice response) systems, and accessibility tools. The Grok TTS API delivers fast, natural speech synthesis with detailed control via speech tags, and is priced at $4.20 per 1 million characters. The API accepts up to 15,000 characters per REST request; for longer content, a WebSocket streaming endpoint is available that has no text length limit and begins returning audio before the full input is processed. The API supports 20 languages and five distinct voices: Ara, Eve, Leo, Rex, and Sal — with Eve set as the default. Beyond voice selection, developers can inject inline and wrapping speech tags to control delivery. These include inline tags like [laugh], [sigh], and [breath], and wrapping tags like <whisper>text</whisper> and <emphasis>text</emphasis>, letting developers create engaging, lifelike delivery without complex markup. This expressiveness addresses one of the core limitations of traditional TTS systems, which often produce technically correct but emotionally flat output. Key Takeaways xAI has launched two standalone audio APIs — Grok Speech-to-Text (STT) and Text-to-Speech (TTS) — built on the same production stack already serving millions of users across Grok mobile apps, Tesla vehicles, and Starlink customer support. The Grok STT API offers real-time and batch transcription across 25 languages with speaker diarization, word-level timestamps, Inverse Text Normalization, and support for 12 audio formats — priced at $0.10/hour for batch and $0.20/hour for streaming. On phone call entity recognition benchmarks, Grok STT reports a 5.0% error rate, significantly outperforming ElevenLabs (12.0%), Deepgram (13.5%), and AssemblyAI (21.3%), with particularly strong performance in medical, legal, and financial use cases. The Grok TTS API supports five expressive voices (Ara, Eve, Leo, Rex, Sal) across 20 languages, with inline and wrapping speech tags like [laugh], [sigh], and <whisper> giving developers fine-grained control over vocal delivery — priced at $4.20 per 1 million characters. Check out the Technical details here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers appeared first on MarkTechPost.

xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers Read Post »

AI, Committee, ข่าว, Uncategorized

A End-to-End Coding Guide to Running OpenAI GPT-OSS Open-Weight Models with Advanced Inference Workflows

In this tutorial, we explore how to run OpenAI’s open-weight GPT-OSS models in Google Colab with a strong focus on their technical behavior, deployment requirements, and practical inference workflows. We begin by setting up the exact dependencies needed for Transformers-based execution, verifying GPU availability, and loading openai/gpt-oss-20b with the correct configuration using native MXFP4 quantization, torch.bfloat16 activations. As we move through the tutorial, we work directly with core capabilities such as structured generation, streaming, multi-turn dialogue handling, tool execution patterns, and batch inference, while keeping in mind how open-weight models differ from closed-hosted APIs in terms of transparency, controllability, memory constraints, and local execution trade-offs. Also, we treat GPT-OSS not just as a chatbot, but as a technically inspectable open-weight LLM stack that we can configure, prompt, and extend inside a reproducible workflow. Copy CodeCopiedUse a different Browser print(” Step 1: Installing required packages…”) print(“=” * 70) !pip install -q –upgrade pip !pip install -q transformers>=4.51.0 accelerate sentencepiece protobuf !pip install -q huggingface_hub gradio ipywidgets !pip install -q openai-harmony import transformers print(f” Transformers version: {transformers.__version__}”) import torch print(f”n System Information:”) print(f” PyTorch version: {torch.__version__}”) print(f” CUDA available: {torch.cuda.is_available()}”) if torch.cuda.is_available(): gpu_name = torch.cuda.get_device_name(0) gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9 print(f” GPU: {gpu_name}”) print(f” GPU Memory: {gpu_memory:.2f} GB”) if gpu_memory < 15: print(f”n WARNING: gpt-oss-20b requires ~16GB VRAM.”) print(f” Your GPU has {gpu_memory:.1f}GB. Consider using Colab Pro for T4/A100.”) else: print(f”n GPU memory sufficient for gpt-oss-20b”) else: print(“n No GPU detected!”) print(” Go to: Runtime → Change runtime type → Select ‘T4 GPU'”) raise RuntimeError(“GPU required for this tutorial”) print(“n” + “=” * 70) print(” PART 2: Loading GPT-OSS Model (Correct Method)”) print(“=” * 70) from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline import torch MODEL_ID = “openai/gpt-oss-20b” print(f”n Loading model: {MODEL_ID}”) print(” This may take several minutes on first run…”) print(” (Model size: ~40GB download, uses native MXFP4 quantization)”) tokenizer = AutoTokenizer.from_pretrained( MODEL_ID, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained( MODEL_ID, torch_dtype=torch.bfloat16, device_map=”auto”, trust_remote_code=True, ) pipe = pipeline( “text-generation”, model=model, tokenizer=tokenizer, ) print(” Model loaded successfully!”) print(f” Model dtype: {model.dtype}”) print(f” Device: {model.device}”) if torch.cuda.is_available(): allocated = torch.cuda.memory_allocated() / 1e9 reserved = torch.cuda.memory_reserved() / 1e9 print(f” GPU Memory Allocated: {allocated:.2f} GB”) print(f” GPU Memory Reserved: {reserved:.2f} GB”) print(“n” + “=” * 70) print(” PART 3: Basic Inference Examples”) print(“=” * 70) def generate_response(messages, max_new_tokens=256, temperature=0.8, top_p=1.0): “”” Generate a response using gpt-oss with recommended parameters. OpenAI recommends: temperature=1.0, top_p=1.0 for gpt-oss “”” output = pipe( messages, max_new_tokens=max_new_tokens, do_sample=True, temperature=temperature, top_p=top_p, pad_token_id=tokenizer.eos_token_id, ) return output[0][“generated_text”][-1][“content”] print(“n Example 1: Simple Question Answering”) print(“-” * 50) messages = [ {“role”: “user”, “content”: “What is the Pythagorean theorem? Explain briefly.”} ] response = generate_response(messages, max_new_tokens=150) print(f”User: {messages[0][‘content’]}”) print(f”nAssistant: {response}”) print(“nn Example 2: Code Generation”) print(“-” * 50) messages = [ ] response = generate_response(messages, max_new_tokens=300) print(f”User: {messages[0][‘content’]}”) print(f”nAssistant: {response}”) print(“nn Example 3: Creative Writing”) print(“-” * 50) messages = [ {“role”: “user”, “content”: “Write a haiku about artificial intelligence.”} ] response = generate_response(messages, max_new_tokens=100, temperature=1.0) print(f”User: {messages[0][‘content’]}”) print(f”nAssistant: {response}”) We set up the full Colab environment required to run GPT-OSS properly and verify that the system has a compatible GPU with enough VRAM. We install the core libraries, check the PyTorch and Transformers versions, and confirm that the runtime is suitable for loading an open-weight model like gpt-oss-20b. We then load the tokenizer, initialize the model with the correct technical configuration, and run a few basic inference examples to confirm that the open-weight pipeline is working end to end. Copy CodeCopiedUse a different Browser print(“n” + “=” * 70) print(” PART 4: Configurable Reasoning Effort”) print(“=” * 70) print(“”” GPT-OSS supports different reasoning effort levels: • LOW – Quick, concise answers (fewer tokens, faster) • MEDIUM – Balanced reasoning and response • HIGH – Deep thinking with full chain-of-thought The reasoning effort is controlled through system prompts and generation parameters. “””) class ReasoningEffortController: “”” Controls reasoning effort levels for gpt-oss generations. “”” EFFORT_CONFIGS = { “low”: { “system_prompt”: “You are a helpful assistant. Be concise and direct.”, “max_tokens”: 200, “temperature”: 0.7, “description”: “Quick, concise answers” }, “medium”: { “system_prompt”: “You are a helpful assistant. Think through problems step by step and provide clear, well-reasoned answers.”, “max_tokens”: 400, “temperature”: 0.8, “description”: “Balanced reasoning” }, “high”: { “system_prompt”: “””You are a helpful assistant with advanced reasoning capabilities. For complex problems: 1. First, analyze the problem thoroughly 2. Consider multiple approaches 3. Show your complete chain of thought 4. Provide a comprehensive, well-reasoned answer Take your time to think deeply before responding.”””, “max_tokens”: 800, “temperature”: 1.0, “description”: “Deep chain-of-thought reasoning” } } def __init__(self, pipeline, tokenizer): self.pipe = pipeline self.tokenizer = tokenizer def generate(self, user_message: str, effort: str = “medium”) -> dict: “””Generate response with specified reasoning effort.””” if effort not in self.EFFORT_CONFIGS: raise ValueError(f”Effort must be one of: {list(self.EFFORT_CONFIGS.keys())}”) config = self.EFFORT_CONFIGS[effort] messages = [ {“role”: “system”, “content”: config[“system_prompt”]}, {“role”: “user”, “content”: user_message} ] output = self.pipe( messages, max_new_tokens=config[“max_tokens”], do_sample=True, temperature=config[“temperature”], top_p=1.0, pad_token_id=self.tokenizer.eos_token_id, ) return { “effort”: effort, “description”: config[“description”], “response”: output[0][“generated_text”][-1][“content”], “max_tokens_used”: config[“max_tokens”] } reasoning_controller = ReasoningEffortController(pipe, tokenizer) print(f”n Logic Puzzle: {test_question}n”) for effort in [“low”, “medium”, “high”]: result = reasoning_controller.generate(test_question, effort) print(f”━━━ {effort.upper()} ({result[‘description’]}) ━━━”) print(f”{result[‘response’][:500]}…”) print() print(“n” + “=” * 70) print(” PART 5: Structured Output Generation (JSON Mode)”) print(“=” * 70) import json import re class StructuredOutputGenerator: “”” Generate structured JSON outputs with schema validation. “”” def __init__(self, pipeline, tokenizer): self.pipe = pipeline self.tokenizer = tokenizer def generate_json(self, prompt: str, schema: dict, max_retries: int = 2) -> dict: “”” Generate JSON output in accordance with a specified schema. Args: prompt: The user’s request schema: JSON schema description max_retries: Number of retries on parse failure “”” schema_str = json.dumps(schema, indent=2) system_prompt = f”””You are a helpful assistant that ONLY outputs valid JSON. Your response must exactly match this JSON schema: {schema_str} RULES: – Output ONLY the JSON object, nothing else – No markdown code blocks (no “`) – No explanations before or after – Ensure all required fields are present – Use correct data types as specified””” messages = [ {“role”: “system”, “content”: system_prompt}, {“role”: “user”, “content”:

A End-to-End Coding Guide to Running OpenAI GPT-OSS Open-Weight Models with Advanced Inference Workflows Read Post »

AI, Committee, ข่าว, Uncategorized

Top 19 AI Red Teaming Tools (2026): Secure Your ML Models

Table of contents What Is AI Red Teaming? Top 19 AI Red Teaming Tools (2026) Conclusion What Is AI Red Teaming? AI Red Teaming is the process of systematically testing artificial intelligence systems—especially generative AI and machine learning models—against adversarial attacks and security stress scenarios. Red teaming goes beyond classic penetration testing; while penetration testing targets known software flaws, red teaming probes for unknown AI-specific vulnerabilities, unforeseen risks, and emergent behaviors. The process adopts the mindset of a malicious adversary, simulating attacks such as prompt injection, data poisoning, jailbreaking, model evasion, bias exploitation, and data leakage. This ensures AI models are not only robust against traditional threats, but also resilient to novel misuse scenarios unique to current AI systems. Key Features & Benefits Threat Modeling: Identify and simulate all potential attack scenarios—from prompt injection to adversarial manipulation and data exfiltration. Realistic Adversarial Behavior: Emulates actual attacker techniques using both manual and automated tools, beyond what is covered in penetration testing. Vulnerability Discovery: Uncovers risks such as bias, fairness gaps, privacy exposure, and reliability failures that may not emerge in pre-release testing. Regulatory Compliance: Supports compliance requirements (EU AI Act, NIST RMF, US Executive Orders) increasingly mandating red teaming for high-risk AI deployments. Continuous Security Validation: Integrates into CI/CD pipelines, enabling ongoing risk assessment and resilience improvement. Red teaming can be carried out by internal security teams, specialized third parties, or platforms built solely for adversarial testing of AI systems. Top 19 AI Red Teaming Tools (2026) Below is a rigorously researched list of the latest and most reputable AI red teaming tools, frameworks, and platforms—spanning open-source, commercial, and industry-leading solutions for both generic and AI-specific attacks: Mindgard – Automated AI red teaming and model vulnerability assessment. MIND.io – Data security platform providing autonomous DLP and data detection and response (DDR) for Agentic AI. Garak – Open-source LLM adversarial testing toolkit. HiddenLayer– A comprehensive AI security platform that provides automated model scanning and red teaming. AIF360 (IBM) – AI Fairness 360 toolkit for bias and fairness assessment. Foolbox – Library for adversarial attacks on AI models. Penligent– An AI-powered penetration testing tool that requires no expert knowledge Giskard– Comprehensive testing for traditional Machine Learning models and Agentic AI Adversarial Robustness Toolbox (ART) – IBM’s open-source toolkit for ML model security. FuzzyAI– A powerful tool for automated LLM fuzzing DeepTeam– An AI framework to red team LLMs and LLM systems SPLX– A unified platform to test, protect & govern AI at scale Pentera– A Platform that executes AI-driven adversarial testing in production to validate exploitability, prioritize remediation. Dreadnode – ML/AI vulnerability detection and red team toolkit. Galah – AI honeypot framework supporting LLM use cases. Meerkat – Data visualization and adversarial testing for ML. Ghidra/GPT-WPRE – Code reverse engineering platform with LLM analysis plugins. Guardrails – Application security for LLMs, prompt injection defense. Snyk – Developer-focused LLM red teaming tool simulating prompt injection and adversarial attacks. Conclusion In the era of generative AI and Large Language Models, AI Red Teaming has become foundational to responsible and resilient AI deployment. Organizations must embrace adversarial testing to uncover hidden vulnerabilities and adapt their defenses to new threat vectors—including attacks driven by prompt engineering, data leakage, bias exploitation, and emergent model behaviors. The best practice is to combine manual expertise with automated platforms utilizing the top red teaming tools listed above for a comprehensive, proactive security posture in AI systems. Check out our Twitter page and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post Top 19 AI Red Teaming Tools (2026): Secure Your ML Models appeared first on MarkTechPost.

Top 19 AI Red Teaming Tools (2026): Secure Your ML Models Read Post »

AI, Committee, ข่าว, Uncategorized

A Coding Guide to Build a Production-Grade Background Task Processing System Using Huey with SQLite, Scheduling, Retries, Pipelines, and Concurrency Control

In this tutorial, we explore how to build a fully functional background task processing system using Huey directly, without relying on Redis. We configure a SQLite-backed Huey instance, start a real consumer in the notebook, and implement advanced task patterns, including retries, priorities, scheduling, pipelines, locking, and monitoring via signals. As we move step by step, we demonstrate how we can simulate production-grade asynchronous job handling while keeping everything self-contained and easy to run in a cloud notebook environment. Copy CodeCopiedUse a different Browser !pip -q install -U huey import os import time import json import random import threading from datetime import datetime from huey import SqliteHuey, crontab from huey.constants import WORKER_THREAD DB_PATH = “/content/huey_demo.db” if os.path.exists(DB_PATH): os.remove(DB_PATH) huey = SqliteHuey( name=”colab-huey”, filename=DB_PATH, results=True, store_none=False, utc=True, ) print(“Huey backend:”, type(huey).__name__) print(“SQLite DB at:”, DB_PATH) We install Huey and configure a SQLite-backed instance. We initialize the database file and ensure a clean environment before starting execution. By doing this, we establish a lightweight yet production-style task queue setup without external dependencies. Copy CodeCopiedUse a different Browser EVENT_LOG = [] @huey.signal() def _log_all_signals(signal, task, exc=None): EVENT_LOG.append({ “ts”: datetime.utcnow().isoformat() + “Z”, “signal”: str(signal), “task”: getattr(task, “name”, None), “id”: getattr(task, “id”, None), “args”: getattr(task, “args”, None), “kwargs”: getattr(task, “kwargs”, None), “exc”: repr(exc) if exc else None, }) def print_latest_events(n=10): print(“n— Latest Huey events —“) for row in EVENT_LOG[-n:]: print(json.dumps(row, indent=2)) We implement a signal handler to capture and store task lifecycle events in a structured log. We track execution details, including task IDs, arguments, and exceptions, to improve observability. Through this mechanism, we build real-time monitoring into our asynchronous system. Copy CodeCopiedUse a different Browser @huey.task(priority=50) def quick_add(a, b): return a + b @huey.task(priority=10) def slow_io(seconds=1.0): time.sleep(seconds) return f”slept={seconds}” @huey.task(retries=3, retry_delay=1, priority=100) def flaky_network_call(p_fail=0.6): if random.random() < p_fail: raise RuntimeError(“Transient failure (simulated)”) return “OK” @huey.task(context=True, priority=60) def cpu_pi_estimate(samples=200_000, task=None): inside = 0 rnd = random.random for _ in range(samples): x, y = rnd(), rnd() if x*x + y*y <= 1.0: inside += 1 est = 4.0 * inside / samples return {“task_id”: task.id if task else None, “pi_estimate”: est, “samples”: samples} We define multiple tasks with priorities, retry configurations, and contextual awareness. We simulate different workloads, including simple arithmetic, I/O delay, transient failures, and CPU-bound computation. By doing this, we demonstrate how Huey handles reliability, execution order, and task metadata. Copy CodeCopiedUse a different Browser @huey.lock_task(“demo:daily-sync”) @huey.task() def locked_sync_job(tag=”sync”): time.sleep(1.0) return f”locked-job-done:{tag}:{datetime.utcnow().isoformat()}Z” @huey.task() def fetch_number(seed=7): random.seed(seed) return random.randint(1, 100) @huey.task() def transform_number(x, scale=3): return x * scale @huey.task() def store_result(x): return {“stored_value”: x, “stored_at”: datetime.utcnow().isoformat() + “Z”} We introduce locking to prevent concurrent execution of critical jobs. We also define tasks that will later be chained together using pipelines to form structured workflows. Through this design, we model realistic background processing patterns that require sequencing and concurrency control. Copy CodeCopiedUse a different Browser TICK = {“count”: 0} @huey.task() def heartbeat(): TICK[“count”] += 1 print(f”[heartbeat] tick={TICK[‘count’]} utc={datetime.utcnow().isoformat()}Z”) @huey.periodic_task(crontab(minute=”*”)) def heartbeat_minutely(): heartbeat() _TIMER_STATE = {“running”: False, “timer”: None} def start_seconds_heartbeat(interval_sec=15): _TIMER_STATE[“running”] = True def _tick(): if not _TIMER_STATE[“running”]: return huey.enqueue(heartbeat.s()) t = threading.Timer(interval_sec, _tick) _TIMER_STATE[“timer”] = t t.start() _tick() def stop_seconds_heartbeat(): _TIMER_STATE[“running”] = False t = _TIMER_STATE.get(“timer”) if t is not None: try: t.cancel() except Exception: pass _TIMER_STATE[“timer”] = None We define heartbeat behavior and configure minute-level periodic execution using Huey’s crontab scheduling. We also implement a timer-based mechanism to simulate sub-minute execution intervals for demonstration purposes. With this setup, we create visible recurring background activity within the notebook. Copy CodeCopiedUse a different Browser consumer = huey.create_consumer( workers=4, worker_type=WORKER_THREAD, periodic=True, initial_delay=0.1, backoff=1.15, max_delay=2.0, scheduler_interval=1, check_worker_health=True, health_check_interval=10, flush_locks=False, ) consumer_thread = threading.Thread(target=consumer.run, daemon=True) consumer_thread.start() print(“Consumer started (threaded).”) print(“nEnqueue basics…”) r1 = quick_add(10, 32) r2 = slow_io(0.75) print(“quick_add result:”, r1(blocking=True, timeout=5)) print(“slow_io result:”, r2(blocking=True, timeout=5)) print(“nRetries + priority demo (flaky task)…”) rf = flaky_network_call(p_fail=0.7) try: print(“flaky_network_call result:”, rf(blocking=True, timeout=10)) except Exception as e: print(“flaky_network_call failed even after retries:”, repr(e)) print(“nContext task (task id inside payload)…”) rp = cpu_pi_estimate(samples=150_000) print(“pi payload:”, rp(blocking=True, timeout=20)) print(“nLocks demo: enqueue multiple locked jobs quickly (should serialize)…”) locked_results = [locked_sync_job(tag=f”run{i}”) for i in range(3)] print([res(blocking=True, timeout=10) for res in locked_results]) print(“nScheduling demo: run slow_io in ~3 seconds…”) rs = slow_io.schedule(args=(0.25,), delay=3) print(“scheduled handle:”, rs) print(“scheduled slow_io result:”, rs(blocking=True, timeout=10)) print(“nRevoke demo: schedule a task in 5s then revoke before it runs…”) rv = slow_io.schedule(args=(0.1,), delay=5) rv.revoke() time.sleep(6) try: out = rv(blocking=False) print(“revoked task output:”, out) except Exception as e: print(“revoked task did not produce result (expected):”, type(e).__name__, str(e)[:120]) print(“nPipeline demo…”) pipeline = ( fetch_number.s(123) .then(transform_number, 5) .then(store_result) ) pipe_res = huey.enqueue(pipeline) print(“pipeline final result:”, pipe_res(blocking=True, timeout=10)) print(“nStarting 15-second heartbeat demo for ~40 seconds…”) start_seconds_heartbeat(interval_sec=15) time.sleep(40) stop_seconds_heartbeat() print(“Stopped 15-second heartbeat demo.”) print_latest_events(12) print(“nStopping consumer gracefully…”) consumer.stop(graceful=True) consumer_thread.join(timeout=5) print(“Consumer stopped.”) We start a threaded consumer inside the notebook to process tasks asynchronously. We enqueue tasks, test retries, demonstrate scheduling and revocation, execute pipelines, and observe logged signals. Finally, we gracefully shut down the consumer to ensure clean resource management and controlled system termination. In conclusion, we designed and executed an advanced asynchronous task system using Huey with a SQLite backend and an in-notebook consumer. We implemented retries, task prioritization, future scheduling, revocation, locking mechanisms, task chaining through pipelines, and periodic behavior simulation, all within a Colab-friendly setup. Through this approach, we gained a clear understanding of how to use Huey to manage background workloads efficiently and extend this architecture to real-world production deployments. Check out the Full Coding Notebook/Implementation here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post A Coding Guide to Build a Production-Grade Background Task Processing System Using Huey with SQLite, Scheduling, Retries, Pipelines, and Concurrency Control appeared first on MarkTechPost.

A Coding Guide to Build a Production-Grade Background Task Processing System Using Huey with SQLite, Scheduling, Retries, Pipelines, and Concurrency Control Read Post »

AI, Committee, ข่าว, Uncategorized

Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale

If you have ever stared at thousands of lines of integration test logs wondering which of the sixteen log files actually contains your bug, you are not alone — and Google now has data to prove it. A team of Google researchers introduced Auto-Diagnose, an LLM-powered tool that automatically reads the failure logs from a broken integration test, finds the root cause, and posts a concise diagnosis directly into the code review where the failure showed up. On a manual evaluation of 71 real-world failures spanning 39 distinct teams, the tool correctly identified the root cause 90.14% of the time. It has run on 52,635 distinct failing tests across 224,782 executions on 91,130 code changes authored by 22,962 distinct developers, with a ‘Not helpful’ rate of just 5.8% on the feedback received. https://arxiv.org/pdf/2604.12108 The problem: integration tests are a debugging tax Integration tests verify that multiple components of a distributed system actually communicate to each other correctly. The tests Auto-Diagnose targets are hermetic functional integration tests: tests where an entire system under test (SUT) — typically a graph of communicating servers — is brought up inside an isolated environment by a test driver, and exercised against business logic. A separate Google survey of 239 respondents found that 78% of integration tests at Google are functional, which is what motivated the scope. Diagnosing integration test failures showed up as one of the top five complaints in EngSat, a Google-wide survey of 6,059 developers. A follow-up survey of 116 developers found that 38.4% of integration test failures take more than an hour to diagnose, and 8.9% take more than a day — versus 2.7% and 0% for unit tests. The root cause is structural. Test driver logs usually surface only a generic symptom (a timeout, an assertion). The actual error lives somewhere inside one of the SUT component logs, often buried under recoverable warnings and ERROR-level lines that are not actually the cause. https://arxiv.org/pdf/2604.12108 How Auto-Diagnose works When an integration test fails, a pub/sub event triggers Auto-Diagnose. The system collects all test driver and SUT component logs at level INFO and above — across data centers, processes, and threads — then joins and sorts them by timestamp into a single log stream. That stream is dropped into a prompt template along with component metadata. The model is Gemini 2.5 Flash, called with temperature = 0.1 (for near-deterministic, debuggable outputs) and topp = 0.8. Gemini was not fine-tuned on Google’s integration test data; this is pure prompt engineering on a general-purpose model. The prompt itself is the most instructive part of this research. It walks the model through an explicit step-by-step protocol: scan log sections, read component context, locate the failure, summarize errors, and only then attempt a conclusion. Critically, it includes hard negative constraints — for example: if the logs do not contain lines from the component that failed, do not draw any conclusion. The model’s response is post-processed into a markdown finding with ==Conclusion==, ==Investigation Steps==, and ==Most Relevant Log Lines== sections, then posted as a comment in Critique, Google’s internal code review system. Each cited log line is rendered as a clickable link. Numbers from production Auto-Diagnose averages 110,617 input tokens and 5,962 output tokens per execution, and posts findings with a p50 latency of 56 seconds and p90 of 346 seconds — fast enough that developers see the diagnosis before they have switched contexts. Critique exposes three feedback buttons on a finding: Please fix (used by reviewers), Helpful, and Not helpful (both used by authors). Across 517 total feedback reports from 437 distinct developers, 436 (84.3%) were “Please fix” from 370 reviewers — by far the dominant interaction, and a sign that reviewers are actively asking authors to act on the diagnoses. Among dev-side feedback, the helpfulness ratio (H / (H + N)) is 62.96%, and the “Not helpful” rate (N / (PF + H + N)) is 5.8% — well under Google’s 10% threshold for keeping a tool live. Across 370 tools that post findings to Critique, Auto-Diagnose ranks #14 in helpfulness, putting it in the top 3.78%. The manual evaluation also surfaced a useful side effect. Of the seven cases where Auto-Diagnose failed, four were because test driver logs were not properly saved on crash, and three were because SUT component logs were not saved when the component crashed — both real infrastructure bugs, reported back to the relevant teams. In production, around 20 ‘more information is needed‘ diagnoses have similarly helped surface infrastructure issues. Key Takeaways Auto-Diagnose hit 90.14% root-cause accuracy on a manual evaluation of 71 real-world integration test failures spanning 39 teams at Google, addressing a problem 6,059 developers ranked among their top five complaints in the EngSat survey. The system runs on Gemini 2.5 Flash with no fine-tuning — just prompt engineering. A pub/sub trigger collects logs across data centers and processes, joins them by timestamp, and sends them to the model at temperature 0.1 and topp 0.8. The prompt is engineered to refuse rather than guess. Hard negative constraints force the model to respond with “more information is needed” when evidence is missing — a deliberate trade-off that prevents hallucinated root causes and even helped surface real infrastructure bugs in Google’s logging pipeline. In production since May 2025, Auto-Diagnose has run on 52,635 distinct failing tests across 224,782 executions on 91,130 code changes from 22,962 developers, posting findings in a p50 of 56 seconds — fast enough that engineers see the diagnosis before switching contexts. Check out the Pre-Print Paper here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale appeared first on MarkTechPost.

Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale Read Post »

AI, Committee, ข่าว, Uncategorized

The case for fixing everything

The handsome new book Maintenance: Of Everything, Part One, by the tech industry legend Stewart Brand, promises to be the first in a series offering “a comprehensive overview of the civilizational importance of maintenance.” One of Brand’s several biographers described him as a mainstay of both counterculture and cyberculture, and with Maintenance, Brand wants us to understand that the upkeep and repair of tools and systems has profound impact on daily life. As he puts it, “Taking responsibility for maintaining something—whether a motorcycle, a monument, or our planet—can be a radical act.” Radical how? This volume doesn’t say. In an outline for the overall work, Brand says his goal is to “end with the nature of maintainers and the honor owed them.” The idea that maintainers are owed anything, much less honor, might surprise some readers. Actually, maintenance and repair have been hot topics in academia since the mid-2010s. I played some role in that movement as a cofounder of the Maintainers, a global, interdisciplinary network dedicated to the study of maintenance, repair, care, and all the work that goes into keeping the world going. Brand is right, too, that maintainers haven’t gotten the laurels they deserve. Over the past few decades, scholars have shown that work from oiling tools to replacing worn parts to updating code bases all tends to be lower in status than “innovation.” Maintenance gets neglected in many organizational and social settings. (Just look at some American infrastructure!) And as the right-to-­repair movement has shown, companies in pursuit of greater profits have frequently locked us out of being able to do repairs or greatly reduced the maintainable life of their products. It’s hard to think of any other reason to put a computer in the door of a refrigerator. Some of Brand’s earlier work helped inspire those insights. But his new book makes me think he doesn’t see things that way. For Brand, maintenance seems to be a solitary act, profound but more about personal success and fulfillment than tending to a shared world or making it better. Born in 1938, Brand is 87 years old. A sense hangs over the book—with its battles against corrosion, rust, and decay, with its attempts to keep things going even as they inevitably falter—of someone looking over life and pondering its end. Maintenance: Of Everything connects to every stage of Brand’s life. It’s worth reviewing where it falls in that arc. Brand has always been interested in tools and fixing things, but rarely has he focused on the systems that need the most care.  More than a half-century ago, Brand was a member of the Merry Pranksters, a countercultural, LSD-centered hippie collective famously led by Ken Kesey, the author of One Flew Over the Cuckoo’s Nest. In 1966, Brand co-produced the Trips Festival, where bands like the Grateful Dead and Big Brother and the Holding Company performed for thousands amid psychedelic light shows. Brand’s Whole Earth Catalog had a vision that might feel progressive, but its libertarian, rugged-individualist philosophy of remaking civilization alone stood in contrast to more collective social change movements. In some ways, the Trips Festival set a paradigm for the rest of his life’s work. Brand’s biographers have described him as a network celebrity—someone who got ahead by bringing people together, building coalitions of influential figures who could boost his signal. As Kesey put it in 1980, “Stewart recognizes power. And cleaves to it.”  Brand applied this network logic to the undertaking he will always be best remembered for: the Whole Earth Catalog. First published in 1968 and aimed at hippies and members of the nascent back-to-the-land movement, the publication had the motto “Access to tools.” Its pages were full of Quonset huts, geodesic domes, solar panels, well pumps, water filters, and other technologies for life off the grid. It was a vision that might feel progressive or left-leaning, but the libertarian, rugged-individualist philosophy of eschewing corrupt systems and remaking civilization alone stood in contrast to the more collective movements pushing for deep social change at the time—like civil rights, feminism, and environmentalism. That vision also led straight to the empowerment that came with new digital tools, and to Silicon Valley. In 1985, Brand published the Whole Earth Software Catalog, the last of the series, and also cofounded the WELL—the Whole Earth ’Lectronic Link, a pioneering online community famous for, among other things, facilitating the trade of Grateful Dead bootlegs. He also wrote a hagiographic book about the MIT Media Lab, known for its corporate-sponsored research into new communications tech. “The Lab would cure the pathologies of technology not with economics or politics but with technology,” Brand wrote. Again, not collective action, not policymaking: tools. And Brand then cofounded the Global Business Network, a group of pricey consulting futurists that further connected him to MIT, Stanford, and the Valley. Brand had literally helped bring about the modern digital revolution. His attention then turned toward its upkeep. Brand’s 1994 book, How Buildings Learn: What Happens After They’re Built, argued against high-modernist architectural ideas. Nearly all buildings eventually get remade, he argued, but he especially favored cheap, simple structures that inhabitants could easily retool to suit changing needs. In some ways, Brand was recapitulating the liberated—or libertarian—philosophy of the Whole Earth Catalog: People can remake their world, if they have access to tools. In a chapter titled “The Romance of Maintenance,” he asked readers to see the beauty, value, and occasional pleasures of fixer-uppers of all kinds. This chapter was a touchstone for many of us in the academic subfield of maintenance studies. Researchers in disciplines like history, sociology, and anthropology, as well as artists and practitioners in fields like libraries, IT, and engineering, all started trying to understand the realities and, yes, romance of maintenance and repair. Brand joined and contributed to Listservs, attended conferences, chatted with intellectual leaders. So it’s a bit uncharitable when he writes that his new book is “the first to look at maintenance in general.” He knows better. The real question,

The case for fixing everything Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at นโยบายความเป็นส่วนตัว and manage your privacy settings by clicking Settings.

ตั้งค่าความเป็นส่วนตัว

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

ยอมรับทั้งหมด
จัดการความเป็นส่วนตัว
  • เปิดใช้งานตลอด

บันทึกการตั้งค่า
th