YouZum

AI

AI, Committee, ข่าว, Uncategorized

A End-to-End Coding Guide to Running OpenAI GPT-OSS Open-Weight Models with Advanced Inference Workflows

In this tutorial, we explore how to run OpenAI’s open-weight GPT-OSS models in Google Colab with a strong focus on their technical behavior, deployment requirements, and practical inference workflows. We begin by setting up the exact dependencies needed for Transformers-based execution, verifying GPU availability, and loading openai/gpt-oss-20b with the correct configuration using native MXFP4 quantization, torch.bfloat16 activations. As we move through the tutorial, we work directly with core capabilities such as structured generation, streaming, multi-turn dialogue handling, tool execution patterns, and batch inference, while keeping in mind how open-weight models differ from closed-hosted APIs in terms of transparency, controllability, memory constraints, and local execution trade-offs. Also, we treat GPT-OSS not just as a chatbot, but as a technically inspectable open-weight LLM stack that we can configure, prompt, and extend inside a reproducible workflow. Copy CodeCopiedUse a different Browser print(” Step 1: Installing required packages…”) print(“=” * 70) !pip install -q –upgrade pip !pip install -q transformers>=4.51.0 accelerate sentencepiece protobuf !pip install -q huggingface_hub gradio ipywidgets !pip install -q openai-harmony import transformers print(f” Transformers version: {transformers.__version__}”) import torch print(f”n System Information:”) print(f” PyTorch version: {torch.__version__}”) print(f” CUDA available: {torch.cuda.is_available()}”) if torch.cuda.is_available(): gpu_name = torch.cuda.get_device_name(0) gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9 print(f” GPU: {gpu_name}”) print(f” GPU Memory: {gpu_memory:.2f} GB”) if gpu_memory < 15: print(f”n WARNING: gpt-oss-20b requires ~16GB VRAM.”) print(f” Your GPU has {gpu_memory:.1f}GB. Consider using Colab Pro for T4/A100.”) else: print(f”n GPU memory sufficient for gpt-oss-20b”) else: print(“n No GPU detected!”) print(” Go to: Runtime → Change runtime type → Select ‘T4 GPU'”) raise RuntimeError(“GPU required for this tutorial”) print(“n” + “=” * 70) print(” PART 2: Loading GPT-OSS Model (Correct Method)”) print(“=” * 70) from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline import torch MODEL_ID = “openai/gpt-oss-20b” print(f”n Loading model: {MODEL_ID}”) print(” This may take several minutes on first run…”) print(” (Model size: ~40GB download, uses native MXFP4 quantization)”) tokenizer = AutoTokenizer.from_pretrained( MODEL_ID, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained( MODEL_ID, torch_dtype=torch.bfloat16, device_map=”auto”, trust_remote_code=True, ) pipe = pipeline( “text-generation”, model=model, tokenizer=tokenizer, ) print(” Model loaded successfully!”) print(f” Model dtype: {model.dtype}”) print(f” Device: {model.device}”) if torch.cuda.is_available(): allocated = torch.cuda.memory_allocated() / 1e9 reserved = torch.cuda.memory_reserved() / 1e9 print(f” GPU Memory Allocated: {allocated:.2f} GB”) print(f” GPU Memory Reserved: {reserved:.2f} GB”) print(“n” + “=” * 70) print(” PART 3: Basic Inference Examples”) print(“=” * 70) def generate_response(messages, max_new_tokens=256, temperature=0.8, top_p=1.0): “”” Generate a response using gpt-oss with recommended parameters. OpenAI recommends: temperature=1.0, top_p=1.0 for gpt-oss “”” output = pipe( messages, max_new_tokens=max_new_tokens, do_sample=True, temperature=temperature, top_p=top_p, pad_token_id=tokenizer.eos_token_id, ) return output[0][“generated_text”][-1][“content”] print(“n Example 1: Simple Question Answering”) print(“-” * 50) messages = [ {“role”: “user”, “content”: “What is the Pythagorean theorem? Explain briefly.”} ] response = generate_response(messages, max_new_tokens=150) print(f”User: {messages[0][‘content’]}”) print(f”nAssistant: {response}”) print(“nn Example 2: Code Generation”) print(“-” * 50) messages = [ ] response = generate_response(messages, max_new_tokens=300) print(f”User: {messages[0][‘content’]}”) print(f”nAssistant: {response}”) print(“nn Example 3: Creative Writing”) print(“-” * 50) messages = [ {“role”: “user”, “content”: “Write a haiku about artificial intelligence.”} ] response = generate_response(messages, max_new_tokens=100, temperature=1.0) print(f”User: {messages[0][‘content’]}”) print(f”nAssistant: {response}”) We set up the full Colab environment required to run GPT-OSS properly and verify that the system has a compatible GPU with enough VRAM. We install the core libraries, check the PyTorch and Transformers versions, and confirm that the runtime is suitable for loading an open-weight model like gpt-oss-20b. We then load the tokenizer, initialize the model with the correct technical configuration, and run a few basic inference examples to confirm that the open-weight pipeline is working end to end. Copy CodeCopiedUse a different Browser print(“n” + “=” * 70) print(” PART 4: Configurable Reasoning Effort”) print(“=” * 70) print(“”” GPT-OSS supports different reasoning effort levels: • LOW – Quick, concise answers (fewer tokens, faster) • MEDIUM – Balanced reasoning and response • HIGH – Deep thinking with full chain-of-thought The reasoning effort is controlled through system prompts and generation parameters. “””) class ReasoningEffortController: “”” Controls reasoning effort levels for gpt-oss generations. “”” EFFORT_CONFIGS = { “low”: { “system_prompt”: “You are a helpful assistant. Be concise and direct.”, “max_tokens”: 200, “temperature”: 0.7, “description”: “Quick, concise answers” }, “medium”: { “system_prompt”: “You are a helpful assistant. Think through problems step by step and provide clear, well-reasoned answers.”, “max_tokens”: 400, “temperature”: 0.8, “description”: “Balanced reasoning” }, “high”: { “system_prompt”: “””You are a helpful assistant with advanced reasoning capabilities. For complex problems: 1. First, analyze the problem thoroughly 2. Consider multiple approaches 3. Show your complete chain of thought 4. Provide a comprehensive, well-reasoned answer Take your time to think deeply before responding.”””, “max_tokens”: 800, “temperature”: 1.0, “description”: “Deep chain-of-thought reasoning” } } def __init__(self, pipeline, tokenizer): self.pipe = pipeline self.tokenizer = tokenizer def generate(self, user_message: str, effort: str = “medium”) -> dict: “””Generate response with specified reasoning effort.””” if effort not in self.EFFORT_CONFIGS: raise ValueError(f”Effort must be one of: {list(self.EFFORT_CONFIGS.keys())}”) config = self.EFFORT_CONFIGS[effort] messages = [ {“role”: “system”, “content”: config[“system_prompt”]}, {“role”: “user”, “content”: user_message} ] output = self.pipe( messages, max_new_tokens=config[“max_tokens”], do_sample=True, temperature=config[“temperature”], top_p=1.0, pad_token_id=self.tokenizer.eos_token_id, ) return { “effort”: effort, “description”: config[“description”], “response”: output[0][“generated_text”][-1][“content”], “max_tokens_used”: config[“max_tokens”] } reasoning_controller = ReasoningEffortController(pipe, tokenizer) print(f”n Logic Puzzle: {test_question}n”) for effort in [“low”, “medium”, “high”]: result = reasoning_controller.generate(test_question, effort) print(f”━━━ {effort.upper()} ({result[‘description’]}) ━━━”) print(f”{result[‘response’][:500]}…”) print() print(“n” + “=” * 70) print(” PART 5: Structured Output Generation (JSON Mode)”) print(“=” * 70) import json import re class StructuredOutputGenerator: “”” Generate structured JSON outputs with schema validation. “”” def __init__(self, pipeline, tokenizer): self.pipe = pipeline self.tokenizer = tokenizer def generate_json(self, prompt: str, schema: dict, max_retries: int = 2) -> dict: “”” Generate JSON output in accordance with a specified schema. Args: prompt: The user’s request schema: JSON schema description max_retries: Number of retries on parse failure “”” schema_str = json.dumps(schema, indent=2) system_prompt = f”””You are a helpful assistant that ONLY outputs valid JSON. Your response must exactly match this JSON schema: {schema_str} RULES: – Output ONLY the JSON object, nothing else – No markdown code blocks (no “`) – No explanations before or after – Ensure all required fields are present – Use correct data types as specified””” messages = [ {“role”: “system”, “content”: system_prompt}, {“role”: “user”, “content”:

A End-to-End Coding Guide to Running OpenAI GPT-OSS Open-Weight Models with Advanced Inference Workflows Read Post »

AI, Committee, ข่าว, Uncategorized

Top 19 AI Red Teaming Tools (2026): Secure Your ML Models

Table of contents What Is AI Red Teaming? Top 19 AI Red Teaming Tools (2026) Conclusion What Is AI Red Teaming? AI Red Teaming is the process of systematically testing artificial intelligence systems—especially generative AI and machine learning models—against adversarial attacks and security stress scenarios. Red teaming goes beyond classic penetration testing; while penetration testing targets known software flaws, red teaming probes for unknown AI-specific vulnerabilities, unforeseen risks, and emergent behaviors. The process adopts the mindset of a malicious adversary, simulating attacks such as prompt injection, data poisoning, jailbreaking, model evasion, bias exploitation, and data leakage. This ensures AI models are not only robust against traditional threats, but also resilient to novel misuse scenarios unique to current AI systems. Key Features & Benefits Threat Modeling: Identify and simulate all potential attack scenarios—from prompt injection to adversarial manipulation and data exfiltration. Realistic Adversarial Behavior: Emulates actual attacker techniques using both manual and automated tools, beyond what is covered in penetration testing. Vulnerability Discovery: Uncovers risks such as bias, fairness gaps, privacy exposure, and reliability failures that may not emerge in pre-release testing. Regulatory Compliance: Supports compliance requirements (EU AI Act, NIST RMF, US Executive Orders) increasingly mandating red teaming for high-risk AI deployments. Continuous Security Validation: Integrates into CI/CD pipelines, enabling ongoing risk assessment and resilience improvement. Red teaming can be carried out by internal security teams, specialized third parties, or platforms built solely for adversarial testing of AI systems. Top 19 AI Red Teaming Tools (2026) Below is a rigorously researched list of the latest and most reputable AI red teaming tools, frameworks, and platforms—spanning open-source, commercial, and industry-leading solutions for both generic and AI-specific attacks: Mindgard – Automated AI red teaming and model vulnerability assessment. MIND.io – Data security platform providing autonomous DLP and data detection and response (DDR) for Agentic AI. Garak – Open-source LLM adversarial testing toolkit. HiddenLayer– A comprehensive AI security platform that provides automated model scanning and red teaming. AIF360 (IBM) – AI Fairness 360 toolkit for bias and fairness assessment. Foolbox – Library for adversarial attacks on AI models. Penligent– An AI-powered penetration testing tool that requires no expert knowledge Giskard– Comprehensive testing for traditional Machine Learning models and Agentic AI Adversarial Robustness Toolbox (ART) – IBM’s open-source toolkit for ML model security. FuzzyAI– A powerful tool for automated LLM fuzzing DeepTeam– An AI framework to red team LLMs and LLM systems SPLX– A unified platform to test, protect & govern AI at scale Pentera– A Platform that executes AI-driven adversarial testing in production to validate exploitability, prioritize remediation. Dreadnode – ML/AI vulnerability detection and red team toolkit. Galah – AI honeypot framework supporting LLM use cases. Meerkat – Data visualization and adversarial testing for ML. Ghidra/GPT-WPRE – Code reverse engineering platform with LLM analysis plugins. Guardrails – Application security for LLMs, prompt injection defense. Snyk – Developer-focused LLM red teaming tool simulating prompt injection and adversarial attacks. Conclusion In the era of generative AI and Large Language Models, AI Red Teaming has become foundational to responsible and resilient AI deployment. Organizations must embrace adversarial testing to uncover hidden vulnerabilities and adapt their defenses to new threat vectors—including attacks driven by prompt engineering, data leakage, bias exploitation, and emergent model behaviors. The best practice is to combine manual expertise with automated platforms utilizing the top red teaming tools listed above for a comprehensive, proactive security posture in AI systems. Check out our Twitter page and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post Top 19 AI Red Teaming Tools (2026): Secure Your ML Models appeared first on MarkTechPost.

Top 19 AI Red Teaming Tools (2026): Secure Your ML Models Read Post »

AI, Committee, ข่าว, Uncategorized

A Coding Guide to Build a Production-Grade Background Task Processing System Using Huey with SQLite, Scheduling, Retries, Pipelines, and Concurrency Control

In this tutorial, we explore how to build a fully functional background task processing system using Huey directly, without relying on Redis. We configure a SQLite-backed Huey instance, start a real consumer in the notebook, and implement advanced task patterns, including retries, priorities, scheduling, pipelines, locking, and monitoring via signals. As we move step by step, we demonstrate how we can simulate production-grade asynchronous job handling while keeping everything self-contained and easy to run in a cloud notebook environment. Copy CodeCopiedUse a different Browser !pip -q install -U huey import os import time import json import random import threading from datetime import datetime from huey import SqliteHuey, crontab from huey.constants import WORKER_THREAD DB_PATH = “/content/huey_demo.db” if os.path.exists(DB_PATH): os.remove(DB_PATH) huey = SqliteHuey( name=”colab-huey”, filename=DB_PATH, results=True, store_none=False, utc=True, ) print(“Huey backend:”, type(huey).__name__) print(“SQLite DB at:”, DB_PATH) We install Huey and configure a SQLite-backed instance. We initialize the database file and ensure a clean environment before starting execution. By doing this, we establish a lightweight yet production-style task queue setup without external dependencies. Copy CodeCopiedUse a different Browser EVENT_LOG = [] @huey.signal() def _log_all_signals(signal, task, exc=None): EVENT_LOG.append({ “ts”: datetime.utcnow().isoformat() + “Z”, “signal”: str(signal), “task”: getattr(task, “name”, None), “id”: getattr(task, “id”, None), “args”: getattr(task, “args”, None), “kwargs”: getattr(task, “kwargs”, None), “exc”: repr(exc) if exc else None, }) def print_latest_events(n=10): print(“n— Latest Huey events —“) for row in EVENT_LOG[-n:]: print(json.dumps(row, indent=2)) We implement a signal handler to capture and store task lifecycle events in a structured log. We track execution details, including task IDs, arguments, and exceptions, to improve observability. Through this mechanism, we build real-time monitoring into our asynchronous system. Copy CodeCopiedUse a different Browser @huey.task(priority=50) def quick_add(a, b): return a + b @huey.task(priority=10) def slow_io(seconds=1.0): time.sleep(seconds) return f”slept={seconds}” @huey.task(retries=3, retry_delay=1, priority=100) def flaky_network_call(p_fail=0.6): if random.random() < p_fail: raise RuntimeError(“Transient failure (simulated)”) return “OK” @huey.task(context=True, priority=60) def cpu_pi_estimate(samples=200_000, task=None): inside = 0 rnd = random.random for _ in range(samples): x, y = rnd(), rnd() if x*x + y*y <= 1.0: inside += 1 est = 4.0 * inside / samples return {“task_id”: task.id if task else None, “pi_estimate”: est, “samples”: samples} We define multiple tasks with priorities, retry configurations, and contextual awareness. We simulate different workloads, including simple arithmetic, I/O delay, transient failures, and CPU-bound computation. By doing this, we demonstrate how Huey handles reliability, execution order, and task metadata. Copy CodeCopiedUse a different Browser @huey.lock_task(“demo:daily-sync”) @huey.task() def locked_sync_job(tag=”sync”): time.sleep(1.0) return f”locked-job-done:{tag}:{datetime.utcnow().isoformat()}Z” @huey.task() def fetch_number(seed=7): random.seed(seed) return random.randint(1, 100) @huey.task() def transform_number(x, scale=3): return x * scale @huey.task() def store_result(x): return {“stored_value”: x, “stored_at”: datetime.utcnow().isoformat() + “Z”} We introduce locking to prevent concurrent execution of critical jobs. We also define tasks that will later be chained together using pipelines to form structured workflows. Through this design, we model realistic background processing patterns that require sequencing and concurrency control. Copy CodeCopiedUse a different Browser TICK = {“count”: 0} @huey.task() def heartbeat(): TICK[“count”] += 1 print(f”[heartbeat] tick={TICK[‘count’]} utc={datetime.utcnow().isoformat()}Z”) @huey.periodic_task(crontab(minute=”*”)) def heartbeat_minutely(): heartbeat() _TIMER_STATE = {“running”: False, “timer”: None} def start_seconds_heartbeat(interval_sec=15): _TIMER_STATE[“running”] = True def _tick(): if not _TIMER_STATE[“running”]: return huey.enqueue(heartbeat.s()) t = threading.Timer(interval_sec, _tick) _TIMER_STATE[“timer”] = t t.start() _tick() def stop_seconds_heartbeat(): _TIMER_STATE[“running”] = False t = _TIMER_STATE.get(“timer”) if t is not None: try: t.cancel() except Exception: pass _TIMER_STATE[“timer”] = None We define heartbeat behavior and configure minute-level periodic execution using Huey’s crontab scheduling. We also implement a timer-based mechanism to simulate sub-minute execution intervals for demonstration purposes. With this setup, we create visible recurring background activity within the notebook. Copy CodeCopiedUse a different Browser consumer = huey.create_consumer( workers=4, worker_type=WORKER_THREAD, periodic=True, initial_delay=0.1, backoff=1.15, max_delay=2.0, scheduler_interval=1, check_worker_health=True, health_check_interval=10, flush_locks=False, ) consumer_thread = threading.Thread(target=consumer.run, daemon=True) consumer_thread.start() print(“Consumer started (threaded).”) print(“nEnqueue basics…”) r1 = quick_add(10, 32) r2 = slow_io(0.75) print(“quick_add result:”, r1(blocking=True, timeout=5)) print(“slow_io result:”, r2(blocking=True, timeout=5)) print(“nRetries + priority demo (flaky task)…”) rf = flaky_network_call(p_fail=0.7) try: print(“flaky_network_call result:”, rf(blocking=True, timeout=10)) except Exception as e: print(“flaky_network_call failed even after retries:”, repr(e)) print(“nContext task (task id inside payload)…”) rp = cpu_pi_estimate(samples=150_000) print(“pi payload:”, rp(blocking=True, timeout=20)) print(“nLocks demo: enqueue multiple locked jobs quickly (should serialize)…”) locked_results = [locked_sync_job(tag=f”run{i}”) for i in range(3)] print([res(blocking=True, timeout=10) for res in locked_results]) print(“nScheduling demo: run slow_io in ~3 seconds…”) rs = slow_io.schedule(args=(0.25,), delay=3) print(“scheduled handle:”, rs) print(“scheduled slow_io result:”, rs(blocking=True, timeout=10)) print(“nRevoke demo: schedule a task in 5s then revoke before it runs…”) rv = slow_io.schedule(args=(0.1,), delay=5) rv.revoke() time.sleep(6) try: out = rv(blocking=False) print(“revoked task output:”, out) except Exception as e: print(“revoked task did not produce result (expected):”, type(e).__name__, str(e)[:120]) print(“nPipeline demo…”) pipeline = ( fetch_number.s(123) .then(transform_number, 5) .then(store_result) ) pipe_res = huey.enqueue(pipeline) print(“pipeline final result:”, pipe_res(blocking=True, timeout=10)) print(“nStarting 15-second heartbeat demo for ~40 seconds…”) start_seconds_heartbeat(interval_sec=15) time.sleep(40) stop_seconds_heartbeat() print(“Stopped 15-second heartbeat demo.”) print_latest_events(12) print(“nStopping consumer gracefully…”) consumer.stop(graceful=True) consumer_thread.join(timeout=5) print(“Consumer stopped.”) We start a threaded consumer inside the notebook to process tasks asynchronously. We enqueue tasks, test retries, demonstrate scheduling and revocation, execute pipelines, and observe logged signals. Finally, we gracefully shut down the consumer to ensure clean resource management and controlled system termination. In conclusion, we designed and executed an advanced asynchronous task system using Huey with a SQLite backend and an in-notebook consumer. We implemented retries, task prioritization, future scheduling, revocation, locking mechanisms, task chaining through pipelines, and periodic behavior simulation, all within a Colab-friendly setup. Through this approach, we gained a clear understanding of how to use Huey to manage background workloads efficiently and extend this architecture to real-world production deployments. Check out the Full Coding Notebook/Implementation here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post A Coding Guide to Build a Production-Grade Background Task Processing System Using Huey with SQLite, Scheduling, Retries, Pipelines, and Concurrency Control appeared first on MarkTechPost.

A Coding Guide to Build a Production-Grade Background Task Processing System Using Huey with SQLite, Scheduling, Retries, Pipelines, and Concurrency Control Read Post »

AI, Committee, ข่าว, Uncategorized

Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale

If you have ever stared at thousands of lines of integration test logs wondering which of the sixteen log files actually contains your bug, you are not alone — and Google now has data to prove it. A team of Google researchers introduced Auto-Diagnose, an LLM-powered tool that automatically reads the failure logs from a broken integration test, finds the root cause, and posts a concise diagnosis directly into the code review where the failure showed up. On a manual evaluation of 71 real-world failures spanning 39 distinct teams, the tool correctly identified the root cause 90.14% of the time. It has run on 52,635 distinct failing tests across 224,782 executions on 91,130 code changes authored by 22,962 distinct developers, with a ‘Not helpful’ rate of just 5.8% on the feedback received. https://arxiv.org/pdf/2604.12108 The problem: integration tests are a debugging tax Integration tests verify that multiple components of a distributed system actually communicate to each other correctly. The tests Auto-Diagnose targets are hermetic functional integration tests: tests where an entire system under test (SUT) — typically a graph of communicating servers — is brought up inside an isolated environment by a test driver, and exercised against business logic. A separate Google survey of 239 respondents found that 78% of integration tests at Google are functional, which is what motivated the scope. Diagnosing integration test failures showed up as one of the top five complaints in EngSat, a Google-wide survey of 6,059 developers. A follow-up survey of 116 developers found that 38.4% of integration test failures take more than an hour to diagnose, and 8.9% take more than a day — versus 2.7% and 0% for unit tests. The root cause is structural. Test driver logs usually surface only a generic symptom (a timeout, an assertion). The actual error lives somewhere inside one of the SUT component logs, often buried under recoverable warnings and ERROR-level lines that are not actually the cause. https://arxiv.org/pdf/2604.12108 How Auto-Diagnose works When an integration test fails, a pub/sub event triggers Auto-Diagnose. The system collects all test driver and SUT component logs at level INFO and above — across data centers, processes, and threads — then joins and sorts them by timestamp into a single log stream. That stream is dropped into a prompt template along with component metadata. The model is Gemini 2.5 Flash, called with temperature = 0.1 (for near-deterministic, debuggable outputs) and topp = 0.8. Gemini was not fine-tuned on Google’s integration test data; this is pure prompt engineering on a general-purpose model. The prompt itself is the most instructive part of this research. It walks the model through an explicit step-by-step protocol: scan log sections, read component context, locate the failure, summarize errors, and only then attempt a conclusion. Critically, it includes hard negative constraints — for example: if the logs do not contain lines from the component that failed, do not draw any conclusion. The model’s response is post-processed into a markdown finding with ==Conclusion==, ==Investigation Steps==, and ==Most Relevant Log Lines== sections, then posted as a comment in Critique, Google’s internal code review system. Each cited log line is rendered as a clickable link. Numbers from production Auto-Diagnose averages 110,617 input tokens and 5,962 output tokens per execution, and posts findings with a p50 latency of 56 seconds and p90 of 346 seconds — fast enough that developers see the diagnosis before they have switched contexts. Critique exposes three feedback buttons on a finding: Please fix (used by reviewers), Helpful, and Not helpful (both used by authors). Across 517 total feedback reports from 437 distinct developers, 436 (84.3%) were “Please fix” from 370 reviewers — by far the dominant interaction, and a sign that reviewers are actively asking authors to act on the diagnoses. Among dev-side feedback, the helpfulness ratio (H / (H + N)) is 62.96%, and the “Not helpful” rate (N / (PF + H + N)) is 5.8% — well under Google’s 10% threshold for keeping a tool live. Across 370 tools that post findings to Critique, Auto-Diagnose ranks #14 in helpfulness, putting it in the top 3.78%. The manual evaluation also surfaced a useful side effect. Of the seven cases where Auto-Diagnose failed, four were because test driver logs were not properly saved on crash, and three were because SUT component logs were not saved when the component crashed — both real infrastructure bugs, reported back to the relevant teams. In production, around 20 ‘more information is needed‘ diagnoses have similarly helped surface infrastructure issues. Key Takeaways Auto-Diagnose hit 90.14% root-cause accuracy on a manual evaluation of 71 real-world integration test failures spanning 39 teams at Google, addressing a problem 6,059 developers ranked among their top five complaints in the EngSat survey. The system runs on Gemini 2.5 Flash with no fine-tuning — just prompt engineering. A pub/sub trigger collects logs across data centers and processes, joins them by timestamp, and sends them to the model at temperature 0.1 and topp 0.8. The prompt is engineered to refuse rather than guess. Hard negative constraints force the model to respond with “more information is needed” when evidence is missing — a deliberate trade-off that prevents hallucinated root causes and even helped surface real infrastructure bugs in Google’s logging pipeline. In production since May 2025, Auto-Diagnose has run on 52,635 distinct failing tests across 224,782 executions on 91,130 code changes from 22,962 developers, posting findings in a p50 of 56 seconds — fast enough that engineers see the diagnosis before switching contexts. Check out the Pre-Print Paper here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale appeared first on MarkTechPost.

Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale Read Post »

AI, Committee, ข่าว, Uncategorized

The case for fixing everything

The handsome new book Maintenance: Of Everything, Part One, by the tech industry legend Stewart Brand, promises to be the first in a series offering “a comprehensive overview of the civilizational importance of maintenance.” One of Brand’s several biographers described him as a mainstay of both counterculture and cyberculture, and with Maintenance, Brand wants us to understand that the upkeep and repair of tools and systems has profound impact on daily life. As he puts it, “Taking responsibility for maintaining something—whether a motorcycle, a monument, or our planet—can be a radical act.” Radical how? This volume doesn’t say. In an outline for the overall work, Brand says his goal is to “end with the nature of maintainers and the honor owed them.” The idea that maintainers are owed anything, much less honor, might surprise some readers. Actually, maintenance and repair have been hot topics in academia since the mid-2010s. I played some role in that movement as a cofounder of the Maintainers, a global, interdisciplinary network dedicated to the study of maintenance, repair, care, and all the work that goes into keeping the world going. Brand is right, too, that maintainers haven’t gotten the laurels they deserve. Over the past few decades, scholars have shown that work from oiling tools to replacing worn parts to updating code bases all tends to be lower in status than “innovation.” Maintenance gets neglected in many organizational and social settings. (Just look at some American infrastructure!) And as the right-to-­repair movement has shown, companies in pursuit of greater profits have frequently locked us out of being able to do repairs or greatly reduced the maintainable life of their products. It’s hard to think of any other reason to put a computer in the door of a refrigerator. Some of Brand’s earlier work helped inspire those insights. But his new book makes me think he doesn’t see things that way. For Brand, maintenance seems to be a solitary act, profound but more about personal success and fulfillment than tending to a shared world or making it better. Born in 1938, Brand is 87 years old. A sense hangs over the book—with its battles against corrosion, rust, and decay, with its attempts to keep things going even as they inevitably falter—of someone looking over life and pondering its end. Maintenance: Of Everything connects to every stage of Brand’s life. It’s worth reviewing where it falls in that arc. Brand has always been interested in tools and fixing things, but rarely has he focused on the systems that need the most care.  More than a half-century ago, Brand was a member of the Merry Pranksters, a countercultural, LSD-centered hippie collective famously led by Ken Kesey, the author of One Flew Over the Cuckoo’s Nest. In 1966, Brand co-produced the Trips Festival, where bands like the Grateful Dead and Big Brother and the Holding Company performed for thousands amid psychedelic light shows. Brand’s Whole Earth Catalog had a vision that might feel progressive, but its libertarian, rugged-individualist philosophy of remaking civilization alone stood in contrast to more collective social change movements. In some ways, the Trips Festival set a paradigm for the rest of his life’s work. Brand’s biographers have described him as a network celebrity—someone who got ahead by bringing people together, building coalitions of influential figures who could boost his signal. As Kesey put it in 1980, “Stewart recognizes power. And cleaves to it.”  Brand applied this network logic to the undertaking he will always be best remembered for: the Whole Earth Catalog. First published in 1968 and aimed at hippies and members of the nascent back-to-the-land movement, the publication had the motto “Access to tools.” Its pages were full of Quonset huts, geodesic domes, solar panels, well pumps, water filters, and other technologies for life off the grid. It was a vision that might feel progressive or left-leaning, but the libertarian, rugged-individualist philosophy of eschewing corrupt systems and remaking civilization alone stood in contrast to the more collective movements pushing for deep social change at the time—like civil rights, feminism, and environmentalism. That vision also led straight to the empowerment that came with new digital tools, and to Silicon Valley. In 1985, Brand published the Whole Earth Software Catalog, the last of the series, and also cofounded the WELL—the Whole Earth ’Lectronic Link, a pioneering online community famous for, among other things, facilitating the trade of Grateful Dead bootlegs. He also wrote a hagiographic book about the MIT Media Lab, known for its corporate-sponsored research into new communications tech. “The Lab would cure the pathologies of technology not with economics or politics but with technology,” Brand wrote. Again, not collective action, not policymaking: tools. And Brand then cofounded the Global Business Network, a group of pricey consulting futurists that further connected him to MIT, Stanford, and the Valley. Brand had literally helped bring about the modern digital revolution. His attention then turned toward its upkeep. Brand’s 1994 book, How Buildings Learn: What Happens After They’re Built, argued against high-modernist architectural ideas. Nearly all buildings eventually get remade, he argued, but he especially favored cheap, simple structures that inhabitants could easily retool to suit changing needs. In some ways, Brand was recapitulating the liberated—or libertarian—philosophy of the Whole Earth Catalog: People can remake their world, if they have access to tools. In a chapter titled “The Romance of Maintenance,” he asked readers to see the beauty, value, and occasional pleasures of fixer-uppers of all kinds. This chapter was a touchstone for many of us in the academic subfield of maintenance studies. Researchers in disciplines like history, sociology, and anthropology, as well as artists and practitioners in fields like libraries, IT, and engineering, all started trying to understand the realities and, yes, romance of maintenance and repair. Brand joined and contributed to Listservs, attended conferences, chatted with intellectual leaders. So it’s a bit uncharitable when he writes that his new book is “the first to look at maintenance in general.” He knows better. The real question,

The case for fixing everything Read Post »

AI, Committee, ข่าว, Uncategorized

How robots learn: A brief, contemporary history

Roboticists used to dream big but build small. They’d hope to match or exceed the extraordinary complexity of the human body, and then they’d spend their career refining robotic arms for auto plants. Aim for C-3P0; end up with the Roomba.  The real ambition for many of these researchers was the robot of science fiction—one that could move through the world, adapt to different environments, and interact safely and helpfully with people. For the socially minded, such a machine could help those with mobility issues, ease loneliness, or do work too dangerous for humans. For the more financially inclined, it would mean a bottomless source of wage-free labor. Either way, a long history of failure left most of Silicon Valley hesitant to bet on helpful robots. That has changed. The machines are yet unbuilt, but the money is flowing: Companies and investors put $6.1 billion into humanoid robots in 2025 alone, four times what was invested in 2024.  What happened? A revolution in how machines have learned to interact with the world.  Imagine you’d like a pair of robot arms installed in your home purely to do one thing: fold clothes. How would it learn to do that? You could start by writing rules. Check the fabric to figure out how much deformation it can tolerate before tearing. Identify a shirt’s collar. Move the gripper to the left sleeve, lift it, and fold it inward by exactly this distance. Repeat for the right sleeve. If the shirt is rotated, turn the plan accordingly. If the sleeve is twisted, correct it. Very quickly the number of rules explodes, but a complete accounting of them could produce reliable results. This was the original craft of robotics: anticipating every possibility and encoding it in advance. Around 2015, the cutting edge started to do things differently: Build a digital simulation of the robotic arms and the clothes, and give the program a reward signal every time it folds successfully and a ding every time it fails. This way, it gets better by trying all sorts of techniques through trial and error, with millions of iterations—the same way AI got good at playing games. The arrival of ChatGPT in 2022 catalyzed the current boom. Trained on vast amounts of text, large language models work not through trial and error but by learning to predict what word should come next in a sentence. Similar models adapted to robotics were soon able to absorb pictures, sensor readings, and the position of a robot’s joints and predict the next action the machine should take, issuing dozens of motor commands every second. This conceptual shift—to reliance on AI models that ingest large amounts of data—seems to work whether that helpful robot is supposed to talk to people, move through an environment, or even do complicated tasks. And it was paired with other ideas about how to accomplish this new way of learning, like deploying robots even if they aren’t yet perfect so they can learn from the environment they’re meant to work in. Today, Silicon Valley roboticists are dreaming big again. Here’s how that happened.  Jibo Jibo A movable social robot carried out conversations long before the age of LLMs. An MIT robotics researcher named Cynthia Breazeal introduced an armless, legless, faceless robot called Jibo to the world in 2014. It looked, in fact, like a lamp. Breazeal’s aim was to create a social robot for families, and the idea pulled in $3.7 million in a crowdsourced funding campaign. Early preorders cost $749. The early Jibo could introduce itself and dance to entertain kids, but that was about it. The vision was always for it to become a sort of embodied assistant that could handle everything from scheduling and emails to telling stories. It earned a number of devoted users, but ultimately the company shut down in 2019. A crowdfunding campaign started in 2014 and drew 4,800 Jibo preorders.COURTESY OF MIT MEDIA LAB In retrospect, one thing that Jibo really needed was better language capabilities. It was competing against Apple’s Siri and Amazon’s Alexa, and all those technologies at the time relied on heavy scripting. In broad terms, when you spoke to them, software would translate your speech into text, analyze what you wanted, and create a response pulled from preapproved snippets. Those snippets could be charming, but they were also repetitive and simply boring—downright robotic. That was especially a challenge for a robot that was supposed to be social and family oriented.  What has happened since, of course, is a revolution in how machines can generate language. Voice mode from any leading AI provider is now engaging and impressive, and multiple hardware startups are trying (and failing) to build products that take advantage of it.  But that comes with a new risk: While scripted conversations can’t really go off the rails, ones generated by AI certainly can. Some popular AI toys have, for example, talked to kids about how to find matches and knives.  OpenAI Dactyl A robot hand trained with simulations tries to model the unpredictability and variation of the real world. By 2018, every leading robotics lab was trying to scrap the old scripted rules and train robots through trial and error. OpenAI tried to train its robotic hand, Dactyl, virtually—with digital models of the hand and of the palm-size cubes Dactyl was supposed to manipulate. The cubes had letters and numbers on their faces; the model might set a task like “Rotate the cube so the red side with the letter O faces upward.” Here’s the problem: A robotic hand might get really good at doing this in its simulated world, but when you take that program and ask it to work on a real version in the real world, the slight differences between the two can cause things to go awry. Colors might be slightly different, or the deformable rubber in the robot’s fingertips could turn out to be stretchier than it was in simulation. Dactyl, part of OpenAI’s first attempt at robotics,

How robots learn: A brief, contemporary history Read Post »

AI, Committee, ข่าว, Uncategorized

The Download: bad news for inner Neanderthals, and AI warfare’s human illusion

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. The problem with thinking you’re part Neanderthal There’s a theory that many of us have an “inner Neanderthal.” The idea is that Homo sapiens and a cousin species once bred, leaving some people today with a trace of Neanderthal DNA.  This DNA is arguably the 21st century’s most celebrated discovery in human evolution. But in 2024, a pair of French geneticists called into question the theory’s very foundations.  They proposed that what scientists interpret as interbreeding could instead be explained by population structure—the way genes concentrate in smaller, isolated groups. Find out what it all means for human evolution. —Ben Crair This story is from the next issue of our print magazine, which is all about nature. Subscribe now to read it when it lands on Wednesday, April 22. Why having “humans in the loop” in an AI war is an illusion —Uri Maoz AI is starting to shape real wars. It’s at the center of a legal battle between Anthropic and the Pentagon, playing a growing role in the conflict with Iran, and raising questions about how much humans should remain “in the loop.” Under Pentagon guidelines, human oversight is meant to provide accountability, context, and security. But the idea of “humans in the loop” is a comforting distraction. The real danger isn’t that machines will act without oversight; it’s that human overseers have no idea what the machines are actually “thinking.” Thankfully, science may offer a way forward. Read the full op-ed on the urgent need for new safeguards around AI warfare. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 Despite blacklisting Anthropic, the White House wants its new modelTrump officials are negotiating access to Mythos. (Axios)+ Anthropic said it was too dangerous for a public release. (Bloomberg $)+ Finance ministers are alarmed about the security risks. (BBC)+ Anthropic just rolled out a model that’s less risky than Mythos. (CNBC)+ The Pentagon has pursued a culture war against the company. (MIT Technology Review) 2 Sam Altman’s side hustles have raised conflict-of-interest concernsHis opaque investments could influence decisions at OpenAI. (WSJ $)+ A jury will soon decide if OpenAI abandoned its founding mission. (Wired $)+ The company is making a big play for science. (MIT Technology Review) 3 A Starlink outage during drone tests exposed the Pentagon’s SpaceX relianceIt was one of several Navy test disruptions linked to Starlink. (Reuters $)+ The DoD is also tapping Ford and GM for military innovations.(NYT $) 4 Data center delays threaten to choke AI expansion40% of this year’s projects are at risk of falling behind schedule. (FT $)+ Partly because no one wants a data center in their backyard. (MIT Technology Review) 5 Alibaba just released its own version of a world modelHappy Oyster is the latest attempt to extend AI’s ability to comprehend physical reality. (SCMP)+ But they still need to understand cause and effect. (FT $) 6 Google’s Gemini is now generating AI images tailored to personal dataBy analyzing users’ Google services and data. (Quartz)+ Google says it will cut the need for detailed prompts. (TechCrunch) 7 OpenAI is beefing up its agentic coding and development systemIts Codex update is a direct shot at Claude Code. (The Verge)+ But not everyone is convinced about AI coding. (MIT Technology Review) 8 Europe’s online age verification app is hereIt’s available for free to any company that wants it. (Wired $)  9 Smartglasses are giving Korean theaters hope of a K-Pop momentTheir AI-powered translations are taking the shows to the world. (NYT $) 10 Global voice actors are fighting Hollywood’s AI pushTheir voices are training the models that are replacing them. (Rest of World) Quote of the day “There’s this dark period between now and some time in the future where the advantage is very much offensive AI.”  —Rob Joyce, former director of cybersecurity at the National Security Agency, tells Bloomberg how AI is creating new hacking threats. One More Thing COURTESY OF NOVEON MAGNETICS The race to produce rare earth elements Access to rare earth elements will determine which countries meet their goals for lowering emissions or generating energy from non-fossil-fuel sources. But some nations, including the US, are worried about the supply of these elements.  China dominates the market, while extraction in the US is limited. As a result, scientists and companies are exploring unconventional sources. Read the full story on their search for critical minerals. —Mureji Fatunde We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line.)+ This ska cover of Rage Against the Machine is an upbeat way to start a revolution.+ We finally know how far Stretch Armstrong can really stretch.+ Customize these ambient sounds to wash away disruptive thoughts.+ Here’s proof childhood dreams can come true: a girl guiding a seal to perform tricks. 

The Download: bad news for inner Neanderthals, and AI warfare’s human illusion Read Post »

AI, Committee, ข่าว, Uncategorized

Why having “humans in the loop” in an AI war is an illusion

The availability of artificial intelligence for use in warfare is at the center of a legal battle between Anthropic and the Pentagon. This debate has become urgent, with AI playing a bigger role than ever before in the current conflict with Iran. AI is no longer just helping humans analyze intelligence. It is now an active player—generating targets in real time, controlling and coordinating missile interceptions, and guiding lethal swarms of autonomous drones. Most of the public conversation regarding the use of AI-driven autonomous lethal weapons centers on how much humans should remain “in the loop.” Under the Pentagon’s current guidelines, human oversight supposedly provides accountability, context, and nuance while reducing the risk of hacking. AI systems are opaque “black boxes” But the debate over “humans in the loop” is a comforting distraction. The immediate danger is not that machines will act without human oversight; it is that human overseers have no idea what the machines are actually “thinking.” The Pentagon’s guidelines are fundamentally flawed because they rest on the dangerous assumption that humans understand how AI systems work. Having studied intentions in the human brain for decades and in AI systems more recently, I can attest that state-of-the-art AI systems are essentially “black boxes.” We know the inputs and outputs, but the artificial “brain” processing them remains opaque. Even their creators cannot fully interpret them or understand how they work. And when AIs do provide reasons, they are not always trustworthy. The illusion of human oversight in autonomous systems In the debate over human oversight, a fundamental question is going unasked: Can we understand what an AI system intends to do before it acts? Imagine an autonomous drone tasked with destroying an enemy munitions factory. The automated command and control system determines that the optimal target is a munitions storage building. It reports a 92% probability of mission success because secondary explosions of the munitions in the building will thoroughly destroy the facility. A human operator reviews the legitimate military objective, sees the high success rate, and approves the strike. But what the operator does not know is that the AI system’s calculation included a hidden factor: Beyond devastating the munitions factory, the secondary explosions would also severely damage a nearby children’s hospital. The emergency response would then focus on the hospital, ensuring the factory burns down. To the AI, maximizing disruption in this way meets its given objective. But to a human, it is potentially committing a war crime by violating the rules regarding civilian life.  Keeping a human in the loop may not provide the safeguard people imagine, because the human cannot know the AI’s intention before it acts. Advanced AI systems do not simply execute instructions; they interpret them. If operators fail to define their objectives carefully enough—a highly likely scenario in high-pressure situations—the “black box” system could be doing exactly what it was told and still not acting as humans intended. This “intention gap” between AI systems and human operators is precisely why we hesitate to deploy frontier black-box AI in civilian health care or air traffic control, and why its integration into the workplace remains fraught—yet we are rushing to deploy it on the battlefield. To make matters worse, if one side in a conflict deploys fully autonomous weapons, which operate at machine speed and scale, the pressure to remain competitive would push the other side to rely on such weapons too. This means the use of increasingly autonomous—and opaque—AI decision-making in war is only likely to grow. The solution: Advance the science of AI intentions The science of AI must comprise both building highly capable AI technology and understanding how this technology works. Huge advances have been made in developing and building more capable models, driven by record investments—forecast by Gartner to grow to around $2.5 trillion in 2026 alone. In contrast, the investment in understanding how the technology works has been minuscule. We need a massive paradigm shift. Engineers are building increasingly capable systems. But understanding how these systems work is not just an engineering problem—it requires an interdisciplinary effort. We must build the tools to characterize, measure, and intervene in the intentions of AI agents before they act. We need to map the internal pathways of the neural networks that drive these agents so that we can build a true causal understanding of their decision-making, moving beyond merely observing inputs and outputs.  A promising way forward is to combine techniques from mechanistic interpretability (breaking neural networks down into human-understandable components) with insights, tools, and models from the neuroscience of intentions. Another idea is to develop transparent, interpretable “auditor” AIs designed to monitor the behavior and emergent goals of more capable black-box systems in real time.   Developing a better understanding of how AI functions will enable us to rely on AI systems for mission-critical applications. It will also make it easier to build more efficient, more capable, and safer systems. Colleagues and I are exploring how ideas from neuroscience, cognitive science, and philosophy—fields that study how intentions arise in human decision-making—might help us understand the intentions of artificial systems. We must prioritize these kinds of interdisciplinary efforts, including collaborations between academia, government, and industry. However, we need more than just academic exploration. The tech industry—and the philanthropists funding AI alignment, which strives to encode human values and goals into these models—must direct substantial investments toward interdisciplinary interpretability research. Furthermore, as the Pentagon pursues increasingly autonomous systems, Congress must mandate rigorous testing of AI systems’ intentions, not just their performance. Until we achieve that, human oversight over AI may be more illusion than safeguard. Uri Maoz is a cognitive and computational neuroscientist specializing in how the brain transforms intentions into actions. A professor at Chapman University with appointments at UCLA and Caltech, he leads an interdisciplinary initiative focused on understanding and measuring intentions in artificial intelligence systems (ai-intentions.org).

Why having “humans in the loop” in an AI war is an illusion Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at นโยบายความเป็นส่วนตัว and manage your privacy settings by clicking Settings.

ตั้งค่าความเป็นส่วนตัว

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

ยอมรับทั้งหมด
จัดการความเป็นส่วนตัว
  • เปิดใช้งานตลอด

บันทึกการตั้งค่า
th