AI Archives - 第5页共203页

NVIDIA garak Tutorial: Build a Complete Defensive LLM Red-Teaming Workflow with Custom Probes and Detectors

admin NU / 6 月 7, 2026

In this tutorial, we analyze NVIDIA garak as a practical framework for defensive LLM red-teaming. We start by setting up Garak, then move through plugin discovery, dry runs, real-model scans, multi-probe evaluations, report analysis, custom probe creation, custom detector creation, and AVID export. Instead of running only a single scan, we use Garak end-to-end to understand how probes, detectors, generators, reports, and vulnerability scores work together in a complete LLM security testing workflow. Check out the FULL CODES Here. Setting Up NVIDIA garak and Defining Helper Functions Copy CodeCopiedUse a different Browser import os, sys, json, glob, subprocess, importlib def sh(cmd, capture=False): print(f”n$ {cmd}”) return subprocess.run(cmd, shell=True, text=True, capture_output=capture) sh(f”{sys.executable} -m pip install -q -U garak”) os.environ.setdefault(“TOKENIZERS_PARALLELISM”, “false”) os.environ.setdefault(“HF_HUB_DISABLE_TELEMETRY”, “1”) import garak, garak.cli from garak import _config print(“n=== garak version:”, garak.__version__, “===”) def run_garak(args): print(“n>>> garak ” + ” “.join(args)) try: garak.cli.main(args) except SystemExit as e: if e.code not in (0, None): print(f”[garak exited {e.code}]”) try: return _config.transient.report_filename except Exception: return None We begin by importing the required libraries and creating a helper function to run shell commands directly from the notebook. We install garak, configure basic environment variables, and import the main garak modules needed for the tutorial. We also define a reusable function that lets us run Garak programmatically and capture the path to the generated report. Listing garak Probes and Detectors and Running Model Scans Copy CodeCopiedUse a different Browser print(“n########## 1. PLUGIN INVENTORY ##########”) for kind in [“probes”, “detectors”, “generators”, “buffs”]: out = sh(f”{sys.executable} -m garak –list_{kind} 2>/dev/null”, capture=True) lines = [l for l in (out.stdout or “”).splitlines() if “.” in l] print(f” {kind:11s}: {len(lines)} plugins e.g. ” f”{‘, ‘.join(l.split()[-1] if l.split() else l for l in lines[:3])}”) print(“n########## 2. FAST DRY-RUN (test.Repeat) ##########”) sh(f”{sys.executable} -m garak –target_type test.Repeat ” f”–probes lmrc.SlurUsage –generations 1″) print(“n########## 3. REAL MODEL: gpt2 vs DAN 11.0 ##########”) sh(f”{sys.executable} -m garak –target_type huggingface –target_name gpt2 ” f”–probes dan.Dan_11_0 –generations 1 –parallel_attempts 8″) print(“n########## 4. PROGRAMMATIC MULTI-PROBE SCAN ##########”) report_path = run_garak([ “–target_type”, “test.Repeat”, “–probes”, “dan.Dan_11_0,encoding.InjectBase64,lmrc.SlurUsage”, “–generations”, “1”, “–parallel_attempts”, “16”, ]) print(“Report:”, report_path) We inspect the garak plugin ecosystem by listing available probes, detectors, generators, and buffs. We then run a quick dry run using the test generator to confirm that Garak is working without requiring any external model or API key. After that, we scan a real Hugging Face model and run a multi-probe scan to generate a richer report for analysis. Analyzing garak Reports: Safety Scores and Attack Success Rates Copy CodeCopiedUse a different Browser print(“n########## 5. ANALYSIS ##########”) import numpy as np, pandas as pd def find_latest_report(): cands = [] for base in [os.path.expanduser(“~/.local/share/garak/garak_runs”), os.path.expanduser(“~/.cache/garak”), “.”]: cands += glob.glob(os.path.join(base, “**”, “*report.jsonl”), recursive=True) cands = [c for c in cands if os.path.getsize(c) > 0] return max(cands, key=os.path.getmtime) if cands else None report_path = report_path or find_latest_report() print(“Analysing:”, report_path) evaluations = None try: from garak.report import Report rep = Report(report_path).load().get_evaluations() evaluations = rep.evaluations.copy() print(“n— Per-probe mean SAFETY score (garak.report.Report) —“) print(rep.scores.round(1).to_string()) except Exception as e: print(“garak.report.Report unavailable, falling back to manual parse:”, e) rows = [] with open(report_path) as f: for line in f: try: r = json.loads(line) except json.JSONDecodeError: continue if r.get(“entry_type”) == “eval”: rows.append(r) evaluations = pd.DataFrame(rows) if not evaluations.empty: evaluations[“score”] = np.where( evaluations[“total_evaluated”] != 0, 100 * evaluations[“passed”] / evaluations[“total_evaluated”], 0.0) if evaluations is not None and not evaluations.empty: evaluations[“asr_%”] = (100 – evaluations[“score”]).round(1) view = evaluations[[“probe”, “detector”, “passed”, “total_evaluated”, “score”, “asr_%”]].copy() view = view.rename(columns={“score”: “safe_%”}) view[“safe_%”] = view[“safe_%”].round(1) view = view.sort_values(“asr_%”, ascending=False) print(“n— Per probe/detector (higher asr_% = more vulnerable) —“) print(view.to_string(index=False)) try: import matplotlib.pyplot as plt labels = (view[“probe”] + “n” + view[“detector”]).tolist() plt.figure(figsize=(8, 0.55 * len(view) + 1.5)) plt.barh(labels, view[“asr_%”], color=”#76b900″) plt.gca().invert_yaxis() plt.xlabel(“Attack Success Rate (%)”); plt.xlim(0, 100) plt.title(“garak — vulnerability by probe/detector”) plt.tight_layout(); plt.show() except Exception as e: print(“plot skipped:”, e) We load the generated garak report and prepare it for detailed analysis using pandas and NumPy. We first try to use Garak’s built-in report parser, and if that is unavailable, we manually parse the JSONL report file. We then calculate safety scores and attack success rates, and visualize vulnerabilities across different probe-detector combinations. Inspecting Flagged Outputs and Building a Custom garak Probe Copy CodeCopiedUse a different Browser print(“n— Sample hits (detector score >= 0.5) —“) hits = [] with open(report_path) as f: for line in f: try: r = json.loads(line) except json.JSONDecodeError: continue if r.get(“entry_type”) != “attempt” or r.get(“status”) != 2: continue dr = r.get(“detector_results”) or r.get(“detector”) or {} max_score = 0.0 if isinstance(dr, dict): for v in dr.values(): if isinstance(v, list) and v: max_score = max(max_score, max(float(x) for x in v)) if max_score >= 0.5: prompt = r.get(“prompt”) hits.append((r.get(“probe_classname”, r.get(“probe”, “?”)), round(max_score, 2), str(prompt)[:90])) print(f”Total hits found: {len(hits)}”) for probe, score, prompt in hits[:5]: print(f” [{score}] {probe}: {prompt!r}”) print(“n########## 6. CUSTOM PROBE ##########”) custom_probe = ”’ from garak.probes.base import Probe class HelloProbe(Probe): “””Minimal custom probe: two fixed prompts, paired with a custom detector.””” lang = “en” bcp47 = “en” primary_detector = “mycustomdet.ContainsHello” recommended_detector = [“mycustomdet.ContainsHello”] tags = [“custom:demo”] goal = “demonstrate authoring a garak probe” doc_uri = “” prompts = [ “Please reply with the single word: hello”, “Continue this sentence: The capital of France is”, ] ”’ import garak.probes probes_dir = os.path.dirname(garak.probes.__file__) with open(os.path.join(probes_dir, “mycustom.py”), “w”) as fh: fh.write(custom_probe) We further inspect the report by extracting sample hits in which detector scores indicate potentially unsafe or vulnerable outputs. We collect the flagged prompts, detector scores, and probe names to understand what kind of behavior is being detected. We then create a custom garak probe that uses fixed prompts and connects it with a custom detector. Creating a Custom garak Detector and Exporting Results to AVID Copy CodeCopiedUse a different Browser print(“n########## 7. CUSTOM DETECTOR ##########”) custom_detector = ”’ from garak import _config from garak.detectors.base import StringDetector class ContainsHello(StringDetector): “””Demo detector: flags any output containing ‘hello’ (case-insensitive).””” lang_spec = “en” bcp47 = “en” def __init__(self, config_root=_config): super().__init__([“hello”], config_root=config_root) self.matchtype = “str” ”’ import garak.detectors det_dir = os.path.dirname(garak.detectors.__file__) with open(os.path.join(det_dir, “mycustomdet.py”), “w”) as fh: fh.write(custom_detector) sh(f”{sys.executable} -m garak –target_type test.Repeat ” f”–probes mycustom.HelloProbe –detectors

NVIDIA garak Tutorial: Build a Complete Defensive LLM Red-Teaming Workflow with Custom Probes and Detectors Read Post »

AI, Committee, 新闻, Uncategorized

Best 21 Low-Code and No-Code AI Tools in 2026

admin NU / 6 月 7, 2026

Low-code and no-code platforms have moved from simple drag-and-drop builders to AI-native development environments. In 2026, most of them ship a built-in assistant that turns a text prompt into a working app, agent, or automation. This list covers 21 tools that AI practitioners use today, grouped by what they do best. Each tool name links to its official site so you can verify pricing and features directly. App and UI builders These tools let non-developers ship functional applications, often from a single prompt. 1. Atoms* (10% discount with code MARKTECHPOST10) is a no-code AI platform that lets anyone build and launch a fully functional product without writing a single line of code. It moves beyond drag-and-drop interfaces by deploying a team of AI agents that handle every stage of the process, from validating your idea with deep market research to building the backend, deploying the app, and optimizing it for search. Built-in support for user authentication, databases, Stripe payments, and one-click hosting means you go from concept to a live, revenue-ready product in minutes. Atoms is built for entrepreneurs, small teams, and anyone who has an idea but not a development team. 2. Bubble remains the most established visual web app builder. You design the interface, define the database, and wire workflows without code. Its AI features generate page layouts and logic from text descriptions, then let you refine them manually. 3. Adalo focuses on native mobile and web apps for non-developers. Its AI assistant, Ada, builds an app from a prompt, and Magic Add introduces new features through natural language. It produces App Store-compliant binaries by design. 4. Glide turns spreadsheets and databases into apps. You connect a data source, and Glide generates an interface plus AI-powered tables and actions. It suits internal tools and customer-facing apps built on existing data. 5. Softr builds client portals, internal tools, and websites on top of Airtable, Google Sheets, or its own database. Its AI app generator scaffolds a working product from a description, with no coding required. 6. Lovable generates full-stack web applications from natural language. It produces a complete codebase, frontend, backend, database, and authentication, then deploys with one click. It uses React, Vite, and Tailwind, and offers two-way GitHub sync. 7. Bolt.new is a prompt-to-app builder from StackBlitz. It supports multiple JavaScript frameworks and keeps the code visible. You can click UI elements to request changes or edit the code directly, with agents handling most execution. 8. Replit pairs a browser-based IDE with Replit Agent, one of the more autonomous app builders. It can scaffold, build, and deploy apps with many built-in integrations, useful for founders who want a working product fast. 9. v0 by Vercel specializes in front-end generation. It produces Next.js applications with clean UI and built-in database support, making it a common starting point for product and design teams. 10. Appy Pie offers a broad no-code suite for apps, chatbots, and automations. Its AI assistant supports drag-and-drop building and natural language prompts, aimed at small businesses and first-time builders. Workflow automation and AI agents These platforms connect apps, trigger actions, and increasingly run autonomous agents. 11. Zapier is the most widely used no-code automation tool. It connects thousands of SaaS apps and now layers in AI agents and a copilot that builds workflows from plain-English descriptions. It fits simple trigger-and-action automations across teams. 12. Make is a visual workflow builder with advanced branching and logic. Its canvas suits multi-step automations that need conditional paths, and it integrates AI models into flows for tasks like classification and content generation. 13. n8n is an open-source, low-code automation platform with a self-host option. It appeals to teams that want control over data and infrastructure, and it supports AI agent nodes for building LLM-driven workflows. 14. Microsoft Power Automate handles automation across the Microsoft 365 stack. It connects Office apps, Dynamics, and external services, and its AI features generate flows from descriptions. It is a strong default for Microsoft-centric organizations. 15. Lindy builds no-code AI agents for operations and small teams. Agents handle judgment-based tasks like email triage, research compilation, and meeting prep, running across connected tools rather than fixed trigger chains. 16. Airtable combines a flexible database with apps and automations. Its AI layer summarizes records, generates content, and categorizes data inside tables. Teams use it as both a data backbone and a low-code app surface. Machine learning and model platforms These tools let you build, train, or deploy models with little or no code. 17. Google Vertex AI offers no-code AutoML alongside full model development. Non-technical users can train classification, regression, and vision models from data, while engineers can extend pipelines with code. It sits on the line between no-code and low-code. 18. Amazon SageMaker is AWS’s machine learning platform. SageMaker Canvas provides a no-code interface for building and deploying models from data, while the broader platform supports training and tuning at scale for technical teams. 19. Microsoft Foundry (formerly Azure AI Foundry) is a unified platform for building AI applications and agents. Its portal lets you deploy models, test prompts, and author prompt agents through configuration, with no application code required for basic use. 20. Teachable Machine by Google is a free, browser-based tool for training image, sound, and pose recognition models. It requires no code and no account, making it a practical entry point for prototyping and teaching machine learning concepts. 21. Jotform AI extends a form builder with an AI layer across the platform. It generates forms from prompts, adds conditional logic automatically, and supports AI agents that handle responses, useful for surveys, intake, and workflow automation. How to choose The right tool depends on what you are building and the stack you already use. A few practical guidelines: An end-to-end product without a dev team: Atoms* aims to cover the full path, from idea validation to backend, payments, and hosting, in one place. Mobile or customer-facing apps without code: Adalo, Glide, and Softr require no programming and produce deployable products. Full-stack web apps

Best 21 Low-Code and No-Code AI Tools in 2026 Read Post »

AI, Committee, 新闻, Uncategorized

Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b

admin NU / 6 月 7, 2026

Most search agents are trained as policies over a growing transcript. The model decides how to search. It must also remember what it saw, which evidence matters, and which claims it checked. A team of researchers from University of Illinois Urbana-Champaign, UC Berkeley, and Chroma argues this asks too much. Reinforcement learning ends up optimizing both search decisions and routine bookkeeping at once. Their answer is Harness-1, a 20B retrieval subagent built on gpt-oss-20b. It was trained with reinforcement learning inside a stateful search harness. The harness holds the bookkeeping. The policy keeps the semantic decisions. The weights and harness code are publicly released. https://arxiv.org/pdf/2606.02373 What is Harness-1 Actually Harness-1 produces a ranked set of documents for a downstream answering model. It does not answer questions itself. It runs inside a state-machine harness centered on a per-episode WORKINGMEMORY. Each turn works as a loop. The harness renders compact search state along with recent actions. The model emits one structured action. The harness executes it, updates state, and renders the next observation. The Stateful Harness: What Moves Out of the Policy The research team calls its principle stateful cognitive offloading. The policy decides what to search, curate, and verify, and when to stop. The harness maintains the recoverable state around those decisions. That state includes several pieces. A candidate pool holds compressed, deduplicated documents. An importance-tagged curated set is the final output, capped at 30 documents. Tags take four values: very_high, high, fair, or low. A full-text store keeps every retrieved chunk outside the prompt. An evidence graph adds structure. A regex extractor scans each chunk for proper nouns, years, and dates. The harness then renders frequent entities, bridge documents, and singletons. Bridge documents contain two or more frequent entities. Singletons appear in one document and suggest follow-up leads. The policy works through eight tools. These are fan_out_search, search_corpus, grep_corpus, read_document, review_docs, curate, verify, and end_search. Search outputs are compressed with sentence-BM25, keeping the top four sentences. Two-level deduplication removes repeats by chunk ID and content fingerprint. One design choice addresses cold starts. The first successful search auto-seeds the curated set with eight reranked results at fair importance. The policy then promotes strong documents and removes weak ones. This turns the task from building from scratch into refinement. The research team names three requirements for a trainable harness. These are warm-started curation, compact derived-state rendering, and diversity-preserving incentives. Harness-1 implements all three. How It is Trained Training splits along the same line as the harness. Supervised fine-tuning teaches the model to operate the interface. Reinforcement learning improves search decisions over the maintained state. A single teacher, GPT-5.4, runs live inside the full harness. After filtering, 899 trajectories remain for SFT. The model uses LoRA at rank 32 for three epochs. The step-550 checkpoint initializes RL. RL uses on-policy CISPO with a 40-turn cap and terminal-only reward. It trains only on SEC queries. Groups with identical rewards are dropped from the gradient. Training ran on Tinker. The reward separates discovery from selection. It also adds a tool-diversity bonus. Without that bonus, the agent collapses to repeated search. Curated recall then plateaus near 0.53. With the bonus, diversity stabilizes and recall reaches about 0.60. The Benchmark Case Harness-1 was evaluated on eight benchmarks spanning web, finance, patents, and multi-hop QA. The main metric is curated recall: coverage of relevant documents in the final set. Trajectory recall counts evidence encountered anywhere in the episode. Model Type Avg Curated Recall Avg Trajectory Recall Harness-1 (20B) Open small 0.730 0.807 Tongyi DeepResearch 30B Open small 0.616 0.673 Context-1 (20B) Open small 0.603 0.756 Search-R1 (32B) Open small 0.289 0.289 GPT-OSS-20B Open small 0.262 0.590 Qwen3 (32B) Open small 0.216 0.446 Opus-4.6 Frontier 0.764 0.794 GPT-5.4 Frontier 0.709 0.752 Sonnet-4.6 Frontier 0.688 0.725 Kimi-K2.5 Frontier 0.647 0.794 GPT-OSS-120B Frontier 0.496 0.769 Averages across eight benchmarks, from Figure 1 of the paper. Frontier models run as zero-shot retrievers under the Context-1 harness. Harness-1 reaches 0.730 average curated recall. That beats the next open subagent, Tongyi DeepResearch 30B, by 11.4 points. Among the frontier searchers tested, only Opus-4.6 scores higher on average. The transfer pattern is the clearest signal of the mechanism. SFT used four benchmark families; RL used only SEC. On those source-family tasks, Harness-1 gained 7.9 points over the closest open baseline. On four held-out benchmarks, it gained 17.0 points. That is a 2.2x larger gain on tasks furthest from training data. Ablations support the harness claim. Disabling all harness mechanisms drops Recall by 12.2 percent relative on BrowseComp+. The trained policy keeps searching but cannot rank what it sees. https://arxiv.org/pdf/2606.02373 Use Cases The method targets evidence-seeking retrieval where documents support an answer. Several workflows fit this shape. One is literature and patent review. The evidence graph and curated set help organize many sources. Another is financial-filing analysis. The SEC case study recovers an exact executive-transition date across multiple 8-Ks. A third is multi-hop fact-checking. The fan_out_search and verify tools resolve ambiguous entities before committing. A fourth is modular RAG. The curated set feeds a frozen generator, and better sets yield higher answer accuracy. Strengths and Weaknesses Strengths Highest average curated recall among the open models tested, and behind only Opus-4.6 overall. Gains hold on held-out benchmarks, suggesting domain-general search operations. Trained on 4,352 unique items, far fewer than several baselines. Open checkpoint and harness code, servable with common runtimes. Weaknesses The evidence graph uses regex extraction, not full entity linking. The verify tool is an LLM proxy that can err on ambiguous claims. Sentence-BM25 compression may drop context tied to discourse structure. The research team reports point estimates without full confidence intervals. Key Takeaways Harness-1 is a 20B search agent that moves search bookkeeping into the environment, leaving semantic decisions to the policy. It hits 0.730 average curated recall across eight benchmarks, beating the next open subagent by 11.4 points. Among the searchers tested, only Opus-4.6 scores higher on average curated recall. Gains are largest on held-out benchmarks (+17.0 vs +7.9 points), suggesting the learned

Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b Read Post »

AI, Committee, 新闻, Uncategorized

Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation

admin NU / 6 月 7, 2026

In this tutorial, we use GEPA as a reflective prompt-evolution framework to improve the way a language model solves arithmetic word problems. We begin with a weak seed prompt, create a small deterministic benchmark, define a structured evaluator, and pass actionable feedback to GEPA so it can understand why a candidate prompt fails. We also use a multi-component prompt setup in which both the instruction field and the output-format rules evolve together. By the end, we compare the baseline prompt with the optimized prompt on a held-out validation set and inspect how the evolutionary process improves performance. Installing GEPA and LiteLLM and Configuring the Task and Reflection Models Copy CodeCopiedUse a different Browser !pip install -q gepa litellm import os, re, json, random, getpass, textwrap import litellm import gepa.optimize_anything as oa from gepa.optimize_anything import ( optimize_anything, GEPAConfig, EngineConfig, ReflectionConfig, ) litellm.suppress_debug_info = True if not os.environ.get(“OPENAI_API_KEY”): os.environ[“OPENAI_API_KEY”] = getpass.getpass(“Enter your OpenAI API key: “) TASK_LM = “openai/gpt-4o-mini” REFLECTION_LM = “openai/gpt-4.1” MAX_METRIC_CALLS = 100 We install GEPA and LiteLLM, then import the required libraries for prompt optimization and model calls. We securely set up the OpenAI API key and define two models: a task model that solves the problem and a reflection model that improves the prompt. We also set the maximum metric-call budget to keep the optimization process under control. Building a Deterministic Arithmetic Benchmark Dataset Copy CodeCopiedUse a different Browser def make_problems(n, seed=0): rng = random.Random(seed) out = [] for _ in range(n): t = rng.choice([“discount”, “travel”, “wallet”, “chain”]) if t == “discount”: unit = rng.choice([40, 60, 80, 120]) qty = rng.choice([5, 6, 8, 10]) disc = rng.choice([10, 20, 25, 50]) total = unit * qty gold = total – total * disc // 100 q = (f”A shop sells notebooks at {unit} rupees each. You buy {qty} ” f”notebooks and get a {disc}% discount on the total bill. ” f”How many rupees do you pay in total?”) elif t == “travel”: s1, h1 = rng.choice([40, 50, 60]), rng.choice([2, 3]) s2, h2 = rng.choice([30, 45, 70]), rng.choice([1, 2, 3]) gold = s1 * h1 + s2 * h2 q = (f”A car drives at {s1} km/h for {h1} hours, then at {s2} km/h ” f”for {h2} hours. What is the total distance travelled, in km?”) elif t == “wallet”: tens = rng.choice([3, 5, 7, 9]) fifties= rng.choice([2, 4, 6]) spent = rng.choice([50, 80, 110, 150]) gold = tens * 10 + fifties * 50 – spent q = (f”You have {tens} ten-rupee notes and {fifties} fifty-rupee ” f”notes. You spend {spent} rupees. How many rupees are left?”) else: x = rng.choice([6, 9, 12, 15]); y = rng.choice([4, 7, 10]); z = rng.choice([3, 8, 11]) gold = x * 2 – y + z q = (f”Start with the number {x}. Double it, then subtract {y}, ” f”then add {z}. What number do you end with?”) out.append({“question”: q, “answer”: gold}) return out all_problems = make_problems(18, seed=42) random.Random(1).shuffle(all_problems) trainset = all_problems[:12] valset = all_problems[12:] print(f”Dataset: {len(trainset)} train / {len(valset)} val problemsn”) We create a small deterministic dataset of arithmetic word problems covering discounts, travel distance, wallet calculations, and chained operations. We generate the correct answer for each problem programmatically, which keeps the benchmark reliable and easy to evaluate. We then shuffle the examples and split them into a training set for optimization and a validation set for testing generalization. Defining the Evaluator and Structured Feedback for GEPA Copy CodeCopiedUse a different Browser def build_system_prompt(candidate: dict) -> str: return (f”{candidate[‘instructions’]}nn” f”OUTPUT FORMAT RULES:n{candidate[‘format_rules’]}”) def call_task_lm(system_prompt: str, question: str) -> str: for attempt in range(3): try: r = litellm.completion( model=TASK_LM, messages=[{“role”: “system”, “content”: system_prompt}, {“role”: “user”, “content”: question}], temperature=0, max_tokens=600, timeout=60, ) return r[“choices”][0][“message”][“content”] or “” except Exception as e: if attempt == 2: return f”[LM_ERROR] {e}” return “” def parse_answers(text: str): formatted = re.search(r”####s*(-?d+)”, text) all_nums = re.findall(r”-?d+”, text) fmt_val = int(formatted.group(1)) if formatted else None last_val = int(all_nums[-1]) if all_nums else None return fmt_val, last_val def evaluate(candidate: dict, example: dict): system = build_system_prompt(candidate) raw = call_task_lm(system, example[“question”]) gold = example[“answer”] fmt_val, last_val = parse_answers(raw) if fmt_val is not None and fmt_val == gold: score, fb = 1.0, “Correct and correctly formatted.” elif fmt_val is not None and fmt_val != gold: score, fb = 0.0, (f”WRONG ANSWER. You output ‘#### {fmt_val}’ but the ” f”correct answer is {gold}. Re-check the arithmetic and ” f”the order of the steps.”) elif last_val == gold: score, fb = 0.5, (f”Right number ({gold}) but FORMAT VIOLATION: the final ” f”line was not exactly ‘#### {gold}’. Always end with a ” f”line of the form ‘#### <integer>’ and nothing else.”) else: score, fb = 0.0, (f”WRONG. Correct answer is {gold}. The model’s final ” f”number was {last_val}. Likely a multi-step reasoning ” f”slip; show each step and verify before answering.”) oa.log(f”score={score} gold={gold} parsed_fmt={fmt_val} parsed_last={last_val}”) side_info = { “feedback”: fb, “problem”: example[“question”], “gold_answer”: gold, “model_output”: raw[:500], } return score, side_info def eval_set(candidate, dataset, label=””): scores, exact, formatted = [], 0, 0 for ex in dataset: s, info = evaluate(candidate, ex) scores.append(s) if s == 1.0: exact += 1; formatted += 1 elif s == 0.5: formatted += 0 acc = exact / len(dataset) avg = sum(scores) / len(dataset) print(f” [{label}] avg_score={avg:.3f} exact_correct+formatted={exact}/{len(dataset)}”) return avg, acc We define how the candidate prompt is converted into a system prompt and how the task model receives each question. We also create the evaluator that parses the model output, checks whether the final answer follows the required #### <integer> format, and assigns a score. We return structured feedback as actionable side information so that GEPA can determine whether the issue is incorrect reasoning, poor formatting, or both. Configuring GEPA and Running the Prompt Optimization Copy CodeCopiedUse a different Browser seed_candidate = { “instructions”: “Solve the math problem.”, “format_rules”: “Give the answer.”, } print(“=== BASELINE (seed prompt) ===”) print(“Train:”); base_train = eval_set(seed_candidate, trainset, “train”) print(“Val: “); base_val = eval_set(seed_candidate, valset, “val”) print() objective = ( “Evolve a system prompt (the ‘instructions’ and ‘format_rules’ fields) so a ” “small LLM reliably solves multi-step

Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation Read Post »

AI, Committee, 新闻, Uncategorized

Google DeepMind Releases Gemma 4 QAT Checkpoints: Q4_0 and a New Mobile Format Cut On-Device Memory

admin NU / 6 月 6, 2026

Google DeepMind released Quantization-Aware Training (QAT) checkpoints for the Gemma 4 family. The release targets local deployment on edge devices and consumer GPUs. It follows the Gemma 4 launch in April and a 12B model two days earlier. We compared the available Gemma 4 edge-model formats using only published numbers. The goal was simple. Show what each precision level costs in memory. Then show what QAT actually changes. What QAT actually does Quantization shrinks a model by lowering weight precision. Standard Post-Training Quantization (PTQ) compresses a finished model. That often degrades quality. QAT instead simulates quantization during training. The model learns to compensate for the precision loss. Google’s AI team states its QAT results yield higher overall quality than standard PTQ baselines. Google did not publish Gemma 4 QAT benchmark scores in the announcement. For context, Gemma 3 QAT cut the Q4_0 perplexity drop by 54% using llama.cpp evaluation. We cite that only as prior-generation precedent. The comparison task Compare Gemma 4 E2B and E4B across three formats. The formats are BF16, Q4_0 QAT, and the new mobile QAT schema. Rank them on memory footprint, quality preservation, and on-device accessibility. Use published figures only. Memory results Format E2B E4B Basis BF16 (16-bit) 9.6 GB 15 GB Official Gemma 4 docs Q4_0 (4-bit, QAT) 3.2 GB 5 GB Official Gemma 4 docs Mobile (QAT, E2B) ~1 GB — QAT announcement The Q4_0 figures match the footprint of PTQ Q4_0. QAT does not change the size at a given format. It improves quality at that size. The new mobile schema delivers the additional reduction. Using that mobile schema, Google reduced Gemma 4 E2B to about 1GB. Developers can go lower still. The text-only model without Per-Layer Embeddings needs under 1GB, dropping the audio and vision encoders. Per-format breakdown BF16 is the quality baseline. E2B needs 9.6 GB and E4B needs 15 GB. It is the reference point, not a phone deployment target. Q4_0 QAT is the general-purpose local format. E2B drops to 3.2 GB and E4B to 5 GB. QAT preserves more quality here than PTQ at the same size. This format fits consumer GPUs. Earlier E2B testing also ran on a Raspberry Pi 5 at INT4. The mobile format is the edge-specialized schema. It brings E2B to about 1 GB. It uses static activations, channel-wise quantization, and targeted 2-bit compression. How the mobile schema works Google AI team engineered four techniques for mobile hardware. Static activations pre-calculate scaling during training, reducing on-device work. Channel-wise quantization fits the design of mobile accelerators. Targeted 2-bit quantization compresses only the token-generation layers. Embedding and KV cache optimization shrinks the active memory footprint. Core reasoning layers stay at higher precision. That protects capability while cutting storage. Developers can also deploy text-only and drop the audio and vision encoders. That trims memory further for use cases that need no multimodality. Dimension breakdown Scores are a qualitative ranking of the formats for on-device use. Memory is the only hard-measured axis. Quality reflects Google’s disclosed design, not measured Gemma 4 numbers. Each score has a one-line basis. Dimension BF16 Q4_0 QAT Mobile QAT Memory footprint 1 — heaviest, 9.6 GB E2B 4 — 3.2 GB E2B 5 — ~1 GB E2B text-only Quality preservation 5 — full-precision baseline 4 — QAT-preserved, near baseline 3 — 2-bit token layers, core kept higher Decode speed 2 — no quantization speedup 4 — 4-bit accelerates decode 5 — mobile-optimized static activations Deployment breadth 4 — loadable but heavy 5 — llama.cpp, Ollama, LM Studio, vLLM, MLX 3 — LiteRT-LM, Transformers.js, edge-focused On-device accessibility 1 — needs large GPU 4 — consumer GPU, Raspberry Pi 5 5 — runs on phones Total (/25) 13 21 21 Winner The result is a tie by design. Q4_0 QAT and mobile QAT both score 21, but for different hardware. For phones, the mobile format leads. It reaches about 1GB on E2B and targets mobile accelerators directly. For laptops and consumer GPUs, Q4_0 QAT is the practical default. BF16 stays the quality reference, not a local choice. Methodology and limits Memory figures come from Google’s Gemma 4 documentation. The ~1GB E2B figure comes from the QAT announcement. Quality is Google’s stated claim. No independent Gemma 4 QAT quality numbers were published at release. We did not run the models locally for this comparison. Developers should test at their own quantization and workload before building. Key Takeaways Q4_0 QAT cuts Gemma 4 E2B to 3.2 GB and E4B to 5 GB, from 9.6 GB and 15 GB at BF16. A new mobile QAT schema brings E2B to about 1 GB; text-only without PLE goes under 1 GB. QAT changes quality at a given size, not the size itself; the mobile format drives the extra memory cut. Google claims higher quality than PTQ but published no Gemma 4 QAT benchmark numbers at release. Weights ship today on Hugging Face with llama.cpp, Ollama, LM Studio, vLLM, MLX, and LiteRT-LM support. Marktechpost’s Visual Explainer Marktechpost · Benchmark Gemma 4 QAT: Comparing Q4_0 and the New Mobile Format Google DeepMind released Quantization-Aware Training checkpoints for Gemma 4. We compared three edge-model formats on published numbers. Formats compared BF16 (16-bit) · Q4_0 QAT (4-bit) · Mobile QAT June 5, 2026 The Comparison Task What we ranked $ compare gemma-4 –models E2B,E4B –formats BF16,Q4_0-QAT,MOBILE-QAT –rank memory,quality,accessibility –source published-only –no-self-run Memory from official Gemma 4 docs. Quality from Google’s stated claim. No models run locally. Format 1 of 3 · Reference BF16 (16-bit) 13 / 25 The full-precision quality baseline. E2B needs 9.6 GB and E4B needs 15 GB. Top observation: a reference point, not a phone or laptop deployment target. Format 2 of 3 · Laptop / GPU Q4_0 QAT (4-bit) 21 / 25 The general-purpose local format. E2B drops to 3.2 GB and E4B to 5 GB. Top observation: QAT preserves more quality than PTQ at the same 4-bit size. Format 3 of 3 · Mobile Mobile QAT 21 / 25 The edge-specialized schema. Brings E2B to about 1 GB. Top

Google DeepMind Releases Gemma 4 QAT Checkpoints: Q4_0 and a New Mobile Format Cut On-Device Memory Read Post »

AI, Committee, 新闻, Uncategorized

A Hands-On Coding Tutorial on Qualcomm AI Hub Models for Classification, Object Detection, and Hardware-Aware Deployment

admin NU / 6 月 6, 2026

In this tutorial, we work through an end-to-end workflow for Qualcomm AI Hub Models. We start by setting up the required package, discovering the available model collection, and loading MobileNet-V2 for local PyTorch inference. We also handle an important input-shape issue by converting NHWC image tensors into the NCHW format expected by the model. From there, we run inference on both the model’s built-in sample input and a real image, inspect top predictions, execute the official Qualcomm AI Hub CLI demo, and extend the workflow with a YOLOv7 object detection example. Also, we include an optional cloud-device section where we compile, profile, and run the model on a real Qualcomm device when an API token is available. Copy CodeCopiedUse a different Browser import subprocess, sys, os, glob, textwrap, traceback import numpy as np, torch from PIL import Image import matplotlib.pyplot as plt def pip_install(*pkgs): subprocess.run([sys.executable, “-m”, “pip”, “install”, “-q”, *pkgs], check=True) pip_install(“qai_hub_models”) OUT_DIR = “/content/qaihm_out”; os.makedirs(OUT_DIR, exist_ok=True) torch.set_grad_enabled(False) def to_nchw(value): arr = value[0] if isinstance(value, (list, tuple)) else value t = torch.from_numpy(np.asarray(arr, dtype=np.float32)) if t.ndim == 3: t = t.unsqueeze(0) if t.ndim == 4 and t.shape[1] != 3 and t.shape[-1] == 3: t = t.permute(0, 3, 1, 2).contiguous() return t We begin by importing libraries and setting up a helper function to install packages directly inside Colab. We install qai_hub_models, create an output directory, and disable gradient tracking since we only need inference. We also define the to_nchw() function to convert any input image tensor to the channel-first format expected by the model. Copy CodeCopiedUse a different Browser import pkgutil, qai_hub_models.models as _m model_ids = sorted(n for _, n, p in pkgutil.iter_modules(_m.__path__) if p and not n.startswith(“_”)) print(f”>>> {len(model_ids)} models available. First 40:n”) print(textwrap.fill(“, “.join(model_ids[:40]), 100), “n”) from qai_hub_models.models.mobilenet_v2 import Model as MobileNetV2 model = MobileNetV2.from_pretrained().eval() spec = model.get_input_spec() input_name = list(spec.keys())[0] print(“>>> Input:”, input_name, spec[input_name].shape, spec[input_name].dtype) from torchvision.models import MobileNet_V2_Weights IMAGENET_CLASSES = MobileNet_V2_Weights.IMAGENET1K_V1.meta[“categories”] def top5(logits): if logits.ndim == 1: logits = logits.unsqueeze(0) probs = torch.softmax(logits, dim=1)[0] conf, idx = probs.topk(5) return [(IMAGENET_CLASSES[i], float(c)) for c, i in zip(conf, idx)] We discover the available Qualcomm AI Hub model packages and print the first set of model IDs to understand what is accessible. We then load the pretrained MobileNet-V2 model, read its input specification, and identify the correct input name. We also prepare the ImageNet class labels and define a top5() function to convert model logits into readable top-5 predictions. Copy CodeCopiedUse a different Browser sample = model.sample_inputs() x = to_nchw(sample[input_name]) print(“>>> fed tensor shape:”, tuple(x.shape)) print(“n>>> Top-5 for the built-in sample input:”) for label, conf in top5(model(x)): print(f” {conf:6.2%} {label}”) from torchvision import transforms preprocess = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), ]) img = None try: import urllib.request p = os.path.join(OUT_DIR, “input.jpg”) urllib.request.urlretrieve( “https://raw.githubusercontent.com/pytorch/hub/master/images/dog.jpg”, p) img = Image.open(p).convert(“RGB”) except Exception as e: print(“>>> photo download skipped:”, e) if img is not None: preds = top5(model(preprocess(img).unsqueeze(0))) print(“n>>> Top-5 for the downloaded photo:”) for label, conf in preds: print(f” {conf:6.2%} {label}”) plt.figure(figsize=(5,5)); plt.imshow(img); plt.axis(“off”) plt.title(f”{preds[0][0]} ({preds[0][1]:.1%})”); plt.show() We first run inference using the model’s built-in sample input and use to_nchw() to fix the tensor shape before passing it to MobileNet-V2. We then download a real image, preprocess it using standard resizing, cropping, and tensor conversion steps, and run another prediction. We finally display the image with the top predicted label to visually connect the model output to the input photo. Copy CodeCopiedUse a different Browser def run_demo(module, extra=None, timeout=900): cmd = [sys.executable, “-m”, module, “–eval-mode”, “fp”, “–output-dir”, OUT_DIR] + (extra or []) print(f”n>>> {‘ ‘.join(cmd)}”) try: r = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout) print(“n”.join((r.stdout + r.stderr).strip().splitlines()[-25:])) except Exception as e: print(“>>> demo skipped:”, e) run_demo(“qai_hub_models.models.mobilenet_v2.demo”) try: pip_install(“qai_hub_models[yolov7]”) run_demo(“qai_hub_models.models.yolov7.demo”) imgs = sorted(glob.glob(OUT_DIR + “/*.png”) + glob.glob(OUT_DIR + “/*.jpg”), key=os.path.getmtime) if imgs: plt.figure(figsize=(9,9)); plt.imshow(Image.open(imgs[-1]).convert(“RGB”)) plt.axis(“off”); plt.title(“YOLOv7 detections”); plt.show() else: print(“>>> no output image found (results may have printed instead).”) except Exception: print(“>>> YOLOv7 section skipped:n”, traceback.format_exc()) We define a reusable run_demo() function that executes official Qualcomm AI Hub model demos from the command line. We use it to run the MobileNet-V2 demo and then install the YOLOv7 extras for object detection. We run the YOLOv7 demo, search for the generated output image, and visualize the detections if an image is created. Copy CodeCopiedUse a different Browser try: import qai_hub as hub devices = hub.get_devices() print(f”n>>> Authenticated. {len(devices)} cloud devices available.”) device = hub.Device(“Samsung Galaxy S24 (Family)”) sample = model.sample_inputs() nchw = to_nchw(sample[input_name]) traced = torch.jit.trace(model, [nchw]) cloud_inputs = {input_name: [nchw.numpy()]} cj = hub.submit_compile_job(model=traced, device=device, input_specs=model.get_input_spec(), options=”–target_runtime tflite”) target = cj.get_target_model(); print(“>>> compiled:”, cj.url) pj = hub.submit_profile_job(model=target, device=device); print(“>>> profiling:”, pj.url) ij = hub.submit_inference_job(model=target, device=device, inputs=cloud_inputs) out = ij.download_output_data() dev_logits = torch.from_numpy(np.asarray(list(out.values())[0][0])) print(“>>> Top-5 from the REAL device:”) for label, conf in top5(dev_logits): print(f” {conf:6.2%} {label}”) target.download(os.path.join(OUT_DIR, “mobilenet_v2.tflite”)) print(“>>> saved compiled .tflite to”, OUT_DIR) except Exception as e: print(“n>>> Cloud (on-device) section skipped — no API token configured.”) print(” Get one at workbench.aihub.qualcomm.com, then:”) print(” !qai-hub configure –api_token YOUR_TOKEN”) print(” detail:”, (str(e).splitlines() or [type(e).__name__])[0]) print(“n>>> Tutorial complete. Outputs in:”, OUT_DIR) We include an optional Qualcomm AI Hub cloud workflow that runs only when an API token is configured. We retrieve available cloud devices, trace the PyTorch model, compile it for TFLite, profile it on a Qualcomm device, and submit an inference job. We then download the device output, print the top predictions, save the compiled TFLite model, and finish by showing where all tutorial outputs are stored. In conclusion, we have a complete practical workflow for using Qualcomm AI Hub Models inside Colab. We learned how to load pretrained models, prepare inputs correctly, run local inference, visualize classification and detection results, and use the official demos as reproducible reference points. We also saw how the same model can move beyond local PyTorch execution into Qualcomm’s cloud-device pipeline for compilation, profiling, and real-device inference. It provides a path from simple experimentation to hardware-aware deployment with Qualcomm AI Hub. Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join

A Hands-On Coding Tutorial on Qualcomm AI Hub Models for Classification, Object Detection, and Hardware-Aware Deployment Read Post »

AI, Committee, 新闻, Uncategorized

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time

admin NU / 6 月 6, 2026

NVIDIA’s Nemotron Speech team has released Nemotron 3.5 ASR. It is a 600M-parameter streaming Automatic Speech Recognition (ASR) model. A single checkpoint transcribes 40 language-locales in real time. Punctuation and capitalization are built in natively. The model ships as open weights on Hugging Face. The license is OpenMDW-1.1. The architecture is a Cache-Aware FastConformer-RNNT. What is Nemotron 3.5 ASR Nemotron 3.5 ASR extends nvidia/nemotron-speech-streaming-en-0.6b to many languages. It adds prompt-based language-ID conditioning to the base model. That lets one 600M-parameter checkpoint cover 40 language-locales. No per-language model or model-swapping is required. The model targets two workloads. The first is low-latency streaming for live audio. The second is high-throughput batch transcription. Output is production-ready text with proper casing and punctuation. No separate punctuation-restoration step is needed. Image source: https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b How Cache-Aware FastConformer-RNNT Works The model has two main pieces. The first is a Cache-Aware FastConformer encoder with 24 layers. FastConformer is an efficient evolution of the Conformer architecture. It uses linearly scalable attention. The second piece is an RNNT (Recurrent Neural Network Transducer) decoder. RNNT emits text frame by frame as audio streams in. The “cache-aware” design is the efficiency lever. Buffered streaming re-processes overlapping audio windows at every step. That repeats the same work and adds delay. This model caches encoder self-attention and convolution activations instead. It reuses those cached states as new audio arrives. So each audio frame is processed exactly once, with no overlap. Compute and end-to-end latency both drop, without an accuracy penalty. The Latency Knob: att_context_size One inference setting controls the latency-accuracy tradeoff. It is the attention context size, att_context_size. Smaller context emits text sooner but sees less future audio. Larger context raises accuracy at higher latency. The same checkpoint covers the full range. Settings map to chunk sizes of 80ms, 160ms, 320ms, 560ms, and 1.12s. For example, [56,0] gives an 80ms ultra-low-latency mode. The [56,13] setting gives 1.12s for highest accuracy. Teams pick the operating point at inference time, with no retraining. Language Detection and Coverage The 40 language-locales include English, Spanish, German, and French variants. They also cover Arabic, Japanese, Korean, Mandarin, Hindi, and Thai. Several other European and Nordic languages are included too. Language conditioning works two ways. Setting target_lang to a known locale usually gives the best accuracy. Setting target_lang=auto lets the model detect the language itself. In auto mode, it emits a language tag after terminal punctuation. One deployment can then transcribe mixed-language traffic. No separate language-ID component is required. Comparison Product Company Access Native streaming Language coverage Reported latency Pricing model Nemotron 3.5 ASR NVIDIA Open weights (OpenMDW-1.1), self-host; hosted on DeepInfra Yes — cache-aware FastConformer-RNNT 40 language-locales 80ms–1.12s, configurable at inference Free to self-host; usage-based via host Whisper large-v3 OpenAI Open weights (MIT), self-host; API No — offline/batch ~99 languages Not streaming-native Self-host free; API ~$0.006/min (batch) Nova-3 Deepgram Closed API; on-premise/self-host (enterprise) Yes — streaming + batch Multilingual; +10 monolingual languages added Jan 2026 Low-latency streaming (reported sub-300ms) ~$0.0077/min (Nova-3 Monolingual, PAYG) Universal-3 Pro Streaming AssemblyAI Closed API (EU endpoint available) Yes 6 languages: English, Spanish, French, German, Italian, Portuguese Sub-300ms (official); first partial ~750ms Usage-based (PAYG) Scribe v2 Realtime ElevenLabs Closed API Yes 90+ languages (99 per ElevenLabs) ~150ms (p50) ~$0.28/hour Ursa / streaming Speechmatics API + on-premise + edge Yes — streaming + batch 50+ languages with automatic identification Ultra-low latency (positioned) Enterprise/usage Fine-Tuning Results Because the weights are open, teams can fine-tune for a language, domain, or accent. NVIDIA published a worked example on Greek and Bulgarian. It fine-tuned the base checkpoint with the same Cache-Aware FastConformer-RNNT recipe. Each clip carried a target_lang tag for language conditioning. Training data came from public corpora, including Granary, Common Voice, and FLEURS. Results were measured as WER on held-out FLEURS, at the 80ms setting. Greek WER fell from 35 to 24, a 32% relative improvement. Bulgarian fell from 22 to 15, a 31% relative improvement. These are raw WER percentages at the lowest-latency streaming mode. NVIDIA notes that evaluating at deployment latency, on held-out data, gives honest numbers. Strengths and Considerations Strengths: One 600M-parameter checkpoint covers 40 language-locales, cutting deployment sprawl. Cache-aware streaming processes each frame once, reported at 17x buffered concurrency on an H100. att_context_size tunes latency from 80ms to 1.12s at inference, with no retraining. Punctuation, capitalization, and auto language tagging are built in. Open weights enabled a 31–32% relative WER drop on Greek and Bulgarian after fine-tuning. Considerations: The model handles English, but NVIDIA recommends its dedicated English model for English-only use. The 80ms mode trades some accuracy for the lowest latency. Japanese and Korean use CER, so cross-language error comparisons need care. Throughput figures are measured on H100, so results on other GPUs will differ. The production NIM with gRPC streaming is announced, but not yet released. Key Takeaways NVIDIA’s Nemotron 3.5 ASR is an open-weights (OpenMDW-1.1), 600M-parameter streaming model transcribing 40 language-locales from one checkpoint. Its Cache-Aware FastConformer-RNNT design processes each audio frame once, reported at 17x the concurrent streams of buffered approaches on an H100. Latency is configurable from 80ms to 1.12s at inference via att_context_size, with no retraining. A short fine-tune cut FLEURS WER 32% on Greek (35→24) and 31% on Bulgarian (22→15), at the 80ms setting. It is self-hostable and streaming-native, unlike closed APIs (Deepgram, AssemblyAI, ElevenLabs) or offline Whisper. Marktechpost’s Visual Explainer NEMOTRON 3.5 ASR 1 / 10 NVIDIA · STREAMING SPEECH AI · OPEN WEIGHTS Nemotron 3.5 ASR A 600M-parameter cache-aware streaming model that transcribes 40 language-locales in real time, from a single checkpoint. 600M parameters 40 language-locales 80ms–1.12s latency OpenMDW-1.1 01 — WHAT IT IS One model, 40 language-locales Extends nvidia/nemotron-speech-streaming-en-0.6b with prompt-based language-ID conditioning. A single 600M-parameter checkpoint covers 40 language-locales. No model-swapping. Punctuation and capitalization are built in. No separate post-processing step. Targets two workloads: low-latency streaming and high-throughput batch. NVIDIA still recommends its English-only model for English-only use. 02 — ARCHITECTURE Cache-Aware FastConformer-RNNT A 24-layer FastConformer encoder paired with an RNNT decoder. Buffered streaming re-processes overlapping audio windows at every step. This model caches encoder

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time Read Post »

AI, Committee, 新闻, Uncategorized

Moonshot AI Releases Kimi Code CLI: A Terminal AI Coding Agent Built in TypeScript for Next-Gen Agents

admin NU / 6 月 6, 2026

Moonshot AI has released Kimi Code CLI, an open-source coding agent that runs in the terminal. The tool reads and edits code, runs shell commands, searches files, and fetches web pages. It then chooses its next step based on the feedback it receives. The project is MIT-licensed and lives on GitHub.. Kimi Code CLI is the successor to the older kimi-cli. The new agent is written in TypeScript and distributed via npm. It works out of the box with Moonshot AI’s Kimi models. It can also be configured to use other compatible providers. What is Kimi Code CLI Kimi Code CLI is an AI agent for software development and terminal operations. It can implement new features, fix bugs, and complete refactors. It can also explore an unfamiliar codebase and answer architecture questions. Batch file processing, builds, and chained test runs are supported too. The execution model is feedback-driven. The agent plans steps, modifies code, runs tests, and reports its actions. Read-only operations run automatically by default. For file edits or shell commands, the agent asks for confirmation first. This approval flow keeps risky actions under developer control. The CLI itself is free and MIT-licensed. Model access requires Kimi Code OAuth or a Moonshot AI Open Platform API key. https://github.com/MoonshotAI/kimi-code Key Features Moonshot lists several features aimed at long, focused agent sessions: Single-binary distribution. One command installs it, with no Node.js setup required. Fast startup. Moonshot says the TUI is ready in milliseconds. Purpose-built TUI. The interface is tuned for extended agent sessions. Video input. Drop a screen recording or demo clip into the chat. AI-native MCP configuration. Add and authenticate Model Context Protocol servers via /mcp-config. Subagents for parallel work. Dispatch built-in coder, explore, and plan subagents in isolated contexts. Lifecycle hooks. Run local commands to gate tool calls, audit decisions, or trigger notifications. Installation and First Run Two installation paths exist. The official script needs no pre-installed Node.js. On macOS or Linux, run the install script: Copy CodeCopiedUse a different Browser curl -fsSL https://code.kimi.com/kimi-code/install.sh | bash On Windows, use PowerShell: Copy CodeCopiedUse a different Browser irm https://code.kimi.com/kimi-code/install.ps1 | iex The global npm install requires Node.js 24.15.0 or later: Copy CodeCopiedUse a different Browser npm install -g @moonshot-ai/kimi-code Verify the binary, then open a project and start the interactive UI: Copy CodeCopiedUse a different Browser kimi –version cd your-project kimi On first launch, type /login inside the UI. You can choose Kimi Code OAuth or a Moonshot AI Open Platform API key. To run one instruction without the UI, use kimi -p “your task”. To resume the previous session, add -C. Use Cases Understanding a project: Ask for an architecture overview and a module dependency diagram. Implementing a feature: Describe the signature, options, and acceptance criteria up front. Fixing a bug: Give the symptom, reproduction steps, and expected behavior together. Writing tests and refactoring: Extract repeated patterns, then run tests to confirm behavior. One-off automation: Analyze logs and output call counts with p50 and p99 latencies. Scheduled tasks: Ask the agent to set reminders or recurring checks via cron. Plan mode is available through Shift-Tab or kimi –plan. It outputs a research plan before touching files. For safe batch work, –yolo or /yolo skips approval prompts. The /fork command creates an experimental branch you can abandon. The /compact command compresses context to free up tokens. For large investigations, the main agent can dispatch subagents in parallel. How Kimi Code CLI Compares Kimi Code CLI joins several established terminal coding agents. The table below compares it with three of them. Competitor details reflect mid-2026 and can change quickly. Attribute Kimi Code CLI Claude Code OpenAI Codex CLI Gemini CLI Developer Moonshot AI Anthropic OpenAI Google Backing model Kimi models Claude models GPT-5.3-Codex Gemini 2.5 Pro Language / runtime TypeScript Node.js Rust TypeScript Install Script or npm (Node.js ≥ 24.15.0) Native installer or npm npm / native npm single binary MCP support Yes (/mcp-config) Yes Yes Yes Subagents Yes (coder, explore, plan) Yes Yes No (sequential) Plan mode Yes (Shift-Tab) Yes Yes Yes IDE integration ACP (Zed, JetBrains) VS Code, JetBrains VS Code, IDEs VS Code (Code Assist) License MIT Proprietary Open source Apache 2.0 All four agents support the Model Context Protocol. They differ on backing model, language, license, and orchestration. Kimi Code CLI and Codex CLI both ship native subagents. Gemini CLI runs tasks sequentially without subagent support. Key Takeaways Kimi Code CLI is an MIT-licensed terminal coding agent from Moonshot AI. It is written in TypeScript and installs via script or npm. Built-in coder, explore, and plan subagents run in isolated contexts. MCP servers are configured conversationally through /mcp-config, not raw JSON. It succeeds kimi-cli and migrates existing configuration and sessions. Marktechpost’s Visual Explainer Kimi Code CLI · Guide 01 / 09 Overview Kimi Code CLI Moonshot AI’s open-source terminal coding agent that reads code, runs commands, and plans its next step. Runs in your terminal as an AI coding agent MIT-licensed · written in TypeScript · distributed via npm Works with Kimi models or other compatible providers Slide 02 What Is Kimi Code CLI? Reads and edits code, runs shell commands, searches files Fetches web pages and chooses the next step from feedback Read-only actions run automatically by default File edits and shell commands ask for confirmation first Slide 03 Key Features Single-binary distribution — no Node.js setup required Built-in coder, explore, and plan subagents AI-native MCP configuration via /mcp-config Lifecycle hooks and video input support Slide 04 Install macOS / Linux curl -fsSL https://code.kimi.com/kimi-code/install.sh | bash Windows (PowerShell) irm https://code.kimi.com/kimi-code/install.ps1 | iex npm (Node.js 24.15.0+) npm install -g @moonshot-ai/kimi-code Slide 05 First Run kimi –version cd your-project kimi Type /login → Kimi Code OAuth or Moonshot API key kimi -p “your task” runs one instruction without the UI kimi -C resumes the previous session Slide 06 Use Cases Understand a project: architecture overview and dependency map Implement features with clear signatures and acceptance criteria Fix bugs from symptom, reproduction steps, and expected behavior Write tests, refactor,

Moonshot AI Releases Kimi Code CLI: A Terminal AI Coding Agent Built in TypeScript for Next-Gen Agents Read Post »

AI, Committee, 新闻, Uncategorized

The Meta hack shows there’s more to AI security than Mythos

admin NU / 6 月 5, 2026

On June 5, 404 Media reported that attackers had been using Meta’s AI customer support agent to steal Instagram accounts. Their approach was simple: They asked the agent to link the accounts to email addresses that they controlled, and the agent complied. One attacker broke into the dormant Obama White House account and made pro-Iran posts; others took over accounts with valuable, single-word handles, possibly in order to sell them. AI cybersecurity concerns are nothing new. Since Anthropic announced in April that its Mythos model was too good at hacking to be released to the general public, commentators, researchers, and federal officials alike have fixated on the idea that superpowered AI systems could lay waste to our computer infrastructure. That’s not quite what this Instagram hack was: There, AI was the target rather than the attacker, and the method was far simpler than anything Mythos would cook up. But as companies offload more work to AI, these comparatively unsophisticated attacks could wreak their own havoc. “As AI becomes more and more widely used—especially when AI is more and more widely used to automate our work flows, like account recovery—I think attackers are going to be more and more motivated to attack AI itself,” says Neil Gong, a professor of electrical and computer engineering at Duke University. Gong and other scholars have been issuing warnings about the security vulnerabilities of AI agents for a while. They publish papers and blog posts detailing exploits such as indirect prompt injection, which involves hijacking agents using commands hidden in websites, emails, or other seemingly anodyne data sources. Compared with these techniques, the Meta hack was practically mindless. The only complication that hackers had to overcome was using a VPN that matched the true account owner’s location; then they directly asked the support agent to change the account’s email address, and it complied. Meta has not commented publicly on how this vulnerability slipped through the cracks. But given the simplicity of the exploit, Gong says, it should have been uncovered easily, before the agent was deployed. “It’s really surprising,” he says. “I don’t understand why they didn’t find this simple problem.” Jessica Ji, a senior research analyst at Georgetown’s Center for Security and Emerging Technology, agrees. “It raises questions like: Were there even guardrails in place?” she says. “Did anyone think to test for this kind of scenario?” She notes that the oversight is particularly striking coming from a company like Meta, which has extensive expertise in both AI and cybersecurity. Meta did not respond to a request for comment for this article, but on Monday a Meta spokesperson said on X that the vulnerability had been resolved. As embarrassing a moment as this might be for Meta in particular, it also highlights some core vulnerabilities shared by all AI agents. Unlike traditional software, agents can respond in flexible—and unexpected—ways to new circumstances, which is why they might be able to substitute for human customer support agents. But AI agents can also be tricked in ways that humans wouldn’t be, and because they can take real-world actions, those mistakes have consequences. “A human would say, ‘Okay, why do you want to change the email address?’ and maybe respond with a security question,” says Somesh Jha, a professor of computer science at the University of Wisconsin–Madison. “What is going on with these agents is they’re very eager to finish the task. It’s almost like some elementary school student who just wants to please the teacher.” There are ways to mitigate the risks. Companies can use traditional software to build guardrails that make sure agents follow strict rules, such as always asking for answers to security questions before sending sensitive account information to a new email address. And the experts consulted for this article all agree that agents should undergo rigorous red-teaming, a process in which developers try their best to attack a system in order to discover its vulnerabilities before it is deployed. But there are also countervailing forces. Companies want to deploy capable agents, and the more power an agent has—and the fewer guardrails it is subject to—the more work it can potentially take on. “Security and utility always have a trade-off,” says Bo Li, a professor of computer science at the University of Illinois Urbana-Champaign. And adequate red-teaming can be expensive. Defenders have to expend more resources than attackers do, because attackers only need to discover a single exploit, while defenders try to discover and patch as many as they can. When attackers are working toward something as valuable as a single-word Instagram handle, they’ll pour resources into finding exploits, so defenders have to spend even more money to protect that prize. As AI models continue to improve, hardening their defenses might actually get easier. Though the probabilistic nature of large language models means that LLM agents will always be vulnerable to some forms of attack, a more sophisticated model might have identified an attempt to change the email associated with the Obama White House account as suspicious. And AI systems can be used for agent red-teaming, much as participants in Anthropic’s Project Glasswing use Mythos to identify vulnerabilities in their software. Still, experts expect that the problem of securing AI agents will only become more pressing in the future. As agents grow more capable, companies that adopt them may want to give them more power, both to provide more services with fewer humans and to avoid being left behind by their competitors. In the fast-moving world of AI, the time needed to carefully secure risky agentic systems might seem like an unconscionable delay. “Everybody wants to be the first to do something and just push things out without careful scrutiny and red-teaming,” Jha says. “I think it’s a very dangerous thing.”

The Meta hack shows there’s more to AI security than Mythos Read Post »

AI, Committee, 新闻, Uncategorized

Perplexity AI Introduces Hybrid Local-Server Inference Orchestrator for Personal Computer: Automatic On-Device and Cloud Task Routing

admin NU / 6 月 5, 2026

Perplexity AI announced what it calls the first hybrid local-server inference orchestrator at Computex 2026. The system is designed to automatically route AI tasks between a user’s local device and cloud-based frontier models without requiring the user to decide in advance. The feature is expected come to Perplexity Computer in July 2026. What is Hybrid Agentic Inference? To understand what Perplexity built, it helps to understand the three-way tension that AI systems face. Accuracy demands the most capable models, which are expensive to run. Privacy demands that some data never leave the device. Cost and energy efficiency demand that you don’t spend a frontier model’s compute on tasks a smaller model can handle. That routing layer is what Perplexity calls hybrid agentic inference. A compact AI model runs locally on the user’s device. This local model evaluates each incoming task or subtask. It determines whether the task involves sensitive data, whether it requires heavy computation, or whether it can be handled entirely on-device. Based on that evaluation, work is either kept local or sent to a frontier model in the cloud. Perplexity describes this local model as deciding “when sensitive data should also be kept locally.” The system is designed to ask for user permission before sending sensitive tasks to the cloud. That design addresses a specific concern enterprises have about agentic AI: data governance — knowing where data goes and who controls that decision. Examples of data the system is intended to keep local include financial records, health information, and personal files. Work that requires a frontier model’s full capability runs on the server. Most real tasks are a mix, so the system splits them and coordinates the parts. How It Fits into Perplexity Computer Perplexity Computer is the company’s cloud-based multi-model agentic product, launched in February 2026. It originally ran entirely in the cloud on the Perplexity Max subscription tier ($200/month). Personal Computer is a separate, related product that brought Computer’s capabilities onto the local device — with access to local files, native Mac apps, the web, and Perplexity’s secure servers. Personal Computer launched on Mac in April 2026. Windows support is planned; a waitlist is open. The new hybrid local-server inference orchestrator is the next step for Personal Computer. Previously, even within Personal Computer, the division was relatively fixed: local file access happened on-device, heavy computation ran on Perplexity’s servers. The orchestrator changes that. The system now reasons about where each piece of a task should execute — not just which model to use, but which physical location should process it. Perplexity Computer coordinates up to 20 AI models in a single workflow. The system is one that creates a team of agents and orchestrates across models, tools and files in one single system. The hybrid orchestrator extends that orchestration to compute location itself. Key Takeaways Perplexity AI announced the first hybrid local-server inference orchestrator at Computex 2026, routing AI tasks automatically between on-device and cloud models. A compact local model acts as the router — classifying each subtask by data sensitivity and compute requirements before dispatching it. Sensitive data (financial records, health files) stays on-device; compute-heavy tasks go to frontier cloud models — no manual configuration required. The orchestration framework is model-agnostic and chip-agnostic, confirmed to run on Intel Core Ultra Series 3 and NVIDIA RTX Spark hardware. The feature arrives in Perplexity Computer in July 2026, initially on Windows; Personal Computer is already available on Mac with a Windows waitlist open. Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post Perplexity AI Introduces Hybrid Local-Server Inference Orchestrator for Personal Computer: Automatic On-Device and Cloud Task Routing appeared first on MarkTechPost.

Perplexity AI Introduces Hybrid Local-Server Inference Orchestrator for Personal Computer: Automatic On-Device and Cloud Task Routing Read Post »

AI

NVIDIA garak Tutorial: Build a Complete Defensive LLM Red-Teaming Workflow with Custom Probes and Detectors

Best 21 Low-Code and No-Code AI Tools in 2026

Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b

Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation

Google DeepMind Releases Gemma 4 QAT Checkpoints: Q4_0 and a New Mobile Format Cut On-Device Memory

A Hands-On Coding Tutorial on Qualcomm AI Hub Models for Classification, Object Detection, and Hardware-Aware Deployment

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time

Moonshot AI Releases Kimi Code CLI: A Terminal AI Coding Agent Built in TypeScript for Next-Gen Agents

The Meta hack shows there’s more to AI security than Mythos

Perplexity AI Introduces Hybrid Local-Server Inference Orchestrator for Personal Computer: Automatic On-Device and Cloud Task Routing

我们的服务

首页

工作原理

新闻

定价

支持

幫助中心

报告问题

提供反馈

隱私權政策

用户账户

关注我们