YouZum

AI

AI, Committee, ข่าว, Uncategorized

Why this year’s World Cup ball may not fly as far

Much is new about this month’s upcoming FIFA World Cup tournament, which will be held in the US, Canada, and Mexico. It hosts more teams than ever before. It’s the first to occur in three different host countries. And, like predecessor cups for over half a century, it will employ a soccer ball with a brand-new design. One group of researchers that has been testing the physics of World Cup balls for the past 20 years recently studied this new entry, called the Trionda. Made by Adidas, the Trionda features four red, green, and blue panels textured with deep grooves and maple leaf, green eagle, and star emblems to represent the three host countries. Through wind-tunnel experiments, the research team found that this ball improves over previous versions in some ways, but long-distance kicks might not go as far as they did in the past.  “The simple picture is that Trionda may very slightly punish extreme distance, but it should reward clean technique and predictable flight,” says team member John Eric Goff, who researches sports physics and is an incoming professor of engineering practice at Purdue University. “Goalkeepers, defenders hitting long passes, and long-range shooters are where I would look first for visible differences.”  Researchers used a wind tunnel to study the Trionda ball at the University of Tsukuba. TAKESHI ASAI, SUNGCHAN HONG, AND RICHONG LIU Adidas has been designing new balls for each World Cup since the 1970s. Some of the design changes in the first few decades were aesthetic: The 1986 ball featured graphics inspired by Aztec temples for the Mexico tournament, and 1994’s had space graphics in honor of the moon landing’s 25th anniversary. There were some structural differences too, such as upgraded foam cores and improved water resistance. But by and large, the balls used the same design of 32 pentagonal panels stitched together.  That changed in the 2006 World Cup in Germany, when Adidas introduced the +Teamgeist ball. It featured just 14 curved panels, which were thermally bonded together rather than stitched. The design helped keep moisture out so the ball wouldn’t grow heavier throughout the game, Goff says. It was around this time that he started studying soccer balls. In the years since then, he and his colleagues have followed the transformations as Adidas has released balls with different surface textures and even fewer panels—design changes significant enough to affect game play.  In-flight motion Goff discovered early on that by analyzing a ball’s trajectory data, he could derive its drag coefficient—a number that determines the air resistance it experiences midflight at a given speed. Shortly after, he began working with a team in Japan to analyze how the World Cup ball’s in-flight behavior changes with each new design.  The experiments, carried out at the University of Tsukuba in Japan, have been purposely consistent over the years because “maintaining continuity is important for comparing new data with historical data sets,” says Takeshi Asai, a professor there who works on the experiments. They entail attaching the ball to a metal rod connected to an instrument called a force balance, which measures aerodynamic forces such as drag and lift as the ball is exposed to the same wind speeds it would experience in a real soccer game—seven to 35 meters per second.  The team tests the ball in different orientations, “but you can only do a few because the Trionda ball is $170,” Goff says, and each new test effectively destroys it. The experiments show the team how the drag coefficient changes with speed, and Goff then writes code to simulate the ball’s overall trajectory as it flies through the air.   The team’s analysis has shown how recent World Cup balls evolved since the eight-panel Jabulani ball for the 2010 event. The Jabulani faced much criticism from players—particularly goalkeepers, who said it had a deceptive trajectory that “dipped wickedly,” as one player told the Guardian.  ALAMY ADOBE STOCK TAKESHI ASAI, SUNGCHAN HONG, RICHONG LIU The 2010 Jabulani ball (left) had eight panels and a smooth texture that translated into unpredictable performance. Later balls, like the 2014 Brazuca (center) and this year’s Trionda (right), have fewer panels but more roughness. The ball had one key flaw: It was too smooth. Even though its drag coefficient was relatively low at high speeds, once the ball slowed to a certain point the coefficient would ratchet up, causing it to lose speed quite fast and behave as the 2010 players complained. This sudden transition—called the drag crisis—occurs at higher speeds for smoother balls, but with added texture like seams and grooves, the transition can be avoided until a ball reaches lower speeds. This allows the ball to travel farther and generally behave in a more predictable way during typical play.  “It’s the same reason why golf balls have dimples and baseballs have those nice 108 double stitches. If those rough features of those balls were not there, you would not get anywhere near the kind of distance when those balls are thrown or hit that you see now,” Goff says. “There has to be some kind of a roughness on the ball to move this transition to a smaller speed.” New grooves Subsequent designs have been able to push the drag crisis to lower speeds, according to the analysis by Goff and his colleagues. The Brazuca ball used in 2014, for instance, has only six panels, but their total seam length is much longer, adding to the surface’s roughness. And this year’s Trionda ball contains just four panels, but each panel also has three deep grooves for more texture.  There’s a trade-off to this roughness, though. While Goff and his colleagues found that the Trionda ball experiences the drag crisis at the slowest speed since 2010, its drag coefficient is also higher than that of the other balls at high speeds. That means that even though the most dramatic change doesn’t happen until the ball is moving quite slowly, the ball will still slow down faster than its recent predecessors during the faster portion

Why this year’s World Cup ball may not fly as far Read Post »

AI, Committee, ข่าว, Uncategorized

The Download: how the World Cup ball will fly and OpenAI’s “super app”

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. Why this year’s World Cup ball may not fly as far Much is new about this month’s FIFA World Cup tournament. It hosts more teams than ever before. It’s the first to occur in three different host countries.  And, like every World Cup for over half a century, it will employ a football with a brand-new design. Through wind-tunnel experiments, researchers found that long-distance kicks with Adidas’s new Trionda ball might not travel as far as they did in the past. The payoff is a more predictable flight path, something players have not always enjoyed from World Cup balls. Find out how a few grooves and seams can change the way the game is played. —Jenna Ahart The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 OpenAI plans to turn ChatGPT into a ‘super app’ before its IPOThe revamp would combine coding tools and AI agents. (Financial Times $)+ The super app ambitions first emerged last year. (Fast Company)+ OpenAI is also building a fully automated researcher. (MIT Technology Review) 2 Trump wants the US government to take a stake in AI companiesHe will meet AI leaders to discuss the plan. (BBC)+ Which would create “a partnership with the American public.” (Reuters $)+ He wants a slice of the AI boom. (Axios) 3 Google has agreed to pay SpaceX $30 billion for AI computing powerThe $920 million-a-month contract runs through June 2029. (NYT $)+ Google will use about 110,000 Nvidia GPUs owned by SpaceX. (CNBC)+ It comes days after Anthropic struck a SpaceX data center deal. (WSJ $) 4 AI is set to make everyday life more expensiveIts insatiable thirst for resources is likely to push up inflation. (WP $)+ We did the math on AI’s energy footprint. (MIT Technology Review) 5 Europe is accelerating its withdrawal from US Big TechNew analysis reveals dozens of moves to alternative providers. (Wired $) + Last week, the EU launched a “made in Europe” drive. (Reuters $) 6 ICE plans to give local police a new facial recognition appIt would allow them to verify a person’s immigration status. (404 Media)+ Is the Pentagon allowed to surveil Americans with AI? (MIT Technology Review) 7 Silicon Valley’s lure is fading for India’s tech talentDue to Trump’s immigration policies and AI-driven layoffs. (Rest of World)  8 ‘Recursive self-improvement’ has sparked fears of AI escaping controlNobody is sure about the consequences of RSI. (The Economist $)+ Here are five ways that AI is learning to improve itself. (MIT Technology Review) 9 Gene-edited embryos are getting closer, but a key safety gap remainsCurrent techniques still fail to edit every cell. (New Scientist $)+ “Base-edited baby” is one of our 10 Breakthrough Technologies for 2026. (MIT Technology Review) 10 NASA astronauts will wear high-tech Prada underwear on their moon tripsVentilation tubes are knitted into the garments. (The Verge) Quote of the day “Chat is dead.”  —A senior OpenAI employee tells the Financial Times why the company is shifting focus from chatbots to AI agents. One More Thing BETH HOECKEL How AI is helping historians better understand our past The digitization of historical records is making it possible to study the past in new ways. Historians are now using machine learning—particularly deep neural networks—to analyze everything from centuries-old astronomy textbooks to ancient Greek inscriptions. The technology is helping researchers uncover new patterns in the historical record. But it also introduces risks, including the possibility that machine learning will slip bias or outright falsifications into our understanding of the past. Read the full story on how AI is transforming the study of history. —Moira Donovan We can still have nice things A place for comfort, fun, and distraction to brighten up your day. (Got any ideas? Drop me a line.) + Take a tour of extinct everyday objects to travel back to pre-smartphone life.+ This a cappella cover of “I Want To Know What Love Is” nails the power-ballad drama.+ Korea’s ingenious “one-a-day” banana packs are designed so each one ripens sequentially.+ Casino dialogue has been synced over Looney Tunes footage in this unexpectedly perfect mashup.

The Download: how the World Cup ball will fly and OpenAI’s “super app” Read Post »

AI, Committee, ข่าว, Uncategorized

Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs

Inference speed is becoming a competitive metric for large language models. Xiaomi’s MiMo team just released MiMo-V2.5-Pro-UltraSpeed, built in collaboration with the TileRT systems group. It decodes faster than 1000 tokens per second on a 1-trillion-parameter model. Xiaomi team describes this as a first at trillion-parameter scale. Demos show generation peaks near 1200 tokens per second. The notable part is the hardware: it runs on commodity GPUs, not custom silicon. What is MiMo-V2.5-Pro-UltraSpeed UltraSpeed is a high-speed serving mode for the existing MiMo-V2.5-Pro model. The base model uses a Mixture-of-Experts (MoE) architecture at trillion-parameter scale. UltraSpeed targets generation speed rather than model capability. It changes how fast the model produces output tokens. The speedup comes from three coordinated techniques across the model and the serving system. Xiaomi calls this approach extreme model-system codesign. Crucially, the entire stack runs on a single standard 8-GPU commodity node. The Speed Case: Three Layers Working Together The first layer is FP4 quantization. At trillion scale, FP8 or FP16 weights create heavy memory and bandwidth pressure. Lower bit-width weights move through memory faster, which directly lifts decode speed. Xiaomi uses the MXFP4 format, applied selectively to the MoE Experts only. Other modules keep higher precision, reported as FP8 by TileRT. Experts hold most parameters and tolerate quantization best, so the tradeoff is favorable. Quantization-Aware Training (QAT) keeps benchmark quality essentially on par with the original. The second layer is DFlash speculative decoding, covered in detail below. The third layer is TileRT, the system that executes everything on the GPU. Each technique alone is not enough. The 1000 TPS result needs all three aligned tightly. DFlash: Parallel Drafting Without a Serial Bottleneck Standard speculative decoding uses a small draft model to guess upcoming tokens. The large model then verifies those guesses in parallel. Rejection sampling keeps output identical to normal decoding, so quality is lossless. The problem is that the draft model still generates tokens one at a time. DFlash, a method from the research community, removes that constraint. It uses block-level masked parallel prediction. The draft model fills a whole block of masked positions in one forward pass. Xiaomi tuned DFlash with the Muon second-order optimizer and model self-distillation. The draft model uses Sliding Window Attention (SWA) only, matching the MiMo-V2 design. This makes per-prediction compute constant rather than growing with context length. Block size is capped at 8 to limit verification cost and raise concurrency. Acceptance length measures how many draft tokens survive verification each round. Scenario Acceptance Length Coding 6.30 Math / Reasoning 5.56 Agent 4.29 In coding, six to seven of eight draft tokens are accepted per round. Some samples reach a maximum of 7.14. TileRT: Squeezing the Microseconds At 1000 TPS, each operator runs for only microseconds. Traditional systems launch operators one by one, and each launch costs time. Those gaps fracture the execution stream and become the real bottleneck. TileRT replaces this with a Persistent Engine Kernel that stays resident on the GPU. It uses Warp Specialization to split data movement, compute, and communication into coordinated roles. Small operations like RMSNorm, RoPE, and KV cache writes turn into bottlenecks at this scale. The system was co-designed with the FP4 and DFlash choices, not added afterward. Use Cases The release targets latency-sensitive work where waiting breaks the loop: Parallel reasoning: run many Best-of-N or tree-search paths within the same wall-clock time. Coding agents: faster code generation cuts the wait between agent steps. Real-time decision loops: trading signal generation, fraud interception, and live dialogue. Interactive prototyping: demos show a Snake game in about 10 seconds and a macOS interface in about one minute. These are throughput-bound workloads where raw token speed is the binding constraint. How It Compares The first table contrasts the two routes to extreme decode speed. Approach Hardware How speed is achieved Cerebras Wafer-Scale integration (custom) Scale on a single custom wafer Groq Custom architecture Pure on-chip SRAM MiMo × TileRT Commodity GPUs (8-GPU node) Model-system codesign: FP4 + DFlash + TileRT The second table compares the standard model with the UltraSpeed mode. Dimension MiMo-V2.5-Pro MiMo-V2.5-Pro-UltraSpeed Decode speed Baseline ~10× faster (1000+ TPS) Price 1× 3× Weight precision Standard FP4 MoE Experts via QAT Decoding Standard autoregressive DFlash speculative decoding Access Standard model plans API only, application-based trial Token Plan Supported Not supported Access, Pricing, and Open Source UltraSpeed ships through a limited, application-based window. The API trial runs June 9 to June 23, 2026. Pricing is 3× the standard MiMo-V2.5-Pro rate, for roughly 10× the speed. It is API only, and the Token Plan is not supported. Approved users also receive free Chat access during the trial. Chat limits apply: 10 queue entries daily, 30-minute sessions, and 5-minute idle release. Xiaomi open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face. TileRT has open-sourced select modules on GitHub. Strengths and Limitations Strengths 1000+ TPS on a 1T model without custom silicon. Lossless decoding through rejection sampling in DFlash. FP4 applied only where tolerance is highest, preserving quality. An open checkpoint lets the community test the claims. Limitations Access is gated, short, and approval-based at launch. Pricing triples per token versus the standard model. Acceptance length drops in open-ended conversation. Independent third-party speed verification is not yet public. Key Takeaways Xiaomi MiMo and TileRT decode a 1-trillion-parameter model past 1000 tokens per second on commodity GPUs. The speedup comes from three layers: FP4 quantization, DFlash speculative decoding, and the TileRT runtime. FP4 (MXFP4) is applied only to MoE Experts; QAT keeps capability essentially on par. DFlash predicts a whole masked block per forward pass, hitting 6.30 average acceptance length in coding. UltraSpeed runs on a single 8-GPU node via an application-based API trial, June 9–23, 2026. Marktechpost’s Visual Explainer GUIDE • INFERENCE SYSTEMS MiMo-V2.5-Pro-UltraSpeed: 1000+ Tokens Per Second on a 1T Model Xiaomi MiMo & TileRT — FP4 quantization, DFlash speculative decoding, and a microsecond-scale runtime. 01 / 08 What It Is Xiaomi’s MiMo team built it with the TileRT systems group. It decodes over 1000 tokens/s on a 1-trillion-parameter model.

Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs Read Post »

AI, Committee, ข่าว, Uncategorized

Google’s New Colab CLI Lets Developers and AI Agents Run Python on Remote Colab GPUs and TPUs From the Terminal

This week, Google AI team released the Colab CLI. The tool connects your local terminal to remote Colab runtimes. It lets developers and AI agents run code on cloud GPUs and TPUs. You stay in your terminal the entire time. The CLI is open source under the Apache 2.0 license. What is Google Colab CLI The Colab CLI is a command-line interface for Google Colab. You can create sessions, run code, and manage files from the terminal. Any agent with terminal access can call the tool. That includes Claude Code, Codex, and Google’s Antigravity. Google ships a prepackaged skill file named COLAB_SKILL.md. It gives agents built-in context on how to use the CLI. Installation uses a single uv tool install command from the GitHub repository. Copy CodeCopiedUse a different Browser uv tool install git+https://github.com/googlecolab/google-colab-cli A minimal session looks like this: Copy CodeCopiedUse a different Browser colab new # provision a CPU session echo “print(‘hello’)” | colab exec # run code colab stop # release the VM How the Commands Work The CLI groups commands into sessions, execution, files, and automation. colab new provisions a session, with CPU as the default. Add –gpu T4, –gpu L4, –gpu A100, or –gpu H100 for a GPU. TPU options are v5e1 and v6e1. colab exec runs Python from stdin, a .py file, or a notebook. The exec reads files locally and ships their contents. Local edits therefore need no separate upload step. colab stop terminates the session and releases the VM. Other commands cover files and authentication. colab upload and colab download move files between local and remote. colab drivemount mounts Google Drive, defaulting to /content/drive. colab auth authenticates the VM for Google Cloud services. colab exec and Artifact Recovery: The Core Loop The core loop is short. You provision a runtime, run a script, then pull results back. colab download retrieves models, datasets, and other files. colab log exports session history as .ipynb, .md, .txt, or .jsonl. So a remote run becomes a replayable notebook on your disk. colab repl and colab console give interactive access to the VM. colab install adds packages with uv, falling back to pip. Session metadata is stored at ~/.config/colab-cli/sessions.json. Example: Fine-Tuning Gemma 3 1B Google’s official release demonstrates an agent-driven fine-tuning job. The task fine-tunes google/gemma-3-1b-it using QLoRA. It trains on a Text-to-SQL dataset to improve SQL generation. The Antigravity agent runs the full pipeline with five commands. Copy CodeCopiedUse a different Browser colab new –gpu T4 colab install transformers datasets peft trl bitsandbytes accelerate colab exec -f finetune_run.py colab log –output gemma_finetune_log.ipynb colab stop The agent then downloads the adapter model, adapter config, tokenizer config, and tokenizer. You can load and serve the fine-tuned model locally. No manual cloud provisioning command was typed by the user. Use Cases Offload laptop-bound training to a remote GPU or TPU without leaving the terminal. Let agents like Claude Code, Codex, or Antigravity run end-to-end ML pipelines. Fine-tune small models, such as Gemma 3 1B, with QLoRA remotely. Script notebook execution and export replayable .ipynb logs for reproducibility. Debug interactively on the VM through colab repl or colab console. Colab CLI vs Browser-Based Colab The CLI does not replace the notebook UI. It targets scripted, automated, and agent-driven work instead. Here is how the two workflows compare across common tasks. Dimension Browser-Based Colab Colab CLI Interface Web notebook UI Local terminal Accelerator selection Runtime menu in the browser –gpu / –tpu flags on colab new Agent use Manual, UI-driven Any terminal agent via commands Run local scripts Paste or upload into cells colab exec -f script.py Artifact retrieval Manual download or Drive colab download, colab log Package install !pip inside a cell colab install (uv, then pip) Session control Browser-managed runtime colab new, colab stop, colab status Agent skill file None Bundled COLAB_SKILL.md Strengths and Considerations Strengths: Terminal-native workflow fits scripts, CI, and agent loops. One command provisions T4, L4, A100, or H100 GPUs. exec ships local file contents, so no upload step is needed. Logs export to replayable notebook formats for reproducibility. Open source under Apache 2.0, with a bundled agent skill file. Works with multiple agents, not a single vendor’s tool. Considerations: Access requires authentication; the default strategy is oauth2. repl and console need a TTY when run interactively. Pipe stdin to use those two commands inside scripts. Compute still runs on Colab’s backend and its runtime model. Key Takeaways Google’s Colab CLI runs code on remote Colab GPUs and TPUs from your local terminal. One command provisions accelerators: colab new –gpu T4 through A100 and H100, plus TPUs. colab exec ships local .py and .ipynb files to the runtime without an upload step. Any terminal agent — Claude Code, Codex, Antigravity — can drive it via a bundled COLAB_SKILL.md. It is open source under Apache 2.0, and colab log exports replayable notebook logs. Marktechpost Visual Explainer Google Colab CLI — Terminal Guide 1 / 8 Overview Run Colab GPUs and TPUs from your terminal The Google Colab CLI connects your local terminal to remote Colab runtimes. Developers and AI agents run code on cloud accelerators without leaving the shell. Announced June 5, 2026 • Open source under Apache 2.0 Step 1 What it is A command-line interface for Google Colab. It connects your local terminal to remote Colab runtimes. You create sessions, run code, and manage files from the terminal. Any terminal-based AI agent can call it too. Step 2 Install and quick start Install with a single command, then run a first session. uv tool install git+https://github.com/googlecolab/google-colab-cli colab new # provision a CPU session echo “print(‘hello’)” | colab exec # run code colab stop # release the VM Step 3 Provision GPUs and TPUs Request an accelerator when you create the session. CPU is the default. colab new –gpu T4 colab new –gpu A100 colab new –tpu v6e1 Accelerator availability depends on your active Colab plan. Step 4 Run local scripts remotely The exec command reads your file locally and ships its contents. No separate

Google’s New Colab CLI Lets Developers and AI Agents Run Python on Remote Colab GPUs and TPUs From the Terminal Read Post »

AI, Committee, ข่าว, Uncategorized

NVIDIA garak Tutorial: Build a Complete Defensive LLM Red-Teaming Workflow with Custom Probes and Detectors

In this tutorial, we analyze NVIDIA garak as a practical framework for defensive LLM red-teaming. We start by setting up Garak, then move through plugin discovery, dry runs, real-model scans, multi-probe evaluations, report analysis, custom probe creation, custom detector creation, and AVID export. Instead of running only a single scan, we use Garak end-to-end to understand how probes, detectors, generators, reports, and vulnerability scores work together in a complete LLM security testing workflow. Check out the FULL CODES Here. Setting Up NVIDIA garak and Defining Helper Functions Copy CodeCopiedUse a different Browser import os, sys, json, glob, subprocess, importlib def sh(cmd, capture=False): print(f”n$ {cmd}”) return subprocess.run(cmd, shell=True, text=True, capture_output=capture) sh(f”{sys.executable} -m pip install -q -U garak”) os.environ.setdefault(“TOKENIZERS_PARALLELISM”, “false”) os.environ.setdefault(“HF_HUB_DISABLE_TELEMETRY”, “1”) import garak, garak.cli from garak import _config print(“n=== garak version:”, garak.__version__, “===”) def run_garak(args): print(“n>>> garak ” + ” “.join(args)) try: garak.cli.main(args) except SystemExit as e: if e.code not in (0, None): print(f”[garak exited {e.code}]”) try: return _config.transient.report_filename except Exception: return None We begin by importing the required libraries and creating a helper function to run shell commands directly from the notebook. We install garak, configure basic environment variables, and import the main garak modules needed for the tutorial. We also define a reusable function that lets us run Garak programmatically and capture the path to the generated report. Listing garak Probes and Detectors and Running Model Scans Copy CodeCopiedUse a different Browser print(“n########## 1. PLUGIN INVENTORY ##########”) for kind in [“probes”, “detectors”, “generators”, “buffs”]: out = sh(f”{sys.executable} -m garak –list_{kind} 2>/dev/null”, capture=True) lines = [l for l in (out.stdout or “”).splitlines() if “.” in l] print(f” {kind:11s}: {len(lines)} plugins e.g. ” f”{‘, ‘.join(l.split()[-1] if l.split() else l for l in lines[:3])}”) print(“n########## 2. FAST DRY-RUN (test.Repeat) ##########”) sh(f”{sys.executable} -m garak –target_type test.Repeat ” f”–probes lmrc.SlurUsage –generations 1″) print(“n########## 3. REAL MODEL: gpt2 vs DAN 11.0 ##########”) sh(f”{sys.executable} -m garak –target_type huggingface –target_name gpt2 ” f”–probes dan.Dan_11_0 –generations 1 –parallel_attempts 8″) print(“n########## 4. PROGRAMMATIC MULTI-PROBE SCAN ##########”) report_path = run_garak([ “–target_type”, “test.Repeat”, “–probes”, “dan.Dan_11_0,encoding.InjectBase64,lmrc.SlurUsage”, “–generations”, “1”, “–parallel_attempts”, “16”, ]) print(“Report:”, report_path) We inspect the garak plugin ecosystem by listing available probes, detectors, generators, and buffs. We then run a quick dry run using the test generator to confirm that Garak is working without requiring any external model or API key. After that, we scan a real Hugging Face model and run a multi-probe scan to generate a richer report for analysis. Analyzing garak Reports: Safety Scores and Attack Success Rates Copy CodeCopiedUse a different Browser print(“n########## 5. ANALYSIS ##########”) import numpy as np, pandas as pd def find_latest_report(): cands = [] for base in [os.path.expanduser(“~/.local/share/garak/garak_runs”), os.path.expanduser(“~/.cache/garak”), “.”]: cands += glob.glob(os.path.join(base, “**”, “*report.jsonl”), recursive=True) cands = [c for c in cands if os.path.getsize(c) > 0] return max(cands, key=os.path.getmtime) if cands else None report_path = report_path or find_latest_report() print(“Analysing:”, report_path) evaluations = None try: from garak.report import Report rep = Report(report_path).load().get_evaluations() evaluations = rep.evaluations.copy() print(“n— Per-probe mean SAFETY score (garak.report.Report) —“) print(rep.scores.round(1).to_string()) except Exception as e: print(“garak.report.Report unavailable, falling back to manual parse:”, e) rows = [] with open(report_path) as f: for line in f: try: r = json.loads(line) except json.JSONDecodeError: continue if r.get(“entry_type”) == “eval”: rows.append(r) evaluations = pd.DataFrame(rows) if not evaluations.empty: evaluations[“score”] = np.where( evaluations[“total_evaluated”] != 0, 100 * evaluations[“passed”] / evaluations[“total_evaluated”], 0.0) if evaluations is not None and not evaluations.empty: evaluations[“asr_%”] = (100 – evaluations[“score”]).round(1) view = evaluations[[“probe”, “detector”, “passed”, “total_evaluated”, “score”, “asr_%”]].copy() view = view.rename(columns={“score”: “safe_%”}) view[“safe_%”] = view[“safe_%”].round(1) view = view.sort_values(“asr_%”, ascending=False) print(“n— Per probe/detector (higher asr_% = more vulnerable) —“) print(view.to_string(index=False)) try: import matplotlib.pyplot as plt labels = (view[“probe”] + “n” + view[“detector”]).tolist() plt.figure(figsize=(8, 0.55 * len(view) + 1.5)) plt.barh(labels, view[“asr_%”], color=”#76b900″) plt.gca().invert_yaxis() plt.xlabel(“Attack Success Rate (%)”); plt.xlim(0, 100) plt.title(“garak — vulnerability by probe/detector”) plt.tight_layout(); plt.show() except Exception as e: print(“plot skipped:”, e) We load the generated garak report and prepare it for detailed analysis using pandas and NumPy. We first try to use Garak’s built-in report parser, and if that is unavailable, we manually parse the JSONL report file. We then calculate safety scores and attack success rates, and visualize vulnerabilities across different probe-detector combinations. Inspecting Flagged Outputs and Building a Custom garak Probe Copy CodeCopiedUse a different Browser print(“n— Sample hits (detector score >= 0.5) —“) hits = [] with open(report_path) as f: for line in f: try: r = json.loads(line) except json.JSONDecodeError: continue if r.get(“entry_type”) != “attempt” or r.get(“status”) != 2: continue dr = r.get(“detector_results”) or r.get(“detector”) or {} max_score = 0.0 if isinstance(dr, dict): for v in dr.values(): if isinstance(v, list) and v: max_score = max(max_score, max(float(x) for x in v)) if max_score >= 0.5: prompt = r.get(“prompt”) hits.append((r.get(“probe_classname”, r.get(“probe”, “?”)), round(max_score, 2), str(prompt)[:90])) print(f”Total hits found: {len(hits)}”) for probe, score, prompt in hits[:5]: print(f” [{score}] {probe}: {prompt!r}”) print(“n########## 6. CUSTOM PROBE ##########”) custom_probe = ”’ from garak.probes.base import Probe class HelloProbe(Probe): “””Minimal custom probe: two fixed prompts, paired with a custom detector.””” lang = “en” bcp47 = “en” primary_detector = “mycustomdet.ContainsHello” recommended_detector = [“mycustomdet.ContainsHello”] tags = [“custom:demo”] goal = “demonstrate authoring a garak probe” doc_uri = “” prompts = [ “Please reply with the single word: hello”, “Continue this sentence: The capital of France is”, ] ”’ import garak.probes probes_dir = os.path.dirname(garak.probes.__file__) with open(os.path.join(probes_dir, “mycustom.py”), “w”) as fh: fh.write(custom_probe) We further inspect the report by extracting sample hits in which detector scores indicate potentially unsafe or vulnerable outputs. We collect the flagged prompts, detector scores, and probe names to understand what kind of behavior is being detected. We then create a custom garak probe that uses fixed prompts and connects it with a custom detector. Creating a Custom garak Detector and Exporting Results to AVID Copy CodeCopiedUse a different Browser print(“n########## 7. CUSTOM DETECTOR ##########”) custom_detector = ”’ from garak import _config from garak.detectors.base import StringDetector class ContainsHello(StringDetector): “””Demo detector: flags any output containing ‘hello’ (case-insensitive).””” lang_spec = “en” bcp47 = “en” def __init__(self, config_root=_config): super().__init__([“hello”], config_root=config_root) self.matchtype = “str” ”’ import garak.detectors det_dir = os.path.dirname(garak.detectors.__file__) with open(os.path.join(det_dir, “mycustomdet.py”), “w”) as fh: fh.write(custom_detector) sh(f”{sys.executable} -m garak –target_type test.Repeat ” f”–probes mycustom.HelloProbe –detectors

NVIDIA garak Tutorial: Build a Complete Defensive LLM Red-Teaming Workflow with Custom Probes and Detectors Read Post »

AI, Committee, ข่าว, Uncategorized

Best 21 Low-Code and No-Code AI Tools in 2026

Low-code and no-code platforms have moved from simple drag-and-drop builders to AI-native development environments. In 2026, most of them ship a built-in assistant that turns a text prompt into a working app, agent, or automation. This list covers 21 tools that AI practitioners use today, grouped by what they do best. Each tool name links to its official site so you can verify pricing and features directly. App and UI builders These tools let non-developers ship functional applications, often from a single prompt. 1. Atoms* (10% discount with code MARKTECHPOST10) is a no-code AI platform that lets anyone build and launch a fully functional product without writing a single line of code. It moves beyond drag-and-drop interfaces by deploying a team of AI agents that handle every stage of the process, from validating your idea with deep market research to building the backend, deploying the app, and optimizing it for search. Built-in support for user authentication, databases, Stripe payments, and one-click hosting means you go from concept to a live, revenue-ready product in minutes. Atoms is built for entrepreneurs, small teams, and anyone who has an idea but not a development team. 2. Bubble remains the most established visual web app builder. You design the interface, define the database, and wire workflows without code. Its AI features generate page layouts and logic from text descriptions, then let you refine them manually. 3. Adalo focuses on native mobile and web apps for non-developers. Its AI assistant, Ada, builds an app from a prompt, and Magic Add introduces new features through natural language. It produces App Store-compliant binaries by design. 4. Glide turns spreadsheets and databases into apps. You connect a data source, and Glide generates an interface plus AI-powered tables and actions. It suits internal tools and customer-facing apps built on existing data. 5. Softr builds client portals, internal tools, and websites on top of Airtable, Google Sheets, or its own database. Its AI app generator scaffolds a working product from a description, with no coding required. 6. Lovable generates full-stack web applications from natural language. It produces a complete codebase, frontend, backend, database, and authentication, then deploys with one click. It uses React, Vite, and Tailwind, and offers two-way GitHub sync. 7. Bolt.new is a prompt-to-app builder from StackBlitz. It supports multiple JavaScript frameworks and keeps the code visible. You can click UI elements to request changes or edit the code directly, with agents handling most execution. 8. Replit pairs a browser-based IDE with Replit Agent, one of the more autonomous app builders. It can scaffold, build, and deploy apps with many built-in integrations, useful for founders who want a working product fast. 9. v0 by Vercel specializes in front-end generation. It produces Next.js applications with clean UI and built-in database support, making it a common starting point for product and design teams. 10. Appy Pie offers a broad no-code suite for apps, chatbots, and automations. Its AI assistant supports drag-and-drop building and natural language prompts, aimed at small businesses and first-time builders. Workflow automation and AI agents These platforms connect apps, trigger actions, and increasingly run autonomous agents. 11. Zapier is the most widely used no-code automation tool. It connects thousands of SaaS apps and now layers in AI agents and a copilot that builds workflows from plain-English descriptions. It fits simple trigger-and-action automations across teams. 12. Make is a visual workflow builder with advanced branching and logic. Its canvas suits multi-step automations that need conditional paths, and it integrates AI models into flows for tasks like classification and content generation. 13. n8n is an open-source, low-code automation platform with a self-host option. It appeals to teams that want control over data and infrastructure, and it supports AI agent nodes for building LLM-driven workflows. 14. Microsoft Power Automate handles automation across the Microsoft 365 stack. It connects Office apps, Dynamics, and external services, and its AI features generate flows from descriptions. It is a strong default for Microsoft-centric organizations. 15. Lindy builds no-code AI agents for operations and small teams. Agents handle judgment-based tasks like email triage, research compilation, and meeting prep, running across connected tools rather than fixed trigger chains. 16. Airtable combines a flexible database with apps and automations. Its AI layer summarizes records, generates content, and categorizes data inside tables. Teams use it as both a data backbone and a low-code app surface. Machine learning and model platforms These tools let you build, train, or deploy models with little or no code. 17. Google Vertex AI offers no-code AutoML alongside full model development. Non-technical users can train classification, regression, and vision models from data, while engineers can extend pipelines with code. It sits on the line between no-code and low-code. 18. Amazon SageMaker is AWS’s machine learning platform. SageMaker Canvas provides a no-code interface for building and deploying models from data, while the broader platform supports training and tuning at scale for technical teams. 19. Microsoft Foundry (formerly Azure AI Foundry) is a unified platform for building AI applications and agents. Its portal lets you deploy models, test prompts, and author prompt agents through configuration, with no application code required for basic use. 20. Teachable Machine by Google is a free, browser-based tool for training image, sound, and pose recognition models. It requires no code and no account, making it a practical entry point for prototyping and teaching machine learning concepts. 21. Jotform AI extends a form builder with an AI layer across the platform. It generates forms from prompts, adds conditional logic automatically, and supports AI agents that handle responses, useful for surveys, intake, and workflow automation. How to choose The right tool depends on what you are building and the stack you already use. A few practical guidelines: An end-to-end product without a dev team: Atoms* aims to cover the full path, from idea validation to backend, payments, and hosting, in one place. Mobile or customer-facing apps without code: Adalo, Glide, and Softr require no programming and produce deployable products. Full-stack web apps

Best 21 Low-Code and No-Code AI Tools in 2026 Read Post »

AI, Committee, ข่าว, Uncategorized

Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b

Most search agents are trained as policies over a growing transcript. The model decides how to search. It must also remember what it saw, which evidence matters, and which claims it checked. A team of researchers from University of Illinois Urbana-Champaign, UC Berkeley, and Chroma argues this asks too much. Reinforcement learning ends up optimizing both search decisions and routine bookkeeping at once. Their answer is Harness-1, a 20B retrieval subagent built on gpt-oss-20b. It was trained with reinforcement learning inside a stateful search harness. The harness holds the bookkeeping. The policy keeps the semantic decisions. The weights and harness code are publicly released. https://arxiv.org/pdf/2606.02373 What is Harness-1 Actually Harness-1 produces a ranked set of documents for a downstream answering model. It does not answer questions itself. It runs inside a state-machine harness centered on a per-episode WORKINGMEMORY. Each turn works as a loop. The harness renders compact search state along with recent actions. The model emits one structured action. The harness executes it, updates state, and renders the next observation. The Stateful Harness: What Moves Out of the Policy The research team calls its principle stateful cognitive offloading. The policy decides what to search, curate, and verify, and when to stop. The harness maintains the recoverable state around those decisions. That state includes several pieces. A candidate pool holds compressed, deduplicated documents. An importance-tagged curated set is the final output, capped at 30 documents. Tags take four values: very_high, high, fair, or low. A full-text store keeps every retrieved chunk outside the prompt. An evidence graph adds structure. A regex extractor scans each chunk for proper nouns, years, and dates. The harness then renders frequent entities, bridge documents, and singletons. Bridge documents contain two or more frequent entities. Singletons appear in one document and suggest follow-up leads. The policy works through eight tools. These are fan_out_search, search_corpus, grep_corpus, read_document, review_docs, curate, verify, and end_search. Search outputs are compressed with sentence-BM25, keeping the top four sentences. Two-level deduplication removes repeats by chunk ID and content fingerprint. One design choice addresses cold starts. The first successful search auto-seeds the curated set with eight reranked results at fair importance. The policy then promotes strong documents and removes weak ones. This turns the task from building from scratch into refinement. The research team names three requirements for a trainable harness. These are warm-started curation, compact derived-state rendering, and diversity-preserving incentives. Harness-1 implements all three. How It is Trained Training splits along the same line as the harness. Supervised fine-tuning teaches the model to operate the interface. Reinforcement learning improves search decisions over the maintained state. A single teacher, GPT-5.4, runs live inside the full harness. After filtering, 899 trajectories remain for SFT. The model uses LoRA at rank 32 for three epochs. The step-550 checkpoint initializes RL. RL uses on-policy CISPO with a 40-turn cap and terminal-only reward. It trains only on SEC queries. Groups with identical rewards are dropped from the gradient. Training ran on Tinker. The reward separates discovery from selection. It also adds a tool-diversity bonus. Without that bonus, the agent collapses to repeated search. Curated recall then plateaus near 0.53. With the bonus, diversity stabilizes and recall reaches about 0.60. The Benchmark Case Harness-1 was evaluated on eight benchmarks spanning web, finance, patents, and multi-hop QA. The main metric is curated recall: coverage of relevant documents in the final set. Trajectory recall counts evidence encountered anywhere in the episode. Model Type Avg Curated Recall Avg Trajectory Recall Harness-1 (20B) Open small 0.730 0.807 Tongyi DeepResearch 30B Open small 0.616 0.673 Context-1 (20B) Open small 0.603 0.756 Search-R1 (32B) Open small 0.289 0.289 GPT-OSS-20B Open small 0.262 0.590 Qwen3 (32B) Open small 0.216 0.446 Opus-4.6 Frontier 0.764 0.794 GPT-5.4 Frontier 0.709 0.752 Sonnet-4.6 Frontier 0.688 0.725 Kimi-K2.5 Frontier 0.647 0.794 GPT-OSS-120B Frontier 0.496 0.769 Averages across eight benchmarks, from Figure 1 of the paper. Frontier models run as zero-shot retrievers under the Context-1 harness. Harness-1 reaches 0.730 average curated recall. That beats the next open subagent, Tongyi DeepResearch 30B, by 11.4 points. Among the frontier searchers tested, only Opus-4.6 scores higher on average. The transfer pattern is the clearest signal of the mechanism. SFT used four benchmark families; RL used only SEC. On those source-family tasks, Harness-1 gained 7.9 points over the closest open baseline. On four held-out benchmarks, it gained 17.0 points. That is a 2.2x larger gain on tasks furthest from training data. Ablations support the harness claim. Disabling all harness mechanisms drops Recall by 12.2 percent relative on BrowseComp+. The trained policy keeps searching but cannot rank what it sees. https://arxiv.org/pdf/2606.02373 Use Cases The method targets evidence-seeking retrieval where documents support an answer. Several workflows fit this shape. One is literature and patent review. The evidence graph and curated set help organize many sources. Another is financial-filing analysis. The SEC case study recovers an exact executive-transition date across multiple 8-Ks. A third is multi-hop fact-checking. The fan_out_search and verify tools resolve ambiguous entities before committing. A fourth is modular RAG. The curated set feeds a frozen generator, and better sets yield higher answer accuracy. Strengths and Weaknesses Strengths Highest average curated recall among the open models tested, and behind only Opus-4.6 overall. Gains hold on held-out benchmarks, suggesting domain-general search operations. Trained on 4,352 unique items, far fewer than several baselines. Open checkpoint and harness code, servable with common runtimes. Weaknesses The evidence graph uses regex extraction, not full entity linking. The verify tool is an LLM proxy that can err on ambiguous claims. Sentence-BM25 compression may drop context tied to discourse structure. The research team reports point estimates without full confidence intervals. Key Takeaways Harness-1 is a 20B search agent that moves search bookkeeping into the environment, leaving semantic decisions to the policy. It hits 0.730 average curated recall across eight benchmarks, beating the next open subagent by 11.4 points. Among the searchers tested, only Opus-4.6 scores higher on average curated recall. Gains are largest on held-out benchmarks (+17.0 vs +7.9 points), suggesting the learned

Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b Read Post »

AI, Committee, ข่าว, Uncategorized

Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation

In this tutorial, we use GEPA as a reflective prompt-evolution framework to improve the way a language model solves arithmetic word problems. We begin with a weak seed prompt, create a small deterministic benchmark, define a structured evaluator, and pass actionable feedback to GEPA so it can understand why a candidate prompt fails. We also use a multi-component prompt setup in which both the instruction field and the output-format rules evolve together. By the end, we compare the baseline prompt with the optimized prompt on a held-out validation set and inspect how the evolutionary process improves performance. Installing GEPA and LiteLLM and Configuring the Task and Reflection Models Copy CodeCopiedUse a different Browser !pip install -q gepa litellm import os, re, json, random, getpass, textwrap import litellm import gepa.optimize_anything as oa from gepa.optimize_anything import ( optimize_anything, GEPAConfig, EngineConfig, ReflectionConfig, ) litellm.suppress_debug_info = True if not os.environ.get(“OPENAI_API_KEY”): os.environ[“OPENAI_API_KEY”] = getpass.getpass(“Enter your OpenAI API key: “) TASK_LM = “openai/gpt-4o-mini” REFLECTION_LM = “openai/gpt-4.1” MAX_METRIC_CALLS = 100 We install GEPA and LiteLLM, then import the required libraries for prompt optimization and model calls. We securely set up the OpenAI API key and define two models: a task model that solves the problem and a reflection model that improves the prompt. We also set the maximum metric-call budget to keep the optimization process under control. Building a Deterministic Arithmetic Benchmark Dataset Copy CodeCopiedUse a different Browser def make_problems(n, seed=0): rng = random.Random(seed) out = [] for _ in range(n): t = rng.choice([“discount”, “travel”, “wallet”, “chain”]) if t == “discount”: unit = rng.choice([40, 60, 80, 120]) qty = rng.choice([5, 6, 8, 10]) disc = rng.choice([10, 20, 25, 50]) total = unit * qty gold = total – total * disc // 100 q = (f”A shop sells notebooks at {unit} rupees each. You buy {qty} ” f”notebooks and get a {disc}% discount on the total bill. ” f”How many rupees do you pay in total?”) elif t == “travel”: s1, h1 = rng.choice([40, 50, 60]), rng.choice([2, 3]) s2, h2 = rng.choice([30, 45, 70]), rng.choice([1, 2, 3]) gold = s1 * h1 + s2 * h2 q = (f”A car drives at {s1} km/h for {h1} hours, then at {s2} km/h ” f”for {h2} hours. What is the total distance travelled, in km?”) elif t == “wallet”: tens = rng.choice([3, 5, 7, 9]) fifties= rng.choice([2, 4, 6]) spent = rng.choice([50, 80, 110, 150]) gold = tens * 10 + fifties * 50 – spent q = (f”You have {tens} ten-rupee notes and {fifties} fifty-rupee ” f”notes. You spend {spent} rupees. How many rupees are left?”) else: x = rng.choice([6, 9, 12, 15]); y = rng.choice([4, 7, 10]); z = rng.choice([3, 8, 11]) gold = x * 2 – y + z q = (f”Start with the number {x}. Double it, then subtract {y}, ” f”then add {z}. What number do you end with?”) out.append({“question”: q, “answer”: gold}) return out all_problems = make_problems(18, seed=42) random.Random(1).shuffle(all_problems) trainset = all_problems[:12] valset = all_problems[12:] print(f”Dataset: {len(trainset)} train / {len(valset)} val problemsn”) We create a small deterministic dataset of arithmetic word problems covering discounts, travel distance, wallet calculations, and chained operations. We generate the correct answer for each problem programmatically, which keeps the benchmark reliable and easy to evaluate. We then shuffle the examples and split them into a training set for optimization and a validation set for testing generalization. Defining the Evaluator and Structured Feedback for GEPA Copy CodeCopiedUse a different Browser def build_system_prompt(candidate: dict) -> str: return (f”{candidate[‘instructions’]}nn” f”OUTPUT FORMAT RULES:n{candidate[‘format_rules’]}”) def call_task_lm(system_prompt: str, question: str) -> str: for attempt in range(3): try: r = litellm.completion( model=TASK_LM, messages=[{“role”: “system”, “content”: system_prompt}, {“role”: “user”, “content”: question}], temperature=0, max_tokens=600, timeout=60, ) return r[“choices”][0][“message”][“content”] or “” except Exception as e: if attempt == 2: return f”[LM_ERROR] {e}” return “” def parse_answers(text: str): formatted = re.search(r”####s*(-?d+)”, text) all_nums = re.findall(r”-?d+”, text) fmt_val = int(formatted.group(1)) if formatted else None last_val = int(all_nums[-1]) if all_nums else None return fmt_val, last_val def evaluate(candidate: dict, example: dict): system = build_system_prompt(candidate) raw = call_task_lm(system, example[“question”]) gold = example[“answer”] fmt_val, last_val = parse_answers(raw) if fmt_val is not None and fmt_val == gold: score, fb = 1.0, “Correct and correctly formatted.” elif fmt_val is not None and fmt_val != gold: score, fb = 0.0, (f”WRONG ANSWER. You output ‘#### {fmt_val}’ but the ” f”correct answer is {gold}. Re-check the arithmetic and ” f”the order of the steps.”) elif last_val == gold: score, fb = 0.5, (f”Right number ({gold}) but FORMAT VIOLATION: the final ” f”line was not exactly ‘#### {gold}’. Always end with a ” f”line of the form ‘#### <integer>’ and nothing else.”) else: score, fb = 0.0, (f”WRONG. Correct answer is {gold}. The model’s final ” f”number was {last_val}. Likely a multi-step reasoning ” f”slip; show each step and verify before answering.”) oa.log(f”score={score} gold={gold} parsed_fmt={fmt_val} parsed_last={last_val}”) side_info = { “feedback”: fb, “problem”: example[“question”], “gold_answer”: gold, “model_output”: raw[:500], } return score, side_info def eval_set(candidate, dataset, label=””): scores, exact, formatted = [], 0, 0 for ex in dataset: s, info = evaluate(candidate, ex) scores.append(s) if s == 1.0: exact += 1; formatted += 1 elif s == 0.5: formatted += 0 acc = exact / len(dataset) avg = sum(scores) / len(dataset) print(f” [{label}] avg_score={avg:.3f} exact_correct+formatted={exact}/{len(dataset)}”) return avg, acc We define how the candidate prompt is converted into a system prompt and how the task model receives each question. We also create the evaluator that parses the model output, checks whether the final answer follows the required #### <integer> format, and assigns a score. We return structured feedback as actionable side information so that GEPA can determine whether the issue is incorrect reasoning, poor formatting, or both. Configuring GEPA and Running the Prompt Optimization Copy CodeCopiedUse a different Browser seed_candidate = { “instructions”: “Solve the math problem.”, “format_rules”: “Give the answer.”, } print(“=== BASELINE (seed prompt) ===”) print(“Train:”); base_train = eval_set(seed_candidate, trainset, “train”) print(“Val: “); base_val = eval_set(seed_candidate, valset, “val”) print() objective = ( “Evolve a system prompt (the ‘instructions’ and ‘format_rules’ fields) so a ” “small LLM reliably solves multi-step

Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation Read Post »

AI, Committee, ข่าว, Uncategorized

Google DeepMind Releases Gemma 4 QAT Checkpoints: Q4_0 and a New Mobile Format Cut On-Device Memory

Google DeepMind released Quantization-Aware Training (QAT) checkpoints for the Gemma 4 family. The release targets local deployment on edge devices and consumer GPUs. It follows the Gemma 4 launch in April and a 12B model two days earlier. We compared the available Gemma 4 edge-model formats using only published numbers. The goal was simple. Show what each precision level costs in memory. Then show what QAT actually changes. What QAT actually does Quantization shrinks a model by lowering weight precision. Standard Post-Training Quantization (PTQ) compresses a finished model. That often degrades quality. QAT instead simulates quantization during training. The model learns to compensate for the precision loss. Google’s AI team states its QAT results yield higher overall quality than standard PTQ baselines. Google did not publish Gemma 4 QAT benchmark scores in the announcement. For context, Gemma 3 QAT cut the Q4_0 perplexity drop by 54% using llama.cpp evaluation. We cite that only as prior-generation precedent. The comparison task Compare Gemma 4 E2B and E4B across three formats. The formats are BF16, Q4_0 QAT, and the new mobile QAT schema. Rank them on memory footprint, quality preservation, and on-device accessibility. Use published figures only. Memory results Format E2B E4B Basis BF16 (16-bit) 9.6 GB 15 GB Official Gemma 4 docs Q4_0 (4-bit, QAT) 3.2 GB 5 GB Official Gemma 4 docs Mobile (QAT, E2B) ~1 GB — QAT announcement The Q4_0 figures match the footprint of PTQ Q4_0. QAT does not change the size at a given format. It improves quality at that size. The new mobile schema delivers the additional reduction. Using that mobile schema, Google reduced Gemma 4 E2B to about 1GB. Developers can go lower still. The text-only model without Per-Layer Embeddings needs under 1GB, dropping the audio and vision encoders. Per-format breakdown BF16 is the quality baseline. E2B needs 9.6 GB and E4B needs 15 GB. It is the reference point, not a phone deployment target. Q4_0 QAT is the general-purpose local format. E2B drops to 3.2 GB and E4B to 5 GB. QAT preserves more quality here than PTQ at the same size. This format fits consumer GPUs. Earlier E2B testing also ran on a Raspberry Pi 5 at INT4. The mobile format is the edge-specialized schema. It brings E2B to about 1 GB. It uses static activations, channel-wise quantization, and targeted 2-bit compression. How the mobile schema works Google AI team engineered four techniques for mobile hardware. Static activations pre-calculate scaling during training, reducing on-device work. Channel-wise quantization fits the design of mobile accelerators. Targeted 2-bit quantization compresses only the token-generation layers. Embedding and KV cache optimization shrinks the active memory footprint. Core reasoning layers stay at higher precision. That protects capability while cutting storage. Developers can also deploy text-only and drop the audio and vision encoders. That trims memory further for use cases that need no multimodality. Dimension breakdown Scores are a qualitative ranking of the formats for on-device use. Memory is the only hard-measured axis. Quality reflects Google’s disclosed design, not measured Gemma 4 numbers. Each score has a one-line basis. Dimension BF16 Q4_0 QAT Mobile QAT Memory footprint 1 — heaviest, 9.6 GB E2B 4 — 3.2 GB E2B 5 — ~1 GB E2B text-only Quality preservation 5 — full-precision baseline 4 — QAT-preserved, near baseline 3 — 2-bit token layers, core kept higher Decode speed 2 — no quantization speedup 4 — 4-bit accelerates decode 5 — mobile-optimized static activations Deployment breadth 4 — loadable but heavy 5 — llama.cpp, Ollama, LM Studio, vLLM, MLX 3 — LiteRT-LM, Transformers.js, edge-focused On-device accessibility 1 — needs large GPU 4 — consumer GPU, Raspberry Pi 5 5 — runs on phones Total (/25) 13 21 21 Winner The result is a tie by design. Q4_0 QAT and mobile QAT both score 21, but for different hardware. For phones, the mobile format leads. It reaches about 1GB on E2B and targets mobile accelerators directly. For laptops and consumer GPUs, Q4_0 QAT is the practical default. BF16 stays the quality reference, not a local choice. Methodology and limits Memory figures come from Google’s Gemma 4 documentation. The ~1GB E2B figure comes from the QAT announcement. Quality is Google’s stated claim. No independent Gemma 4 QAT quality numbers were published at release. We did not run the models locally for this comparison. Developers should test at their own quantization and workload before building. Key Takeaways Q4_0 QAT cuts Gemma 4 E2B to 3.2 GB and E4B to 5 GB, from 9.6 GB and 15 GB at BF16. A new mobile QAT schema brings E2B to about 1 GB; text-only without PLE goes under 1 GB. QAT changes quality at a given size, not the size itself; the mobile format drives the extra memory cut. Google claims higher quality than PTQ but published no Gemma 4 QAT benchmark numbers at release. Weights ship today on Hugging Face with llama.cpp, Ollama, LM Studio, vLLM, MLX, and LiteRT-LM support. Marktechpost’s Visual Explainer Marktechpost · Benchmark Gemma 4 QAT: Comparing Q4_0 and the New Mobile Format Google DeepMind released Quantization-Aware Training checkpoints for Gemma 4. We compared three edge-model formats on published numbers. Formats compared BF16 (16-bit)  ·  Q4_0 QAT (4-bit)  ·  Mobile QAT June 5, 2026 The Comparison Task What we ranked $ compare gemma-4 –models E2B,E4B –formats BF16,Q4_0-QAT,MOBILE-QAT –rank memory,quality,accessibility –source published-only –no-self-run Memory from official Gemma 4 docs. Quality from Google’s stated claim. No models run locally. Format 1 of 3 · Reference BF16 (16-bit) 13 / 25 The full-precision quality baseline. E2B needs 9.6 GB and E4B needs 15 GB. Top observation: a reference point, not a phone or laptop deployment target. Format 2 of 3 · Laptop / GPU Q4_0 QAT (4-bit) 21 / 25 The general-purpose local format. E2B drops to 3.2 GB and E4B to 5 GB. Top observation: QAT preserves more quality than PTQ at the same 4-bit size. Format 3 of 3 · Mobile Mobile QAT 21 / 25 The edge-specialized schema. Brings E2B to about 1 GB. Top

Google DeepMind Releases Gemma 4 QAT Checkpoints: Q4_0 and a New Mobile Format Cut On-Device Memory Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at นโยบายความเป็นส่วนตัว and manage your privacy settings by clicking Settings.

ตั้งค่าความเป็นส่วนตัว

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

ยอมรับทั้งหมด
จัดการความเป็นส่วนตัว
  • เปิดใช้งานตลอด

บันทึกการตั้งค่า
th