YouZum

AI

AI, Committee, News, Uncategorized

A Hands-On Coding Tutorial on Qualcomm AI Hub Models for Classification, Object Detection, and Hardware-Aware Deployment

In this tutorial, we work through an end-to-end workflow for Qualcomm AI Hub Models. We start by setting up the required package, discovering the available model collection, and loading MobileNet-V2 for local PyTorch inference. We also handle an important input-shape issue by converting NHWC image tensors into the NCHW format expected by the model. From there, we run inference on both the model’s built-in sample input and a real image, inspect top predictions, execute the official Qualcomm AI Hub CLI demo, and extend the workflow with a YOLOv7 object detection example. Also, we include an optional cloud-device section where we compile, profile, and run the model on a real Qualcomm device when an API token is available. Copy CodeCopiedUse a different Browser import subprocess, sys, os, glob, textwrap, traceback import numpy as np, torch from PIL import Image import matplotlib.pyplot as plt def pip_install(*pkgs): subprocess.run([sys.executable, “-m”, “pip”, “install”, “-q”, *pkgs], check=True) pip_install(“qai_hub_models”) OUT_DIR = “/content/qaihm_out”; os.makedirs(OUT_DIR, exist_ok=True) torch.set_grad_enabled(False) def to_nchw(value): arr = value[0] if isinstance(value, (list, tuple)) else value t = torch.from_numpy(np.asarray(arr, dtype=np.float32)) if t.ndim == 3: t = t.unsqueeze(0) if t.ndim == 4 and t.shape[1] != 3 and t.shape[-1] == 3: t = t.permute(0, 3, 1, 2).contiguous() return t We begin by importing libraries and setting up a helper function to install packages directly inside Colab. We install qai_hub_models, create an output directory, and disable gradient tracking since we only need inference. We also define the to_nchw() function to convert any input image tensor to the channel-first format expected by the model. Copy CodeCopiedUse a different Browser import pkgutil, qai_hub_models.models as _m model_ids = sorted(n for _, n, p in pkgutil.iter_modules(_m.__path__) if p and not n.startswith(“_”)) print(f”>>> {len(model_ids)} models available. First 40:n”) print(textwrap.fill(“, “.join(model_ids[:40]), 100), “n”) from qai_hub_models.models.mobilenet_v2 import Model as MobileNetV2 model = MobileNetV2.from_pretrained().eval() spec = model.get_input_spec() input_name = list(spec.keys())[0] print(“>>> Input:”, input_name, spec[input_name].shape, spec[input_name].dtype) from torchvision.models import MobileNet_V2_Weights IMAGENET_CLASSES = MobileNet_V2_Weights.IMAGENET1K_V1.meta[“categories”] def top5(logits): if logits.ndim == 1: logits = logits.unsqueeze(0) probs = torch.softmax(logits, dim=1)[0] conf, idx = probs.topk(5) return [(IMAGENET_CLASSES[i], float(c)) for c, i in zip(conf, idx)] We discover the available Qualcomm AI Hub model packages and print the first set of model IDs to understand what is accessible. We then load the pretrained MobileNet-V2 model, read its input specification, and identify the correct input name. We also prepare the ImageNet class labels and define a top5() function to convert model logits into readable top-5 predictions. Copy CodeCopiedUse a different Browser sample = model.sample_inputs() x = to_nchw(sample[input_name]) print(“>>> fed tensor shape:”, tuple(x.shape)) print(“n>>> Top-5 for the built-in sample input:”) for label, conf in top5(model(x)): print(f” {conf:6.2%} {label}”) from torchvision import transforms preprocess = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), ]) img = None try: import urllib.request p = os.path.join(OUT_DIR, “input.jpg”) urllib.request.urlretrieve( “https://raw.githubusercontent.com/pytorch/hub/master/images/dog.jpg”, p) img = Image.open(p).convert(“RGB”) except Exception as e: print(“>>> photo download skipped:”, e) if img is not None: preds = top5(model(preprocess(img).unsqueeze(0))) print(“n>>> Top-5 for the downloaded photo:”) for label, conf in preds: print(f” {conf:6.2%} {label}”) plt.figure(figsize=(5,5)); plt.imshow(img); plt.axis(“off”) plt.title(f”{preds[0][0]} ({preds[0][1]:.1%})”); plt.show() We first run inference using the model’s built-in sample input and use to_nchw() to fix the tensor shape before passing it to MobileNet-V2. We then download a real image, preprocess it using standard resizing, cropping, and tensor conversion steps, and run another prediction. We finally display the image with the top predicted label to visually connect the model output to the input photo. Copy CodeCopiedUse a different Browser def run_demo(module, extra=None, timeout=900): cmd = [sys.executable, “-m”, module, “–eval-mode”, “fp”, “–output-dir”, OUT_DIR] + (extra or []) print(f”n>>> {‘ ‘.join(cmd)}”) try: r = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout) print(“n”.join((r.stdout + r.stderr).strip().splitlines()[-25:])) except Exception as e: print(“>>> demo skipped:”, e) run_demo(“qai_hub_models.models.mobilenet_v2.demo”) try: pip_install(“qai_hub_models[yolov7]”) run_demo(“qai_hub_models.models.yolov7.demo”) imgs = sorted(glob.glob(OUT_DIR + “/*.png”) + glob.glob(OUT_DIR + “/*.jpg”), key=os.path.getmtime) if imgs: plt.figure(figsize=(9,9)); plt.imshow(Image.open(imgs[-1]).convert(“RGB”)) plt.axis(“off”); plt.title(“YOLOv7 detections”); plt.show() else: print(“>>> no output image found (results may have printed instead).”) except Exception: print(“>>> YOLOv7 section skipped:n”, traceback.format_exc()) We define a reusable run_demo() function that executes official Qualcomm AI Hub model demos from the command line. We use it to run the MobileNet-V2 demo and then install the YOLOv7 extras for object detection. We run the YOLOv7 demo, search for the generated output image, and visualize the detections if an image is created. Copy CodeCopiedUse a different Browser try: import qai_hub as hub devices = hub.get_devices() print(f”n>>> Authenticated. {len(devices)} cloud devices available.”) device = hub.Device(“Samsung Galaxy S24 (Family)”) sample = model.sample_inputs() nchw = to_nchw(sample[input_name]) traced = torch.jit.trace(model, [nchw]) cloud_inputs = {input_name: [nchw.numpy()]} cj = hub.submit_compile_job(model=traced, device=device, input_specs=model.get_input_spec(), options=”–target_runtime tflite”) target = cj.get_target_model(); print(“>>> compiled:”, cj.url) pj = hub.submit_profile_job(model=target, device=device); print(“>>> profiling:”, pj.url) ij = hub.submit_inference_job(model=target, device=device, inputs=cloud_inputs) out = ij.download_output_data() dev_logits = torch.from_numpy(np.asarray(list(out.values())[0][0])) print(“>>> Top-5 from the REAL device:”) for label, conf in top5(dev_logits): print(f” {conf:6.2%} {label}”) target.download(os.path.join(OUT_DIR, “mobilenet_v2.tflite”)) print(“>>> saved compiled .tflite to”, OUT_DIR) except Exception as e: print(“n>>> Cloud (on-device) section skipped — no API token configured.”) print(” Get one at workbench.aihub.qualcomm.com, then:”) print(” !qai-hub configure –api_token YOUR_TOKEN”) print(” detail:”, (str(e).splitlines() or [type(e).__name__])[0]) print(“n>>> Tutorial complete. Outputs in:”, OUT_DIR) We include an optional Qualcomm AI Hub cloud workflow that runs only when an API token is configured. We retrieve available cloud devices, trace the PyTorch model, compile it for TFLite, profile it on a Qualcomm device, and submit an inference job. We then download the device output, print the top predictions, save the compiled TFLite model, and finish by showing where all tutorial outputs are stored. In conclusion, we have a complete practical workflow for using Qualcomm AI Hub Models inside Colab. We learned how to load pretrained models, prepare inputs correctly, run local inference, visualize classification and detection results, and use the official demos as reproducible reference points. We also saw how the same model can move beyond local PyTorch execution into Qualcomm’s cloud-device pipeline for compilation, profiling, and real-device inference. It provides a path from simple experimentation to hardware-aware deployment with Qualcomm AI Hub. Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join

A Hands-On Coding Tutorial on Qualcomm AI Hub Models for Classification, Object Detection, and Hardware-Aware Deployment Read Post »

AI, Committee, News, Uncategorized

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time

NVIDIA’s Nemotron Speech team has released Nemotron 3.5 ASR. It is a 600M-parameter streaming Automatic Speech Recognition (ASR) model. A single checkpoint transcribes 40 language-locales in real time. Punctuation and capitalization are built in natively. The model ships as open weights on Hugging Face. The license is OpenMDW-1.1. The architecture is a Cache-Aware FastConformer-RNNT. What is Nemotron 3.5 ASR Nemotron 3.5 ASR extends nvidia/nemotron-speech-streaming-en-0.6b to many languages. It adds prompt-based language-ID conditioning to the base model. That lets one 600M-parameter checkpoint cover 40 language-locales. No per-language model or model-swapping is required. The model targets two workloads. The first is low-latency streaming for live audio. The second is high-throughput batch transcription. Output is production-ready text with proper casing and punctuation. No separate punctuation-restoration step is needed. Image source: https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b How Cache-Aware FastConformer-RNNT Works The model has two main pieces. The first is a Cache-Aware FastConformer encoder with 24 layers. FastConformer is an efficient evolution of the Conformer architecture. It uses linearly scalable attention. The second piece is an RNNT (Recurrent Neural Network Transducer) decoder. RNNT emits text frame by frame as audio streams in. The “cache-aware” design is the efficiency lever. Buffered streaming re-processes overlapping audio windows at every step. That repeats the same work and adds delay. This model caches encoder self-attention and convolution activations instead. It reuses those cached states as new audio arrives. So each audio frame is processed exactly once, with no overlap. Compute and end-to-end latency both drop, without an accuracy penalty. The Latency Knob: att_context_size One inference setting controls the latency-accuracy tradeoff. It is the attention context size, att_context_size. Smaller context emits text sooner but sees less future audio. Larger context raises accuracy at higher latency. The same checkpoint covers the full range. Settings map to chunk sizes of 80ms, 160ms, 320ms, 560ms, and 1.12s. For example, [56,0] gives an 80ms ultra-low-latency mode. The [56,13] setting gives 1.12s for highest accuracy. Teams pick the operating point at inference time, with no retraining. Language Detection and Coverage The 40 language-locales include English, Spanish, German, and French variants. They also cover Arabic, Japanese, Korean, Mandarin, Hindi, and Thai. Several other European and Nordic languages are included too. Language conditioning works two ways. Setting target_lang to a known locale usually gives the best accuracy. Setting target_lang=auto lets the model detect the language itself. In auto mode, it emits a language tag after terminal punctuation. One deployment can then transcribe mixed-language traffic. No separate language-ID component is required. Comparison Product Company Access Native streaming Language coverage Reported latency Pricing model Nemotron 3.5 ASR NVIDIA Open weights (OpenMDW-1.1), self-host; hosted on DeepInfra Yes — cache-aware FastConformer-RNNT 40 language-locales 80ms–1.12s, configurable at inference Free to self-host; usage-based via host Whisper large-v3 OpenAI Open weights (MIT), self-host; API No — offline/batch ~99 languages Not streaming-native Self-host free; API ~$0.006/min (batch) Nova-3 Deepgram Closed API; on-premise/self-host (enterprise) Yes — streaming + batch Multilingual; +10 monolingual languages added Jan 2026 Low-latency streaming (reported sub-300ms) ~$0.0077/min (Nova-3 Monolingual, PAYG) Universal-3 Pro Streaming AssemblyAI Closed API (EU endpoint available) Yes 6 languages: English, Spanish, French, German, Italian, Portuguese Sub-300ms (official); first partial ~750ms Usage-based (PAYG) Scribe v2 Realtime ElevenLabs Closed API Yes 90+ languages (99 per ElevenLabs) ~150ms (p50) ~$0.28/hour Ursa / streaming Speechmatics API + on-premise + edge Yes — streaming + batch 50+ languages with automatic identification Ultra-low latency (positioned) Enterprise/usage Fine-Tuning Results Because the weights are open, teams can fine-tune for a language, domain, or accent. NVIDIA published a worked example on Greek and Bulgarian. It fine-tuned the base checkpoint with the same Cache-Aware FastConformer-RNNT recipe. Each clip carried a target_lang tag for language conditioning. Training data came from public corpora, including Granary, Common Voice, and FLEURS. Results were measured as WER on held-out FLEURS, at the 80ms setting. Greek WER fell from 35 to 24, a 32% relative improvement. Bulgarian fell from 22 to 15, a 31% relative improvement. These are raw WER percentages at the lowest-latency streaming mode. NVIDIA notes that evaluating at deployment latency, on held-out data, gives honest numbers. Strengths and Considerations Strengths: One 600M-parameter checkpoint covers 40 language-locales, cutting deployment sprawl. Cache-aware streaming processes each frame once, reported at 17x buffered concurrency on an H100. att_context_size tunes latency from 80ms to 1.12s at inference, with no retraining. Punctuation, capitalization, and auto language tagging are built in. Open weights enabled a 31–32% relative WER drop on Greek and Bulgarian after fine-tuning. Considerations: The model handles English, but NVIDIA recommends its dedicated English model for English-only use. The 80ms mode trades some accuracy for the lowest latency. Japanese and Korean use CER, so cross-language error comparisons need care. Throughput figures are measured on H100, so results on other GPUs will differ. The production NIM with gRPC streaming is announced, but not yet released. Key Takeaways NVIDIA’s Nemotron 3.5 ASR is an open-weights (OpenMDW-1.1), 600M-parameter streaming model transcribing 40 language-locales from one checkpoint. Its Cache-Aware FastConformer-RNNT design processes each audio frame once, reported at 17x the concurrent streams of buffered approaches on an H100. Latency is configurable from 80ms to 1.12s at inference via att_context_size, with no retraining. A short fine-tune cut FLEURS WER 32% on Greek (35→24) and 31% on Bulgarian (22→15), at the 80ms setting. It is self-hostable and streaming-native, unlike closed APIs (Deepgram, AssemblyAI, ElevenLabs) or offline Whisper. Marktechpost’s Visual Explainer NEMOTRON 3.5 ASR 1 / 10 NVIDIA · STREAMING SPEECH AI · OPEN WEIGHTS Nemotron 3.5 ASR A 600M-parameter cache-aware streaming model that transcribes 40 language-locales in real time, from a single checkpoint. 600M parameters 40 language-locales 80ms–1.12s latency OpenMDW-1.1 01 — WHAT IT IS One model, 40 language-locales Extends nvidia/nemotron-speech-streaming-en-0.6b with prompt-based language-ID conditioning. A single 600M-parameter checkpoint covers 40 language-locales. No model-swapping. Punctuation and capitalization are built in. No separate post-processing step. Targets two workloads: low-latency streaming and high-throughput batch. NVIDIA still recommends its English-only model for English-only use. 02 — ARCHITECTURE Cache-Aware FastConformer-RNNT A 24-layer FastConformer encoder paired with an RNNT decoder. Buffered streaming re-processes overlapping audio windows at every step. This model caches encoder

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time Read Post »

AI, Committee, News, Uncategorized

Moonshot AI Releases Kimi Code CLI: A Terminal AI Coding Agent Built in TypeScript for Next-Gen Agents

Moonshot AI has released Kimi Code CLI, an open-source coding agent that runs in the terminal. The tool reads and edits code, runs shell commands, searches files, and fetches web pages. It then chooses its next step based on the feedback it receives. The project is MIT-licensed and lives on GitHub.. Kimi Code CLI is the successor to the older kimi-cli. The new agent is written in TypeScript and distributed via npm. It works out of the box with Moonshot AI’s Kimi models. It can also be configured to use other compatible providers. What is Kimi Code CLI Kimi Code CLI is an AI agent for software development and terminal operations. It can implement new features, fix bugs, and complete refactors. It can also explore an unfamiliar codebase and answer architecture questions. Batch file processing, builds, and chained test runs are supported too. The execution model is feedback-driven. The agent plans steps, modifies code, runs tests, and reports its actions. Read-only operations run automatically by default. For file edits or shell commands, the agent asks for confirmation first. This approval flow keeps risky actions under developer control. The CLI itself is free and MIT-licensed. Model access requires Kimi Code OAuth or a Moonshot AI Open Platform API key. https://github.com/MoonshotAI/kimi-code Key Features Moonshot lists several features aimed at long, focused agent sessions: Single-binary distribution. One command installs it, with no Node.js setup required. Fast startup. Moonshot says the TUI is ready in milliseconds. Purpose-built TUI. The interface is tuned for extended agent sessions. Video input. Drop a screen recording or demo clip into the chat. AI-native MCP configuration. Add and authenticate Model Context Protocol servers via /mcp-config. Subagents for parallel work. Dispatch built-in coder, explore, and plan subagents in isolated contexts. Lifecycle hooks. Run local commands to gate tool calls, audit decisions, or trigger notifications. Installation and First Run Two installation paths exist. The official script needs no pre-installed Node.js. On macOS or Linux, run the install script: Copy CodeCopiedUse a different Browser curl -fsSL https://code.kimi.com/kimi-code/install.sh | bash On Windows, use PowerShell: Copy CodeCopiedUse a different Browser irm https://code.kimi.com/kimi-code/install.ps1 | iex The global npm install requires Node.js 24.15.0 or later: Copy CodeCopiedUse a different Browser npm install -g @moonshot-ai/kimi-code Verify the binary, then open a project and start the interactive UI: Copy CodeCopiedUse a different Browser kimi –version cd your-project kimi On first launch, type /login inside the UI. You can choose Kimi Code OAuth or a Moonshot AI Open Platform API key. To run one instruction without the UI, use kimi -p “your task”. To resume the previous session, add -C. Use Cases Understanding a project: Ask for an architecture overview and a module dependency diagram. Implementing a feature: Describe the signature, options, and acceptance criteria up front. Fixing a bug: Give the symptom, reproduction steps, and expected behavior together. Writing tests and refactoring: Extract repeated patterns, then run tests to confirm behavior. One-off automation: Analyze logs and output call counts with p50 and p99 latencies. Scheduled tasks: Ask the agent to set reminders or recurring checks via cron. Plan mode is available through Shift-Tab or kimi –plan. It outputs a research plan before touching files. For safe batch work, –yolo or /yolo skips approval prompts. The /fork command creates an experimental branch you can abandon. The /compact command compresses context to free up tokens. For large investigations, the main agent can dispatch subagents in parallel. How Kimi Code CLI Compares Kimi Code CLI joins several established terminal coding agents. The table below compares it with three of them. Competitor details reflect mid-2026 and can change quickly. Attribute Kimi Code CLI Claude Code OpenAI Codex CLI Gemini CLI Developer Moonshot AI Anthropic OpenAI Google Backing model Kimi models Claude models GPT-5.3-Codex Gemini 2.5 Pro Language / runtime TypeScript Node.js Rust TypeScript Install Script or npm (Node.js ≥ 24.15.0) Native installer or npm npm / native npm single binary MCP support Yes (/mcp-config) Yes Yes Yes Subagents Yes (coder, explore, plan) Yes Yes No (sequential) Plan mode Yes (Shift-Tab) Yes Yes Yes IDE integration ACP (Zed, JetBrains) VS Code, JetBrains VS Code, IDEs VS Code (Code Assist) License MIT Proprietary Open source Apache 2.0 All four agents support the Model Context Protocol. They differ on backing model, language, license, and orchestration. Kimi Code CLI and Codex CLI both ship native subagents. Gemini CLI runs tasks sequentially without subagent support. Key Takeaways Kimi Code CLI is an MIT-licensed terminal coding agent from Moonshot AI. It is written in TypeScript and installs via script or npm. Built-in coder, explore, and plan subagents run in isolated contexts. MCP servers are configured conversationally through /mcp-config, not raw JSON. It succeeds kimi-cli and migrates existing configuration and sessions. Marktechpost’s Visual Explainer Kimi Code CLI · Guide 01 / 09 Overview Kimi Code CLI Moonshot AI’s open-source terminal coding agent that reads code, runs commands, and plans its next step. Runs in your terminal as an AI coding agent MIT-licensed · written in TypeScript · distributed via npm Works with Kimi models or other compatible providers Slide 02 What Is Kimi Code CLI? Reads and edits code, runs shell commands, searches files Fetches web pages and chooses the next step from feedback Read-only actions run automatically by default File edits and shell commands ask for confirmation first Slide 03 Key Features Single-binary distribution — no Node.js setup required Built-in coder, explore, and plan subagents AI-native MCP configuration via /mcp-config Lifecycle hooks and video input support Slide 04 Install macOS / Linux curl -fsSL https://code.kimi.com/kimi-code/install.sh | bash Windows (PowerShell) irm https://code.kimi.com/kimi-code/install.ps1 | iex npm (Node.js 24.15.0+) npm install -g @moonshot-ai/kimi-code Slide 05 First Run kimi –version cd your-project kimi Type /login → Kimi Code OAuth or Moonshot API key kimi -p “your task” runs one instruction without the UI kimi -C resumes the previous session Slide 06 Use Cases Understand a project: architecture overview and dependency map Implement features with clear signatures and acceptance criteria Fix bugs from symptom, reproduction steps, and expected behavior Write tests, refactor,

Moonshot AI Releases Kimi Code CLI: A Terminal AI Coding Agent Built in TypeScript for Next-Gen Agents Read Post »

AI, Committee, News, Uncategorized

The Meta hack shows there’s more to AI security than Mythos

On June 5, 404 Media reported that attackers had been using Meta’s AI customer support agent to steal Instagram accounts. Their approach was simple: They asked the agent to link the accounts to email addresses that they controlled, and the agent complied. One attacker broke into the dormant Obama White House account and made pro-Iran posts; others took over accounts with valuable, single-word handles, possibly in order to sell them. AI cybersecurity concerns are nothing new. Since Anthropic announced in April that its Mythos model was too good at hacking to be released to the general public, commentators, researchers, and federal officials alike have fixated on the idea that superpowered AI systems could lay waste to our computer infrastructure. That’s not quite what this Instagram hack was: There, AI was the target rather than the attacker, and the method was far simpler than anything Mythos would cook up. But as companies offload more work to AI, these comparatively unsophisticated attacks could wreak their own havoc. “As AI becomes more and more widely used—especially when AI is more and more widely used to automate our work flows, like account recovery—I think attackers are going to be more and more motivated to attack AI itself,” says Neil Gong, a professor of electrical and computer engineering at Duke University. Gong and other scholars have been issuing warnings about the security vulnerabilities of AI agents for a while. They publish papers and blog posts detailing exploits such as indirect prompt injection, which involves hijacking agents using commands hidden in websites, emails, or other seemingly anodyne data sources. Compared with these techniques, the Meta hack was practically mindless. The only complication that hackers had to overcome was using a VPN that matched the true account owner’s location; then they directly asked the support agent to change the account’s email address, and it complied. Meta has not commented publicly on how this vulnerability slipped through the cracks. But given the simplicity of the exploit, Gong says, it should have been uncovered easily, before the agent was deployed. “It’s really surprising,” he says. “I don’t understand why they didn’t find this simple problem.” Jessica Ji, a senior research analyst at Georgetown’s Center for Security and Emerging Technology, agrees. “It raises questions like: Were there even guardrails in place?” she says. “Did anyone think to test for this kind of scenario?” She notes that the oversight is particularly striking coming from a company like Meta, which has extensive expertise in both AI and cybersecurity. Meta did not respond to a request for comment for this article, but on Monday a Meta spokesperson said on X that the vulnerability had been resolved. As embarrassing a moment as this might be for Meta in particular, it also highlights some core vulnerabilities shared by all AI agents. Unlike traditional software, agents can respond in flexible—and unexpected—ways to new circumstances, which is why they might be able to substitute for human customer support agents. But AI agents can also be tricked in ways that humans wouldn’t be, and because they can take real-world actions, those mistakes have consequences. “A human would say, ‘Okay, why do you want to change the email address?’ and maybe respond with a security question,” says Somesh Jha, a professor of computer science at the University of Wisconsin–Madison. “What is going on with these agents is they’re very eager to finish the task. It’s almost like some elementary school student who just wants to please the teacher.” There are ways to mitigate the risks. Companies can use traditional software to build guardrails that make sure agents follow strict rules, such as always asking for answers to security questions before sending sensitive account information to a new email address. And the experts consulted for this article all agree that agents should undergo rigorous red-teaming, a process in which developers try their best to attack a system in order to discover its vulnerabilities before it is deployed. But there are also countervailing forces. Companies want to deploy capable agents, and the more power an agent has—and the fewer guardrails it is subject to—the more work it can potentially take on. “Security and utility always have a trade-off,” says Bo Li, a professor of computer science at the  University of Illinois Urbana-Champaign. And adequate red-teaming can be expensive. Defenders have to expend more resources than attackers do, because attackers only need to discover a single exploit, while defenders try to discover and patch as many as they can. When attackers are working toward something as valuable as a single-word Instagram handle, they’ll pour resources into finding exploits, so defenders have to spend even more money to protect that prize.  As AI models continue to improve, hardening their defenses might actually get easier. Though the probabilistic nature of large language models means that LLM agents will always be vulnerable to some forms of attack, a more sophisticated model might have identified an attempt to change the email associated with the Obama White House account as suspicious. And AI systems can be used for agent red-teaming, much as participants in Anthropic’s Project Glasswing use Mythos to identify vulnerabilities in their software.  Still, experts expect that the problem of securing AI agents will only become more pressing in the future. As agents grow more capable, companies that adopt them may want to give them more power, both to provide more services with fewer humans and to avoid being left behind by their competitors. In the fast-moving world of AI, the time needed to carefully secure risky agentic systems might seem like an unconscionable delay. “Everybody wants to be the first to do something and just push things out without careful scrutiny and red-teaming,” Jha says. “I think it’s a very dangerous thing.”

The Meta hack shows there’s more to AI security than Mythos Read Post »

AI, Committee, News, Uncategorized

Perplexity AI Introduces Hybrid Local-Server Inference Orchestrator for Personal Computer: Automatic On-Device and Cloud Task Routing

Perplexity AI announced what it calls the first hybrid local-server inference orchestrator at Computex 2026. The system is designed to automatically route AI tasks between a user’s local device and cloud-based frontier models without requiring the user to decide in advance. The feature is expected come to Perplexity Computer in July 2026. What is Hybrid Agentic Inference? To understand what Perplexity built, it helps to understand the three-way tension that AI systems face. Accuracy demands the most capable models, which are expensive to run. Privacy demands that some data never leave the device. Cost and energy efficiency demand that you don’t spend a frontier model’s compute on tasks a smaller model can handle. That routing layer is what Perplexity calls hybrid agentic inference. A compact AI model runs locally on the user’s device. This local model evaluates each incoming task or subtask. It determines whether the task involves sensitive data, whether it requires heavy computation, or whether it can be handled entirely on-device. Based on that evaluation, work is either kept local or sent to a frontier model in the cloud. Perplexity describes this local model as deciding “when sensitive data should also be kept locally.” The system is designed to ask for user permission before sending sensitive tasks to the cloud. That design addresses a specific concern enterprises have about agentic AI: data governance — knowing where data goes and who controls that decision. Examples of data the system is intended to keep local include financial records, health information, and personal files. Work that requires a frontier model’s full capability runs on the server. Most real tasks are a mix, so the system splits them and coordinates the parts. How It Fits into Perplexity Computer Perplexity Computer is the company’s cloud-based multi-model agentic product, launched in February 2026. It originally ran entirely in the cloud on the Perplexity Max subscription tier ($200/month). Personal Computer is a separate, related product that brought Computer’s capabilities onto the local device — with access to local files, native Mac apps, the web, and Perplexity’s secure servers. Personal Computer launched on Mac in April 2026. Windows support is planned; a waitlist is open. The new hybrid local-server inference orchestrator is the next step for Personal Computer. Previously, even within Personal Computer, the division was relatively fixed: local file access happened on-device, heavy computation ran on Perplexity’s servers. The orchestrator changes that. The system now reasons about where each piece of a task should execute — not just which model to use, but which physical location should process it. Perplexity Computer coordinates up to 20 AI models in a single workflow. The system is one that creates a team of agents and orchestrates across models, tools and files in one single system. The hybrid orchestrator extends that orchestration to compute location itself. Key Takeaways Perplexity AI announced the first hybrid local-server inference orchestrator at Computex 2026, routing AI tasks automatically between on-device and cloud models. A compact local model acts as the router — classifying each subtask by data sensitivity and compute requirements before dispatching it. Sensitive data (financial records, health files) stays on-device; compute-heavy tasks go to frontier cloud models — no manual configuration required. The orchestration framework is model-agnostic and chip-agnostic, confirmed to run on Intel Core Ultra Series 3 and NVIDIA RTX Spark hardware. The feature arrives in Perplexity Computer in July 2026, initially on Windows; Personal Computer is already available on Mac with a Windows waitlist open. Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post Perplexity AI Introduces Hybrid Local-Server Inference Orchestrator for Personal Computer: Automatic On-Device and Cloud Task Routing appeared first on MarkTechPost.

Perplexity AI Introduces Hybrid Local-Server Inference Orchestrator for Personal Computer: Automatic On-Device and Cloud Task Routing Read Post »

AI, Committee, News, Uncategorized

NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes

In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. Cold-starting inference workloads on Kubernetes can take several minutes. During that time, GPUs are allocated but idle, generating no tokens and serving no requests. ‘Cold start’ means the full sequence a model server must complete before serving any request: pulling the container image, loading model weights into GPU memory, warming up CUDA kernels, compiling or capturing CUDA graphs, and registering with the service discovery layer. This delay increases the risk of SLA violations during traffic spikes, as the system cannot scale quickly enough to absorb sudden increases in demand. The cold-start latency for a single-GPU vLLM (v0.20.0) workload breaks into three segments: container/image pull, engine initialization (weight loading, kernel warmup, graph compilation), and distributed runtime startup. To address this, NVIDIA’s AI research team has introduced NVIDIA Dynamo Snapshot: a checkpoint/restore approach for AI inference workloads on Kubernetes. https://developer.nvidia.com/blog/nvidia-dynamo-snapshot-fast-startup-for-inference-workloads-on-kubernetes/?linkId=100000423964029 What is CRIU and cuda-checkpoint? A running inference worker’s checkpointable state has two components. Device state (GPU-side) includes CUDA contexts, streams, device memory, and virtual address mappings — this is not visible to the host. To serialize it, cuda-checkpoint uses the checkpointing capability of the CUDA driver to dump the device state to CPU memory of the process owning each CUDA context. Host state (CPU-side) includes CPU memory, threads, file descriptors, and namespaces. CRIU (Checkpoint/Restore in Userspace) walks the Linux kernel’s bookkeeping and serializes the process tree’s state to disk. When checkpointing, the two tools run in order: cuda-checkpoint dumps all device state into CPU memory first, then CRIU dumps all host-side process tree state to a folder in storage. When restoring on the same or a different node: CRIU restores the process tree from distributed storage such as NFS or SMB first, then cuda-checkpoint restores the GPU state from what is now in CPU memory onto the new GPUs. CRIU is fundamentally a freeze-and-thaw mechanism. When a process is restored, execution resumes at the exact instruction where it was checkpointed, completely unaware that checkpointing or restoration occurred. Because of this, any coordination required before checkpointing such as quiescing the workload or after restoration such as re-establishing external state — must be handled externally through an orchestrator or workload-specific hooks. How Dynamo Snapshot Works on Kubernetes In Kubernetes, workloads run inside containers inside pods. Because CRIU checkpoints contain references to the container’s writable filesystem layer, checkpointing is done at the container level so the process tree state and filesystem travel together. NVIDIA provides a privileged DaemonSet, snapshot-agent, installable through a Helm chart. An agent runs on every node and handles checkpoint and restore for runc-managed containers without requiring modifications to runc itself. On checkpoint, the agent waits for the workload’s readiness probe, invokes cuda-checkpoint and CRIU from the host side, and writes the artifact to shared storage. The workload may have created or deleted files local to the container (the overlay filesystem), which the agent also checkpoints after the CRIU stage. On restore, the agent launches a lightweight placeholder pod, restores the overlay filesystem, and restores the CRIU/CUDA checkpoint into its namespaces. Each agent operates independently on its local node, allowing checkpoints and restores to parallelize naturally across the cluster. This DaemonSet approach was chosen over Kubernetes native checkpoint/restore support in runc for three reasons: it is fully portable without depending on cloud-provider feature gates, it gives tighter control over CRIU for performance tuning, and it allows checkpoint artifacts to live in flexible storage backends rather than being embedded into OCI images. Quiesce/resume hooks: A Dynamo inference worker initializes in two ordered phases. First, engine initialization: communicators are initialized, weights are loaded, kernels are warmed up, and CUDA graphs are compiled. The worker is fully warm at this point but not yet discoverable outside its pod. Second, distributed runtime startup: the worker connects to the Dynamo control plane and registers with the discovery backend. Open TCP connections to the control plane exist from this point onward. If checkpoint were taken after distributed runtime startup, there would be active TCP connections that CRIU cannot capture. The solution is quiesce/resume hooks: the worker writes a ‘ready for checkpoint’ signal file after engine initialization but before distributed runtime startup. The worker then enters a polling loop waiting for a ‘restore complete’ signal file while the snapshot agent checkpoints it externally. Because CRIU restores execution at the exact instruction where checkpointing occurred, the worker resumes directly inside the polling loop, detects the signal file, and proceeds with distributed runtime initialization without requiring additional synchronization. The quiesce/resume pattern is also important for multi-GPU and multi-node checkpoints (planned for a future release): outbound TCP connections used for RPC cannot be checkpointed in an established state because the pod IP changes between checkpoint and restore, and RDMA registrations and NIC state need to be recreated post-restore. Optimization 1: KV Cache Unmap and Release After measuring peak GPU memory usage while weights, CUDA graphs, and other buffers are allocated, inference engines allocate the remaining GPU memory as a large KV cache buffer. Since the checkpoint is taken before the replica has served any requests, this KV cache buffer does not need to be checkpointed at all. However, its virtual address must remain stable because it is baked into the CUDA graph. The solution is to allocate the KV cache via the CUDA Virtual Memory Management API (cuMemCreate and cuMemMap), then free the underlying physical allocation with cuMemUnmap and cuMemRelease — but not cuMemAddressFree. This keeps the virtual address range intact while releasing the physical memory. This functionality is natively available in vLLM via sleep() and wake_up() and in SGLang via torch_memory_saver. For Qwen3-0.6B on a B200, this reduces the total artifact size from ~190 GiB to ~6 GiB. The wins are most pronounced for large KV cache sizes — that is, smaller model weights relative to GPU size. Optimization 2: Speeding Up CRIU Memory Restore Even after the artifact is smaller, upstream CRIU restore time remains a bottleneck. For larger models, restore time actually exceeds

NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes Read Post »

AI, Committee, News, Uncategorized

The Download: AI hacking beyond Mythos, and chatbots’ impact on our brains

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. The Meta hack shows there’s more to AI security than Mythos On Monday, reports emerged that attackers had used Meta’s AI customer support agent to steal Instagram accounts. Their approach was simple: they asked the agent to link the accounts to email addresses they controlled, and it complied. Since Anthropic announced that its Mythos model was too good at hacking for a general release, cybersecurity concerns have focused on the risk of superpowered AI systems overwhelming computer infrastructure. But the Instagram hack shows that far simpler exploits can still cause damage. As companies offload more work to AI, these comparatively unsophisticated attacks are becoming harder to ignore. Read the full story to understand why. —Grace Huckins Are AI chatbots making us lose control of our brains? Gloria Mark, a psychologist at the University of California, Irvine, fears that digital technologies are weakening our cognitive abilities. Her research suggests attention spans have fallen sharply over time, leading to higher stress and lower performance. She now believes AI tools like ChatGPT and Claude may accelerate this shift. “You’re deferring your cognitive work to AI,” she said. “And it’s not good for us.” Mark argues this could weaken critical thinking and emotional intelligence. Luckily, she thinks we can course-correct by changing our relationship with these technologies. Find out how AI could reshape attention and thinking. —Jessica Hamzelou This story is from The Checkup, our weekly newsletter giving you the inside track on all things biotech. Sign up to receive it in your inbox every Thursday. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 Anthropic has called for a global slowdown in AI developmentIt flagged the risk of models “self-improving.” (WSJ $)+ And wants a coordinated plan to stop them. (Reuters $)+ Skeptics note that the timing is awfully convenient. (The Register) 2 In a first, scientists have precisely edited human embryo genesThey relied on a newer gene-editing technique. (NYT $)+ Genetically-modified babies could be on their way. (Guardian)+ Companies have big plans for the technology. (MIT Technology Review) 3 US officials have discussed taking financial stakes in the AI firmsThey’ve held talks about the government acquiring shares. (Reuters $)+ Sam Altman pitched the idea to the White House last year. (WSJ $) 4 Bot web traffic has overtaken human web trafficCloudflare said 57.4% of traffic now comes from bots. (NBC News)+ Its CEO expected the milestone at the end of 2027. (CNET) 5 The White House plans to bring AI doctors into American medicineIt wants chatbots to diagnose illness and prescribe medicine. (WSJ $)+ But we don’t even know if healthcare AI actually helps patients. (MIT Technology Review) 6 Meta quietly added facial recognition code for smart glasses to its appThe exploratory feature would identify people via biometric data. (Wired $)+ Smart glasses are also entering warfare. (MIT Technology Review) 7 South Korea’s labour minister wants tech firms to share AI profitsKim Young wants staff and suppliers to get a share. (Reuters $)+ He helped avert a huge strike over AI profit-sharing at Samsung. (NYT $) 8 Canada’s highly-anticipated AI strategy has launchedIt promises over $2 billion in funding and aims to create 250,000 jobs. (BBC)+ AI could strengthen democracy. (MIT Technology Review) 9 Investment in agricultural tech is boomingThat’s good news at a time when we’re facing unprecedented levels of food market volatility. (The Economist $) 10 Bumblebees can use tools to solve problems, new research showsNot just busy—they’re clever too! (Guardian)  Quote of the day “Welp, that happened faster than I predicted.”  —Matthew Prince, co-founder and CEO of Cloudflare, one of the largest internet hosting services, reacts on X to reports that bots have overtaken humans in driving web traffic. One More Thing CHRISTOPHER PAYNE Inside the machine that saved Moore’s Law In a Connecticut clean room, the Dutch company ASML is developing the world’s most advanced machine for extreme ultraviolet (EUV) lithography, a crucial process for manufacturing microchips. The system has become vital to Moore’s Law—the observation that the number of transistors on a chip roughly doubles every two years as components shrink, driving gains in performance and efficiency. “Without this machine, it’s gone,” says Wayne Lam, a director of research at CCS Insight. “You can’t really make any leading-edge processors without EUV.” Discover how ASML’s EUV technology saved Moore’s Law. —Clive Thompson We can still have nice things A place for comfort, fun, and distraction to brighten up your day. (Got any ideas? Drop me a line.) + Tech bosses love Tolkien. Here’s what the writer might think of them.+ Rare footage captures an underwater volcano erupting beneath the Pacific Ocean.+ Watch a tiny rescued cub grow into adulthood in this heartwarming tiger compilation.+ This medieval version of “Take On Me” is like stepping into a tavern of synth-pop bards.

The Download: AI hacking beyond Mythos, and chatbots’ impact on our brains Read Post »

AI, Committee, News, Uncategorized

Meet OpenJarvis: A Local-First Framework for On-Device Personal AI Agents with Tools, Memory, and Learning

Researchers at Stanford University and Lambda Labs, have published the research paper for OpenJarvis, an open-source framework that runs inference, agents, memory, and learning entirely on-device. The open-weight models configured through OpenJarvis land within 3.2 percentage points of the best cloud model on average, at roughly 800× lower marginal API cost per query and roughly 4× lower latency under the research’s benchmark protocol. This research work builds on the research team’s earlier Intelligence Per Watt study, which reported that local models already handle 88.7% of single-turn chat and reasoning queries at interactive latency, with intelligence efficiency improving 5.3× from 2023 to 2025. Model Overview & Access OpenJarvis is not a single model. It is a framework that composes any supported model with a configurable agent stack, evaluated across 11 local models from four families. Property Value License Apache 2.0 Framework release March 12, 2026 Paper arXiv:2605.17172 (posted May 16, 2026) Repository github.com/open-jarvis/OpenJarvis Stars / forks ~5.4k / ~1.2k (June 2026) Languages Python (~83%), Rust (~9%), TypeScript (~7%) Evaluated models 11 local models across 4 families: Qwen3.5, Gemma4, Nemotron, Granite Cloud baselines Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro Supported engines Ollama, vLLM, SGLang, llama.cpp, Apple Foundation Models, Exo (among others) Context window Model-dependent Installation Single command; ~3 minutes on broadband Hardware Tested on 7 platforms, from Mac Mini M4 to NVIDIA DGX Spark Architecture: Five Primitives and a Spec OpenJarvis decomposes a personal AI system into five typed primitives, composed through a single declarative configuration object called a spec. Intelligence — the model, weights, generation parameters, and quantization format. Engine — the inference runtime (Ollama, vLLM, SGLang, etc.), batching, KV-cache settings, and hardware path. Agents — the reasoning loop (ReAct or CodeAct), system prompts, tool-use policy, and turn limits. Tools & Memory — external interfaces, retrieval backends, 25+ data connectors, and 32+ messaging channels, with native MCP support and interchangeable memory backends. Learning — the optimizer that updates the spec from traces. This slot accepts LoRA, DSPy, GEPA, or LLM-guided spec search. Each primitive is independently swappable, and a spec serializes all five into a TOML file. Two specs can share the same agent and tool configuration and differ only in model and engine, so the same behavior runs on a Mac Mini and a workstation without rewriting prompts. LLM-guided spec search is the second contribution. It is a local–cloud collaboration: a frontier cloud model acts as a teacher at search time, reading traces, diagnosing failure clusters, and proposing edits across Intelligence, Engine, Agents, and Tools & Memory. An edit is accepted only if it improves the target failure cluster without causing meaningful regressions elsewhere — the research team calls this the gate (default tolerance 1%). The optimized spec then runs entirely on-device at inference time, with zero cloud calls. The teacher is used only at search time; at 100 queries per day, the amortized teacher cost falls below $0.001 per query within six months. Prior work (GEPA, DSPy, LoRA) optimizes one primitive at a time, and prompt optimizers alone recover only about 5 pp of the cloud–local gap. LLM-guided spec search recovers 13–32 pp because it edits across primitives jointly, at 7–11× lower optimization cost than single-primitive baselines. The four-primitive move space contributes 5.5–16.5 pp, and the LLM proposer adds about 10 pp on average over an evolutionary search at the same move space. https://arxiv.org/pdf/2605.17172v1 Capabilities & Performance OpenJarvis was evaluated across 8 benchmarks spanning 508 tasks: tool calling (ToolCall-15), agentic workflows (PinchBench), coding (LiveCodeBench), customer service (τ-Bench V2, τ²-Bench Telecom), general assistance (GAIA), and deep research (LiveResearchBench, DeepResearchBench). The swap test: Replacing the intended cloud model with Qwen3.5-9B in existing frameworks (OpenClaw, Hermes Agent) drops accuracy by 25–39 pp. With the same model under an OpenJarvis spec, the residual drop shrinks to 5.6–16.5 pp — recovering 56–77% of the portability loss. The accuracy frontier: The best single local model, Qwen3.5-122B, reaches 80.3% average accuracy versus Claude Opus 4.6 at 83.5% — a 3.2 pp gap. Local specs match or exceed cloud on 4 of 8 benchmarks: ToolCall-15, PinchBench, LiveCodeBench, and τ-Bench V2. Cost and latency: Local configurations form the accuracy–efficiency frontier. Qwen3.5-122B delivers its 80.3% at roughly a thousandth of a cent per query, versus $0.009 per query for Claude Opus 4.6 — an approximately 800× marginal API-cost advantage. End-to-end latency drops by roughly 4× on the agentic workloads, though the paper notes single-shot prompts can favor cloud serving. Search gains: LLM-guided spec search improves the Qwen3.5-9B student to 100% on PinchBench, 83% on LiveCodeBench, and 91% on LiveResearchBench. Across the full eight-benchmark suite, average gains per student model range from 13.1 to 31.5 pp. The authors report that these gains survive their robustness checks (reward-weight variants, search-seed variance, and random restarts). How to Use it Installation is one command. On macOS, Linux, or WSL2: Copy CodeCopiedUse a different Browser curl -fsSL https://open-jarvis.github.io/OpenJarvis/install.sh | bash Windows users run an equivalent PowerShell script (irm … | iex). The installer provisions uv, a Python virtual environment, Ollama, and a starter model in about three minutes on broadband. A desktop GUI ships as a .dmg, .exe, .deb, .rpm, or .AppImage from the releases page. After install, jarvis starts a chat session. Starter presets cover common workflows: Copy CodeCopiedUse a different Browser jarvis init –preset morning-digest-mac # daily briefing with TTS jarvis init –preset deep-research # multi-hop research with citations jarvis init –preset code-assistant # agent with code execution and shell access jarvis init –preset scheduled-monitor # stateful agent on a schedule The framework ships with eight built-in agents across three execution modes — on-demand, scheduled, and continuous. It connects to 25+ data sources (Gmail, Calendar, iMessage, Notion, Obsidian, Slack, GitHub, and others) and exposes agents over 32+ messaging channels (WhatsApp, Telegram, Discord, iMessage, Signal, and others). Skills can be imported from external catalogs — about 150 from Hermes Agent and about 13,700 community skills from OpenClaw — all following the agentskills.io specification. A jarvis optimize skills –policy dspy command refines them from local trace history. Marktechpost’s Visual Explainer OpenJarvis ·

Meet OpenJarvis: A Local-First Framework for On-Device Personal AI Agents with Tools, Memory, and Learning Read Post »

AI, Committee, News, Uncategorized

Miso Labs Releases MisoTTS: An 8B Emotive Text-to-Speech Model with Open Weights

Miso Labs has released MisoTTS, an open-weights 8-billion-parameter text-to-speech model. It generates expressive speech from both text and audio context. The model uses residual vector quantization (RVQ) to widen its sonic range. This avoids scaling a single flat vocabulary while keeping parameter count fixed. What is MisoTTS MisoTTS is an 8B-parameter text-to-dialogue RVQ Transformer. It is inspired by the Sesame CSM architecture. It pairs a Llama 3.2-style backbone with a smaller audio decoder. It generates Mimi audio codes from text and optional audio context. The model conditions on both text and prior audio. That second input lets it respond to the speaker’s tone. The text vocabulary is 128,256 tokens, and there are 32 audio codebooks. Mimi is the audio tokenizer, and max sequence length is 2,048. Default inference runs in torch.bfloat16. Miso Labs claims 110ms latency. It lists ElevenLabs at 700ms and Sesame at 300ms. The Vocabulary Size Problem Standard transformers generate from a fixed vocabulary of discrete tokens. That works when a small vocabulary covers the target space. Human speech does not fit that assumption. It varies across pitch, rhythm, emphasis, emotion, and accent. Expanding the audio vocabulary is the obvious fix. But larger vocabularies need more parameters in a standard transformer. Each token must be represented and predicted by the model. Miso Labs calls this the vocabulary size problem. The second issue is conditioning. Most TTS models condition only on text. They ignore the interlocutor’s tone. Miso Labs argues this contributes to the “uncanny valley” effect. Residual Vector Quantization: The Core Idea MisoTTS addresses both problems with residual vector quantization (RVQ). Miso Labs traces RVQ to image-generation research and to Sesame’s CSM for audio. Instead of one token index, the model emits a vector of indices. Each audio token is 32 codebook indices over 2048-way codebooks. The model keeps a separate codebook for each position in the vector. To recover the sound, it sums the looked-up vectors. Each codebook adds another refinement to the signal. This is what makes the scaling work. Addressable vocabulary equals codebook size raised to the depth. Growing the depth adds no parameters to the model. So MisoTTS reaches about 204832, or roughly 10105 addressable tokens. Miso Labs notes naive scaling would require a far larger network. https://www.misolabs.ai/blog/miso-tts-8b The Two-Transformer Architecture The model splits into a backbone and a decoder. The backbone is a 7.7B-parameter transformer, autoregressive over time. It predicts the first codebook index and a final hidden state. A 300M-parameter decoder then runs autoregressively over depth. It predicts the remaining codebook indices, one position at a time. Each prediction conditions on the indices already chosen in the frame. The same 300M parameters are reused for every position. Embeddings follow the same logic. Text tokens use a single lookup. An audio token’s embedding is the sum of per-position codebook lookups. Interleaving text and audio lets the backbone use conversation history. That is how it carries context across turns. Strengths and Challenges Strengths: Open weights on day one, under a modified MIT license. RVQ scales the sonic range without scaling parameter count. Conditions on audio context, not text alone. Local deployment keeps sensitive audio data in-house. The architecture and math are documented in a public blog post. Challenges: Half-duplex only, with no turn-taking yet. The large model needs a capable CUDA GPU. API access is announced but not yet available. Latency and quality claims still need third-party testing. Marktechpost’s Visual Explainer Marktechpost · Model Brief 01 / 09 Open-Weights Release · June 3, 2026 MisoTTS An 8B emotive text-to-speech model from Miso Labs, built on residual vector quantization and conditioned on both text and audio. 8B params RVQ Transformer Mimi codes modified MIT What MisoTTS Is A text-to-dialogue RVQ Transformer An 8B-parameter model inspired by the Sesame CSM architecture. Pairs a Llama 3.2-style backbone with a smaller audio decoder. Generates Mimi audio codes from text and optional audio context. Conditions on prior audio, so output responds to speaker tone. At a Glance Published specifications Parameters 8B (7.7B + 300M) Architecture RVQ Transformer Audio codebooks 32 (2048-way) Audio tokenizer Mimi Text vocabulary 128,256 Max sequence length 2,048 Default precision torch.bfloat16 License modified MIT The Motivation The vocabulary size problem Transformers generate from a fixed vocabulary of discrete tokens. Speech varies in pitch, rhythm, emphasis, emotion, and accent. A bigger audio vocabulary needs more parameters in a standard transformer. Most TTS condition only on text, ignoring tone — the “uncanny valley” effect. The Core Idea Residual vector quantization The model emits a vector of indices, not a single token index. Each token is 32 codebook indices over 2048-way codebooks. Summing the looked-up vectors reconstructs the sound. Depth scales addressable vocabulary to ~204832 (≈10105) with no added parameters. Architecture Two transformers, one vector token Backbone (7.7B) — autoregressive over time; predicts codebook index k₁ and hidden state h₀. Decoder (300M) — autoregressive over depth; predicts k₂ through k₃₂. The same 300M parameters are reused for every position. Interleaved text and audio let the backbone use conversation history. Run It Locally Inference in a few lines from generator import load_miso_8b import torchaudio gen = load_miso_8b(device=”cuda”, model_path_or_repo_id=”MisoLabs/MisoTTS”) audio = gen.generate( text=”Hello from Miso.”, speaker=0, context=[], max_audio_length_ms=10_000) torchaudio.save(“miso.wav”, audio.unsqueeze(0).cpu(), gen.sample_rate) Setup uses uv with Python 3.10. Weights download from Hugging Face. Audio is watermarked by default via SilentCipher. One-shot voice cloning works from a ~10-second clip. Limitations Where it stops, for now Handles individual turns only; no turn-taking yet. Generates half-duplex audio — it cannot speak while the other party speaks. Miso Labs frames full-duplex and turn-taking as future work. API access is announced but not yet available. Key Takeaways The short version Open-weights 8B TTS under a modified MIT license. Conditions on text and audio, so output tracks speaker tone. RVQ scales vocabulary to ~204832 without adding parameters. 7.7B backbone over time, 300M decoder over depth. Half-duplex and single-turn today; API access pending. Prev Next Decoded by Marktechpost — AI research, model briefs, and developer tools for practitioners. marktechpost.com Key Takeaways Miso Labs open-sourced MisoTTS, an 8B text-to-speech

Miso Labs Releases MisoTTS: An 8B Emotive Text-to-Speech Model with Open Weights Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at Privacy Policy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
en_US