YouZum

Committee

AI, Committee, ข่าว, Uncategorized

How to Evaluate Voice Agents in 2025: Beyond Automatic Speech Recognition (ASR) and Word Error Rate (WER) to Task Success, Barge-In, and Hallucination-Under-Noise

Table of contents Why WER Isn’t Enough ? What to Measure (and How) ? Benchmark Landscape: What Each Covers Filling the Gaps: What You Still Need to Add A Concrete, Reproducible Evaluation Plan References Optimizing only for Automatic Speech Recognition (ASR) and Word Error Rate (WER) is insufficient for modern, interactive voice agents. Robust evaluation must measure end-to-end task success, barge-in behavior and latency, and hallucination-under-noise—alongside ASR, safety, and instruction following. VoiceBench offers a multi-facet speech-interaction benchmark across general knowledge, instruction following, safety, and robustness to speaker/environment/content variations, but it does not cover barge-in or real-device task completion. SLUE (and Phase-2) target spoken language understanding (SLU); MASSIVE and Spoken-SQuAD probe multilingual and spoken QA; DSTC tracks add spoken, task-oriented robustness. Combine these with explicit barge-in/endpointing tests, user-centric task-success measurement, and controlled noise-stress protocols to obtain a complete picture. Why WER Isn’t Enough? WER measures transcription fidelity, not interaction quality. Two agents with similar WER can diverge widely in dialog success because latency, turn-taking, misunderstanding recovery, safety, and robustness to acoustic and content perturbations dominate user experience. Prior work on real systems shows the need to evaluate user satisfaction and task success directly—e.g., Cortana’s automatic online evaluation predicted user satisfaction from in-situ interaction signals, not only ASR accuracy. What to Measure (and How)? 1) End-to-End Task Success Metric: Task Success Rate (TSR) with strict success criteria per task (goal completion, constraints met), plus Task Completion Time (TCT) and Turns-to-Success.Why. Real assistants are judged by outcomes. Competitions like Alexa Prize TaskBot explicitly measured users’ ability to finish multi-step tasks (e.g., cooking, DIY) with ratings and completion. Protocol. Define tasks with verifiable endpoints (e.g., “assemble shopping list with N items and constraints”). Use blinded human raters and automatic logs to compute TSR/TCT/Turns. For multilingual/SLU coverage, draw task intents/slots from MASSIVE. 2) Barge-In and Turn-Taking Metrics: Barge-In Detection Latency (ms): time from user onset to TTS suppression. True/False Barge-In Rates: correct interruptions vs. spurious stops. Endpointing Latency (ms): time to ASR finalization after user stop. Why. Smooth interruption handling and fast endpointing determine perceived responsiveness. Research formalizes barge-in verification and continuous barge-in processing; endpointing latency continues to be an active area in streaming ASR. Protocol. Script prompts where the user interrupts TTS at controlled offsets and SNRs. Measure suppression and recognition timings with high-precision logs (frame timestamps). Include noisy/echoic far-field conditions. Classic and modern studies provide recovery and signaling strategies that reduce false barge-ins. 3) Hallucination-Under-Noise (HUN) Metric. HUN Rate: fraction of outputs that are fluent but semantically unrelated to the audio, under controlled noise or non-speech audio.Why. ASR and audio-LLM stacks can emit “convincing nonsense,” especially with non-speech segments or noise overlays. Recent work defines and measures ASR hallucinations; targeted studies show Whisper hallucinations induced by non-speech sounds. Protocol. Construct audio sets with additive environmental noise (varied SNRs), non-speech distractors, and content disfluencies. Score semantic relatedness (human judgment with adjudication) and compute HUN. Track whether downstream agent actions propagate hallucinations to incorrect task steps. 4) Instruction Following, Safety, and Robustness Metric Families. Instruction-Following Accuracy (format and constraint adherence). Safety Refusal Rate on adversarial spoken prompts. Robustness Deltas across speaker age/accent/pitch, environment (noise, reverb, far-field), and content noise (grammar errors, disfluencies). Why. VoiceBench explicitly targets these axes with spoken instructions (real and synthetic) spanning general knowledge, instruction following, and safety; it perturbs speaker, environment, and content to probe robustness. Protocol. Use VoiceBench for breadth on speech-interaction capabilities; report aggregate and per-axis scores. For SLU specifics (NER, dialog acts, QA, summarization), leverage SLUE and Phase-2. 5) Perceptual Speech Quality (for TTS and Enhancement) Metric. Subjective Mean Opinion Score via ITU-T P.808 (crowdsourced ACR/DCR/CCR).Why. Interaction quality depends on both recognition and playback quality. P.808 gives a validated crowdsourcing protocol with open-source tooling. Benchmark Landscape: What Each Covers VoiceBench (2024) Scope: Multi-facet voice assistant evaluation with spoken inputs covering general knowledge, instruction following, safety, and robustness across speaker/environment/content variations; uses both real and synthetic speech.Limitations: Does not benchmark barge-in/endpointing latency or real-world task completion on devices; focuses on response correctness and safety under variations. SLUE / SLUE Phase-2 Scope: Spoken language understanding tasks: NER, sentiment, dialog acts, named-entity localization, QA, summarization; designed to study end-to-end vs. pipeline sensitivity to ASR errors.Use: Great for probing SLU robustness and pipeline fragility in spoken settings. MASSIVE Scope: >1M virtual-assistant utterances across 51–52 languages with intents/slots; strong fit for multilingual task-oriented evaluation.Use: Build multilingual task suites and measure TSR/slot F1 under speech conditions (paired with TTS or read speech). Spoken-SQuAD / HeySQuAD and Related Spoken-QA Sets Scope: Spoken question answering to test ASR-aware comprehension and multi-accent robustness.Use: Stress-test comprehension under speech errors; not a full agent task suite. DSTC (Dialog System Technology Challenge) Tracks Scope: Robust dialog modeling with spoken, task-oriented data; human ratings alongside automatic metrics; recent tracks emphasize multilinguality, safety, and evaluation dimensionality.Use: Complementary for dialog quality, DST, and knowledge-grounded responses under speech conditions. Real-World Task Assistance (Alexa Prize TaskBot) Scope: Multi-step task assistance with user ratings and success criteria (cooking/DIY).Use: Gold-standard inspiration for defining TSR and interaction KPIs; the public reports describe evaluation focus and outcomes. Filling the Gaps: What You Still Need to Add Barge-In & Endpointing KPIsAdd explicit measurement harnesses. Literature offers barge-in verification and continuous processing strategies; streaming ASR endpointing latency remains an active research topic. Track barge-in detection latency, suppression correctness, endpointing delay, and false barge-ins. Hallucination-Under-Noise (HUN) ProtocolsAdopt emerging ASR-hallucination definitions and controlled noise/non-speech tests; report HUN rate and its impact on downstream actions. On-Device Interaction LatencyCorrelate user-perceived latency with streaming ASR designs (e.g., transducer variants); measure time-to-first-token, time-to-final, and local processing overhead. Cross-Axis Robustness MatricesCombine VoiceBench’s speaker/environment/content axes with your task suite (TSR) to expose failure surfaces (e.g., barge-in under far-field echo; task success at low SNR; multilingual slots under accent shift). Perceptual Quality for PlaybackUse ITU-T P.808 (with the open P.808 toolkit) to quantify user-perceived TTS quality in your end-to-end loop, not just ASR. A Concrete, Reproducible Evaluation Plan Assemble the Suite Speech-Interaction Core: VoiceBench for knowledge, instruction following, safety, and robustness axes. SLU Depth: SLUE/Phase-2 tasks (NER, dialog acts,

How to Evaluate Voice Agents in 2025: Beyond Automatic Speech Recognition (ASR) and Word Error Rate (WER) to Task Success, Barge-In, and Hallucination-Under-Noise Read Post »

AI, Committee, ข่าว, Uncategorized

Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture

What if, instead of re-sampling one agent, you could push Gemini-2.5 Pro to 34.1% on HLE by mixing 12–15 tool-using agents that share notes and stop early? Google Cloud AI Research, with collaborators from MIT, Harvard, and Google DeepMind, introduced TUMIX (Tool-Use Mixture)—a test-time framework that ensembles heterogeneous agent styles (text-only, code, search, guided variants) and lets them share intermediate answers over a few refinement rounds, then stop early via an LLM-based judge. The result: higher accuracy at lower cost on hard reasoning benchmarks such as HLE, GPQA-Diamond, and AIME (2024/2025). https://arxiv.org/pdf/2510.01279 So, What exactly is different new? Mixture over modality, not just more samples: TUMIX runs ~15 agent styles spanning Chain-of-Thought (CoT), code execution, web search, dual-tool agents, and guided variants. Each round, every agent sees (a) the original question and (b) other agents’ previous answers, then proposes a refined answer. This message-passing raises average accuracy early while diversity gradually collapses—so stopping matters. Adaptive early-termination: An LLM-as-Judge halts refinement once answers exhibit strong consensus (with a minimum round threshold). This preserves accuracy at ~49% of the inference cost vs. fixed-round refinement; token cost drops to ~46% because late rounds are token-heavier. Auto-designed agents: Beyond human-crafted agents, TUMIX prompts the base LLM to generate new agent types; mixing these with the manual set yields an additional ~+1.2% average lift without extra cost. The empirical “sweet spot” is ~12–15 agent styles. https://arxiv.org/pdf/2510.01279 How does it work? TUMIX runs a group of heterogeneous agents—text-only Chain-of-Thought, code-executing, web-searching, and guided variants—in parallel, then iterates a small number of refinement rounds where each agent conditions on the original question plus the other agents’ prior rationales and answers (structured note-sharing). After each round, an LLM-based judge evaluates consensus/consistency to decide early termination; if confidence is insufficient, another round is triggered, otherwise the system finalizes via simple aggregation (e.g., majority vote or selector). This mixture-of-tool-use design trades brute-force re-sampling for diverse reasoning paths, improving coverage of correct candidates while controlling token/tool budgets; empirically, benefits saturate around 12–15 agent styles, and stopping early preserves diversity and lowers cost without sacrificing accuracy Lets discuss the Results Under comparable inference budgets to strong tool-augmented baselines (Self-MoA, Symbolic-MoE, DEI, SciMaster, GSA), TUMIX yields the best average accuracy; a scaled variant (TUMIX+) pushes further with more compute: HLE (Humanity’s Last Exam): Pro: 21.6% → 34.1% (TUMIX+); Flash: 9.7% → 23.1%.(HLE is a 2,500-question, difficult, multi-domain benchmark finalized in 2025.) GPQA-Diamond: Pro: up to 88.3%; Flash: up to 82.1%. (GPQA-Diamond is the hardest 198-question subset authored by domain experts.) AIME 2024/25: Pro: 96.7%; Flash: 86.7% with TUMIX(+) at test time. Across tasks, TUMIX averages +3.55% over the best prior tool-augmented test-time scaling baseline at similar cost, and +7.8% / +17.4% over no-scaling for Pro/Flash, respectively. https://arxiv.org/pdf/2510.01279 Our Comments TUMIX is a great approach from Google because it frames test-time scaling as a search problem over heterogeneous tool policies rather than brute-force sampling. The parallel committee (text, code, search) improves candidate coverage, while the LLM-judge enables early-stop that preserves diversity and reduces token/tool spend—useful under latency budgets. The HLE gains (34.1% with Gemini-2.5 Pro) align with the benchmark’s finalized 2,500-question design, and the ~12–15 agent styles “sweet spot” indicates selection—not generation—is the limiting factor. Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture appeared first on MarkTechPost.

Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture Read Post »

AI, Committee, ข่าว, Uncategorized

A Coding Implementation to Build a Transformer-Based Regression Language Model to Predict Continuous Values from Text

We will build a Regression Language Model (RLM), a model that predicts continuous numerical values directly from text sequences in this coding implementation. Instead of classifying or generating text, we focus on training a transformer-based architecture that learns quantitative relationships hidden within natural language descriptions. We start by generating synthetic text-to-number data, tokenizing it efficiently, and then train a lightweight Transformer encoder to map linguistic cues to real-valued targets. By the end, we not only understand how RLMs can be implemented from scratch but also visualize their learning behavior and test their generalization on unseen examples. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser import numpy as np import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader import matplotlib.pyplot as plt from collections import Counter import re torch.manual_seed(42) np.random.seed(42) print(” Regression Language Model (RLM) Tutorial”) print(“=” * 60) We begin by importing essential libraries, such as PyTorch, NumPy, and Matplotlib, to build and visualize our Regression Language Model. We set random seeds to ensure reproducibility and initialize the environment, thereby guaranteeing consistent results each time the tutorial is run. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def generate_synthetic_data(n_samples=2000): “””Generate synthetic text-to-number regression data””” templates = [ (“The temperature is {} degrees”, lambda x: x), (“I rate this {} out of ten”, lambda x: x), (“The price is {} dollars”, lambda x: x), (“Confidence level: {}”, lambda x: x / 100), (“Speed of {} kilometers per hour”, lambda x: x / 10), (“{} percent complete”, lambda x: x / 100), (“Scored {} points in the game”, lambda x: x / 10), (“The distance is {} meters”, lambda x: x), ] data = [] for _ in range(n_samples): template, transform = templates[np.random.randint(len(templates))] value = np.random.uniform(0, 100) text = template.format(round(value, 1)) target = transform(value) data.append((text, target)) return data We create a synthetic dataset that pairs natural language sentences with corresponding numerical values. By using varied templates such as temperatures, ratings, and percentages, we ensure the model learns diverse text–number relationships. This controlled setup helps us simulate realistic regression tasks without relying on external data. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class SimpleTokenizer: def __init__(self): self.word2idx = {“<PAD>”: 0, “<UNK>”: 1} self.idx2word = {0: “<PAD>”, 1: “<UNK>”} self.vocab_size = 2 def fit(self, texts): “””Build vocabulary from texts””” words = [] for text in texts: words.extend(re.findall(r’w+|[^ws]’, text.lower())) word_counts = Counter(words) for word, _ in word_counts.most_common(): if word not in self.word2idx: self.word2idx[word] = self.vocab_size self.idx2word[self.vocab_size] = word self.vocab_size += 1 def encode(self, text, max_len=20): “””Convert text to token indices””” words = re.findall(r’w+|[^ws]’, text.lower()) indices = [self.word2idx.get(w, 1) for w in words] if len(indices) < max_len: indices += [0] * (max_len – len(indices)) else: indices = indices[:max_len] return indices We design a simple tokenizer to convert raw text into numerical tokens that the model can process. It builds a vocabulary from all unique words and maps each to an index, handling unknown words and padding automatically. This step ensures our textual inputs are transformed into consistent, machine-readable sequences for training. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class RLMDataset(Dataset): def __init__(self, data, tokenizer, max_len=20): self.data = data self.tokenizer = tokenizer self.max_len = max_len def __len__(self): return len(self.data) def __getitem__(self, idx): text, target = self.data[idx] tokens = self.tokenizer.encode(text, self.max_len) return torch.tensor(tokens), torch.tensor([target], dtype=torch.float32) class RegressionLanguageModel(nn.Module): def __init__(self, vocab_size, embed_dim=128, num_heads=4, num_layers=2, dropout=0.1, max_len=20): super().__init__() self.token_embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0) self.position_embedding = nn.Embedding(max_len, embed_dim) encoder_layer = nn.TransformerEncoderLayer( d_model=embed_dim, nhead=num_heads, dim_feedforward=embed_dim * 4, dropout=dropout, batch_first=True ) self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers) self.fc1 = nn.Linear(embed_dim, 64) self.relu = nn.ReLU() self.dropout = nn.Dropout(dropout) self.fc2 = nn.Linear(64, 1) self.max_len = max_len def forward(self, x): batch_size, seq_len = x.shape positions = torch.arange(0, seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1) token_embed = self.token_embedding(x) pos_embed = self.position_embedding(positions) embeddings = token_embed + pos_embed padding_mask = (x == 0) encoded = self.transformer(embeddings, src_key_padding_mask=padding_mask) mask_expanded = (~padding_mask).unsqueeze(-1).float() summed = (encoded * mask_expanded).sum(dim=1) pooled = summed / mask_expanded.sum(dim=1) x = self.fc1(pooled) x = self.relu(x) x = self.dropout(x) output = self.fc2(x) return output We package our text–number pairs into a PyTorch Dataset, where we tokenize each sentence and return tensors ready for batching. We then build a Transformer-based RLM: token and positional embeddings flow through a multi-layer encoder, we mean-pool non-padded tokens, and feed the result to a small MLP head for regression. In effect, we allow the encoder to learn numerical cues from language, while the head maps them to a single continuous value. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def train_rlm(model, train_loader, val_loader, epochs=15, lr=0.001): criterion = nn.MSELoss() optimizer = optim.Adam(model.parameters(), lr=lr) train_losses, val_losses = [], [] print(f”n Training on {device}”) print(“-” * 60) for epoch in range(epochs): model.train() train_loss = 0 for tokens, targets in train_loader: tokens, targets = tokens.to(device), targets.to(device) optimizer.zero_grad() outputs = model(tokens) loss = criterion(outputs, targets) loss.backward() optimizer.step() train_loss += loss.item() train_loss /= len(train_loader) train_losses.append(train_loss) model.eval() val_loss = 0 with torch.no_grad(): for tokens, targets in val_loader: tokens, targets = tokens.to(device), targets.to(device) outputs = model(tokens) loss = criterion(outputs, targets) val_loss += loss.item() val_loss /= len(val_loader) val_losses.append(val_loss) print(f”Epoch {epoch+1:2d}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}”) return train_losses, val_losses We train the model using Adam and MSE loss on a GPU, if available, iterating over mini-batches to backpropagate and update weights. We switch to evaluation mode for validation at the end of each epoch, track training and validation losses, and print progress so we can see the learning dynamics in real-time. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“n Generating synthetic data…”) data = generate_synthetic_data(2000) split_idx = int(0.8 * len(data)) train_data, val_data = data[:split_idx], data[split_idx:] print(f”Train samples: {len(train_data)}, Val samples: {len(val_data)}”) print(“n Building tokenizer…”) tokenizer = SimpleTokenizer() tokenizer.fit([text for text, _ in train_data]) print(f”Vocabulary size: {tokenizer.vocab_size}”) train_dataset = RLMDataset(train_data, tokenizer) val_dataset = RLMDataset(val_data, tokenizer) train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=32) print(“n Building Regression Language Model…”) model = RegressionLanguageModel(vocab_size=tokenizer.vocab_size) print(f”Model parameters: {sum(p.numel() for p in model.parameters()):,}”) train_losses, val_losses = train_rlm(model, train_loader, val_loader) plt.figure(figsize=(10, 4)) plt.plot(train_losses, label=’Train Loss’, linewidth=2) plt.plot(val_losses, label=’Val Loss’, linewidth=2) plt.xlabel(‘Epoch’)

A Coding Implementation to Build a Transformer-Based Regression Language Model to Predict Continuous Values from Text Read Post »

AI, Committee, ข่าว, Uncategorized

This AI Paper Proposes a Novel Dual-Branch Encoder-Decoder Architecture for Unsupervised Speech Enhancement (SE)

Can a speech enhancer trained only on real noisy recordings cleanly separate speech and noise—without ever seeing paired data? A team of researchers from Brno University of Technology and Johns Hopkins University proposes Unsupervised Speech Enhancement using Data-defined Priors (USE-DDP), a dual-stream encoder–decoder that separates any noisy input into two waveforms—estimated clean speech and residual noise—and learns both solely from unpaired datasets (clean-speech corpus and optional noise corpus). Training enforces that the sum of the two outputs reconstructs the input waveform, avoiding degenerate solutions and aligning the design with neural audio codec objectives. https://arxiv.org/pdf/2509.22942 Why this is important? Most learning-based speech enhancement pipelines depend on paired clean–noisy recordings, which are expensive or impossible to collect at scale in real-world conditions. Unsupervised routes like MetricGAN-U remove the need for clean data but couple model performance to external, non-intrusive metrics used during training. USE-DDP keeps the training data-only, imposing priors with discriminators over independent clean-speech and noise datasets and using reconstruction consistency to tie estimates back to the observed mixture. How it works? Generator: A codec-style encoder compresses the input audio into a latent sequence; this is split into two parallel transformer branches (RoFormer) that target clean speech and noise respectively, decoded by a shared decoder back to waveforms. The input is reconstructed as the least-squares combination of the two outputs (scalars α, β compensate for amplitude errors). Reconstruction uses multi-scale mel/STFT and SI-SDR losses, as in neural audio codecs. Priors via adversaries: Three discriminator ensembles—clean, noise, and noisy—impose distributional constraints: the clean branch must resemble the clean-speech corpus; the noise branch must resemble a noise corpus; the reconstructed mixture must sound natural. LS-GAN and feature-matching losses are used. Initialization: Initializing encoder/decoder from a pretrained Descript Audio Codec improves convergence and final quality vs. training from scratch. How it compares? On the standard VCTK+DEMAND simulated setup, USE-DDP reports parity with the strongest unsupervised baselines (e.g., unSE/unSE+ based on optimal transport) and competitive DNSMOS vs. MetricGAN-U (which directly optimizes DNSMOS). Example numbers from the paper’s Table 1 (input vs. systems): DNSMOS improves from 2.54 (noisy) to ~3.03 (USE-DDP), PESQ from 1.97 to ~2.47; CBAK trails some baselines due to more aggressive noise attenuation in non-speech segments—consistent with the explicit noise prior. https://arxiv.org/pdf/2509.22942 Data choice is not a detail—it’s the result A central finding: which clean-speech corpus defines the prior can swing outcomes and even create over-optimistic results on simulated tests. In-domain prior (VCTK clean) on VCTK+DEMAND → best scores (DNSMOS ≈3.03), but this configuration unrealistically “peeks” at the target distribution used to synthesize the mixtures. Out-of-domain prior → notably lower metrics (e.g., PESQ ~2.04), reflecting distribution mismatch and some noise leakage into the clean branch. Real-world CHiME-3: using a “close-talk” channel as in-domain clean prior actually hurts—because the “clean” reference itself contains environment bleed; an out-of-domain truly clean corpus yields higher DNSMOS/UTMOS on both dev and test, albeit with some intelligibility trade-off under stronger suppression. This clarifies discrepancies across prior unsupervised results and argues for careful, transparent prior selection when claiming SOTA on simulated benchmarks. Our Comments The proposed dual-branch encoder-decoder architecture treats enhancement as explicit two-source estimation with data-defined priors, not metric-chasing. The reconstruction constraint (clean + noise = input) plus adversarial priors over independent clean/noise corpora gives a clear inductive bias, and initializing from a neural audio codec is a pragmatic way to stabilize training. The results look competitive with unsupervised baselines while avoiding DNSMOS-guided objectives; the caveat is that “clean prior” choice materially affects reported gains, so claims should specify corpus selection. Check out the PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post This AI Paper Proposes a Novel Dual-Branch Encoder-Decoder Architecture for Unsupervised Speech Enhancement (SE) appeared first on MarkTechPost.

This AI Paper Proposes a Novel Dual-Branch Encoder-Decoder Architecture for Unsupervised Speech Enhancement (SE) Read Post »

AI, Committee, ข่าว, Uncategorized

The Download: using AI to discover “zero day” vulnerabilities, and Apple’s ICE app removal

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. Microsoft says AI can create “zero day” threats in biology A team at Microsoft says it used artificial intelligence to discover a “zero day” vulnerability in the biosecurity systems used to prevent the misuse of DNA. These screening systems are designed to stop people from purchasing genetic sequences that could be used to create deadly toxins or pathogens. But now researchers say they have figured out how to bypass the protections in a way previously unknown to defenders. Read the full story. —Antonio Regalado If you’re interested in learning more about AI and biology, check out: + AI-designed viruses are here and already killing bacteria. Read the full story. + OpenAI is making a foray into longevity science with an AI built to help manufacture stem cells. + AI is dreaming up drugs that no one has ever seen. Now we’ve got to see if they work. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 Apple removed an app for reporting ICE officer sightingsThe US Attorney General requested it take down ICEBlock—and Apple complied. (Insider $)+ Apple says the removal was down to the safety risk it posed. (Bloomberg $)+ The company had a similar explanation for removing a Hong Kong map app back in 2019. (The Verge) 2 OpenAI’s parental controls are easily circumvented Its alerts about teenagers’ concerning conversations also took hours to deliver. (WP $)+ The looming crackdown on AI companionship. (MIT Technology Review) 3 VCs have sunk a record amount into AI startups this year To the tune of $192.7 billion so far. (Bloomberg $)+ The AI bubble is looking increasingly precarious, though. (FT $)+ How to fine-tune AI for prosperity. (MIT Technology Review) 4 The US federal vaccination schedule is still waiting for an updateOfficials are yet to sign off on recommendations for this year’s updated Covid shots. (Ars Technica)+ Many people have been left unable to get vaccinated. (NPR) 5 The US Department of Energy has canceled yet more clean energy projectsIn mostly blue states. (TechCrunch)+ More than 300 funding awards have been axed. (CNBC)+ How to make clean energy progress under Trump in the states. (MIT Technology Review) 6 TikTok recommends pornography to children’s accountsDespite activating its “restricted mode” to prevent sexualized content. (BBC) 7 China has launched a new skilled worker visa programIn the wake of the US H-1B visa clampdown. (Wired $)+ The initiative hasn’t gone down well with locals. (BBC) 8 Flights were grounded in Germany after several drone sightingsNATO members are worried about suspected Russian incursions in their skies. (WSJ $)+ It’s the latest in a string of airspace sightings. (FT $) 9 How YouTube is shaking up HollywoodIts powerful creators are starting to worry the entertainment establishment—and Netflix. (FT $) 10 Anti-robocall tools are getting betterCall screening features are a useful first line of defense. (NYT $) Quote of the day “Capitulating to an authoritarian regime is never the right move.” —Joshua Aaron, the developer of ICEBlock, the app that crowdsources sightings of ICE officials, hits back at Apple’s decision to remove it from the App Store, 404 Media reports. One more thing How AI can help supercharge creativity Existing generative tools can automate a striking range of creative tasks and offer near-instant gratification—but at what cost? Some artists and researchers fear that such technology could turn us into passive consumers of yet more AI slop. And so they are looking for ways to inject human creativity back into the process: working on what’s known as co-­creativity or more-than-human creativity. The aim is to develop AI tools that augment our creativity rather than strip it from us. Read the full story. —Will Douglas Heaven We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.) + Congratulations to Fizz, the very handsome UK cat of the year! + What it took to transform actor Jeremy Allan White into the one and only Boss in his new film, Deliver Me from Nowhere.+ Divers have salvaged more than 1,000 gold and silver coins from a 1715 shipwreck off the east coast of Florida.+ The internet is obsessed with crabs. But why?

The Download: using AI to discover “zero day” vulnerabilities, and Apple’s ICE app removal Read Post »

AI, Committee, ข่าว, Uncategorized

Can a Small Language Model Predict Kernel Latency, Memory, and Model Accuracy from Code? A New Regression Language Model (RLM) Says Yes

Researchers from Cornell and Google introduce a unified Regression Language Model (RLM) that predicts numeric outcomes directly from code strings—covering GPU kernel latency, program memory usage, and even neural network accuracy and latency—without hand-engineered features. A 300M-parameter encoder–decoder initialized from T5-Gemma achieves strong rank correlations across heterogeneous tasks and languages, using a single text-to-number decoder that emits digits with constrained decoding. What exactly is new? Unified code-to-metric regression: One RLM predicts (i) peak memory from high-level code (Python/C/C++ and more), (ii) latency for Triton GPU kernels, and (iii) accuracy and hardware-specific latency from ONNX graphs—by reading raw text representations and decoding numeric outputs. No feature engineering, graph encoders, or zero-cost proxies are required. Concrete results: Reported correlations include Spearman ρ ≈ 0.93 on APPS LeetCode memory, ρ ≈ 0.52 for Triton kernel latency, ρ > 0.5 average across 17 CodeNet languages, and Kendall τ ≈ 0.46 across five classic NAS spaces—competitive with and in some cases surpassing graph-based predictors. Multi-objective decoding: Because the decoder is autoregressive, the model conditions later metrics on earlier ones (e.g., accuracy → per-device latencies), capturing realistic trade-offs along Pareto fronts. https://arxiv.org/abs/2509.26476 Why is this important? Performance prediction pipelines in compilers, GPU kernel selection, and NAS typically rely on bespoke features, syntax trees, or GNN encoders that are brittle to new ops/languages. Treating regression as next-token prediction over numbers standardizes the stack: tokenize inputs as plain text (source code, Triton IR, ONNX), then decode calibrated numeric strings digit-by-digit with constrained sampling. This reduces maintenance cost and improves transfer to new tasks via fine-tuning. Data and benchmarks Code-Regression dataset (HF): Curated to support code-to-metric tasks spanning APPS/LeetCode runs, Triton kernel latencies (KernelBook-derived), and CodeNet memory footprints. NAS/ONNX suite: Architectures from NASBench-101/201, FBNet, Once-for-All (MB/PN/RN), Twopath, Hiaml, Inception, and NDS are exported to ONNX text to predict accuracy and device-specific latency. How does it work? Backbone: Encoder–decoder with a T5-Gemma encoder initialization (~300M params). Inputs are raw strings (code or ONNX). Outputs are numbers emitted as sign/exponent/mantissa digit tokens; constrained decoding enforces valid numerals and supports uncertainty via sampling. Ablations: (i) Language pretraining accelerates convergence and improves Triton latency prediction; (ii) decoder-only numeric emission outperforms MSE regression heads even with y-normalization; (iii) learned tokenizers specialized for ONNX operators increase effective context; (iv) longer contexts help; (v) scaling to a larger Gemma encoder further improves correlation with adequate tuning. Training code. The regress-lm library provides text-to-text regression utilities, constrained decoding, and multi-task pretraining/fine-tuning recipes. Stats that matters APPS (Python) memory: Spearman ρ > 0.9. CodeNet (17 languages) memory: average ρ > 0.5; strongest languages include C/C++ (~0.74–0.75). Triton kernels (A6000) latency: ρ ≈ 0.52. NAS ranking: average Kendall τ ≈ 0.46 across NASNet, Amoeba, PNAS, ENAS, DARTS; competitive with FLAN and GNN baselines. Key Takeaways Unified code-to-metric regression works. A single ~300M-parameter T5Gemma-initialized model (“RLM”) predicts: (a) memory from high-level code, (b) Triton GPU kernel latency, and (c) model accuracy + device latency from ONNX—directly from text, no hand-engineered features. The research shows Spearman ρ > 0.9 on APPS memory, ≈0.52 on Triton latency, >0.5 average across 17 CodeNet languages, and Kendall-τ ≈ 0.46 on five NAS spaces. Numbers are decoded as text with constraints. Instead of a regression head, RLM emits numeric tokens with constrained decoding, enabling multi-metric, autoregressive outputs (e.g., accuracy followed by multi-device latencies) and uncertainty via sampling. The Code-Regression dataset unifies APPS/LeetCode memory, Triton kernel latency, and CodeNet memory; the regress-lm library provides the training/decoding stack. Our Comments It is very interesting how this work reframes performance prediction as text-to-number generation: a compact T5Gemma-initialized RLM reads source (Python/C++), Triton kernels, or ONNX graphs and emits calibrated numerics via constrained decoding. The reported correlations—APPS memory (ρ>0.9), Triton latency on RTX A6000 (~0.52), and NAS Kendall-τ ≈0.46—are strong enough to matter for compiler heuristics, kernel pruning, and multi-objective NAS triage without bespoke features or GNNs. The open dataset and library make replication straightforward and lower the barrier to fine-tuning on new hardware or languages. Check out the Paper, GitHub Page and Dataset Card. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Can a Small Language Model Predict Kernel Latency, Memory, and Model Accuracy from Code? A New Regression Language Model (RLM) Says Yes appeared first on MarkTechPost.

Can a Small Language Model Predict Kernel Latency, Memory, and Model Accuracy from Code? A New Regression Language Model (RLM) Says Yes Read Post »

AI, Committee, ข่าว, Uncategorized

How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export?

In this tutorial, we walk through an advanced implementation of WhisperX, where we explore transcription, alignment, and word-level timestamps in detail. We set up the environment, load and preprocess the audio, and then run the full pipeline, from transcription to alignment and analysis, while ensuring memory efficiency and supporting batch processing. Along the way, we also visualize results, export them in multiple formats, and even extract keywords to gain deeper insights from the audio content. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip install -q git+https://github.com/m-bain/whisperX.git !pip install -q pandas matplotlib seaborn import whisperx import torch import gc import os import json import pandas as pd from pathlib import Path from IPython.display import Audio, display, HTML import warnings warnings.filterwarnings(‘ignore’) CONFIG = { “device”: “cuda” if torch.cuda.is_available() else “cpu”, “compute_type”: “float16” if torch.cuda.is_available() else “int8”, “batch_size”: 16, “model_size”: “base”, “language”: None, } print(f” Running on: {CONFIG[‘device’]}”) print(f” Compute type: {CONFIG[‘compute_type’]}”) print(f” Model: {CONFIG[‘model_size’]}”) We begin by installing WhisperX along with essential libraries and then configure our setup. We detect whether CUDA is available, select the compute type, and set parameters such as batch size, model size, and language to prepare for transcription. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def download_sample_audio(): “””Download a sample audio file for testing””” !wget -q -O sample.mp3 https://github.com/mozilla-extensions/speaktome/raw/master/content/cv-valid-dev/sample-000000.mp3 print(” Sample audio downloaded”) return “sample.mp3” def load_and_analyze_audio(audio_path): “””Load audio and display basic info””” audio = whisperx.load_audio(audio_path) duration = len(audio) / 16000 print(f” Audio: {Path(audio_path).name}”) print(f” Duration: {duration:.2f} seconds”) print(f” Sample rate: 16000 Hz”) display(Audio(audio_path)) return audio, duration def transcribe_audio(audio, model_size=CONFIG[“model_size”], language=None): “””Transcribe audio using WhisperX (batched inference)””” print(“n STEP 1: Transcribing audio…”) model = whisperx.load_model( model_size, CONFIG[“device”], compute_type=CONFIG[“compute_type”] ) transcribe_kwargs = { “batch_size”: CONFIG[“batch_size”] } if language: transcribe_kwargs[“language”] = language result = model.transcribe(audio, **transcribe_kwargs) total_segments = len(result[“segments”]) total_words = sum(len(seg.get(“words”, [])) for seg in result[“segments”]) del model gc.collect() if CONFIG[“device”] == “cuda”: torch.cuda.empty_cache() print(f” Transcription complete!”) print(f” Language: {result[‘language’]}”) print(f” Segments: {total_segments}”) print(f” Total text length: {sum(len(seg[‘text’]) for seg in result[‘segments’])} characters”) return result We download a sample audio file, load it for analysis, and then transcribe it using WhisperX. We set up batched inference with our chosen model size and configuration, and we output key details such as language, number of segments, and total text length. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def align_transcription(segments, audio, language_code): “””Align transcription for accurate word-level timestamps””” print(“n STEP 2: Aligning for word-level timestamps…”) try: model_a, metadata = whisperx.load_align_model( language_code=language_code, device=CONFIG[“device”] ) result = whisperx.align( segments, model_a, metadata, audio, CONFIG[“device”], return_char_alignments=False ) total_words = sum(len(seg.get(“words”, [])) for seg in result[“segments”]) del model_a gc.collect() if CONFIG[“device”] == “cuda”: torch.cuda.empty_cache() print(f” Alignment complete!”) print(f” Aligned words: {total_words}”) return result except Exception as e: print(f” Alignment failed: {str(e)}”) print(” Continuing with segment-level timestamps only…”) return {“segments”: segments, “word_segments”: []} We align the transcription to generate precise word-level timestamps. By loading the alignment model and applying it to the audio, we refine timing accuracy, and then report the total aligned words while ensuring memory is cleared for efficient processing. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def analyze_transcription(result): “””Generate statistics about the transcription””” print(“n TRANSCRIPTION STATISTICS”) print(“=”*70) segments = result[“segments”] total_duration = max(seg[“end”] for seg in segments) if segments else 0 total_words = sum(len(seg.get(“words”, [])) for seg in segments) total_chars = sum(len(seg[“text”].strip()) for seg in segments) print(f”Total duration: {total_duration:.2f} seconds”) print(f”Total segments: {len(segments)}”) print(f”Total words: {total_words}”) print(f”Total characters: {total_chars}”) if total_duration > 0: print(f”Words per minute: {(total_words / total_duration * 60):.1f}”) pauses = [] for i in range(len(segments) – 1): pause = segments[i+1][“start”] – segments[i][“end”] if pause > 0: pauses.append(pause) if pauses: print(f”Average pause between segments: {sum(pauses)/len(pauses):.2f}s”) print(f”Longest pause: {max(pauses):.2f}s”) word_durations = [] for seg in segments: if “words” in seg: for word in seg[“words”]: duration = word[“end”] – word[“start”] word_durations.append(duration) if word_durations: print(f”Average word duration: {sum(word_durations)/len(word_durations):.3f}s”) print(“=”*70) We analyze the transcription by generating detailed statistics such as total duration, segment count, word count, and character count. We also calculate words per minute, pauses between segments, and average word duration to better understand the pacing and flow of the audio. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def display_results(result, show_words=False, max_rows=50): “””Display transcription results in formatted table””” data = [] for seg in result[“segments”]: text = seg[“text”].strip() start = f”{seg[‘start’]:.2f}s” end = f”{seg[‘end’]:.2f}s” duration = f”{seg[‘end’] – seg[‘start’]:.2f}s” if show_words and “words” in seg: for word in seg[“words”]: data.append({ “Start”: f”{word[‘start’]:.2f}s”, “End”: f”{word[‘end’]:.2f}s”, “Duration”: f”{word[‘end’] – word[‘start’]:.3f}s”, “Text”: word[“word”], “Score”: f”{word.get(‘score’, 0):.2f}” }) else: data.append({ “Start”: start, “End”: end, “Duration”: duration, “Text”: text }) df = pd.DataFrame(data) if len(df) > max_rows: print(f”Showing first {max_rows} rows of {len(df)} total…”) display(HTML(df.head(max_rows).to_html(index=False))) else: display(HTML(df.to_html(index=False))) return df def export_results(result, output_dir=”output”, filename=”transcript”): “””Export results in multiple formats””” os.makedirs(output_dir, exist_ok=True) json_path = f”{output_dir}/{filename}.json” with open(json_path, “w”, encoding=”utf-8″) as f: json.dump(result, f, indent=2, ensure_ascii=False) srt_path = f”{output_dir}/{filename}.srt” with open(srt_path, “w”, encoding=”utf-8″) as f: for i, seg in enumerate(result[“segments”], 1): start = format_timestamp(seg[“start”]) end = format_timestamp(seg[“end”]) f.write(f”{i}n{start} –> {end}n{seg[‘text’].strip()}nn”) vtt_path = f”{output_dir}/{filename}.vtt” with open(vtt_path, “w”, encoding=”utf-8″) as f: f.write(“WEBVTTnn”) for i, seg in enumerate(result[“segments”], 1): start = format_timestamp_vtt(seg[“start”]) end = format_timestamp_vtt(seg[“end”]) f.write(f”{start} –> {end}n{seg[‘text’].strip()}nn”) txt_path = f”{output_dir}/{filename}.txt” with open(txt_path, “w”, encoding=”utf-8″) as f: for seg in result[“segments”]: f.write(f”{seg[‘text’].strip()}n”) csv_path = f”{output_dir}/{filename}.csv” df_data = [] for seg in result[“segments”]: df_data.append({ “start”: seg[“start”], “end”: seg[“end”], “text”: seg[“text”].strip() }) pd.DataFrame(df_data).to_csv(csv_path, index=False) print(f”n Results exported to ‘{output_dir}/’ directory:”) print(f” ✓ {filename}.json (full structured data)”) print(f” ✓ {filename}.srt (subtitles)”) print(f” ✓ {filename}.vtt (web video subtitles)”) print(f” ✓ {filename}.txt (plain text)”) print(f” ✓ {filename}.csv (timestamps + text)”) def format_timestamp(seconds): “””Convert seconds to SRT timestamp format””” hours = int(seconds // 3600) minutes = int((seconds % 3600) // 60) secs = int(seconds % 60) millis = int((seconds % 1) * 1000) return f”{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}” def format_timestamp_vtt(seconds): “””Convert seconds to VTT timestamp format””” hours = int(seconds // 3600) minutes = int((seconds % 3600) // 60) secs = int(seconds % 60) millis = int((seconds % 1) * 1000) return f”{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}” def batch_process_files(audio_files, output_dir=”batch_output”): “””Process multiple audio files in batch””” print(f”n Batch processing {len(audio_files)} files…”) results = {}

How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export? Read Post »

AI, Committee, ข่าว, Uncategorized

TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies

arXiv:2510.01391v1 Announce Type: new Abstract: Large language models (LLMs) excel at general language tasks but often struggle with event-based questions-especially those requiring causal or temporal reasoning. We introduce TAG-EQA (Text-And-Graph for Event Question Answering), a prompting framework that injects causal event graphs into LLM inputs by converting structured relations into natural-language statements. TAG-EQA spans nine prompting configurations, combining three strategies (zero-shot, few-shot, chain-of-thought) with three input modalities (text-only, graph-only, text+graph), enabling a systematic analysis of when and how structured knowledge aids inference. On the TORQUESTRA benchmark, TAG-EQA improves accuracy by 5% on average over text-only baselines, with gains up to 12% in zero-shot settings and 18% when graph-augmented CoT prompting is effective. While performance varies by model and configuration, our findings show that causal graphs can enhance event reasoning in LLMs without fine-tuning, offering a flexible way to encode structure in prompt-based QA.

TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies Read Post »

AI, Committee, ข่าว, Uncategorized

Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models

arXiv:2510.01304v1 Announce Type: cross Abstract: Although current large Vision-Language Models (VLMs) have advanced in multimodal understanding and reasoning, their fundamental perceptual and reasoning abilities remain limited. Specifically, even on simple jigsaw tasks, existing VLMs perform near randomly, revealing deficiencies in core perception and reasoning capabilities. While high-quality vision-language data can enhance these capabilities, its scarcity and limited scalability impose significant constraints. To address this, we propose AGILE, an Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs. AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment. At each step, the model generates executable code to perform an action based on the current state, while the environment provides fine-grained visual feedback to guide task completion. Through this iterative cycle of observation and interaction, the model incrementally improves its perceptual and reasoning capabilities via exploration and feedback. Experimental results show that AGILE not only substantially boosts performance on jigsaw tasks of varying complexity (e.g., increasing accuracy from 9.5% to 82.8% under the 2 $times$ 2 setting) but also demonstrates strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%. These results indicate notable enhancements in both perceptual and reasoning abilities. This work opens a new avenue for advancing reasoning and generalization in multimodal models and provides an efficient, scalable solution to the scarcity of multimodal reinforcement learning data. The code and datasets is available at https://github.com/yuzeng0-0/AGILE .

Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at นโยบายความเป็นส่วนตัว and manage your privacy settings by clicking Settings.

ตั้งค่าความเป็นส่วนตัว

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

ยอมรับทั้งหมด
จัดการความเป็นส่วนตัว
  • เปิดใช้งานตลอด

บันทึกการตั้งค่า
th