YouZum

Committee

AI, Committee, Actualités, Uncategorized

GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

arXiv:2510.01252v2 Announce Type: replace Abstract: As large language models (LLMs) are increasingly trained on massive, uncurated corpora, understanding both model representations and the data they internalize has become a major challenge. In this work, we show that pairing LLMs with sparse autoencoders (SAEs) enables interpretation not only of model behavior but also of the deeper structures, themes, and biases embedded in the training data. We train a GPT-style transformer model exclusively on the novels of Jane Austen, a corpus rich in social constructs and narrative patterns. We then apply SAEs to hidden states across multiple layers, uncovering sparse, interpretable features that reflect the key narratives and concepts present in the corpus, including gender, class, and societal duty. Our findings demonstrate that LLMs combined with SAEs can act as scalable probes into complex datasets, offering a new path for corpus exploration, bias discovery, and model interpretability at scale.

GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models Lire l’article »

AI, Committee, Actualités, Uncategorized

Emission-GPT: A domain-specific language model agent for knowledge retrieval, emission inventory and data analysis

arXiv:2510.02359v1 Announce Type: new Abstract: Improving air quality and addressing climate change relies on accurate understanding and analysis of air pollutant and greenhouse gas emissions. However, emission-related knowledge is often fragmented and highly specialized, while existing methods for accessing and compiling emissions data remain inefficient. These issues hinder the ability of non-experts to interpret emissions information, posing challenges to research and management. To address this, we present Emission-GPT, a knowledge-enhanced large language model agent tailored for the atmospheric emissions domain. Built on a curated knowledge base of over 10,000 documents (including standards, reports, guidebooks, and peer-reviewed literature), Emission-GPT integrates prompt engineering and question completion to support accurate domain-specific question answering. Emission-GPT also enables users to interactively analyze emissions data via natural language, such as querying and visualizing inventories, analyzing source contributions, and recommending emission factors for user-defined scenarios. A case study in Guangdong Province demonstrates that Emission-GPT can extract key insights–such as point source distributions and sectoral trends–directly from raw data with simple prompts. Its modular and extensible architecture facilitates automation of traditionally manual workflows, positioning Emission-GPT as a foundational tool for next-generation emission inventory development and scenario-based assessment.

Emission-GPT: A domain-specific language model agent for knowledge retrieval, emission inventory and data analysis Lire l’article »

AI, Committee, Actualités, Uncategorized

Beyond Imitation: Recovering Dense Rewards from Demonstrations

arXiv:2510.02493v1 Announce Type: cross Abstract: Conventionally, supervised fine-tuning (SFT) is treated as a simple imitation learning process that only trains a policy to imitate expert behavior on demonstration datasets. In this work, we challenge this view by establishing a fundamental equivalence between SFT and Inverse Reinforcement Learning. We prove that the SFT objective is a special case of Inverse Q-Learning, which implies that the SFT process does not just learn a policy, but also an implicit, dense, token-level reward model that explains the expert demonstrations. We then show how to recover this dense reward signal directly from the SFT model by formulating a baseline-relative reward function. The availability of such a dense reward model offers numerous benefits, providing granular credit assignment for each token generated. We demonstrate one key application by using these recovered rewards to further improve the policy with reinforcement learning. Our method, Dense-Path REINFORCE, consistently outperforms the original SFT models on instruction-following benchmarks. This work reframes SFT not merely as policy imitation but as a powerful reward learning mechanism, opening new possibilities for leveraging expert demonstrations.

Beyond Imitation: Recovering Dense Rewards from Demonstrations Lire l’article »

AI, Committee, Actualités, Uncategorized

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

arXiv:2503.04697v2 Announce Type: replace Abstract: Reasoning language models have shown an uncanny ability to improve performance at test-time by “thinking longer”-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their chain-of-thought reasoning is not controllable, making it impossible to allocate test-time compute to achieve a desired level of performance. We introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints. We use LCPO to train L1, a reasoning language model that produces outputs satisfying a length constraint given in its prompt. L1’s length control allows for smoothly trading off computational cost and accuracy on a wide range of tasks, and outperforms the state-of-the-art S1 method for length control. Furthermore, we uncover an unexpected short chain-of-thought capability in models trained with LCPO. Specifically, using LCPO we derive Short Reasoning Models (SRMs), that exhibit similar reasoning patterns as full-length reasoning models, but can generate CoT lengths comparable to non-reasoning models. They demonstrate significant performance gains, for instance, our 1.5B L1 model surpasses GPT-4o at equal reasoning lengths. Overall, LCPO enables precise control over reasoning length, allowing for fine-grained allocation of test-time compute and accuracy. We release code and models at https://www.cmu-l3.github.io/l1

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning Lire l’article »

AI, Committee, Actualités, Uncategorized

StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows

Why treat LLM inference as batched kernels to DRAM when a dataflow compiler can pipe tiles through on-chip FIFOs and stream converters?StreamTensor is a compiler that lowers PyTorch LLM graphs (GPT-2, Llama, Qwen, Gemma) into stream-scheduled dataflow accelerators on AMD’s Alveo U55C FPGA. The system introduces an iterative tensor (“itensor”) type to encode tile/order of streams, enabling provably correct inter-kernel streaming and automated insertion/sizing of DMA engines, FIFOs, and layout converters. On LLM decoding workloads, the research team reports up to 0.64× lower latency vs. GPUs and up to 1.99× higher energy efficiency. https://arxiv.org/pdf/2509.13694 What StreamTensor does? StreamTensor compiles PyTorch graphs into a stream-oriented dataflow design so that intermediate tiles are largely avoids off-chip DRAM round-trips via on-chip streaming and fusion; DMAs are inserted only when required; they are forwarded through on-chip FIFOs to downstream kernels. The compiler’s central abstraction—iterative tensors (itensors)—records iteration order, tiling, and layout, which makes inter-kernel stream compatibility explicit and drives converter generation only where needed. The framework also searches hierarchically over tiling, fusion, and resource allocation, and uses a linear program to size FIFOs to avoid stalls or deadlock while minimizing on-chip memory. https://arxiv.org/pdf/2509.13694 What’s actually new? Hierarchical DSE. The compiler explores three design spaces—(i) tiling/unroll/vectorization/permutation at the Linalg level, (ii) fusion under memory/resource constraints, and (iii) resource allocation/stream widths—optimizing for sustained throughput under bandwidth limits. End-to-end PyTorch → device flow. Models enter via Torch-MLIR, are transformed to MLIR Linalg, and then into a dataflow IR whose nodes become hardware kernels with explicit streams and host/runtime glue—no manual RTL assembly. iterative tensor (itensor) typing system. A first-class tensor type expresses iteration order, tiling, and affine maps. This makes stream order explicit, allows safe kernel fusion, and lets the compiler synthesize minimal buffer/format converters when producers/consumers disagree. Formal FIFO sizing. Inter-kernel buffering is solved with a linear-programming formulation to avoid stalls/deadlocks while minimizing on-chip memory usage (BRAM/URAM). Results Latency: up to 0.76× vs prior FPGA LLM accelerators and 0.64× vs a GPU baseline on GPT-2; Energy efficiency: up to 1.99× vs A100 on emerging LLMs (model-dependent). Platform context: Alveo U55C (HBM2 16 GB, 460 GB/s, PCIe Gen3×16 or dual Gen4×8, 2×QSFP28). https://arxiv.org/pdf/2509.13694 Our Comments The useful contribution here is a PyTorch→Torch-MLIR→dataflow compiler that emits stream-scheduled kernels and a host/runtime for AMD’s Alveo U55C; the iterative tensor type plus linear-programming-based FIFO sizing enables safe inter-kernel streaming rather than DRAM round-trips. On reported LLM decoding benchmarks across GPT-2, Llama, Qwen, and Gemma, the research team show geometric-mean latency as low as 0.64× vs. a GPU baseline and energy efficiency up to 1.99×, with scope limited to decoding workloads. The hardware context is clear: Alveo U55C provides 16 GB HBM2 at 460 GB/s with dual QSFP28 and PCIe Gen3×16 or dual Gen4×8, which aligns with the streaming dataflow design. Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows appeared first on MarkTechPost.

StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows Lire l’article »

AI, Committee, Actualités, Uncategorized

How to Evaluate Voice Agents in 2025: Beyond Automatic Speech Recognition (ASR) and Word Error Rate (WER) to Task Success, Barge-In, and Hallucination-Under-Noise

Table of contents Why WER Isn’t Enough ? What to Measure (and How) ? Benchmark Landscape: What Each Covers Filling the Gaps: What You Still Need to Add A Concrete, Reproducible Evaluation Plan References Optimizing only for Automatic Speech Recognition (ASR) and Word Error Rate (WER) is insufficient for modern, interactive voice agents. Robust evaluation must measure end-to-end task success, barge-in behavior and latency, and hallucination-under-noise—alongside ASR, safety, and instruction following. VoiceBench offers a multi-facet speech-interaction benchmark across general knowledge, instruction following, safety, and robustness to speaker/environment/content variations, but it does not cover barge-in or real-device task completion. SLUE (and Phase-2) target spoken language understanding (SLU); MASSIVE and Spoken-SQuAD probe multilingual and spoken QA; DSTC tracks add spoken, task-oriented robustness. Combine these with explicit barge-in/endpointing tests, user-centric task-success measurement, and controlled noise-stress protocols to obtain a complete picture. Why WER Isn’t Enough? WER measures transcription fidelity, not interaction quality. Two agents with similar WER can diverge widely in dialog success because latency, turn-taking, misunderstanding recovery, safety, and robustness to acoustic and content perturbations dominate user experience. Prior work on real systems shows the need to evaluate user satisfaction and task success directly—e.g., Cortana’s automatic online evaluation predicted user satisfaction from in-situ interaction signals, not only ASR accuracy. What to Measure (and How)? 1) End-to-End Task Success Metric: Task Success Rate (TSR) with strict success criteria per task (goal completion, constraints met), plus Task Completion Time (TCT) and Turns-to-Success.Why. Real assistants are judged by outcomes. Competitions like Alexa Prize TaskBot explicitly measured users’ ability to finish multi-step tasks (e.g., cooking, DIY) with ratings and completion. Protocol. Define tasks with verifiable endpoints (e.g., “assemble shopping list with N items and constraints”). Use blinded human raters and automatic logs to compute TSR/TCT/Turns. For multilingual/SLU coverage, draw task intents/slots from MASSIVE. 2) Barge-In and Turn-Taking Metrics: Barge-In Detection Latency (ms): time from user onset to TTS suppression. True/False Barge-In Rates: correct interruptions vs. spurious stops. Endpointing Latency (ms): time to ASR finalization after user stop. Why. Smooth interruption handling and fast endpointing determine perceived responsiveness. Research formalizes barge-in verification and continuous barge-in processing; endpointing latency continues to be an active area in streaming ASR. Protocol. Script prompts where the user interrupts TTS at controlled offsets and SNRs. Measure suppression and recognition timings with high-precision logs (frame timestamps). Include noisy/echoic far-field conditions. Classic and modern studies provide recovery and signaling strategies that reduce false barge-ins. 3) Hallucination-Under-Noise (HUN) Metric. HUN Rate: fraction of outputs that are fluent but semantically unrelated to the audio, under controlled noise or non-speech audio.Why. ASR and audio-LLM stacks can emit “convincing nonsense,” especially with non-speech segments or noise overlays. Recent work defines and measures ASR hallucinations; targeted studies show Whisper hallucinations induced by non-speech sounds. Protocol. Construct audio sets with additive environmental noise (varied SNRs), non-speech distractors, and content disfluencies. Score semantic relatedness (human judgment with adjudication) and compute HUN. Track whether downstream agent actions propagate hallucinations to incorrect task steps. 4) Instruction Following, Safety, and Robustness Metric Families. Instruction-Following Accuracy (format and constraint adherence). Safety Refusal Rate on adversarial spoken prompts. Robustness Deltas across speaker age/accent/pitch, environment (noise, reverb, far-field), and content noise (grammar errors, disfluencies). Why. VoiceBench explicitly targets these axes with spoken instructions (real and synthetic) spanning general knowledge, instruction following, and safety; it perturbs speaker, environment, and content to probe robustness. Protocol. Use VoiceBench for breadth on speech-interaction capabilities; report aggregate and per-axis scores. For SLU specifics (NER, dialog acts, QA, summarization), leverage SLUE and Phase-2. 5) Perceptual Speech Quality (for TTS and Enhancement) Metric. Subjective Mean Opinion Score via ITU-T P.808 (crowdsourced ACR/DCR/CCR).Why. Interaction quality depends on both recognition and playback quality. P.808 gives a validated crowdsourcing protocol with open-source tooling. Benchmark Landscape: What Each Covers VoiceBench (2024) Scope: Multi-facet voice assistant evaluation with spoken inputs covering general knowledge, instruction following, safety, and robustness across speaker/environment/content variations; uses both real and synthetic speech.Limitations: Does not benchmark barge-in/endpointing latency or real-world task completion on devices; focuses on response correctness and safety under variations. SLUE / SLUE Phase-2 Scope: Spoken language understanding tasks: NER, sentiment, dialog acts, named-entity localization, QA, summarization; designed to study end-to-end vs. pipeline sensitivity to ASR errors.Use: Great for probing SLU robustness and pipeline fragility in spoken settings. MASSIVE Scope: >1M virtual-assistant utterances across 51–52 languages with intents/slots; strong fit for multilingual task-oriented evaluation.Use: Build multilingual task suites and measure TSR/slot F1 under speech conditions (paired with TTS or read speech). Spoken-SQuAD / HeySQuAD and Related Spoken-QA Sets Scope: Spoken question answering to test ASR-aware comprehension and multi-accent robustness.Use: Stress-test comprehension under speech errors; not a full agent task suite. DSTC (Dialog System Technology Challenge) Tracks Scope: Robust dialog modeling with spoken, task-oriented data; human ratings alongside automatic metrics; recent tracks emphasize multilinguality, safety, and evaluation dimensionality.Use: Complementary for dialog quality, DST, and knowledge-grounded responses under speech conditions. Real-World Task Assistance (Alexa Prize TaskBot) Scope: Multi-step task assistance with user ratings and success criteria (cooking/DIY).Use: Gold-standard inspiration for defining TSR and interaction KPIs; the public reports describe evaluation focus and outcomes. Filling the Gaps: What You Still Need to Add Barge-In & Endpointing KPIsAdd explicit measurement harnesses. Literature offers barge-in verification and continuous processing strategies; streaming ASR endpointing latency remains an active research topic. Track barge-in detection latency, suppression correctness, endpointing delay, and false barge-ins. Hallucination-Under-Noise (HUN) ProtocolsAdopt emerging ASR-hallucination definitions and controlled noise/non-speech tests; report HUN rate and its impact on downstream actions. On-Device Interaction LatencyCorrelate user-perceived latency with streaming ASR designs (e.g., transducer variants); measure time-to-first-token, time-to-final, and local processing overhead. Cross-Axis Robustness MatricesCombine VoiceBench’s speaker/environment/content axes with your task suite (TSR) to expose failure surfaces (e.g., barge-in under far-field echo; task success at low SNR; multilingual slots under accent shift). Perceptual Quality for PlaybackUse ITU-T P.808 (with the open P.808 toolkit) to quantify user-perceived TTS quality in your end-to-end loop, not just ASR. A Concrete, Reproducible Evaluation Plan Assemble the Suite Speech-Interaction Core: VoiceBench for knowledge, instruction following, safety, and robustness axes. SLU Depth: SLUE/Phase-2 tasks (NER, dialog acts,

How to Evaluate Voice Agents in 2025: Beyond Automatic Speech Recognition (ASR) and Word Error Rate (WER) to Task Success, Barge-In, and Hallucination-Under-Noise Lire l’article »

AI, Committee, Actualités, Uncategorized

Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture

What if, instead of re-sampling one agent, you could push Gemini-2.5 Pro to 34.1% on HLE by mixing 12–15 tool-using agents that share notes and stop early? Google Cloud AI Research, with collaborators from MIT, Harvard, and Google DeepMind, introduced TUMIX (Tool-Use Mixture)—a test-time framework that ensembles heterogeneous agent styles (text-only, code, search, guided variants) and lets them share intermediate answers over a few refinement rounds, then stop early via an LLM-based judge. The result: higher accuracy at lower cost on hard reasoning benchmarks such as HLE, GPQA-Diamond, and AIME (2024/2025). https://arxiv.org/pdf/2510.01279 So, What exactly is different new? Mixture over modality, not just more samples: TUMIX runs ~15 agent styles spanning Chain-of-Thought (CoT), code execution, web search, dual-tool agents, and guided variants. Each round, every agent sees (a) the original question and (b) other agents’ previous answers, then proposes a refined answer. This message-passing raises average accuracy early while diversity gradually collapses—so stopping matters. Adaptive early-termination: An LLM-as-Judge halts refinement once answers exhibit strong consensus (with a minimum round threshold). This preserves accuracy at ~49% of the inference cost vs. fixed-round refinement; token cost drops to ~46% because late rounds are token-heavier. Auto-designed agents: Beyond human-crafted agents, TUMIX prompts the base LLM to generate new agent types; mixing these with the manual set yields an additional ~+1.2% average lift without extra cost. The empirical “sweet spot” is ~12–15 agent styles. https://arxiv.org/pdf/2510.01279 How does it work? TUMIX runs a group of heterogeneous agents—text-only Chain-of-Thought, code-executing, web-searching, and guided variants—in parallel, then iterates a small number of refinement rounds where each agent conditions on the original question plus the other agents’ prior rationales and answers (structured note-sharing). After each round, an LLM-based judge evaluates consensus/consistency to decide early termination; if confidence is insufficient, another round is triggered, otherwise the system finalizes via simple aggregation (e.g., majority vote or selector). This mixture-of-tool-use design trades brute-force re-sampling for diverse reasoning paths, improving coverage of correct candidates while controlling token/tool budgets; empirically, benefits saturate around 12–15 agent styles, and stopping early preserves diversity and lowers cost without sacrificing accuracy Lets discuss the Results Under comparable inference budgets to strong tool-augmented baselines (Self-MoA, Symbolic-MoE, DEI, SciMaster, GSA), TUMIX yields the best average accuracy; a scaled variant (TUMIX+) pushes further with more compute: HLE (Humanity’s Last Exam): Pro: 21.6% → 34.1% (TUMIX+); Flash: 9.7% → 23.1%.(HLE is a 2,500-question, difficult, multi-domain benchmark finalized in 2025.) GPQA-Diamond: Pro: up to 88.3%; Flash: up to 82.1%. (GPQA-Diamond is the hardest 198-question subset authored by domain experts.) AIME 2024/25: Pro: 96.7%; Flash: 86.7% with TUMIX(+) at test time. Across tasks, TUMIX averages +3.55% over the best prior tool-augmented test-time scaling baseline at similar cost, and +7.8% / +17.4% over no-scaling for Pro/Flash, respectively. https://arxiv.org/pdf/2510.01279 Our Comments TUMIX is a great approach from Google because it frames test-time scaling as a search problem over heterogeneous tool policies rather than brute-force sampling. The parallel committee (text, code, search) improves candidate coverage, while the LLM-judge enables early-stop that preserves diversity and reduces token/tool spend—useful under latency budgets. The HLE gains (34.1% with Gemini-2.5 Pro) align with the benchmark’s finalized 2,500-question design, and the ~12–15 agent styles “sweet spot” indicates selection—not generation—is the limiting factor. Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture appeared first on MarkTechPost.

Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture Lire l’article »

AI, Committee, Actualités, Uncategorized

A Coding Implementation to Build a Transformer-Based Regression Language Model to Predict Continuous Values from Text

We will build a Regression Language Model (RLM), a model that predicts continuous numerical values directly from text sequences in this coding implementation. Instead of classifying or generating text, we focus on training a transformer-based architecture that learns quantitative relationships hidden within natural language descriptions. We start by generating synthetic text-to-number data, tokenizing it efficiently, and then train a lightweight Transformer encoder to map linguistic cues to real-valued targets. By the end, we not only understand how RLMs can be implemented from scratch but also visualize their learning behavior and test their generalization on unseen examples. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser import numpy as np import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader import matplotlib.pyplot as plt from collections import Counter import re torch.manual_seed(42) np.random.seed(42) print(” Regression Language Model (RLM) Tutorial”) print(“=” * 60) We begin by importing essential libraries, such as PyTorch, NumPy, and Matplotlib, to build and visualize our Regression Language Model. We set random seeds to ensure reproducibility and initialize the environment, thereby guaranteeing consistent results each time the tutorial is run. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def generate_synthetic_data(n_samples=2000): “””Generate synthetic text-to-number regression data””” templates = [ (“The temperature is {} degrees”, lambda x: x), (“I rate this {} out of ten”, lambda x: x), (“The price is {} dollars”, lambda x: x), (“Confidence level: {}”, lambda x: x / 100), (“Speed of {} kilometers per hour”, lambda x: x / 10), (“{} percent complete”, lambda x: x / 100), (“Scored {} points in the game”, lambda x: x / 10), (“The distance is {} meters”, lambda x: x), ] data = [] for _ in range(n_samples): template, transform = templates[np.random.randint(len(templates))] value = np.random.uniform(0, 100) text = template.format(round(value, 1)) target = transform(value) data.append((text, target)) return data We create a synthetic dataset that pairs natural language sentences with corresponding numerical values. By using varied templates such as temperatures, ratings, and percentages, we ensure the model learns diverse text–number relationships. This controlled setup helps us simulate realistic regression tasks without relying on external data. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class SimpleTokenizer: def __init__(self): self.word2idx = {“<PAD>”: 0, “<UNK>”: 1} self.idx2word = {0: “<PAD>”, 1: “<UNK>”} self.vocab_size = 2 def fit(self, texts): “””Build vocabulary from texts””” words = [] for text in texts: words.extend(re.findall(r’w+|[^ws]’, text.lower())) word_counts = Counter(words) for word, _ in word_counts.most_common(): if word not in self.word2idx: self.word2idx[word] = self.vocab_size self.idx2word[self.vocab_size] = word self.vocab_size += 1 def encode(self, text, max_len=20): “””Convert text to token indices””” words = re.findall(r’w+|[^ws]’, text.lower()) indices = [self.word2idx.get(w, 1) for w in words] if len(indices) < max_len: indices += [0] * (max_len – len(indices)) else: indices = indices[:max_len] return indices We design a simple tokenizer to convert raw text into numerical tokens that the model can process. It builds a vocabulary from all unique words and maps each to an index, handling unknown words and padding automatically. This step ensures our textual inputs are transformed into consistent, machine-readable sequences for training. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class RLMDataset(Dataset): def __init__(self, data, tokenizer, max_len=20): self.data = data self.tokenizer = tokenizer self.max_len = max_len def __len__(self): return len(self.data) def __getitem__(self, idx): text, target = self.data[idx] tokens = self.tokenizer.encode(text, self.max_len) return torch.tensor(tokens), torch.tensor([target], dtype=torch.float32) class RegressionLanguageModel(nn.Module): def __init__(self, vocab_size, embed_dim=128, num_heads=4, num_layers=2, dropout=0.1, max_len=20): super().__init__() self.token_embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0) self.position_embedding = nn.Embedding(max_len, embed_dim) encoder_layer = nn.TransformerEncoderLayer( d_model=embed_dim, nhead=num_heads, dim_feedforward=embed_dim * 4, dropout=dropout, batch_first=True ) self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers) self.fc1 = nn.Linear(embed_dim, 64) self.relu = nn.ReLU() self.dropout = nn.Dropout(dropout) self.fc2 = nn.Linear(64, 1) self.max_len = max_len def forward(self, x): batch_size, seq_len = x.shape positions = torch.arange(0, seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1) token_embed = self.token_embedding(x) pos_embed = self.position_embedding(positions) embeddings = token_embed + pos_embed padding_mask = (x == 0) encoded = self.transformer(embeddings, src_key_padding_mask=padding_mask) mask_expanded = (~padding_mask).unsqueeze(-1).float() summed = (encoded * mask_expanded).sum(dim=1) pooled = summed / mask_expanded.sum(dim=1) x = self.fc1(pooled) x = self.relu(x) x = self.dropout(x) output = self.fc2(x) return output We package our text–number pairs into a PyTorch Dataset, where we tokenize each sentence and return tensors ready for batching. We then build a Transformer-based RLM: token and positional embeddings flow through a multi-layer encoder, we mean-pool non-padded tokens, and feed the result to a small MLP head for regression. In effect, we allow the encoder to learn numerical cues from language, while the head maps them to a single continuous value. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def train_rlm(model, train_loader, val_loader, epochs=15, lr=0.001): criterion = nn.MSELoss() optimizer = optim.Adam(model.parameters(), lr=lr) train_losses, val_losses = [], [] print(f”n Training on {device}”) print(“-” * 60) for epoch in range(epochs): model.train() train_loss = 0 for tokens, targets in train_loader: tokens, targets = tokens.to(device), targets.to(device) optimizer.zero_grad() outputs = model(tokens) loss = criterion(outputs, targets) loss.backward() optimizer.step() train_loss += loss.item() train_loss /= len(train_loader) train_losses.append(train_loss) model.eval() val_loss = 0 with torch.no_grad(): for tokens, targets in val_loader: tokens, targets = tokens.to(device), targets.to(device) outputs = model(tokens) loss = criterion(outputs, targets) val_loss += loss.item() val_loss /= len(val_loader) val_losses.append(val_loss) print(f”Epoch {epoch+1:2d}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}”) return train_losses, val_losses We train the model using Adam and MSE loss on a GPU, if available, iterating over mini-batches to backpropagate and update weights. We switch to evaluation mode for validation at the end of each epoch, track training and validation losses, and print progress so we can see the learning dynamics in real-time. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“n Generating synthetic data…”) data = generate_synthetic_data(2000) split_idx = int(0.8 * len(data)) train_data, val_data = data[:split_idx], data[split_idx:] print(f”Train samples: {len(train_data)}, Val samples: {len(val_data)}”) print(“n Building tokenizer…”) tokenizer = SimpleTokenizer() tokenizer.fit([text for text, _ in train_data]) print(f”Vocabulary size: {tokenizer.vocab_size}”) train_dataset = RLMDataset(train_data, tokenizer) val_dataset = RLMDataset(val_data, tokenizer) train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=32) print(“n Building Regression Language Model…”) model = RegressionLanguageModel(vocab_size=tokenizer.vocab_size) print(f”Model parameters: {sum(p.numel() for p in model.parameters()):,}”) train_losses, val_losses = train_rlm(model, train_loader, val_loader) plt.figure(figsize=(10, 4)) plt.plot(train_losses, label=’Train Loss’, linewidth=2) plt.plot(val_losses, label=’Val Loss’, linewidth=2) plt.xlabel(‘Epoch’)

A Coding Implementation to Build a Transformer-Based Regression Language Model to Predict Continuous Values from Text Lire l’article »

AI, Committee, Actualités, Uncategorized

This AI Paper Proposes a Novel Dual-Branch Encoder-Decoder Architecture for Unsupervised Speech Enhancement (SE)

Can a speech enhancer trained only on real noisy recordings cleanly separate speech and noise—without ever seeing paired data? A team of researchers from Brno University of Technology and Johns Hopkins University proposes Unsupervised Speech Enhancement using Data-defined Priors (USE-DDP), a dual-stream encoder–decoder that separates any noisy input into two waveforms—estimated clean speech and residual noise—and learns both solely from unpaired datasets (clean-speech corpus and optional noise corpus). Training enforces that the sum of the two outputs reconstructs the input waveform, avoiding degenerate solutions and aligning the design with neural audio codec objectives. https://arxiv.org/pdf/2509.22942 Why this is important? Most learning-based speech enhancement pipelines depend on paired clean–noisy recordings, which are expensive or impossible to collect at scale in real-world conditions. Unsupervised routes like MetricGAN-U remove the need for clean data but couple model performance to external, non-intrusive metrics used during training. USE-DDP keeps the training data-only, imposing priors with discriminators over independent clean-speech and noise datasets and using reconstruction consistency to tie estimates back to the observed mixture. How it works? Generator: A codec-style encoder compresses the input audio into a latent sequence; this is split into two parallel transformer branches (RoFormer) that target clean speech and noise respectively, decoded by a shared decoder back to waveforms. The input is reconstructed as the least-squares combination of the two outputs (scalars α, β compensate for amplitude errors). Reconstruction uses multi-scale mel/STFT and SI-SDR losses, as in neural audio codecs. Priors via adversaries: Three discriminator ensembles—clean, noise, and noisy—impose distributional constraints: the clean branch must resemble the clean-speech corpus; the noise branch must resemble a noise corpus; the reconstructed mixture must sound natural. LS-GAN and feature-matching losses are used. Initialization: Initializing encoder/decoder from a pretrained Descript Audio Codec improves convergence and final quality vs. training from scratch. How it compares? On the standard VCTK+DEMAND simulated setup, USE-DDP reports parity with the strongest unsupervised baselines (e.g., unSE/unSE+ based on optimal transport) and competitive DNSMOS vs. MetricGAN-U (which directly optimizes DNSMOS). Example numbers from the paper’s Table 1 (input vs. systems): DNSMOS improves from 2.54 (noisy) to ~3.03 (USE-DDP), PESQ from 1.97 to ~2.47; CBAK trails some baselines due to more aggressive noise attenuation in non-speech segments—consistent with the explicit noise prior. https://arxiv.org/pdf/2509.22942 Data choice is not a detail—it’s the result A central finding: which clean-speech corpus defines the prior can swing outcomes and even create over-optimistic results on simulated tests. In-domain prior (VCTK clean) on VCTK+DEMAND → best scores (DNSMOS ≈3.03), but this configuration unrealistically “peeks” at the target distribution used to synthesize the mixtures. Out-of-domain prior → notably lower metrics (e.g., PESQ ~2.04), reflecting distribution mismatch and some noise leakage into the clean branch. Real-world CHiME-3: using a “close-talk” channel as in-domain clean prior actually hurts—because the “clean” reference itself contains environment bleed; an out-of-domain truly clean corpus yields higher DNSMOS/UTMOS on both dev and test, albeit with some intelligibility trade-off under stronger suppression. This clarifies discrepancies across prior unsupervised results and argues for careful, transparent prior selection when claiming SOTA on simulated benchmarks. Our Comments The proposed dual-branch encoder-decoder architecture treats enhancement as explicit two-source estimation with data-defined priors, not metric-chasing. The reconstruction constraint (clean + noise = input) plus adversarial priors over independent clean/noise corpora gives a clear inductive bias, and initializing from a neural audio codec is a pragmatic way to stabilize training. The results look competitive with unsupervised baselines while avoiding DNSMOS-guided objectives; the caveat is that “clean prior” choice materially affects reported gains, so claims should specify corpus selection. Check out the PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post This AI Paper Proposes a Novel Dual-Branch Encoder-Decoder Architecture for Unsupervised Speech Enhancement (SE) appeared first on MarkTechPost.

This AI Paper Proposes a Novel Dual-Branch Encoder-Decoder Architecture for Unsupervised Speech Enhancement (SE) Lire l’article »

AI, Committee, Actualités, Uncategorized

The Download: using AI to discover “zero day” vulnerabilities, and Apple’s ICE app removal

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. Microsoft says AI can create “zero day” threats in biology A team at Microsoft says it used artificial intelligence to discover a “zero day” vulnerability in the biosecurity systems used to prevent the misuse of DNA. These screening systems are designed to stop people from purchasing genetic sequences that could be used to create deadly toxins or pathogens. But now researchers say they have figured out how to bypass the protections in a way previously unknown to defenders. Read the full story. —Antonio Regalado If you’re interested in learning more about AI and biology, check out: + AI-designed viruses are here and already killing bacteria. Read the full story. + OpenAI is making a foray into longevity science with an AI built to help manufacture stem cells. + AI is dreaming up drugs that no one has ever seen. Now we’ve got to see if they work. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 Apple removed an app for reporting ICE officer sightingsThe US Attorney General requested it take down ICEBlock—and Apple complied. (Insider $)+ Apple says the removal was down to the safety risk it posed. (Bloomberg $)+ The company had a similar explanation for removing a Hong Kong map app back in 2019. (The Verge) 2 OpenAI’s parental controls are easily circumvented Its alerts about teenagers’ concerning conversations also took hours to deliver. (WP $)+ The looming crackdown on AI companionship. (MIT Technology Review) 3 VCs have sunk a record amount into AI startups this year To the tune of $192.7 billion so far. (Bloomberg $)+ The AI bubble is looking increasingly precarious, though. (FT $)+ How to fine-tune AI for prosperity. (MIT Technology Review) 4 The US federal vaccination schedule is still waiting for an updateOfficials are yet to sign off on recommendations for this year’s updated Covid shots. (Ars Technica)+ Many people have been left unable to get vaccinated. (NPR) 5 The US Department of Energy has canceled yet more clean energy projectsIn mostly blue states. (TechCrunch)+ More than 300 funding awards have been axed. (CNBC)+ How to make clean energy progress under Trump in the states. (MIT Technology Review) 6 TikTok recommends pornography to children’s accountsDespite activating its “restricted mode” to prevent sexualized content. (BBC) 7 China has launched a new skilled worker visa programIn the wake of the US H-1B visa clampdown. (Wired $)+ The initiative hasn’t gone down well with locals. (BBC) 8 Flights were grounded in Germany after several drone sightingsNATO members are worried about suspected Russian incursions in their skies. (WSJ $)+ It’s the latest in a string of airspace sightings. (FT $) 9 How YouTube is shaking up HollywoodIts powerful creators are starting to worry the entertainment establishment—and Netflix. (FT $) 10 Anti-robocall tools are getting betterCall screening features are a useful first line of defense. (NYT $) Quote of the day “Capitulating to an authoritarian regime is never the right move.” —Joshua Aaron, the developer of ICEBlock, the app that crowdsources sightings of ICE officials, hits back at Apple’s decision to remove it from the App Store, 404 Media reports. One more thing How AI can help supercharge creativity Existing generative tools can automate a striking range of creative tasks and offer near-instant gratification—but at what cost? Some artists and researchers fear that such technology could turn us into passive consumers of yet more AI slop. And so they are looking for ways to inject human creativity back into the process: working on what’s known as co-­creativity or more-than-human creativity. The aim is to develop AI tools that augment our creativity rather than strip it from us. Read the full story. —Will Douglas Heaven We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.) + Congratulations to Fizz, the very handsome UK cat of the year! + What it took to transform actor Jeremy Allan White into the one and only Boss in his new film, Deliver Me from Nowhere.+ Divers have salvaged more than 1,000 gold and silver coins from a 1715 shipwreck off the east coast of Florida.+ The internet is obsessed with crabs. But why?

The Download: using AI to discover “zero day” vulnerabilities, and Apple’s ICE app removal Lire l’article »

We use cookies to improve your experience and performance on our website. You can learn more at Politique de confidentialité and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
fr_FR