News Archives - Página 15 de 101

SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations

admin NU / octubre 7, 2025

arXiv:2510.04398v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often produce hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial attacks for hallucination elicitation in LLMs, but it often produces unrealistic prompts, either by inserting gibberish tokens or by altering the original meaning. As a result, these approaches offer limited insight into how hallucinations may occur in practice. While adversarial attacks in computer vision often involve realistic modifications to input images, the problem of finding realistic adversarial prompts for eliciting LLM hallucinations has remained largely underexplored. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA) to elicit hallucinations via realistic modifications to the prompt that preserve its meaning while maintaining semantic coherence. Our contributions are threefold: (i) we formulate finding realistic attacks for hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (ii) we introduce a constraint-preserving zeroth-order method to effectively search for adversarial yet feasible prompts; and (iii) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no constraint violations compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations. Code is available at https://github.com/Buyun-Liang/SECA.

SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations Leer entrada »

AI, Committee, Noticias, Uncategorized

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

admin NU / octubre 7, 2025

arXiv:2509.08825v2 Announce Type: replace Abstract: Large language models are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection or prompting strategy). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors. We call this phenomenon where configuration choices lead to incorrect conclusions LLM hacking. We find that intentional LLM hacking is strikingly simple. By replicating 37 data annotation tasks from 21 published social science studies, we show that, with just a handful of prompt paraphrases, virtually anything can be presented as statistically significant. Beyond intentional manipulation, our analysis of 13 million labels from 18 different LLMs across 2361 realistic hypotheses shows that there is also a high risk of accidental LLM hacking, even when following standard research practices. We find incorrect conclusions in approximately 31% of hypotheses for state-of-the-art LLMs, and in half the hypotheses for smaller language models. While higher task performance and stronger general model capabilities reduce LLM hacking risk, even highly accurate models remain susceptible. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of LLM-based findings near significance thresholds. We analyze 21 mitigation techniques and find that human annotations provide crucial protection against false positives. Common regression estimator correction techniques can restore valid inference but trade off Type I vs. Type II errors. We publish a list of practical recommendations to prevent LLM hacking.

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation Leer entrada »

AI, Committee, Noticias, Uncategorized

GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

admin NU / octubre 6, 2025

arXiv:2510.01252v2 Announce Type: replace Abstract: As large language models (LLMs) are increasingly trained on massive, uncurated corpora, understanding both model representations and the data they internalize has become a major challenge. In this work, we show that pairing LLMs with sparse autoencoders (SAEs) enables interpretation not only of model behavior but also of the deeper structures, themes, and biases embedded in the training data. We train a GPT-style transformer model exclusively on the novels of Jane Austen, a corpus rich in social constructs and narrative patterns. We then apply SAEs to hidden states across multiple layers, uncovering sparse, interpretable features that reflect the key narratives and concepts present in the corpus, including gender, class, and societal duty. Our findings demonstrate that LLMs combined with SAEs can act as scalable probes into complex datasets, offering a new path for corpus exploration, bias discovery, and model interpretability at scale.

GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models Leer entrada »

AI, Committee, Noticias, Uncategorized

Emission-GPT: A domain-specific language model agent for knowledge retrieval, emission inventory and data analysis

admin NU / octubre 6, 2025

arXiv:2510.02359v1 Announce Type: new Abstract: Improving air quality and addressing climate change relies on accurate understanding and analysis of air pollutant and greenhouse gas emissions. However, emission-related knowledge is often fragmented and highly specialized, while existing methods for accessing and compiling emissions data remain inefficient. These issues hinder the ability of non-experts to interpret emissions information, posing challenges to research and management. To address this, we present Emission-GPT, a knowledge-enhanced large language model agent tailored for the atmospheric emissions domain. Built on a curated knowledge base of over 10,000 documents (including standards, reports, guidebooks, and peer-reviewed literature), Emission-GPT integrates prompt engineering and question completion to support accurate domain-specific question answering. Emission-GPT also enables users to interactively analyze emissions data via natural language, such as querying and visualizing inventories, analyzing source contributions, and recommending emission factors for user-defined scenarios. A case study in Guangdong Province demonstrates that Emission-GPT can extract key insights–such as point source distributions and sectoral trends–directly from raw data with simple prompts. Its modular and extensible architecture facilitates automation of traditionally manual workflows, positioning Emission-GPT as a foundational tool for next-generation emission inventory development and scenario-based assessment.

Emission-GPT: A domain-specific language model agent for knowledge retrieval, emission inventory and data analysis Leer entrada »

AI, Committee, Noticias, Uncategorized

Beyond Imitation: Recovering Dense Rewards from Demonstrations

admin NU / octubre 6, 2025

arXiv:2510.02493v1 Announce Type: cross Abstract: Conventionally, supervised fine-tuning (SFT) is treated as a simple imitation learning process that only trains a policy to imitate expert behavior on demonstration datasets. In this work, we challenge this view by establishing a fundamental equivalence between SFT and Inverse Reinforcement Learning. We prove that the SFT objective is a special case of Inverse Q-Learning, which implies that the SFT process does not just learn a policy, but also an implicit, dense, token-level reward model that explains the expert demonstrations. We then show how to recover this dense reward signal directly from the SFT model by formulating a baseline-relative reward function. The availability of such a dense reward model offers numerous benefits, providing granular credit assignment for each token generated. We demonstrate one key application by using these recovered rewards to further improve the policy with reinforcement learning. Our method, Dense-Path REINFORCE, consistently outperforms the original SFT models on instruction-following benchmarks. This work reframes SFT not merely as policy imitation but as a powerful reward learning mechanism, opening new possibilities for leveraging expert demonstrations.

Beyond Imitation: Recovering Dense Rewards from Demonstrations Leer entrada »

AI, Committee, Noticias, Uncategorized

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

admin NU / octubre 6, 2025

arXiv:2503.04697v2 Announce Type: replace Abstract: Reasoning language models have shown an uncanny ability to improve performance at test-time by “thinking longer”-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their chain-of-thought reasoning is not controllable, making it impossible to allocate test-time compute to achieve a desired level of performance. We introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints. We use LCPO to train L1, a reasoning language model that produces outputs satisfying a length constraint given in its prompt. L1’s length control allows for smoothly trading off computational cost and accuracy on a wide range of tasks, and outperforms the state-of-the-art S1 method for length control. Furthermore, we uncover an unexpected short chain-of-thought capability in models trained with LCPO. Specifically, using LCPO we derive Short Reasoning Models (SRMs), that exhibit similar reasoning patterns as full-length reasoning models, but can generate CoT lengths comparable to non-reasoning models. They demonstrate significant performance gains, for instance, our 1.5B L1 model surpasses GPT-4o at equal reasoning lengths. Overall, LCPO enables precise control over reasoning length, allowing for fine-grained allocation of test-time compute and accuracy. We release code and models at https://www.cmu-l3.github.io/l1

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning Leer entrada »

AI, Committee, Noticias, Uncategorized

StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows

admin NU / octubre 6, 2025

Why treat LLM inference as batched kernels to DRAM when a dataflow compiler can pipe tiles through on-chip FIFOs and stream converters?StreamTensor is a compiler that lowers PyTorch LLM graphs (GPT-2, Llama, Qwen, Gemma) into stream-scheduled dataflow accelerators on AMD’s Alveo U55C FPGA. The system introduces an iterative tensor (“itensor”) type to encode tile/order of streams, enabling provably correct inter-kernel streaming and automated insertion/sizing of DMA engines, FIFOs, and layout converters. On LLM decoding workloads, the research team reports up to 0.64× lower latency vs. GPUs and up to 1.99× higher energy efficiency. https://arxiv.org/pdf/2509.13694 What StreamTensor does? StreamTensor compiles PyTorch graphs into a stream-oriented dataflow design so that intermediate tiles are largely avoids off-chip DRAM round-trips via on-chip streaming and fusion; DMAs are inserted only when required; they are forwarded through on-chip FIFOs to downstream kernels. The compiler’s central abstraction—iterative tensors (itensors)—records iteration order, tiling, and layout, which makes inter-kernel stream compatibility explicit and drives converter generation only where needed. The framework also searches hierarchically over tiling, fusion, and resource allocation, and uses a linear program to size FIFOs to avoid stalls or deadlock while minimizing on-chip memory. https://arxiv.org/pdf/2509.13694 What’s actually new? Hierarchical DSE. The compiler explores three design spaces—(i) tiling/unroll/vectorization/permutation at the Linalg level, (ii) fusion under memory/resource constraints, and (iii) resource allocation/stream widths—optimizing for sustained throughput under bandwidth limits. End-to-end PyTorch → device flow. Models enter via Torch-MLIR, are transformed to MLIR Linalg, and then into a dataflow IR whose nodes become hardware kernels with explicit streams and host/runtime glue—no manual RTL assembly. iterative tensor (itensor) typing system. A first-class tensor type expresses iteration order, tiling, and affine maps. This makes stream order explicit, allows safe kernel fusion, and lets the compiler synthesize minimal buffer/format converters when producers/consumers disagree. Formal FIFO sizing. Inter-kernel buffering is solved with a linear-programming formulation to avoid stalls/deadlocks while minimizing on-chip memory usage (BRAM/URAM). Results Latency: up to 0.76× vs prior FPGA LLM accelerators and 0.64× vs a GPU baseline on GPT-2; Energy efficiency: up to 1.99× vs A100 on emerging LLMs (model-dependent). Platform context: Alveo U55C (HBM2 16 GB, 460 GB/s, PCIe Gen3×16 or dual Gen4×8, 2×QSFP28). https://arxiv.org/pdf/2509.13694 Our Comments The useful contribution here is a PyTorch→Torch-MLIR→dataflow compiler that emits stream-scheduled kernels and a host/runtime for AMD’s Alveo U55C; the iterative tensor type plus linear-programming-based FIFO sizing enables safe inter-kernel streaming rather than DRAM round-trips. On reported LLM decoding benchmarks across GPT-2, Llama, Qwen, and Gemma, the research team show geometric-mean latency as low as 0.64× vs. a GPU baseline and energy efficiency up to 1.99×, with scope limited to decoding workloads. The hardware context is clear: Alveo U55C provides 16 GB HBM2 at 460 GB/s with dual QSFP28 and PCIe Gen3×16 or dual Gen4×8, which aligns with the streaming dataflow design. Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows appeared first on MarkTechPost.

StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows Leer entrada »

AI, Committee, Noticias, Uncategorized

How to Evaluate Voice Agents in 2025: Beyond Automatic Speech Recognition (ASR) and Word Error Rate (WER) to Task Success, Barge-In, and Hallucination-Under-Noise

admin NU / octubre 5, 2025

Table of contents Why WER Isn’t Enough ? What to Measure (and How) ? Benchmark Landscape: What Each Covers Filling the Gaps: What You Still Need to Add A Concrete, Reproducible Evaluation Plan References Optimizing only for Automatic Speech Recognition (ASR) and Word Error Rate (WER) is insufficient for modern, interactive voice agents. Robust evaluation must measure end-to-end task success, barge-in behavior and latency, and hallucination-under-noise—alongside ASR, safety, and instruction following. VoiceBench offers a multi-facet speech-interaction benchmark across general knowledge, instruction following, safety, and robustness to speaker/environment/content variations, but it does not cover barge-in or real-device task completion. SLUE (and Phase-2) target spoken language understanding (SLU); MASSIVE and Spoken-SQuAD probe multilingual and spoken QA; DSTC tracks add spoken, task-oriented robustness. Combine these with explicit barge-in/endpointing tests, user-centric task-success measurement, and controlled noise-stress protocols to obtain a complete picture. Why WER Isn’t Enough? WER measures transcription fidelity, not interaction quality. Two agents with similar WER can diverge widely in dialog success because latency, turn-taking, misunderstanding recovery, safety, and robustness to acoustic and content perturbations dominate user experience. Prior work on real systems shows the need to evaluate user satisfaction and task success directly—e.g., Cortana’s automatic online evaluation predicted user satisfaction from in-situ interaction signals, not only ASR accuracy. What to Measure (and How)? 1) End-to-End Task Success Metric: Task Success Rate (TSR) with strict success criteria per task (goal completion, constraints met), plus Task Completion Time (TCT) and Turns-to-Success.Why. Real assistants are judged by outcomes. Competitions like Alexa Prize TaskBot explicitly measured users’ ability to finish multi-step tasks (e.g., cooking, DIY) with ratings and completion. Protocol. Define tasks with verifiable endpoints (e.g., “assemble shopping list with N items and constraints”). Use blinded human raters and automatic logs to compute TSR/TCT/Turns. For multilingual/SLU coverage, draw task intents/slots from MASSIVE. 2) Barge-In and Turn-Taking Metrics: Barge-In Detection Latency (ms): time from user onset to TTS suppression. True/False Barge-In Rates: correct interruptions vs. spurious stops. Endpointing Latency (ms): time to ASR finalization after user stop. Why. Smooth interruption handling and fast endpointing determine perceived responsiveness. Research formalizes barge-in verification and continuous barge-in processing; endpointing latency continues to be an active area in streaming ASR. Protocol. Script prompts where the user interrupts TTS at controlled offsets and SNRs. Measure suppression and recognition timings with high-precision logs (frame timestamps). Include noisy/echoic far-field conditions. Classic and modern studies provide recovery and signaling strategies that reduce false barge-ins. 3) Hallucination-Under-Noise (HUN) Metric. HUN Rate: fraction of outputs that are fluent but semantically unrelated to the audio, under controlled noise or non-speech audio.Why. ASR and audio-LLM stacks can emit “convincing nonsense,” especially with non-speech segments or noise overlays. Recent work defines and measures ASR hallucinations; targeted studies show Whisper hallucinations induced by non-speech sounds. Protocol. Construct audio sets with additive environmental noise (varied SNRs), non-speech distractors, and content disfluencies. Score semantic relatedness (human judgment with adjudication) and compute HUN. Track whether downstream agent actions propagate hallucinations to incorrect task steps. 4) Instruction Following, Safety, and Robustness Metric Families. Instruction-Following Accuracy (format and constraint adherence). Safety Refusal Rate on adversarial spoken prompts. Robustness Deltas across speaker age/accent/pitch, environment (noise, reverb, far-field), and content noise (grammar errors, disfluencies). Why. VoiceBench explicitly targets these axes with spoken instructions (real and synthetic) spanning general knowledge, instruction following, and safety; it perturbs speaker, environment, and content to probe robustness. Protocol. Use VoiceBench for breadth on speech-interaction capabilities; report aggregate and per-axis scores. For SLU specifics (NER, dialog acts, QA, summarization), leverage SLUE and Phase-2. 5) Perceptual Speech Quality (for TTS and Enhancement) Metric. Subjective Mean Opinion Score via ITU-T P.808 (crowdsourced ACR/DCR/CCR).Why. Interaction quality depends on both recognition and playback quality. P.808 gives a validated crowdsourcing protocol with open-source tooling. Benchmark Landscape: What Each Covers VoiceBench (2024) Scope: Multi-facet voice assistant evaluation with spoken inputs covering general knowledge, instruction following, safety, and robustness across speaker/environment/content variations; uses both real and synthetic speech.Limitations: Does not benchmark barge-in/endpointing latency or real-world task completion on devices; focuses on response correctness and safety under variations. SLUE / SLUE Phase-2 Scope: Spoken language understanding tasks: NER, sentiment, dialog acts, named-entity localization, QA, summarization; designed to study end-to-end vs. pipeline sensitivity to ASR errors.Use: Great for probing SLU robustness and pipeline fragility in spoken settings. MASSIVE Scope: >1M virtual-assistant utterances across 51–52 languages with intents/slots; strong fit for multilingual task-oriented evaluation.Use: Build multilingual task suites and measure TSR/slot F1 under speech conditions (paired with TTS or read speech). Spoken-SQuAD / HeySQuAD and Related Spoken-QA Sets Scope: Spoken question answering to test ASR-aware comprehension and multi-accent robustness.Use: Stress-test comprehension under speech errors; not a full agent task suite. DSTC (Dialog System Technology Challenge) Tracks Scope: Robust dialog modeling with spoken, task-oriented data; human ratings alongside automatic metrics; recent tracks emphasize multilinguality, safety, and evaluation dimensionality.Use: Complementary for dialog quality, DST, and knowledge-grounded responses under speech conditions. Real-World Task Assistance (Alexa Prize TaskBot) Scope: Multi-step task assistance with user ratings and success criteria (cooking/DIY).Use: Gold-standard inspiration for defining TSR and interaction KPIs; the public reports describe evaluation focus and outcomes. Filling the Gaps: What You Still Need to Add Barge-In & Endpointing KPIsAdd explicit measurement harnesses. Literature offers barge-in verification and continuous processing strategies; streaming ASR endpointing latency remains an active research topic. Track barge-in detection latency, suppression correctness, endpointing delay, and false barge-ins. Hallucination-Under-Noise (HUN) ProtocolsAdopt emerging ASR-hallucination definitions and controlled noise/non-speech tests; report HUN rate and its impact on downstream actions. On-Device Interaction LatencyCorrelate user-perceived latency with streaming ASR designs (e.g., transducer variants); measure time-to-first-token, time-to-final, and local processing overhead. Cross-Axis Robustness MatricesCombine VoiceBench’s speaker/environment/content axes with your task suite (TSR) to expose failure surfaces (e.g., barge-in under far-field echo; task success at low SNR; multilingual slots under accent shift). Perceptual Quality for PlaybackUse ITU-T P.808 (with the open P.808 toolkit) to quantify user-perceived TTS quality in your end-to-end loop, not just ASR. A Concrete, Reproducible Evaluation Plan Assemble the Suite Speech-Interaction Core: VoiceBench for knowledge, instruction following, safety, and robustness axes. SLU Depth: SLUE/Phase-2 tasks (NER, dialog acts,

How to Evaluate Voice Agents in 2025: Beyond Automatic Speech Recognition (ASR) and Word Error Rate (WER) to Task Success, Barge-In, and Hallucination-Under-Noise Leer entrada »

AI, Committee, Noticias, Uncategorized

Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture

admin NU / octubre 5, 2025

What if, instead of re-sampling one agent, you could push Gemini-2.5 Pro to 34.1% on HLE by mixing 12–15 tool-using agents that share notes and stop early? Google Cloud AI Research, with collaborators from MIT, Harvard, and Google DeepMind, introduced TUMIX (Tool-Use Mixture)—a test-time framework that ensembles heterogeneous agent styles (text-only, code, search, guided variants) and lets them share intermediate answers over a few refinement rounds, then stop early via an LLM-based judge. The result: higher accuracy at lower cost on hard reasoning benchmarks such as HLE, GPQA-Diamond, and AIME (2024/2025). https://arxiv.org/pdf/2510.01279 So, What exactly is different new? Mixture over modality, not just more samples: TUMIX runs ~15 agent styles spanning Chain-of-Thought (CoT), code execution, web search, dual-tool agents, and guided variants. Each round, every agent sees (a) the original question and (b) other agents’ previous answers, then proposes a refined answer. This message-passing raises average accuracy early while diversity gradually collapses—so stopping matters. Adaptive early-termination: An LLM-as-Judge halts refinement once answers exhibit strong consensus (with a minimum round threshold). This preserves accuracy at ~49% of the inference cost vs. fixed-round refinement; token cost drops to ~46% because late rounds are token-heavier. Auto-designed agents: Beyond human-crafted agents, TUMIX prompts the base LLM to generate new agent types; mixing these with the manual set yields an additional ~+1.2% average lift without extra cost. The empirical “sweet spot” is ~12–15 agent styles. https://arxiv.org/pdf/2510.01279 How does it work? TUMIX runs a group of heterogeneous agents—text-only Chain-of-Thought, code-executing, web-searching, and guided variants—in parallel, then iterates a small number of refinement rounds where each agent conditions on the original question plus the other agents’ prior rationales and answers (structured note-sharing). After each round, an LLM-based judge evaluates consensus/consistency to decide early termination; if confidence is insufficient, another round is triggered, otherwise the system finalizes via simple aggregation (e.g., majority vote or selector). This mixture-of-tool-use design trades brute-force re-sampling for diverse reasoning paths, improving coverage of correct candidates while controlling token/tool budgets; empirically, benefits saturate around 12–15 agent styles, and stopping early preserves diversity and lowers cost without sacrificing accuracy Lets discuss the Results Under comparable inference budgets to strong tool-augmented baselines (Self-MoA, Symbolic-MoE, DEI, SciMaster, GSA), TUMIX yields the best average accuracy; a scaled variant (TUMIX+) pushes further with more compute: HLE (Humanity’s Last Exam): Pro: 21.6% → 34.1% (TUMIX+); Flash: 9.7% → 23.1%.(HLE is a 2,500-question, difficult, multi-domain benchmark finalized in 2025.) GPQA-Diamond: Pro: up to 88.3%; Flash: up to 82.1%. (GPQA-Diamond is the hardest 198-question subset authored by domain experts.) AIME 2024/25: Pro: 96.7%; Flash: 86.7% with TUMIX(+) at test time. Across tasks, TUMIX averages +3.55% over the best prior tool-augmented test-time scaling baseline at similar cost, and +7.8% / +17.4% over no-scaling for Pro/Flash, respectively. https://arxiv.org/pdf/2510.01279 Our Comments TUMIX is a great approach from Google because it frames test-time scaling as a search problem over heterogeneous tool policies rather than brute-force sampling. The parallel committee (text, code, search) improves candidate coverage, while the LLM-judge enables early-stop that preserves diversity and reduces token/tool spend—useful under latency budgets. The HLE gains (34.1% with Gemini-2.5 Pro) align with the benchmark’s finalized 2,500-question design, and the ~12–15 agent styles “sweet spot” indicates selection—not generation—is the limiting factor. Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture appeared first on MarkTechPost.

Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture Leer entrada »

AI, Committee, Noticias, Uncategorized

A Coding Implementation to Build a Transformer-Based Regression Language Model to Predict Continuous Values from Text

admin NU / octubre 5, 2025

We will build a Regression Language Model (RLM), a model that predicts continuous numerical values directly from text sequences in this coding implementation. Instead of classifying or generating text, we focus on training a transformer-based architecture that learns quantitative relationships hidden within natural language descriptions. We start by generating synthetic text-to-number data, tokenizing it efficiently, and then train a lightweight Transformer encoder to map linguistic cues to real-valued targets. By the end, we not only understand how RLMs can be implemented from scratch but also visualize their learning behavior and test their generalization on unseen examples. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser import numpy as np import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader import matplotlib.pyplot as plt from collections import Counter import re torch.manual_seed(42) np.random.seed(42) print(” Regression Language Model (RLM) Tutorial”) print(“=” * 60) We begin by importing essential libraries, such as PyTorch, NumPy, and Matplotlib, to build and visualize our Regression Language Model. We set random seeds to ensure reproducibility and initialize the environment, thereby guaranteeing consistent results each time the tutorial is run. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def generate_synthetic_data(n_samples=2000): “””Generate synthetic text-to-number regression data””” templates = [ (“The temperature is {} degrees”, lambda x: x), (“I rate this {} out of ten”, lambda x: x), (“The price is {} dollars”, lambda x: x), (“Confidence level: {}”, lambda x: x / 100), (“Speed of {} kilometers per hour”, lambda x: x / 10), (“{} percent complete”, lambda x: x / 100), (“Scored {} points in the game”, lambda x: x / 10), (“The distance is {} meters”, lambda x: x), ] data = [] for _ in range(n_samples): template, transform = templates[np.random.randint(len(templates))] value = np.random.uniform(0, 100) text = template.format(round(value, 1)) target = transform(value) data.append((text, target)) return data We create a synthetic dataset that pairs natural language sentences with corresponding numerical values. By using varied templates such as temperatures, ratings, and percentages, we ensure the model learns diverse text–number relationships. This controlled setup helps us simulate realistic regression tasks without relying on external data. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class SimpleTokenizer: def __init__(self): self.word2idx = {“<PAD>”: 0, “<UNK>”: 1} self.idx2word = {0: “<PAD>”, 1: “<UNK>”} self.vocab_size = 2 def fit(self, texts): “””Build vocabulary from texts””” words = [] for text in texts: words.extend(re.findall(r’w+|[^ws]’, text.lower())) word_counts = Counter(words) for word, _ in word_counts.most_common(): if word not in self.word2idx: self.word2idx[word] = self.vocab_size self.idx2word[self.vocab_size] = word self.vocab_size += 1 def encode(self, text, max_len=20): “””Convert text to token indices””” words = re.findall(r’w+|[^ws]’, text.lower()) indices = [self.word2idx.get(w, 1) for w in words] if len(indices) < max_len: indices += [0] * (max_len – len(indices)) else: indices = indices[:max_len] return indices We design a simple tokenizer to convert raw text into numerical tokens that the model can process. It builds a vocabulary from all unique words and maps each to an index, handling unknown words and padding automatically. This step ensures our textual inputs are transformed into consistent, machine-readable sequences for training. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class RLMDataset(Dataset): def __init__(self, data, tokenizer, max_len=20): self.data = data self.tokenizer = tokenizer self.max_len = max_len def __len__(self): return len(self.data) def __getitem__(self, idx): text, target = self.data[idx] tokens = self.tokenizer.encode(text, self.max_len) return torch.tensor(tokens), torch.tensor([target], dtype=torch.float32) class RegressionLanguageModel(nn.Module): def __init__(self, vocab_size, embed_dim=128, num_heads=4, num_layers=2, dropout=0.1, max_len=20): super().__init__() self.token_embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0) self.position_embedding = nn.Embedding(max_len, embed_dim) encoder_layer = nn.TransformerEncoderLayer( d_model=embed_dim, nhead=num_heads, dim_feedforward=embed_dim * 4, dropout=dropout, batch_first=True ) self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers) self.fc1 = nn.Linear(embed_dim, 64) self.relu = nn.ReLU() self.dropout = nn.Dropout(dropout) self.fc2 = nn.Linear(64, 1) self.max_len = max_len def forward(self, x): batch_size, seq_len = x.shape positions = torch.arange(0, seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1) token_embed = self.token_embedding(x) pos_embed = self.position_embedding(positions) embeddings = token_embed + pos_embed padding_mask = (x == 0) encoded = self.transformer(embeddings, src_key_padding_mask=padding_mask) mask_expanded = (~padding_mask).unsqueeze(-1).float() summed = (encoded * mask_expanded).sum(dim=1) pooled = summed / mask_expanded.sum(dim=1) x = self.fc1(pooled) x = self.relu(x) x = self.dropout(x) output = self.fc2(x) return output We package our text–number pairs into a PyTorch Dataset, where we tokenize each sentence and return tensors ready for batching. We then build a Transformer-based RLM: token and positional embeddings flow through a multi-layer encoder, we mean-pool non-padded tokens, and feed the result to a small MLP head for regression. In effect, we allow the encoder to learn numerical cues from language, while the head maps them to a single continuous value. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def train_rlm(model, train_loader, val_loader, epochs=15, lr=0.001): criterion = nn.MSELoss() optimizer = optim.Adam(model.parameters(), lr=lr) train_losses, val_losses = [], [] print(f”n Training on {device}”) print(“-” * 60) for epoch in range(epochs): model.train() train_loss = 0 for tokens, targets in train_loader: tokens, targets = tokens.to(device), targets.to(device) optimizer.zero_grad() outputs = model(tokens) loss = criterion(outputs, targets) loss.backward() optimizer.step() train_loss += loss.item() train_loss /= len(train_loader) train_losses.append(train_loss) model.eval() val_loss = 0 with torch.no_grad(): for tokens, targets in val_loader: tokens, targets = tokens.to(device), targets.to(device) outputs = model(tokens) loss = criterion(outputs, targets) val_loss += loss.item() val_loss /= len(val_loader) val_losses.append(val_loss) print(f”Epoch {epoch+1:2d}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}”) return train_losses, val_losses We train the model using Adam and MSE loss on a GPU, if available, iterating over mini-batches to backpropagate and update weights. We switch to evaluation mode for validation at the end of each epoch, track training and validation losses, and print progress so we can see the learning dynamics in real-time. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser print(“n Generating synthetic data…”) data = generate_synthetic_data(2000) split_idx = int(0.8 * len(data)) train_data, val_data = data[:split_idx], data[split_idx:] print(f”Train samples: {len(train_data)}, Val samples: {len(val_data)}”) print(“n Building tokenizer…”) tokenizer = SimpleTokenizer() tokenizer.fit([text for text, _ in train_data]) print(f”Vocabulary size: {tokenizer.vocab_size}”) train_dataset = RLMDataset(train_data, tokenizer) val_dataset = RLMDataset(val_data, tokenizer) train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=32) print(“n Building Regression Language Model…”) model = RegressionLanguageModel(vocab_size=tokenizer.vocab_size) print(f”Model parameters: {sum(p.numel() for p in model.parameters()):,}”) train_losses, val_losses = train_rlm(model, train_loader, val_loader) plt.figure(figsize=(10, 4)) plt.plot(train_losses, label=’Train Loss’, linewidth=2) plt.plot(val_losses, label=’Val Loss’, linewidth=2) plt.xlabel(‘Epoch’)

A Coding Implementation to Build a Transformer-Based Regression Language Model to Predict Continuous Values from Text Leer entrada »

Noticias

SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

Emission-GPT: A domain-specific language model agent for knowledge retrieval, emission inventory and data analysis

Beyond Imitation: Recovering Dense Rewards from Demonstrations

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows

How to Evaluate Voice Agents in 2025: Beyond Automatic Speech Recognition (ASR) and Word Error Rate (WER) to Task Success, Barge-In, and Hallucination-Under-Noise

Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture

A Coding Implementation to Build a Transformer-Based Regression Language Model to Predict Continuous Values from Text

Nuestros servicios

Inicio

Cómo funciona

Noticias

Precios

Soporte

Centro de ayuda

Reportar un problema

Dar comentarios

Política de privacidad

Cuenta de usuario

Síguenos