Uncategorized Archives

Learning to Interpret Weight Differences in Language Models

admin NU / octubre 22, 2025

arXiv:2510.05092v3 Announce Type: replace-cross Abstract: Finetuning (pretrained) language models is a standard approach for updating their internal parametric knowledge and specializing them to new tasks and domains. However, the corresponding model weight changes (“weight diffs”) are not generally interpretable. While inspecting the finetuning dataset can give a sense of how the model might have changed, these datasets are often not publicly available or are too large to work with directly. Towards the goal of comprehensively understanding weight diffs in natural language, we introduce Diff Interpretation Tuning (DIT), a method that trains models to describe their own finetuning-induced modifications. Our approach uses synthetic, labeled weight diffs to train a DIT-adapter, which can be applied to a compatible finetuned model to make it describe how it has changed. We demonstrate in two proof-of-concept settings (reporting hidden behaviors and summarizing finetuned knowledge) that our method enables models to describe their finetuning-induced modifications using accurate natural language descriptions.

Learning to Interpret Weight Differences in Language Models Leer entrada »

AI, Committee, Noticias, Uncategorized

Explaining Large Language Models with gSMILE

admin NU / octubre 22, 2025

arXiv:2505.21657v5 Announce Type: replace Abstract: Large Language Models (LLMs) such as GPT, LLaMA, and Claude achieve remarkable performance in text generation but remain opaque in their decision-making processes, limiting trust and accountability in high-stakes applications. We present gSMILE (generative SMILE), a model-agnostic, perturbation-based framework for token-level interpretability in LLMs. Extending the SMILE methodology, gSMILE uses controlled prompt perturbations, Wasserstein distance metrics, and weighted linear surrogates to identify input tokens with the most significant impact on the output. This process enables the generation of intuitive heatmaps that visually highlight influential tokens and reasoning paths. We evaluate gSMILE across leading LLMs (OpenAI’s gpt-3.5-turbo-instruct, Meta’s LLaMA 3.1 Instruct Turbo, and Anthropic’s Claude 2.1) using attribution fidelity, attribution consistency, attribution stability, attribution faithfulness, and attribution accuracy as metrics. Results show that gSMILE delivers reliable human-aligned attributions, with Claude 2.1 excelling in attention fidelity and GPT-3.5 achieving the highest output consistency. These findings demonstrate gSMILE’s ability to balance model performance and interpretability, enabling more transparent and trustworthy AI systems.

Explaining Large Language Models with gSMILE Leer entrada »

AI, Committee, Noticias, Uncategorized

ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks

admin NU / octubre 22, 2025

arXiv:2510.18455v1 Announce Type: new Abstract: Retrieval Augmented Generation (RAG) systems are increasingly vital in dynamic domains like online gaming, yet the lack of a dedicated benchmark has impeded standardized evaluation in this area. The core difficulty lies in Dual Dynamics: the constant interplay between game content updates and the shifting focus of the player community. Furthermore, the necessity of automating such a benchmark introduces a critical requirement for player-centric authenticity to ensure generated questions are realistic. To address this integrated challenge, we introduce ChronoPlay, a novel framework for the automated and continuous generation of game RAG benchmarks. ChronoPlay utilizes a dual-dynamic update mechanism to track both forms of change, and a dual-source synthesis engine that draws from official sources and player community to ensure both factual correctness and authentic query patterns. We instantiate our framework on three distinct games to create the first dynamic RAG benchmark for the gaming domain, offering new insights into model performance under these complex and realistic conditions. Code is avaliable at: https://github.com/hly1998/ChronoPlay.

ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks Leer entrada »

AI, Committee, Noticias, Uncategorized

DelvePO: Direction-Guided Self-Evolving Framework for Flexible Prompt Optimization

admin NU / octubre 22, 2025

arXiv:2510.18257v1 Announce Type: new Abstract: Prompt Optimization has emerged as a crucial approach due to its capabilities in steering Large Language Models to solve various tasks. However, current works mainly rely on the random rewriting ability of LLMs, and the optimization process generally focus on specific influencing factors, which makes it easy to fall into local optimum. Besides, the performance of the optimized prompt is often unstable, which limits its transferability in different tasks. To address the above challenges, we propose $textbf{DelvePO}$ ($textbf{D}$irection-Guid$textbf{e}$d Se$textbf{l}$f-E$textbf{v}$olving Framework for Fl$textbf{e}$xible $textbf{P}$rompt $textbf{O}$ptimization), a task-agnostic framework to optimize prompts in self-evolve manner. In our framework, we decouple prompts into different components that can be used to explore the impact that different factors may have on various tasks. On this basis, we introduce working memory, through which LLMs can alleviate the deficiencies caused by their own uncertainties and further obtain key insights to guide the generation of new prompts. Extensive experiments conducted on different tasks covering various domains for both open- and closed-source LLMs, including DeepSeek-R1-Distill-Llama-8B, Qwen2.5-7B-Instruct and GPT-4o-mini. Experimental results show that DelvePO consistently outperforms previous SOTA methods under identical experimental settings, demonstrating its effectiveness and transferability across different tasks.

DelvePO: Direction-Guided Self-Evolving Framework for Flexible Prompt Optimization Leer entrada »

AI, Committee, Noticias, Uncategorized

BrailleLLM: Braille Instruction Tuning with Large Language Models for Braille Domain Tasks

admin NU / octubre 22, 2025

arXiv:2510.18288v1 Announce Type: new Abstract: Braille plays a vital role in education and information accessibility for visually impaired individuals. However, Braille information processing faces challenges such as data scarcity and ambiguities in mixed-text contexts. We construct English and Chinese Braille Mixed Datasets (EBMD/CBMD) with mathematical formulas to support diverse Braille domain research, and propose a syntax tree-based augmentation method tailored for Braille data. To address the underperformance of traditional fine-tuning methods in Braille-related tasks, we investigate Braille Knowledge-Based Fine-Tuning (BKFT), which reduces the learning difficulty of Braille contextual features. BrailleLLM employs BKFT via instruction tuning to achieve unified Braille translation, formula-to-Braille conversion, and mixed-text translation. Experiments demonstrate that BKFT achieves significant performance improvements over conventional fine-tuning in Braille translation scenarios. Our open-sourced datasets and methodologies establish a foundation for low-resource multilingual Braille research.

BrailleLLM: Braille Instruction Tuning with Large Language Models for Braille Domain Tasks Leer entrada »

AI, Committee, Noticias, Uncategorized

DVAGen: Dynamic Vocabulary Augmented Generation

admin NU / octubre 21, 2025

arXiv:2510.17115v1 Announce Type: new Abstract: Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.

DVAGen: Dynamic Vocabulary Augmented Generation Leer entrada »

AI, Committee, Noticias, Uncategorized

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

admin NU / octubre 21, 2025

arXiv:2510.12997v2 Announce Type: replace-cross Abstract: Test-time scaling has enabled Large Language Models (LLMs) with remarkable reasoning capabilities, particularly in mathematical domains, through intermediate chain-of-thought (CoT) reasoning before generating final answers. However, the specific sources and mechanisms underlying these reasoning capabilities remain insufficiently understood. Optimization reasoning, i.e. finding extrema under constraints, represents a fundamental abstraction that underpins critical applications in planning, control, resource allocation, and prompt search. To systematically evaluate this capability, we introduce ExtremBench, a benchmark dataset for solving mathematical extremal problems, curated from inequality exercises used for Chinese Mathematical Olympiad and transformed into $93$ standardized extrema-finding problems. We conduct extensive evaluations across various state-of-the-art open-source model families, including the Qwen3, GPT-OSS, and DeepSeek. Our results reveal that LLMs’ extremal-solving reasoning capabilities do not always align with those of current mathematical benchmarks such as AIME25 and MATH-500, with some models showing strong general mathematical reasoning but poor extremal-solving skills, and vice versa. This discrepancy highlights a critical gap in current evaluation practices and suggests that existing benchmarks may not comprehensively capture the full spectrum of mathematical reasoning abilities.

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems Leer entrada »

AI, Committee, Noticias, Uncategorized

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

admin NU / octubre 21, 2025

arXiv:2507.00432v2 Announce Type: replace-cross Abstract: Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Leer entrada »

AI, Committee, Noticias, Uncategorized

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

admin NU / octubre 21, 2025

arXiv:2510.09211v2 Announce Type: replace Abstract: When performing reasoning tasks with user-specific requirements, such as strict output formats, large language models (LLMs) often prioritize reasoning over adherence to detailed instructions. Fine-tuning LLMs on supervised datasets to address this is impractical due to high computational costs and limited parameter access. To tackle this, we propose DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs’ outputs through chain-of-thought (CoT) correction. DICE decouples the process by first prompting LLMs to generate natural language responses, then using trained SLMs to analyze and refine these outputs to meet structured output specifications. This framework preserves LLMs’ broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands. Specifically, DICE first constructs structured CoT adaptation datasets via a two-stage method and subsequently applies a dual-tuning strategy to fine-tune SLMs for generating structured outputs in an analyze-then-answer pattern. Experiments demonstrate that DICE improves the average format accuracy and content correctness of LLM outputs by 35.4% and 29.4%, respectively, achieving state-of-the-art (SOTA) performance over other competitive baselines.

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction Leer entrada »

AI, Committee, Noticias, Uncategorized

Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants

admin NU / octubre 21, 2025

A team of researchers from Google Research and UC Santa Cruz released DeepSomatic, an AI model that identifies cancer cell genetic variants. In research with Children’s Mercy, it found 10 variants in pediatric leukemia cells missed by other tools. DeepSomatic has a somatic small variant caller for cancer genomes that works across Illumina short reads, PacBio HiFi long reads, and Oxford Nanopore long reads. The method extends DeepVariant, detects single nucleotide variants and small insertions and deletions in whole genome and whole exome data, and supports tumor normal and tumor only workflows, including FFPE models. https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct How It Works? DeepSomatic converts aligned reads into image like tensors that encode pileups, base qualities, and alignment context. A convolutional neural network classifies candidate sites as somatic or not and the pipeline emits VCF or gVCF. This design is platform agnostic because the tensor summarizes local haplotype and error patterns across technologies. Google researchers describe the approach and its focus on distinguishing inherited and acquired variants including difficult samples such as glioblastoma and pediatric leukemia. Datasets and Benchmarking Training and evaluation use CASTLE, Cancer Standards Long read Evaluation. CASTLE contains 6 matched tumor and normal cell line pairs that were whole genome sequenced on Illumina, PacBio HiFi, and Oxford Nanopore. The research team releases benchmark sets and accessions for reuse. This fills a gap in multi technology somatic training and testing resources. https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct Reported Results The research team report consistent gains over widely used methods in both single nucleotide variants and indels. On Illumina indels, the next best method is about 80 percent F1, DeepSomatic is about 90 percent. On PacBio indels, the next best method is under 50 percent, DeepSomatic is above 80 percent. Baselines include SomaticSniper, MuTect2, and Strelka2 for short reads and ClairS for long reads. The study reports 329,011 somatic variants across the reference lines and an additional preserved sample. Google research team reports that DeepSomatic outperforms current methods with particular strength on indels. https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct Generalization to Real Samples The research team evaluates transfer to cancers beyond the training set. A glioblastoma sample shows recovery of known drivers. Pediatric leukemia samples test the tumor only mode where a clean normal is not available. The tool recovers known calls and reports additional variants in that cohort. These studies indicate the representation and training scheme generalize to new disease contexts and to settings without matched normals. Key Takeaways DeepSomatic detects somatic SNVs (single nucleotide variants) and indels across Illumina, PacBio HiFi, and Oxford Nanopore, and builds on the DeepVariant methodology. The pipeline supports tumor normal and tumor only workflows, includes FFPE WGS and WES models, and is released on GitHub. It encodes read pileups as image like tensors and uses a convolutional neural network to classify somatic sites and emit VCF or gVCF. Training and evaluation use the CASTLE dataset with 6 matched tumor normal cell line pairs sequenced on three platforms, with benchmarks and accessions provided. Reported results show about 90 percent indel F1 on Illumina and above 80 percent on PacBio, outperforming common baselines, with 329,011 somatic variants identified across reference samples. Editorial Comments DeepSomatic is a pragmatic step for somatic variant calling across sequencing platforms, the model keeps DeepVariant’s image tensor representation and a convolutional neural network, so the same architecture scales from Illumina to PacBio HiFi to Oxford Nanopore with consistent preprocessing and outputs. The CASTLE dataset is the right move, it supplies matched tumor and normal cell lines across 3 technologies, which strengthens training and benchmarking and aids reproducibility. Reported results emphasize indel accuracy, about 90% F1 on Illumina and more than 80% on PacBio against lower baselines, which addresses a long running weakness in indel detection. The pipeline supports WGS and WES, tumor normal and tumor only, and FFPE, which matches real laboratory constraints. Check out the Technical Paper, Technical details, Dataset and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants appeared first on MarkTechPost.

Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants Leer entrada »

Uncategorized

Learning to Interpret Weight Differences in Language Models

Explaining Large Language Models with gSMILE

ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks

DelvePO: Direction-Guided Self-Evolving Framework for Flexible Prompt Optimization

BrailleLLM: Braille Instruction Tuning with Large Language Models for Braille Domain Tasks

DVAGen: Dynamic Vocabulary Augmented Generation

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants

Nuestros servicios

Inicio

Cómo funciona

Noticias

Precios

Soporte

Centro de ayuda

Reportar un problema

Dar comentarios

Política de privacidad

Cuenta de usuario

Síguenos