Uncategorized Archives - Pagina 2 di 95

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

admin NU / Ottobre 21, 2025

arXiv:2510.12997v2 Announce Type: replace-cross Abstract: Test-time scaling has enabled Large Language Models (LLMs) with remarkable reasoning capabilities, particularly in mathematical domains, through intermediate chain-of-thought (CoT) reasoning before generating final answers. However, the specific sources and mechanisms underlying these reasoning capabilities remain insufficiently understood. Optimization reasoning, i.e. finding extrema under constraints, represents a fundamental abstraction that underpins critical applications in planning, control, resource allocation, and prompt search. To systematically evaluate this capability, we introduce ExtremBench, a benchmark dataset for solving mathematical extremal problems, curated from inequality exercises used for Chinese Mathematical Olympiad and transformed into $93$ standardized extrema-finding problems. We conduct extensive evaluations across various state-of-the-art open-source model families, including the Qwen3, GPT-OSS, and DeepSeek. Our results reveal that LLMs’ extremal-solving reasoning capabilities do not always align with those of current mathematical benchmarks such as AIME25 and MATH-500, with some models showing strong general mathematical reasoning but poor extremal-solving skills, and vice versa. This discrepancy highlights a critical gap in current evaluation practices and suggests that existing benchmarks may not comprehensively capture the full spectrum of mathematical reasoning abilities.

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

DVAGen: Dynamic Vocabulary Augmented Generation

admin NU / Ottobre 21, 2025

arXiv:2510.17115v1 Announce Type: new Abstract: Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.

DVAGen: Dynamic Vocabulary Augmented Generation Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

admin NU / Ottobre 21, 2025

arXiv:2510.09211v2 Announce Type: replace Abstract: When performing reasoning tasks with user-specific requirements, such as strict output formats, large language models (LLMs) often prioritize reasoning over adherence to detailed instructions. Fine-tuning LLMs on supervised datasets to address this is impractical due to high computational costs and limited parameter access. To tackle this, we propose DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs’ outputs through chain-of-thought (CoT) correction. DICE decouples the process by first prompting LLMs to generate natural language responses, then using trained SLMs to analyze and refine these outputs to meet structured output specifications. This framework preserves LLMs’ broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands. Specifically, DICE first constructs structured CoT adaptation datasets via a two-stage method and subsequently applies a dual-tuning strategy to fine-tune SLMs for generating structured outputs in an analyze-then-answer pattern. Experiments demonstrate that DICE improves the average format accuracy and content correctness of LLM outputs by 35.4% and 29.4%, respectively, achieving state-of-the-art (SOTA) performance over other competitive baselines.

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

admin NU / Ottobre 21, 2025

arXiv:2507.00432v2 Announce Type: replace-cross Abstract: Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants

admin NU / Ottobre 21, 2025

A team of researchers from Google Research and UC Santa Cruz released DeepSomatic, an AI model that identifies cancer cell genetic variants. In research with Children’s Mercy, it found 10 variants in pediatric leukemia cells missed by other tools. DeepSomatic has a somatic small variant caller for cancer genomes that works across Illumina short reads, PacBio HiFi long reads, and Oxford Nanopore long reads. The method extends DeepVariant, detects single nucleotide variants and small insertions and deletions in whole genome and whole exome data, and supports tumor normal and tumor only workflows, including FFPE models. https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct How It Works? DeepSomatic converts aligned reads into image like tensors that encode pileups, base qualities, and alignment context. A convolutional neural network classifies candidate sites as somatic or not and the pipeline emits VCF or gVCF. This design is platform agnostic because the tensor summarizes local haplotype and error patterns across technologies. Google researchers describe the approach and its focus on distinguishing inherited and acquired variants including difficult samples such as glioblastoma and pediatric leukemia. Datasets and Benchmarking Training and evaluation use CASTLE, Cancer Standards Long read Evaluation. CASTLE contains 6 matched tumor and normal cell line pairs that were whole genome sequenced on Illumina, PacBio HiFi, and Oxford Nanopore. The research team releases benchmark sets and accessions for reuse. This fills a gap in multi technology somatic training and testing resources. https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct Reported Results The research team report consistent gains over widely used methods in both single nucleotide variants and indels. On Illumina indels, the next best method is about 80 percent F1, DeepSomatic is about 90 percent. On PacBio indels, the next best method is under 50 percent, DeepSomatic is above 80 percent. Baselines include SomaticSniper, MuTect2, and Strelka2 for short reads and ClairS for long reads. The study reports 329,011 somatic variants across the reference lines and an additional preserved sample. Google research team reports that DeepSomatic outperforms current methods with particular strength on indels. https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct Generalization to Real Samples The research team evaluates transfer to cancers beyond the training set. A glioblastoma sample shows recovery of known drivers. Pediatric leukemia samples test the tumor only mode where a clean normal is not available. The tool recovers known calls and reports additional variants in that cohort. These studies indicate the representation and training scheme generalize to new disease contexts and to settings without matched normals. Key Takeaways DeepSomatic detects somatic SNVs (single nucleotide variants) and indels across Illumina, PacBio HiFi, and Oxford Nanopore, and builds on the DeepVariant methodology. The pipeline supports tumor normal and tumor only workflows, includes FFPE WGS and WES models, and is released on GitHub. It encodes read pileups as image like tensors and uses a convolutional neural network to classify somatic sites and emit VCF or gVCF. Training and evaluation use the CASTLE dataset with 6 matched tumor normal cell line pairs sequenced on three platforms, with benchmarks and accessions provided. Reported results show about 90 percent indel F1 on Illumina and above 80 percent on PacBio, outperforming common baselines, with 329,011 somatic variants identified across reference samples. Editorial Comments DeepSomatic is a pragmatic step for somatic variant calling across sequencing platforms, the model keeps DeepVariant’s image tensor representation and a convolutional neural network, so the same architecture scales from Illumina to PacBio HiFi to Oxford Nanopore with consistent preprocessing and outputs. The CASTLE dataset is the right move, it supplies matched tumor and normal cell lines across 3 technologies, which strengthens training and benchmarking and aids reproducibility. Reported results emphasize indel accuracy, about 90% F1 on Illumina and more than 80% on PacBio against lower baselines, which addresses a long running weakness in indel detection. The pipeline supports WGS and WES, tumor normal and tumor only, and FFPE, which matches real laboratory constraints. Check out the Technical Paper, Technical details, Dataset and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants appeared first on MarkTechPost.

Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation

admin NU / Ottobre 20, 2025

arXiv:2510.15624v1 Announce Type: cross Abstract: The automation of scientific discovery represents a critical milestone in Artificial Intelligence (AI) research. However, existing agentic systems for science suffer from two fundamental limitations: rigid, pre-programmed workflows that cannot adapt to intermediate findings, and inadequate context management that hinders long-horizon research. We present texttt{freephdlabor}, an open-source multiagent framework featuring textit{fully dynamic workflows} determined by real-time agent reasoning and a coloremph{textit{modular architecture}} enabling seamless customization — users can modify, add, or remove agents to address domain-specific requirements. The framework provides comprehensive infrastructure including textit{automatic context compaction}, textit{workspace-based communication} to prevent information degradation, textit{memory persistence} across sessions, and textit{non-blocking human intervention} mechanisms. These features collectively transform automated research from isolated, single-run attempts into textit{continual research programs} that build systematically on prior explorations and incorporate human feedback. By providing both the architectural principles and practical implementation for building customizable co-scientist systems, this work aims to facilitate broader adoption of automated research across scientific domains, enabling practitioners to deploy interactive multiagent systems that autonomously conduct end-to-end research — from ideation through experimentation to publication-ready manuscripts.

Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition

admin NU / Ottobre 20, 2025

arXiv:2504.20094v2 Announce Type: replace-cross Abstract: Conversational recommender systems (CRS) have advanced with large language models, showing strong results in domains like movies. These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme. In contrast, games present distinct challenges: fast-evolving catalogs, interaction-driven preferences (e.g., skill level, mechanics, hardware), and increased risk of unsafe responses in open-ended conversation. We propose MATCHA, a multi-agent framework for CRS that assigns specialized agents for intent parsing, tool-augmented retrieval, multi-LLM ranking with reflection, explanation, and risk control which enabling finer personalization, long-tail coverage, and stronger safety. Evaluated on real user request dataset, MATCHA outperforms six baselines across eight metrics, improving Hit@5 by 20%, reducing popularity bias by 24%, and achieving 97.9% adversarial defense. Human and virtual-judge evaluations confirm improved explanation quality and user alignment.

Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics

admin NU / Ottobre 20, 2025

arXiv:2510.15345v1 Announce Type: new Abstract: Automatic readability assessment plays a key role in ensuring effective and accessible written communication. Despite significant progress, the field is hindered by inconsistent definitions of readability and measurements that rely on surface-level text properties. In this work, we investigate the factors shaping human perceptions of readability through the analysis of 897 judgments, finding that, beyond surface-level cues, information content and topic strongly shape text comprehensibility. Furthermore, we evaluate 15 popular readability metrics across five English datasets, contrasting them with six more nuanced, model-based metrics. Our results show that four model-based metrics consistently place among the top four in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 8.6. These findings highlight a mismatch between current readability metrics and human perceptions, pointing to model-based approaches as a more promising direction.

Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

admin NU / Ottobre 20, 2025

arXiv:2510.15545v1 Announce Type: new Abstract: Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

admin NU / Ottobre 20, 2025

arXiv:2510.15313v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style. Our analysis reveals systematic generation and evaluation biases: LLMs exhibit “echo chamber” effects when assessing creative quality, often converging on flawed standards that diverge from human judgments. These findings highlight both the potential and limitations of current capabilities of LLMs as proxy for literacy generation and the limited evaluation practices, thereby demonstrating the continued need of hybrid validation from both humans and models in culturally and technically complex creative tasks.

Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry Leggi l'articolo »

Uncategorized

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

DVAGen: Dynamic Vocabulary Augmented Generation

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants

Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation

Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition

Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

I nostri servizi

Home

Come funziona

Notizie

Prezzi

Supporto

Centro assistenza

Segnala un problema

Dai feedback

Politica sulla privacy

Account utente

Seguici