YouZum

Committee

AI, Committee, Notizie, Uncategorized

Rethinking Verification for LLM Code Generation: From Generation to Testing

arXiv:2507.06920v2 Announce Type: replace Abstract: Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.

Rethinking Verification for LLM Code Generation: From Generation to Testing Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs

arXiv:2409.17946v4 Announce Type: replace-cross Abstract: Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning (FPFT). However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from the weak-to-strong based on Feature Alignment-enhanced Knowledge Distillation (FAKD). Specifically, we poison small-scale language models through FPFT to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through FAKD, which employs PEFT. Theoretical analysis reveals that FAKD has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of FAKD on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.

Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation

arXiv:2507.06249v1 Announce Type: cross Abstract: Recently, pre-trained models with phonetic supervision have demonstrated their advantages for crosslingual speech recognition in data efficiency and information sharing across languages. However, a limitation is that a pronunciation lexicon is needed for such phoneme-based crosslingual speech recognition. In this study, we aim to eliminate the need for pronunciation lexicons and propose a latent variable model based method, with phonemes being treated as discrete latent variables. The new method consists of a speech-to-phoneme (S2P) model and a phoneme-to-grapheme (P2G) model, and a grapheme-to-phoneme (G2P) model is introduced as an auxiliary inference model. To jointly train the three models, we utilize the joint stochastic approximation (JSA) algorithm, which is a stochastic extension of the EM (expectation-maximization) algorithm and has demonstrated superior performance particularly in estimating discrete latent variable models. Based on the Whistle multilingual pre-trained S2P model, crosslingual experiments are conducted in Polish (130 h) and Indonesian (20 h). With only 10 minutes of phoneme supervision, the new method, JSA-SPG, achieves 5% error rate reductions compared to the best crosslingual fine-tuning approach using subword or full phoneme supervision. Furthermore, it is found that in language domain adaptation (i.e., utilizing cross-domain text-only data), JSA-SPG outperforms the standard practice of language model fusion via the auxiliary support of the G2P model by 9% error rate reductions. To facilitate reproducibility and encourage further exploration in this field, we open-source the JSA-SPG training code and complete pipeline.

Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Temporal Analysis of Climate Policy Discourse: Insights from Dynamic Embedded Topic Modeling

arXiv:2507.06435v1 Announce Type: new Abstract: Understanding how policy language evolves over time is critical for assessing global responses to complex challenges such as climate change. Temporal analysis helps stakeholders, including policymakers and researchers, to evaluate past priorities, identify emerging themes, design governance strategies, and develop mitigation measures. Traditional approaches, such as manual thematic coding, are time-consuming and limited in capturing the complex, interconnected nature of global policy discourse. With the increasing relevance of unsupervised machine learning, these limitations can be addressed, particularly under high-volume, complex, and high-dimensional data conditions. In this work, we explore a novel approach that applies the dynamic embedded topic model (DETM) to analyze the evolution of global climate policy discourse. A probabilistic model designed to capture the temporal dynamics of topics over time. We collected a corpus of United Nations Framework Convention on Climate Change (UNFCCC) policy decisions from 1995 to 2023, excluding 2020 due to the postponement of COP26 as a result of the COVID-19 pandemic. The model reveals shifts from early emphases on greenhouse gases and international conventions to recent focuses on implementation, technical collaboration, capacity building, finance, and global agreements. Section 3 presents the modeling pipeline, including preprocessing, model training, and visualization of temporal word distributions. Our results show that DETM is a scalable and effective tool for analyzing the evolution of global policy discourse. Section 4 discusses the implications of these findings and we concluded with future directions and refinements to extend this approach to other policy domains.

Temporal Analysis of Climate Policy Discourse: Insights from Dynamic Embedded Topic Modeling Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

The Trilemma of Truth in Large Language Models

arXiv:2506.23921v2 Announce Type: replace Abstract: We often attribute human characteristics to large language models (LLMs) and claim that they “know” certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM’s depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs “know” and how certain they are of their probabilistic internal knowledge.

The Trilemma of Truth in Large Language Models Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods

arXiv:2502.16903v2 Announce Type: replace Abstract: Despite the growing interest in jailbreak methods as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. We conduct a systematic measurement study based on 37 jailbreak studies since 2022, focusing on both the methods and the evaluation systems they employ. We find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. This paper advocates a shift to a more nuanced, case-by-case evaluation paradigm. We introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset, detailed case-by-case evaluation guidelines and an evaluation system integrated with these guidelines — GuidedEval. Experiments demonstrate that GuidedBench offers more accurate measurements of jailbreak performance, enabling meaningful comparisons across methods and uncovering new insights overlooked in previous evaluations. GuidedEval reduces inter-evaluator variance by at least 76.03%. Furthermore, we observe that incorporating guidelines can enhance the effectiveness of jailbreak methods themselves, offering new insights into both attack strategies and evaluation paradigms.

GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS

arXiv:2507.05557v1 Announce Type: new Abstract: Test-time scaling has emerged as a promising paradigm in language modeling, leveraging additional computational resources at inference time to enhance model performance. In this work, we introduce R2-LLMs, a novel and versatile hierarchical retrieval-augmented reasoning framework designed to improve test-time scaling in large language models (LLMs) without requiring distillation from more advanced models to obtain chain-of-thought (CoT) training data. R2-LLMs enhances inference-time generalization by integrating dual-level retrieval-based in-context learning: (1) At the coarse level, our approach extracts abstract templates from complex reasoning problems and retrieves similar problem-answer pairs to facilitate high-level in-context learning; (2) At the fine level, during Monte Carlo Tree Search (MCTS), R2-LLMs efficiently retrieves analogous intermediate solution steps from reference mathematical problem datasets, refining step-wise reasoning with the aid of a process reward model (PRM) for scoring. R2-LLMs is a robust hierarchical reasoning-augmentation method that enhances in-context-level reasoning while seamlessly integrating with step-level tree search methods. Utilizing PRM, it refines both candidate generation and decision-making for improved reasoning accuracy. Empirical evaluations on the MATH500, GSM8K, and OlympiadBench-TO datasets achieve substantial relative improvement with an increase of up to 16% using LLaMA-3.1-8B compared to the baselines, showcasing the effectiveness of our approach in complex reasoning tasks.

Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

arXiv:2507.03483v2 Announce Type: replace Abstract: In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs’ knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model

arXiv:2503.13575v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) possess encompassing capabilities that can process diverse language-related tasks. However, finetuning on LLMs will diminish this general skills and continual finetuning will further cause severe degradation on accumulated knowledge. Recently, Continual Learning (CL) in Large Language Models (LLMs) arises which aims to continually adapt the LLMs to new tasks while maintaining previously learned knowledge and inheriting general skills. Existing techniques either leverage previous data to replay, leading to extra computational costs, or utilize a single parameter-efficient module to learn the downstream task, constraining new knowledge absorption with interference between different tasks. Toward these issues, this paper proposes Analytic Subspace Routing(ASR) to address these challenges. For each task, we isolate the learning within a subspace of deep layers’ features via low-rank adaptation, eliminating knowledge interference between different tasks. Additionally, we propose an analytic routing mechanism to properly utilize knowledge learned in different subspaces. Our approach employs Recursive Least Squares to train a multi-task router model, allowing the router to dynamically adapt to incoming data without requiring access to historical data. Also, the router effectively assigns the current task to an appropriate subspace and has a non-forgetting property of previously learned tasks with a solid theoretical guarantee. Experimental results demonstrate that our method achieves near-perfect retention of prior knowledge while seamlessly integrating new information, effectively overcoming the core limitations of existing methods. Our code will be released after acceptance.

Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors

arXiv:2507.05939v1 Announce Type: new Abstract: Nowadays, misinformation articles, especially multimodal ones, are widely spread on social media platforms and cause serious negative effects. To control their propagation, Multimodal Misinformation Detection (MMD) becomes an active topic in the community to automatically identify misinformation. Previous MMD methods focus on supervising detectors by collecting offline data. However, in real-world scenarios, new events always continually emerge, making MMD models trained on offline data consistently outdated and ineffective. To address this issue, training MMD models under online data streams is an alternative, inducing an emerging task named continual MMD. Unfortunately, it is hindered by two major challenges. First, training on new data consistently decreases the detection performance on past data, named past knowledge forgetting. Second, the social environment constantly evolves over time, affecting the generalization on future data. To alleviate these challenges, we propose to remember past knowledge by isolating interference between event-specific parameters with a Dirichlet process-based mixture-of-expert structure, and anticipate future environmental distributions by learning a continuous-time dynamics model. Accordingly, we induce a new continual MMD method DAEDCMD. Extensive experiments demonstrate that DAEDCMD can consistently and significantly outperform the compared methods, including six MMD baselines and three continual learning methods.

Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors Leggi l'articolo »

it_IT