YouZum

Uncategorized

AI, Committee, Noticias, Uncategorized

Detecting Sockpuppetry on Wikipedia Using Meta-Learning

arXiv:2506.10314v1 Announce Type: cross Abstract: Malicious sockpuppet detection on Wikipedia is critical to preserving access to reliable information on the internet and preventing the spread of disinformation. Prior machine learning approaches rely on stylistic and meta-data features, but do not prioritise adaptability to author-specific behaviours. As a result, they struggle to effectively model the behaviour of specific sockpuppet-groups, especially when text data is limited. To address this, we propose the application of meta-learning, a machine learning technique designed to improve performance in data-scarce settings by training models across multiple tasks. Meta-learning optimises a model for rapid adaptation to the writing style of a new sockpuppet-group. Our results show that meta-learning significantly enhances the precision of predictions compared to pre-trained models, marking an advancement in combating sockpuppetry on open editing platforms. We release a new dataset of sockpuppet investigations to foster future research in both sockpuppetry and meta-learning fields.

Detecting Sockpuppetry on Wikipedia Using Meta-Learning Leer entrada »

AI, Committee, Noticias, Uncategorized

PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier

arXiv:2506.10406v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG’s dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.

PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier Leer entrada »

AI, Committee, Noticias, Uncategorized

Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering

arXiv:2506.10751v1 Announce Type: cross Abstract: Automated question answering (QA) over electronic health records (EHRs) can bridge critical information gaps for clinicians and patients, yet it demands both precise evidence retrieval and faithful answer generation under limited supervision. In this work, we present Neural, the runner-up in the BioNLP 2025 ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations. For each stage, we automatically explore the prompt space with DSPy’s MIPROv2 optimizer, jointly tuning instructions and few-shot demonstrations on the development set. A self-consistency voting scheme further improves evidence recall without sacrificing precision. On the hidden test set, our method attains an overall score of 51.5, placing second stage while outperforming standard zero-shot and few-shot prompting by over 20 and 10 points, respectively. These results indicate that data-driven prompt optimization is a cost-effective alternative to model fine-tuning for high-stakes clinical QA, advancing the reliability of AI assistants in healthcare.

Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering Leer entrada »

AI, Committee, Noticias, Uncategorized

How Do LLMs Really Reason? A Framework to Separate Logic from Knowledge

Unpacking Reasoning in Modern LLMs: Why Final Answers Aren’t Enough Recent advancements in reasoning-focused LLMs like OpenAI’s o1/3 and DeepSeek-R1 have led to notable improvements on complex tasks. However, the step-by-step reasoning behind these models remains unclear. Most evaluations focus on final-answer accuracy, which hides the reasoning process and doesn’t reveal how models combine knowledge and logic. Some earlier methods attempt to measure reasoning by comparing answers to the original question, but this approach is flawed since models often rely on prior deductions or internal knowledge. Domains such as math and medicine differ in their reasoning needs, highlighting the importance of developing better, domain-aware evaluation methods for building trustworthy AI. The Shortcomings of Final-Answer Evaluations in Math and Medicine Recent LLMs have made impressive strides in reasoning tasks, especially in math and medicine, thanks to better training data and reward strategies. However, most of this progress focuses on boosting final answer accuracy rather than understanding how the model reasons step-by-step. Past work has flagged factual errors in reasoning chains or measured similarity between reasoning steps and the original question. But such similarity doesn’t guarantee logical soundness or factual correctness, since LLMs often draw on internal knowledge or earlier reasoning. A New Framework for Separating Knowledge and Logic in LLM Reasoning Researchers from UC Santa Cruz, Stanford, and Tongji University go beyond final-answer evaluation by breaking down LLM reasoning into two key parts: factual knowledge and logical steps. They introduce a detailed framework that utilizes two metrics: the Knowledge Index (KI) for factual accuracy and Information Gain (InfoGain) for reasoning quality. Their analysis of Qwen models across math and medical tasks reveals that reasoning skills don’t easily transfer between domains. While supervised fine-tuning improves accuracy, it often harms reasoning depth. Reinforcement learning, however, helps refine reasoning by removing irrelevant information. This work highlights the importance of evaluating and training LLMs more thoughtfully. Assessing Reasoning with Qwen2.5-7B and DeepSeek-R1 Models The researchers evaluate reasoning in LLMs by analyzing Qwen2.5-7B and its DeepSeek-R1-distilled version, trained with SFT and RL. Using tasks from both math and medical domains, they decompose responses into logical steps and assess them using two key metrics: Information Gain (how much uncertainty is reduced with each reasoning step) and Knowledge Index (how factually accurate each step is, verified against expert sources). While InfoGain tracks the informativeness of each step, KI checks whether the knowledge aligns with real-world facts. This approach reveals how models reason and where they may falter in accuracy or logic. Supervised Fine-Tuning vs. Reinforcement Learning in Domain-Specific Tasks The study evaluates two variants of Qwen-2.5-7B—Qwen-Base and the distilled Qwen-R1 on medical tasks. Results show that Qwen-Base consistently outperforms Qwen-R1 in accuracy, knowledge retention, and reasoning, especially after SFT and RL. The distilled model likely struggles due to prior training focused on math and code, resulting in a domain mismatch. Interestingly, SFT enhances medical knowledge more effectively than RL, although it may slightly compromise reasoning efficiency. RL, on the other hand, improves both reasoning and knowledge when applied post-SFT. Medical benchmarks tend to rely more on factual knowledge than abstract reasoning, unlike math-focused tasks. Conclusion: Toward More Interpretable and Trustworthy LLMs In conclusion, the study introduces a framework that separates knowledge from reasoning to evaluate better how LLMs think, particularly in high-stakes areas like medicine and math. Using Qwen models trained with SFT and RL, the researchers found that while SFT improves factual accuracy, essential in medicine, it often weakens reasoning. RL, however, enhances reasoning by trimming out incorrect information. The framework could be extended to fields such as law or finance, where structured thinking is crucial. Overall, this approach helps clarify how LLMs make decisions and suggests ways to tailor their training for specific domains. Check out the Paper, Code and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter. The post How Do LLMs Really Reason? A Framework to Separate Logic from Knowledge appeared first on MarkTechPost.

How Do LLMs Really Reason? A Framework to Separate Logic from Knowledge Leer entrada »

AI, Committee, Noticias, Uncategorized

CURE: A Reinforcement Learning Framework for Co-Evolving Code and Unit Test Generation in LLMs

Introduction Large Language Models (LLMs) have shown substantial improvements in reasoning and precision through reinforcement learning (RL) and test-time scaling techniques. Despite outperforming traditional unit test generation methods, most existing approaches such as O1-Coder and UTGEN require supervision from ground-truth code. This supervision increases data collection costs and limits the scale of usable training data. Limitations of Existing Approaches Conventional unit test generation relies on: Software analysis methods, which are rule-based and rigid. Neural machine translation techniques, which often lack semantic alignment. While recent prompt-based and agentic methods improve performance, they still depend heavily on labeled code for fine-tuning. This reliance restricts adaptability and scalability, particularly in real-world, large-scale deployment scenarios. CURE: A Self-Supervised Co-Evolutionary Approach Researchers from the University of Chicago, Princeton University, Peking University, and ByteDance Seed introduce CURE, a self-supervised reinforcement learning framework that jointly trains a code generator and a unit test generator without any ground-truth code. CURE operates using a self-play mechanism in which: The LLM generates both correct and incorrect code. The unit test generator learns to distinguish failure modes and refines itself accordingly. This bidirectional co-evolution enhances both code generation and verification without external supervision. Architecture and Methodology Base Models and Sampling Strategy CURE is built on Qwen2.5-7B and 14B Instruct models, with Qwen3-4B used for long-chain-of-thought (CoT) variants. Each training step samples: 16 candidate code completions. 16 task-derived unit tests. Sampling is performed using vLLM with temperature 1.0 and top-p 1.0. For long-CoT models, a response-length-aware transformation penalizes lengthy outputs, improving inference-time efficiency. Reward Function and Optimization CURE introduces a mathematically grounded reward formulation to: Maximize reward precision, defined as the likelihood that correct code scores higher than incorrect code across generated unit tests. Apply response-based reward adjustments for long responses to reduce latency. Optimization proceeds via policy gradient methods, jointly updating the coder and unit tester to improve their mutual performance. Benchmark Datasets and Evaluation Metrics CURE is evaluated on five standard coding datasets: LiveBench MBPP LiveCodeBench CodeContests CodeForces Performance is measured across: Unit test accuracy One-shot code generation accuracy Best-of-N (BoN) accuracy using 16 code and test samples. Performance and Efficiency Gains The ReasonFlux-Coder models derived via CURE achieve: +37.8% in unit test accuracy. +5.3% in one-shot code generation accuracy. +9.0% in BoN accuracy. Notably, ReasonFlux-Coder-4B achieves 64.8% reduction in average unit test response length—substantially improving inference speed. Across all benchmarks, these models outperform traditional coding-supervised fine-tuned models (e.g., Qwen2.5-Coder-Instruct). Application to Commercial LLMs When ReasonFlux-Coder-4B is paired with GPT-series models: GPT-4o-mini gains +5.5% BoN accuracy. GPT-4.1-mini improves by +1.8%. API costs are reduced while performance is enhanced, indicating a cost-effective solution for production-level inference pipelines. Use as Reward Model for Label-Free Fine-Tuning CURE-trained unit test generators can be repurposed as reward models in RL training. Using ReasonFlux-Coder-4B’s generated unit tests yields comparable improvements to human-labeled test supervision—enabling fully label-free reinforcement learning pipelines. Broader Applicability and Future Directions Beyond BoN, ReasonFlux-Coder models integrate seamlessly with agentic coding frameworks like: MPSC (Multi-Perspective Self-Consistency) AlphaCodium S* These systems benefit from CURE’s ability to refine both code and tests iteratively. CURE also boosts agentic unit test generation accuracy by over 25.1%, reinforcing its versatility. Conclusion CURE represents a significant advancement in self-supervised learning for code generation and validation, enabling large language models to jointly evolve their coding and unit test generation capabilities without reliance on ground-truth code. By leveraging a co-evolutionary reinforcement learning framework, CURE not only enhances core performance metrics such as one-shot accuracy and Best-of-N selection but also improves inference efficiency through response-length-aware optimization. Its compatibility with existing agentic coding pipelines and ability to function as a label-free reward model make it a scalable and cost-effective solution for both training and deployment scenarios. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter. The post CURE: A Reinforcement Learning Framework for Co-Evolving Code and Unit Test Generation in LLMs appeared first on MarkTechPost.

CURE: A Reinforcement Learning Framework for Co-Evolving Code and Unit Test Generation in LLMs Leer entrada »

AI, Committee, Noticias, Uncategorized

Exploring the Escalation of Source Bias in User, Data, and Recommender System Feedback Loop

arXiv:2405.17998v2 Announce Type: replace-cross Abstract: Recommender systems are essential for information access, allowing users to present their content for recommendation. With the rise of large language models (LLMs), AI-generated content (AIGC), primarily in the form of text, has become a central part of the content ecosystem. As AIGC becomes increasingly prevalent, it is important to understand how it affects the performance and dynamics of recommender systems. To this end, we construct an environment that incorporates AIGC to explore its short-term impact. The results from popular sequential recommendation models reveal that AIGC are ranked higher in the recommender system, reflecting the phenomenon of source bias. To further explore the long-term impact of AIGC, we introduce a feedback loop with realistic simulators. The results show that the model’s preference for AIGC increases as the user clicks on AIGC rises and the model trains on simulated click data. This leads to two issues: In the short term, bias toward AIGC encourages LLM-based content creation, increasing AIGC content, and causing unfair traffic distribution. From a long-term perspective, our experiments also show that when AIGC dominates the content ecosystem after a feedback loop, it can lead to a decline in recommendation performance. To address these issues, we propose a debiasing method based on L1-loss optimization to maintain long-term content ecosystem balance. In a real-world environment with AIGC generated by mainstream LLMs, our method ensures a balance between AIGC and human-generated content in the ecosystem. The code and dataset are available at https://github.com/Yuqi-Zhou/Rec_SourceBias.

Exploring the Escalation of Source Bias in User, Data, and Recommender System Feedback Loop Leer entrada »

AI, Committee, Noticias, Uncategorized

A Decomposition-Based Approach for Evaluating and Analyzing Inter-Annotator Disagreement

arXiv:2206.05446v2 Announce Type: replace Abstract: We propose a novel method to conceptually decompose an existing annotation into separate levels, allowing the analysis of inter-annotators disagreement in each level separately. We suggest two distinct strategies in order to actualize this approach: a theoretically-driven one, in which the researcher defines a decomposition based on prior knowledge of the annotation task, and an exploration-based one, in which many possible decompositions are inductively computed and presented to the researcher for interpretation and evaluation. Utilizing a recently constructed dataset for narrative analysis as our use-case, we apply each of the two strategies to demonstrate the potential of our approach in testing hypotheses regarding the sources of annotation disagreements, as well as revealing latent structures and relations within the annotation task. We conclude by suggesting how to extend and generalize our approach, as well as use it for other purposes.

A Decomposition-Based Approach for Evaluating and Analyzing Inter-Annotator Disagreement Leer entrada »

es_ES