YouZum

Uncategorized

AI, Committee, 新闻, Uncategorized

CURE: A Reinforcement Learning Framework for Co-Evolving Code and Unit Test Generation in LLMs

Introduction Large Language Models (LLMs) have shown substantial improvements in reasoning and precision through reinforcement learning (RL) and test-time scaling techniques. Despite outperforming traditional unit test generation methods, most existing approaches such as O1-Coder and UTGEN require supervision from ground-truth code. This supervision increases data collection costs and limits the scale of usable training data. Limitations of Existing Approaches Conventional unit test generation relies on: Software analysis methods, which are rule-based and rigid. Neural machine translation techniques, which often lack semantic alignment. While recent prompt-based and agentic methods improve performance, they still depend heavily on labeled code for fine-tuning. This reliance restricts adaptability and scalability, particularly in real-world, large-scale deployment scenarios. CURE: A Self-Supervised Co-Evolutionary Approach Researchers from the University of Chicago, Princeton University, Peking University, and ByteDance Seed introduce CURE, a self-supervised reinforcement learning framework that jointly trains a code generator and a unit test generator without any ground-truth code. CURE operates using a self-play mechanism in which: The LLM generates both correct and incorrect code. The unit test generator learns to distinguish failure modes and refines itself accordingly. This bidirectional co-evolution enhances both code generation and verification without external supervision. Architecture and Methodology Base Models and Sampling Strategy CURE is built on Qwen2.5-7B and 14B Instruct models, with Qwen3-4B used for long-chain-of-thought (CoT) variants. Each training step samples: 16 candidate code completions. 16 task-derived unit tests. Sampling is performed using vLLM with temperature 1.0 and top-p 1.0. For long-CoT models, a response-length-aware transformation penalizes lengthy outputs, improving inference-time efficiency. Reward Function and Optimization CURE introduces a mathematically grounded reward formulation to: Maximize reward precision, defined as the likelihood that correct code scores higher than incorrect code across generated unit tests. Apply response-based reward adjustments for long responses to reduce latency. Optimization proceeds via policy gradient methods, jointly updating the coder and unit tester to improve their mutual performance. Benchmark Datasets and Evaluation Metrics CURE is evaluated on five standard coding datasets: LiveBench MBPP LiveCodeBench CodeContests CodeForces Performance is measured across: Unit test accuracy One-shot code generation accuracy Best-of-N (BoN) accuracy using 16 code and test samples. Performance and Efficiency Gains The ReasonFlux-Coder models derived via CURE achieve: +37.8% in unit test accuracy. +5.3% in one-shot code generation accuracy. +9.0% in BoN accuracy. Notably, ReasonFlux-Coder-4B achieves 64.8% reduction in average unit test response length—substantially improving inference speed. Across all benchmarks, these models outperform traditional coding-supervised fine-tuned models (e.g., Qwen2.5-Coder-Instruct). Application to Commercial LLMs When ReasonFlux-Coder-4B is paired with GPT-series models: GPT-4o-mini gains +5.5% BoN accuracy. GPT-4.1-mini improves by +1.8%. API costs are reduced while performance is enhanced, indicating a cost-effective solution for production-level inference pipelines. Use as Reward Model for Label-Free Fine-Tuning CURE-trained unit test generators can be repurposed as reward models in RL training. Using ReasonFlux-Coder-4B’s generated unit tests yields comparable improvements to human-labeled test supervision—enabling fully label-free reinforcement learning pipelines. Broader Applicability and Future Directions Beyond BoN, ReasonFlux-Coder models integrate seamlessly with agentic coding frameworks like: MPSC (Multi-Perspective Self-Consistency) AlphaCodium S* These systems benefit from CURE’s ability to refine both code and tests iteratively. CURE also boosts agentic unit test generation accuracy by over 25.1%, reinforcing its versatility. Conclusion CURE represents a significant advancement in self-supervised learning for code generation and validation, enabling large language models to jointly evolve their coding and unit test generation capabilities without reliance on ground-truth code. By leveraging a co-evolutionary reinforcement learning framework, CURE not only enhances core performance metrics such as one-shot accuracy and Best-of-N selection but also improves inference efficiency through response-length-aware optimization. Its compatibility with existing agentic coding pipelines and ability to function as a label-free reward model make it a scalable and cost-effective solution for both training and deployment scenarios. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter. The post CURE: A Reinforcement Learning Framework for Co-Evolving Code and Unit Test Generation in LLMs appeared first on MarkTechPost.

CURE: A Reinforcement Learning Framework for Co-Evolving Code and Unit Test Generation in LLMs Read Post »

AI, Committee, 新闻, Uncategorized

Exploring the Escalation of Source Bias in User, Data, and Recommender System Feedback Loop

arXiv:2405.17998v2 Announce Type: replace-cross Abstract: Recommender systems are essential for information access, allowing users to present their content for recommendation. With the rise of large language models (LLMs), AI-generated content (AIGC), primarily in the form of text, has become a central part of the content ecosystem. As AIGC becomes increasingly prevalent, it is important to understand how it affects the performance and dynamics of recommender systems. To this end, we construct an environment that incorporates AIGC to explore its short-term impact. The results from popular sequential recommendation models reveal that AIGC are ranked higher in the recommender system, reflecting the phenomenon of source bias. To further explore the long-term impact of AIGC, we introduce a feedback loop with realistic simulators. The results show that the model’s preference for AIGC increases as the user clicks on AIGC rises and the model trains on simulated click data. This leads to two issues: In the short term, bias toward AIGC encourages LLM-based content creation, increasing AIGC content, and causing unfair traffic distribution. From a long-term perspective, our experiments also show that when AIGC dominates the content ecosystem after a feedback loop, it can lead to a decline in recommendation performance. To address these issues, we propose a debiasing method based on L1-loss optimization to maintain long-term content ecosystem balance. In a real-world environment with AIGC generated by mainstream LLMs, our method ensures a balance between AIGC and human-generated content in the ecosystem. The code and dataset are available at https://github.com/Yuqi-Zhou/Rec_SourceBias.

Exploring the Escalation of Source Bias in User, Data, and Recommender System Feedback Loop Read Post »

AI, Committee, 新闻, Uncategorized

A Decomposition-Based Approach for Evaluating and Analyzing Inter-Annotator Disagreement

arXiv:2206.05446v2 Announce Type: replace Abstract: We propose a novel method to conceptually decompose an existing annotation into separate levels, allowing the analysis of inter-annotators disagreement in each level separately. We suggest two distinct strategies in order to actualize this approach: a theoretically-driven one, in which the researcher defines a decomposition based on prior knowledge of the annotation task, and an exploration-based one, in which many possible decompositions are inductively computed and presented to the researcher for interpretation and evaluation. Utilizing a recently constructed dataset for narrative analysis as our use-case, we apply each of the two strategies to demonstrate the potential of our approach in testing hypotheses regarding the sources of annotation disagreements, as well as revealing latent structures and relations within the annotation task. We conclude by suggesting how to extend and generalize our approach, as well as use it for other purposes.

A Decomposition-Based Approach for Evaluating and Analyzing Inter-Annotator Disagreement Read Post »

AI, Committee, 新闻, Uncategorized

Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

arXiv:2503.22879v3 Announce Type: replace-cross Abstract: State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance. Despite this, scaling up SSMs on cloud services or limited-resource devices is challenging due to their storage requirements and computational power. To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration. As SSMs are prone to quantization-induced errors, recent efforts have focused on optimizing a particular model or bit-width for efficiency without sacrificing performance. However, distinct bit-width configurations are essential for different scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for enhancing generation speed in short prompt applications for a single user. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms. Based on the channel order preserving and activation persistence of SSMs, we propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input $x$, combined with a per-state-group quantization for input-dependent parameters $B$ and $C$. To ensure compute-invariance in the SSM output, we rearrange weights offline according to the clustering sequence. The experiments show that Quamba2-8B outperforms two state-of-the-art SSM quantization methods and delivers 1.3$times$ and 3$times$ speed-ups in the pre-filling and generation stages, respectively, while offering 4$times$ memory reduction with only a $1.6%$ average accuracy drop. The evaluation on MMLU shows the generalizability and robustness of our framework. The code and quantized models will be released at: https://github.com/enyac-group/Quamba.

Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models Read Post »

AI, Committee, 新闻, Uncategorized

AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists

arXiv:2506.08140v1 Announce Type: cross Abstract: Despite long-standing efforts in accelerating scientific discovery with AI, building AI co-scientists remains challenging due to limited high-quality data for training and evaluation. To tackle this data scarcity issue, we present AutoSDT, an automatic pipeline that collects high-quality coding tasks in real-world data-driven discovery workflows. AutoSDT leverages the coding capabilities and parametric knowledge of LLMs to search for diverse sources, select ecologically valid tasks, and synthesize accurate task instructions and code solutions. Using our pipeline, we construct AutoSDT-5K, a dataset of 5,404 coding tasks for data-driven discovery that covers four scientific disciplines and 756 unique Python packages. To the best of our knowledge, AutoSDT-5K is the only automatically collected and the largest open dataset for data-driven scientific discovery. Expert feedback on a subset of 256 tasks shows the effectiveness of AutoSDT: 93% of the collected tasks are ecologically valid, and 92.2% of the synthesized programs are functionally correct. Trained on AutoSDT-5K, the Qwen2.5-Coder-Instruct LLM series, dubbed AutoSDT-Coder, show substantial improvement on two challenging data-driven discovery benchmarks, ScienceAgentBench and DiscoveryBench. Most notably, AutoSDT-Coder-32B reaches the same level of performance as GPT-4o on ScienceAgentBench with a success rate of 7.8%, doubling the performance of its base model. On DiscoveryBench, it lifts the hypothesis matching score to 8.1, bringing a 17.4% relative improvement and closing the gap between open-weight models and GPT-4o.

AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists Read Post »

AI, Committee, 新闻, Uncategorized

Olica: Efficient Structured Pruning of Large Language Models without Retraining

arXiv:2506.08436v1 Announce Type: new Abstract: Most existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish the corrupted correlations, making them prohibitively expensive. To address this, we propose a pruning framework for LLMs called Orthogonal decomposition and Linear Calibration (Olica), which eliminates the need for retraining. A key observation is that the multi-head attention (MHA) layer depends on two types of matrix products. By treating these matrix products as unified entities and applying principal component analysis (PCA), we extract the most important information to compress LLMs without sacrificing accuracy or disrupting their original structure. Consequently, retraining becomes unnecessary. A fast decomposition method is devised, reducing the complexity of PCA by a factor of the square of the number of attention heads. Additionally, to mitigate error accumulation problem caused by pruning the feed-forward network (FFN) layer, we introduce a linear calibration method to reconstruct the residual errors of pruned layers using low-rank matrices. By leveraging singular value decomposition (SVD) on the solution of the least-squares problem, these matrices are obtained without requiring retraining. Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.

Olica: Efficient Structured Pruning of Large Language Models without Retraining Read Post »

AI, Committee, 新闻, Uncategorized

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

arXiv:2504.15585v4 Announce Type: replace-cross Abstract: The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across various applications. As LLMs continue to gain prominence in both research and commercial domains, their security and safety implications have become a growing concern, not only for researchers and corporations but also for every nation. Currently, existing surveys on LLM safety primarily focus on specific stages of the LLM lifecycle, e.g., deployment phase or fine-tuning phase, lacking a comprehensive understanding of the entire “lifechain” of LLMs. To address this gap, this paper introduces, for the first time, the concept of “full-stack” safety to systematically consider safety issues throughout the entire process of LLM training, deployment, and eventual commercialization. Compared to the off-the-shelf LLM safety surveys, our work demonstrates several distinctive advantages: (I) Comprehensive Perspective. We define the complete LLM lifecycle as encompassing data preparation, pre-training, post-training, deployment and final commercialization. To our knowledge, this represents the first safety survey to encompass the entire lifecycle of LLMs. (II) Extensive Literature Support. Our research is grounded in an exhaustive review of over 800+ papers, ensuring comprehensive coverage and systematic organization of security issues within a more holistic understanding. (III) Unique Insights. Through systematic literature analysis, we have developed reliable roadmaps and perspectives for each chapter. Our work identifies promising research directions, including safety in data generation, alignment techniques, model editing, and LLM-based agent systems. These insights provide valuable guidance for researchers pursuing future work in this field.

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment Read Post »

AI, Committee, 新闻, Uncategorized

Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing

arXiv:2506.07086v1 Announce Type: new Abstract: Multi-modal affective computing aims to automatically recognize and interpret human attitudes from diverse data sources such as images and text, thereby enhancing human-computer interaction and emotion understanding. Existing approaches typically rely on unimodal analysis or straightforward fusion of cross-modal information that fail to capture complex and conflicting evidence presented across different modalities. In this paper, we propose a novel LLM-based approach for affective computing that explicitly deconstructs visual and textual representations into shared (modality-invariant) and modality-specific components. Specifically, our approach firstly encodes and aligns input modalities using pre-trained multi-modal encoders, then employs a representation decomposition framework to separate common emotional content from unique cues, and finally integrates these decomposed signals via an attention mechanism to form a dynamic soft prompt for a multi-modal LLM. Extensive experiments on three representative tasks for affective computing, namely, multi-modal aspect-based sentiment analysis, multi-modal emotion analysis, and hateful meme detection, demonstrate the effectiveness of our approach, which consistently outperforms strong baselines and state-of-the-art models.

Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing Read Post »

zh_CN