YouZum

Committee

AI, Committee, News, Uncategorized

Microsoft Research Releases Skala: a Deep-Learning Exchange–Correlation Functional Targeting Hybrid-Level Accuracy at Semi-Local Cost

TL;DR: Skala is a deep-learning exchange–correlation functional for Kohn–Sham Density Functional Theory (DFT) that targets hybrid-level accuracy at semi-local cost, reporting MAE ≈ 1.06 kcal/mol on W4-17 (0.85 on the single-reference subset) and WTMAD-2 ≈ 3.89 kcal/mol on GMTKN55; evaluations use a fixed D3(BJ) dispersion correction. It is positioned for main-group molecular chemistry today, with transition metals and periodic systems slated as future extensions. Azure AI Foundry The model and tooling are available now via Azure AI Foundry Labs and the open-source microsoft/skala repository. How much compression ratio and throughput would you recover by training a format-aware graph compressor and shipping only a self-describing graph to a universal decoder? Microsoft Research has released Skala, a neural exchange–correlation (XC) functional for Kohn–Sham Density Functional Theory (DFT). Skala learns non-local effects from data while keeping the computational profile comparable to meta-GGA functionals. https://arxiv.org/pdf/2506.14665 What Skala is (and isn’t)? Skala replaces a hand-crafted XC form with a neural functional evaluated on standard meta-GGA grid features. It explicitly does not attempt to learn dispersion in this first release; benchmark evaluations use a fixed D3 correction (D3(BJ) unless noted). The goal is rigorous main-group thermochemistry at semi-local cost, not a universal functional for all regimes on day one. https://arxiv.org/pdf/2506.14665 Benchmarks On W4-17 atomization energies, Skala reports MAE 1.06 kcal/mol on the full set and 0.85 kcal/mol on the single-reference subset. On GMTKN55, Skala achieves WTMAD-2 3.89 kcal/mol, competitive with top hybrids; all functionals were evaluated with the same dispersion settings (D3(BJ) unless VV10/D3(0) applies). https://arxiv.org/pdf/2506.14665 https://arxiv.org/pdf/2506.14665 Architecture and training Skala evaluates meta-GGA features on the standard numerical integration grid, then aggregates information via a finite-range, non-local neural operator (bounded enhancement factor; exact-constraint aware including Lieb–Oxford, size-consistency, and coordinate-scaling). Training proceeds in two phases: (1) pre-training on B3LYP densities with XC labels extracted from high-level wavefunction energies; (2) SCF-in-the-loop fine-tuning using Skala’s own densities (no backprop through SCF). The model is trained on a large, curated corpus dominated by ~80k high-accuracy total atomization energies (MSR-ACC/TAE) plus additional reactions/properties, with W4-17 and GMTKN55 removed from training to avoid leakage. Cost profile and implementation Skala keeps semi-local cost scaling and is engineered for GPU execution via GauXC; the public repo exposes: (i) a PyTorch implementation and microsoft-skala PyPI package with PySCF/ASE hooks, and (ii) a GauXC add-on usable to integrate Skala into other DFT stacks. The README lists ~276k parameters and provides minimal examples. Application In practice, Skala slots into main-group molecular workflows where semi-local cost and hybrid-level accuracy matter: high-throughput reaction energetics (ΔE, barrier estimates), conformer/radical stability ranking, and geometry/dipole predictions feeding QSAR/lead-optimization loops. Because it’s exposed via PySCF/ASE and a GauXC GPU path, teams can run batched SCF jobs and screen candidates at near meta-GGA runtime, then reserve hybrids/CC for final checks. For managed experiments and sharing, Skala is available in Azure AI Foundry Labs and as an open GitHub/PyPI stack. Key Takeaways Performance: Skala achieves MAE 1.06 kcal/mol on W4-17 (0.85 on the single-reference subset) and WTMAD-2 3.89 kcal/mol on GMTKN55; dispersion is applied via D3(BJ) in reported evaluations. Method: A neural XC functional with meta-GGA inputs and finite-range learned non-locality, honoring key exact constraints; retains semi-local O(N³) cost and does not learn dispersion in this release. Training signal: Trained on ~150k high-accuracy labels, including ~80k CCSD(T)/CBS-quality atomization energies (MSR-ACC/TAE); SCF-in-the-loop fine-tuning uses Skala’s own densities; public test sets are de-duplicated from training. Editorial Comments Skala is a pragmatic step: a neural XC functional reporting MAE 1.06 kcal/mol on W4-17 (0.85 on single-reference) and WTMAD-2 3.89 kcal/mol on GMTKN55, evaluated with D3(BJ) dispersion, and scoped today to main-group molecular systems. It’s accessible for testing via Azure AI Foundry Labs with code and PySCF/ASE integrations on GitHub, enabling direct head-to-head baselines against existing meta-GGAs and hybrids. Check out the Technical Paper, GitHub Page and technical blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Microsoft Research Releases Skala: a Deep-Learning Exchange–Correlation Functional Targeting Hybrid-Level Accuracy at Semi-Local Cost appeared first on MarkTechPost.

Microsoft Research Releases Skala: a Deep-Learning Exchange–Correlation Functional Targeting Hybrid-Level Accuracy at Semi-Local Cost Read Post »

AI, Committee, News, Uncategorized

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

arXiv:2510.07239v1 Announce Type: new Abstract: Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We introduce Red-Bandit, a red-teaming framework that adapts online to identify and exploit model failure modes under distinct attack styles (e.g., manipulation, slang). Red-Bandit post-trains a set of parameter-efficient LoRA experts, each specialized for a particular attack style, using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts based on the target model’s response safety, balancing exploration and exploitation. Red-Bandit achieves state-of-the-art results on AdvBench under sufficient exploration (ASR@10), while producing more human-readable prompts (lower perplexity). Moreover, Red-Bandit’s bandit policy serves as a diagnostic tool for uncovering model-specific vulnerabilities by indicating which attack styles most effectively elicit unsafe behaviors.

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts Read Post »

AI, Committee, News, Uncategorized

SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis

arXiv:2505.16834v3 Announce Type: replace Abstract: Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at https://github.com/RUCAIBox/SimpleDeepSearcher.

SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis Read Post »

AI, Committee, News, Uncategorized

BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

arXiv:2510.06811v1 Announce Type: new Abstract: The Circuit Localization track of the Mechanistic Interpretability Benchmark (MIB) evaluates methods for localizing circuits within large language models (LLMs), i.e., subnetworks responsible for specific task behaviors. In this work, we investigate whether ensembling two or more circuit localization methods can improve performance. We explore two variants: parallel and sequential ensembling. In parallel ensembling, we combine attribution scores assigned to each edge by different methods-e.g., by averaging or taking the minimum or maximum value. In the sequential ensemble, we use edge attribution scores obtained via EAP-IG as a warm start for a more expensive but more precise circuit identification method, namely edge pruning. We observe that both approaches yield notable gains on the benchmark metrics, leading to a more precise circuit identification approach. Finally, we find that taking a parallel ensemble over various methods, including the sequential ensemble, achieves the best results. We evaluate our approach in the BlackboxNLP 2025 MIB Shared Task, comparing ensemble scores to official baselines across multiple model-task combinations.

BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods Read Post »

AI, Committee, News, Uncategorized

AutoRev: Multi-Modal Graph Retrieval for Automated Peer-Review Generation

arXiv:2505.14376v2 Announce Type: replace Abstract: Enhancing the quality and efficiency of academic publishing is critical for both authors and reviewers, as research papers are central to scholarly communication and a major source of high-quality content on the web. To support this goal, we propose AutoRev, an automatic peer-review system designed to provide actionable, high-quality feedback to both reviewers and authors. AutoRev leverages a novel Multi-Modal Retrieval-Augmented Generation (RAG) framework that combines textual and graphical representations of academic papers. By modelling documents as graphs, AutoRev effectively retrieves the most pertinent information, significantly reducing the input context length for LLMs and thereby enhancing their review generation capabilities. Experimental results show that AutoRev outperforms state-of-the-art baselines by up to 58.72% and demonstrates competitive performance in human evaluations against ground truth reviews. We envision AutoRev as a powerful tool to streamline the peer-review workflow, alleviating challenges and enabling scalable, high-quality scholarly publishing. By guiding both authors and reviewers, AutoRev has the potential to accelerate the dissemination of quality research on the web at a larger scale. Code will be released upon acceptance.

AutoRev: Multi-Modal Graph Retrieval for Automated Peer-Review Generation Read Post »

AI, Committee, News, Uncategorized

RA3: Mid-Training with Temporal Action Abstractions for Faster Reinforcement Learning (RL) Post-Training in Code LLMs

TL;DR: A new research from Apple, formalizes what “mid-training” should do before reinforcement learning RL post-training and introduces RA3 (Reasoning as Action Abstractions)—an EM-style procedure that learns temporally consistent latent actions from expert traces, then fine-tunes on those bootstrapped traces. It shows mid-training should (1) prune to a compact near-optimal action subspace and (2) shorten the effective planning horizon, improving RL convergence. Empirically, RA3 improves HumanEval/MBPP by ~8/4 points over base/NTP and accelerates RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces. What does the research present? The research team present the first formal treatment of how mid-training shapes post-training reinforcement learning RL: they breakdown outcomes into (i) pruning efficiency—how well mid-training selects a compact near-optimal action subset that shapes the initial policy prior—and (ii) RL convergence—how quickly post-training improves within that restricted set. The analysis argues mid-training is most effective when the decision space is compact and the effective horizon is short, favoring temporal abstractions over primitive next-token actions. https://arxiv.org/pdf/2509.25810 Algorithm: RA3 in one pass RA3 derives a sequential variational lower bound (a temporal ELBO) and optimizes it with an EM-like loop: E-step (latent discovery): use RL to infer temporally consistent latent structures (abstractions) aligned to expert sequences. M-step (model update): perform next-token prediction on the bootstrapped, latent-annotated traces to make those abstractions part of the model’s policy. Results: code generation and RLVR On Python code tasks, the research team reports that across multiple base models, RA3 improves average pass@k on HumanEval and MBPP by ~8 and ~4 points over the base model and an NTP mid-training baseline. In post-training, RLVR converges faster and to higher final performance on HumanEval+, MBPP+, LiveCodeBench, and Codeforces when initialized from RA3. These are mid- and post-training effects respectively; the evaluation scope is code generation. Key Takeaways The research team formalizes mid-training via two determinants—pruning efficiency and impact on RL convergence—arguing effectiveness rises when the decision space is compact and the effective horizon is short. RA3 optimizes a sequential variational lower bound by iteratively discovering temporally consistent latent structures with RL and then fine-tuning on bootstrapped traces (EM-style). On code generation, RA3 reports ~+8 (HumanEval) and ~+4 (MBPP) average pass@k gains over base/NTP mid-training baselines across several model scales. Initializing post-training with RA3 accelerates RLVR convergence and improves asymptotic performance on HumanEval+, MBPP+, LiveCodeBench, and Codeforces. Editorial Comments RA3’s contribution is concrete and narrow: it formalizes mid-training around two determinants—pruning efficiency and RL convergence—and operationalizes them via a temporal ELBO optimized in an EM loop to learn persistent action abstractions before RLVR. The researchers report ~+8 (HumanEval) and ~+4 (MBPP) average pass@k gains over base/NTP and faster RLVR convergence on HumanEval+, MBPP+, LiveCodeBench, and Codeforces. Check out the Technical Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post RA3: Mid-Training with Temporal Action Abstractions for Faster Reinforcement Learning (RL) Post-Training in Code LLMs appeared first on MarkTechPost.

RA3: Mid-Training with Temporal Action Abstractions for Faster Reinforcement Learning (RL) Post-Training in Code LLMs Read Post »

AI, Committee, News, Uncategorized

Fun-ASR Technical Report

arXiv:2509.12508v3 Announce Type: replace Abstract: In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present Fun-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.

Fun-ASR Technical Report Read Post »

AI, Committee, News, Uncategorized

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

arXiv:2510.01367v3 Announce Type: replace-cross Abstract: Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model’s chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less ‘effort’ than required to achieve high reward. TRACE quantifies effort by measuring how early a model’s reasoning becomes sufficient to obtain the reward. We progressively truncate a model’s CoT at various lengths, force the model to answer, and estimate the expected reward at each cutoff. A hacking model, which takes a shortcut, will achieve a high expected reward with only a small fraction of its CoT, yielding a large area under the accuracy-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort Read Post »

AI, Committee, News, Uncategorized

Partial Information Decomposition via Normalizing Flows in Latent Gaussian Distributions

arXiv:2510.04417v1 Announce Type: cross Abstract: The study of multimodality has garnered significant interest in fields where the analysis of interactions among multiple information sources can enhance predictive modeling, data fusion, and interpretability. Partial information decomposition (PID) has emerged as a useful information-theoretic framework to quantify the degree to which individual modalities independently, redundantly, or synergistically convey information about a target variable. However, existing PID methods depend on optimizing over a joint distribution constrained by estimated pairwise probability distributions, which are costly and inaccurate for continuous and high-dimensional modalities. Our first key insight is that the problem can be solved efficiently when the pairwise distributions are multivariate Gaussians, and we refer to this problem as Gaussian PID (GPID). We propose a new gradient-based algorithm that substantially improves the computational efficiency of GPID based on an alternative formulation of the underlying optimization problem. To generalize the applicability to non-Gaussian data, we learn information-preserving encoders to transform random variables of arbitrary input distributions into pairwise Gaussian random variables. Along the way, we resolved an open problem regarding the optimality of joint Gaussian solutions for GPID. Empirical validation in diverse synthetic examples demonstrates that our proposed method provides more accurate and efficient PID estimates than existing baselines. We further evaluate a series of large-scale multimodal benchmarks to show its utility in real-world applications of quantifying PID in multimodal datasets and selecting high-performing models.

Partial Information Decomposition via Normalizing Flows in Latent Gaussian Distributions Read Post »

AI, Committee, News, Uncategorized

How Many Parameters Does Your Task Really Need? Task Specific Pruning with LLM-Sieve

arXiv:2505.18350v2 Announce Type: replace-cross Abstract: As Large Language Models (LLMs) are increasingly deployed for narrow tasks in resource-constrained settings, a central question arises: how much of an LLM is truly necessary for a given task? We present LLM-Sieve, a framework that prunes LLMs down to the minimal parameter subset needed to preserve task performance. Our approach introduces two innovations: (i) output-aligned non-orthogonal projections, which yield more faithful low-rank approximations than traditional PCA/SVD by aligning directly with layer outputs; and (ii) adaptive pruning via a Genetic Algorithm, which automatically discovers matrix-specific pruning levels and exposes the uneven distribution of task-relevant knowledge. Across models from 3.8B to 70B parameters, LLM-Sieve removes 20-75% of weights with only 1-5% accuracy loss-substantially ahead of prior pruning methods. Beyond efficiency, our framework reveals bottleneck matrices that concentrate critical knowledge, suggesting architectural implications for future LLM design. LLM-Sieve integrates seamlessly with LoRA fine-tuning and quantization, enabling both efficient deployment and deeper understanding of knowledge organization in LLMs.

How Many Parameters Does Your Task Really Need? Task Specific Pruning with LLM-Sieve Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at Privacy Policy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
en_US