YouZum

Committee

AI, Committee, Nachrichten, Uncategorized

MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

arXiv:2510.08804v1 Announce Type: new Abstract: We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks. Unlike general-purpose coding, scientific workflows require algorithms that are rigorous, interconnected with deep domain knowledge, and incorporate domain-specific reasoning, as well as algorithm iteration without requiring I/O test cases. Many scientific problems also require a sequence of subproblems to be solved, leading to the final desired result. MOSAIC is designed as a training-free framework with specially designed agents to self-reflect, create the rationale, code, and debug within a student-teacher paradigm to address the challenges of scientific code generation. This design facilitates stepwise problem decomposition, targeted error correction, and, when combined with our Consolidated Context Window (CCW), mitigates LLM hallucinations when solving complex scientific tasks involving chained subproblems. We evaluate MOSAIC on scientific coding benchmarks and demonstrate that our specialized agentic framework outperforms existing approaches in terms of accuracy, robustness, and interpretability.

MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking

arXiv:2510.09528v1 Announce Type: new Abstract: Pre-trained transformer-based models have significantly advanced automatic speech recognition (ASR), yet they remain sensitive to accent and dialectal variations, resulting in elevated word error rates (WER) in linguistically diverse languages such as English and Persian. To address this challenge, we propose an accent-invariant ASR framework that integrates accent and dialect classification into the recognition pipeline. Our approach involves training a spectrogram-based classifier to capture accent-specific cues, masking the regions most influential to its predictions, and using the masked spectrograms for data augmentation. This enhances the robustness of ASR models against accent variability. We evaluate the method using both English and Persian speech. For Persian, we introduce a newly collected dataset spanning multiple regional accents, establishing the first systematic benchmark for accent variation in Persian ASR that fills a critical gap in multilingual speech research and provides a foundation for future studies on low-resource, linguistically diverse languages. Experimental results with the Whisper model demonstrate that our masking and augmentation strategy yields substantial WER reductions in both English and Persian settings, confirming the effectiveness of the approach. This research advances the development of multilingual ASR systems that are resilient to accent and dialect diversity. Code and dataset are publicly available at: https://github.com/MH-Sameti/Accent_invariant_ASR

Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

On the Reliability of Large Language Models for Causal Discovery

arXiv:2407.19638v2 Announce Type: replace Abstract: This study investigates the efficacy of Large Language Models (LLMs) in causal discovery. Using newly available open-source LLMs, OLMo and BLOOM, which provide access to their pre-training corpora, we investigate how LLMs address causal discovery through three research questions. We examine: (i) the impact of memorization for accurate causal relation prediction, (ii) the influence of incorrect causal relations in pre-training data, and (iii) the contextual nuances that influence LLMs’ understanding of causal relations. Our findings indicate that while LLMs are effective in recognizing causal relations that occur frequently in pre-training data, their ability to generalize to new or rare causal relations is limited. Moreover, the presence of incorrect causal relations significantly undermines the confidence of LLMs in corresponding correct causal relations, and the contextual information critically affects the outcomes of LLMs to discern causal connections between random variables.

On the Reliability of Large Language Models for Causal Discovery Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

SwiReasoning: Entropy-Driven Alternation of Latent and Explicit Chain-of-Thought for Reasoning LLMs

SwiReasoning is a decoding-time framework that lets a reasoning LLM decide when to think in latent space and when to write explicit chain-of-thought, using block-wise confidence estimated from entropy trends in next-token distributions. The method is training-free, model-agnostic, and targets Pareto-superior accuracy/efficiency trade-offs on mathematics and STEM benchmarks. Reported results show +1.5%–2.8% average accuracy improvements with unlimited tokens and +56%–79% average token-efficiency gains under constrained budgets; on AIME’24/’25, it reaches maximum reasoning accuracy earlier than standard CoT. What SwiReasoning changes at inference time? The controller monitors the decoder’s next-token entropy to form a block-wise confidence signal. When confidence is low (entropy trending upward), it enters latent reasoning—the model continues to reason without emitting tokens. When confidence recovers (entropy trending down), it switches back to explicit reasoning, emitting CoT tokens to consolidate and commit to a single path. A switch count control limits the maximum number of thinking-block transitions to suppress overthinking before finalizing the answer. This dynamic alternation is the core mechanism behind the reported accuracy-per-token gains. https://arxiv.org/pdf/2510.05069 Results: accuracy and efficiency on standard suites It reports improvements across mathematics and STEM reasoning tasks: Pass@1 (unlimited budget): accuracy lifts up to +2.8% (math) and +2.0% (STEM) in Figure 1 and Table 1, with a +2.17% average over baselines (CoT with sampling, CoT greedy, and Soft Thinking). Token efficiency (limited budgets): average improvements up to +79% (Figure 2). A comprehensive comparison shows SwiReasoning attains the highest token efficiency in 13/15 evaluations, with an +84% average improvement over CoT across those settings (Figure 4). Pass@k dynamics: with Qwen3-8B on AIME 2024/2025, maximum reasoning accuracies are achieved +50% earlier than CoT on average (Figure 5), indicating faster convergence to the ceiling with fewer sampled trajectories. Why switching helps? Explicit CoT is discrete and readable but locks in a single path prematurely, which can discard useful alternatives. Latent reasoning is continuous and information-dense per step, but purely latent strategies may diffuse probability mass and impede convergence. SwiReasoning adds a confidence-guided alternation: latent phases broaden exploration when the model is uncertain; explicit phases exploit rising confidence to solidify a solution and commit tokens only when beneficial. The switch count control regularizes the process by capping oscillations and limiting prolonged “silent” wandering—addressing both accuracy loss from diffusion and token waste from overthinking cited as challenges for training-free latent methods. Positioning vs. baselines The project compares against CoT with sampling, CoT greedy, and Soft Thinking, reporting a +2.17% average accuracy lift at unlimited budgets (Table 1) and consistent efficiency-per-token advantages under budget constraints. The visualized Pareto frontier shifts outward—either higher accuracy at the same budget or similar accuracy with fewer tokens—across different model families and scales. On AIME’24/’25, the Pass@k curves show that SwiReasoning reaches the performance ceiling with fewer samples than CoT, reflecting improved convergence behavior rather than only better raw ceilings. https://arxiv.org/pdf/2510.05069 https://arxiv.org/pdf/2510.05069 Key Takeaways Training-free controller: SwiReasoning alternates between latent reasoning and explicit chain-of-thought using block-wise confidence from next-token entropy trends. Efficiency gains: Reports +56–79% average token-efficiency improvements under constrained budgets versus CoT, with larger gains as budgets tighten. Accuracy lifts: Achieves +1.5–2.8% average Pass@1 improvements on mathematics/STEM benchmarks at unlimited budgets. Faster convergence: On AIME 2024/2025, reaches maximum reasoning accuracy earlier than CoT (improved Pass@k dynamics). Editorial Comments SwiReasoning is a useful step toward pragmatic “reasoning policy” control at decode time: it’s training-free, slots behind the tokenizer, and exposes measurable gains on math/STEM suites by toggling between latent and explicit CoT using an entropy-trend confidence signal with a capped switch count. The open-source BSD implementation and clear flags (–max_switch_count, –alpha) make replication straightforward and lower the barrier to stacking with orthogonal efficiency layers (e.g., quantization, speculative decoding, KV-cache tricks). The method’s value proposition is “accuracy per token” rather than raw SOTA accuracy, which is operationally important for budgeted inference and batching. Check out the Paper and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post SwiReasoning: Entropy-Driven Alternation of Latent and Explicit Chain-of-Thought for Reasoning LLMs appeared first on MarkTechPost.

SwiReasoning: Entropy-Driven Alternation of Latent and Explicit Chain-of-Thought for Reasoning LLMs Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs

arXiv:2510.08825v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated impressive reasoning abilities yet remain unreliable on knowledge-intensive, multi-hop questions — they miss long-tail facts, hallucinate when uncertain, and their internal knowledge lags behind real-world change. Knowledge graphs (KGs) offer a structured source of relational evidence, but existing KGQA methods face fundamental trade-offs: compiling complete SPARQL queries without knowing available relations proves brittle, retrieving large subgraphs introduces noise, and complex agent frameworks with parallel exploration exponentially expand search spaces. To address these limitations, we propose Search-on-Graph (SoG), a simple yet effective framework that enables LLMs to perform iterative informed graph navigation using a single, carefully designed textsc{Search} function. Rather than pre-planning paths or retrieving large subgraphs, SoG follows an “observe-then-navigate” principle: at each step, the LLM examines actual available relations from the current entity before deciding on the next hop. This approach further adapts seamlessly to different KG schemas and handles high-degree nodes through adaptive filtering. Across six KGQA benchmarks spanning Freebase and Wikidata, SoG achieves state-of-the-art performance without fine-tuning. We demonstrate particularly strong gains on Wikidata benchmarks (+16% improvement over previous best methods) alongside consistent improvements on Freebase benchmarks.

Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

arXiv:2510.09032v1 Announce Type: new Abstract: As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based multilingual and regional transformer models (mBERT, XLM-RoBERTa, DistilBERT, DeBERTaV3, BanglaBERT, and IndicBERT) on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our manually validated monolingual dataset to encourage further research on multilingual language modeling for low-resource languages.

Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

arXiv:2510.08800v1 Announce Type: new Abstract: While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs’ ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs’ ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

NLP-ADBench: NLP Anomaly Detection Benchmark

arXiv:2412.04784v2 Announce Type: replace Abstract: Anomaly detection (AD) is an important machine learning task with applications in fraud detection, content moderation, and user behavior analysis. However, AD is relatively understudied in a natural language processing (NLP) context, limiting its effectiveness in detecting harmful content, phishing attempts, and spam reviews. We introduce NLP-ADBench, the most comprehensive NLP anomaly detection (NLP-AD) benchmark to date, which includes eight curated datasets and 19 state-of-the-art algorithms. These span 3 end-to-end methods and 16 two-step approaches that adapt classical, non-AD methods to language embeddings from BERT and OpenAI. Our empirical results show that no single model dominates across all datasets, indicating a need for automated model selection. Moreover, two-step methods with transformer-based embeddings consistently outperform specialized end-to-end approaches, with OpenAI embeddings outperforming those of BERT. We release NLP-ADBench at https://github.com/USC-FORTIS/NLP-ADBench, providing a unified framework for NLP-AD and supporting future investigations.

NLP-ADBench: NLP Anomaly Detection Benchmark Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

arXiv:2510.09541v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Is vibe coding ruining a generation of engineers?

AI tools are revolutionizing software development by automating repetitive tasks, refactoring bloated code, and identifying bugs in real-time. Developers can now generate well-structured code from plain language prompts, saving hours of manual effort. These tools learn from vast codebases, offering context-aware recommendations that enhance productivity and reduce errors. Rather than starting from scratch, engineers can prototype quickly, iterate faster and focus on solving increasingly complex problems. As code generation tools grow in popularity, they raise questions about the future size and structure of engineering teams. Earlier this year, Garry Tan, CEO of startup accelerator Y Combinator, noted that about one-quarter of its current clients use AI to write 95% or more of their software. In an interview with CNBC, Tan said: “What that means for founders is that you don’t need a team of 50 or 100 engineers, you don’t have to raise as much. The capital goes much longer.” AI-powered coding may offer a fast solution for businesses under budget pressure — but its long-term effects on the field and labor pool cannot be ignored. As AI-powered coding rises, human expertise may diminish In the era of AI, the traditional journey to coding expertise that has long supported senior developers may be at risk. Easy access to large language models (LLMs) enables junior coders to quickly identify issues in code. While this speeds up software development, it can distance developers from their own work, delaying the growth of core problem-solving skills. As a result, they may avoid the focused, sometimes uncomfortable hours required to build expertise and progress on the path to becoming successful senior developers. Consider Anthropic’s Claude Code, a terminal-based assistant built on the Claude 3.7 Sonnet model, which automates bug detection and resolution, test creation and code refactoring. Using natural language commands, it reduces repetitive manual work and boosts productivity. Microsoft has also released two open-source frameworks — AutoGen and Semantic Kernel — to support the development of agentic AI systems. AutoGen enables asynchronous messaging, modular components, and distributed agent collaboration to build complex workflows with minimal human input. Semantic Kernel is an SDK that integrates LLMs with languages like C#, Python and Java, letting developers build AI agents to automate tasks and manage enterprise applications. The increasing availability of these tools from Anthropic, Microsoft and others may reduce opportunities for coders to refine and deepen their skills. Rather than “banging their heads against the wall” to debug a few lines or select a library to unlock new features, junior developers may simply turn to AI for an assist. This means senior coders with problem-solving skills honed over decades may become an endangered species. Overreliance on AI for writing code risks weakening developers’ hands-on experience and understanding of key programming concepts. Without regular practice, they may struggle to independently debug, optimize or design systems. Ultimately, this erosion of skill can undermine critical thinking, creativity and adaptability — qualities that are essential not just for coding, but for assessing the quality and logic of AI-generated solutions. AI as mentor: Turning code automation into hands-on learning While concerns about AI diminishing human developer skills are valid, businesses shouldn’t dismiss AI-supported coding. They just need to think carefully about when and how to deploy AI tools in development. These tools can be more than productivity boosters; they can act as interactive mentors, guiding coders in real time with explanations, alternatives and best practices. When used as a training tool, AI can reinforce learning by showing coders why code is broken and how to fix it—rather than simply applying a solution. For example, a junior developer using Claude Code might receive immediate feedback on inefficient syntax or logic errors, along with suggestions linked to detailed explanations. This enables active learning, not passive correction. It’s a win-win: Accelerating project timelines without doing all the work for junior coders. Additionally, coding frameworks can support experimentation by letting developers prototype agent workflows or integrate LLMs without needing expert-level knowledge upfront. By observing how AI builds and refines code, junior developers who actively engage with these tools can internalize patterns, architectural decisions and debugging strategies — mirroring the traditional learning process of trial and error, code reviews and mentorship. However, AI coding assistants shouldn’t replace real mentorship or pair programming. Pull requests and formal code reviews remain essential for guiding newer, less experienced team members. We are nowhere near the point at which AI can single-handedly upskill a junior developer. Companies and educators can build structured development programs around these tools that emphasize code comprehension to ensure AI is used as a training partner rather than a crutch. This encourages coders to question AI outputs and requires manual refactoring exercises. In this way, AI becomes less of a replacement for human ingenuity and more of a catalyst for accelerated, experiential learning. Bridging the gap between automation and education When utilized with intention, AI doesn’t just write code; it teaches coding, blending automation with education to prepare developers for a future where deep understanding and adaptability remain indispensable. By embracing AI as a mentor, as a programming partner and as a team of developers we can direct to the problem at hand, we can bridge the gap between effective automation and education. We can empower developers to grow alongside the tools they use. We can ensure that, as AI evolves, so too does the human skill set, fostering a generation of coders who are both efficient and deeply knowledgeable. Richard Sonnenblick is chief data scientist at Planview.

Is vibe coding ruining a generation of engineers? Beitrag lesen »

We use cookies to improve your experience and performance on our website. You can learn more at Datenschutzrichtlinie and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
de_DE