Committee Archives - Seite 23 von 99

AI, Committee, Nachrichten, Uncategorized

Base Models Beat Aligned Models at Randomness and Creativity

admin NU / September 16, 2025

arXiv:2505.00047v2 Announce Type: replace Abstract: Alignment has quickly become a default ingredient in LLM development, with techniques such as reinforcement learning from human feedback making models act safely, follow instructions, and perform ever-better on complex tasks. While these techniques are certainly useful, we propose that they should not be universally applied and demonstrate a range of tasks on which base language models consistently outperform their popular aligned forms. Particularly, we study tasks that require unpredictable outputs, such as random number generation, mixed strategy games (rock-paper-scissors and hide-and-seek), and creative writing. In each case, aligned models tend towards narrow behaviors that result in distinct disadvantages, for instance, preferring to generate “7” over other uniformly random numbers, becoming almost fully predictable in some game states, or prioritizing pleasant writing over creative originality. Across models tested, better performance on common benchmarks tends to correlate with worse performance on our tasks, suggesting an effective trade-off in the required capabilities.

Base Models Beat Aligned Models at Randomness and Creativity Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Is In-Context Learning Learning?

admin NU / September 16, 2025

arXiv:2509.10414v2 Announce Type: replace Abstract: In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model’s ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL does constitute learning, but its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that ICL is an effective learning paradigm, but limited in its ability to learn and generalise to unseen tasks. We note that, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input’s linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies on formally similar tasks, we conclude that autoregression’s ad-hoc encoding is not a robust mechanism, and suggests limited all-purpose generalisability.

Is In-Context Learning Learning? Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

admin NU / September 16, 2025

arXiv:2505.16281v2 Announce Type: replace Abstract: The advancement of Large Language Models (LLMs) enables flexible and interpretable automatic evaluations. In the field of machine translation evaluation, utilizing LLMs with translation error annotations based on Multidimensional Quality Metrics (MQM) yields more human-aligned judgments. However, current LLM-based evaluation methods still face challenges in accurately identifying error spans and assessing their severity. In this paper, we propose HiMATE, a Hierarchical Multi-Agent Framework for Machine Translation Evaluation. We argue that existing approaches inadequately exploit the fine-grained structural and semantic information within the MQM hierarchy. To address this, we develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors. Two key strategies are incorporated to further mitigate systemic hallucinations within the framework: the utilization of the model’s self-reflection capability and the facilitation of agent discussion involving asymmetric information. Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations. Further analyses underscore its significant advantage in error span detection and severity assessment, achieving an average F1-score improvement of 89% over the best-performing baseline. We make our code and data publicly available at https://github.com/nlp2ct-shijie/HiMATE.

HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

MALLM: Multi-Agent Large Language Models Framework

admin NU / September 16, 2025

arXiv:2509.11656v1 Announce Type: cross Abstract: Multi-agent debate (MAD) has demonstrated the ability to augment collective intelligence by scaling test-time compute and leveraging expertise. Current frameworks for multi-agent debate are often designed towards tool use, lack integrated evaluation, or provide limited configurability of agent personas, response generators, discussion paradigms, and decision protocols. We introduce MALLM (Multi-Agent Large Language Models), an open-source framework that enables systematic analysis of MAD components. MALLM offers more than 144 unique configurations of MAD, including (1) agent personas (e.g., Expert, Personality), (2) response generators (e.g., Critical, Reasoning), (3) discussion paradigms (e.g., Memory, Relay), and (4) decision protocols (e.g., Voting, Consensus). MALLM uses simple configuration files to define a debate. Furthermore, MALLM can load any textual Huggingface dataset (e.g., MMLU-Pro, WinoGrande) and provides an evaluation pipeline for easy comparison of MAD configurations. MALLM is tailored towards researchers and provides a window into the heart of multi-agent debate, facilitating the understanding of its components and their interplay.

MALLM: Multi-Agent Large Language Models Framework Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

MoonshotAI Released Checkpoint-Engine: A Simple Middleware to Update Model Weights in LLM Inference Engines, Effective for Reinforcement Learning

admin NU / September 16, 2025

MoonshotAI has open-sourced checkpoint-engine, a lightweight middleware aimed at solving one of the key bottlenecks in large language model (LLM) deployment: rapidly updating model weights across thousands of GPUs without disrupting inference. The library is particularly designed for reinforcement learning (RL) and reinforcement learning with human feedback (RLHF), where models are updated frequently and downtime directly impacts system throughput. https://github.com/MoonshotAI/checkpoint-engine How Fast can LLMs be updated? Checkpoint-engine delivers a significant breakthrough by updating a 1-trillion parameter model across thousands of GPUs in roughly 20 seconds. Traditional distributed inference pipelines can take several minutes to reload models of this size. By reducing the update time by an order of magnitude, checkpoint-engine directly addresses one of the largest inefficiencies in large-scale serving. The system achieves this through: Broadcast updates for static clusters. Peer-to-peer (P2P) updates for dynamic clusters. Overlapped communication and memory copy for reduced latency. What does the Architecture look like? Checkpoint-engine sits between training engines and inference clusters. Its design includes: A Parameter Server that coordinates updates. Worker Extensions that integrate with inference frameworks such as vLLM. The weight update pipeline runs in three stages: Host-to-Device (H2D): Parameters are copied into GPU memory. Broadcast: Weights are distributed across workers using CUDA IPC buffers. Reload: Each inference shard reloads only the subset of weights it needs. This staged pipeline is optimized for overlap, ensuring GPUs remain active throughout the update process. How does it perform in practice? Benchmarking results confirm checkpoint-engine’s scalability: GLM-4.5-Air (BF16, 8×H800): 3.94s (broadcast), 8.83s (P2P). Qwen3-235B-Instruct (BF16, 8×H800): 6.75s (broadcast), 16.47s (P2P). DeepSeek-V3.1 (FP8, 16×H20): 12.22s (broadcast), 25.77s (P2P). Kimi-K2-Instruct (FP8, 256×H20): ~21.5s (broadcast), 34.49s (P2P). Even at trillion-parameter scale with 256 GPUs, broadcast updates complete in about 20 seconds, validating its design goal. What are some trade-offs? Checkpoint-engine introduces notable advantages, but also comes with limitations: Memory Overhead: Overlapped pipelines require additional GPU memory; insufficient memory triggers slower fallback paths. P2P Latency: Peer-to-peer updates support elastic clusters but at a performance cost. Compatibility: Officially tested with vLLM only; broader engine support requires engineering work. Quantization: FP8 support exists but remains experimental. Where does it fit in deployment scenarios? Checkpoint-engine is most valuable for: Reinforcement learning pipelines where frequent weight updates are required. Large inference clusters serving 100B–1T+ parameter models. Elastic environments with dynamic scaling, where P2P flexibility offsets latency trade-offs. Summary Checkpoint-engine represents a focused solution to one of the hardest problems in large-scale LLM deployment: rapid weight synchronization without halting inference. With demonstrated updates at trillion-parameter scale in around 20 seconds, flexible support for both broadcast and P2P modes, and an optimized communication pipeline, it provides a practical path forward for reinforcement learning pipelines and high-performance inference clusters. While still limited to vLLM and requiring refinements in quantization and dynamic scaling, it establishes an important foundation for efficient, continuous model updates in production AI systems. Check out the PROJECT PAGE here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post MoonshotAI Released Checkpoint-Engine: A Simple Middleware to Update Model Weights in LLM Inference Engines, Effective for Reinforcement Learning appeared first on MarkTechPost.

MoonshotAI Released Checkpoint-Engine: A Simple Middleware to Update Model Weights in LLM Inference Engines, Effective for Reinforcement Learning Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

ALIGNS: Unlocking nomological networks in psychological measurement through a large language model

admin NU / September 15, 2025

arXiv:2509.09723v1 Announce Type: new Abstract: Psychological measurement is critical to many disciplines. Despite advances in measurement, building nomological networks, theoretical maps of how concepts and measures relate to establish validity, remains a challenge 70 years after Cronbach and Meehl proposed them as fundamental to validation. This limitation has practical consequences: clinical trials may fail to detect treatment effects, and public policy may target the wrong outcomes. We introduce Analysis of Latent Indicators to Generate Nomological Structures (ALIGNS), a large language model-based system trained with validated questionnaire measures. ALIGNS provides three comprehensive nomological networks containing over 550,000 indicators across psychology, medicine, social policy, and other fields. This represents the first application of large language models to solve a foundational problem in measurement validation. We report classification accuracy tests used to develop the model, as well as three evaluations. In the first evaluation, the widely used NIH PROMIS anxiety and depression instruments are shown to converge into a single dimension of emotional distress. The second evaluation examines child temperament measures and identifies four potential dimensions not captured by current frameworks, and questions one existing dimension. The third evaluation, an applicability check, engages expert psychometricians who assess the system’s importance, accessibility, and suitability. ALIGNS is freely available at nomologicalnetwork.org, complementing traditional validation methods with large-scale nomological analysis.

ALIGNS: Unlocking nomological networks in psychological measurement through a large language model Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Faster and Better LLMs via Latency-Aware Test-Time Scaling

admin NU / September 15, 2025

arXiv:2505.19634v4 Announce Type: replace Abstract: Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a latency-aware evaluation of representative TTS methods, we demonstrate that a compute-optimal TTS does not always result in the lowest latency in scenarios where latency is critical. To address this gap and achieve latency-optimal TTS, we propose two key approaches by optimizing the concurrency configurations: (1) branch-wise parallelism, which leverages multiple concurrent inference branches, and (2) sequence-wise parallelism, enabled by speculative decoding. By integrating these two approaches and allocating computational resources properly to each, our latency-optimal TTS enables a 32B model to reach 82.3% accuracy on MATH-500 within 1 minute and a smaller 3B model to achieve 72.4% within 10 seconds. Our work emphasizes the importance of latency-aware TTS and demonstrates its ability to deliver both speed and accuracy in latency-sensitive scenarios.

Faster and Better LLMs via Latency-Aware Test-Time Scaling Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Meta AI Released MobileLLM-R1: A Edge Reasoning Model with less than 1B Parameters and Achieves 2x–5x Performance Boost Over Other Fully Open-Source AI Models

admin NU / September 15, 2025

Table of contents What architecture powers MobileLLM-R1? How efficient is the training? How does it perform against other open models? Where does MobileLLM-R1 fall short? How does MobileLLM-R1 compare to Qwen3, SmolLM2, and OLMo? Summary Meta has released MobileLLM-R1, a family of lightweight edge reasoning models now available on Hugging Face. The release includes models ranging from 140M to 950M parameters, with a focus on efficient mathematical, coding, and scientific reasoning at sub-billion scale. Unlike general-purpose chat models, MobileLLM-R1 is designed for edge deployment, aiming to deliver state-of-the-art reasoning accuracy while remaining computationally efficient. What architecture powers MobileLLM-R1? The largest model, MobileLLM-R1-950M, integrates several architectural optimizations: 22 Transformer layers with 24 attention heads and 6 grouped KV heads. Embedding dimension: 1536; hidden dimension: 6144. Grouped-Query Attention (GQA) reduces compute and memory. Block-wise weight sharing cuts parameter count without heavy latency penalties. SwiGLU activations improve small-model representation. Context length: 4K for base, 32K for post-trained models. 128K vocabulary with shared input/output embeddings. The emphasis is on reducing compute and memory requirements, making it suitable for deployment on constrained devices. How efficient is the training? MobileLLM-R1 is notable for data efficiency: Trained on ~4.2T tokens in total. By comparison, Qwen3’s 0.6B model was trained on 36T tokens. This means MobileLLM-R1 uses only ≈11.7% of the data to reach or surpass Qwen3’s accuracy. Post-training applies supervised fine-tuning on math, coding, and reasoning datasets. This efficiency translates directly into lower training costs and resource demands. How does it perform against other open models? On benchmarks, MobileLLM-R1-950M shows significant gains: MATH (MATH500 dataset): ~5× higher accuracy than Olmo-1.24B and ~2× higher accuracy than SmolLM2-1.7B. Reasoning and coding (GSM8K, AIME, LiveCodeBench): Matches or surpasses Qwen3-0.6B, despite using far fewer tokens. The model delivers results typically associated with larger architectures while maintaining a smaller footprint. Where does MobileLLM-R1 fall short? The model’s focus creates limitations: Strong in math, code, and structured reasoning. Weaker in general conversation, commonsense, and creative tasks compared to larger LLMs. Distributed under FAIR NC (non-commercial) license, which restricts usage in production settings. Longer contexts (32K) raise KV-cache and memory demands at inference. How does MobileLLM-R1 compare to Qwen3, SmolLM2, and OLMo? Performance snapshot (post-trained models): Model Params Train tokens (T) MATH500 GSM8K AIME’24 AIME’25 LiveCodeBench MobileLLM-R1-950M 0.949B 4.2 74.0 67.5 15.5 16.3 19.9 Qwen3-0.6B 0.596B 36.0 73.0 79.2 11.3 17.0 14.9 SmolLM2-1.7B-Instruct 1.71B ~11.0 19.2 41.8 0.3 0.1 4.4 OLMo-2-1B-Instruct 1.48B ~3.95 19.2 69.7 0.6 0.1 0.0 Key observations: R1-950M matches Qwen3-0.6B in math (74.0 vs 73.0) while requiring ~8.6× fewer tokens. Performance gaps vs SmolLM2 and OLMo are substantial across reasoning tasks. Qwen3 maintains an edge in GSM8K, but the difference is small compared to the training efficiency advantage. Summary Meta’s MobileLLM-R1 underscores a trend toward smaller, domain-optimized models that deliver competitive reasoning without massive training budgets. By achieving 2×–5× performance gains over larger open models while training on a fraction of the data, it demonstrates that efficiency—not just scale—will define the next phase of LLM deployment, especially for math, coding, and scientific use cases on edge devices. Check out the Model on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Meta AI Released MobileLLM-R1: A Edge Reasoning Model with less than 1B Parameters and Achieves 2x–5x Performance Boost Over Other Fully Open-Source AI Models appeared first on MarkTechPost.

Meta AI Released MobileLLM-R1: A Edge Reasoning Model with less than 1B Parameters and Achieves 2x–5x Performance Boost Over Other Fully Open-Source AI Models Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm

admin NU / September 15, 2025

arXiv:2509.09707v1 Announce Type: cross Abstract: Integrating Large Language Models (LLMs) within metaheuristics opens a novel path for solving complex combinatorial optimization problems. While most existing approaches leverage LLMs for code generation to create or refine specific heuristics, they often overlook the structural properties of individual problem instances. In this work, we introduce a novel framework that integrates LLMs with a Biased Random-Key Genetic Algorithm (BRKGA) to solve the NP-hard Longest Run Subsequence problem. Our approach extends the instance-driven heuristic bias paradigm by introducing a human-LLM collaborative process to co-design and implement a set of computationally efficient metrics. The LLM analyzes these instance-specific metrics to generate a tailored heuristic bias, which steers the BRKGA toward promising areas of the search space. We conduct a comprehensive experimental evaluation, including rigorous statistical tests, convergence and behavioral analyses, and targeted ablation studies, comparing our method against a standard BRKGA baseline across 1,050 generated instances of varying complexity. Results show that our top-performing hybrid, BRKGA+Llama-4-Maverick, achieves statistically significant improvements over the baseline, particularly on the most complex instances. Our findings confirm that leveraging an LLM to produce an a priori, instance-driven heuristic bias is a valuable approach for enhancing metaheuristics in complex optimization domains.

LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Beyond Token Limits: Assessing Language Model Performance on Long Text Classification

admin NU / September 15, 2025

arXiv:2509.10199v1 Announce Type: new Abstract: The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.

Beyond Token Limits: Assessing Language Model Performance on Long Text Classification Beitrag lesen »

Committee

Base Models Beat Aligned Models at Randomness and Creativity

Is In-Context Learning Learning?

HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

MALLM: Multi-Agent Large Language Models Framework

MoonshotAI Released Checkpoint-Engine: A Simple Middleware to Update Model Weights in LLM Inference Engines, Effective for Reinforcement Learning

ALIGNS: Unlocking nomological networks in psychological measurement through a large language model

Faster and Better LLMs via Latency-Aware Test-Time Scaling

Meta AI Released MobileLLM-R1: A Edge Reasoning Model with less than 1B Parameters and Achieves 2x–5x Performance Boost Over Other Fully Open-Source AI Models

LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm

Beyond Token Limits: Assessing Language Model Performance on Long Text Classification

Unsere Dienstleistungen

Startseite

Wie es funktioniert

Nachrichten

Preise

Support

Hilfe-Center

Problem melden

Feedback geben

Datenschutzrichtlinie

Benutzerkonto

Folgen Sie uns