YouZum

Committee

AI, Committee, ข่าว, Uncategorized

Base Models Beat Aligned Models at Randomness and Creativity

arXiv:2505.00047v2 Announce Type: replace Abstract: Alignment has quickly become a default ingredient in LLM development, with techniques such as reinforcement learning from human feedback making models act safely, follow instructions, and perform ever-better on complex tasks. While these techniques are certainly useful, we propose that they should not be universally applied and demonstrate a range of tasks on which base language models consistently outperform their popular aligned forms. Particularly, we study tasks that require unpredictable outputs, such as random number generation, mixed strategy games (rock-paper-scissors and hide-and-seek), and creative writing. In each case, aligned models tend towards narrow behaviors that result in distinct disadvantages, for instance, preferring to generate “7” over other uniformly random numbers, becoming almost fully predictable in some game states, or prioritizing pleasant writing over creative originality. Across models tested, better performance on common benchmarks tends to correlate with worse performance on our tasks, suggesting an effective trade-off in the required capabilities.

Base Models Beat Aligned Models at Randomness and Creativity Read Post »

AI, Committee, ข่าว, Uncategorized

Is In-Context Learning Learning?

arXiv:2509.10414v2 Announce Type: replace Abstract: In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model’s ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL does constitute learning, but its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that ICL is an effective learning paradigm, but limited in its ability to learn and generalise to unseen tasks. We note that, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input’s linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies on formally similar tasks, we conclude that autoregression’s ad-hoc encoding is not a robust mechanism, and suggests limited all-purpose generalisability.

Is In-Context Learning Learning? Read Post »

AI, Committee, ข่าว, Uncategorized

HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

arXiv:2505.16281v2 Announce Type: replace Abstract: The advancement of Large Language Models (LLMs) enables flexible and interpretable automatic evaluations. In the field of machine translation evaluation, utilizing LLMs with translation error annotations based on Multidimensional Quality Metrics (MQM) yields more human-aligned judgments. However, current LLM-based evaluation methods still face challenges in accurately identifying error spans and assessing their severity. In this paper, we propose HiMATE, a Hierarchical Multi-Agent Framework for Machine Translation Evaluation. We argue that existing approaches inadequately exploit the fine-grained structural and semantic information within the MQM hierarchy. To address this, we develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors. Two key strategies are incorporated to further mitigate systemic hallucinations within the framework: the utilization of the model’s self-reflection capability and the facilitation of agent discussion involving asymmetric information. Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations. Further analyses underscore its significant advantage in error span detection and severity assessment, achieving an average F1-score improvement of 89% over the best-performing baseline. We make our code and data publicly available at https://github.com/nlp2ct-shijie/HiMATE.

HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation Read Post »

AI, Committee, ข่าว, Uncategorized

MALLM: Multi-Agent Large Language Models Framework

arXiv:2509.11656v1 Announce Type: cross Abstract: Multi-agent debate (MAD) has demonstrated the ability to augment collective intelligence by scaling test-time compute and leveraging expertise. Current frameworks for multi-agent debate are often designed towards tool use, lack integrated evaluation, or provide limited configurability of agent personas, response generators, discussion paradigms, and decision protocols. We introduce MALLM (Multi-Agent Large Language Models), an open-source framework that enables systematic analysis of MAD components. MALLM offers more than 144 unique configurations of MAD, including (1) agent personas (e.g., Expert, Personality), (2) response generators (e.g., Critical, Reasoning), (3) discussion paradigms (e.g., Memory, Relay), and (4) decision protocols (e.g., Voting, Consensus). MALLM uses simple configuration files to define a debate. Furthermore, MALLM can load any textual Huggingface dataset (e.g., MMLU-Pro, WinoGrande) and provides an evaluation pipeline for easy comparison of MAD configurations. MALLM is tailored towards researchers and provides a window into the heart of multi-agent debate, facilitating the understanding of its components and their interplay.

MALLM: Multi-Agent Large Language Models Framework Read Post »

AI, Committee, ข่าว, Uncategorized

MoonshotAI Released Checkpoint-Engine: A Simple Middleware to Update Model Weights in LLM Inference Engines, Effective for Reinforcement Learning

MoonshotAI has open-sourced checkpoint-engine, a lightweight middleware aimed at solving one of the key bottlenecks in large language model (LLM) deployment: rapidly updating model weights across thousands of GPUs without disrupting inference. The library is particularly designed for reinforcement learning (RL) and reinforcement learning with human feedback (RLHF), where models are updated frequently and downtime directly impacts system throughput. https://github.com/MoonshotAI/checkpoint-engine How Fast can LLMs be updated? Checkpoint-engine delivers a significant breakthrough by updating a 1-trillion parameter model across thousands of GPUs in roughly 20 seconds. Traditional distributed inference pipelines can take several minutes to reload models of this size. By reducing the update time by an order of magnitude, checkpoint-engine directly addresses one of the largest inefficiencies in large-scale serving. The system achieves this through: Broadcast updates for static clusters. Peer-to-peer (P2P) updates for dynamic clusters. Overlapped communication and memory copy for reduced latency. What does the Architecture look like? Checkpoint-engine sits between training engines and inference clusters. Its design includes: A Parameter Server that coordinates updates. Worker Extensions that integrate with inference frameworks such as vLLM. The weight update pipeline runs in three stages: Host-to-Device (H2D): Parameters are copied into GPU memory. Broadcast: Weights are distributed across workers using CUDA IPC buffers. Reload: Each inference shard reloads only the subset of weights it needs. This staged pipeline is optimized for overlap, ensuring GPUs remain active throughout the update process. How does it perform in practice? Benchmarking results confirm checkpoint-engine’s scalability: GLM-4.5-Air (BF16, 8×H800): 3.94s (broadcast), 8.83s (P2P). Qwen3-235B-Instruct (BF16, 8×H800): 6.75s (broadcast), 16.47s (P2P). DeepSeek-V3.1 (FP8, 16×H20): 12.22s (broadcast), 25.77s (P2P). Kimi-K2-Instruct (FP8, 256×H20): ~21.5s (broadcast), 34.49s (P2P). Even at trillion-parameter scale with 256 GPUs, broadcast updates complete in about 20 seconds, validating its design goal. What are some trade-offs? Checkpoint-engine introduces notable advantages, but also comes with limitations: Memory Overhead: Overlapped pipelines require additional GPU memory; insufficient memory triggers slower fallback paths. P2P Latency: Peer-to-peer updates support elastic clusters but at a performance cost. Compatibility: Officially tested with vLLM only; broader engine support requires engineering work. Quantization: FP8 support exists but remains experimental. Where does it fit in deployment scenarios? Checkpoint-engine is most valuable for: Reinforcement learning pipelines where frequent weight updates are required. Large inference clusters serving 100B–1T+ parameter models. Elastic environments with dynamic scaling, where P2P flexibility offsets latency trade-offs. Summary Checkpoint-engine represents a focused solution to one of the hardest problems in large-scale LLM deployment: rapid weight synchronization without halting inference. With demonstrated updates at trillion-parameter scale in around 20 seconds, flexible support for both broadcast and P2P modes, and an optimized communication pipeline, it provides a practical path forward for reinforcement learning pipelines and high-performance inference clusters. While still limited to vLLM and requiring refinements in quantization and dynamic scaling, it establishes an important foundation for efficient, continuous model updates in production AI systems. Check out the PROJECT PAGE here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post MoonshotAI Released Checkpoint-Engine: A Simple Middleware to Update Model Weights in LLM Inference Engines, Effective for Reinforcement Learning appeared first on MarkTechPost.

MoonshotAI Released Checkpoint-Engine: A Simple Middleware to Update Model Weights in LLM Inference Engines, Effective for Reinforcement Learning Read Post »

AI, Committee, ข่าว, Uncategorized

ALIGNS: Unlocking nomological networks in psychological measurement through a large language model

arXiv:2509.09723v1 Announce Type: new Abstract: Psychological measurement is critical to many disciplines. Despite advances in measurement, building nomological networks, theoretical maps of how concepts and measures relate to establish validity, remains a challenge 70 years after Cronbach and Meehl proposed them as fundamental to validation. This limitation has practical consequences: clinical trials may fail to detect treatment effects, and public policy may target the wrong outcomes. We introduce Analysis of Latent Indicators to Generate Nomological Structures (ALIGNS), a large language model-based system trained with validated questionnaire measures. ALIGNS provides three comprehensive nomological networks containing over 550,000 indicators across psychology, medicine, social policy, and other fields. This represents the first application of large language models to solve a foundational problem in measurement validation. We report classification accuracy tests used to develop the model, as well as three evaluations. In the first evaluation, the widely used NIH PROMIS anxiety and depression instruments are shown to converge into a single dimension of emotional distress. The second evaluation examines child temperament measures and identifies four potential dimensions not captured by current frameworks, and questions one existing dimension. The third evaluation, an applicability check, engages expert psychometricians who assess the system’s importance, accessibility, and suitability. ALIGNS is freely available at nomologicalnetwork.org, complementing traditional validation methods with large-scale nomological analysis.

ALIGNS: Unlocking nomological networks in psychological measurement through a large language model Read Post »

AI, Committee, ข่าว, Uncategorized

Faster and Better LLMs via Latency-Aware Test-Time Scaling

arXiv:2505.19634v4 Announce Type: replace Abstract: Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a latency-aware evaluation of representative TTS methods, we demonstrate that a compute-optimal TTS does not always result in the lowest latency in scenarios where latency is critical. To address this gap and achieve latency-optimal TTS, we propose two key approaches by optimizing the concurrency configurations: (1) branch-wise parallelism, which leverages multiple concurrent inference branches, and (2) sequence-wise parallelism, enabled by speculative decoding. By integrating these two approaches and allocating computational resources properly to each, our latency-optimal TTS enables a 32B model to reach 82.3% accuracy on MATH-500 within 1 minute and a smaller 3B model to achieve 72.4% within 10 seconds. Our work emphasizes the importance of latency-aware TTS and demonstrates its ability to deliver both speed and accuracy in latency-sensitive scenarios.

Faster and Better LLMs via Latency-Aware Test-Time Scaling Read Post »

AI, Committee, ข่าว, Uncategorized

Meta AI Released MobileLLM-R1: A Edge Reasoning Model with less than 1B Parameters and Achieves 2x–5x Performance Boost Over Other Fully Open-Source AI Models

Table of contents What architecture powers MobileLLM-R1? How efficient is the training? How does it perform against other open models? Where does MobileLLM-R1 fall short? How does MobileLLM-R1 compare to Qwen3, SmolLM2, and OLMo? Summary Meta has released MobileLLM-R1, a family of lightweight edge reasoning models now available on Hugging Face. The release includes models ranging from 140M to 950M parameters, with a focus on efficient mathematical, coding, and scientific reasoning at sub-billion scale. Unlike general-purpose chat models, MobileLLM-R1 is designed for edge deployment, aiming to deliver state-of-the-art reasoning accuracy while remaining computationally efficient. What architecture powers MobileLLM-R1? The largest model, MobileLLM-R1-950M, integrates several architectural optimizations: 22 Transformer layers with 24 attention heads and 6 grouped KV heads. Embedding dimension: 1536; hidden dimension: 6144. Grouped-Query Attention (GQA) reduces compute and memory. Block-wise weight sharing cuts parameter count without heavy latency penalties. SwiGLU activations improve small-model representation. Context length: 4K for base, 32K for post-trained models. 128K vocabulary with shared input/output embeddings. The emphasis is on reducing compute and memory requirements, making it suitable for deployment on constrained devices. How efficient is the training? MobileLLM-R1 is notable for data efficiency: Trained on ~4.2T tokens in total. By comparison, Qwen3’s 0.6B model was trained on 36T tokens. This means MobileLLM-R1 uses only ≈11.7% of the data to reach or surpass Qwen3’s accuracy. Post-training applies supervised fine-tuning on math, coding, and reasoning datasets. This efficiency translates directly into lower training costs and resource demands. How does it perform against other open models? On benchmarks, MobileLLM-R1-950M shows significant gains: MATH (MATH500 dataset): ~5× higher accuracy than Olmo-1.24B and ~2× higher accuracy than SmolLM2-1.7B. Reasoning and coding (GSM8K, AIME, LiveCodeBench): Matches or surpasses Qwen3-0.6B, despite using far fewer tokens. The model delivers results typically associated with larger architectures while maintaining a smaller footprint. Where does MobileLLM-R1 fall short? The model’s focus creates limitations: Strong in math, code, and structured reasoning. Weaker in general conversation, commonsense, and creative tasks compared to larger LLMs. Distributed under FAIR NC (non-commercial) license, which restricts usage in production settings. Longer contexts (32K) raise KV-cache and memory demands at inference. How does MobileLLM-R1 compare to Qwen3, SmolLM2, and OLMo? Performance snapshot (post-trained models): Model Params Train tokens (T) MATH500 GSM8K AIME’24 AIME’25 LiveCodeBench MobileLLM-R1-950M 0.949B 4.2 74.0 67.5 15.5 16.3 19.9 Qwen3-0.6B 0.596B 36.0 73.0 79.2 11.3 17.0 14.9 SmolLM2-1.7B-Instruct 1.71B ~11.0 19.2 41.8 0.3 0.1 4.4 OLMo-2-1B-Instruct 1.48B ~3.95 19.2 69.7 0.6 0.1 0.0 Key observations: R1-950M matches Qwen3-0.6B in math (74.0 vs 73.0) while requiring ~8.6× fewer tokens. Performance gaps vs SmolLM2 and OLMo are substantial across reasoning tasks. Qwen3 maintains an edge in GSM8K, but the difference is small compared to the training efficiency advantage. Summary Meta’s MobileLLM-R1 underscores a trend toward smaller, domain-optimized models that deliver competitive reasoning without massive training budgets. By achieving 2×–5× performance gains over larger open models while training on a fraction of the data, it demonstrates that efficiency—not just scale—will define the next phase of LLM deployment, especially for math, coding, and scientific use cases on edge devices. Check out the Model on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Meta AI Released MobileLLM-R1: A Edge Reasoning Model with less than 1B Parameters and Achieves 2x–5x Performance Boost Over Other Fully Open-Source AI Models appeared first on MarkTechPost.

Meta AI Released MobileLLM-R1: A Edge Reasoning Model with less than 1B Parameters and Achieves 2x–5x Performance Boost Over Other Fully Open-Source AI Models Read Post »

AI, Committee, ข่าว, Uncategorized

Beyond Token Limits: Assessing Language Model Performance on Long Text Classification

arXiv:2509.10199v1 Announce Type: new Abstract: The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.

Beyond Token Limits: Assessing Language Model Performance on Long Text Classification Read Post »

AI, Committee, ข่าว, Uncategorized

DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model

arXiv:2509.09724v1 Announce Type: new Abstract: Technology opportunities are critical information that serve as a foundation for advancements in technology, industry, and innovation. This paper proposes a framework based on the temporal relationships between technologies to identify emerging technology opportunities. The proposed framework begins by extracting text from a patent dataset, followed by mapping text-based topics to discover inter-technology relationships. Technology opportunities are then identified by tracking changes in these topics over time. To enhance efficiency, the framework leverages a large language model to extract topics and employs a prompt for a chat-based language model to support the discovery of technology opportunities. The framework was evaluated using an artificial intelligence patent dataset provided by the United States Patent and Trademark Office. The experimental results suggest that artificial intelligence technology is evolving into forms that facilitate everyday accessibility. This approach demonstrates the potential of the proposed framework to identify future technology opportunities.

DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at นโยบายความเป็นส่วนตัว and manage your privacy settings by clicking Settings.

ตั้งค่าความเป็นส่วนตัว

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

ยอมรับทั้งหมด
จัดการความเป็นส่วนตัว
  • เปิดใช้งานตลอด

บันทึกการตั้งค่า
th