YouZum

Committee

AI, Committee, Nachrichten, Uncategorized

ALIGNS: Unlocking nomological networks in psychological measurement through a large language model

arXiv:2509.09723v1 Announce Type: new Abstract: Psychological measurement is critical to many disciplines. Despite advances in measurement, building nomological networks, theoretical maps of how concepts and measures relate to establish validity, remains a challenge 70 years after Cronbach and Meehl proposed them as fundamental to validation. This limitation has practical consequences: clinical trials may fail to detect treatment effects, and public policy may target the wrong outcomes. We introduce Analysis of Latent Indicators to Generate Nomological Structures (ALIGNS), a large language model-based system trained with validated questionnaire measures. ALIGNS provides three comprehensive nomological networks containing over 550,000 indicators across psychology, medicine, social policy, and other fields. This represents the first application of large language models to solve a foundational problem in measurement validation. We report classification accuracy tests used to develop the model, as well as three evaluations. In the first evaluation, the widely used NIH PROMIS anxiety and depression instruments are shown to converge into a single dimension of emotional distress. The second evaluation examines child temperament measures and identifies four potential dimensions not captured by current frameworks, and questions one existing dimension. The third evaluation, an applicability check, engages expert psychometricians who assess the system’s importance, accessibility, and suitability. ALIGNS is freely available at nomologicalnetwork.org, complementing traditional validation methods with large-scale nomological analysis.

ALIGNS: Unlocking nomological networks in psychological measurement through a large language model Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Faster and Better LLMs via Latency-Aware Test-Time Scaling

arXiv:2505.19634v4 Announce Type: replace Abstract: Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a latency-aware evaluation of representative TTS methods, we demonstrate that a compute-optimal TTS does not always result in the lowest latency in scenarios where latency is critical. To address this gap and achieve latency-optimal TTS, we propose two key approaches by optimizing the concurrency configurations: (1) branch-wise parallelism, which leverages multiple concurrent inference branches, and (2) sequence-wise parallelism, enabled by speculative decoding. By integrating these two approaches and allocating computational resources properly to each, our latency-optimal TTS enables a 32B model to reach 82.3% accuracy on MATH-500 within 1 minute and a smaller 3B model to achieve 72.4% within 10 seconds. Our work emphasizes the importance of latency-aware TTS and demonstrates its ability to deliver both speed and accuracy in latency-sensitive scenarios.

Faster and Better LLMs via Latency-Aware Test-Time Scaling Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Meta AI Released MobileLLM-R1: A Edge Reasoning Model with less than 1B Parameters and Achieves 2x–5x Performance Boost Over Other Fully Open-Source AI Models

Table of contents What architecture powers MobileLLM-R1? How efficient is the training? How does it perform against other open models? Where does MobileLLM-R1 fall short? How does MobileLLM-R1 compare to Qwen3, SmolLM2, and OLMo? Summary Meta has released MobileLLM-R1, a family of lightweight edge reasoning models now available on Hugging Face. The release includes models ranging from 140M to 950M parameters, with a focus on efficient mathematical, coding, and scientific reasoning at sub-billion scale. Unlike general-purpose chat models, MobileLLM-R1 is designed for edge deployment, aiming to deliver state-of-the-art reasoning accuracy while remaining computationally efficient. What architecture powers MobileLLM-R1? The largest model, MobileLLM-R1-950M, integrates several architectural optimizations: 22 Transformer layers with 24 attention heads and 6 grouped KV heads. Embedding dimension: 1536; hidden dimension: 6144. Grouped-Query Attention (GQA) reduces compute and memory. Block-wise weight sharing cuts parameter count without heavy latency penalties. SwiGLU activations improve small-model representation. Context length: 4K for base, 32K for post-trained models. 128K vocabulary with shared input/output embeddings. The emphasis is on reducing compute and memory requirements, making it suitable for deployment on constrained devices. How efficient is the training? MobileLLM-R1 is notable for data efficiency: Trained on ~4.2T tokens in total. By comparison, Qwen3’s 0.6B model was trained on 36T tokens. This means MobileLLM-R1 uses only ≈11.7% of the data to reach or surpass Qwen3’s accuracy. Post-training applies supervised fine-tuning on math, coding, and reasoning datasets. This efficiency translates directly into lower training costs and resource demands. How does it perform against other open models? On benchmarks, MobileLLM-R1-950M shows significant gains: MATH (MATH500 dataset): ~5× higher accuracy than Olmo-1.24B and ~2× higher accuracy than SmolLM2-1.7B. Reasoning and coding (GSM8K, AIME, LiveCodeBench): Matches or surpasses Qwen3-0.6B, despite using far fewer tokens. The model delivers results typically associated with larger architectures while maintaining a smaller footprint. Where does MobileLLM-R1 fall short? The model’s focus creates limitations: Strong in math, code, and structured reasoning. Weaker in general conversation, commonsense, and creative tasks compared to larger LLMs. Distributed under FAIR NC (non-commercial) license, which restricts usage in production settings. Longer contexts (32K) raise KV-cache and memory demands at inference. How does MobileLLM-R1 compare to Qwen3, SmolLM2, and OLMo? Performance snapshot (post-trained models): Model Params Train tokens (T) MATH500 GSM8K AIME’24 AIME’25 LiveCodeBench MobileLLM-R1-950M 0.949B 4.2 74.0 67.5 15.5 16.3 19.9 Qwen3-0.6B 0.596B 36.0 73.0 79.2 11.3 17.0 14.9 SmolLM2-1.7B-Instruct 1.71B ~11.0 19.2 41.8 0.3 0.1 4.4 OLMo-2-1B-Instruct 1.48B ~3.95 19.2 69.7 0.6 0.1 0.0 Key observations: R1-950M matches Qwen3-0.6B in math (74.0 vs 73.0) while requiring ~8.6× fewer tokens. Performance gaps vs SmolLM2 and OLMo are substantial across reasoning tasks. Qwen3 maintains an edge in GSM8K, but the difference is small compared to the training efficiency advantage. Summary Meta’s MobileLLM-R1 underscores a trend toward smaller, domain-optimized models that deliver competitive reasoning without massive training budgets. By achieving 2×–5× performance gains over larger open models while training on a fraction of the data, it demonstrates that efficiency—not just scale—will define the next phase of LLM deployment, especially for math, coding, and scientific use cases on edge devices. Check out the Model on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Meta AI Released MobileLLM-R1: A Edge Reasoning Model with less than 1B Parameters and Achieves 2x–5x Performance Boost Over Other Fully Open-Source AI Models appeared first on MarkTechPost.

Meta AI Released MobileLLM-R1: A Edge Reasoning Model with less than 1B Parameters and Achieves 2x–5x Performance Boost Over Other Fully Open-Source AI Models Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model

arXiv:2509.09724v1 Announce Type: new Abstract: Technology opportunities are critical information that serve as a foundation for advancements in technology, industry, and innovation. This paper proposes a framework based on the temporal relationships between technologies to identify emerging technology opportunities. The proposed framework begins by extracting text from a patent dataset, followed by mapping text-based topics to discover inter-technology relationships. Technology opportunities are then identified by tracking changes in these topics over time. To enhance efficiency, the framework leverages a large language model to extract topics and employs a prompt for a chat-based language model to support the discovery of technology opportunities. The framework was evaluated using an artificial intelligence patent dataset provided by the United States Patent and Trademark Office. The experimental results suggest that artificial intelligence technology is evolving into forms that facilitate everyday accessibility. This approach demonstrates the potential of the proposed framework to identify future technology opportunities.

DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Unsupervised Hallucination Detection by Inspecting Reasoning Processes

arXiv:2509.10004v1 Announce Type: new Abstract: Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data. While unsupervised methods have gained popularity by eliminating labor-intensive human annotations, they frequently rely on proxy signals unrelated to factual correctness. This misalignment biases detection probes toward superficial or non-truth-related aspects, limiting generalizability across datasets and scenarios. To overcome these limitations, we propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness. IRIS prompts the LLM to carefully verify the truthfulness of a given statement, and obtain its contextualized embedding as informative features for training. Meanwhile, the uncertainty of each response is considered a soft pseudolabel for truthfulness. Experimental results demonstrate that IRIS consistently outperforms existing unsupervised methods. Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.

Unsupervised Hallucination Detection by Inspecting Reasoning Processes Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes

arXiv:2507.13335v2 Announce Type: replace Abstract: Humour, as a complex language form, is derived from myriad aspects of life. Whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular form. We compare models’ joke explanation abilities from simple puns to complex topical humour that requires esoteric knowledge of real-world entities and events. To this end, we curate a dataset of 600 jokes across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (including reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most existing works on overly simple joke forms.

Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm

arXiv:2509.09707v1 Announce Type: cross Abstract: Integrating Large Language Models (LLMs) within metaheuristics opens a novel path for solving complex combinatorial optimization problems. While most existing approaches leverage LLMs for code generation to create or refine specific heuristics, they often overlook the structural properties of individual problem instances. In this work, we introduce a novel framework that integrates LLMs with a Biased Random-Key Genetic Algorithm (BRKGA) to solve the NP-hard Longest Run Subsequence problem. Our approach extends the instance-driven heuristic bias paradigm by introducing a human-LLM collaborative process to co-design and implement a set of computationally efficient metrics. The LLM analyzes these instance-specific metrics to generate a tailored heuristic bias, which steers the BRKGA toward promising areas of the search space. We conduct a comprehensive experimental evaluation, including rigorous statistical tests, convergence and behavioral analyses, and targeted ablation studies, comparing our method against a standard BRKGA baseline across 1,050 generated instances of varying complexity. Results show that our top-performing hybrid, BRKGA+Llama-4-Maverick, achieves statistically significant improvements over the baseline, particularly on the most complex instances. Our findings confirm that leveraging an LLM to produce an a priori, instance-driven heuristic bias is a valuable approach for enhancing metaheuristics in complex optimization domains.

LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Beyond Token Limits: Assessing Language Model Performance on Long Text Classification

arXiv:2509.10199v1 Announce Type: new Abstract: The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.

Beyond Token Limits: Assessing Language Model Performance on Long Text Classification Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Google AI Releases VaultGemma: The Largest and Most Capable Open Model (1B-parameters) Trained from Scratch with Differential Privacy

Google AI Research and DeepMind have released VaultGemma 1B, the largest open-weight large language model trained entirely with differential privacy (DP). This development is a major step toward building AI models that are both powerful and privacy-preserving. Why Do We Need Differential Privacy in LLMs? Large language models trained on vast web-scale datasets are prone to memorization attacks, where sensitive or personally identifiable information can be extracted from the model. Studies have shown that verbatim training data can resurface, especially in open-weight releases. Differential Privacy offers a mathematical guarantee that prevents any single training example from significantly influencing the model. Unlike approaches that apply DP only during fine-tuning, VaultGemma enforces full private pretraining, ensuring that privacy protection begins at the foundational level. https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf What Is the Architecture of VaultGemma? VaultGemma is architecturally similar to earlier Gemma models, but optimized for private training. Model size: 1B parameters, 26 layers. Transformer type: Decoder-only. Activations: GeGLU with feedforward dimension of 13,824. Attention: Multi-Query Attention (MQA) with global span of 1024 tokens. Normalization: RMSNorm in pre-norm configuration. Tokenizer: SentencePiece with a 256K vocabulary. A notable change is the reduction of sequence length to 1024 tokens, which lowers compute costs and enables larger batch sizes under DP constraints. What Data Was Used for Training? VaultGemma was trained on the same 13 trillion-token dataset as Gemma 2, composed primarily of English text from web documents, code, and scientific articles. The dataset underwent several filtering stages to: Remove unsafe or sensitive content. Reduce personal information exposure. Prevent evaluation data contamination. This ensures both safety and fairness in benchmarking. How Was Differential Privacy Applied? VaultGemma used DP-SGD (Differentially Private Stochastic Gradient Descent) with gradient clipping and Gaussian noise addition. Implementation was built on JAX Privacy and introduced optimizations for scalability: Vectorized per-example clipping for parallel efficiency. Gradient accumulation to simulate large batches. Truncated Poisson Subsampling integrated into the data loader for efficient on-the-fly sampling. The model achieved a formal DP guarantee of (ε ≤ 2.0, δ ≤ 1.1e−10) at the sequence level (1024 tokens). How Do Scaling Laws Work for Private Training? Training large models under DP constraints requires new scaling strategies. The VaultGemma team developed DP-specific scaling laws with three innovations: Optimal learning rate modeling using quadratic fits across training runs. Parametric extrapolation of loss values to reduce reliance on intermediate checkpoints. Semi-parametric fits to generalize across model size, training steps, and noise-batch ratios. This methodology enabled precise prediction of achievable loss and efficient resource use on the TPUv6e training cluster. What Were the Training Configurations? VaultGemma was trained on 2048 TPUv6e chips using GSPMD partitioning and MegaScale XLA compilation. Batch size: ~518K tokens. Training iterations: 100,000. Noise multiplier: 0.614. The achieved loss was within 1% of predictions from the DP scaling law, validating the approach. How Does VaultGemma Perform Compared to Non-Private Models? On academic benchmarks, VaultGemma trails its non-private counterparts but shows strong utility: ARC-C: 26.45 vs. 38.31 (Gemma-3 1B). PIQA: 68.0 vs. 70.51 (GPT-2 1.5B). TriviaQA (5-shot): 11.24 vs. 39.75 (Gemma-3 1B). These results suggest that DP-trained models are currently comparable to non-private models from about five years ago. Importantly, memorization tests confirmed that no training data leakage was detectable in VaultGemma, unlike in non-private Gemma models. https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf Summary In summary, VaultGemma 1B proves that large-scale language models can be trained with rigorous differential privacy guarantees without making them impractical to use. While a utility gap remains compared to non-private counterparts, the release of both the model and its training methodology provides the community with a strong foundation for advancing private AI. This work signals a shift toward building models that are not only capable but also inherently safe, transparent, and privacy-preserving. Check out the Paper, Model on Hugging Face and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Google AI Releases VaultGemma: The Largest and Most Capable Open Model (1B-parameters) Trained from Scratch with Differential Privacy appeared first on MarkTechPost.

Google AI Releases VaultGemma: The Largest and Most Capable Open Model (1B-parameters) Trained from Scratch with Differential Privacy Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

BentoML Released llm-optimizer: An Open-Source AI Tool for Benchmarking and Optimizing LLM Inference

BentoML has recently released llm-optimizer, an open-source framework designed to streamline the benchmarking and performance tuning of self-hosted large language models (LLMs). The tool addresses a common challenge in LLM deployment: finding optimal configurations for latency, throughput, and cost without relying on manual trial-and-error. Why is tuning the LLM performance difficult? Tuning LLM inference is a balancing act across many moving parts—batch size, framework choice (vLLM, SGLang, etc.), tensor parallelism, sequence lengths, and how well the hardware is utilized. Each of these factors can shift performance in different ways, which makes finding the right combination for speed, efficiency, and cost far from straightforward. Most teams still rely on repetitive trial-and-error testing, a process that is slow, inconsistent, and often inconclusive. For self-hosted deployments, the cost of getting it wrong is high: poorly tuned configurations can quickly translate into higher latency and wasted GPU resources. How llm-optimizer is different? llm-optimizer provides a structured way to explore the LLM performance landscape. It eliminates repetitive guesswork by enabling systematic benchmarking and automated search across possible configurations. Core capabilities include: Running standardized tests across inference frameworks such as vLLM and SGLang. Applying constraint-driven tuning, e.g., surfacing only configurations where time-to-first-token is below 200ms. Automating parameter sweeps to identify optimal settings. Visualizing tradeoffs with dashboards for latency, throughput, and GPU utilization. The framework is open-source and available on GitHub. How can devs explore results without running benchmarks locally? Alongside the optimizer, BentoML released the LLM Performance Explorer, a browser-based interface powered by llm-optimizer. It provides pre-computed benchmark data for popular open-source models and lets users: Compare frameworks and configurations side by side. Filter by latency, throughput, or resource thresholds. Browse tradeoffs interactively without provisioning hardware. How does llm-optimizer impact LLM deployment practices? As the use of LLMs grows, getting the most out of deployments comes down to how well inference parameters are tuned. llm-optimizer lowers the complexity of this process, giving smaller teams access to optimization techniques that once required large-scale infrastructure and deep expertise. By providing standardized benchmarks and reproducible results, the framework adds much-needed transparency to the LLM space. It makes comparisons across models and frameworks more consistent, closing a long-standing gap in the community. Ultimately, BentoML’s llm-optimizer brings a constraint-driven, benchmark-focused method to self-hosted LLM optimization, replacing ad-hoc trial and error with a systematic and repeatable workflow. Check out the GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post BentoML Released llm-optimizer: An Open-Source AI Tool for Benchmarking and Optimizing LLM Inference appeared first on MarkTechPost.

BentoML Released llm-optimizer: An Open-Source AI Tool for Benchmarking and Optimizing LLM Inference Beitrag lesen »

We use cookies to improve your experience and performance on our website. You can learn more at Datenschutzrichtlinie and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
de_DE