YouZum

Uncategorized

AI, Committee, Notizie, Uncategorized

Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

arXiv:2601.19895v1 Announce Type: cross Abstract: Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.

Post-LayerNorm Is Back: Stable, ExpressivE, and Deep Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Complex Logical Instruction Generation

arXiv:2508.09125v2 Announce Type: replace Abstract: Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditions, loops, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF

Complex Logical Instruction Generation Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Propaganda AI: An Analysis of Semantic Divergence in Large Language Models

arXiv:2504.12344v2 Announce Type: replace Abstract: Large language models (LLMs) can exhibit concept-conditioned semantic divergence: common high-level cues (e.g., ideologies, public figures) elicit unusually uniform, stance-like responses that evade token-trigger audits. This behavior falls in a blind spot of current safety evaluations, yet carries major societal stakes, as such concept cues can steer content exposure at scale. We formalize this phenomenon and present RAVEN (Response Anomaly Vigilance), a black-box audit that flags cases where a model is simultaneously highly certain and atypical among peers by coupling semantic entropy over paraphrastic samples with cross-model disagreement. In a controlled LoRA fine-tuning study, we implant a concept-conditioned stance using a small biased corpus, demonstrating feasibility without rare token triggers. Auditing five LLM families across twelve sensitive topics (360 prompts per model) and clustering via bidirectional entailment, RAVEN surfaces recurrent, model-specific divergences in 9/12 topics. Concept-level audits complement token-level defenses and provide a practical early-warning signal for release evaluation and post-deployment monitoring against propaganda-like influence.

Propaganda AI: An Analysis of Semantic Divergence in Large Language Models Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Tencent Hunyuan Releases HPC-Ops: A High Performance LLM Inference Operator Library

Tencent Hunyuan has open sourced HPC-Ops, a production grade operator library for large language model inference architecture devices. HPC-Ops focuses on low level CUDA kernels for core operators such as Attention, Grouped GEMM, and Fused MoE, and exposes them through a compact-C and Python API for integration into existing inference stacks. HPC-Ops runs in large scale internal services. In those deployments it delivers about 30 percent queries per minute improvement for Tencent-HY models and about 17 percent improvement for DeepSeek models on mainstream inference cards. These gains are reported at the service level, so they reflect the cumulative effect of faster kernels inside a real inference pipeline. Scope and design of HPC-Ops HPC-Ops is a production grade, high performance, and easy to use operator library for LLM inference, developed by the Tencent Hunyuan AI Infra team. The project does not try to replace serving frameworks. Instead it provides kernels and clean APIs that can be called from systems that already handle scheduling, KV cache management, batching, and transport. The API is designed for seamless use inside popular inference frameworks such as vLLM and SGLang. That means the framework team can swap in HPC-Ops kernels behind their own abstractions without changing the external behavior of their servers. HPC-Ops uses C++ and CUDA with CuTe and CUTLASS as building blocks. Kernels are written as relatively small examples that also serve as a modern CUDA tutorial. Kernel performance characteristics The project publishes maximum observed speedup numbers for each operator relative to established baselines. These are microbenchmarks, and the research team stress that performance varies across shapes and workloads, but they show the optimization ceiling. For Attention in bf16, compared with FlashInfer, FlashAttention two, FlashAttention three, and TensorRT LLM, HPC Ops reports up to 1.33 times speedup in prefill and up to 2.22 times in decode. For Attention in fp8, compared with FlashInfer, FlashAttention three, and TensorRT LLM, it reports up to 1.12 times in prefill and up to 2.0 times in decode. For FusedMoE fp8, compared with TensorRT LLM and vLLM, maximum observed speedup is up to 1.49 times in prefill and 1.14 times in decode. For GroupGEMM fp8, compared with DeepGEMM, the reported gains are up to 1.1 times in prefill and 1.88 times in decode. These numbers matter because decode is usually the latency bottleneck in autoregressive generation, where batch sizes shrink and memory traffic dominates. The fact that Attention and GroupGEMM show the largest relative gains in decode suggests that HPC-Ops focuses on the part of the pipeline that most users notice. Supported kernels and precision The current release groups its functionality into three operator families: Attention kernels cover both prefill and decode and include support for paged attention. Paged attention is the memory layout that frameworks like vLLM use to place key and value cache blocks in a paged structure, which improves memory reuse for long sequences. Grouped GEMM is implemented as quantized GroupGEMM with fp8 weights. HPC-Ops supports block wise and per tensor scaling, so teams can trade off quantization granularity against parameter storage and calibration cost. Fused-MoE combines mixture of experts routing and expert computation in a single quantized operator. It also uses fp8 expert weights and supports block wise and per tensor scaling strategies. Across these kernels, HPC-Ops provides native support for bf16 and fp8 data types. That matches the current production trend to move inference toward lower precision formats that preserve accuracy while reducing memory bandwidth and improving tensor core utilization. Key Takeaways Tencent Hunyuan open-sourced HPC-Ops as a production grade operator library for LLM inference on NVIDIA SM90 GPUs, including H20, with C++ and CUDA kernels built on CuTe and CUTLASS. In production deployments HPC-Ops reports about 30 percent QPM gain for Tencent-HY models and about 17 percent QPM gain for DeepSeek models on mainstream inference cards. Operator microbenchmarks show maximum speedups up to 2.22 times for bf16 Attention decode, up to 2.0 times for fp8 Attention decode, up to 1.49 times for fp8 FusedMoE prefill, and up to 1.88 times for fp8 GroupGEMM decode compared with strong baselines like FlashInfer, FlashAttention, TensorRT LLM, and DeepGEMM. The library focuses on three operator families, Attention with paged attention support, quantized GroupGEMM with fp8 weights, and quantized Fused MoE with fp8 expert weights, with both block wise and per tensor scaling, and native bf16 plus fp8 precision support. HPC-Ops is designed as an operator layer that integrates into existing inference frameworks such as vLLM and SGLang, and the roadmap targets sparse attention for long context LLMs, extended quantization including 4 bit and 8 bit strategies, and kernels that better overlap computation with multi GPU communication. Check out the Repo here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Tencent Hunyuan Releases HPC-Ops: A High Performance LLM Inference Operator Library appeared first on MarkTechPost.

Tencent Hunyuan Releases HPC-Ops: A High Performance LLM Inference Operator Library Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Cross-Examination Framework: A Task-Agnostic Diagnostic for Information Fidelity in Text-to-Text Generation

arXiv:2601.19350v1 Announce Type: new Abstract: Traditional metrics like BLEU and BERTScore fail to capture semantic fidelity in generative text-to-text tasks. We adapt the Cross-Examination Framework (CEF) for a reference-free, multi-dimensional evaluation by treating the source and candidate as independent knowledge bases. CEF generates verifiable questions from each text and performs a cross-examination to derive three interpretable scores: Coverage, Conformity, and Consistency. Validated across translation, summarization and clinical note-generation, our framework identifies critical errors, such as content omissions and factual contradictions, missed by standard metrics. A key contribution is a systematic robustness analysis to select a stable judge model. Crucially, the strong correlation between our reference-free and with-reference modes validates CEF’s reliability without gold references. Furthermore, human expert validation demonstrates that CEF mismatching questions align with meaning-altering semantic errors higher than with non-semantic errors, particularly excelling at identifying entity-based and relational distortions.

Cross-Examination Framework: A Task-Agnostic Diagnostic for Information Fidelity in Text-to-Text Generation Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation

arXiv:2502.00662v2 Announce Type: replace-cross Abstract: Existing vision-language model (VLM)-based methods for out-of-distribution (OOD) detection typically rely on similarity scores between input images and in-distribution (ID) text prototypes. However, the modality gap between image and text often results in high false positive rates, as OOD samples can exhibit high similarity to ID text prototypes. To mitigate the impact of this modality gap, we propose incorporating ID image prototypes along with ID text prototypes. We present theoretical analysis and empirical evidence indicating that this approach enhances VLM-based OOD detection performance without any additional training. To further reduce the gap between image and text, we introduce a novel few-shot tuning framework, SUPREME, comprising biased prompts generation (BPG) and image-text consistency (ITC) modules. BPG enhances image-text fusion and improves generalization by conditioning ID text prototypes on the Gaussian-based estimated image domain bias; ITC reduces the modality gap by minimizing intra- and inter-modal distances. Moreover, inspired by our theoretical and empirical findings, we introduce a novel OOD score $S_{textit{GMP}}$, leveraging uni- and cross-modal similarities. Finally, we present extensive experiments to demonstrate that SUPREME consistently outperforms existing VLM-based OOD detection methods.

Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

MultiVis-Agent: A Multi-Agent Framework with Logic Rules for Reliable and Comprehensive Cross-Modal Data Visualization

arXiv:2601.18320v1 Announce Type: new Abstract: Real-world visualization tasks involve complex, multi-modal requirements that extend beyond simple text-to-chart generation, requiring reference images, code examples, and iterative refinement. Current systems exhibit fundamental limitations: single-modality input, one-shot generation, and rigid workflows. While LLM-based approaches show potential for these complex requirements, they introduce reliability challenges including catastrophic failures and infinite loop susceptibility. To address this gap, we propose MultiVis-Agent, a logic rule-enhanced multi-agent framework for reliable multi-modal and multi-scenario visualization generation. Our approach introduces a four-layer logic rule framework that provides mathematical guarantees for system reliability while maintaining flexibility. Unlike traditional rule-based systems, our logic rules are mathematical constraints that guide LLM reasoning rather than replacing it. We formalize the MultiVis task spanning four scenarios from basic generation to iterative refinement, and develop MultiVis-Bench, a benchmark with over 1,000 cases for multi-modal visualization evaluation. Extensive experiments demonstrate that our approach achieves 75.63% visualization score on challenging tasks, significantly outperforming baselines (57.54-62.79%), with task completion rates of 99.58% and code execution success rates of 94.56% (vs. 74.48% and 65.10% without logic rules), successfully addressing both complexity and reliability challenges in automated visualization generation.

MultiVis-Agent: A Multi-Agent Framework with Logic Rules for Reliable and Comprehensive Cross-Modal Data Visualization Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Neurocomputational Mechanisms of Syntactic Transfer in Bilingual Sentence Production

arXiv:2601.18056v1 Announce Type: new Abstract: We discuss the benefits of incorporating into the study of bilingual production errors and their traditionally documented timing signatures (e.g., event-related potentials) certain types of oscillatory signatures, which can offer new implementational-level constraints for theories of bilingualism. We argue that a recent neural model of language, ROSE, can offer a neurocomputational account of syntactic transfer in bilingual production, capturing some of its formal properties and the scope of morphosyntactic sequencing failure modes. We take as a case study cross-linguistic influence (CLI) and attendant theories of functional inhibition/competition, and present these as being driven by specific oscillatory failure modes during L2 sentence planning. We argue that modeling CLI in this way not only offers the kind of linking hypothesis ROSE was built to encourage, but also licenses the exploration of more spatiotemporally complex biomarkers of language dysfunction than more commonly discussed neural signatures.

Neurocomputational Mechanisms of Syntactic Transfer in Bilingual Sentence Production Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

arXiv:2511.12487v2 Announce Type: replace-cross Abstract: Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller LLaMA 3.2 variants showing the strongest resistance, and some cross-architecture models retaining higher toxicity. These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming and that defenses should anticipate cross-model reuse of adversarial prompts rather than focusing only on single-model hardening.

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Grounded Concreteness: Human-Like Concreteness Sensitivity in Vision-Language Models

arXiv:2601.18065v1 Announce Type: new Abstract: Do vision–language models (VLMs) develop more human-like sensitivity to linguistic concreteness than text-only large language models (LLMs) when both are evaluated with text-only prompts? We study this question with a controlled comparison between matched Llama text backbones and their Llama Vision counterparts across multiple model scales, treating multimodal pretraining as an ablation on perceptual grounding rather than access to images at inference. We measure concreteness effects at three complementary levels: (i) output behavior, by relating question-level concreteness to QA accuracy; (ii) embedding geometry, by testing whether representations organize along a concreteness axis; and (iii) attention dynamics, by quantifying context reliance via attention-entropy measures. In addition, we elicit token-level concreteness ratings from models and evaluate alignment to human norm distributions, testing whether multimodal training yields more human-consistent judgments. Across benchmarks and scales, VLMs show larger gains on more concrete inputs, exhibit clearer concreteness-structured representations, produce ratings that better match human norms, and display systematically different attention patterns consistent with increased grounding.

Grounded Concreteness: Human-Like Concreteness Sensitivity in Vision-Language Models Leggi l'articolo »

We use cookies to improve your experience and performance on our website. You can learn more at Politica sulla privacy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
it_IT