YouZum

Committee

AI, Committee, Actualités, Uncategorized

Vocabulary Customization for Efficient Domain-Specific LLM Deployment

arXiv:2509.26124v1 Announce Type: new Abstract: When using an LLM to process text outside the training domain(s), an often overlooked factor is vocabulary mismatch, where the general-domain tokenizer fails to capture frequent domain-specific terms, leading to higher token fertility and thus a decrease in processing speed due to suboptimal sub-word splits. We address this limitation by augmenting the pretrained vocabulary with a set of domain-specific tokens. To this end, we design an algorithm that extends an existing tokenizer while guaranteeing it never decreases tokenization efficiency: every input sequence is segmented into at most the same number of tokens as before. Evaluated on real-world e-Commerce use-cases, the augmented tokenizer significantly shortens input sequences by up to 20% and reduces inference latency on downstream tasks while preserving predictive quality. We further analyze secondary effects, such as the impact on forward pass speed and the rate at which the model adopts the newly introduced tokens, to illustrate the broader benefits of vocabulary adaptation.

Vocabulary Customization for Efficient Domain-Specific LLM Deployment Lire l’article »

AI, Committee, Actualités, Uncategorized

One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient

arXiv:2509.26313v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) is the predominant method for adapting large language models (LLMs), yet it often struggles with generalization compared to reinforcement learning (RL). In this work, we posit that this performance disparity stems not just from the loss function, but from a more fundamental difference: SFT learns from a fixed, pre-collected dataset, whereas RL utilizes on-policy data sampled from the current policy. Building on this hypothesis, we introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method. OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory. At each step, it performs a Monte Carlo “rollout” by sampling multiple candidate tokens from the current policy’s distribution. The ground-truth token from the supervised data is then used to provide a reward signal to these samples. Guided by policy gradient, our algorithm repurposes static, off-policy supervised data into a dynamic, on-policy signal at the token level, capturing the generalization benefits of on-policy learning while bypassing the costly overhead of full sentence generation. Through extensive experiments on a diverse suite of challenging benchmarks spanning mathematical reasoning, code generation, and general domain reasoning, we demonstrate that OTR consistently outperforms standard SFT. Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs and provide compelling evidence that the on-policy nature of data is a critical driver of generalization, offering a promising new direction for fine-tuning LLMs.

One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient Lire l’article »

AI, Committee, Actualités, Uncategorized

Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation

arXiv:2505.21956v3 Announce Type: replace-cross Abstract: Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture, necessitating the integration of retrieval methods. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy – combining a sub-dimensional sparse retriever with a dense retriever – to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesis. Extensive experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in the retrieval and further contributes to generation quality, while maintaining high efficiency.

Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation Lire l’article »

AI, Committee, Actualités, Uncategorized

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

arXiv:2509.15235v4 Announce Type: replace-cross Abstract: Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding Lire l’article »

AI, Committee, Actualités, Uncategorized

Can Large Language Models Express Uncertainty Like Human?

arXiv:2509.24202v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in high-stakes settings, where overconfident responses can mislead users. Reliable confidence estimation has been shown to enhance trust and task accuracy. Yet existing methods face practical barriers: logits are often hidden, multi-sampling is computationally expensive, and verbalized numerical uncertainty (e.g., giving a 0-100 score) deviates from natural communication. We revisit linguistic confidence (LC), where models express uncertainty through hedging language (e.g., probably, might), offering a lightweight and human-centered alternative. To advance this direction, we (1) release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores, and (2) propose a lightweight mapper that converts hedges into confidence scores at near-zero cost. Building on these resources, we (3) conduct the first systematic study of LC across modern LLMs and QA benchmarks, revealing that while most LLMs underperform in expressing reliable LC, carefully designed prompting achieves competitive calibration and discriminability. Finally, we (4) introduce a fine-tuning framework that further improves LC reliability. Taken together, our work positions linguistic confidence as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, and calls for deeper exploration of this promising yet underexplored direction.

Can Large Language Models Express Uncertainty Like Human? Lire l’article »

AI, Committee, Actualités, Uncategorized

Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey

arXiv:2509.24322v1 Announce Type: new Abstract: In recent years, large language models (LLMs) have driven major advances in language understanding, marking a significant step toward artificial general intelligence (AGI). With increasing demands for higher-level semantics and cross-modal fusion, multimodal large language models (MLLMs) have emerged, integrating diverse information sources (e.g., text, vision, and audio) to enhance modeling and reasoning in complex scenarios. In AI for Science, multimodal emotion recognition and reasoning has become a rapidly growing frontier. While LLMs and MLLMs have achieved notable progress in this area, the field still lacks a systematic review that consolidates recent developments. To address this gap, this paper provides a comprehensive survey of LLMs and MLLMs for emotion recognition and reasoning, covering model architectures, datasets, and performance benchmarks. We further highlight key challenges and outline future research directions, aiming to offer researchers both an authoritative reference and practical insights for advancing this domain. To the best of our knowledge, this paper is the first attempt to comprehensively survey the intersection of MLLMs with multimodal emotion recognition and reasoning. The summary of existing methods mentioned is in our Github: href{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}.

Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey Lire l’article »

AI, Committee, Actualités, Uncategorized

Latent Visual Reasoning

arXiv:2509.24251v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing, thereby enhancing the visual signal along the reasoning trajectories. Nevertheless, these approaches remain fundamentally constrained: reasoning is still confined to the language space, with visual information treated as static preconditions. We introduce Latent Visual Reasoning (LVR), a new paradigm that enables autoregressive reasoning directly in the visual embedding space. A visual encoder first projects images into visual tokens within a joint semantic space shared with the language model. The language model is then trained to generate latent states that reconstruct key visual tokens critical for answering the query, constituting the process of latent visual reasoning. By interleaving LVR with standard text generation, our model achieves substantial gains on perception-intensive visual question answering tasks. In addition, we adapt the GRPO algorithm to conduct reinforcement learning on latent reasoning, further balancing LVR and textual generation. We show that LVR substantially improves fine-grained visual understanding and perception, achieving 71.67% on MMVP compared to 66.67% with Qwen2.5-VL. Code base and model weights will be released later.

Latent Visual Reasoning Lire l’article »

AI, Committee, Actualités, Uncategorized

MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark

arXiv:2509.22461v1 Announce Type: cross Abstract: The ability to reason from audio, including speech, paralinguistic cues, environmental sounds, and music, is essential for AI agents to interact effectively in real-world scenarios. Existing benchmarks mainly focus on static or single-scene settings and do not fully capture scenarios where multiple speakers, unfolding events, and heterogeneous audio sources interact. To address these challenges, we introduce MDAR, a benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks. MDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips, covering five categories of complex reasoning and spanning three question types. We benchmark 26 state-of-the-art audio language models on MDAR and observe that they exhibit limitations in complex reasoning tasks. On single-choice questions, Qwen2.5-Omni (open-source) achieves 76.67% accuracy, whereas GPT-4o Audio (closed-source) reaches 68.47%; however, GPT-4o Audio substantially outperforms Qwen2.5-Omni on the more challenging multiple-choice and open-ended tasks. Across all three question types, no model achieves 80% performance. These findings underscore the unique challenges posed by MDAR and its value as a benchmark for advancing audio reasoning research.Code and benchmark can be found at https://github.com/luckyerr/MDAR.

MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark Lire l’article »

AI, Committee, Actualités, Uncategorized

Context Parametrization with Compositional Adapters

arXiv:2509.22158v1 Announce Type: new Abstract: Large language models (LLMs) often seamlessly adapt to new tasks through in-context learning (ICL) or supervised fine-tuning (SFT). However, both of these approaches face key limitations: ICL is inefficient when handling many demonstrations, and SFT incurs training overhead while sacrificing flexibility. Mapping instructions or demonstrations from context directly into adapter parameters offers an appealing alternative. While prior work explored generating adapters based on a single input context, it has overlooked the need to integrate multiple chunks of information. To address this gap, we introduce CompAs, a meta-learning framework that translates context into adapter parameters with a compositional structure. Adapters generated this way can be merged algebraically, enabling instructions, demonstrations, or retrieved passages to be seamlessly combined without reprocessing long prompts. Critically, this approach yields three benefits: lower inference cost, robustness to long-context instability, and establishes a principled solution when input exceeds the model’s context window. Furthermore, CompAs encodes information into adapter parameters in a reversible manner, enabling recovery of input context through a decoder, facilitating safety and security. Empirical results on diverse multiple-choice and extractive question answering tasks show that CompAs outperforms ICL and prior generator-based methods, especially when scaling to more inputs. Our work establishes composable adapter generation as a practical and efficient alternative for scaling LLM deployment.

Context Parametrization with Compositional Adapters Lire l’article »

AI, Committee, Actualités, Uncategorized

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

arXiv:2509.22646v1 Announce Type: cross Abstract: Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension — whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated — has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs Lire l’article »

We use cookies to improve your experience and performance on our website. You can learn more at Politique de confidentialité and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
fr_FR