YouZum

AI

AI, Committee, 新闻, Uncategorized

Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models

arXiv:2507.18263v1 Announce Type: new Abstract: Direct speech translation (ST) has garnered increasing attention nowadays, yet the accurate translation of terminology within utterances remains a great challenge. In this regard, current studies mainly concentrate on leveraging various translation knowledge into ST models. However, these methods often struggle with interference from irrelevant noise and can not fully utilize the translation knowledge. To address these issues, in this paper, we propose a novel Locate-and-Focus method for terminology translation. It first effectively locates the speech clips containing terminologies within the utterance to construct translation knowledge, minimizing irrelevant information for the ST model. Subsequently, it associates the translation knowledge with the utterance and hypothesis from both audio and textual modalities, allowing the ST model to better focus on translation knowledge during translation. Experimental results across various datasets demonstrate that our method effectively locates terminologies within utterances and enhances the success rate of terminology translation, while maintaining robust general translation performance.

Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models Read Post »

AI, Committee, 新闻, Uncategorized

System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition

arXiv:2507.18580v1 Announce Type: new Abstract: This paper presents our system for CCL25-Eval Task 10, addressing Fine-Grained Chinese Hate Speech Recognition (FGCHSR). We propose a novel SRAG-MAV framework that synergistically integrates task reformulation(TR), Self-Retrieval-Augmented Generation (SRAG), and Multi-Round Accumulative Voting (MAV). Our method reformulates the quadruplet extraction task into triplet extraction, uses dynamic retrieval from the training set to create contextual prompts, and applies multi-round inference with voting to improve output stability and performance. Our system, based on the Qwen2.5-7B model, achieves a Hard Score of 26.66, a Soft Score of 48.35, and an Average Score of 37.505 on the STATE ToxiCN dataset, significantly outperforming baselines such as GPT-4o (Average Score 15.63) and fine-tuned Qwen2.5-7B (Average Score 35.365). The code is available at https://github.com/king-wang123/CCL25-SRAG-MAV.

System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition Read Post »

AI, Committee, 新闻, Uncategorized

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs

arXiv:2507.18043v1 Announce Type: new Abstract: Inference-time steering methods offer a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying internal activations at test time without updating model weights. However, most existing approaches rely on fixed, global intervention vectors, overlook the causal influence of individual input tokens, and fail to leverage informative gradients from the model’s logits, particularly in multimodal settings where visual and textual inputs contribute unevenly. To address these limitations, we introduce GrAInS, an inference-time steering approach that operates across both language-only and vision-language models and tasks. GrAInS uses contrastive, gradient-based attribution via Integrated Gradients to identify the top-k most influential tokens, both positively and negatively attributed based on their contribution to preferred versus dispreferred outputs. These tokens are then used to construct directional steering vectors that capture semantic shifts from undesirable to desirable behavior. During inference, GrAInS adjusts hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale. This enables fine-grained, interpretable, and modular control over model behavior, without retraining or auxiliary supervision. Empirically, GrAInS consistently outperforms both fine-tuning and existing steering baselines: it achieves a 13.22% accuracy gain on TruthfulQA using Llama-3.1-8B, reduces hallucination rates on MMHal-Bench from 0.624 to 0.514 with LLaVA-1.6-7B, and improves alignment win rates on SPA-VL by 8.11%, all while preserving the model’s fluency and general capabilities.

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs Read Post »

AI, Committee, 新闻, Uncategorized

P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts

arXiv:2406.12548v3 Announce Type: replace Abstract: Personalized large language models (LLMs) have attracted great attention in many applications, such as emotional support and role-playing. However, existing works primarily focus on modeling explicit character profiles, while ignoring the underlying personality traits that truly shape behaviors and decision-making, hampering the development of more anthropomorphic and psychologically-grounded AI systems. In this paper, we explore the modeling of Big Five personality traits, which is the most widely used trait theory in psychology, and propose P-React, a mixture of experts (MoE)-based personalized LLM. Particularly, we integrate a Personality Specialization Loss (PSL) to better capture individual trait expressions, providing a more nuanced and psychologically grounded personality simulacrum. To facilitate research in this field, we curate OCEAN-Chat, a high-quality, human-verified dataset designed to train LLMs in expressing personality traits across diverse topics. Extensive experiments demonstrate the effectiveness of P-React in maintaining consistent and real personality.

P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts Read Post »

AI, Committee, 新闻, Uncategorized

DualDistill and Agentic-R1: How AI Combines Natural Language and Tool Use for Superior Math Problem Solving

Existing long-CoT reasoning models have achieved state-of-the-art performance in mathematical reasoning by generating reasoning trajectories with iterative self-verification and refinement. However, open-source long-CoT models depend only on natural language reasoning traces, making them computationally expensive and prone to errors without verification mechanisms. Although tool-aided reasoning provides greater efficiency and reliability for large-scale numerical computations through frameworks like OpenHands that integrate code interpreters, these agentic approaches struggle with abstract or conceptually complex reasoning problems. DualDistill Framework and Agentic-R1 Model Researchers from Carnegie Mellon University have proposed DualDistill, a distillation framework that combines trajectories from two complementary teachers to create a unified student model. The framework utilizes one reasoning-oriented teacher and one tool-augmented teacher to develop Agentic-R1, a model that learns to select the most appropriate strategy for each problem type dynamically. Agentic-R1 executes code for arithmetic and algorithmic tasks while employing natural language reasoning for abstract problems. DualDistill utilizes trajectory composition to distill knowledge from both complementary teachers, followed by self-distillation. Moreover, researchers used OpenHands as the agentic reasoning teacher, and DeepSeek-R1 as the text-based reasoning teacher. https://arxiv.org/abs/2507.05707 Evaluation and Benchmarks The proposed method is evaluated across multiple benchmarks like DeepMath-L and Combinatorics300 to test various aspects of mathematical reasoning. It is compared against the baselines DeepSeek-R1-Distill and Qwen-2.5-Instruct. The student model, Agentic-R1, shows great performance improvements that benefit from both agentic and reasoning strategies. It outperforms two similarly sized models, each specializing in tool-assisted (Qwen2.5-7B-Instruct) or pure reasoning (Deepseek-R1-Distill7B) strategies. Agentic-R1 outperforms tool-based models by intelligently using reasoning strategies when required, while maintaining greater efficiency compared to pure reasoning models on standard mathematical tasks. Qualitative Analysis and Tool Usage Patterns Qualitative examples show that Agentic-R1 exhibits intelligent tool usage patterns, activating code execution tools in 79.2% of computationally demanding Combinatorics300 problems, while reducing activation to 52.0% for the simpler AMC dataset problems. Agentic-R1 learns to invoke tools appropriately through supervised fine-tuning alone, without explicit instruction, effectively balancing computational efficiency and reasoning accuracy. Robustness to Imperfect Teachers The framework remains effective even when guided by imperfect teachers. For instance, the agentic teacher achieves only 48.4% accuracy on Combinatorics300, yet the student model improved from 44.7% to 50.9%, ultimately outperforming the teacher. Conclusion In summary, the DualDistill framework effectively combines the strengths of natural language reasoning and tool-assisted problem solving by distilling complementary knowledge from two specialized teacher models into a single versatile student model, Agentic-R1. Through trajectory composition and self-distillation, Agentic-R1 learns to dynamically select the most appropriate strategy for each problem, balancing precision and computational efficiency. Evaluations across diverse mathematical reasoning benchmarks demonstrate that Agentic-R1 outperforms both pure reasoning and tool-based models, even when learning from imperfect teachers. This work highlights a promising approach to building adaptable AI agents capable of integrating heterogeneous problem-solving strategies for more robust and efficient reasoning. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW] The post DualDistill and Agentic-R1: How AI Combines Natural Language and Tool Use for Superior Math Problem Solving appeared first on MarkTechPost.

DualDistill and Agentic-R1: How AI Combines Natural Language and Tool Use for Superior Math Problem Solving Read Post »

AI, Committee, 新闻, Uncategorized

Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

arXiv:2507.16802v2 Announce Type: replace Abstract: Large Language Models (LLMs) exhibit considerable promise in financial applications; however, prevailing models frequently demonstrate limitations when confronted with scenarios that necessitate sophisticated reasoning capabilities, stringent trustworthiness criteria, and efficient adaptation to domain-specific requirements. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task label system with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage training pipeline, and dynamic attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including Fineva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA-diamond. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications. The Finova bench is available at https://github.com/antgroup/Finova.

Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning Read Post »

AI, Committee, 新闻, Uncategorized

GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks

Multimodal foundation models (MFMs) like GPT-4o, Gemini, and Claude have shown rapid progress recently, especially in public demos. While their language skills are well studied, their true ability to understand visual information remains unclear. Most benchmarks used today focus heavily on text-based tasks, such as VQA or classification, which often reflect language strengths more than visual capabilities. These tests also require text outputs, making it difficult to fairly assess visual skills or compare MFMs with vision-specific models. Moreover, critical aspects such as 3D perception, segmentation, and grouping, which are core to visual understanding, are still largely overlooked in current evaluations.  MFMs have demonstrated strong performance in tasks that combine visual and language understanding, such as captioning and visual question answering. However, their effectiveness in tasks that require detailed visual comprehension remains unclear. Most current benchmarks rely on text-based outputs, making it difficult to compare MFMs with vision-only models fairly. Some studies attempt to adapt vision datasets for MFMs by converting annotations into text, but this limitation restricts evaluation to language outputs. Prompting strategies have also been explored to help MFMs tackle visual tasks by breaking them into manageable subtasks, though reproducibility remains a challenge in some cases.  Researchers at EPFL evaluated several popular multimodal foundation models—such as GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet on core computer vision tasks, including segmentation, object detection, and depth prediction, using datasets like COCO and ImageNet. Since most MFMs are designed to output text and are only accessible via APIs, they developed a prompt-chaining framework to translate these visual tasks into text-compatible formats. Their findings show that while MFMs are competent generalists, they fall short of specialized vision models, especially in geometric tasks. GPT-4o stood out, performing best in 4 out of 6 tasks. The evaluation toolkit will be open-sourced.  To evaluate MFMs on vision tasks, the study designed a prompt chaining strategy, breaking complex tasks into simpler, language-friendly subtasks. For example, instead of predicting bounding boxes directly, the model first identifies present objects, then locates them through recursive image cropping. For segmentation and grouping, images are divided into superpixels, which are easier to label and compare. Depth and surface normals are estimated using pairwise rankings of superpixel regions. This modular design leverages MFMs’ strength in classification and similarity, while calibration controls ensure fair comparisons. The method is flexible, and performance improves with finer-grained prompting.  The study evaluates various MFMs, including GPT-4, Gemini Flash, and Claude 3.5, across multiple tasks, such as image classification, object detection, and segmentation. Using datasets like ImageNet, COCO, and Hypersim, results show GPT-4o reaching 77.2% on ImageNet and 60.62 AP50 for object detection, outperformed by specialist models like ViT-G (90.94%) and Co-DETR (91.30%). Semantic segmentation results show GPT-4o at 44.89 mIoU, while OneFormer leads with 65.52. MFMs handle distribution shifts reasonably well but lag on precise visual reasoning. The study also introduces prompt chaining and oracle baselines to evaluate upper-bound performance.  In conclusion, the study introduces a benchmarking framework to assess the visual capabilities of MFMs, such as GPT-4o, Gemini, and Claude, by converting standard vision tasks into prompt-based formats. Findings show MFMs perform better on semantic tasks than geometric ones, with GPT-4o leading overall. However, all MFMs lag significantly behind task-specific vision models. Despite being generalists trained primarily on image-text data, they show promising progress, especially newer reasoning models, such as o3, on 3D tasks. Limitations include high inference cost and prompt sensitivity. Still, this framework provides a unified approach to evaluating MFMs’ visual understanding, laying the groundwork for future advancements.  Check out the Paper, GitHub Page and Project. All credit for this research goes to the researchers of this project. Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW] The post GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks appeared first on MarkTechPost.

GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks Read Post »

AI, Committee, 新闻, Uncategorized

This AI Paper Introduces PyVision: A Python-Centric Framework Where AI Writes Tools as It Thinks

Visual reasoning tasks challenge artificial intelligence models to interpret and process visual information using both perception and logical reasoning. These tasks span a wide range of applications, including medical diagnostics, visual math, symbolic puzzles, and image-based question answering. Success in this field requires more than object recognition—it demands dynamic adaptation, abstraction, and contextual inference. Models must analyze images, identify relevant features, and often generate explanations or solutions that require a sequence of reasoning steps tied to the visual input. The limitation becomes evident when models are expected to apply reasoning or modify their strategies for varied visual tasks. Many current models lack flexibility, often defaulting to pattern matching or hardcoded routines. These systems struggle to break down unfamiliar problems or create solutions beyond their preset toolkits. They also fail when tasks involve abstract reasoning or require models to look beyond surface-level features in visual content. The need for a system that can autonomously adapt and construct new tools for reasoning has become a significant bottleneck. Previous models typically rely on fixed toolsets and rigid single-turn processing. Solutions like Visual ChatGPT, HuggingGPT, or ViperGPT integrate tools like segmentation or detection models, but they are constrained to predefined workflows. This setup limits creativity and adaptability. These models operate without the ability to modify or expand their toolset during a task. They process tasks linearly, which limits their usefulness in domains that require iterative reasoning. Multi-turn capabilities are either missing or severely limited, preventing models from engaging in more in-depth analytical reasoning. Researchers introduced PyVision to overcome these issues. Developed by teams from Shanghai AI Lab, Rice University, CUHK, NUS, and SII, this framework enables large multimodal language models (MLLMs) to autonomously create and execute Python-based tools tailored to specific visual reasoning problems. Unlike previous approaches, PyVision is not bound by static modules. It uses Python as its primary language and builds tools dynamically in a multi-turn loop. This allows the system to adapt its approach mid-task, enabling the model to make decisions, reflect on results, and refine its code or reasoning across several steps. In practice, PyVision initiates by receiving a user query and corresponding visual input. The MLLM, such as GPT-4.1 or Claude-4.0-Sonnet, generates Python code based on the prompt, which is executed in an isolated environment. The results—textual, visual, or numerical—are fed back into the model. Using this feedback, the model can revise its plan, generate new code, and iterate until it produces a solution. This system supports cross-turn persistence, which means variable states are maintained between interactions, allowing sequential reasoning. PyVision includes internal safety features, such as process isolation and structured I/O, ensuring robust performance even under complex reasoning loads. It utilizes Python libraries such as OpenCV, NumPy, and Pillow to perform operations like segmentation, OCR, image enhancement, and statistical analysis. Quantitative benchmarks validate PyVision’s effectiveness. On the visual search benchmark V*, PyVision improved GPT-4.1’s performance from 68.1% to 75.9%, a gain of +7.8%. On the symbolic visual reasoning benchmark VLMsAreBlind-mini, Claude-4.0-Sonnet’s accuracy increased from 48.1% to 79.2%, a 31.1% improvement. Additional gains were observed on other tasks: +2.4% on MMMU and +2.5% on VisualPuzzles for GPT-4.1; +4.8% on MathVista and +8.3% on VisualPuzzles for Claude-4.0-Sonnet. The improvements vary depending on the underlying model’s strengths—models that excel in perception benefit more from PyVision in perception-heavy tasks, while reasoning-strong models gain more in abstract challenges. PyVision amplifies the base model’s abilities rather than masking or replacing them. This research highlights a substantial advancement in visual reasoning. PyVision addresses a fundamental limitation by enabling models to create problem-specific tools in real-time. The approach transforms static models into agentic systems capable of thoughtful, iterative problem-solving. By dynamically linking perception and reasoning, PyVision takes a critical step toward building intelligent, adaptable AI for complex real-world visual challenges. Check out the Paper, GitHub Page and Project. All credit for this research goes to the researchers of this project. Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW] The post This AI Paper Introduces PyVision: A Python-Centric Framework Where AI Writes Tools as It Thinks appeared first on MarkTechPost.

This AI Paper Introduces PyVision: A Python-Centric Framework Where AI Writes Tools as It Thinks Read Post »

AI, Committee, 新闻, Uncategorized

Impact of Stickers on Multimodal Sentiment and Intent in Social Media: A New Task, Dataset and Baseline

arXiv:2405.08427v2 Announce Type: replace Abstract: Stickers are increasingly used in social media to express sentiment and intent. Despite their significant impact on sentiment analysis and intent recognition, little research has been conducted in this area. To address this gap, we propose a new task: textbf{M}ultimodal chat textbf{S}entiment textbf{A}nalysis and textbf{I}ntent textbf{R}ecognition involving textbf{S}tickers (MSAIRS). Additionally, we introduce a novel multimodal dataset containing Chinese chat records and stickers excerpted from several mainstream social media platforms. Our dataset includes paired data with the same text but different stickers, the same sticker but different contexts, and various stickers consisting of the same images with different texts, allowing us to better understand the impact of stickers on chat sentiment and intent. We also propose an effective multimodal joint model, MMSAIR, featuring differential vector construction and cascaded attention mechanisms for enhanced multimodal fusion. Our experiments demonstrate the necessity and effectiveness of jointly modeling sentiment and intent, as they mutually reinforce each other’s recognition accuracy. MMSAIR significantly outperforms traditional models and advanced MLLMs, demonstrating the challenge and uniqueness of sticker interpretation in social media. Our dataset and code are available on https://github.com/FakerBoom/MSAIRS-Dataset.

Impact of Stickers on Multimodal Sentiment and Intent in Social Media: A New Task, Dataset and Baseline Read Post »

AI, Committee, 新闻, Uncategorized

Large Language Models in Argument Mining: A Survey

arXiv:2506.16383v3 Announce Type: replace Abstract: Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques — such as prompting, chain-of-thought reasoning, and retrieval augmentation — have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.

Large Language Models in Argument Mining: A Survey Read Post »

zh_CN