YouZum

Uncategorized

AI, Committee, Notizie, Uncategorized

GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks

Multimodal foundation models (MFMs) like GPT-4o, Gemini, and Claude have shown rapid progress recently, especially in public demos. While their language skills are well studied, their true ability to understand visual information remains unclear. Most benchmarks used today focus heavily on text-based tasks, such as VQA or classification, which often reflect language strengths more than visual capabilities. These tests also require text outputs, making it difficult to fairly assess visual skills or compare MFMs with vision-specific models. Moreover, critical aspects such as 3D perception, segmentation, and grouping, which are core to visual understanding, are still largely overlooked in current evaluations.  MFMs have demonstrated strong performance in tasks that combine visual and language understanding, such as captioning and visual question answering. However, their effectiveness in tasks that require detailed visual comprehension remains unclear. Most current benchmarks rely on text-based outputs, making it difficult to compare MFMs with vision-only models fairly. Some studies attempt to adapt vision datasets for MFMs by converting annotations into text, but this limitation restricts evaluation to language outputs. Prompting strategies have also been explored to help MFMs tackle visual tasks by breaking them into manageable subtasks, though reproducibility remains a challenge in some cases.  Researchers at EPFL evaluated several popular multimodal foundation models—such as GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet on core computer vision tasks, including segmentation, object detection, and depth prediction, using datasets like COCO and ImageNet. Since most MFMs are designed to output text and are only accessible via APIs, they developed a prompt-chaining framework to translate these visual tasks into text-compatible formats. Their findings show that while MFMs are competent generalists, they fall short of specialized vision models, especially in geometric tasks. GPT-4o stood out, performing best in 4 out of 6 tasks. The evaluation toolkit will be open-sourced.  To evaluate MFMs on vision tasks, the study designed a prompt chaining strategy, breaking complex tasks into simpler, language-friendly subtasks. For example, instead of predicting bounding boxes directly, the model first identifies present objects, then locates them through recursive image cropping. For segmentation and grouping, images are divided into superpixels, which are easier to label and compare. Depth and surface normals are estimated using pairwise rankings of superpixel regions. This modular design leverages MFMs’ strength in classification and similarity, while calibration controls ensure fair comparisons. The method is flexible, and performance improves with finer-grained prompting.  The study evaluates various MFMs, including GPT-4, Gemini Flash, and Claude 3.5, across multiple tasks, such as image classification, object detection, and segmentation. Using datasets like ImageNet, COCO, and Hypersim, results show GPT-4o reaching 77.2% on ImageNet and 60.62 AP50 for object detection, outperformed by specialist models like ViT-G (90.94%) and Co-DETR (91.30%). Semantic segmentation results show GPT-4o at 44.89 mIoU, while OneFormer leads with 65.52. MFMs handle distribution shifts reasonably well but lag on precise visual reasoning. The study also introduces prompt chaining and oracle baselines to evaluate upper-bound performance.  In conclusion, the study introduces a benchmarking framework to assess the visual capabilities of MFMs, such as GPT-4o, Gemini, and Claude, by converting standard vision tasks into prompt-based formats. Findings show MFMs perform better on semantic tasks than geometric ones, with GPT-4o leading overall. However, all MFMs lag significantly behind task-specific vision models. Despite being generalists trained primarily on image-text data, they show promising progress, especially newer reasoning models, such as o3, on 3D tasks. Limitations include high inference cost and prompt sensitivity. Still, this framework provides a unified approach to evaluating MFMs’ visual understanding, laying the groundwork for future advancements.  Check out the Paper, GitHub Page and Project. All credit for this research goes to the researchers of this project. Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW] The post GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks appeared first on MarkTechPost.

GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

This AI Paper Introduces PyVision: A Python-Centric Framework Where AI Writes Tools as It Thinks

Visual reasoning tasks challenge artificial intelligence models to interpret and process visual information using both perception and logical reasoning. These tasks span a wide range of applications, including medical diagnostics, visual math, symbolic puzzles, and image-based question answering. Success in this field requires more than object recognition—it demands dynamic adaptation, abstraction, and contextual inference. Models must analyze images, identify relevant features, and often generate explanations or solutions that require a sequence of reasoning steps tied to the visual input. The limitation becomes evident when models are expected to apply reasoning or modify their strategies for varied visual tasks. Many current models lack flexibility, often defaulting to pattern matching or hardcoded routines. These systems struggle to break down unfamiliar problems or create solutions beyond their preset toolkits. They also fail when tasks involve abstract reasoning or require models to look beyond surface-level features in visual content. The need for a system that can autonomously adapt and construct new tools for reasoning has become a significant bottleneck. Previous models typically rely on fixed toolsets and rigid single-turn processing. Solutions like Visual ChatGPT, HuggingGPT, or ViperGPT integrate tools like segmentation or detection models, but they are constrained to predefined workflows. This setup limits creativity and adaptability. These models operate without the ability to modify or expand their toolset during a task. They process tasks linearly, which limits their usefulness in domains that require iterative reasoning. Multi-turn capabilities are either missing or severely limited, preventing models from engaging in more in-depth analytical reasoning. Researchers introduced PyVision to overcome these issues. Developed by teams from Shanghai AI Lab, Rice University, CUHK, NUS, and SII, this framework enables large multimodal language models (MLLMs) to autonomously create and execute Python-based tools tailored to specific visual reasoning problems. Unlike previous approaches, PyVision is not bound by static modules. It uses Python as its primary language and builds tools dynamically in a multi-turn loop. This allows the system to adapt its approach mid-task, enabling the model to make decisions, reflect on results, and refine its code or reasoning across several steps. In practice, PyVision initiates by receiving a user query and corresponding visual input. The MLLM, such as GPT-4.1 or Claude-4.0-Sonnet, generates Python code based on the prompt, which is executed in an isolated environment. The results—textual, visual, or numerical—are fed back into the model. Using this feedback, the model can revise its plan, generate new code, and iterate until it produces a solution. This system supports cross-turn persistence, which means variable states are maintained between interactions, allowing sequential reasoning. PyVision includes internal safety features, such as process isolation and structured I/O, ensuring robust performance even under complex reasoning loads. It utilizes Python libraries such as OpenCV, NumPy, and Pillow to perform operations like segmentation, OCR, image enhancement, and statistical analysis. Quantitative benchmarks validate PyVision’s effectiveness. On the visual search benchmark V*, PyVision improved GPT-4.1’s performance from 68.1% to 75.9%, a gain of +7.8%. On the symbolic visual reasoning benchmark VLMsAreBlind-mini, Claude-4.0-Sonnet’s accuracy increased from 48.1% to 79.2%, a 31.1% improvement. Additional gains were observed on other tasks: +2.4% on MMMU and +2.5% on VisualPuzzles for GPT-4.1; +4.8% on MathVista and +8.3% on VisualPuzzles for Claude-4.0-Sonnet. The improvements vary depending on the underlying model’s strengths—models that excel in perception benefit more from PyVision in perception-heavy tasks, while reasoning-strong models gain more in abstract challenges. PyVision amplifies the base model’s abilities rather than masking or replacing them. This research highlights a substantial advancement in visual reasoning. PyVision addresses a fundamental limitation by enabling models to create problem-specific tools in real-time. The approach transforms static models into agentic systems capable of thoughtful, iterative problem-solving. By dynamically linking perception and reasoning, PyVision takes a critical step toward building intelligent, adaptable AI for complex real-world visual challenges. Check out the Paper, GitHub Page and Project. All credit for this research goes to the researchers of this project. Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW] The post This AI Paper Introduces PyVision: A Python-Centric Framework Where AI Writes Tools as It Thinks appeared first on MarkTechPost.

This AI Paper Introduces PyVision: A Python-Centric Framework Where AI Writes Tools as It Thinks Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Impact of Stickers on Multimodal Sentiment and Intent in Social Media: A New Task, Dataset and Baseline

arXiv:2405.08427v2 Announce Type: replace Abstract: Stickers are increasingly used in social media to express sentiment and intent. Despite their significant impact on sentiment analysis and intent recognition, little research has been conducted in this area. To address this gap, we propose a new task: textbf{M}ultimodal chat textbf{S}entiment textbf{A}nalysis and textbf{I}ntent textbf{R}ecognition involving textbf{S}tickers (MSAIRS). Additionally, we introduce a novel multimodal dataset containing Chinese chat records and stickers excerpted from several mainstream social media platforms. Our dataset includes paired data with the same text but different stickers, the same sticker but different contexts, and various stickers consisting of the same images with different texts, allowing us to better understand the impact of stickers on chat sentiment and intent. We also propose an effective multimodal joint model, MMSAIR, featuring differential vector construction and cascaded attention mechanisms for enhanced multimodal fusion. Our experiments demonstrate the necessity and effectiveness of jointly modeling sentiment and intent, as they mutually reinforce each other’s recognition accuracy. MMSAIR significantly outperforms traditional models and advanced MLLMs, demonstrating the challenge and uniqueness of sticker interpretation in social media. Our dataset and code are available on https://github.com/FakerBoom/MSAIRS-Dataset.

Impact of Stickers on Multimodal Sentiment and Intent in Social Media: A New Task, Dataset and Baseline Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Large Language Models in Argument Mining: A Survey

arXiv:2506.16383v3 Announce Type: replace Abstract: Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques — such as prompting, chain-of-thought reasoning, and retrieval augmentation — have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.

Large Language Models in Argument Mining: A Survey Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs

arXiv:2502.12988v2 Announce Type: replace Abstract: Previous approaches to persona simulation large language models (LLMs) have typically relied on learning basic biographical information, or using limited role-play dialogue datasets to capture a character’s responses. However, a holistic representation of an individual goes beyond surface-level facts or conversations to deeper thoughts and thinking. In this work, we introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought processes of a character. Using Lu Xun, a renowned Chinese writer, as a case study, we propose four training tasks derived from his 17 essay collections. These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks: multiple-choice question answering, generative question answering, and style transfer, each aligning the LLM with Lu Xun’s internal ideation and writing style. To optimize learning across these tasks, we introduce a CharLoRA parameter updating mechanism, where a general linguistic style expert collaborates with other task-specific experts to better study both the language style and the understanding of deeper thoughts. We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics. We hope that this work inspires future research on deep character persona simulation LLM.

Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

arXiv:2507.16284v2 Announce Type: replace Abstract: The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80% accuracy on texts shorter than 150 characters and reaches 100% accuracy for longer texts. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.

Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems

arXiv:2507.16835v1 Announce Type: cross Abstract: Voice-based conversational AI systems increasingly rely on cascaded architectures combining speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. However, systematic evaluation of different component combinations in production settings remains understudied. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data from over 300,000 AI-conducted job interviews. We develop an automated evaluation framework using LLM-as-a-Judge to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of four production configurations reveals that Google STT paired with GPT-4.1 significantly outperforms alternatives in both conversational and technical quality metrics. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversational AI systems and contribute a validated evaluation methodology for voice-based interactions.

Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation

arXiv:2507.16002v1 Announce Type: new Abstract: One major challenge in natural language processing is named entity recognition (NER), which identifies and categorises named entities in textual input. In order to improve NER, this study investigates a Hindi NER technique that makes use of Hindi-specific pretrained encoders (MuRIL and XLM-R) and Generative Models ( Llama-2-7B-chat-hf (Llama2-7B), Llama-2-70B-chat-hf (Llama2-70B), Llama-3-70B-Instruct (Llama3-70B) and GPT3.5-turbo), and augments the data with retrieved data from external relevant contexts, notably from Wikipedia. We have fine-tuned MuRIL, XLM-R and Llama2-7B with and without RA. However, Llama2-70B, lama3-70B and GPT3.5-turbo are utilised for few-shot NER generation. Our investigation shows that the mentioned language models (LMs) with Retrieval Augmentation (RA) outperform baseline methods that don’t incorporate RA in most cases. The macro F1 scores for MuRIL and XLM-R are 0.69 and 0.495, respectively, without RA and increase to 0.70 and 0.71, respectively, in the presence of RA. Fine-tuned Llama2-7B outperforms Llama2-7B by a significant margin. On the other hand the generative models which are not fine-tuned also perform better with augmented data. GPT3.5-turbo adopted RA well; however, Llama2-70B and llama3-70B did not adopt RA with our retrieval context. The findings show that RA significantly improves performance, especially for low-context data. This study adds significant knowledge about how best to use data augmentation methods and pretrained models to enhance NER performance, particularly in languages with limited resources.

Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Small Edits, Big Consequences: Telling Good from Bad Robustness in Large Language Models

arXiv:2507.15868v1 Announce Type: new Abstract: Large language models (LLMs) now write code in settings where misreading a single word can break safety or cost money, yet we still expect them to overlook stray typos. To probe where useful robustness ends and harmful insensitivity begins, we compile 50 LeetCode problems and craft three minimal prompt perturbations that should vary in importance: (i) progressive underspecification deleting 10 % of words per step; (ii) lexical flip swapping a pivotal quantifier (“max” to “min”); and (iii) jargon inflation replacing a common noun with an obscure technical synonym. Six frontier models, including three “reasoning-tuned” versions, solve each mutated prompt, and their Python outputs are checked against the original test suites to reveal whether they reused the baseline solution or adapted. Among 11 853 generations we observe a sharp double asymmetry. Models remain correct in 85 % of cases even after 90 % of the prompt is missing, showing over-robustness to underspecification, yet only 54 % react to a single quantifier flip that reverses the task, with reasoning-tuned variants even less sensitive than their bases. Jargon edits lie in between, passing through 56 %. Current LLMs thus blur the line between harmless noise and meaning – changing edits, often treating both as ignorable. Masking salient anchors such as function names can force re – evaluation. We advocate evaluation and training protocols that reward differential sensitivity: stay steady under benign noise but adapt – or refuse – when semantics truly change.

Small Edits, Big Consequences: Telling Good from Bad Robustness in Large Language Models Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

arXiv:2506.06821v3 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.

Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems Leggi l'articolo »

it_IT