YouZum

Committee

AI, Committee, News, Uncategorized

JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing

Bridging the Gap Between Artistic Intent and Technical Execution Photo retouching is a core aspect of digital photography, enabling users to manipulate image elements such as tone, exposure, and contrast to create visually compelling content. Whether for professional purposes or personal expression, users often seek to enhance images in ways that align with specific aesthetic goals. However, the art of photo retouching requires both technical knowledge and creative sensibility, making it difficult to achieve high-quality results without significant effort or expertise. The key problem arises from the gap between manual editing tools and automated solutions. While professional software like Adobe Lightroom offers extensive retouching options, mastering these tools can be time-consuming and difficult for casual users. Conversely, AI-driven methods tend to oversimplify the editing process, failing to offer the control or precision required for nuanced edits. These automated solutions also struggle with generalizing across diverse visual scenes or supporting complex user instructions. Limitations of Current AI-Based Photo Editing Models Traditional tools have relied on zeroth- and first-order optimization, as well as reinforcement learning, to handle photo retouching tasks. Others utilize diffusion-based methods for image synthesis. These strategies show progress but are generally hampered by their inability to handle fine-grained regional control, maintain high-resolution outputs, or preserve the underlying content of the image. Even more recent large models, such as GPT-4o and Gemini-2-Flash, offer text-driven editing but compromise user control, and their generative processes often overwrite critical content details. JarvisArt: A Multimodal AI Retoucher Integrating Chain-of-Thought and Lightroom APIs Researchers from Xiamen University, the Chinese University of Hong Kong, Bytedance, the National University of Singapore, and Tsinghua University introduced JarvisArt—an intelligent retouching agent. This system leverages a multimodal large language model to enable flexible, instruction-guided image editing. JarvisArt is trained to emulate the decision-making process of professional artists, interpreting user intent through both visual and language cues, and executing retouching actions across more than 200 tools in Adobe Lightroom via a custom integration protocol. The methodology integrates three major components. First, the researchers constructed a high-quality dataset, MMArt, which includes 5,000 standard and 50,000 Chain-of-Thought–annotated samples spanning various editing styles and complexities. Then, JarvisArt undergoes a two-stage training process. The initial phase uses supervised fine-tuning to build reasoning and tool-selection capabilities. It’s followed by Group Relative Policy Optimization for Retouching (GRPO-R), which incorporates customized tool-use rewards—such as retouching accuracy and perceptual quality—to refine the system’s ability to generate professional-quality edits. A specialized Agent-to-Lightroom (A2L) protocol ensures the seamless and transparent execution of tools within Lightroom, enabling users to dynamically adjust edits. Benchmarking JarvisArt’s Capabilities and Real-World Performance JarvisArt’s ability to interpret complex instructions and apply nuanced edits was evaluated using MMArt-Bench, a benchmark constructed from real user edits. The system delivered a 60% improvement in average pixel-level metrics for content fidelity compared to GPT-4o, maintaining similar instruction-following capabilities. It also demonstrated versatility in handling both global image edits and localized refinements, with the ability to manipulate images of arbitrary resolution. For example, it can adjust skin texture, eye brightness, or hair definition based on region-specific instructions. These results were achieved while preserving aesthetic goals defined by the user, showing a practical blend of control and quality across multiple editing tasks. Conclusion: A Generative Agent That Fuses Creativity With Technical Precision The researchteam tackled a significant challenge—enabling intelligent, high-quality photo retouching that does not require professional expertise. The method they introduced bridges the gap between automation and user control by combining data synthesis, reasoning-driven training, and integration with commercial software. JarvisArt offers a practical and powerful solution for creative users who seek both flexibility and quality in their image editing. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience[Learn More] The post JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing appeared first on MarkTechPost.

JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing Read Post »

AI, Committee, News, Uncategorized

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

arXiv:2412.18424v3 Announce Type: replace-cross Abstract: Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating Read Post »

AI, Committee, News, Uncategorized

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

arXiv:2507.10787v1 Announce Type: new Abstract: This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers Read Post »

AI, Committee, News, Uncategorized

Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching

arXiv:2507.04099v2 Announce Type: replace Abstract: Fine-tuning methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have demonstrated success in training large language models (LLMs) for single-turn tasks. However, these methods fall short in multi-turn applications, such as diagnostic patient interviewing, where understanding how early conversational turns influence downstream completions and outcomes is essential. In medicine, a multi-turn perspective is critical for learning diagnostic schemas and better understanding conversation dynamics. To address this gap, I introduce Savage Conversation Forests (SCF), a reinforcement learning framework that leverages a branched conversation architecture to fine-tune LLMs for multi-turn dialogue. SCF generates multiple possible conversation continuations at each turn, enabling the model to learn how different early responses affect downstream interactions and diagnostic outcomes. In experiments simulating doctor-patient conversations, SCF with branching outperforms linear conversation architectures on diagnostic accuracy. I hypothesize that SCF’s improvements stem from its ability to provide richer, interdependent training signals across conversation turns. These results suggest that a branched training architecture is an important strategy for fine tuning LLMs in complex multi-turn conversational tasks.

Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching Read Post »

AI, Committee, News, Uncategorized

Testing Hypotheses from the Social Approval Theory of Online Hate: An Analysis of 110 Million Posts from Parler

arXiv:2507.10810v1 Announce Type: new Abstract: In this paper, we explored how online hate is motivated by receiving social approval from others. We specifically examined two central tenets of Walther’s (2024) social approval theory of online hate: (H1a) more signals of social approval on hate messages predicts more subsequent hate messages, and (H1b) as social approval increases, hate speech messages become more extreme. Using over 110 million posts from Parler (2018-2021), we observed that the number of upvotes a person received on a hate speech post was unassociated with the amount of hate speech in their next post and posts during the next week, month, three months, and six months. Between-person effects revealed an average negative relationship between social approval and hate speech production at the post level, but this relationship was mixed at other time intervals. Social approval reinforcement mechanisms of online hate may operate differently on niche social media platforms.

Testing Hypotheses from the Social Approval Theory of Online Hate: An Analysis of 110 Million Posts from Parler Read Post »

AI, Committee, News, Uncategorized

NVIDIA Just Released Audio Flamingo 3: An Open-Source Model Advancing Audio General Intelligence

Heard about Artificial General Intelligence (AGI)? Meet its auditory counterpart—Audio General Intelligence. With Audio Flamingo 3 (AF3), NVIDIA introduces a major leap in how machines understand and reason about sound. While past models could transcribe speech or classify audio clips, they lacked the ability to interpret audio in a context-rich, human-like way—across speech, ambient sound, and music, and over extended durations. AF3 changes that. With Audio Flamingo 3, NVIDIA introduces a fully open-source large audio-language model (LALM) that not only hears but also understands and reasons. Built on a five-stage curriculum and powered by the AF-Whisper encoder, AF3 supports long audio inputs (up to 10 minutes), multi-turn multi-audio chat, on-demand thinking, and even voice-to-voice interactions. This sets a new bar for how AI systems interact with sound, bringing us a step closer to AGI. The Core Innovations Behind Audio Flamingo 3 AF-Whisper: A Unified Audio Encoder AF3 uses AF-Whisper, a novel encoder adapted from Whisper-v3. It processes speech, ambient sounds, and music using the same architecture—solving a major limitation of earlier LALMs which used separate encoders, leading to inconsistencies. AF-Whisper leverages audio-caption datasets, synthesized metadata, and a dense 1280-dimension embedding space to align with text representations. Chain-of-Thought for Audio: On-Demand Reasoning Unlike static QA systems, AF3 is equipped with ‘thinking’ capabilities. Using the AF-Think dataset (250k examples), the model can perform chain-of-thought reasoning when prompted, enabling it to explain its inference steps before arriving at an answer—a key step toward transparent audio AI. Multi-Turn, Multi-Audio Conversations Through the AF-Chat dataset (75k dialogues), AF3 can hold contextual conversations involving multiple audio inputs across turns. This mimics real-world interactions, where humans refer back to previous audio cues. It also introduces voice-to-voice conversations using a streaming text-to-speech module. Long Audio Reasoning AF3 is the first fully open model capable of reasoning over audio inputs up to 10 minutes. Trained with LongAudio-XL (1.25M examples), the model supports tasks like meeting summarization, podcast understanding, sarcasm detection, and temporal grounding. State-of-the-Art Benchmarks and Real-World Capability AF3 surpasses both open and closed models on over 20 benchmarks, including: MMAU (avg): 73.14% (+2.14% over Qwen2.5-O) LongAudioBench: 68.6 (GPT-4o evaluation), beating Gemini 2.5 Pro LibriSpeech (ASR): 1.57% WER, outperforming Phi-4-mm ClothoAQA: 91.1% (vs. 89.2% from Qwen2.5-O) These improvements aren’t just marginal; they redefine what’s expected from audio-language systems. AF3 also introduces benchmarking in voice chat and speech generation, achieving 5.94s generation latency (vs. 14.62s for Qwen2.5) and better similarity scores. The Data Pipeline: Datasets That Teach Audio Reasoning NVIDIA didn’t just scale compute—they rethought the data: AudioSkills-XL: 8M examples combining ambient, music, and speech reasoning. LongAudio-XL: Covers long-form speech from audiobooks, podcasts, meetings. AF-Think: Promotes short CoT-style inference. AF-Chat: Designed for multi-turn, multi-audio conversations. Each dataset is fully open-sourced, along with training code and recipes, enabling reproducibility and future research. Open Source AF3 is not just a model drop. NVIDIA released: Model weights Training recipes Inference code Four open datasets This transparency makes AF3 the most accessible state-of-the-art audio-language model. It opens new research directions in auditory reasoning, low-latency audio agents, music comprehension, and multi-modal interaction. Conclusion: Toward General Audio Intelligence Audio Flamingo 3 demonstrates that deep audio understanding is not just possible but reproducible and open. By combining scale, novel training strategies, and diverse data, NVIDIA delivers a model that listens, understands, and reasons in ways previous LALMs could not. Check out the Paper, Codes and Model on Hugging Face. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More] The post NVIDIA Just Released Audio Flamingo 3: An Open-Source Model Advancing Audio General Intelligence appeared first on MarkTechPost.

NVIDIA Just Released Audio Flamingo 3: An Open-Source Model Advancing Audio General Intelligence Read Post »

AI, Committee, News, Uncategorized

EventHunter: Dynamic Clustering and Ranking of Security Events from Hacker Forum Discussions

arXiv:2507.09762v1 Announce Type: cross Abstract: Hacker forums provide critical early warning signals for emerging cybersecurity threats, but extracting actionable intelligence from their unstructured and noisy content remains a significant challenge. This paper presents an unsupervised framework that automatically detects, clusters, and prioritizes security events discussed across hacker forum posts. Our approach leverages Transformer-based embeddings fine-tuned with contrastive learning to group related discussions into distinct security event clusters, identifying incidents like zero-day disclosures or malware releases without relying on predefined keywords. The framework incorporates a daily ranking mechanism that prioritizes identified events using quantifiable metrics reflecting timeliness, source credibility, information completeness, and relevance. Experimental evaluation on real-world hacker forum data demonstrates that our method effectively reduces noise and surfaces high-priority threats, enabling security analysts to mount proactive responses. By transforming disparate hacker forum discussions into structured, actionable intelligence, our work addresses fundamental challenges in automated threat detection and analysis.

EventHunter: Dynamic Clustering and Ranking of Security Events from Hacker Forum Discussions Read Post »

AI, Committee, News, Uncategorized

READoc: A Unified Benchmark for Realistic Document Structured Extraction

arXiv:2409.05137v3 Announce Type: replace Abstract: Document Structured Extraction (DSE) aims to extract structured content from raw documents. Despite the emergence of numerous DSE systems, their unified evaluation remains inadequate, significantly hindering the field’s advancement. This problem is largely attributed to existing benchmark paradigms, which exhibit fragmented and localized characteristics. To address these limitations and offer a thorough evaluation of DSE systems, we introduce a novel benchmark named READoc, which defines DSE as a realistic task of converting unstructured PDFs into semantically rich Markdown. The READoc dataset is derived from 3,576 diverse and real-world documents from arXiv, GitHub, and Zenodo. In addition, we develop a DSE Evaluation S$^3$uite comprising Standardization, Segmentation and Scoring modules, to conduct a unified evaluation of state-of-the-art DSE approaches. By evaluating a range of pipeline tools, expert visual models, and general VLMs, we identify the gap between current work and the unified, realistic DSE objective for the first time. We aspire that READoc will catalyze future research in DSE, fostering more comprehensive and practical solutions.

READoc: A Unified Benchmark for Realistic Document Structured Extraction Read Post »

AI, Committee, News, Uncategorized

Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models

arXiv:2505.17826v2 Announce Type: replace-cross Abstract: Trinity-RFT is a general-purpose, unified and easy-to-use framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a modular and decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT; (2) seamless integration for agent-environment interaction with high efficiency and robustness; and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for development and research of advanced reinforcement learning paradigms at both macroscopic and microscopic levels. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples, applications and experiments that demonstrate its functionalities and user-friendliness.

Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models Read Post »

AI, Committee, News, Uncategorized

Single Word Change is All You Need: Using LLMs to Create Synthetic Training Examples for Text Classifiers

arXiv:2401.17196v3 Announce Type: replace Abstract: In text classification, creating an adversarial example means subtly perturbing a few words in a sentence without changing its meaning, causing it to be misclassified by a classifier. A concerning observation is that a significant portion of adversarial examples generated by existing methods change only one word. This single-word perturbation vulnerability represents a significant weakness in classifiers, which malicious users can exploit to efficiently create a multitude of adversarial examples. This paper studies this problem and makes the following key contributions: (1) We introduce a novel metric $rho$ to quantitatively assess a classifier’s robustness against single-word perturbation. (2) We present the SP-Attack, designed to exploit the single-word perturbation vulnerability, achieving a higher attack success rate, better preserving sentence meaning, while reducing computation costs compared to state-of-the-art adversarial methods. (3) We propose SP-Defense, which aims to improve r{ho} by applying data augmentation in learning. Experimental results on 4 datasets and BERT and distilBERT classifiers show that SP-Defense improves $rho$ by 14.6% and 13.9% and decreases the attack success rate of SP-Attack by 30.4% and 21.2% on two classifiers respectively, and decreases the attack success rate of existing attack methods that involve multiple-word perturbations.

Single Word Change is All You Need: Using LLMs to Create Synthetic Training Examples for Text Classifiers Read Post »

en_US