YouZum

Committee

AI, Committee, News, Uncategorized

Partitioner Guided Modal Learning Framework

arXiv:2507.11661v1 Announce Type: new Abstract: Multimodal learning benefits from multiple modal information, and each learned modal representations can be divided into uni-modal that can be learned from uni-modal training and paired-modal features that can be learned from cross-modal interaction. Building on this perspective, we propose a partitioner-guided modal learning framework, PgM, which consists of the modal partitioner, uni-modal learner, paired-modal learner, and uni-paired modal decoder. Modal partitioner segments the learned modal representation into uni-modal and paired-modal features. Modal learner incorporates two dedicated components for uni-modal and paired-modal learning. Uni-paired modal decoder reconstructs modal representation based on uni-modal and paired-modal features. PgM offers three key benefits: 1) thorough learning of uni-modal and paired-modal features, 2) flexible distribution adjustment for uni-modal and paired-modal representations to suit diverse downstream tasks, and 3) different learning rates across modalities and partitions. Extensive experiments demonstrate the effectiveness of PgM across four multimodal tasks and further highlight its transferability to existing models. Additionally, we visualize the distribution of uni-modal and paired-modal features across modalities and tasks, offering insights into their respective contributions.

Partitioner Guided Modal Learning Framework Read Post »

AI, Committee, News, Uncategorized

NeuralOS: A Generative Framework for Simulating Interactive Operating System Interfaces

Transforming Human-Computer Interaction with Generative Interfaces Recent advances in generative models are transforming the way we interact with computers, making experiences more natural, adaptive, and personalized. Early interfaces, command-line tools, and static menus were fixed and required users to adapt to the machine. Now, with the rise of LLMs and multimodal AI, users can engage with systems using everyday language, images, and even video. Newer models are even capable of simulating dynamic environments, such as those found in video games, in real-time. These trends point toward a future where computer interfaces aren’t just responsive, they’re generative, tailoring themselves to our goals, preferences, and the evolving context around us. Evolution of Generative Models for Simulating Environments Recent generative modeling approaches have made significant progress in simulating interactive environments. Early models, such as World Models, utilized latent variables to simulate reinforcement learning tasks, while GameGAN and Genie enabled the imitation of interactive games and the creation of playable 2D worlds. Diffusion-based models have further advanced this field, with tools like GameNGen, MarioVGG, DIAMOND, and GameGen-X simulating iconic and open-world games with remarkable fidelity. Beyond gaming, models such as UniSim simulate real-world scenarios, and Pandora allows video generation controlled by natural language prompts. While these efforts excel at dynamic, visually rich simulations, simulating subtle GUI transitions and precise user input, such as cursor movement, remains a unique and complex challenge. Introducing NeuralOS: A Diffusion-RNN Based OS Simulator Researchers from the University of Waterloo and the National Research Council Canada have introduced NeuralOS. This neural framework simulates operating system interfaces by directly generating screen frames from user inputs, such as mouse movements, clicks, and keystrokes. NeuralOS combines a recurrent neural network to track system state with a diffusion-based renderer to produce realistic GUI images. Trained on large-scale Ubuntu XFCE interaction data, it accurately models application launches and cursor behavior, although fine-grained keyboard input remains a challenge. NeuralOS marks a step toward adaptive, generative user interfaces that could eventually replace traditional static menus with more intuitive, AI-driven interaction. Architectural Design and Training Pipeline of NeuralOS NeuralOS is built on a modular design that mimics the separation of internal logic and GUI rendering found in traditional operating systems. It uses a hierarchical RNN to track user-driven state changes and a latent-space diffusion model to generate screen visuals. User inputs, such as cursor movements and key presses, are encoded and processed by the RNN, which maintains system memory over time. The renderer then uses these outputs and spatial cursor maps to produce realistic frames. Training involves multiple stages, including pretraining the RNN, joint training, scheduled sampling, and context extension, to handle long-term dependencies, reduce errors, and adapt effectively to real user interactions. Evaluation and Accuracy of Simulated GUI Transitions Due to the high training costs, the NeuralOS team evaluated smaller variants and ablations using a curated set of 730 examples. To assess how well the model localizes the cursor, they trained a regression model. They found that NeuralOS predicted cursor positions with great accuracy within approximately 1.5 pixels, far outperforming models without spatial encoding. For state transitions such as opening apps, NeuralOS achieved 37.7% accuracy across 73 challenging transition types, significantly outperforming the baseline. Ablation studies revealed that removing joint training resulted in blurry outputs and missing cursors, whereas skipping scheduled sampling led to a rapid decline in prediction quality over time. Conclusion: Toward Fully Generative Operating Systems In conclusion, NeuralOS is a framework that simulates operating system interfaces using generative models. It blends an RNN to track system states with a diffusion model that renders screen images based on user actions. Trained on Ubuntu desktop interactions, NeuralOS can generate realistic screen sequences and predict mouse behavior; however, handling detailed keyboard input remains challenging. While the model shows promise, it’s limited by its low resolution, slow speed (1.8 fps), and inability to perform complex OS tasks, such as installing software or accessing the internet. Future work may focus on language-driven controls, better performance, and expanding functionality beyond current OS boundaries. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More] The post NeuralOS: A Generative Framework for Simulating Interactive Operating System Interfaces appeared first on MarkTechPost.

NeuralOS: A Generative Framework for Simulating Interactive Operating System Interfaces Read Post »

AI, Committee, News, Uncategorized

JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing

Bridging the Gap Between Artistic Intent and Technical Execution Photo retouching is a core aspect of digital photography, enabling users to manipulate image elements such as tone, exposure, and contrast to create visually compelling content. Whether for professional purposes or personal expression, users often seek to enhance images in ways that align with specific aesthetic goals. However, the art of photo retouching requires both technical knowledge and creative sensibility, making it difficult to achieve high-quality results without significant effort or expertise. The key problem arises from the gap between manual editing tools and automated solutions. While professional software like Adobe Lightroom offers extensive retouching options, mastering these tools can be time-consuming and difficult for casual users. Conversely, AI-driven methods tend to oversimplify the editing process, failing to offer the control or precision required for nuanced edits. These automated solutions also struggle with generalizing across diverse visual scenes or supporting complex user instructions. Limitations of Current AI-Based Photo Editing Models Traditional tools have relied on zeroth- and first-order optimization, as well as reinforcement learning, to handle photo retouching tasks. Others utilize diffusion-based methods for image synthesis. These strategies show progress but are generally hampered by their inability to handle fine-grained regional control, maintain high-resolution outputs, or preserve the underlying content of the image. Even more recent large models, such as GPT-4o and Gemini-2-Flash, offer text-driven editing but compromise user control, and their generative processes often overwrite critical content details. JarvisArt: A Multimodal AI Retoucher Integrating Chain-of-Thought and Lightroom APIs Researchers from Xiamen University, the Chinese University of Hong Kong, Bytedance, the National University of Singapore, and Tsinghua University introduced JarvisArt—an intelligent retouching agent. This system leverages a multimodal large language model to enable flexible, instruction-guided image editing. JarvisArt is trained to emulate the decision-making process of professional artists, interpreting user intent through both visual and language cues, and executing retouching actions across more than 200 tools in Adobe Lightroom via a custom integration protocol. The methodology integrates three major components. First, the researchers constructed a high-quality dataset, MMArt, which includes 5,000 standard and 50,000 Chain-of-Thought–annotated samples spanning various editing styles and complexities. Then, JarvisArt undergoes a two-stage training process. The initial phase uses supervised fine-tuning to build reasoning and tool-selection capabilities. It’s followed by Group Relative Policy Optimization for Retouching (GRPO-R), which incorporates customized tool-use rewards—such as retouching accuracy and perceptual quality—to refine the system’s ability to generate professional-quality edits. A specialized Agent-to-Lightroom (A2L) protocol ensures the seamless and transparent execution of tools within Lightroom, enabling users to dynamically adjust edits. Benchmarking JarvisArt’s Capabilities and Real-World Performance JarvisArt’s ability to interpret complex instructions and apply nuanced edits was evaluated using MMArt-Bench, a benchmark constructed from real user edits. The system delivered a 60% improvement in average pixel-level metrics for content fidelity compared to GPT-4o, maintaining similar instruction-following capabilities. It also demonstrated versatility in handling both global image edits and localized refinements, with the ability to manipulate images of arbitrary resolution. For example, it can adjust skin texture, eye brightness, or hair definition based on region-specific instructions. These results were achieved while preserving aesthetic goals defined by the user, showing a practical blend of control and quality across multiple editing tasks. Conclusion: A Generative Agent That Fuses Creativity With Technical Precision The researchteam tackled a significant challenge—enabling intelligent, high-quality photo retouching that does not require professional expertise. The method they introduced bridges the gap between automation and user control by combining data synthesis, reasoning-driven training, and integration with commercial software. JarvisArt offers a practical and powerful solution for creative users who seek both flexibility and quality in their image editing. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience[Learn More] The post JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing appeared first on MarkTechPost.

JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing Read Post »

AI, Committee, News, Uncategorized

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

arXiv:2412.18424v3 Announce Type: replace-cross Abstract: Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating Read Post »

AI, Committee, News, Uncategorized

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

arXiv:2507.10787v1 Announce Type: new Abstract: This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers Read Post »

AI, Committee, News, Uncategorized

Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching

arXiv:2507.04099v2 Announce Type: replace Abstract: Fine-tuning methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have demonstrated success in training large language models (LLMs) for single-turn tasks. However, these methods fall short in multi-turn applications, such as diagnostic patient interviewing, where understanding how early conversational turns influence downstream completions and outcomes is essential. In medicine, a multi-turn perspective is critical for learning diagnostic schemas and better understanding conversation dynamics. To address this gap, I introduce Savage Conversation Forests (SCF), a reinforcement learning framework that leverages a branched conversation architecture to fine-tune LLMs for multi-turn dialogue. SCF generates multiple possible conversation continuations at each turn, enabling the model to learn how different early responses affect downstream interactions and diagnostic outcomes. In experiments simulating doctor-patient conversations, SCF with branching outperforms linear conversation architectures on diagnostic accuracy. I hypothesize that SCF’s improvements stem from its ability to provide richer, interdependent training signals across conversation turns. These results suggest that a branched training architecture is an important strategy for fine tuning LLMs in complex multi-turn conversational tasks.

Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching Read Post »

AI, Committee, News, Uncategorized

Testing Hypotheses from the Social Approval Theory of Online Hate: An Analysis of 110 Million Posts from Parler

arXiv:2507.10810v1 Announce Type: new Abstract: In this paper, we explored how online hate is motivated by receiving social approval from others. We specifically examined two central tenets of Walther’s (2024) social approval theory of online hate: (H1a) more signals of social approval on hate messages predicts more subsequent hate messages, and (H1b) as social approval increases, hate speech messages become more extreme. Using over 110 million posts from Parler (2018-2021), we observed that the number of upvotes a person received on a hate speech post was unassociated with the amount of hate speech in their next post and posts during the next week, month, three months, and six months. Between-person effects revealed an average negative relationship between social approval and hate speech production at the post level, but this relationship was mixed at other time intervals. Social approval reinforcement mechanisms of online hate may operate differently on niche social media platforms.

Testing Hypotheses from the Social Approval Theory of Online Hate: An Analysis of 110 Million Posts from Parler Read Post »

AI, Committee, News, Uncategorized

NVIDIA Just Released Audio Flamingo 3: An Open-Source Model Advancing Audio General Intelligence

Heard about Artificial General Intelligence (AGI)? Meet its auditory counterpart—Audio General Intelligence. With Audio Flamingo 3 (AF3), NVIDIA introduces a major leap in how machines understand and reason about sound. While past models could transcribe speech or classify audio clips, they lacked the ability to interpret audio in a context-rich, human-like way—across speech, ambient sound, and music, and over extended durations. AF3 changes that. With Audio Flamingo 3, NVIDIA introduces a fully open-source large audio-language model (LALM) that not only hears but also understands and reasons. Built on a five-stage curriculum and powered by the AF-Whisper encoder, AF3 supports long audio inputs (up to 10 minutes), multi-turn multi-audio chat, on-demand thinking, and even voice-to-voice interactions. This sets a new bar for how AI systems interact with sound, bringing us a step closer to AGI. The Core Innovations Behind Audio Flamingo 3 AF-Whisper: A Unified Audio Encoder AF3 uses AF-Whisper, a novel encoder adapted from Whisper-v3. It processes speech, ambient sounds, and music using the same architecture—solving a major limitation of earlier LALMs which used separate encoders, leading to inconsistencies. AF-Whisper leverages audio-caption datasets, synthesized metadata, and a dense 1280-dimension embedding space to align with text representations. Chain-of-Thought for Audio: On-Demand Reasoning Unlike static QA systems, AF3 is equipped with ‘thinking’ capabilities. Using the AF-Think dataset (250k examples), the model can perform chain-of-thought reasoning when prompted, enabling it to explain its inference steps before arriving at an answer—a key step toward transparent audio AI. Multi-Turn, Multi-Audio Conversations Through the AF-Chat dataset (75k dialogues), AF3 can hold contextual conversations involving multiple audio inputs across turns. This mimics real-world interactions, where humans refer back to previous audio cues. It also introduces voice-to-voice conversations using a streaming text-to-speech module. Long Audio Reasoning AF3 is the first fully open model capable of reasoning over audio inputs up to 10 minutes. Trained with LongAudio-XL (1.25M examples), the model supports tasks like meeting summarization, podcast understanding, sarcasm detection, and temporal grounding. State-of-the-Art Benchmarks and Real-World Capability AF3 surpasses both open and closed models on over 20 benchmarks, including: MMAU (avg): 73.14% (+2.14% over Qwen2.5-O) LongAudioBench: 68.6 (GPT-4o evaluation), beating Gemini 2.5 Pro LibriSpeech (ASR): 1.57% WER, outperforming Phi-4-mm ClothoAQA: 91.1% (vs. 89.2% from Qwen2.5-O) These improvements aren’t just marginal; they redefine what’s expected from audio-language systems. AF3 also introduces benchmarking in voice chat and speech generation, achieving 5.94s generation latency (vs. 14.62s for Qwen2.5) and better similarity scores. The Data Pipeline: Datasets That Teach Audio Reasoning NVIDIA didn’t just scale compute—they rethought the data: AudioSkills-XL: 8M examples combining ambient, music, and speech reasoning. LongAudio-XL: Covers long-form speech from audiobooks, podcasts, meetings. AF-Think: Promotes short CoT-style inference. AF-Chat: Designed for multi-turn, multi-audio conversations. Each dataset is fully open-sourced, along with training code and recipes, enabling reproducibility and future research. Open Source AF3 is not just a model drop. NVIDIA released: Model weights Training recipes Inference code Four open datasets This transparency makes AF3 the most accessible state-of-the-art audio-language model. It opens new research directions in auditory reasoning, low-latency audio agents, music comprehension, and multi-modal interaction. Conclusion: Toward General Audio Intelligence Audio Flamingo 3 demonstrates that deep audio understanding is not just possible but reproducible and open. By combining scale, novel training strategies, and diverse data, NVIDIA delivers a model that listens, understands, and reasons in ways previous LALMs could not. Check out the Paper, Codes and Model on Hugging Face. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More] The post NVIDIA Just Released Audio Flamingo 3: An Open-Source Model Advancing Audio General Intelligence appeared first on MarkTechPost.

NVIDIA Just Released Audio Flamingo 3: An Open-Source Model Advancing Audio General Intelligence Read Post »

AI, Committee, News, Uncategorized

Single Word Change is All You Need: Using LLMs to Create Synthetic Training Examples for Text Classifiers

arXiv:2401.17196v3 Announce Type: replace Abstract: In text classification, creating an adversarial example means subtly perturbing a few words in a sentence without changing its meaning, causing it to be misclassified by a classifier. A concerning observation is that a significant portion of adversarial examples generated by existing methods change only one word. This single-word perturbation vulnerability represents a significant weakness in classifiers, which malicious users can exploit to efficiently create a multitude of adversarial examples. This paper studies this problem and makes the following key contributions: (1) We introduce a novel metric $rho$ to quantitatively assess a classifier’s robustness against single-word perturbation. (2) We present the SP-Attack, designed to exploit the single-word perturbation vulnerability, achieving a higher attack success rate, better preserving sentence meaning, while reducing computation costs compared to state-of-the-art adversarial methods. (3) We propose SP-Defense, which aims to improve r{ho} by applying data augmentation in learning. Experimental results on 4 datasets and BERT and distilBERT classifiers show that SP-Defense improves $rho$ by 14.6% and 13.9% and decreases the attack success rate of SP-Attack by 30.4% and 21.2% on two classifiers respectively, and decreases the attack success rate of existing attack methods that involve multiple-word perturbations.

Single Word Change is All You Need: Using LLMs to Create Synthetic Training Examples for Text Classifiers Read Post »

AI, Committee, News, Uncategorized

EventHunter: Dynamic Clustering and Ranking of Security Events from Hacker Forum Discussions

arXiv:2507.09762v1 Announce Type: cross Abstract: Hacker forums provide critical early warning signals for emerging cybersecurity threats, but extracting actionable intelligence from their unstructured and noisy content remains a significant challenge. This paper presents an unsupervised framework that automatically detects, clusters, and prioritizes security events discussed across hacker forum posts. Our approach leverages Transformer-based embeddings fine-tuned with contrastive learning to group related discussions into distinct security event clusters, identifying incidents like zero-day disclosures or malware releases without relying on predefined keywords. The framework incorporates a daily ranking mechanism that prioritizes identified events using quantifiable metrics reflecting timeliness, source credibility, information completeness, and relevance. Experimental evaluation on real-world hacker forum data demonstrates that our method effectively reduces noise and surfaces high-priority threats, enabling security analysts to mount proactive responses. By transforming disparate hacker forum discussions into structured, actionable intelligence, our work addresses fundamental challenges in automated threat detection and analysis.

EventHunter: Dynamic Clustering and Ranking of Security Events from Hacker Forum Discussions Read Post »

en_US