YouZum

AI

AI, Committee, ข่าว, Uncategorized

UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning

arXiv:2502.15082v2 Announce Type: replace-cross Abstract: User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs). This requires deleting or “forgetting” a set of data points from an already-trained model, which typically degrades its performance on other data points. Thus, a balance must be struck between removing information and keeping the model’s other abilities intact, with a failure to balance this trade-off leading to poor deletion or an unusable model. To this end, we propose UPCORE (Utility-Preserving Coreset Selection), a method-agnostic data selection framework for mitigating collateral damage during unlearning. Finding that the model damage is correlated with the variance of the model’s representations on the forget set, we selectively prune the forget set to remove outliers, thereby minimizing model degradation after unlearning. Across three standard unlearning methods, UPCORE consistently achieves a superior balance between the competing objectives of deletion efficacy and model preservation. To better evaluate this trade-off, we introduce a new metric, measuring the area-under-the-curve (AUC) across standard metrics. Our results show that UPCORE improves both standard metrics and AUC, benefiting from positive transfer between the coreset and pruned points while reducing negative transfer from the forget set to points outside of it.

UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning Read Post »

AI, Committee, ข่าว, Uncategorized

Automatically assessing oral narratives of Afrikaans and isiXhosa children

arXiv:2507.13205v1 Announce Type: new Abstract: Developing narrative and comprehension skills in early childhood is critical for later literacy. However, teachers in large preschool classrooms struggle to accurately identify students who require intervention. We present a system for automatically assessing oral narratives of preschool children in Afrikaans and isiXhosa. The system uses automatic speech recognition followed by a machine learning scoring model to predict narrative and comprehension scores. For scoring predicted transcripts, we compare a linear model to a large language model (LLM). The LLM-based system outperforms the linear model in most cases, but the linear system is competitive despite its simplicity. The LLM-based system is comparable to a human expert in flagging children who require intervention. We lay the foundation for automatic oral assessments in classrooms, giving teachers extra capacity to focus on personalised support for children’s learning.

Automatically assessing oral narratives of Afrikaans and isiXhosa children Read Post »

AI, Committee, ข่าว, Uncategorized

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

arXiv:2507.06261v3 Announce Type: replace Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities Read Post »

AI, Committee, ข่าว, Uncategorized

AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles

arXiv:2507.11764v1 Announce Type: new Abstract: This paper presents AI Wizards’ participation in the CLEF 2025 CheckThat! Lab Task 1: Subjectivity Detection in News Articles, classifying sentences as subjective/objective in monolingual, multilingual, and zero-shot settings. Training/development datasets were provided for Arabic, German, English, Italian, and Bulgarian; final evaluation included additional unseen languages (e.g., Greek, Romanian, Polish, Ukrainian) to assess generalization. Our primary strategy enhanced transformer-based classifiers by integrating sentiment scores, derived from an auxiliary model, with sentence representations, aiming to improve upon standard fine-tuning. We explored this sentiment-augmented architecture with mDeBERTaV3-base, ModernBERT-base (English), and Llama3.2-1B. To address class imbalance, prevalent across languages, we employed decision threshold calibration optimized on the development set. Our experiments show sentiment feature integration significantly boosts performance, especially subjective F1 score. This framework led to high rankings, notably 1st for Greek (Macro F1 = 0.51).

AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles Read Post »

AI, Committee, ข่าว, Uncategorized

Multi-domain Multilingual Sentiment Analysis in Industry: Predicting Aspect-based Opinion Quadruples

arXiv:2505.10389v2 Announce Type: replace Abstract: This paper explores the design of an aspect-based sentiment analysis system using large language models (LLMs) for real-world use. We focus on quadruple opinion extraction — identifying aspect categories, sentiment polarity, targets, and opinion expressions from text data across different domains and languages. We investigate whether a single fine-tuned model can effectively handle multiple domain-specific taxonomies simultaneously. We demonstrate that a combined multi-domain model achieves performance comparable to specialized single-domain models while reducing operational complexity. We also share lessons learned for handling non-extractive predictions and evaluating various failure modes when developing LLM-based systems for structured prediction tasks.

Multi-domain Multilingual Sentiment Analysis in Industry: Predicting Aspect-based Opinion Quadruples Read Post »

AI, Committee, ข่าว, Uncategorized

Partitioner Guided Modal Learning Framework

arXiv:2507.11661v1 Announce Type: new Abstract: Multimodal learning benefits from multiple modal information, and each learned modal representations can be divided into uni-modal that can be learned from uni-modal training and paired-modal features that can be learned from cross-modal interaction. Building on this perspective, we propose a partitioner-guided modal learning framework, PgM, which consists of the modal partitioner, uni-modal learner, paired-modal learner, and uni-paired modal decoder. Modal partitioner segments the learned modal representation into uni-modal and paired-modal features. Modal learner incorporates two dedicated components for uni-modal and paired-modal learning. Uni-paired modal decoder reconstructs modal representation based on uni-modal and paired-modal features. PgM offers three key benefits: 1) thorough learning of uni-modal and paired-modal features, 2) flexible distribution adjustment for uni-modal and paired-modal representations to suit diverse downstream tasks, and 3) different learning rates across modalities and partitions. Extensive experiments demonstrate the effectiveness of PgM across four multimodal tasks and further highlight its transferability to existing models. Additionally, we visualize the distribution of uni-modal and paired-modal features across modalities and tasks, offering insights into their respective contributions.

Partitioner Guided Modal Learning Framework Read Post »

AI, Committee, ข่าว, Uncategorized

NeuralOS: A Generative Framework for Simulating Interactive Operating System Interfaces

Transforming Human-Computer Interaction with Generative Interfaces Recent advances in generative models are transforming the way we interact with computers, making experiences more natural, adaptive, and personalized. Early interfaces, command-line tools, and static menus were fixed and required users to adapt to the machine. Now, with the rise of LLMs and multimodal AI, users can engage with systems using everyday language, images, and even video. Newer models are even capable of simulating dynamic environments, such as those found in video games, in real-time. These trends point toward a future where computer interfaces aren’t just responsive, they’re generative, tailoring themselves to our goals, preferences, and the evolving context around us. Evolution of Generative Models for Simulating Environments Recent generative modeling approaches have made significant progress in simulating interactive environments. Early models, such as World Models, utilized latent variables to simulate reinforcement learning tasks, while GameGAN and Genie enabled the imitation of interactive games and the creation of playable 2D worlds. Diffusion-based models have further advanced this field, with tools like GameNGen, MarioVGG, DIAMOND, and GameGen-X simulating iconic and open-world games with remarkable fidelity. Beyond gaming, models such as UniSim simulate real-world scenarios, and Pandora allows video generation controlled by natural language prompts. While these efforts excel at dynamic, visually rich simulations, simulating subtle GUI transitions and precise user input, such as cursor movement, remains a unique and complex challenge. Introducing NeuralOS: A Diffusion-RNN Based OS Simulator Researchers from the University of Waterloo and the National Research Council Canada have introduced NeuralOS. This neural framework simulates operating system interfaces by directly generating screen frames from user inputs, such as mouse movements, clicks, and keystrokes. NeuralOS combines a recurrent neural network to track system state with a diffusion-based renderer to produce realistic GUI images. Trained on large-scale Ubuntu XFCE interaction data, it accurately models application launches and cursor behavior, although fine-grained keyboard input remains a challenge. NeuralOS marks a step toward adaptive, generative user interfaces that could eventually replace traditional static menus with more intuitive, AI-driven interaction. Architectural Design and Training Pipeline of NeuralOS NeuralOS is built on a modular design that mimics the separation of internal logic and GUI rendering found in traditional operating systems. It uses a hierarchical RNN to track user-driven state changes and a latent-space diffusion model to generate screen visuals. User inputs, such as cursor movements and key presses, are encoded and processed by the RNN, which maintains system memory over time. The renderer then uses these outputs and spatial cursor maps to produce realistic frames. Training involves multiple stages, including pretraining the RNN, joint training, scheduled sampling, and context extension, to handle long-term dependencies, reduce errors, and adapt effectively to real user interactions. Evaluation and Accuracy of Simulated GUI Transitions Due to the high training costs, the NeuralOS team evaluated smaller variants and ablations using a curated set of 730 examples. To assess how well the model localizes the cursor, they trained a regression model. They found that NeuralOS predicted cursor positions with great accuracy within approximately 1.5 pixels, far outperforming models without spatial encoding. For state transitions such as opening apps, NeuralOS achieved 37.7% accuracy across 73 challenging transition types, significantly outperforming the baseline. Ablation studies revealed that removing joint training resulted in blurry outputs and missing cursors, whereas skipping scheduled sampling led to a rapid decline in prediction quality over time. Conclusion: Toward Fully Generative Operating Systems In conclusion, NeuralOS is a framework that simulates operating system interfaces using generative models. It blends an RNN to track system states with a diffusion model that renders screen images based on user actions. Trained on Ubuntu desktop interactions, NeuralOS can generate realistic screen sequences and predict mouse behavior; however, handling detailed keyboard input remains challenging. While the model shows promise, it’s limited by its low resolution, slow speed (1.8 fps), and inability to perform complex OS tasks, such as installing software or accessing the internet. Future work may focus on language-driven controls, better performance, and expanding functionality beyond current OS boundaries. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More] The post NeuralOS: A Generative Framework for Simulating Interactive Operating System Interfaces appeared first on MarkTechPost.

NeuralOS: A Generative Framework for Simulating Interactive Operating System Interfaces Read Post »

AI, Committee, ข่าว, Uncategorized

JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing

Bridging the Gap Between Artistic Intent and Technical Execution Photo retouching is a core aspect of digital photography, enabling users to manipulate image elements such as tone, exposure, and contrast to create visually compelling content. Whether for professional purposes or personal expression, users often seek to enhance images in ways that align with specific aesthetic goals. However, the art of photo retouching requires both technical knowledge and creative sensibility, making it difficult to achieve high-quality results without significant effort or expertise. The key problem arises from the gap between manual editing tools and automated solutions. While professional software like Adobe Lightroom offers extensive retouching options, mastering these tools can be time-consuming and difficult for casual users. Conversely, AI-driven methods tend to oversimplify the editing process, failing to offer the control or precision required for nuanced edits. These automated solutions also struggle with generalizing across diverse visual scenes or supporting complex user instructions. Limitations of Current AI-Based Photo Editing Models Traditional tools have relied on zeroth- and first-order optimization, as well as reinforcement learning, to handle photo retouching tasks. Others utilize diffusion-based methods for image synthesis. These strategies show progress but are generally hampered by their inability to handle fine-grained regional control, maintain high-resolution outputs, or preserve the underlying content of the image. Even more recent large models, such as GPT-4o and Gemini-2-Flash, offer text-driven editing but compromise user control, and their generative processes often overwrite critical content details. JarvisArt: A Multimodal AI Retoucher Integrating Chain-of-Thought and Lightroom APIs Researchers from Xiamen University, the Chinese University of Hong Kong, Bytedance, the National University of Singapore, and Tsinghua University introduced JarvisArt—an intelligent retouching agent. This system leverages a multimodal large language model to enable flexible, instruction-guided image editing. JarvisArt is trained to emulate the decision-making process of professional artists, interpreting user intent through both visual and language cues, and executing retouching actions across more than 200 tools in Adobe Lightroom via a custom integration protocol. The methodology integrates three major components. First, the researchers constructed a high-quality dataset, MMArt, which includes 5,000 standard and 50,000 Chain-of-Thought–annotated samples spanning various editing styles and complexities. Then, JarvisArt undergoes a two-stage training process. The initial phase uses supervised fine-tuning to build reasoning and tool-selection capabilities. It’s followed by Group Relative Policy Optimization for Retouching (GRPO-R), which incorporates customized tool-use rewards—such as retouching accuracy and perceptual quality—to refine the system’s ability to generate professional-quality edits. A specialized Agent-to-Lightroom (A2L) protocol ensures the seamless and transparent execution of tools within Lightroom, enabling users to dynamically adjust edits. Benchmarking JarvisArt’s Capabilities and Real-World Performance JarvisArt’s ability to interpret complex instructions and apply nuanced edits was evaluated using MMArt-Bench, a benchmark constructed from real user edits. The system delivered a 60% improvement in average pixel-level metrics for content fidelity compared to GPT-4o, maintaining similar instruction-following capabilities. It also demonstrated versatility in handling both global image edits and localized refinements, with the ability to manipulate images of arbitrary resolution. For example, it can adjust skin texture, eye brightness, or hair definition based on region-specific instructions. These results were achieved while preserving aesthetic goals defined by the user, showing a practical blend of control and quality across multiple editing tasks. Conclusion: A Generative Agent That Fuses Creativity With Technical Precision The researchteam tackled a significant challenge—enabling intelligent, high-quality photo retouching that does not require professional expertise. The method they introduced bridges the gap between automation and user control by combining data synthesis, reasoning-driven training, and integration with commercial software. JarvisArt offers a practical and powerful solution for creative users who seek both flexibility and quality in their image editing. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience[Learn More] The post JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing appeared first on MarkTechPost.

JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing Read Post »

AI, Committee, ข่าว, Uncategorized

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

arXiv:2412.18424v3 Announce Type: replace-cross Abstract: Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating Read Post »

AI, Committee, ข่าว, Uncategorized

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

arXiv:2507.10787v1 Announce Type: new Abstract: This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers Read Post »

th