YouZum

AI

AI, Committee, Noticias, Uncategorized

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

arXiv:2505.13227v2 Announce Type: replace-cross Abstract: Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Leer entrada »

AI, Committee, Noticias, Uncategorized

AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR

arXiv:2506.14190v1 Announce Type: new Abstract: Developing code-switched ASR systems is challenging due to language ambiguity and limited exposure to multilingual, code-switched data, while collecting such speech is costly. Prior work generates synthetic audio from text, but these methods are computationally intensive and hard to scale. We introduce AsyncSwitch, a novel asynchronous adaptation framework that leverages large-scale, text-rich web data to pre-expose ASR models to diverse code-switched domains before fine-tuning on paired speech-text corpora. Our three-stage process (1) trains decoder self-attention and feedforward layers on code-switched text, (2) aligns decoder and encoder via cross-attention using limited speech-text data, and (3) fully fine-tunes the entire model. Experiments with Whisper on Malay-English code-switching demonstrate a 9.02% relative WER reduction, while improving monolingual performance in Singlish, Malay, and other English variants.

AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR Leer entrada »

AI, Committee, Noticias, Uncategorized

Bridging Social Media and Search Engines: Dredge Words and the Detection of Unreliable Domains

arXiv:2406.11423v4 Announce Type: replace-cross Abstract: Proactive content moderation requires platforms to rapidly and continuously evaluate the credibility of websites. Leveraging the direct and indirect paths users follow to unreliable websites, we develop a website credibility classification and discovery system that integrates both webgraph and large-scale social media contexts. We additionally introduce the concept of dredge words, terms or phrases for which unreliable domains rank highly on search engines, and provide the first exploration of their usage on social media. Our graph neural networks that combine webgraph and social media contexts generate to state-of-the-art results in website credibility classification and significantly improves the top-k identification of unreliable domains. Additionally, we release a novel dataset of dredge words, highlighting their strong connections to both social media and online commerce platforms.

Bridging Social Media and Search Engines: Dredge Words and the Detection of Unreliable Domains Leer entrada »

AI, Committee, Noticias, Uncategorized

CMCTS: A Constrained Monte Carlo Tree Search Framework for Mathematical Reasoning in Large Language Model

arXiv:2502.11169v2 Announce Type: replace Abstract: This paper introduces the Constrained Monte Carlo Tree Search (CMCTS) framework to enhance the mathematical reasoning capabilities of Large Language Models (LLM). By incorporating a constrained action space, Process Reward Model (PRM), and partial order rules, CMCTS effectively addresses the limitations of existing MCTS methods in terms of state space diversity and action selection rationality. Specifically, during the expansion phase, CMCTS restricts action sampling to a predefined constrained action set to increase candidate state diversity. In the simulation phase, it introduces partial order rules and PRM to optimize action selection and prevent unreasonable state transitions. Experimental results show that CMCTS performs outstandingly across multiple mathematical reasoning benchmarks. Under a zero-shot setting, a 7B-parameter model achieves an average accuracy of 83.4%, surpassing the 72B baseline model by 4.8%. Ablation studies demonstrate that each component of the framework is crucial for performance improvement, and their combined use fully leverages their respective strengths. Overall, the CMCTS framework provides an effective approach to enhancing LLM mathematical reasoning capabilities, supported by theoretical analysis, and offers novel insights for future reasoning tasks.

CMCTS: A Constrained Monte Carlo Tree Search Framework for Mathematical Reasoning in Large Language Model Leer entrada »

AI, Committee, Noticias, Uncategorized

How Much Can We Forget about Data Contamination?

arXiv:2410.03249v4 Announce Type: replace-cross Abstract: The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results. Next, we study the impact of the weight decay parameter on example forgetting, showing that empirical forgetting occurs faster than the cumulative weight decay. This allows us to gauge the degree of example forgetting in large-scale training runs, indicating that many LLMs, including Lllama 3 405B, have forgotten the data seen at the beginning of training.

How Much Can We Forget about Data Contamination? Leer entrada »

AI, Committee, Noticias, Uncategorized

A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment

arXiv:2502.13520v2 Announce Type: replace Abstract: This paper introduces the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale, fine-grained dataset for Arabic readability assessment. BAREC consists of 69,441 sentences spanning 1+ million words, carefully curated to cover 19 readability levels, from kindergarten to postgraduate comprehension. The corpus balances genre diversity, topical coverage, and target audiences, offering a comprehensive resource for evaluating Arabic text complexity. The corpus was fully manually annotated by a large team of annotators. The average pairwise inter-annotator agreement, measured by Quadratic Weighted Kappa, is 81.8%, reflecting a high level of substantial agreement. Beyond presenting the corpus, we benchmark automatic readability assessment across different granularity levels, comparing a range of techniques. Our results highlight the challenges and opportunities in Arabic readability modeling, demonstrating competitive performance across various methods. To support research and education, we make BAREC openly available, along with detailed annotation guidelines and benchmark results.

A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment Leer entrada »

AI, Committee, Noticias, Uncategorized

EPFL Researchers Introduce MEMOIR: A Scalable Framework for Lifelong Model Editing in LLMs

The Challenge of Updating LLM Knowledge LLMs have shown outstanding performance for various tasks through extensive pre-training on vast datasets. However, these models frequently generate outdated or inaccurate information and can reflect biases during deployment, so their knowledge needs to be updated continuously. Traditional fine-tuning methods are expensive and susceptible to catastrophic forgetting. This has motivated lifelong model editing, which updates model knowledge efficiently and locally. To generate correct predictions, each edit requires reliability, generalizability, and localization. Methods like non-parametric achieve precise localized edits but poor generalization, while parametric methods offer better generalization but suffer from catastrophic forgetting. Limitations of Prior Model Editing Techniques Earlier works have explored sparse neural activations in continual learning, with methods like PackNet and Supermasks-in-Superposition allocating disjoint parameter subsets per task. Gradient-based approaches such as GPM and SPARCL improve efficiency through orthogonal updates but are limited to continual learning contexts. Parametric approaches such as ROME, MEMIT, and WISE modify weights through locating-then-editing strategies or auxiliary modules, but suffer from forgetting over extended edit sequences. Non-parametric methods like GRACE and LOKA store knowledge externally to preserve original weights, enabling precise local edits. However, these methods rely on exact input matches, limiting their generalization capabilities. Introducing MEMOIR: A Structured Approach to Model Editing Researchers from EPFL, Lausanne, Switzerland, have proposed MEMOIR (Model Editing with Minimal Overwrite and Informed Retention), which achieves an optimal balance between reliability, generalization, and locality for large-scale edits. It introduces a memory module that consists of a fully-connected layer within a single transformer block where all edits occur. MEMOIR solves catastrophic forgetting by allocating distinct parameter subsets to each edit and retrieving them during inference to activate only relevant knowledge for specific prompts. Moreover, the method utilizes structured sparsification with sample-dependent masks during editing, activating only prompt-specific parameter subsets. It distributes new knowledge across the parameter space, reducing overwriting and minimizing catastrophic forgetting. Evaluation and Experimental Results MEMOIR operates through a residual memory framework during inference, where the edited output integrates original layer outputs with residual memory outputs. It is evaluated against baselines such as GRACE for external knowledge storage, DEFER for inference-time routing, causal tracing methods like ROME, MEMIT, and ALPHAEDIT, and memory-based methods like WISE. Direct fine-tuning serves as an additional baseline comparison. Experiments are conducted on four autoregressive language models: LLaMA-3-8B-Instruct, Mistral-7B, LLaMA-2-7B, and GPT-J-6B, providing a comprehensive evaluation across different models and scales to show the effectiveness and generalizability of MOMOIR. On the ZsRE question-answering dataset, MEMOIR achieves an average metric of 0.95 on LLaMA-3 with 1000 edits, outperforming all prior methods by a margin of 0.16. Similar outcomes are seen with Mistral, where this method once again achieves the highest average score, highlighting its robustness and effectiveness across various LLMs. Moreover, MEMOIR maintains optimal balanced performance with increasing edit volumes for hallucination correction using the SelfCheckGPT dataset. MEMOIR sustains saturated locality scores under the most challenging scenario of 600 edits, while achieving perplexity metrics 57% and 77% lower than WISE, the second-best performing method, on LLaMA-3 and Mistral, respectively. Conclusion and Future Directions In conclusion, MEMOIR is a scalable framework for lifelong model editing that effectively balances reliability, generalization, and locality using innovative sparsification techniques. The method retrieves relevant updates through sparse activation pattern comparison, allowing edits to generalize to rephrased queries while maintaining model behavior on unrelated prompts. However, certain limitations exist, like modification of only single linear layers, which may restrict handling of long-horizon edits or knowledge requiring broader model changes. Future directions include extending the approach to multiple layers, hierarchical editing strategies, and application to multi-modal or encoder-decoder models beyond the current decoder-only transformer focus. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post EPFL Researchers Introduce MEMOIR: A Scalable Framework for Lifelong Model Editing in LLMs appeared first on MarkTechPost.

EPFL Researchers Introduce MEMOIR: A Scalable Framework for Lifelong Model Editing in LLMs Leer entrada »

AI, Committee, Noticias, Uncategorized

QFFT, Question-Free Fine-Tuning for Adaptive Reasoning

arXiv:2506.12860v1 Announce Type: new Abstract: Recent advancements in Long Chain-of-Thought (CoT) reasoning models have improved performance on complex tasks, but they suffer from overthinking, which generates redundant reasoning steps, especially for simple questions. This paper revisits the reasoning patterns of Long and Short CoT models, observing that the Short CoT patterns offer concise reasoning efficiently, while the Long CoT patterns excel in challenging scenarios where the Short CoT patterns struggle. To enable models to leverage both patterns, we propose Question-Free Fine-Tuning (QFFT), a fine-tuning approach that removes the input question during training and learns exclusively from Long CoT responses. This approach enables the model to adaptively employ both reasoning patterns: it prioritizes the Short CoT patterns and activates the Long CoT patterns only when necessary. Experiments on various mathematical datasets demonstrate that QFFT reduces average response length by more than 50%, while achieving performance comparable to Supervised Fine-Tuning (SFT). Additionally, QFFT exhibits superior performance compared to SFT in noisy, out-of-domain, and low-resource scenarios.

QFFT, Question-Free Fine-Tuning for Adaptive Reasoning Leer entrada »

AI, Committee, Noticias, Uncategorized

FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design

arXiv:2506.13066v1 Announce Type: new Abstract: Large Multimodal Models (LMMs) demonstrate significant cross-modal reasoning capabilities. However, financial applications face challenges due to the lack of high-quality multimodal reasoning datasets and the inefficiency of existing training paradigms for reasoning enhancement. To address these issues, we propose an integrated framework, FinLMM-R1, combining an automated and scalable pipeline for data construction with enhanced training strategies to improve the multimodal reasoning of LMM. The Automated and Scalable Pipeline (ASP) resolves textual-visual misalignment in financial reports through a separate paradigm of question-answer generation and image-question alignment, ensuring data integrity and extraction efficiency. Through ASP, we collect 89,378 aligned image-question pairs from 23,397 financial reports, covering tasks such as arithmetic reasoning, statistics reasoning, financial explanation, and financial knowledge. Moreover, we introduce the Thinking with Adversarial Reward in LMM (TAR-LMM), extending the prior two-stage training framework [1] with additional reward mechanisms. In the first stage, we focus on text-only tasks with format and accuracy rewards to guide the model in generating well-structured thinking contents. In the second stage, we construct multi-image contrastive samples with additional reward components including image selection, thinking content length, and adversarial reward to jointly optimize the LMM across visual perception, reasoning efficiency, and logical coherence. Extensive experiments on 7 benchmarks show ASP-derived dataset and training framework significantly improve answer accuracy and reasoning depth over existing reasoning LMMs in both general and financial multimodal contexts.

FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design Leer entrada »

AI, Committee, Noticias, Uncategorized

Surprise Calibration for Better In-Context Learning

arXiv:2506.12796v1 Announce Type: new Abstract: In-context learning (ICL) has emerged as a powerful paradigm for task adaptation in large language models (LLMs), where models infer underlying task structures from a few demonstrations. However, ICL remains susceptible to biases that arise from prior knowledge and contextual demonstrations, which can degrade the performance of LLMs. Existing bias calibration methods typically apply fixed class priors across all inputs, limiting their efficacy in dynamic ICL settings where the context for each query differs. To address these limitations, we adopt implicit sequential Bayesian inference as a framework for interpreting ICL, identify “surprise” as an informative signal for class prior shift, and introduce a novel method–Surprise Calibration (SC). SC leverages the notion of surprise to capture the temporal dynamics of class priors, providing a more adaptive and computationally efficient solution for in-context learning. We empirically demonstrate the superiority of SC over existing bias calibration techniques across a range of benchmark natural language processing tasks.

Surprise Calibration for Better In-Context Learning Leer entrada »

es_ES