YouZum

ข่าว

AI, Committee, ข่าว, Uncategorized

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

arXiv:2505.13227v2 Announce Type: replace-cross Abstract: Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Read Post »

AI, Committee, ข่าว, Uncategorized

AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR

arXiv:2506.14190v1 Announce Type: new Abstract: Developing code-switched ASR systems is challenging due to language ambiguity and limited exposure to multilingual, code-switched data, while collecting such speech is costly. Prior work generates synthetic audio from text, but these methods are computationally intensive and hard to scale. We introduce AsyncSwitch, a novel asynchronous adaptation framework that leverages large-scale, text-rich web data to pre-expose ASR models to diverse code-switched domains before fine-tuning on paired speech-text corpora. Our three-stage process (1) trains decoder self-attention and feedforward layers on code-switched text, (2) aligns decoder and encoder via cross-attention using limited speech-text data, and (3) fully fine-tunes the entire model. Experiments with Whisper on Malay-English code-switching demonstrate a 9.02% relative WER reduction, while improving monolingual performance in Singlish, Malay, and other English variants.

AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR Read Post »

AI, Committee, ข่าว, Uncategorized

Bridging Social Media and Search Engines: Dredge Words and the Detection of Unreliable Domains

arXiv:2406.11423v4 Announce Type: replace-cross Abstract: Proactive content moderation requires platforms to rapidly and continuously evaluate the credibility of websites. Leveraging the direct and indirect paths users follow to unreliable websites, we develop a website credibility classification and discovery system that integrates both webgraph and large-scale social media contexts. We additionally introduce the concept of dredge words, terms or phrases for which unreliable domains rank highly on search engines, and provide the first exploration of their usage on social media. Our graph neural networks that combine webgraph and social media contexts generate to state-of-the-art results in website credibility classification and significantly improves the top-k identification of unreliable domains. Additionally, we release a novel dataset of dredge words, highlighting their strong connections to both social media and online commerce platforms.

Bridging Social Media and Search Engines: Dredge Words and the Detection of Unreliable Domains Read Post »

AI, Committee, ข่าว, Uncategorized

A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment

arXiv:2502.13520v2 Announce Type: replace Abstract: This paper introduces the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale, fine-grained dataset for Arabic readability assessment. BAREC consists of 69,441 sentences spanning 1+ million words, carefully curated to cover 19 readability levels, from kindergarten to postgraduate comprehension. The corpus balances genre diversity, topical coverage, and target audiences, offering a comprehensive resource for evaluating Arabic text complexity. The corpus was fully manually annotated by a large team of annotators. The average pairwise inter-annotator agreement, measured by Quadratic Weighted Kappa, is 81.8%, reflecting a high level of substantial agreement. Beyond presenting the corpus, we benchmark automatic readability assessment across different granularity levels, comparing a range of techniques. Our results highlight the challenges and opportunities in Arabic readability modeling, demonstrating competitive performance across various methods. To support research and education, we make BAREC openly available, along with detailed annotation guidelines and benchmark results.

A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment Read Post »

AI, Committee, ข่าว, Uncategorized

CMCTS: A Constrained Monte Carlo Tree Search Framework for Mathematical Reasoning in Large Language Model

arXiv:2502.11169v2 Announce Type: replace Abstract: This paper introduces the Constrained Monte Carlo Tree Search (CMCTS) framework to enhance the mathematical reasoning capabilities of Large Language Models (LLM). By incorporating a constrained action space, Process Reward Model (PRM), and partial order rules, CMCTS effectively addresses the limitations of existing MCTS methods in terms of state space diversity and action selection rationality. Specifically, during the expansion phase, CMCTS restricts action sampling to a predefined constrained action set to increase candidate state diversity. In the simulation phase, it introduces partial order rules and PRM to optimize action selection and prevent unreasonable state transitions. Experimental results show that CMCTS performs outstandingly across multiple mathematical reasoning benchmarks. Under a zero-shot setting, a 7B-parameter model achieves an average accuracy of 83.4%, surpassing the 72B baseline model by 4.8%. Ablation studies demonstrate that each component of the framework is crucial for performance improvement, and their combined use fully leverages their respective strengths. Overall, the CMCTS framework provides an effective approach to enhancing LLM mathematical reasoning capabilities, supported by theoretical analysis, and offers novel insights for future reasoning tasks.

CMCTS: A Constrained Monte Carlo Tree Search Framework for Mathematical Reasoning in Large Language Model Read Post »

AI, Committee, ข่าว, Uncategorized

How Much Can We Forget about Data Contamination?

arXiv:2410.03249v4 Announce Type: replace-cross Abstract: The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results. Next, we study the impact of the weight decay parameter on example forgetting, showing that empirical forgetting occurs faster than the cumulative weight decay. This allows us to gauge the degree of example forgetting in large-scale training runs, indicating that many LLMs, including Lllama 3 405B, have forgotten the data seen at the beginning of training.

How Much Can We Forget about Data Contamination? Read Post »

AI, Committee, ข่าว, Uncategorized

EPFL Researchers Introduce MEMOIR: A Scalable Framework for Lifelong Model Editing in LLMs

The Challenge of Updating LLM Knowledge LLMs have shown outstanding performance for various tasks through extensive pre-training on vast datasets. However, these models frequently generate outdated or inaccurate information and can reflect biases during deployment, so their knowledge needs to be updated continuously. Traditional fine-tuning methods are expensive and susceptible to catastrophic forgetting. This has motivated lifelong model editing, which updates model knowledge efficiently and locally. To generate correct predictions, each edit requires reliability, generalizability, and localization. Methods like non-parametric achieve precise localized edits but poor generalization, while parametric methods offer better generalization but suffer from catastrophic forgetting. Limitations of Prior Model Editing Techniques Earlier works have explored sparse neural activations in continual learning, with methods like PackNet and Supermasks-in-Superposition allocating disjoint parameter subsets per task. Gradient-based approaches such as GPM and SPARCL improve efficiency through orthogonal updates but are limited to continual learning contexts. Parametric approaches such as ROME, MEMIT, and WISE modify weights through locating-then-editing strategies or auxiliary modules, but suffer from forgetting over extended edit sequences. Non-parametric methods like GRACE and LOKA store knowledge externally to preserve original weights, enabling precise local edits. However, these methods rely on exact input matches, limiting their generalization capabilities. Introducing MEMOIR: A Structured Approach to Model Editing Researchers from EPFL, Lausanne, Switzerland, have proposed MEMOIR (Model Editing with Minimal Overwrite and Informed Retention), which achieves an optimal balance between reliability, generalization, and locality for large-scale edits. It introduces a memory module that consists of a fully-connected layer within a single transformer block where all edits occur. MEMOIR solves catastrophic forgetting by allocating distinct parameter subsets to each edit and retrieving them during inference to activate only relevant knowledge for specific prompts. Moreover, the method utilizes structured sparsification with sample-dependent masks during editing, activating only prompt-specific parameter subsets. It distributes new knowledge across the parameter space, reducing overwriting and minimizing catastrophic forgetting. Evaluation and Experimental Results MEMOIR operates through a residual memory framework during inference, where the edited output integrates original layer outputs with residual memory outputs. It is evaluated against baselines such as GRACE for external knowledge storage, DEFER for inference-time routing, causal tracing methods like ROME, MEMIT, and ALPHAEDIT, and memory-based methods like WISE. Direct fine-tuning serves as an additional baseline comparison. Experiments are conducted on four autoregressive language models: LLaMA-3-8B-Instruct, Mistral-7B, LLaMA-2-7B, and GPT-J-6B, providing a comprehensive evaluation across different models and scales to show the effectiveness and generalizability of MOMOIR. On the ZsRE question-answering dataset, MEMOIR achieves an average metric of 0.95 on LLaMA-3 with 1000 edits, outperforming all prior methods by a margin of 0.16. Similar outcomes are seen with Mistral, where this method once again achieves the highest average score, highlighting its robustness and effectiveness across various LLMs. Moreover, MEMOIR maintains optimal balanced performance with increasing edit volumes for hallucination correction using the SelfCheckGPT dataset. MEMOIR sustains saturated locality scores under the most challenging scenario of 600 edits, while achieving perplexity metrics 57% and 77% lower than WISE, the second-best performing method, on LLaMA-3 and Mistral, respectively. Conclusion and Future Directions In conclusion, MEMOIR is a scalable framework for lifelong model editing that effectively balances reliability, generalization, and locality using innovative sparsification techniques. The method retrieves relevant updates through sparse activation pattern comparison, allowing edits to generalize to rephrased queries while maintaining model behavior on unrelated prompts. However, certain limitations exist, like modification of only single linear layers, which may restrict handling of long-horizon edits or knowledge requiring broader model changes. Future directions include extending the approach to multiple layers, hierarchical editing strategies, and application to multi-modal or encoder-decoder models beyond the current decoder-only transformer focus. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post EPFL Researchers Introduce MEMOIR: A Scalable Framework for Lifelong Model Editing in LLMs appeared first on MarkTechPost.

EPFL Researchers Introduce MEMOIR: A Scalable Framework for Lifelong Model Editing in LLMs Read Post »

AI, Committee, ข่าว, Uncategorized

REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities

arXiv:2503.13102v2 Announce Type: replace Abstract: Recent advances in large language models (LLMs) have introduced the novel paradigm of using LLMs as judges, where an LLM evaluates and scores the outputs of another LLM, which often correlates highly with human preferences. However, the use of LLM-as-a-judge has been primarily studied in English. In this paper, we evaluate this framework in Russian by introducing the Russian Error tyPes Annotation dataset (REPA), a dataset of 1k user queries and 2k LLM-generated responses. Human annotators labeled each response pair expressing their preferences across ten specific error types, as well as selecting an overall preference. We rank six generative LLMs across the error types using three rating systems based on human preferences. We also evaluate responses using eight LLM judges in zero-shot and few-shot settings. We describe the results of analyzing the judges and position and length biases. Our findings reveal a notable gap between LLM judge performance in Russian and English. However, rankings based on human and LLM preferences show partial alignment, suggesting that while current LLM judges struggle with fine-grained evaluation in Russian, there is potential for improvement.

REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities Read Post »

AI, Committee, ข่าว, Uncategorized

InfoFlood: Jailbreaking Large Language Models with Information Overload

arXiv:2506.12274v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains. However, their potential to generate harmful responses has raised significant societal and regulatory concerns, especially when manipulated by adversarial techniques known as “jailbreak” attacks. Existing jailbreak methods typically involve appending carefully crafted prefixes or suffixes to malicious prompts in order to bypass the built-in safety mechanisms of these models. In this work, we identify a new vulnerability in which excessive linguistic complexity can disrupt built-in safety mechanisms-without the need for any added prefixes or suffixes-allowing attackers to elicit harmful outputs directly. We refer to this phenomenon as Information Overload. To automatically exploit this vulnerability, we propose InfoFlood, a jailbreak attack that transforms malicious queries into complex, information-overloaded queries capable of bypassing built-in safety mechanisms. Specifically, InfoFlood: (1) uses linguistic transformations to rephrase malicious queries, (2) identifies the root cause of failure when an attempt is unsuccessful, and (3) refines the prompt’s linguistic structure to address the failure while preserving its malicious intent. We empirically validate the effectiveness of InfoFlood on four widely used LLMs-GPT-4o, GPT-3.5-turbo, Gemini 2.0, and LLaMA 3.1-by measuring their jailbreak success rates. InfoFlood consistently outperforms baseline attacks, achieving up to 3 times higher success rates across multiple jailbreak benchmarks. Furthermore, we demonstrate that commonly adopted post-processing defenses, including OpenAI’s Moderation API, Perspective API, and SmoothLLM, fail to mitigate these attacks. This highlights a critical weakness in traditional AI safety guardrails when confronted with information overload-based jailbreaks.

InfoFlood: Jailbreaking Large Language Models with Information Overload Read Post »

AI, Committee, ข่าว, Uncategorized

QFFT, Question-Free Fine-Tuning for Adaptive Reasoning

arXiv:2506.12860v1 Announce Type: new Abstract: Recent advancements in Long Chain-of-Thought (CoT) reasoning models have improved performance on complex tasks, but they suffer from overthinking, which generates redundant reasoning steps, especially for simple questions. This paper revisits the reasoning patterns of Long and Short CoT models, observing that the Short CoT patterns offer concise reasoning efficiently, while the Long CoT patterns excel in challenging scenarios where the Short CoT patterns struggle. To enable models to leverage both patterns, we propose Question-Free Fine-Tuning (QFFT), a fine-tuning approach that removes the input question during training and learns exclusively from Long CoT responses. This approach enables the model to adaptively employ both reasoning patterns: it prioritizes the Short CoT patterns and activates the Long CoT patterns only when necessary. Experiments on various mathematical datasets demonstrate that QFFT reduces average response length by more than 50%, while achieving performance comparable to Supervised Fine-Tuning (SFT). Additionally, QFFT exhibits superior performance compared to SFT in noisy, out-of-domain, and low-resource scenarios.

QFFT, Question-Free Fine-Tuning for Adaptive Reasoning Read Post »

th