YouZum

AI

AI, Committee, ข่าว, Uncategorized

Non-Collaborative User Simulators for Tool Agents

arXiv:2509.23124v3 Announce Type: replace Abstract: Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $tau$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents’ weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.

Non-Collaborative User Simulators for Tool Agents Read Post »

AI, Committee, ข่าว, Uncategorized

NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification

arXiv:2512.06921v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human-machine interaction, they provide a strong foundation for advancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (NeuroABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is developed using a novel multimodal annotation pipeline with multiple review cycles. The benchmark evaluates the identification of 68 clinical anatomical structures, providing a rigorous and standardized framework for assessing model performance. Experiments on over 10 state-of-the-art MLLMs reveal significant limitations, with the best-performing model achieving only 40.87% accuracy in anatomical identification tasks. To further evaluate the benchmark, we extract a subset of the dataset and conduct an informative test with four neurosurgical trainees. The results show that the best-performing student achieves 56% accuracy, with the lowest scores of 28% and an average score of 46.5%. While the best MLLM performs comparably to the lowest-scoring student, it still lags significantly behind the group’s average performance. This comparison underscores both the progress of MLLMs in anatomical understanding and the substantial gap that remains in achieving human-level performance.

NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification Read Post »

AI, Committee, ข่าว, Uncategorized

Large Language Models and Forensic Linguistics: Navigating Opportunities and Threats in the Age of Generative AI

arXiv:2512.06922v1 Announce Type: new Abstract: Large language models (LLMs) present a dual challenge for forensic linguistics. They serve as powerful analytical tools enabling scalable corpus analysis and embedding-based authorship attribution, while simultaneously destabilising foundational assumptions about idiolect through style mimicry, authorship obfuscation, and the proliferation of synthetic texts. Recent stylometric research indicates that LLMs can approximate surface stylistic features yet exhibit detectable differences from human writers, a tension with significant forensic implications. However, current AI-text detection techniques, whether classifier-based, stylometric, or watermarking approaches, face substantial limitations: high false positive rates for non-native English writers and vulnerability to adversarial strategies such as homoglyph substitution. These uncertainties raise concerns under legal admissibility standards, particularly the Daubert and Kumho Tire frameworks. The article concludes that forensic linguistics requires methodological reconfiguration to remain scientifically credible and legally admissible. Proposed adaptations include hybrid human-AI workflows, explainable detection paradigms beyond binary classification, and validation regimes measuring error and bias across diverse populations. The discipline’s core insight, i.e., that language reveals information about its producer, remains valid but must accommodate increasingly complex chains of human and machine authorship.

Large Language Models and Forensic Linguistics: Navigating Opportunities and Threats in the Age of Generative AI Read Post »

AI, Committee, ข่าว, Uncategorized

Why They Disagree: Decoding Differences in Opinions about AI Risk on the Lex Fridman Podcast

arXiv:2512.06350v1 Announce Type: cross Abstract: The emergence of transformative technologies often surfaces deep societal divisions, nowhere more evident than in contemporary debates about artificial intelligence (AI). A striking feature of these divisions is that they persist despite shared interests in ensuring that AI benefits humanity and avoiding catastrophic outcomes. This paper analyzes contemporary debates about AI risk, parsing the differences between the “doomer” and “boomer” perspectives into definitional, factual, causal, and moral premises to identify key points of contention. We find that differences in perspectives about existential risk (“X-risk”) arise fundamentally from differences in causal premises about design vs. emergence in complex systems, while differences in perspectives about employment risks (“E-risks”) pertain to different causal premises about the applicability of past theories (evolution) vs their inapplicability (revolution). Disagreements about these two forms of AI risk appear to share two properties: neither involves significant disagreements on moral values and both can be described in terms of differing views on the extent of boundedness of human rationality. Our approach to analyzing reasoning chains at scale, using an ensemble of LLMs to parse textual data, can be applied to identify key points of contention in debates about risk to the public in any arena.

Why They Disagree: Decoding Differences in Opinions about AI Risk on the Lex Fridman Podcast Read Post »

AI, Committee, ข่าว, Uncategorized

Zhipu AI Releases GLM-4.6V: A 128K Context Vision Language Model with Native Tool Calling

Zhipu AI has open sourced the GLM-4.6V series as a pair of vision language models that treat images, video and tools as first class inputs for agents, not as afterthoughts bolted on top of text. Model lineup and context length The series has 2 models. GLM-4.6V is a 106B parameter foundation model for cloud and high performance cluster workloads. GLM-4.6V-Flash is a 9B parameter variant tuned for local deployment and low latency use. GLM-4.6V extends the training context window to 128K tokens. In practice this supports roughly 150 pages of dense documents, 200 slide pages or one hour of video in a single pass because pages are encoded as images and consumed by the visual encoder. Native multimodal tool use The main technical change is native multimodal Function Calling. Traditional tool use in LLM systems routes everything through text. Images or pages are first turned into descriptions, the model calls tools using text arguments and then reads textual responses. This wastes information and increases latency. GLM-4.6V introduces native multimodal Function Calling. Images, screenshots and document pages pass directly as tool parameters. Tools can return search result grids, charts, rendered web pages or product images. The model consumes those visual outputs and fuses them with text in the same reasoning chain. This closes the loop from perception to understanding to execution and is explicitly positioned as the bridge between visual perception and executable action for multimodal agents. To support this, Zhipu AI extends the Model Context Protocol with URL based multimodal handling. Tools receive and return URLs that identify specific images or frames, which avoids file size limits and allows precise selection inside multi image contexts. Rich text content, web search and frontend replication Zhipu AI research team describes 4 canonical scenarios: First, rich text content understanding and creation. GLM-4.6V reads mixed inputs such as papers, reports or slide decks and produces structured image text interleaved outputs. It understands text, charts, figures, tables and formulas in the same document. During generation it can crop relevant visuals or retrieve external images through tools, then run a visual audit step that filters low quality images and composes the final article with inline figures. Second, visual web search. The model can detect user intent, plan which search tools to call and combine text to image and image to text search. It then aligns retrieved images and text, selects the relevant evidence and outputs a structured answer, for example a visual comparison of products or places. Third, frontend replication and visual interaction. GLM-4.6V is tuned for design to code workflows. From a UI screenshot, it reconstructs pixel accurate HTML, CSS and JavaScript. Developers can then mark a region on the screenshot and issue natural language instructions, for example move this button left or change this card background. The model maps those instructions back to the code and returns an updated snippet. Fourth, multimodal document understanding at long context. GLM-4.6V can read multi document inputs up to the 128K token context limit by treating pages as images. The research team reports a case where the model processes financial reports from 4 public companies, extracts core metrics and builds a comparison table, and a case where it summarises a full football match while keeping the ability to answer questions about specific goals and timestamps. Architecture, data and reinforcement learning The GLM-4.6V models belong to the GLM-V family and based on the tech report for GLM-4.5V and GLM-4.1V-Thinking. The research team highlights three main technical ingredients. First, long sequence modeling. GLM-4.6V extends the training context window to 128K tokens and runs continual pre training on massive long context image text corpora. It uses compression alignment ideas from Glyph so that visual tokens can carry dense information that is aligned with language tokens. Second, world knowledge enhancement. Zhipu AI team adds a billion scale multimodal perception and world knowledge dataset at pre training time. This covers layered encyclopedic concepts and everyday visual entities. The stated goal is to improve both basic perception and cross modal question answering completeness, not only benchmarks. Third, agentic data synthesis and extended MCP. The research team generates large synthetic traces where the model calls tools, processes visual outputs and iterates on plans. They extend MCP with URL based multimodal handling and an interleaved output mechanism. The generation stack follows a Draft, Image Selection, Final Polish sequence. The model can autonomously call cropping or search tools between these stages to place images at the right positions in the output. Tool invocation is part of the reinforcement learning objective. GLM-4.6V uses RL to align planning, instruction following and format adherence in complex tool chains. Performance https://z.ai/blog/glm-4.6v Key Takeaways GLM-4.6V is a 106B multimodal foundation model with a 128K token training context, and GLM-4.6V-Flash is a 9B variant optimized for local and low latency use. Both models support native multimodal Function Calling so tools can consume and return images, video frames and document pages directly, which links visual perception to executable actions for agents. GLM-4.6V is trained for long context multimodal understanding and interleaved generation, so it can read large mixed document sets and emit structured text with inline figures and tool selected images in one pass. The series achieves state of the art performance on major multimodal benchmarks at similar parameter scales and is released as open source weights under the MIT license on Hugging Face and ModelScope. Check out the Model Card on HF and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Zhipu AI Releases GLM-4.6V: A 128K Context Vision Language Model with Native Tool Calling appeared first on MarkTechPost.

Zhipu AI Releases GLM-4.6V: A 128K Context Vision Language Model with Native Tool Calling Read Post »

AI, Committee, ข่าว, Uncategorized

Text Rationalization for Robust Causal Effect Estimation

arXiv:2512.05373v1 Announce Type: cross Abstract: Recent advances in natural language processing have enabled the increasing use of text data in causal inference, particularly for adjusting confounding factors in treatment effect estimation. Although high-dimensional text can encode rich contextual information, it also poses unique challenges for causal identification and estimation. In particular, the positivity assumption, which requires sufficient treatment overlap across confounder values, is often violated at the observational level, when massive text is represented in feature spaces. Redundant or spurious textual features inflate dimensionality, producing extreme propensity scores, unstable weights, and inflated variance in effect estimates. We address these challenges with Confounding-Aware Token Rationalization (CATR), a framework that selects a sparse necessary subset of tokens using a residual-independence diagnostic designed to preserve confounding information sufficient for unconfoundedness. By discarding irrelevant texts while retaining key signals, CATR mitigates observational-level positivity violations and stabilizes downstream causal effect estimators. Experiments on synthetic data and a real-world study using the MIMIC-III database demonstrate that CATR yields more accurate, stable, and interpretable causal effect estimates than existing baselines.

Text Rationalization for Robust Causal Effect Estimation Read Post »

AI, Committee, ข่าว, Uncategorized

Learning from Self Critique and Refinement for Faithful LLM Summarization

arXiv:2512.05387v1 Announce Type: new Abstract: Large Language Models (LLMs) often suffer from hallucinations: output content that is not grounded in the input context, when performing long-form text generation tasks such as summarization. Prior works have shown that hallucinations can be reduced by iteratively critiquing and refining previously generated outputs using either the same model or a more powerful teacher model as the critique. However, these approaches either require additional test-time compute or assume access to more powerful teacher models, making them costly and less practical. In this work, we propose Self Critique and Refinement-based Preference Optimization (SCRPO), which is a self-supervised training framework that first constructs a preference dataset by leveraging the LLM’s own critique and refinement capabilities, and then applies preference learning to improve the same LLM for faithful summarization. Experiments on three summarization benchmarks (XSUM CNNDM and SAMSum), demonstrate that our approach outperforms state-of-the-art self-supervised learning methods in terms of faithfulness metrics while either maintaining or improving other metrics that measure the overall quality of the summary. Moreover, compared to test-time refinement, our approach not only improves efficiency but also results in more faithful summaries.

Learning from Self Critique and Refinement for Faithful LLM Summarization Read Post »

AI, Committee, ข่าว, Uncategorized

Faithfulness metric fusion: Improving the evaluation of LLM trustworthiness across domains

arXiv:2512.05700v1 Announce Type: new Abstract: We present a methodology for improving the accuracy of faithfulness evaluation in Large Language Models (LLMs). The proposed methodology is based on the combination of elementary faithfulness metrics into a combined (fused) metric, for the purpose of improving the faithfulness of LLM outputs. The proposed strategy for metric fusion deploys a tree-based model to identify the importance of each metric, which is driven by the integration of human judgements evaluating the faithfulness of LLM responses. This fused metric is demonstrated to correlate more strongly with human judgements across all tested domains for faithfulness. Improving the ability to evaluate the faithfulness of LLMs, allows for greater confidence to be placed within models, allowing for their implementation in a greater diversity of scenarios. Additionally, we homogenise a collection of datasets across question answering and dialogue-based domains and implement human judgements and LLM responses within this dataset, allowing for the reproduction and trialling of faithfulness evaluation across domains.

Faithfulness metric fusion: Improving the evaluation of LLM trustworthiness across domains Read Post »

AI, Committee, ข่าว, Uncategorized

Reinforce-Ada: An Adaptive Sampling Framework under Non-linear RL Objectives

arXiv:2510.04996v3 Announce Type: replace-cross Abstract: Reinforcement learning (RL) for large language model reasoning is frequently hindered by signal loss, a phenomenon where standard uniform sampling with small group sizes fails to uncover informative learning signals for difficult prompts. We demonstrate that this collapse is a statistical artifact of undersampling rather than an inherent model limitation. To address this systematically, we introduce a theoretical framework based on optimizing a non-linear RL objective (e.g., log-likelihood). We show that this objective naturally induces a weighted gradient estimator that prioritizes difficult prompts, which can be robustly realized through adaptive sampling. Guided by this framework, we propose Reinforce-Ada, a family of algorithms that dynamically allocates inference budgets based on prompt difficulty, effectively scaling up RL compute to where it is needed most. Unlike passive filtering methods that discard low-signal prompts, Reinforce-Ada actively invests compute to recover them. We introduce two efficient realizations: an estimation-based approach and a model-free sequential sampling approach. Extensive experiments across multiple benchmarks show that Reinforce-Ada significantly outperforms uniform baselines like GRPO, recovering lost signals and accelerating convergence by up to $2times$ while maintaining the same total inference budget. Code is available at https://github.com/RLHFlow/Reinforce-Ada.

Reinforce-Ada: An Adaptive Sampling Framework under Non-linear RL Objectives Read Post »

AI, Committee, ข่าว, Uncategorized

Fair Text Classification via Transferable Representations

arXiv:2503.07691v2 Announce Type: replace-cross Abstract: Group fairness is a central research topic in text classification, where reaching fair treatment between sensitive groups (e.g., women and men) remains an open challenge. We propose an approach that extends the use of the Wasserstein Dependency Measure for learning unbiased neural text classifiers. Given the challenge of distinguishing fair from unfair information in a text encoder, we draw inspiration from adversarial training by inducing independence between representations learned for the target label and those for a sensitive attribute. We further show that Domain Adaptation can be efficiently leveraged to remove the need for access to the sensitive attributes in the dataset we cure. We provide both theoretical and empirical evidence that our approach is well-founded.

Fair Text Classification via Transferable Representations Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at นโยบายความเป็นส่วนตัว and manage your privacy settings by clicking Settings.

ตั้งค่าความเป็นส่วนตัว

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

ยอมรับทั้งหมด
จัดการความเป็นส่วนตัว
  • เปิดใช้งานตลอด

บันทึกการตั้งค่า
th