YouZum

AI

AI, Committee, ข่าว, Uncategorized

Microsoft Copilot gets 12 big updates for fall, including new AI assistant character Mico

Microsoft today held a live announcement event online for its Copilot AI digital assistant, with Mustafa Suleyman, CEO of Microsoft’s AI division, and other presenters unveiling a new generation of features that deepen integration across Windows, Edge, and Microsoft 365, positioning the platform as a practical assistant for people during work and off-time, while allowing them to preserve control and safety of their data. The new Copilot 2025 Fall Update features also up the ante in terms of capabilities and the accessibility of generative AI assistance from Microsoft to users, so businesses relying on Microsoft products, and those who seek to offer complimentary or competing products, would do well to review them. Suleyman emphasized that the updates reflect a shift from hype to usefulness. “Technology should work in service of people, not the other way around,” he said. “Copilot is not just a product—it’s a promise that AI can be helpful, supportive, and deeply personal.” Intriguingly, the announcement also sought to shine a greater spotlight on Microsoft’s own homegrown AI models, as opposed to those of its partner and investment OpenAI, which previously powered the entire Copilot experience. Instead, Suleyman wrote today in a blog post: “At the foundation of it all is our strategy to put the best models to work for you – both those we build and those we don’t. Over the past few months, we have released in-house models like MAI-Voice-1, MAI-1-Preview and MAI-Vision-1, and are rapidly iterating.” 12 Features That Redefine Copilot The Fall Release consolidates Copilot’s identity around twelve key capabilities—each with potential to streamline organizational knowledge work, development, or support operations. Groups – Shared Copilot sessions where up to 32 participants can brainstorm, co-author, or plan simultaneously. For distributed teams, it effectively merges a meeting chat, task board, and generative workspace. Copilot maintains context, summarizes decisions, and tracks open actions. Imagine – A collaborative hub for creating and remixing AI-generated content. In an enterprise setting, Imagine enables rapid prototyping of visuals, marketing drafts, or training materials. Mico – A new character identity for Copilot that introduces expressive feedback and emotional expression in the form of a cute, amorphous blob. Echoing Microsoft’s historic character interfaces like Clippy (Office 97) or Cortana (2014), Mico serves as a unifying UX layer across modalities. Real Talk – A conversational mode that adapts to a user’s communication style and offers calibrated pushback — ending the sycophancy that some users have complained about with other AI models such as prior versions of OpenAI’s ChatGPT. For professionals, it allows Socratic problem-solving rather than passive answer generation, making Copilot more credible in technical collaboration. Memory & Personalization – Long-term contextual memory that lets Copilot recall key details—training plans, dates, goals—at the user’s direction. Connectors – Integration with OneDrive, Outlook, Gmail, Google Drive, and Google Calendar for natural-language search across accounts. Proactive Actions (Preview) – Context-based prompts and next-step suggestions derived from recent activity. Copilot for Health – Health information grounded in credible medical sources such as Harvard Health, with tools allowing users to locate and compare doctors. Learn Live – A Socratic, voice-driven tutoring experience using questions, visuals, and whiteboards. Copilot Mode in Edge – Converts Microsoft Edge into an “AI browser” that summarizes, compares, and executes web actions by voice. Copilot on Windows – Deep integration across Windows 11 PCs with “Hey Copilot” activation, Copilot Vision guidance, and quick access to files and apps. Copilot Pages and Copilot Search – A collaborative file canvas plus a unified search experience combining AI-generated, cited answers with standard web results. The Fall Release is immediately available in the United States, with rollout to the UK, Canada, and other markets in progress. Some functions—such as Groups, Journeys, and Copilot for Health—remain U.S.-only for now. Proactive Actions requires a Microsoft 365 Personal, Family, or Premium subscription. Together these updates illustrate Microsoft’s pivot from static productivity suites to contextual AI infrastructure, with the Copilot brand acting as the connective tissue across user roles. From Clippy to Mico: The Return of a Guided Interface One of the most notable introductions is Mico, a small animated companion that is available within Copilot’s voice-enabled experiences, including the Copilot app on Windows, iOS, and Android, as well as in Study Mode and other conversational contexts. It serves as an optional visual companion that appears during interactive or voice-based sessions, rather than across all Copilot interfaces. Mico listens, reacts with expressions, and changes color to reflect tone and emotion — bringing a visual warmth to an AI assistant experience that has traditionally been text-heavy. Mico’s design recalls earlier eras of Microsoft’s history with character-based assistants. In the mid-1990s, Microsoft experimented with Microsoft Bob (1995), a software interface that used cartoon characters like a dog named Rover to guide users through everyday computing tasks. While innovative for its time, Bob was discontinued after a year due to performance and usability issues. A few years later came Clippy, the Office Assistant introduced in Microsoft Office 97. Officially known as “Clippit,” the animated paperclip would pop up to offer help and tips within Word and other Office applications. Clippy became widely recognized—sometimes humorously so—for interrupting users with unsolicited advice. Microsoft retired Clippy from Office in 2001, though the character remains a nostalgic symbol of early AI-driven assistance. More recently, Cortana, launched in 2014 as Microsoft’s digital voice assistant for Windows and mobile devices, aimed to provide natural-language interaction similar to Apple’s Siri or Amazon’s Alexa. Despite positive early reception, Cortana’s role diminished as Microsoft refocused on enterprise productivity and AI integration. The service was officially discontinued on Windows in 2023. Mico, by contrast, represents a modern reimagining of that tradition—combining the personality of early assistants with the intelligence and adaptability of contemporary AI models. Where Clippy offered canned responses, Mico listens, learns, and reflects a user’s mood in real time. The goal, as Suleyman framed it, is to create an AI that feels “helpful, supportive, and deeply personal.” Groups Are Microsoft’s Version of Claude and ChatGPT Projects During Microsoft’s launch video, product researcher

Microsoft Copilot gets 12 big updates for fall, including new AI assistant character Mico Read Post »

AI, Committee, ข่าว, Uncategorized

UltraCUA: A Foundation Computer-Use Agents Model that Bridges the Gap between General-Purpose GUI Agents and Specialized API-based Agents

Computer-use agents have been limited to primitives. They click, they type, they scroll. Long action chains amplify grounding errors and waste steps. Apple Researchers introduce UltraCUA, a foundation model that builds an hybrid action space that lets an agent interleave low level GUI actions with high level programmatic tool calls. The model chooses the cheaper and more reliable move at each step. The approach improves success and reduces steps on OSWorld, and transfers to WindowsAgentArena without Windows specific training. https://arxiv.org/pdf/2510.17790 What hybrid action changes? Hybrid action treats tools as first class actions. A tool call encapsulates a multi step operation as a single function with a clear signature and a docstring. A click or a key press still exists when no programmatic path is available. The agent learns to alternate between both modes. The goal is to reduce cascade errors and to cut step counts. The research team positions this as a bridge between GUI only CUAs and tool centric agent frameworks. https://arxiv.org/pdf/2510.17790 Scaled tool acquisition UltraCUA builds its tool library with an automated pipeline. The system extracts keyboard shortcuts and commands from software documentation. The system integrates open source implementations from agent toolkits. The system also uses coding agents to synthesize new tools. Each tool is a callable interface that hides a long GUI sequence. The research team reports coverage across 10 desktop domains with 881 tools. The largest buckets include VS Code with 135 tools and LibreOffice Writer with 123 tools. Thunderbird and GIMP also have deep coverage. https://arxiv.org/pdf/2510.17790 Verifiable synthetic tasks and trajectories Training requires grounded supervision and stable rewards. UltraCUA uses a dual synthetic engine. An evaluator first pipeline composes atomic verifiers for browsers, files, images, and system state, then generates tasks that satisfy those checks. An instruction first pipeline explores the OS and proposes context aligned tasks which are then verified. The result is 17,864 verifiable tasks across 10 domains such as Chrome, LibreOffice, GIMP, VS Code, system, Thunderbird, VLC, and multi app workflows. Chrome has 2,826 tasks. The LibreOffice suite sums to 5,885 tasks. Multi app tasks reach 2,113. https://arxiv.org/pdf/2510.17790 A multi agent rollout produces successful hybrid trajectories. The planner uses OpenAI o3 for decision making. The grounder uses GTA1-7B for accurate visual localization. The rollout yields about 26.8K successful trajectories that show when to use a tool and when to act in the GUI. These trajectories are the core of the supervised phase. Training Approach Training has two stages. Stage 1 is supervised fine tuning. The models train for 3 epochs at a learning rate of 2e-5 on the successful trajectories. Loss is applied turn wise to avoid over weighting early steps. Stage 2 is online reinforcement learning. The models train for 150 steps at a learning rate of 1e-6 on verified tasks that are sampled by difficulty. The policy optimization follows a GRPO variant with clip higher, and removes KL regularization and format rewards. The reward combines sparse task outcome with a tool use term. Experiments use NVIDIA H100 GPUs. The context is kept near 32K by controlling the number of exposed tools. Results on OSWorld UltraCUA improves success at both 7B and 32B scales. Under 15 step budgets, UltraCUA-32B reaches 41.0 percent success. OpenCUA-32B reaches 29.7 percent. The absolute gain is 11.3 points. UltraCUA-7B reaches 28.9 percent. UI-TARS-1.5-7B reaches 23.4 percent. Gains persist under 50 step budgets. A per domain breakdown shows consistent lifts across Chrome, Writer, VS Code, and cross application tasks. Average steps decrease against baselines. These shifts indicate better action selection rather than only more attempts. https://arxiv.org/pdf/2510.17790 https://arxiv.org/pdf/2510.17790 Cross platform transfer on WindowsAgentArena UltraCUA trains only on Ubuntu based OSWorld data. The model is then evaluated on WindowsAgentArena. UltraCUA-7B reaches 21.7 percent success. This exceeds UI-TARS-1.5-7B at 18.1 percent and a Qwen2 baseline trained with Windows data at 13.5 percent. The result suggests that hybrid action strategies learned on one platform transfer to other platforms. The paper highlights this as zero shot platform generalization. https://arxiv.org/pdf/2510.17790 Key Takeaways UltraCUA formalizes a hybrid action space that lets a single agent alternate between GUI primitives and programmatic tool calls, which reduces long error prone action chains. The research team scales a reusable tool library through an automated pipeline and pairs it with a synthetic data engine, yielding 17,000 plus verifiable computer use tasks for training and evaluation. Training follows a two stage recipe, supervised fine tuning on successful hybrid trajectories then online reinforcement learning on verifiable tasks, which optimizes when to call tools versus act in the GUI. On OSWorld, UltraCUA reports an average 22 percent relative improvement over base models and 11 percent fewer steps, which indicates gains in reliability and efficiency. The 7B model reaches 21.7 percent success on WindowsAgentArena without Windows specific training, which shows cross platform generalization of the hybrid action policy. Editorial Comments UltraCUA moves computer use agents from brittle primitive action chains to a hybrid action policy, integrating GUI primitives with programmatic tool calls, which reduces error propagation and step counts. It scales tools via an automated pipeline and pairs them with a synthetic data engine that yields 17,000 plus verifiable tasks, enabling supervised fine tuning and online reinforcement learning on grounded signals. Reported results include 22 percent relative improvement on OSWorld with 11 percent fewer steps, and 21.7 percent success on WindowsAgentArena without Windows specific training, which indicates cross platform transfer of the policy. Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post UltraCUA: A Foundation Computer-Use Agents Model that Bridges the Gap between General-Purpose GUI Agents and Specialized API-based Agents appeared first on MarkTechPost.

UltraCUA: A Foundation Computer-Use Agents Model that Bridges the Gap between General-Purpose GUI Agents and Specialized API-based Agents Read Post »

AI, Committee, ข่าว, Uncategorized

Google AI Introduces FLAME Approach: A One-Step Active Learning that Selects the Most Informative Samples for Training and Makes a Model Specialization Super Fast

Open vocabulary object detectors answer text queries with boxes. In remote sensing, zero shot performance drops because classes are fine grained and visual context is unusual. Google Research team proposess FLAME, a one step active learning strategy that rides on a strong open vocabulary detector and adds a tiny refiner that you can train in near real time on a CPU. The base model generates high recall proposals, the refiner filters false positives with a few targeted labels, and you avoid full model fine tuning. It reports state of the art accuracy on DOTA and DIOR with 30 shots, and minute scale adaptation per label on a CPU. https://arxiv.org/pdf/2510.17670v1 Problem framing Open vocabulary detectors such as OWL ViT v2 are trained on web scale image text pairs. They generalize well on natural images, yet they struggle when categories are subtle, for example chimney versus storage tank, or when the imaging geometry is different, for example nadir aerial tiles with rotated objects and small scales. Precision falls because the text embedding and the visual embedding overlap for look alike categories. A practical system needs the breadth of open vocabulary models, and the precision of a local specialist, without hours of GPU fine tuning or thousands of new labels. Method and design in concise FLAME is a cascaded pipeline. Step one, run a zero shot open vocabulary detector to produce many candidate boxes for a text query, for example “chimney.” Step two, represent each candidate with visual features and its similarity to the text. Step three, retrieve marginal samples that sit near the decision boundary by doing a low dimensional projection with PCA, then a density estimate, then select the uncertain band. Step four, cluster this band and pick one item per cluster for diversity. Step five, have a user label about 30 crops as positive or negative. Step six, optionally rebalance with SMOTE or SVM SMOTE if the labels are skewed. Step seven, train a small classifier, for example an RBF SVM or a two layer MLP, to accept or reject the original proposals. The base detector stays frozen, so you keep recall and generalization, and the refiner learns the exact semantics the user meant. https://arxiv.org/pdf/2510.17670v1 Datasets, base models, and setup Evaluation uses two standard remote sensing detection benchmarks. DOTA has oriented boxes over 15 categories in high resolution aerial images. DIOR has 23,463 images and 192,472 instances over 20 categories. The comparison includes a zero shot OWL ViT v2 baseline, a zero shot RS OWL ViT v2 that is fine tuned on RS WebLI, and several few shot baselines. RS OWL ViT v2 improves zero shot mean AP to 31.827 percent on DOTA and 29.387 percent on DIOR, which becomes the starting point for FLAME. https://arxiv.org/pdf/2510.17670v1 Understanding the Results On 30 shot adaptation, FLAME cascaded on RS OWL ViT v2 reaches 53.96 percent AP on DOTA and 53.21 percent AP on DIOR, which is the top accuracy among the listed methods. The comparison includes SIoU, a prototype based method with DINOv2, and a few shot method proposed by the research team. These numbers appear in Table 1. The research team also reports the per class breakdown in Table 2. On DIOR, the chimney class improves from 0.11 in zero shot to 0.94 after FLAME, which illustrates how the refiner removes look alike false positives from the open vocabulary proposals. https://arxiv.org/pdf/2510.17670v1 Key Takeaways FLAME is a one step active learning cascade over OWL ViT v2, it retrieves marginal samples using density estimation, enforces diversity with clustering, collects about 30 labels, and trains a lightweight refiner such as an RBF SVM or a small MLP, with no base model fine tuning. With 30 shots, FLAME on RS OWL ViT v2 reaches 53.96% AP on DOTA and 53.21% AP on DIOR, exceeding prior few shot baselines including SIoU and a prototype method with DINOv2. On DIOR, the chimney class improves from 0.11 in zero shot to 0.94 after FLAME, which shows strong filtering of look alike false positives. Adaptation runs in about 1 minute for each label on a standard CPU, which supports near real time, user in the loop specialization. Zero shot OWL ViT v2 starts at 13.774% AP on DOTA and 14.982% on DIOR, RS OWL ViT v2 raises zero shot AP to 31.827% and 29.387% respectively, and FLAME then delivers the large precision gains on top. Editorial Comments FLAME is a one step active learning cascade that layers a tiny refiner on top of OWL ViT v2, selecting marginal detections, collecting about 30 labels, and training a small classifier without touching the base model. On DOTA and DIOR, FLAME with RS OWL ViT v2 reports 53.96 percent AP and 53.21 percent AP, establishing a strong few shot baseline. On DIOR chimney, average precision rises from 0.11 to 0.94 after refinement, illustrating false positive suppression. Adaptation runs in about 1 minute per label on a CPU, enabling interactive specialization. OWLv2 and RS WebLI provide the foundation for zero shot proposals. Overall, FLAME demonstrates a practical path to open vocabulary detection specialization in remote sensing by pairing RS OWL ViT v2 proposals with a minute scale CPU refiner that lifts DOTA to 53.96 percent AP and DIOR to 53.21 percent AP. Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Google AI Introduces FLAME Approach: A One-Step Active Learning that Selects the Most Informative Samples for Training and Makes a Model Specialization Super Fast appeared first on MarkTechPost.

Google AI Introduces FLAME Approach: A One-Step Active Learning that Selects the Most Informative Samples for Training and Makes a Model Specialization Super Fast Read Post »

AI, Committee, ข่าว, Uncategorized

OpenAI launches company knowledge in ChatGPT, letting you access your firm’s data from Google Drive, Slack, GitHub

Is the Google Search for internal enterprise knowledge finally here…but from OpenAI? It certainly seems that way. Today, OpenAI has launched company knowledge in ChatGPT, a major new capability for subscribers to ChatGPT’s paid Business, Enterprise, and Edu plans that lets them call up their company’s data directly from third-party workplace apps including Slack, SharePoint, Google Drive, Gmail, GitHub, HubSpot and combine it in ChatGPT outputs to them. As OpenAI’s CEO of Applications Fidji Simo put it in a post on the social network X: “it brings all the context from your apps (Slack, Google Drive, GitHub, etc) together in ChatGPT so you can get answers that are specific to your business.” Intriguingly, OpenAI’s blog post on the feature states that is “powered by a version of GPT‑5 that’s trained to look across multiple sources to give more comprehensive and accurate answers,” which sounds to me like a new fine-tuned version of the model family the company released back in August, though there are no additional details on how it was trained. Nonetheless, company knowledge in ChatGPT is rolling out globally and is designed to make ChatGPT a central point of access for verified organizational information, supported by secure integrations and enterprise-grade compliance controls, and give employees way faster access to their company’s information while working. Now, instead of toggling over to Slack to find the assignment you were given and instructions, or tabbing over to Google Drive and opening up specific files to find the names and numbers you need to call, ChatGPT can deliver all that type of information directly into your chat session — if your company enables the proper connections. As OpenAI Chief Operating Officer Brad Lightcap wrote in a post on the social network X: “company knowledge has changed how i use chatgpt at work more than anything we have built so far – let us know what you think!” It builds upon the third-party app connectors unveiled back in August 2025, though those were only for individual users on the ChatGPT Plus plans. Connecting ChatGPT to Workplace Systems Enterprise teams often face the challenge of fragmented data across various internal tools—email, chat, file storage, project management, and customer platforms. Company knowledge bridges those silos by enabling ChatGPT to connect to approved systems like, and other supported apps through enterprise-managed connectors. Each response generated with company knowledge includes citations and direct links to the original sources, allowing teams to verify where specific details originated. This transparency helps organizations maintain data trustworthiness while increasing productivity. OpenAI confirms that company knowledge uses a version of GPT-5 optimized for multi-source reasoning and cross-system synthesis, providing detailed, contextually accurate results even across disparate sources. Built for Enterprise Control and Security Company knowledge was designed from the ground up for enterprise governance and compliance. It respects existing permissions within connected apps — ChatGPT can only access what a user is already authorized to view— and never trains on company data by default. Security features include industry-standard encryption, support for SSO and SCIM for account provisioning, and IP allowlisting to restrict access to approved corporate networks. Enterprise administrators can also define role-based access control (RBAC) policies and manage permissions at a group or department level. OpenAI’s Enterprise Compliance API provides a full audit trail, allowing administrators to review conversation logs for reporting and regulatory purposes. This capability helps enterprises meet internal governance standards and industry-specific requirements such as SOC 2 and ISO 27001 compliance. Admin Configuration and Connector Management For enterprise deployment, administrators must enable company knowledge and its connectors within the ChatGPT workspace. Once connectors are active, users can authenticate their own accounts for each work app they need to access. In Enterprise and Edu plans, connectors are off by default and require explicit admin approval before employees can use them. Admins can selectively enable connectors, manage access by role, and require SSO-based authentication for enhanced control. Business plan users, by contrast, have connectors enabled automatically if available in their workspace. Admins can still oversee which connectors are approved, ensuring alignment with internal IT and data policies. Company knowledge becomes available to any user with at least one active connector, and admins can configure group-level permissions for different teams — such as restricting GitHub access to engineering while enabling Google Drive or HubSpot for marketing and sales. How Company Knowledge Works in Practice Activating company knowledge is straightforward. Users can start a new or existing conversation in ChatGPT and select “Company knowledge” under the message composer or from the tools menu. After authenticating their connected apps, they can ask questions as usual—such as “Summarize this account’s latest feedback and risks” or “Compile a Q4 performance summary from project trackers.” ChatGPT searches across the connected tools, retrieves relevant context, and produces an answer with full citations and source links. The system can combine data across apps — for instance, blending Slack updates, Google Docs notes, and HubSpot CRM records — to create an integrated view of a project, client, or initiative. When company knowledge is not selected, ChatGPT may still use connectors in a limited capacity as part of the default experience, but responses will not include detailed citations or multi-source synthesis. Advanced Use Cases for Enterprise Teams For development and operations leaders, company knowledge can act as a centralized intelligence layer that surfaces real-time updates and dependencies across complex workflows. ChatGPT can, for example, summarize open GitHub pull requests, highlight unresolved Linear tickets, and cross-reference Slack engineering discussions—all in a single output. Technical teams can also use it for incident retrospectives or release planning by pulling relevant information from issue trackers, logs, and meeting notes. Procurement or finance leaders can use it to consolidate purchase requests or budget updates across shared drives and internal communications. Because the model can reference structured and unstructured data simultaneously, it supports wide-ranging scenarios—from compliance documentation reviews to cross-departmental performance summaries. Privacy, Data Residency, and Compliance Enterprise data protection is a central design element of company knowledge. ChatGPT processes data in line with OpenAI’s

OpenAI launches company knowledge in ChatGPT, letting you access your firm’s data from Google Drive, Slack, GitHub Read Post »

AI, Committee, ข่าว, Uncategorized

Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM

arXiv:2505.24379v3 Announce Type: replace-cross Abstract: Large Language Models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning — which retrains the model from scratch without the target data — is widely regarded the gold standard for mitigating privacy risks in deployment. In this paper, we revisit this assumption in a practical deployment setting where both the pre- and post-unlearning logits API are exposed, such as in open-weight scenarios. Targeting this setting, we introduce a novel data extraction attack that leverages signals from the pre-unlearning model to guide the post-unlearning model, uncovering patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates — doubling performance in some cases — across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack’s effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage during real-world deployments, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints. Code is publicly available at: https://github.com/Nicholas0228/unlearned_data_extraction_llm.

Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM Read Post »

AI, Committee, ข่าว, Uncategorized

Interpretable Question Answering with Knowledge Graphs

arXiv:2510.19181v1 Announce Type: new Abstract: This paper presents a question answering system that operates exclusively on a knowledge graph retrieval without relying on retrieval augmented generation (RAG) with large language models (LLMs). Instead, a small paraphraser model is used to paraphrase the entity relationship edges retrieved from querying the knowledge graph. The proposed pipeline is divided into two main stages. The first stage involves pre-processing a document to generate sets of question-answer (QA) pairs. The second stage converts these QAs into a knowledge graph from which graph-based retrieval is performed using embeddings and fuzzy techniques. The graph is queried, re-ranked, and paraphrased to generate a final answer. This work includes an evaluation using LLM-as-a-judge on the CRAG benchmark, which resulted in accuracies of 71.9% and 54.4% using LLAMA-3.2 and GPT-3.5-Turbo, respectively.

Interpretable Question Answering with Knowledge Graphs Read Post »

AI, Committee, ข่าว, Uncategorized

The Coverage Principle: How Pre-Training Enables Post-Training

arXiv:2510.15020v2 Announce Type: replace-cross Abstract: Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross-entropy loss, cross-entropy can be a poor predictor of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of emph{coverage}, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods such as Best-of-N to succeed. Our main results develop an understanding of emph{the coverage principle}, a phenomenon whereby next-token prediction (more generally, maximum likelihood) implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: emph{coverage generalizes faster than cross-entropy}, avoiding spurious dependence on problem-dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.

The Coverage Principle: How Pre-Training Enables Post-Training Read Post »

AI, Committee, ข่าว, Uncategorized

When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA

arXiv:2510.19172v1 Announce Type: new Abstract: LLMs often fail to handle temporal knowledge conflicts–contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.

When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at นโยบายความเป็นส่วนตัว and manage your privacy settings by clicking Settings.

ตั้งค่าความเป็นส่วนตัว

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

ยอมรับทั้งหมด
จัดการความเป็นส่วนตัว
  • เปิดใช้งานตลอด

บันทึกการตั้งค่า
th