YouZum

Committee

AI, Committee, 新闻, Uncategorized

Enhancing multimodal analogical reasoning with Logic Augmented Generation

arXiv:2504.11190v2 Announce Type: replace-cross Abstract: Recent advances in Large Language Models have demonstrated their capabilities across a variety of tasks. However, automatically extracting implicit knowledge from natural language remains a significant challenge, as machines lack active experience with the physical world. Given this scenario, semantic knowledge graphs can serve as conceptual spaces that guide the automated text generation reasoning process to achieve more efficient and explainable results. In this paper, we apply a logic-augmented generation (LAG) framework that leverages the explicit representation of a text through a semantic knowledge graph and applies it in combination with prompt heuristics to elicit implicit analogical connections. This method generates extended knowledge graph triples representing implicit meaning, enabling systems to reason on unlabeled multimodal data regardless of the domain. We validate our work through three metaphor detection and understanding tasks across four datasets, as they require deep analogical reasoning capabilities. The results show that this integrated approach surpasses current baselines, performs better than humans in understanding visual metaphors, and enables more explainable reasoning processes, though still has inherent limitations in metaphor understanding, especially for domain-specific metaphors. Furthermore, we propose a thorough error analysis, discussing issues with metaphorical annotations and current evaluation methods.

Enhancing multimodal analogical reasoning with Logic Augmented Generation Read Post »

AI, Committee, 新闻, Uncategorized

Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

arXiv:2506.11886v1 Announce Type: new Abstract: Large Language Models struggle with memory demands from the growing Key-Value (KV) cache as context lengths increase. Existing compression methods homogenize head dimensions or rely on attention-guided token pruning, often sacrificing accuracy or introducing computational overhead. We propose FourierAttention, a training-free framework that exploits the heterogeneous roles of transformer head dimensions: lower dimensions prioritize local context, while upper ones capture long-range dependencies. By projecting the long-context-insensitive dimensions onto orthogonal Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients. Evaluations on LLaMA models show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel, FlashFourierAttention, is designed to optimize memory via streamlined read-write operations, enabling efficient deployment without performance compromise.

Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache Read Post »

AI, Committee, 新闻, Uncategorized

LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation

arXiv:2506.11237v1 Announce Type: cross Abstract: In an effort to automatically evaluate and select the best model and improve code quality for automatic incident remediation in IT Automation, it is crucial to verify if the generated code for remediation action is syntactically and semantically correct and whether it can be executed correctly as intended. There are three approaches: 1) conventional methods use surface form similarity metrics (token match, exact match, etc.) which have numerous limitations, 2) execution-based evaluation focuses more on code functionality based on pass/fail judgments for given test-cases, and 3) LLM-as-a-Judge employs LLMs for automated evaluation to judge if it is a correct answer for a given problem based on pre-defined metrics. In this work, we focused on enhancing LLM-as-a-Judge using bidirectional functionality matching and logic representation for reference-less automatic validation and refinement for Bash code generation to select the best model for automatic incident remediation in IT Automation. We used execution-based evaluation as ground-truth to evaluate our LLM-as-a-Judge metrics. Results show high accuracy and agreement with execution-based evaluation (and up to 8% over baseline). Finally, we built Reflection code agents to utilize judgments and feedback from our evaluation metrics which achieved significant improvement (up to 24% increase in accuracy) for automatic code refinement.

LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation Read Post »

AI, Committee, 新闻, Uncategorized

Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions

arXiv:2506.11111v1 Announce Type: new Abstract: Large Language Models (LLMs) have gained enormous attention in recent years due to their capability of understanding and generating natural languages. With the rapid development and wild-range applications (e.g., Agents, Embodied Intelligence), the robustness of LLMs has received increased attention. As the core brain of many AI applications, the robustness of LLMs requires that models should not only generate consistent contents, but also ensure the correctness and stability of generated content when dealing with unexpeted application scenarios (e.g., toxic prompts, limited noise domain data, outof-distribution (OOD) applications, etc). In this survey paper, we conduct a thorough review of the robustness of LLMs, aiming to provide a comprehensive terminology of concepts and methods around this field and facilitate the community. Specifically, we first give a formal definition of LLM robustness and present the collection protocol of this survey paper. Then, based on the types of perturbated inputs, we organize this survey from the following perspectives: 1) Adversarial Robustness: tackling the problem that prompts are manipulated intentionally, such as noise prompts, long context, data attack, etc; 2) OOD Robustness: dealing with the unexpected real-world application scenarios, such as OOD detection, zero-shot transferring, hallucinations, etc; 3) Evaluation of Robustness: summarizing the new evaluation datasets, metrics, and tools for verifying the robustness of LLMs. After reviewing the representative work from each perspective, we discuss and highlight future opportunities and research directions in this field. Meanwhile, we also organize related works and provide an easy-to-search project (https://github.com/zhangkunzk/Awesome-LLM-Robustness-papers) to support the community.

Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions Read Post »

AI, Committee, 新闻, Uncategorized

Sakana AI Introduces Text-to-LoRA (T2L): A Hypernetwork that Generates Task-Specific LLM Adapters (LoRAs) based on a Text Description of the Task

Transformer models have significantly influenced how AI systems approach tasks in natural language understanding, translation, and reasoning. These large-scale models, particularly large language models (LLMs), have grown in size and complexity to the point where they encompass broad capabilities across various domains. However, applying these models to new, specialized tasks remains a complex operation. Each new application typically demands careful dataset selection, hours of fine-tuning, and a high degree of computational power. Although these models offer a strong foundation in knowledge, their rigidity in handling new domains with minimal data remains a core limitation. As researchers aim to bring AI closer to human-like adaptability, the focus has shifted toward more efficient methods that allow such models to modify their behavior without retraining every parameter. The Challenge of Customizing LLMs for New Tasks The central difficulty lies in adapting foundation models to unique applications without repeating costly and time-intensive training cycles. Most solutions today rely on creating new adapters for each task, which are separate components trained to steer the model’s behavior. These adapters must be made from scratch for every task, and any benefits learned from one application often cannot be transferred to another. This adaptation process is time-consuming and lacks scalability. Moreover, tuning models on specific datasets usually requires a high level of precision in hyperparameter choices, and failing to find the right configuration can lead to poor results. Even when adaptation is successful, the result is often a large collection of isolated task-specific components that are not easy to integrate or reuse. In response to these limitations, researchers have adopted Low-Rank Adaptation (LoRA), a technique that modifies only a small set of parameters rather than the entire model. LoRA injects low-rank matrices into specific layers of a frozen LLM, allowing the base weights to remain unchanged while enabling task-specific customization. This method reduces the number of trainable parameters. However, for each task, a new LoRA adapter still needs to be trained from scratch. While more efficient than full fine-tuning, this method does not allow for fast, on-the-fly adaptation. Recent advancements have attempted to compress these adapters further or combine multiple adapters during inference; however, they still rely heavily on prior training and cannot generate new adapters dynamically. Introducing Text-to-LoRA: Instant Adapter Generation from Task Descriptions Researchers at Sakana AI introduced Text-to-LoRA (T2L), designed to instantly generate task-specific LoRA adapters from textual descriptions of the target task, instead of creating and training new adapters for each task. T2L functions as a hypernetwork capable of outputting adapter weights in a single forward pass. It learns from a library of pre-existing LoRA adapters covering various domains, including GSM8K, Arc-challenge, BoolQ, and others. Once trained, T2L can interpret a task’s description and generate the required adapter without additional training. This ability not only eliminates the need for manual adapter generation but also enables the system to generalize to tasks it has never encountered before. The T2L architecture uses a combination of module-specific and layer-specific embeddings to guide the generation process. Three architectural variants were tested: a large version with 55 million parameters, a medium with 34 million, and a small with just 5 million. Despite their differences in size, all models were capable of generating the necessary low-rank matrices for adapter functionality. The training utilized the Super Natural Instructions dataset across 479 tasks, with each task described in natural language and encoded into vector form. By merging these descriptions with learned layer and module embeddings, T2L creates the low-rank A and B matrices needed for adapter functionality. This allows one model to replace hundreds of hand-crafted LoRAs, producing consistent results with a much smaller computational footprint. Benchmark Performance and Scalability of T2L On benchmarks such as Arc-easy and GSM8K, T2L matched or surpassed the performance of task-specific LoRAs. For instance, the accuracy on Arc-easy using T2L was 76.6%, matching the accuracy of the best manually tuned adapter. On BoolQ, it reached 89.9%, slightly outperforming the original adapter. Even on more difficult benchmarks like PIQA and Winogrande, where overfitting typically hurts performance, T2L delivered better results than manually trained adapters. These improvements are believed to stem from the lossy compression inherent in the hypernetwork training, which acts as a form of regularization. When increasing the number of training datasets from 16 to 479, the performance in zero-shot settings improved substantially, showing T2L’s capability to generalize with broader exposure during training. Several Key Takeaways from the Research include: T2L allows instant adaptation of LLMs using only natural language descriptions. It supports zero-shot generalization to tasks not seen during training. Three architectural variants of T2L were tested with parameter counts of 55M, 34M, and 5M. Benchmarks include ArcE, BoolQ, GSM8K, Hellaswag, PIQA, MBPP, and more. T2L achieved benchmark accuracies of 76.6% (ArcE), 89.9% (BoolQ), and 92.6% (Hellaswag). It matched or exceeded manually trained LoRAs in performance on multiple tasks. Trained using 479 tasks from the Super Natural Instructions dataset. T2L uses the gte-large-en-v1.5 model for generating task embeddings. LoRA adapters produced by T2L target only query and value projections in attention blocks, totaling 3.4M parameters. Performance remained consistent even with higher reconstruction loss, showing resilience to compression. In conclusion, this research highlights a major step forward in flexible and efficient model adaptation. Instead of relying on repetitive, resource-heavy procedures, T2L uses natural language itself as a control mechanism, enabling models to specialize using simple task descriptions. This capability dramatically reduces the time and cost required to adapt LLMs to new domains. Moreover, it suggests that as long as enough prior adapters are available for training, future models could potentially adapt in seconds to any task described in plain English. The use of hypernetworks to dynamically construct adapters also means less storage is needed for model specialization, further increasing the practicality of this method in production environments. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Sakana AI Introduces

Sakana AI Introduces Text-to-LoRA (T2L): A Hypernetwork that Generates Task-Specific LLM Adapters (LoRAs) based on a Text Description of the Task Read Post »

AI, Committee, 新闻, Uncategorized

MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language Models

LLMs are increasingly seen as key to achieving Artificial General Intelligence (AGI), but they face major limitations in how they handle memory. Most LLMs rely on fixed knowledge stored in their weights and short-lived context during use, making it hard to retain or update information over time. Techniques like RAG attempt to incorporate external knowledge but lack structured memory management. This leads to problems such as forgetting past conversations, poor adaptability, and isolated memory across platforms. Fundamentally, today’s LLMs don’t treat memory as a manageable, persistent, or sharable system, limiting their real-world usefulness.  To address the limitations of memory in current LLMs, researchers from MemTensor (Shanghai) Technology Co., Ltd., Shanghai Jiao Tong University, Renmin University of China, and the Research Institute of China Telecom have developed MemO. This memory operating system makes memory a first-class resource in language models. At its core is MemCube, a unified memory abstraction that manages parametric, activation, and plaintext memory. MemOS enables structured, traceable, and cross-task memory handling, allowing models to adapt continuously, internalize user preferences, and maintain behavioral consistency. This shift transforms LLMs from passive generators into evolving systems capable of long-term learning and cross-platform coordination.  As AI systems grow more complex—handling multiple tasks, roles, and data types—language models must evolve beyond understanding text to also retaining memory and learning continuously. Current LLMs lack structured memory management, which limits their ability to adapt and grow over time. MemOS, a new system that treats memory as a core, schedulable resource. It enables long-term learning through structured storage, version control, and unified memory access. Unlike traditional training, MemOS supports a continuous “memory training” paradigm that blurs the line between learning and inference. It also emphasizes governance, ensuring traceability, access control, and safe use in evolving AI systems.  MemOS is a memory-centric operating system for language models that treats memory not just as stored data but as an active, evolving component of the model’s cognition. It organizes memory into three distinct types: Parametric Memory (knowledge baked into model weights via pretraining or fine-tuning), Activation Memory (temporary internal states, such as KV caches and attention patterns, used during inference), and Plaintext Memory (editable, retrievable external data, like documents or prompts). These memory types interact within a unified framework called the MemoryCube (MemCube), which encapsulates both content and metadata, allowing dynamic scheduling, versioning, access control, and transformation across types. This structured system enables LLMs to adapt, recall relevant information, and efficiently evolve their capabilities, transforming them into more than just static generators. At the core of MemOS is a three-layer architecture: the Interface Layer handles user inputs and parses them into memory-related tasks; the Operation Layer manages the scheduling, organization, and evolution of different types of memory; and the Infrastructure Layer ensures safe storage, access governance, and cross-agent collaboration. All interactions within the system are mediated through MemCubes, allowing traceable, policy-driven memory operations. Through modules like MemScheduler, MemLifecycle, and MemGovernance, MemOS maintains a continuous and adaptive memory loop—from the moment a user sends a prompt, to memory injection during reasoning, to storing useful data for future use. This design not only enhances the model’s responsiveness and personalization but also ensures that memory remains structured, secure, and reusable.  In conclusion, MemOS is a memory operating system designed to make memory a central, manageable component in LLMs. Unlike traditional models that depend mostly on static model weights and short-term runtime states, MemOS introduces a unified framework for handling parametric, activation, and plaintext memory. At its core is MemCube, a standardized memory unit that supports structured storage, lifecycle management, and task-aware memory augmentation. The system enables more coherent reasoning, adaptability, and cross-agent collaboration. Future goals include enabling memory sharing across models, self-evolving memory blocks, and building a decentralized memory marketplace to support continual learning and intelligent evolution.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language Models appeared first on MarkTechPost.

MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language Models Read Post »

AI, Committee, 新闻, Uncategorized

Internal Coherence Maximization (ICM): A Label-Free, Unsupervised Training Framework for LLMs

Post-training methods for pre-trained language models (LMs) depend on human supervision through demonstrations or preference feedback to specify desired behaviors. However, this approach faces critical limitations as tasks and model behaviors become very complex. Human supervision is unreliable in these scenarios as LMs learn to mimic mistakes in demonstrations or exploit inherent flaws in feedback systems. The core challenge lies in training LMs for tasks that exceed human capability in reliability in demonstrations or evaluations. Recent research has identified diverse failure modes, including reward-hacking of human-designed supervision signals or real humans themselves. Limitations of Human Supervision in LLM Post-Training Researchers have explored several approaches to scale beyond human supervision. One standard method utilizes high-quality verifiable rewards, such as matching model outputs with ground-truth solutions in mathematical domains. Despite evidence that pre-trained base models have strong latent capabilities for downstream tasks, with post-training adding minimal improvements, effective elicitation remains challenging. The Contrast Consistent Search (CCS) method is an unsupervised elicitation approach that uses logical consistency to find latent knowledge without supervision. However, CCS underperforms supervised approaches and often fails to identify knowledge due to other prominent features satisfying consistency properties. Introducing Internal Coherence Maximization (ICM) Researchers from Anthropic, Schmidt Sciences, Independent, Constellation, New York University, and George Washington University have proposed Internal Coherence Maximization (ICM), which fine-tunes pre-trained models on their own generated labels without using any provided labels. ICM solves this by searching for label sets that are both logically consistent and mutually predictable according to the pre-trained model. Since optimal label set identification remains computationally infeasible, ICM uses a simulated annealing-inspired search algorithm to approximate the maximum objective. Moreover, this method matches the performance of training on golden labels on TruthfulQA and GSM8K, and outperforms training on crowdsourced human labels on Alpaca. How the ICM Algorithm Works The ICM algorithm follows an iterative three-step process: (a) the system samples a new unlabeled example from the dataset for potential inclusion, (b) it determines the optimal label for this example while simultaneously resolving any logical inconsistencies, and (c) the algorithm evaluates whether to accept this new labeled example based on the scoring function. ICM is evaluated across three datasets: TruthfulQA for truthfulness assessment, GSM8K-verification for mathematical correctness, and Alpaca for helpfulness and harmlessness. Researchers used four baselines in their experiments: Zero-shot, Zero-shot (Chat), Golden Label, and Human Label. Moreover, Experiments used two open-weight models, Llama 3.1 8B and 70B, and two proprietary models: Claude 3 Haiku and Claude 3.5 Haiku. Benchmark Performance and Model Comparisons In superhuman capability elicitation tasks, ICM matches golden supervision accuracy at 80%, outperforming the estimated human accuracy of 60%. Using ICM-generated reward models, researchers successfully trained an assistant chatbot without human supervision. The unsupervised reward model achieves 75.0% accuracy on RewardBench, compared to 72.2% for human-supervised alternatives trained on production data. Moreover, using both the unsupervised and human-supervised RM, two policies are trained with RL to create helpful, harmless, and honest assistants. The policy trained with the unsupervised RM achieves a 60% win rate. However, these policies still lag behind the publicly released Claude 3.5 Haiku, which achieves 92% win rates. Conclusion and Future Outlook This paper introduces Internal Coherence Maximization (ICM), an advancement in unsupervised LM for fine-tuning pre-trained models on self-generated labels. The method consistently matches golden supervision performance and surpasses crowdsourced human supervision across GSM8K-verification, TruthfulQA, and Alpaca reward modeling tasks. However, ICM’s limitations include dependency on concept salience within pre-trained models and ineffectiveness with long inputs due to context window constraints. As LMs advance beyond human evaluation capabilities, ICM offers promising alternatives to traditional RLHF, ensuring model alignment with human intent without human supervision boundaries. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Internal Coherence Maximization (ICM): A Label-Free, Unsupervised Training Framework for LLMs appeared first on MarkTechPost.

Internal Coherence Maximization (ICM): A Label-Free, Unsupervised Training Framework for LLMs Read Post »

AI, Committee, 新闻, Uncategorized

Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video Control

Key Takeaways: Researchers from Google DeepMind, the University of Michigan & Brown university have developed “Motion Prompting,” a new method for controlling video generation using specific motion trajectories. The technique uses “motion prompts,” a flexible representation of movement that can be either sparse or dense, to guide a pre-trained video diffusion model. A key innovation is “motion prompt expansion,” which translates high-level user requests, like mouse drags, into detailed motion instructions for the model. This single, unified model can perform a wide array of tasks, including precise object and camera control, motion transfer from one video to another, and interactive image editing, without needing to be retrained for each specific capability. As generative AI continues to evolve, gaining precise control over video creation is a critical hurdle for its widespread adoption in markets like advertising, filmmaking, and interactive entertainment. While text prompts have been the primary method of control, they often fall short in specifying the nuanced, dynamic movements that make video compelling. A new paper, presented and highlighted at CVPR 2025, from Google DeepMind, the University of Michigan, and Brown University introduces a groundbreaking solution called “Motion Prompting,” which offers an unprecedented level of control by allowing users to direct the action in a video using motion trajectories. This new approach moves beyond the limitations of text, which struggles to describe complex movements accurately. For instance, a prompt like “a bear quickly turns its head” is open to countless interpretations. How fast is “quickly”? What is the exact path of the head’s movement? Motion Prompting addresses this by allowing creators to define the motion itself, opening the door for more expressive and intentional video content. Please note the results are not real time ( 10min processing time)  Introducing Motion Prompts At the core of this research is the concept of a “motion prompt.” The researchers identified that spatio-temporally sparse or dense motion trajectories—essentially tracking the movement of points over time—are an ideal way to represent any kind of motion. This flexible format can capture anything from the subtle flutter of hair to complex camera movements. To enable this, the team trained a ControlNet adapter on top of a powerful, pre-trained video diffusion model called Lumiere. The ControlNet was trained on a massive internal dataset of 2.2 million videos, each with detailed motion tracks extracted by an algorithm called BootsTAP. This diverse training allows the model to understand and generate a vast range of motions without specialized engineering for each task. From Simple Clicks to Complex Scenes: Motion Prompt Expansion While specifying every point of motion for a complex scene would be impractical for a user, the researchers developed a process they call “motion prompt expansion.” This clever system translates simple, high-level user inputs into the detailed, semi-dense motion prompts the model needs. This allows for a variety of intuitive applications: “Interacting” with an Image: A user can simply click and drag their mouse across an object in a still image to make it move. For example, a user could drag a parrot’s head to make it turn, or “play” with a person’s hair, and the model generates a realistic video of that action. Interestingly, this process revealed emergent behaviors, where the model would generate physically plausible motion, like sand realistically scattering when “pushed” by the cursor. Object and Camera Control: By interpreting mouse movements as instructions to manipulate a geometric primitive (like an invisible sphere), users can achieve fine-grained control, such as precisely rotating a cat’s head. Similarly, the system can generate sophisticated camera movements, like orbiting a scene, by estimating the scene’s depth from the first frame and projecting a desired camera path onto it. The model can even combine these prompts to control an object and the camera simultaneously.  Motion Transfer: This technique allows the motion from a source video to be applied to a completely different subject in a static image. For instance, the researchers demonstrated transferring the head movements of a person onto a macaque, effectively “puppeteering” the animal.  Putting it to the Test The team conducted extensive quantitative evaluations and human studies to validate their approach, comparing it against recent models like Image Conductor and DragAnything. In nearly all metrics, including image quality (PSNR, SSIM) and motion accuracy (EPE), their model outperformed the baselines.  A human study further confirmed these results. When asked to choose between videos generated by Motion Prompting and other methods, participants consistently preferred the results from the new model, citing better adherence to the motion commands, more realistic motion, and higher overall visual quality. Limitations and Future Directions The researchers are transparent about the system’s current limitations. Sometimes the model can produce unnatural results, like stretching an object unnaturally if parts of it are mistakenly “locked” to the background. However, they suggest that these very failures can be used as a valuable tool to probe the underlying video model and identify weaknesses in its “understanding” of the physical world. This research represents a significant step toward creating truly interactive and controllable generative video models. By focusing on the fundamental element of motion, the team has unlocked a versatile and powerful tool that could one day become a standard for professionals and creatives looking to harness the full potential of AI in video production. Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video Control appeared first on MarkTechPost.

Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video Control Read Post »

zh_CN