YouZum

AI

AI, Committee, News, Uncategorized

MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language Models

LLMs are increasingly seen as key to achieving Artificial General Intelligence (AGI), but they face major limitations in how they handle memory. Most LLMs rely on fixed knowledge stored in their weights and short-lived context during use, making it hard to retain or update information over time. Techniques like RAG attempt to incorporate external knowledge but lack structured memory management. This leads to problems such as forgetting past conversations, poor adaptability, and isolated memory across platforms. Fundamentally, today’s LLMs don’t treat memory as a manageable, persistent, or sharable system, limiting their real-world usefulness.  To address the limitations of memory in current LLMs, researchers from MemTensor (Shanghai) Technology Co., Ltd., Shanghai Jiao Tong University, Renmin University of China, and the Research Institute of China Telecom have developed MemO. This memory operating system makes memory a first-class resource in language models. At its core is MemCube, a unified memory abstraction that manages parametric, activation, and plaintext memory. MemOS enables structured, traceable, and cross-task memory handling, allowing models to adapt continuously, internalize user preferences, and maintain behavioral consistency. This shift transforms LLMs from passive generators into evolving systems capable of long-term learning and cross-platform coordination.  As AI systems grow more complex—handling multiple tasks, roles, and data types—language models must evolve beyond understanding text to also retaining memory and learning continuously. Current LLMs lack structured memory management, which limits their ability to adapt and grow over time. MemOS, a new system that treats memory as a core, schedulable resource. It enables long-term learning through structured storage, version control, and unified memory access. Unlike traditional training, MemOS supports a continuous “memory training” paradigm that blurs the line between learning and inference. It also emphasizes governance, ensuring traceability, access control, and safe use in evolving AI systems.  MemOS is a memory-centric operating system for language models that treats memory not just as stored data but as an active, evolving component of the model’s cognition. It organizes memory into three distinct types: Parametric Memory (knowledge baked into model weights via pretraining or fine-tuning), Activation Memory (temporary internal states, such as KV caches and attention patterns, used during inference), and Plaintext Memory (editable, retrievable external data, like documents or prompts). These memory types interact within a unified framework called the MemoryCube (MemCube), which encapsulates both content and metadata, allowing dynamic scheduling, versioning, access control, and transformation across types. This structured system enables LLMs to adapt, recall relevant information, and efficiently evolve their capabilities, transforming them into more than just static generators. At the core of MemOS is a three-layer architecture: the Interface Layer handles user inputs and parses them into memory-related tasks; the Operation Layer manages the scheduling, organization, and evolution of different types of memory; and the Infrastructure Layer ensures safe storage, access governance, and cross-agent collaboration. All interactions within the system are mediated through MemCubes, allowing traceable, policy-driven memory operations. Through modules like MemScheduler, MemLifecycle, and MemGovernance, MemOS maintains a continuous and adaptive memory loop—from the moment a user sends a prompt, to memory injection during reasoning, to storing useful data for future use. This design not only enhances the model’s responsiveness and personalization but also ensures that memory remains structured, secure, and reusable.  In conclusion, MemOS is a memory operating system designed to make memory a central, manageable component in LLMs. Unlike traditional models that depend mostly on static model weights and short-term runtime states, MemOS introduces a unified framework for handling parametric, activation, and plaintext memory. At its core is MemCube, a standardized memory unit that supports structured storage, lifecycle management, and task-aware memory augmentation. The system enables more coherent reasoning, adaptability, and cross-agent collaboration. Future goals include enabling memory sharing across models, self-evolving memory blocks, and building a decentralized memory marketplace to support continual learning and intelligent evolution.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language Models appeared first on MarkTechPost.

MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language Models Read Post »

AI, Committee, News, Uncategorized

Internal Coherence Maximization (ICM): A Label-Free, Unsupervised Training Framework for LLMs

Post-training methods for pre-trained language models (LMs) depend on human supervision through demonstrations or preference feedback to specify desired behaviors. However, this approach faces critical limitations as tasks and model behaviors become very complex. Human supervision is unreliable in these scenarios as LMs learn to mimic mistakes in demonstrations or exploit inherent flaws in feedback systems. The core challenge lies in training LMs for tasks that exceed human capability in reliability in demonstrations or evaluations. Recent research has identified diverse failure modes, including reward-hacking of human-designed supervision signals or real humans themselves. Limitations of Human Supervision in LLM Post-Training Researchers have explored several approaches to scale beyond human supervision. One standard method utilizes high-quality verifiable rewards, such as matching model outputs with ground-truth solutions in mathematical domains. Despite evidence that pre-trained base models have strong latent capabilities for downstream tasks, with post-training adding minimal improvements, effective elicitation remains challenging. The Contrast Consistent Search (CCS) method is an unsupervised elicitation approach that uses logical consistency to find latent knowledge without supervision. However, CCS underperforms supervised approaches and often fails to identify knowledge due to other prominent features satisfying consistency properties. Introducing Internal Coherence Maximization (ICM) Researchers from Anthropic, Schmidt Sciences, Independent, Constellation, New York University, and George Washington University have proposed Internal Coherence Maximization (ICM), which fine-tunes pre-trained models on their own generated labels without using any provided labels. ICM solves this by searching for label sets that are both logically consistent and mutually predictable according to the pre-trained model. Since optimal label set identification remains computationally infeasible, ICM uses a simulated annealing-inspired search algorithm to approximate the maximum objective. Moreover, this method matches the performance of training on golden labels on TruthfulQA and GSM8K, and outperforms training on crowdsourced human labels on Alpaca. How the ICM Algorithm Works The ICM algorithm follows an iterative three-step process: (a) the system samples a new unlabeled example from the dataset for potential inclusion, (b) it determines the optimal label for this example while simultaneously resolving any logical inconsistencies, and (c) the algorithm evaluates whether to accept this new labeled example based on the scoring function. ICM is evaluated across three datasets: TruthfulQA for truthfulness assessment, GSM8K-verification for mathematical correctness, and Alpaca for helpfulness and harmlessness. Researchers used four baselines in their experiments: Zero-shot, Zero-shot (Chat), Golden Label, and Human Label. Moreover, Experiments used two open-weight models, Llama 3.1 8B and 70B, and two proprietary models: Claude 3 Haiku and Claude 3.5 Haiku. Benchmark Performance and Model Comparisons In superhuman capability elicitation tasks, ICM matches golden supervision accuracy at 80%, outperforming the estimated human accuracy of 60%. Using ICM-generated reward models, researchers successfully trained an assistant chatbot without human supervision. The unsupervised reward model achieves 75.0% accuracy on RewardBench, compared to 72.2% for human-supervised alternatives trained on production data. Moreover, using both the unsupervised and human-supervised RM, two policies are trained with RL to create helpful, harmless, and honest assistants. The policy trained with the unsupervised RM achieves a 60% win rate. However, these policies still lag behind the publicly released Claude 3.5 Haiku, which achieves 92% win rates. Conclusion and Future Outlook This paper introduces Internal Coherence Maximization (ICM), an advancement in unsupervised LM for fine-tuning pre-trained models on self-generated labels. The method consistently matches golden supervision performance and surpasses crowdsourced human supervision across GSM8K-verification, TruthfulQA, and Alpaca reward modeling tasks. However, ICM’s limitations include dependency on concept salience within pre-trained models and ineffectiveness with long inputs due to context window constraints. As LMs advance beyond human evaluation capabilities, ICM offers promising alternatives to traditional RLHF, ensuring model alignment with human intent without human supervision boundaries. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Internal Coherence Maximization (ICM): A Label-Free, Unsupervised Training Framework for LLMs appeared first on MarkTechPost.

Internal Coherence Maximization (ICM): A Label-Free, Unsupervised Training Framework for LLMs Read Post »

AI, Committee, News, Uncategorized

Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video Control

Key Takeaways: Researchers from Google DeepMind, the University of Michigan & Brown university have developed “Motion Prompting,” a new method for controlling video generation using specific motion trajectories. The technique uses “motion prompts,” a flexible representation of movement that can be either sparse or dense, to guide a pre-trained video diffusion model. A key innovation is “motion prompt expansion,” which translates high-level user requests, like mouse drags, into detailed motion instructions for the model. This single, unified model can perform a wide array of tasks, including precise object and camera control, motion transfer from one video to another, and interactive image editing, without needing to be retrained for each specific capability. As generative AI continues to evolve, gaining precise control over video creation is a critical hurdle for its widespread adoption in markets like advertising, filmmaking, and interactive entertainment. While text prompts have been the primary method of control, they often fall short in specifying the nuanced, dynamic movements that make video compelling. A new paper, presented and highlighted at CVPR 2025, from Google DeepMind, the University of Michigan, and Brown University introduces a groundbreaking solution called “Motion Prompting,” which offers an unprecedented level of control by allowing users to direct the action in a video using motion trajectories. This new approach moves beyond the limitations of text, which struggles to describe complex movements accurately. For instance, a prompt like “a bear quickly turns its head” is open to countless interpretations. How fast is “quickly”? What is the exact path of the head’s movement? Motion Prompting addresses this by allowing creators to define the motion itself, opening the door for more expressive and intentional video content. Please note the results are not real time ( 10min processing time)  Introducing Motion Prompts At the core of this research is the concept of a “motion prompt.” The researchers identified that spatio-temporally sparse or dense motion trajectories—essentially tracking the movement of points over time—are an ideal way to represent any kind of motion. This flexible format can capture anything from the subtle flutter of hair to complex camera movements. To enable this, the team trained a ControlNet adapter on top of a powerful, pre-trained video diffusion model called Lumiere. The ControlNet was trained on a massive internal dataset of 2.2 million videos, each with detailed motion tracks extracted by an algorithm called BootsTAP. This diverse training allows the model to understand and generate a vast range of motions without specialized engineering for each task. From Simple Clicks to Complex Scenes: Motion Prompt Expansion While specifying every point of motion for a complex scene would be impractical for a user, the researchers developed a process they call “motion prompt expansion.” This clever system translates simple, high-level user inputs into the detailed, semi-dense motion prompts the model needs. This allows for a variety of intuitive applications: “Interacting” with an Image: A user can simply click and drag their mouse across an object in a still image to make it move. For example, a user could drag a parrot’s head to make it turn, or “play” with a person’s hair, and the model generates a realistic video of that action. Interestingly, this process revealed emergent behaviors, where the model would generate physically plausible motion, like sand realistically scattering when “pushed” by the cursor. Object and Camera Control: By interpreting mouse movements as instructions to manipulate a geometric primitive (like an invisible sphere), users can achieve fine-grained control, such as precisely rotating a cat’s head. Similarly, the system can generate sophisticated camera movements, like orbiting a scene, by estimating the scene’s depth from the first frame and projecting a desired camera path onto it. The model can even combine these prompts to control an object and the camera simultaneously.  Motion Transfer: This technique allows the motion from a source video to be applied to a completely different subject in a static image. For instance, the researchers demonstrated transferring the head movements of a person onto a macaque, effectively “puppeteering” the animal.  Putting it to the Test The team conducted extensive quantitative evaluations and human studies to validate their approach, comparing it against recent models like Image Conductor and DragAnything. In nearly all metrics, including image quality (PSNR, SSIM) and motion accuracy (EPE), their model outperformed the baselines.  A human study further confirmed these results. When asked to choose between videos generated by Motion Prompting and other methods, participants consistently preferred the results from the new model, citing better adherence to the motion commands, more realistic motion, and higher overall visual quality. Limitations and Future Directions The researchers are transparent about the system’s current limitations. Sometimes the model can produce unnatural results, like stretching an object unnaturally if parts of it are mistakenly “locked” to the background. However, they suggest that these very failures can be used as a valuable tool to probe the underlying video model and identify weaknesses in its “understanding” of the physical world. This research represents a significant step toward creating truly interactive and controllable generative video models. By focusing on the fundamental element of motion, the team has unlocked a versatile and powerful tool that could one day become a standard for professionals and creatives looking to harness the full potential of AI in video production. Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video Control appeared first on MarkTechPost.

Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video Control Read Post »

AI, Committee, News, Uncategorized

OpenThoughts: A Scalable Supervised Fine-Tuning SFT Data Curation Pipeline for Reasoning Models

The Growing Complexity of Reasoning Data Curation Recent reasoning models, such as DeepSeek-R1 and o3, have shown outstanding performance in mathematical, coding, and scientific areas, utilizing post-training techniques like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the complete methodologies behind these frontier reasoning models are not public, which makes research for building reasoning models difficult. While SFT data curation has become a powerful approach for developing strong reasoning capabilities, most existing efforts explore only limited design choices, such as relying solely on human-written questions or single teacher models. Moreover, exploring the extensive design space of various techniques for generating question-answer pairs requires high costs for teacher inference and model training. Reasoning traces provided by models such as Gemini, QwQ, and DeepSeek-R1 have enabled knowledge distillation techniques to train smaller reasoning models. Projects like OpenR1, OpenMathReasoning, and OpenCodeReasoning collect questions from public forums and competition sites, while Natural Reasoning utilizes pre-training corpora as seed data. Some efforts, such as S1 and LIMO, focus on manually curating small, high-quality datasets of challenging prompts. Other methods, such as DeepMath-103K and Nvidia Nemotron, introduce innovations across data sourcing, filtering, and scaling stages. RL methods, including AceReason and Skywork-OR1, have enhanced reasoning capabilities beyond traditional SFT methods. OpenThoughts: A Scalable Framework for SFT Dataset Development Researchers from Stanford University, the University of Washington, BespokeLabs.ai, Toyota Research Institute, UC Berkeley, and 12 additional organizations have proposed OpenThoughts, a new SOTA open reasoning data recipe. OpenThoughts uses a progressive approach across three iterations: OpenThoughts-114K scales the Sky-T1 pipeline with automated verification, OpenThoughts2-1M enhances data scale through augmented question diversity and synthetic generation strategies, and OpenThoughts3-1.2M incorporates findings from over 1,000 ablation experiments to develop a simple, scalable, and high-performing data curation pipeline. Moreover, the model OpenThinker3-7B achieves state-of-the-art performance among open-data models at the 7B scale. The OpenThoughts3-1.2M is built by ablating each pipeline component independently while maintaining constant conditions across other stages, generating 31,600 data points per strategy and fine-tuning Qwen2.5-7B-Instruct on each resulting dataset. The goal during training is to create the best dataset of question-response pairs for SFT reasoning. Evaluation occurs across eight reasoning benchmarks across mathematics (AIME24, AMC23, MATH500), coding (CodeElo, CodeForces, LiveCodeBench), and science (GPQA Diamond, JEEBench). The experimental design includes a rigorous decontamination process to remove high-similarity samples and maintains a held-out benchmark set for generalization testing. Evalchemy serves as the primary evaluation tool, ensuring consistent evaluation protocols. Evaluation Insights and Benchmark Performance The OpenThoughts pipeline evaluation reveals key insights across question sourcing, mixing, filtering, answer filtering, and the teacher model. Question sourcing experiments show that CodeGolf and competitive coding questions achieve the highest performance for code tasks (25.3-27.5 average scores), while LLM-generated and human-written questions excel in mathematics (58.8-58.5 scores), and physics StackExchange questions with chemistry textbook extractions perform best in science (43.2-45.3 scores). Mixing question shows that combining multiple question sources degrades performance, with optimal results of 5% accuracy improvements over diverse mixing strategies. In the teacher model, QwQ-32B outperforms DeepSeek-R1 in knowledge distillation, achieving an accuracy improvement of 1.9-2.6%. In conclusion, researchers present the OpenThoughts project, showing that systematic experimentation can significantly advance SFT data curation for reasoning models. Researchers developed OpenThoughts3-1.2M, a state-of-the-art open-data reasoning dataset across science, mathematics, and coding domains. The resulting OpenThinker3-7B model achieves superior performance among open-data reasoning models at its scale. However, several limitations remain unexplored, including RL approaches, staged fine-tuning, and curriculum learning strategies. Future research directions include investigating cross-domain transfer effects when optimizing individual domains versus overall performance, and understanding the scaling dynamics as student models approach teacher capabilities. Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter. The post OpenThoughts: A Scalable Supervised Fine-Tuning SFT Data Curation Pipeline for Reasoning Models appeared first on MarkTechPost.

OpenThoughts: A Scalable Supervised Fine-Tuning SFT Data Curation Pipeline for Reasoning Models Read Post »

AI, Committee, News, Uncategorized

Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

arXiv:2506.09277v2 Announce Type: replace Abstract: Large Language Models (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify their answers. Despite their logical appearance, self-NLE do not necessarily reflect the LLM actual decision-making process, making such explanations unfaithful. While existing methods for measuring self-NLE faithfulness mostly rely on behavioral tests or computational block identification, none of them examines the neural activity underlying the model’s reasoning. This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE by directly comparing the latter with interpretations of the model’s internal hidden states. The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning. This approach advances the understanding of self-NLE faithfulness and provides building blocks for generating more faithful self-NLE.

Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at Privacy Policy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
en_US