YouZum

News

AI, Committee, News, Uncategorized

How to Design a Fully Interactive, Reactive, and Dynamic Terminal-Based Data Dashboard Using Textual?

In this tutorial, we build an advanced interactive dashboard using Textual, and we explore how terminal-first UI frameworks can feel as expressive and dynamic as modern web dashboards. As we write and run each snippet, we actively construct the interface piece by piece, widgets, layouts, reactive state, and event flows, so we can see how Textual behaves like a live UI engine right inside Google Colab. By the end, we notice how naturally we can blend tables, trees, forms, and progress indicators into a cohesive application that feels fast, clean, and responsive. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip install textual textual-web nest-asyncio from textual.app import App, ComposeResult from textual.containers import Container, Horizontal, Vertical from textual.widgets import ( Header, Footer, Button, DataTable, Static, Input, Label, ProgressBar, Tree, Select ) from textual.reactive import reactive from textual import on from datetime import datetime import random class StatsCard(Static): value = reactive(0) def __init__(self, title: str, *args, **kwargs): super().__init__(*args, **kwargs) self.title = title def compose(self) -> ComposeResult: yield Label(self.title) yield Label(str(self.value), id=”stat-value”) def watch_value(self, new_value: int) -> None: if self.is_mounted: try: self.query_one(“#stat-value”, Label).update(str(new_value)) except Exception: pass We set up the environment and import all the necessary components to build our Textual application. As we define the StatsCard widget, we establish a reusable component that reacts to changes in value and updates itself automatically. We begin to see how Textual’s reactive system lets us create dynamic UI elements with minimal effort. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class DataDashboard(App): CSS = “”” Screen { background: $surface; } #main-container { height: 100%; padding: 1; } #stats-row { height: auto; margin-bottom: 1; } StatsCard { border: solid $primary; height: 5; padding: 1; margin-right: 1; width: 1fr; } #stat-value { text-style: bold; color: $accent; content-align: center middle; } #control-panel { height: 12; border: solid $secondary; padding: 1; margin-bottom: 1; } #data-section { height: 1fr; } #left-panel { width: 30; border: solid $secondary; padding: 1; margin-right: 1; } DataTable { height: 100%; border: solid $primary; } Input { margin: 1 0; } Button { margin: 1 1 1 0; } ProgressBar { margin: 1 0; } “”” BINDINGS = [ (“d”, “toggle_dark”, “Toggle Dark Mode”), (“q”, “quit”, “Quit”), (“a”, “add_row”, “Add Row”), (“c”, “clear_table”, “Clear Table”), ] total_rows = reactive(0) total_sales = reactive(0) avg_rating = reactive(0.0) We define the DataDashboard class and configure global styles, key bindings, and reactive attributes. We decide how the app should look and behave right from the top, giving us full control over themes and interactivity. This structure helps us create a polished dashboard without writing any HTML or JS. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def compose(self) -> ComposeResult: yield Header(show_clock=True) with Container(id=”main-container”): with Horizontal(id=”stats-row”): yield StatsCard(“Total Rows”, id=”card-rows”) yield StatsCard(“Total Sales”, id=”card-sales”) yield StatsCard(“Avg Rating”, id=”card-rating”) with Vertical(id=”control-panel”): yield Input(placeholder=”Product Name”, id=”input-name”) yield Select( [(“Electronics”, “electronics”), (“Books”, “books”), (“Clothing”, “clothing”)], prompt=”Select Category”, id=”select-category” ) with Horizontal(): yield Button(“Add Row”, variant=”primary”, id=”btn-add”) yield Button(“Clear Table”, variant=”warning”, id=”btn-clear”) yield Button(“Generate Data”, variant=”success”, id=”btn-generate”) yield ProgressBar(total=100, id=”progress”) with Horizontal(id=”data-section”): with Container(id=”left-panel”): yield Label(“Navigation”) tree = Tree(“Dashboard”) tree.root.expand() products = tree.root.add(“Products”, expand=True) products.add_leaf(“Electronics”) products.add_leaf(“Books”) products.add_leaf(“Clothing”) tree.root.add_leaf(“Reports”) tree.root.add_leaf(“Settings”) yield tree yield DataTable(id=”data-table”) yield Footer() We compose the entire UI layout, arranging containers, cards, form inputs, buttons, a navigation tree, and a data table. As we structure these components, we watch the interface take shape exactly the way we envision it. This snippet lets us design the visual skeleton of the dashboard in a clean, declarative manner. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def on_mount(self) -> None: table = self.query_one(DataTable) table.add_columns(“ID”, “Product”, “Category”, “Price”, “Sales”, “Rating”) table.cursor_type = “row” self.generate_sample_data(5) self.set_interval(0.1, self.update_progress) def generate_sample_data(self, count: int = 5) -> None: table = self.query_one(DataTable) categories = [“Electronics”, “Books”, “Clothing”] products = { “Electronics”: [“Laptop”, “Phone”, “Tablet”, “Headphones”], “Books”: [“Novel”, “Textbook”, “Magazine”, “Comic”], “Clothing”: [“Shirt”, “Pants”, “Jacket”, “Shoes”] } for _ in range(count): category = random.choice(categories) product = random.choice(products[category]) row_id = self.total_rows + 1 price = round(random.uniform(10, 500), 2) sales = random.randint(1, 100) rating = round(random.uniform(1, 5), 1) table.add_row( str(row_id), product, category, f”${price}”, str(sales), str(rating) ) self.total_rows += 1 self.total_sales += sales self.update_stats() def update_stats(self) -> None: self.query_one(“#card-rows”, StatsCard).value = self.total_rows self.query_one(“#card-sales”, StatsCard).value = self.total_sales if self.total_rows > 0: table = self.query_one(DataTable) total_rating = sum(float(row[5]) for row in table.rows) self.avg_rating = round(total_rating / self.total_rows, 2) self.query_one(“#card-rating”, StatsCard).value = self.avg_rating def update_progress(self) -> None: progress = self.query_one(ProgressBar) progress.advance(1) if progress.progress >= 100: progress.progress = 0 We implement all the logic for generating data, computing statistics, animating progress, and updating cards. We see how quickly we can bind backend logic to frontend components using Textual’s reactive model. This step makes the dashboard feel alive as numbers update instantly and progress bars animate smoothly. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser @on(Button.Pressed, “#btn-add”) def handle_add_button(self) -> None: name_input = self.query_one(“#input-name”, Input) category = self.query_one(“#select-category”, Select).value if name_input.value and category: table = self.query_one(DataTable) row_id = self.total_rows + 1 price = round(random.uniform(10, 500), 2) sales = random.randint(1, 100) rating = round(random.uniform(1, 5), 1) table.add_row( str(row_id), name_input.value, str(category), f”${price}”, str(sales), str(rating) ) self.total_rows += 1 self.total_sales += sales self.update_stats() name_input.value = “” @on(Button.Pressed, “#btn-clear”) def handle_clear_button(self) -> None: table = self.query_one(DataTable) table.clear() self.total_rows = 0 self.total_sales = 0 self.avg_rating = 0 self.update_stats() @on(Button.Pressed, “#btn-generate”) def handle_generate_button(self) -> None: self.generate_sample_data(10) def action_toggle_dark(self) -> None: self.dark = not self.dark def action_add_row(self) -> None: self.handle_add_button() def action_clear_table(self) -> None: self.handle_clear_button() if __name__ == “__main__”: import nest_asyncio nest_asyncio.apply() app = DataDashboard() app.run() We connect UI events to backend actions using button handlers, keyboard shortcuts, and app-level functions. As we run the app, we interact with a fully functional dashboard that responds instantly to every click and command. This snippet completes the application and demonstrates how easily Textual enables us to build dynamic, state-driven UIs. In conclusion, we see the whole dashboard come together in a fully functional, interactive form that runs directly from a notebook environment. We experience firsthand how Textual lets us design terminal UIs with the structure and

How to Design a Fully Interactive, Reactive, and Dynamic Terminal-Based Data Dashboard Using Textual? Read Post »

AI, Committee, News, Uncategorized

MBZUAI Researchers Introduce PAN: A General World Model For Interactable Long Horizon Simulation

Most text to video models generate a single clip from a prompt and then stop. They do not keep an internal world state that persists as actions arrive over time. PAN, a new model from MBZUAI’s Institute of Foundation Models, is designed to fill that gap by acting as a general world model that predicts future world states as video, conditioned on history and natural language actions. https://arxiv.org/pdf/2511.09057 From video generator to interactive world simulator PAN is defined as a general, interactable, long horizon world model. It maintains an internal latent state that represents the current world, then updates that state when it receives a natural language action such as ‘turn left and speed up’ or ‘move the robot arm to the red block.’ The model then decodes the updated state into a short video segment that shows the consequence of that action. This cycle repeats, so the same world state evolves across many steps. This design allows PAN to support open domain, action conditioned simulation. It can roll out counterfactual futures for different action sequences. An external agent can query PAN as a simulator, compare predicted futures, and choose actions based on those predictions. GLP architecture, separating what happens from how it looks The base of PAN is the Generative Latent Prediction, GLP, architecture. GLP separates world dynamics from visual rendering. First, a vision encoder maps images or video frames into a latent world state. Second, an autoregressive latent dynamics backbone based on a large language model predicts the next latent state, conditioned on history and the current action. Third, a video diffusion decoder reconstructs the corresponding video segment from that latent state. In PAN, the vision encoder and backbone are built on Qwen2.5-VL-7B-Instruct. The vision tower tokenizes frames into patches and produces structured embeddings. The language backbone runs over a history of world states and actions, plus learned query tokens, and outputs the latent representation of the next world state. These latents live in the shared multimodal space of the VLM, which helps ground the dynamics in both text and vision. The video diffusion decoder is adapted from Wan2.1-T2V-14B, a diffusion transformer for high fidelity video generation. The research team trains this decoder with a flow matching objective, using one thousand denoising steps and a Rectified Flow formulation. The decoder conditions on both the predicted latent world state and the current natural language action, with a dedicated cross attention stream for the world state and another for the action text. https://arxiv.org/pdf/2511.09057 Causal Swin DPM and sliding window diffusion Naively chaining single shot video models by conditioning only on the last frame leads to local discontinuities and rapid quality degradation over long rollouts. PAN addresses this with Causal Swin DPM, which augments the Shift Window Denoising Process Model with chunk wise causal attention. The decoder operates on a sliding temporal window that holds two chunks of video frames at different noise levels. During denoising, one chunk moves from high noise to clean frames and then leaves the window. A new noisy chunk enters at the other end. Chunk wise causal attention ensures that the later chunk can only attend to the earlier one, not to unseen future actions. This keeps transitions between chunks smooth and reduces error accumulation over long horizons. PAN also adds controlled noise to the conditioning frame, rather than using a perfectly sharp frame. This suppresses incidental pixel details that do not matter for dynamics and encourages the model to focus on stable structure such as objects and layout. https://arxiv.org/pdf/2511.09057 Training stack and data construction PAN is trained in two stages. In the first stage, the research team adapts Wan2.1 T2V 14B into the Causal Swin DPM architecture. They train the decoder in BFloat16 with AdamW, a cosine schedule, gradient clipping, FlashAttention3 and FlexAttention kernels, and a hybrid sharded data parallel scheme across 960 NVIDIA H200 GPUs. In the second stage, they integrate the frozen Qwen2.5 VL 7B Instruct backbone with the video diffusion decoder under the GLP objective. The vision language model remains frozen. The model learns query embeddings and the decoder so that predicted latents and reconstructed videos stay consistent. This joint training also uses sequence parallelism and Ulysses style attention sharding to handle long context sequences. Early stopping ends training after 1 epoch once validation converges, even though the schedule allows 5 epochs. Training data comes from widely used publicly accessible video sources that cover everyday activities, human object interactions, natural environments, and multi agent scenarios. Long form videos are segmented into coherent clips using shot boundary detection. A filtering pipeline removes static or overly dynamic clips, low aesthetic quality, heavy text overlays, and screen recordings using rule based metrics, pretrained detectors, and a custom VLM filter. The research team then re-captions clips with dense, temporally grounded descriptions that emphasize motion and causal events. Benchmarks, action fidelity, long horizon stability, planning The research team evaluates the model along three axes, action simulation fidelity, long horizon forecast, and simulative reasoning and planning, against both open source and commercial video generators and world models. Baselines include WAN 2.1 and 2.2, Cosmos 1 and 2, V JEPA 2, and commercial systems such as KLING, MiniMax Hailuo, and Gen 3. For action simulation fidelity, a VLM based judge scores how well the model executes language specified actions while maintaining a stable background. PAN reaches 70.3% accuracy on agent simulation and 47% on environment simulation, for an overall score of 58.6%. It achieves the highest fidelity among open source models and surpasses most commercial baselines. For long horizon forecast, the research team measures Transition Smoothness and Simulation Consistency. Transition Smoothness uses optical flow acceleration to quantify how smooth motion is across action boundaries. Simulation Consistency uses metrics inspired by WorldScore to monitor degradation over extended sequences. PAN scores 53.6% on Transition Smoothness and 64.1% on Simulation Consistency and exceeds all baselines, including KLING and MiniMax, on these metrics. For simulative reasoning and planning, PAN is used as an internal simulator inside an OpenAI-o3 based agent loop. In

MBZUAI Researchers Introduce PAN: A General World Model For Interactable Long Horizon Simulation Read Post »

AI, Committee, News, Uncategorized

Cerebras Releases MiniMax-M2-REAP-162B-A10B: A Memory Efficient Version of MiniMax-M2 for Long Context Coding Agents

Cerebras has released MiniMax-M2-REAP-162B-A10B, a compressed Sparse Mixture-of-Experts (SMoE) Causal Language Model derived from MiniMax-M2, using the new Router weighted Expert Activation Pruning (REAP) method. The model keeps the behavior of the original 230B total, 10B active MiniMax M2, while pruning experts and reducing memory for deployment focused workloads such as coding agents and tool calling. Architecture and core specifications MiniMax-M2-REAP-162B-A10B has these key properties: Base model: MiniMax-M2 Compression method: REAP, Router weighted Expert Activation Pruning Total parameters: 162B Active parameters per token: 10B Layers: 62 transformer blocks Attention heads per layer: 48 Experts: 180 experts, obtained by pruning a 256 expert configuration Activated experts per token: 8 Context length: 196,608 tokens License: modified MIT, derived from MiniMaxAI MiniMax M2 The SMoE design means that the model stores 162B parameters, but each token only routes through a small set of experts, so the effective compute cost per token is similar to a 10B dense model. MiniMax M2 itself is positioned as an MoE model built for coding and agentic workflows, with 230B total parameters and 10B active, which this checkpoint inherits. How REAP compresses MiniMax-M2? MiniMax-M2-REAP-162B-A10B is created by applying REAP uniformly across all MoE blocks of MiniMax M2, at a 30 percent expert pruning rate. The REAP method defines a saliency score for each expert that combines: Router gate values: How often and how strongly the router selects that expert Expert activation norms: The magnitude of the expert output when active Experts that contribute minimally to the layer output, under this combined criterion, are removed. The remaining experts keep their original weights and the router keeps separate gates for each of them. This is one shot compression, there is no extra fine tuning after pruning in the method definition. A core theoretical result in the REAP’s research paper is that expert merging with summed gates causes functional subspace collapse. When experts are merged, the router loses its independent, input dependent control over those experts, so a single merged expert must approximate an input dependent mixture that was originally expressed through multiple experts. The research team proves that, whenever the router policy depends on the input and the experts are not identical, this introduces irreducible error. In contrast, pruning removes some experts but preserves independent control of the survivors, so the error scales with the gate weight of the removed experts. Across a set of SMoE models in the 20B to 1T parameter range, REAP consistently outperforms expert merging and other pruning criteria on generative benchmarks such as code generation, mathematical reasoning and tool calling, especially at 50 percent compression. Accuracy under 30 percent expert pruning The MiniMax-M2-REAP-162B-A10B model gets compared on three checkpoints on standard coding, reasoning and agentic benchmarks: MiniMax-M2 (230B, base model) MiniMax-M2-REAP-172B-A10B, 25 percent pruning MiniMax-M2-REAP-162B-A10B, 30 percent pruning https://huggingface.co/cerebras/MiniMax-M2-REAP-162B-A10B On coding benchmarks such as HumanEval, HumanEval Plus, MBPP and MBPP Plus, the 162B REAP model stays very close to the base model. HumanEval sits around 90% range, and MBPP stays in the 80% range, with the 172B and 162B models essentially tracking the original MiniMax-M2 within a few points. On reasoning benchmarks such as AIME 25 and MATH 500, there are small shifts between the three models, but there is no collapse at 30 percent pruning and the 162B checkpoint remains competitive with the base model. On tool calling and agentic evaluation, represented by τ2 bench in a telecom setting, the 162B REAP model again matches the base model within small variance. The model card explicitly states that this checkpoint keeps almost identical performance while being about 30 percent lighter in parameter count. These results line up with the broader REAP study, which reports near lossless compression for code generation and tool calling on several large SMoE architectures when pruning experts using the REAP criterion. Deployment, memory usage and observed throughput Cerebras provides a direct vLLM serve example and positions MiniMax-M2-REAP-162B-A10B as a drop in model for the existing MiniMax M2 integration. Copy CodeCopiedUse a different Browser vllm serve cerebras/MiniMax-M2-REAP-162B-A10B –tensor-parallel-size 8 –tool-call-parser minimax_m2 –reasoning-parser minimax_m2_append_think –trust-remote-code –enable_expert_parallel –enable-auto-tool-choice If the run hits memory limits, the card recommends lowering –max-num-seqs, for example to 64, to keep batch size in check on a given GPU. Key Takeaways SMoE architecture with efficient compute: MiniMax-M2-REAP-162B-A10B is a Sparse Mixture of Experts model with 162B total parameters and 10B active parameters per token, so the compute cost per token is close to a 10B dense model while keeping frontier scale capacity. REAP expert pruning keeps behavior of MiniMax-M2: The model is produced by applying REAP Router weighted Expert Activation Pruning to MiniMax-M2 at roughly 30 percent expert pruning, pruning experts based on router gate values and expert activation norms while leaving surviving experts and router structure intact. Near lossless accuracy at 30 percent compression: On coding benchmarks such as HumanEval and MBPP, and on reasoning benchmarks such as AIME25 and MATH 500, the 162B REAP variant tracks the 230B MiniMax-M2 and a 172B REAP variant within a few points, showing near lossless compression for code, reasoning and tool use. Pruning outperforms expert merging for generative SMoE: The REAP study shows that pruning experts using a saliency criterion avoids the functional subspace collapse seen with expert merging in generative tasks, and performs better across large SMoE models in the 22B to about 1T parameter range. Comparison Table Image source: Marktechpost.com Editorial Comments Cerebras’ release of MiniMax-M2-REAP-162B-A10B is a strong signal that Router weighted Expert Activation Pruning is ready for real workloads, not just as a research curiosity. The checkpoint shows that a 30 percent expert pruning schedule can keep MiniMax-M2 230B-A10B behavior almost intact while cutting memory and preserving long context coding, reasoning and tool calling performance, which is exactly what SMoE researchers need for practical deployment. Overall, Cerebras is quietly turning expert pruning into production infrastructure for frontier class SMoE models. Check out the Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and

Cerebras Releases MiniMax-M2-REAP-162B-A10B: A Memory Efficient Version of MiniMax-M2 for Long Context Coding Agents Read Post »

AI, Committee, News, Uncategorized

Google’s new AI training method helps small models tackle complex reasoning

Researchers at Google Cloud and UCLA have proposed a new reinforcement learning framework that significantly improves the ability of language models to learn very challenging multi-step reasoning tasks. Supervised Reinforcement Learning (SRL) reformulates problem-solving as a sequence of logical “actions,” providing rich learning signals during the training process. This approach enables smaller models to learn complex problems that were previously out of reach for other common training techniques. Experiments show that SRL not only excels on math reasoning benchmarks but also generalizes effectively to agentic software engineering tasks. SRL is a versatile training framework that can elevate smaller and less expensive models to higher reasoning abilities. The limits of current LLM reasoning training Recent advances in training large language models (LLMs) for reasoning have largely been driven by reinforcement learning with verifiable rewards (RLVR), a method where a model is rewarded based on the correctness of its final answer. By repeatedly trying to solve problems and getting feedback on the final outcome, the model gradually learns effective problem-solving strategies.  However, the success of this outcome-based approach depends on the model’s ability to discover a correct solution within a limited number of attempts, or “rollouts.” Since each rollout is computationally expensive, models can’t try indefinitely. This method hits a wall when problems are so difficult that the model rarely, if ever, finds the right answer within its budget. This creates a critical learning bottleneck. In many multi-step reasoning problems, a model might correctly solve several steps but get derailed by a single mistake, leading to an incorrect answer. With RLVR, this entire effort receives a negative reward, and the model learns nothing from its partially correct work. It’s an all-or-nothing approach that fails to provide granular feedback and provides sparse rewards. An alternative method is supervised fine-tuning (SFT), where the model learns from examples containing the full reasoning process laid out by experts. While SFT can instill reasoning abilities, it often leads to overfitting (the model simply learns to imitate the trajectories in the training data instead of learning to generalize to problems beyond the examples it has seen). This issue is made worse by the fact that high-quality, human-created training data is both scarce and expensive to produce. As the paper notes, these limitations leave “a critical gap for training small open-source models to effectively learn difficult problems.” How supervised reinforcement learning works SRL introduces a framework that reformulates problem-solving as a “sequential decision-making process,” striking a balance between pure outcome-based RL and pure imitation learning. Instead of optimizing only for the final answer or forcing the model to imitate an expert’s entire thought process, SRL teaches the model to reproduce a sequence of key actions that form the backbone of expert reasoning. This allows the model to learn to take actions similar to an expert while developing its own internal reasoning style. In the SRL framework, expert demonstrations are broken down into a series of intermediate, concrete actions, each representing a meaningful step. For a math problem, an action might be an algebraic manipulation. For a software engineering agent, it could be a command executed in a code repository. To generate training data, SRL uses a powerful teacher model to create solution trajectories, which are then used to train a smaller model. According to I-Hung Hsu, a research scientist at Google and co-author of the paper, this middle-ground approach is key to its effectiveness in real-world scenarios. “SRL sits in the middle: It captures the structured flexibility of real-world problem solving, where there are multiple valid strategies but also clear notions of what ‘good reasoning’ looks like at each step,” Hsu told VentureBeat. “This makes SRL suitable for domains like data science automation or probably supply chain optimization — tasks that reward sound intermediate reasoning rather than mere final answers.” During training, the model first generates an “inner monologue” (its internal reasoning process, enclosed in <think> tags) before committing to an action. At each step, SRL provides a reward based on the similarity between the model’s predicted action and the expert’s action. This step-wise reward system provides dense, fine-grained feedback, allowing the model to learn and improve even if its overall solution isn’t perfect. This solves the sparse reward problem RLVR faces. SRL in action The researchers’ experiments show that SRL significantly outperforms strong baselines in both challenging mathematical reasoning and agentic software engineering benchmarks. They also observed that SRL encourages more flexible and sophisticated reasoning patterns in models, such as interleaved planning and self-verification, which improve solution quality without just making the outputs longer. For enterprise leaders, performance gains are only valuable if they don’t come with runaway costs. Hsu clarifies that SRL-trained models are more efficient in their reasoning. “The gains come from better reasoning quality and structure, not from verbosity,” he said. “In terms of efficiency, SRL-trained models are roughly on par with the base model in token usage… while SRL isn’t designed to reduce inference cost, it achieves stronger reasoning performance without increasing it.” For the math tests, the team fine-tuned Qwen2.5-7B-Instruct on a dataset of 1,000 difficult math questions. They compared its performance against models trained with SFT and RLVR (using the GRPO algorithm common in models like DeepSeek-R1) on four competition-level math benchmarks. The SRL-trained model achieved a substantial 3.0% average performance boost over other methods.  The team extended SRL to agentic software engineering, a domain critical for enterprise automation. They trained a coding-specialized model, Qwen2.5-Coder-7B-Instruct, on 5,000 expert trajectories of agents interacting with a coding environment. The SRL-trained model was benchmarked against the original base model and SWE-Gym-7B, a strong baseline fine-tuned with SFT. SRL achieved a 14.8% task resolve rate, representing a 74% relative improvement over the SFT-based model. This shows SRL’s ability to train more competent AI agents for complex, real-world programming tasks. A new standard for high-stakes AI? The paper’s strongest results came from combining methods: First, using SRL to teach foundational reasoning, then using RLVR to refine that skill. In their experiments, when the researchers used SRL

Google’s new AI training method helps small models tackle complex reasoning Read Post »

AI, Committee, News, Uncategorized

ChatGPT Group Chats are here … but not for everyone (yet)

It was originally found in leaked code and publicized by AI influencers on X, but OpenAI has made it official: ChatGPT now offers Group Chats, allowing multiple users to join the same, single ChatGPT conversation and send messages to each other and the underlying large language model (LLM), online and via its mobile apps. Imagine adding ChatGPT as another member of your existing group chats, allowing you to text it as you would one of your friends or family members and have them respond as well, and you’ll have an idea of the intriguing power and potential of this feature. However, the feature is only available as a limited pilot for now to ChatGPT users in Japan, New Zealand, South Korea, and Taiwan (all tiers, including free usage). “Group chats are just the beginning of ChatGPT becoming a shared space to collaborate and interact with others,” OpenAI wrote in its announcement. This development builds on internal experimentation at OpenAI, where technical staffer Keyan Zhang said in a post on X that OpenAI’s team initially considered multiplayer ChatGPT to be “a wild, out-of-distribution idea.” According to Zhang, the model’s performance in those early tests demonstrated far more potential than existing interfaces typically allow. The move follows OpenAI investor yet competitor Microsoft’s update of its Copilot AI assistant to allow group chats last month, as well as Anthropic’s introduction of shareable context and chat histories from its Claude AI models through its Projects feature introduced summer 2024, though this is not a simultaneous, realtime group chat in the same way. Collaborative functionality integrated into ChatGPT Group chats function as shared conversational spaces where users can plan events, brainstorm ideas, or collaborate on projects with the added support of ChatGPT. These conversations are distinct from individual chats and are excluded from ChatGPT’s memory system—meaning no data from these group threads is used to train or personalize future interactions. Users can initiate a group chat by selecting the people icon in a new or existing conversation. Adding others creates a copy of the original thread, preserving the source dialogue. Participants can join via a shareable link and are prompted to create a profile with a name, username, and photo. The feature supports 1 to 20 participants per group. Each group chat is listed in a new section of the ChatGPT interface, and users can manage settings like naming the group, adding or removing participants, or muting notifications. Powered by GPT-5.1 with expanded tools The new group chat feature runs on GPT-5.1 Auto, a backend setting that chooses the optimal model based on the user’s subscription tier and the prompt. Functionality such as search, image generation, file upload, and dictation is available inside group conversations. Importantly, the system applies rate limits only when ChatGPT is producing responses. Direct messages between human users in the group do not count toward any plan’s message cap. OpenAI has added new social features to ChatGPT in support of this group dynamic. The model can react with emojis, interpret conversational context to decide when to respond, and personalize generated content using members’ profile photos—such as inserting user likenesses into images when asked. Privacy by default, controls for younger users OpenAI emphasized that privacy and user control are integral to group chat design. The feature operates independently of the user’s personalized ChatGPT memory, and no new memories are created from these interactions. Participation requires an invitation link, and members are always able to see who is in a chat or leave at any time. Users under the age of 18 are automatically shielded from sensitive content in group chats. Parents or guardians can disable group chat access altogether via built-in parental controls. Group creators retain special permissions, including immunity from being removed by others. All other participants can be added or removed by group members. A testbed for shared AI experiences OpenAI frames group chats as an early step toward richer, multi-user applications of AI, hinting at broader ambitions for ChatGPT as a shared workspace. The company expects to expand access over time and refine the feature based on how early users engage with it. Keyan Zhang’s post suggests that the underlying model capabilities are far ahead of the interfaces users currently interact with. This pilot, in OpenAI’s view, offers a new “container” where more of the model’s latent capacity can be surfaced. “Our models have a lot more room to shine than today’s experiences show, and the current containers only use a fraction of their capabilities,” Zhang said. With this initial pilot focused on a limited set of markets, OpenAI is likely monitoring both usage patterns and cultural fit as it plans for broader deployment. For now, the group chat experiment offers a new way for users to interact with ChatGPT—and with each other—in real time, using a conversational interface that blends productivity and personalization. Developer access: Still unclear OpenAI has not provided any indication that Group Chats will be accessible via the API or SDK. The current rollout is framed strictly within the ChatGPT product environment, with no mention of tool calls, developer hooks, or integration support for programmatic use. This absence of signaling leaves it unclear whether the company views group interaction as a future developer primitive or as a contained UX feature for end users only. For enterprise teams exploring how to replicate multi-user collaboration with generative models, any current implementation would require custom orchestration—such as managing multi-party context and prompts across separate API calls, and handling session state and response merging externally. Until OpenAI provides formal support, Group Chats remain a closed interface feature rather than a developer-accessible capability. Here is a standalone concluding subsection tailored for the article, focusing on what the ChatGPT Group Chat rollout means for enterprise decision makers in both pilot regions and globally: Implications for enterprise AI and data leaders For enterprise teams already leveraging AI platforms—or preparing to—OpenAI’s group chat feature introduces a new layer of multi-user collaboration that could shift how generative models are deployed across workflows. While

ChatGPT Group Chats are here … but not for everyone (yet) Read Post »

AI, Committee, News, Uncategorized

The Download: how AI really works, and phasing out animal testing

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. OpenAI’s new LLM exposes the secrets of how AI really works The news: ChatGPT maker OpenAI has built an experimental large language model that is far easier to understand than typical models. Why it matters: It’s a big deal, because today’s LLMs are black boxes: Nobody fully understands how they do what they do. Building a model that is more transparent sheds light on how LLMs work in general, helping researchers figure out why models hallucinate, why they go off the rails, and just how far we should trust them with critical tasks. Read the full story. —Will Douglas Heaven Google DeepMind is using Gemini to train agents inside Goat Simulator 3 Google DeepMind has built a new video-game-playing agent called SIMA 2 that can navigate and solve problems in 3D virtual worlds. The company claims it’s a big step toward more general-purpose agents and better real-world robots.    The company first demoed SIMA (which stands for “scalable instructable multiworld agent”) last year. But this new version has been built on top of Gemini, the firm’s flagship large language model, which gives the agent a huge boost in capability. Read the full story. —Will Douglas Heaven These technologies could help put a stop to animal testing Earlier this week, the UK’s science minister announced an ambitious plan: to phase out animal testing. Testing potential skin irritants on animals will be stopped by the end of next year. By 2027, researchers are “expected to end” tests of the strength of Botox on mice. And drug tests in dogs and nonhuman primates will be reduced by 2030. It’s good news for activists and scientists who don’t want to test on animals. And it’s timely too: In recent decades, we’ve seen dramatic advances in technologies that offer new ways to model the human body and test the effects of potential therapies, without experimenting on animals. Read the full story. —Jessica Hamzelou This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 Chinese hackers used Anthropic’s AI to conduct an espionage campaign   It automated a number of attacks on corporations and governments in September. (WSJ $)+ The AI was able to handle the majority of the hacking workload itself. (NYT $)+ Cyberattacks by AI agents are coming. (MIT Technology Review) 2 Blue Origin successfully launched and landed its New Glenn rocketIt managed to deploy two NASA satellites into space without a hitch. (CNN)+ The New Glenn is the company’s largest reusable rocket. (FT $)+ The launch had been delayed twice before. (WP $) 3 Brace yourself for flu seasonIt started five weeks earlier than usual in the UK, and the US is next. (Ars Technica)+ Here’s why we don’t have a cold vaccine. Yet. (MIT Technology Review) 4 Google is hosting a Border Protection facial recognition app    The app alerts officials whether to contact ICE about identified immigrants. (404 Media)+ Another effort to track ICE raids was just taken offline. (MIT Technology Review) 5 OpenAI is trialling group chats in ChatGPTIt’d essentially make AI a participant in a conversation of up to 20 people. (Engadget) 6 A TikTok stunt sparked debate over how charitable America’s churches really areContent creator Nikalie Monroe asked churches for help feeding her baby. Very few stepped up. (WP $) 7 Indian startups are attempting to tackle air pollutionBut their solutions are far beyond the means of the average Indian household. (NYT $)+ OpenAI is huge in India. Its models are steeped in caste bias. (MIT Technology Review) 8 An AI tool could help reduce wasted efforts to transplant organsIt predicts how likely the would-be recipient is to die during the brief transplantation window. (The Guardian)+ Putin says organ transplants could grant immortality. Not quite. (MIT Technology Review) 9 3D-printing isn’t making prosthetics more affordableIt turns out that plastic prostheses are often really uncomfortable. (IEEE Spectrum)+ These prosthetics break the mold with third thumbs, spikes, and superhero skins. (MIT Technology Review) 10 What happens when relationships with AI fall apartCan you really file for divorce from an LLM? (Wired $)+ It’s surprisingly easy to stumble into a relationship with an AI chatbot. (MIT Technology Review) Quote of the day “It’s a funky time.” —Aileen Lee, founder and managing partner of Cowboy Ventures, tells TechCrunch the AI boom has torn up the traditional investment rulebook. One more thing Restoring an ancient lake from the rubble of an unfinished airport in Mexico City Weeks after Mexican President Andrés Manuel López Obrador took office in 2018, he controversially canceled ambitious plans to build an airport on the deserted site of the former Lake Texcoco—despite the fact it was already around a third complete. Instead, he tasked Iñaki Echeverria, a Mexican architect and landscape designer, with turning it into a vast urban park, an artificial wetland that aims to transform the future of the entire Valley region. But as López Obrador’s presidential team nears its end, the plans for Lake Texcoco’s rebirth could yet vanish. Read the full story. —Matthew Ponsford We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.) + Maybe Gen Z is onto something when it comes to vibe dating.+ Trust AC/DC to give the fans what they want, performing Jailbreak for the first time since 1991.+ Nieves González, the artist behind Lily Allen’s new album cover, has an eye for detail.+ Here’s what AI determines is a catchy tune.

The Download: how AI really works, and phasing out animal testing Read Post »

AI, Committee, News, Uncategorized

OpenAI Researchers Train Weight Sparse Transformers to Expose Interpretable Circuits

If neural networks are now making decisions everywhere from code editors to safety systems, how can we actually see the specific circuits inside that drive each behavior? OpenAI has introduced a new mechanistic interpretability research study that trains language models to use sparse internal wiring, so that model behavior can be explained using small, explicit circuits. https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf Training transformers to be weight sparse Most transformer language models are dense. Each neuron reads from and writes to many residual channels, and features are often in superposition. This makes circuit level analysis difficult. Previous OpenAI work tried to learn sparse feature bases on top of dense models using sparse autoencoders. The new research work instead changes the base model so that the transformer itself is weight sparse. The OpenAI team trains decoder only transformers with an architecture similar to GPT 2. After each optimizer step with AdamW optimizer, they enforce a fixed sparsity level on every weight matrix and bias, including token embeddings. Only the largest magnitude entries in each matrix are kept. The rest are set to zero. Over training, an annealing schedule gradually drives the fraction of non zero parameters down until the model reaches a target sparsity. In the most extreme setting, roughly 1 in 1000 weights is non zero. Activations are also somewhat sparse. Around 1 in 4 activations are non zero at a typical node location. The effective connectivity graph is therefore very thin even when the model width is large. This encourages disentangled features that map cleanly onto the residual channels the circuit uses. https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf Measuring interpretability through task specific pruning To quantify whether these models are easier to understand, OpenAI team does not rely on qualitative examples alone. The research team define a suite of simple algorithmic tasks based on Python next token prediction. One example, single_double_quote, requires the model to close a Python string with the right quote character. Another example, set_or_string, requires the model to choose between .add and += based on whether a variable was initialized as a set or a string. For each task, they search for the smallest subnetwork, called a circuit, that can still perform the task up to a fixed loss threshold. The pruning is node based. A node is an MLP neuron at a specific layer, an attention head, or a residual stream channel at a specific layer. When a node is pruned, its activation is replaced by its mean over the pretraining distribution. This is mean ablation. The search uses continuous mask parameters for each node and a Heaviside style gate, optimized with a straight through estimator like surrogate gradient. The complexity of a circuit is measured as the count of active edges between retained nodes. The main interpretability metric is the geometric mean of edge counts across all tasks. Example circuits in sparse transformers On the single_double_quote task, the sparse models yield a compact and fully interpretable circuit. In an early MLP layer, one neuron behaves as a quote detector that activates on both single and double quotes. A second neuron behaves as a quote type classifier that distinguishes the two quote types. Later, an attention head uses these signals to attend back to the opening quote position and copy its type to the closing position. In circuit graph terms, the mechanism uses 5 residual channels, 2 MLP neurons in layer 0, and 1 attention head in a later layer with a single relevant query key channel and a single value channel. If the rest of the model is ablated, this subgraph still solves the task. If these few edges are removed, the model fails on the task. The circuit is therefore both sufficient and necessary in the operational sense defined by the paper. https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf For more complex behaviors, such as type tracking of a variable named current inside a function body, the recovered circuits are larger and only partially understood. The research team show an example where one attention operation writes the variable name into the token set() at the definition, and another attention operation later copies the type information from that token back into a later use of current. This still yields a relatively small circuit graph. Key Takeaways Weight-sparse transformers by design: OpenAI trains GPT-2 style decoder only transformers so that almost all weights are zero, around 1 in 1000 weights is non zero, enforcing sparsity across all weights and biases including token embeddings, which yields thin connectivity graphs that are structurally easier to analyze. Interpretability is measured as minimal circuit size: The work defines a benchmark of simple Python next token tasks and, for each task, searches for the smallest subnetwork, in terms of active edges between nodes, that still reaches a fixed loss, using node level pruning with mean ablation and a straight through estimator style mask optimization. Concrete, fully reverse engineered circuits emerge: On tasks such as predicting matching quote characters, the sparse model yields a compact circuit with a few residual channels, 2 key MLP neurons and 1 attention head that the authors can fully reverse engineer and verify as both sufficient and necessary for the behavior. Sparsity delivers much smaller circuits at fixed capability: At matched pre-training loss levels, weight sparse models require circuits that are roughly 16 times smaller than those recovered from dense baselines, defining a capability interpretability frontier where increased sparsity improves interpretability while slightly reducing raw capability. Editorial Comments OpenAI’s work on weight sparse transformers is a pragmatic step toward making mechanistic interpretability operational. By enforcing sparsity directly in the base model, the paper turns abstract discussions of circuits into concrete graphs with measurable edge counts, clear necessity and sufficiency tests, and reproducible benchmarks on Python next token tasks. The models are small and inefficient, but the methodology is relevant for future safety audits and debugging workflows. This research treats interpretability as a first class design constraint rather than an after the fact diagnostic. Check out the Paper, GitHub Repo and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and

OpenAI Researchers Train Weight Sparse Transformers to Expose Interpretable Circuits Read Post »

AI, Committee, News, Uncategorized

PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models

arXiv:2511.10002v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework “PustakAI”footnote{Pustak means `book’ in many Indian languages.} for the design and evaluation of a novel question-answering dataset “NCERT-QA” aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.

PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models Read Post »

AI, Committee, News, Uncategorized

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

arXiv:2509.11816v2 Announce Type: replace-cross Abstract: Current unlearning and safety training methods consistently fail to remove dangerous knowledge from language models. We identify the root cause – unlearning targets representations which are too general – and develop a highly selective technique that unlearns robustly while preserving general performance. Our method performs PCA on activations and module-output gradients to identify subspaces containing common representations, then collapses these subspaces before computing unlearning updates, a technique we term Collapse of Irrelevant Representations (CIR). This avoids unlearning general knowledge and targets only representations specific to the facts being unlearned. When unlearning bio- and cyber-hazardous facts from Llama-3.1-8B, we achieve over 30x greater reduction in post-attack accuracy than the best baseline (Circuit Breakers), while disrupting general performance 30x less, and using less than 3 GPU-seconds per fact. Thus, by disentangling harmful and benign capabilities at the level of representations, CIR enables robust and non-disruptive unlearning.

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at Privacy Policy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
en_US