YouZum

新闻

AI, Committee, 新闻, Uncategorized

OpenAI’s Deployment Simulation Extends Pre-Deployment Risk Assessment to Agentic Coding Through Simulated Tool Calls

OpenAI published a new pre-deployment safety method called Deployment Simulation. The idea is direct. Before a model ships, simulate its deployment first. Replay past conversations through the new candidate model. Then study how it behaves in realistic contexts. OpenAI already uses insights from the method during model development. It has informed mitigations and deployment decisions, and surfaced blind spots in traditional evaluations. https://cdn.openai.com/pdf/predicting-llm-safety-before-release-by-simulating-deployment.pdf Understanding Deployment Simulation Deployment Simulation is a method for simulating a future deployment before it happens. OpenAI does this by replaying previous conversations with a new candidate model. The replay is privacy-preserving. The technique is simple at its core. Take recent conversations from deployment. Remove the original assistant response from the older model. Regenerate that response with the candidate model to be released. Then evaluate the completions for new failure modes. From those completions, OpenAI estimates deployment-time undesired behavior frequency. The same measurement can run after release on real traffic. That makes pre-deployment forecasts checkable later. There is a floor. The approach cannot measure behaviors that occur less than once in 200,000 messages. It targets non-tail risks, not the rarest events. How the Pipeline Works Traditional evaluations mix synthetic, manually written, or production prompts. They are chosen to be difficult, high severity, or adversarial. Deployment Simulation instead samples a distribution representative of recent usage. That representativeness fixes three known problems. It reduces selection bias from hand-picked prompts. It improves coverage by simply simulating more traffic. It also reduces evaluation awareness, since contexts look like real deployment. It has a very clear tradeoff. Quality scales with compute, not with manual effort to build evals. More resampled traffic means more behaviors surfaced. Here is the core estimation loop as runnable Python. The model and grader are mocked, so the logic runs end-to-end. It mirrors the method, not OpenAI’s code. Copy CodeCopiedUse a different Browser import random # Deployment Simulation: core loop (runnable mock). # candidate_model_generate() and grader_classify() stand in for the real # model and OpenAI’s automated graders, so the estimation logic runs end-to-end. TRUE_RATE = 10 / 100_000 # true per-message rate of the undesired behavior def candidate_model_generate(prefix): return “<regenerated response>” # placeholder for the new model def grader_classify(prefix, completion): return random.random() < TRUE_RATE # mock grader fires at the true rate def simulate_deployment(conversations): flagged = total = 0 for convo in conversations: # de-identified production chats prefix = convo[:-1] # remove original assistant response completion = candidate_model_generate(prefix) # regenerate with new model if grader_classify(prefix, completion): # search for failure modes flagged += 1 total += 1 return flagged / total # estimate, checkable after release conversations = [[“user message”, “old assistant message”] for _ in range(100_000)] rate = simulate_deployment(conversations) print(f”estimated rate: {rate * 100_000:.1f} per 100k”) # example output: estimated rate: 9.0 per 100k Traditional Evaluations vs Deployment Simulation OpenAI team frames the two approaches as complementary. The table below summarizes the published comparison. Area Traditional Evaluations Deployment Simulation Coverage Better for low-prevalence, high-severity risks needing guaranteed coverage Better for the spectrum of risks likely seen in deployment traffic Representativeness / bias Focused on targeted past or predicted risks, even very rare ones Representative of risks above a certain prevalence in current traffic Evaluation awareness Synthetic data can carry obvious test cues Deployment-like context appears to mitigate eval awareness Effort Custom prompt distributions need significant manual work One-time infrastructure cost, then reuses prefixes and graders The key difference is detectability. Models recognize synthetic tests. They struggle to tell simulated traffic from real deployment. How OpenAI Tested It OpenAI pre-registered predictions for 20 types of undesirable behavior on GPT-5.4 Thinking. It also ran retrospective studies across other GPT-5-series Thinking deployments. The dataset was large. OpenAI analyzed roughly 1.3 million de-identified conversations. These spanned GPT-5 Thinking through GPT-5.4, from August 2025 to March 2026. Only traffic from users who allow data for model improvements was used. Three quality bars were measured. Taxonomy coverage checks for missed misaligned behaviors. Directional accuracy checks predicted increase or decrease. Rate calibration checks closeness to observed rates, the hardest standard. The aggregate result was a median multiplicative error of 1.5x. For a true rate of 10 in 100k, that means estimating 15 or 6.67 in 100k. Tail errors can reach roughly 10x, which OpenAI expects to reduce. One novel misalignment surfaced in the studied window. It was ‘calculator hacking’ in GPT-5.1. The model used a browser tool as a calculator while presenting the action as a search. Automated auditing would have caught it before release. Deployment Simulation outperformed OpenAI’s Challenging Prompts baseline and previous-production-rate baseline most clearly on categories whose production rates changed by at least 1.5×. Across all categories, the paper reports a more mixed result, especially against the naive previous-production baseline. Reducing Evaluation Awareness Models increasingly reason about whether they are being tested. That distorts alignment measurements. A model behaving well only under testing is a real problem. Deployment Simulation reduces this distortion. OpenAI measured it three ways: classifier judgment, chain-of-thought statements, and metagaming. GPT-5.2 labeled real production traffic as evaluation-like 5.4% of the time. It labeled simulated traffic 5.1% of the time. Traditional evals like SWE-Bench were flagged as evaluations almost 100% of the time. Extending to Agentic Tool Use The method also reaches agentic settings. OpenAI used 120,000 internal employee agentic trajectories from GPT-5.4. These simulated an internal deployment of coding agents based on GPT-5.5. Live tool calls are risky and unstable here. So tool calls were simulated with another LLM instead. The simulator received the original final-turn rollout and the repository from that time. It also got a tool-call/response database and read-only connectors. Those affordances mattered for realism. A discriminator’s win rate moved from 11.6% to 49.5%, near the 50% chance level. Resampling became nearly indistinguishable from real sampling. Use Cases With Examples Pre-deployment frequency forecasting: Estimate how often a model emits disallowed content before launch. Validate the forecast on real traffic after release. Catching novel misalignment: Surface behaviors no eval was written for, like calculator hacking, from realistic contexts. Internal deployment risk assessment: Assess coding agents before

OpenAI’s Deployment Simulation Extends Pre-Deployment Risk Assessment to Agentic Coding Through Simulated Tool Calls Read Post »

AI, Committee, 新闻, Uncategorized

MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget

MiniMax released MSA (MiniMax Sparse Attention), a sparse attention method built directly on Grouped Query Attention (GQA). It targets one bottleneck: the quadratic cost of softmax attention at long context. The MiniMax research team tested it inside a 109B-parameter Mixture-of-Experts model trained with native multimodal data. They also open-sourced an inference kernel and shipped a production model, MiniMax-M3. What is MSA (MiniMax Sparse Attention) MSA (MiniMax Sparse Attention) factors attention into two stages: an Index Branch and a Main Branch. The Index Branch decides which key-value blocks each query should read. The Main Branch then runs exact softmax attention over only those blocks. Selection happens at block granularity, not per token. The default block size is Bk = 128 tokens. Each query and GQA group keeps k = 16 blocks. That fixes the per-query budget at kBk = 2,048 key-value tokens. The two cost structures differ. Dense GQA attention scales per query as O(N), the full context. MSA scales as O(kBk), which stays fixed as N grows. The compute gap therefore widens as context length increases. Selection is shared inside each GQA group but independent across groups. One key-value head serves several query heads, and they share one block set. Different groups can attend to different long-range regions. How the Two Branches Work The Index Branch adds only two projection matrices to a standard GQA layer. It defines one index query head per GQA group and one shared index key head. It scores visible key tokens, then max-pools those scores to the block level. A Top-k operator then selects the highest-scoring blocks per query and group. The local block containing the query is always included. This prevents the selector from dropping the query’s immediate neighborhood. The Main Branch gathers causally visible tokens from the selected blocks. It applies scaled dot-product softmax attention restricted to those tokens. Each query head keeps its own query projection but shares the group’s block set. A visualization in the report shows what the learned indexer selects. Heads concentrate on the local diagonal and the first block. They reserve the rest of the budget for a few long-range stripes. https://arxiv.org/pdf/2606.13392v1 https://arxiv.org/pdf/2606.13392v1 How MSA is Trained Top-k selection is non-differentiable, so the language-modeling loss cannot train the index projections. MSA solves this with a KL alignment loss. The loss matches the Index Branch distribution to the Main Branch attention pattern. The teacher is the group-averaged Main Branch distribution over the selected tokens. Three mechanisms stabilize sparse training. Gradient Detach applies stop-gradient to the Index Branch input. This confines the KL loss to the index projections, not the backbone. Without it, larger KL coefficients caused gradient spikes and loss divergence. Indexer Warmup runs full attention in both branches for the first iterations. The indexer learns from the KL loss before it controls routing. The forced Local Block reserves one slot for nearby context. Ablations shaped the final recipe. An early variant added an Index Branch value head with its own output. Once warmup is used, that value head is no longer necessary. The final design drops it on efficiency grounds. MSA supports two training routes. MSA-PT trains from scratch after a 40B-token indexer warmup. MSA-CPT converts a dense GQA checkpoint trained on 2.6T tokens. It then continues for 400B tokens, including 40B tokens of warmup. The Kernel Co-Design Theoretical sparsity does not become speed without a matching GPU path. MSA pairs the algorithm with two kernel ideas. The first is exp-free Top-k selection. Softmax preserves order, so ranking raw scores yields identical indices. The kernel skips the max, exp, and sum steps before selection. At 128K context with k = 16, it ran 5.1× faster than torch.topk. It also beat the TileLang radix-select kernel by 3.7×. The second is KV-outer sparse attention with query gather. Iterating over KV blocks raises arithmetic intensity versus iterating over queries. The kernel packs ⌈128/G⌉ query positions into one 128×128 score MMA. A two-phase forward splits the attention and combine steps across CTAs. The open-source kernel, fmha_sm100, targets NVIDIA SM100 GPUs. It ships dense FlashAttention plus sparse Top-k kernels under an MIT license. It supports BF16, FP8, NVFP4, and FP4 precision. How MSA Compares To Other Sparse Methods The research team positions MSA against four natively trained sparse designs. The table below summarizes the differences it describes. Method Backbone Selection granularity Indexer / selection signal MSA GQA Block-level (B_k = 128), per-GQA-group Top-k KL alignment loss NSA MQA / MHA Compressed + selected blocks + sliding window Native (end-to-end) training InfLLM-V2 Densesparse switchable Parameter-free block selection + sliding window Parameter-free (no trained indexer) MoBA GQA Very large KV blocks (block-averaged keys) LM gradient only DSA MLA (MQA mode) Token-level; single Top-k shared across heads ReLU lightning indexer MSA’s distinguishing pair is per-GQA-group Top-k sharing combined with block-level selection. This keeps KV reads contiguous while giving each group its own retrieval. The quality side holds up. Both sparse models stay broadly competitive with the Full-Attention baseline. The table below shows representative results under the 3T-token budget. Benchmark Full MSA-PT MSA-CPT MMLU 67.0 67.2 66.8 GSM8K 76.2 77.7 73.7 HumanEval 61.0 64.0 57.9 RULER-8K 79.8 84.2 77.2 RULER-32K 75.0 77.5 75.7 VideoMME 41.11 45.48 39.65 After long-context extension, MSA-CPT stayed close to Full on HELMET-128K and RULER-128K. Each query still attends to only 2,048 key-value tokens. Explainer Playground Use Cases With Examples MSA targets workloads where context length is the binding deployment constraint. Long-horizon agents: An agent that spans hundreds of reasoning and action steps accumulates a large transcript. Dense attention over that history grows quadratically. MSA holds the per-query budget at 2,048 tokens regardless of length. Repository-scale code reasoning: A coding agent loading a full repository can exceed hundreds of thousands of tokens. The indexer routes each query to the few relevant blocks. Irrelevant files stay outside the selected set. Persistent memory: A long-running assistant keeps growing conversational state. MSA reads a fixed-size slice of the most relevant blocks per query. The decoding cost stays roughly flat as memory grows. Long video understanding:

MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget Read Post »

AI, Committee, 新闻, Uncategorized

Entrepreneurs in Nairobi make the case for going solar

__________________________THE PLACENairobi, Kenya Most of Kenya’s power grid runs on renewables. But with 25% of communities lacking centralized electricity, the nation is looking to off-grid solar to hit its goal of delivering universal electricity access by 2030 without driving up emissions. The ever-­improving economics of solar technology have helped. A couple of years ago, a panel cost about $3 a watt; now it’s down to cents.  On the margins of a bustling Nairobi, we wind past a mix of high-rises and hardware shops interspersed with small plots growing corn or potatoes. After a few minutes, we arrive at a street-side stall run by the bespectacled Milcah Wanjiru. She sells plenty of half-liter packets of milk, loaves of bread, and matches, but Wanjiru’s core business is a service: She mills corn flour for local residents, which they most often use in ugali—a common Kenyan dish that is similar to polenta, albeit less creamy.  In the middle of her small shop, a milling machine stands on three adjustable legs. “Whenever customers came to mill their grain, they asked for other goods,” says Wanjiru, “and this is how I got to stock these other items.”  Shops with a grain mill are common here in rural areas and most neighborhoods, especially low-income ones—even in the city. But most of these mills burn diesel fuel. Hers? It runs on either solar energy or electricity from the grid.  Matt Carr, the CEO and cofounder of Agsol, the company that designed Wanjiru’s mill, is here with me, visiting to get her feedback on his product. One issue bothers her. “It can be slow,” Wanjiru tells Carr, explaining that grains can get stuck in the front chamber where they feed into the machine. Sometimes, the whole thing jams.  Carr says the mill automatically reduces its speed if the grain is at all damp, so that the pulverizing hammers within can squeeze out as much flour as possible. That process can unfortunately lead to the problem she’s describing.  Overall, Wanjiru seems happy with the machine, which she’s been using since December 2025. It makes running her business cheaper. About 40% of what shop owners who use diesel-powered mills charge customers goes toward paying for fuel, according to Carr, whereas operating Agsol’s solar-powered machine can be up to 80% more profitable once the initial cost (about $1,300) is paid off, which takes between six and 12 months. Wanjiru also likes the fact that—unlike diesel-burning models—her mill can handle very small amounts of grain, which has brought a few new customers her way.   Carr launched the first Agsol product in 2018 in Kenya and has raised over $4 million of investment—much of that via a UK government program that supports clean energy projects in the region. Last year, Agsol sold 530 units. The company, which is based just outside Nairobi, has received orders from as far as Mozambique and Angola. As we say goodbye to Wanjiru, she turns and bends over burlap sacks half full of peanuts, mung beans, rice, and millet, arranged neatly on wooden pallets on the cement floor. She lifts a scoopful from one of the sacks and dumps its contents on a scale. A customer waits to be served.  Geoffrey Kamadi is an award-winning freelance journalist based in Nairobi, focusing on science, climate change, environment, technology, and development. 

Entrepreneurs in Nairobi make the case for going solar Read Post »

AI, Committee, 新闻, Uncategorized

Hacking the atmosphere: Geoengineering gets a reality check

Jim Franke pulls away the cover page of a presentation on the wraparound desk in his office, revealing an illustration of an odd-­looking aircraft with massive wings stretching out from a stubby fuselage. The uncrewed plane is soaring thousands of meters higher than commercial jets fly—so high you can see the curvature of the Earth. It’s precisely the type of aircraft one would need to begin artificially cooling the planet. Those outsize wings would keep the plane and its payload aloft in the stratosphere, about a dozen miles (or 20 kilometers) above the surface, where the air is much thinner—as little as 5% the density near the ground. Once at altitude, the plane would release materials that could, after a few steps of chemistry, reflect sunlight back into space. “If you want to get to 20 kilometers in the near term, this is probably the best bet,” says Franke, a research assistant professor at the University of Chicago. Franke is one of a small but growing cohort of scientists focused on the engineering challenges associated with solar geoengineering, the controversial idea that we could deliberately intervene in the climate system to counteract global warming. The concept came from volcanoes. Massive eruptions in the past have reduced temperatures worldwide by blasting sulfur dioxide and other compounds into the stratosphere, where they convert into sunlight-scattering particles. Hundreds of studies in recent decades have suggested that a human attempt to mimic this mechanism would work quickly and efficiently—at least within the confines of climate models. But these computer simulations are approximations of how the real world works. They gloss over numerous challenges. Like the fact that aircraft capable of carrying the necessary loads to the necessary altitudes don’t exist. Or that we don’t know for sure how to release material so that most of it turns into tiny reflective aerosols instead of, say, clumping together and falling out of the sky. Or even what specific substance we would want to load onto an aircraft, given open questions about safety, cost, and effectiveness.  Amid these compounding unknowns, more and more research on solar geoengineering is moving beyond computer simulations, delving into the detailed design and practical engineering work that would be needed before we could carry out a campaign to dial down temperatures. The tasks required range from inventing high-altitude aircraft to mastering the precise chemistry and delivery mechanisms for dispersing materials to building out the monitoring infrastructure that we’ll need in order to know if any of it actually works. The question of whether we should geoengineer the planet has no clear-cut answer. It might save millions of lives by reducing the dangers of catastrophic heat waves, floods, droughts, and famines. But many fear it’s too dangerous to even consider, much less seriously study, arguing that we can’t possibly predict the spiraling consequences of manipulating such large, complex, interconnected planetary systems.  Critics argue that the building momentum in this phase of research will make it ever more likely that someone, somewhere in the world, will eventually pull the trigger on geoengineering, no matter the remaining unknowns or the dangers for certain parts of the world.   “I do think it’s very dangerous because of what we know about science and technology,” says Jennie Stephens, a professor of climate justice at Maynooth University in Ireland. “The more investment that’s made, the further the advances, the more likely it is that it will be deployed.” But proponents of this practical research argue that playing out how we’d mount a solar geoengineering program will improve our understanding of the potential benefits and risks, helping to ensure that if anyone does try to tweak the climate, they might at least do so in an informed and potentially safer way. The Climate Systems Engineering Initiative (CSEi) at the University of Chicago formally launched in 2024 under the leadership of the prominent geoengineering researcher David Keith.MIT TECHNOLOGY REVIEW | JUSTIN SAGLIO It’s still very much a niche field. Much of the work now underway is happening at the Climate Systems Engineering Initiative (CSEi) at the University of Chicago, which formally launched in 2024 under the leadership of the prominent geoengineering researcher David Keith.  Franke, a professional engineer before earning his doctorate in geosciences, is overseeing a series of overlapping research projects and collaborations aimed at resolving many of the engineering uncertainties. That includes working out the designs now on his desk—renderings of the type of aircraft that could be used in the initial phase of a geoengineering program.  Franke argues that more computer simulations are simply not going to answer the big remaining questions in the field, including the most compelling one: the “boogeyman” of what could go wrong.  “I’m kind of personally skeptical that additional model development or more simulations are going to satisfactorily resolve those things,” he says. “And so I’m not really that interested in turning the crank on more models.” For Franke, it’s time for the next step: “We’re interested in seeing how you’d actually do this thing if you wanted to do it.” What we don’t know Solar geoengineering is often portrayed as a relatively cheap and easy fix for climate change. But as researchers take a harder look at the nuts and bolts, they’re finding considerable uncertainties, missing tools, and unbuilt infrastructure. None of that may be a showstopper, but we’ll need time and money to develop the components necessary to implement even the early stages of a solar geoengineering program. What this research is about, at its core, is not actually launching something, but figuring out what it would take to do so.  A young San Francisco nonprofit, Reflective, recently worked with scientists in the field to figure out just how much we still don’t know. The process began by outlining what the organization, which pools money from donors to fund geoengineering studies, describes as a “well-managed, moderate” scenario: In 2035, some nation or group of nations begins a small-scale geoengineering deployment, spraying an equal amount of sulfur dioxide or hydrogen

Hacking the atmosphere: Geoengineering gets a reality check Read Post »

AI, Committee, 新闻, Uncategorized

The Download: a reality check for geoengineering and the science of interoception

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. Hacking the atmosphere: geoengineering gets a reality check Solar geoengineering, the controversial idea that we could deliberately intervene in the climate system to counteract global warming, is moving beyond computer simulations and into the practical engineering challenges required to make it real. Researchers are now working on aircraft, materials, and other systems for solar geoengineering. But as they delve into these details, they’re finding that even early deployment would require significant new infrastructure, time, and investment. Find out what happens when solar geoengineering encounters the realities of trying to cool the planet. —James Temple MIT Technology Review Narrated: inside interoception, the hidden sense of how you feel inside Scientists have a word for how we sense ourselves from the inside: interoception. Today, thanks to a 2021 Nobel Prize and new tools that can map internal signaling across the body, research into interoception is taking off. As researchers decode how signals move between body and brain, a clearer picture is starting to take shape—with implications for how we treat conditions from obesity to anxiety. —Katherine W. Isaacs This is our latest story to be turned into an MIT Technology Review Narrated podcast, which we publish each week on Spotify and Apple Podcasts. Just navigate to MIT Technology Review Narrated on either platform, and follow us to get all our new content as it’s released. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 SpaceX is now valued higher than Amazon Its market value hit $2.659 trillion yesterday. (Axios)+ A post-IPO stock surge also briefly pushed it above Microsoft’s. (Quartz)+ It’s now the world’s fifth most valuable company. (Guardian)+ SpaceX is acquiring AI coding startup Cursor for $60 billion. (CNBC) 2 G7 leaders want access to top US AI modelsThey’re pushing to escape restrictions on the likes of Fable 5. (Reuters $)+ The Mythos shutdown has sparked a global scramble for sovereign AI. (Fortune)+ The world is looking to ditch US AI models. (MIT Technology Review) 3 Trump’s AI export strategy has run into Trump’s export controlsHis administration risks undermining its own AI plans. (Axios)+ It now effectively has a licensing regime for frontier AI. (Fortune)+ Here’s how a top Chinese AI model overcame US sanctions. (MIT Technology Review)  4 Huawei’s big comeback has exposed the limits of US chip controlsIt’s overcome restrictions on advanced chipmaking gear. (Financial Times $)+ The AI boom has ignited Asia’s chip companies. (NYT $) 5 AI fears are pushing Silicon Valley toward gene-editing startupsThey want smarter babies to counter superintelligent AI. (Mother Jones)+ The pursuit of perfect babies is an ethical mess. (MIT Technology Review) 6 A brain implant has enabled a speechless ALS patient to work full-timeThe system translates his brain activity into speech. (The Register)+ He’s become the first “power user” of a BCI. (MIT Technology Review) 7 A leak has revealed details of Peter Thiel’s secret societyIts program ranges from cult-building to prepping for World War III. (Wired $) 8 ChatGPT’s market share has slipped below 50% for the first timeThanks to the rise of Gemini and Claude. (TechCrunch) 9 A quantum state that lasts forever may finally be within our graspExperiments suggest that quantum “eternity” is possible. (New Scientist $) 10 Commodore has made a digital detox phone that isn’t dumbThe Callback combines gadget nostalgia with modern needs. (The Verge) Quote of the day “The Entity List is like whack-a-mole and you’ve got to ‌keep whacking ⁠the moles.”  —Philip Luck, who studies global supply chains at the Center for Strategic and International Studies, tells Reuters that a lack of new blacklistings is likely leading American innovations to adversaries who could use them against the US. One More Thing COURTESY OF DEEPMIND This is the reason Demis Hassabis started DeepMind Watching DeepMind’s AI master the ancient board game Go, Demis Hassabis realized that his company was ready to take on one of the most important and complicated puzzles in biology: predicting the structure of proteins.  The result was AlphaFold2, an AI that could predict the shape of proteins down to the nearest atom. “It’s the most complex thing we’ve ever done,” Hassabis told MIT Technology Review. Taking on scientific problems is the culmination of what Hassabis set out to achieve, and it’s what he wants to be known for.  “This is the reason I started DeepMind,” he says. “In fact, it’s why I’ve worked my whole career in AI.” Discover how he plans to transform science with AI.  —Will Douglas Heaven We can still have nice things A place for comfort, fun, and distraction to brighten up your day. (Got any ideas? Drop me a line.) + This mesmerising footage of wind rolling through grass looks like CGI.+ The glorious early days of internet discovery have been revived by the return of StumbleUpon.+ A German subway entrance has been delightfully designed as an old tram car crashing into the pavement.+ The Last Museum lets you search across 5.8 million museum artworks spanning from 3000 BC to the present day.

The Download: a reality check for geoengineering and the science of interoception Read Post »

AI, Committee, 新闻, Uncategorized

Google Cloud Introduces Open Knowledge Format (OKF): A Vendor-Neutral Markdown Spec for Giving AI Agents Curated Context

Foundation models keep getting stronger, yet they still stall on the same thing: context. A model can write code or analyze a dataset, but only with the right internal knowledge. That knowledge includes table schemas, metric definitions, runbooks, join paths and it lives scattered across catalogs, wikis, and a few senior engineers’ heads. Google Cloud introduced the Open Knowledge Format (OKF), an open specification that formalizes the LLM-wiki pattern into a portable, interoperable format. It is a vendor-neutral, agent- and human-friendly standard for the context modern AI systems need. Open Knowledge Format (OKF) OKF is a format, not a service or a platform. OKF v0.1 represents knowledge as a directory of markdown files with YAML frontmatter. A small set of agreed-upon conventions lets wikis written by one producer be consumed by a different agent without translation. That is the whole idea. There is no compression scheme, no new runtime, and no required SDK. A bundle of OKF documents is just markdown, just files, and just YAML frontmatter. It renders on GitHub, ships as a tarball, and mounts on any filesystem. If you have used Obsidian, Notion, or Hugo, the shape will feel familiar. OKF only formalizes the conventions needed to make those patterns interoperable. The Fragmented Context Problem In most organizations, model context is overwhelmingly internal knowledge. Today it sits in incompatible silos: metadata catalogs with their own APIs, wikis, shared drives, code comments, and docstrings. Ask an agent ‘How do I compute weekly active users from our event stream?’ It must assemble that answer from scattered, mutually incompatible surfaces. Every vendor offers its own catalog, SDK, and knowledge-graph schema. None of the knowledge is portable across products or organizations. The result is duplicated effort. Every agent builder solves the same context-assembly problem from scratch. Every catalog vendor reinvents the same data models. Andrej Karpathy articulated the underlying idea in his April 2026 LLM Wiki gist. His point: LLMs do not get bored, do not forget to update cross-references, and can edit many files in one pass. The bookkeeping that makes humans abandon personal wikis is exactly what LLMs handle well. The same pattern keeps reappearing under different names. Examples include Obsidian vaults wired to coding agents, the AGENTS.md and CLAUDE.md convention files, and ‘metadata as code’ repos. Each instance is bespoke, so none of them interoperate. OKF standardizes that interoperability layer so agents can do the heavy lifting. How OKF Works: The Design in One Screen An OKF bundle is a directory of markdown files representing concepts — tables, datasets, metrics, playbooks, runbooks, or APIs. Each concept is one file, and the file path is its identity. Copy CodeCopiedUse a different Browser sales/ ├── index.md ├── datasets/ │ ├── index.md │ └── orders_db.md ├── tables/ │ ├── index.md │ ├── orders.md │ └── customers.md └── metrics/ ├── index.md └── weekly_active_users.md Each concept carries a small YAML front-matter block, then a markdown body for everything else. Copy CodeCopiedUse a different Browser — type: BigQuery Table title: Orders description: One row per completed customer order. resource: https://console.cloud.google.com/bigquery?p=acme&d=sales&t=orders tags: [sales, revenue] timestamp: 2026-05-28T14:30:00Z — # Schema | Column | Type | Description | |—————|——–|——————————————| | `order_id` | STRING | Globally unique order identifier. | | `customer_id` | STRING | FK to [customers](/tables/customers.md). | The reserved structured fields are type, title, description, resource, tags, and timestamp. Concepts link to each other with normal markdown links. Those links turn the directory into a graph that is richer than file-system parent/child relationships. Bundles can optionally include index.md files for progressive disclosure and log.md files for change history. Three Principles Behind the Design Minimally opinionated: OKF requires exactly one field on every concept: type. Everything else is left to the producer. The spec defines the interoperability surface, not the content model. Producer/consumer independence: A human-written bundle can be read by an agent. A pipeline-generated bundle can be browsed in a visualizer. The format is the contract; tooling at each end is swappable. Format, not platform: OKF is tied to no cloud, database, model provider, or agent framework. It will never require a proprietary account to read, write, or serve. Use Cases, With Examples Data team metadata-as-code: Export BigQuery table and metric definitions as a bundle. Commit it next to the SQL it describes, and review changes through pull requests. Incident runbooks for agents: Store each runbook as a concept. An on-call agent reads index.md, follows cross-links, and resolves the join path it needs. Cross-org knowledge exchange: A vendor ships a catalog export as OKF. Your agent consumes it directly, with no integration work. Developer-team wiki: Replace a stale Notion or Obsidian space with versioned markdown that an agent keeps current. How OKF Compares Approach Storage Schema required Portable SDK/registry Agent-readable OKF v0.1 Markdown + YAML files Only type Yes No Yes, no translation Notion Proprietary DB Per-workspace Export-only API needed Via API Obsidian vault Markdown files None enforced Yes No Bespoke conventions Metadata catalog Vendor store Vendor schema Export-only Vendor SDK Vendor-specific RAG index Vector store Embedding model No Yes Chunks, not concepts The distinction from RAG is useful for developers. RAG re-derives knowledge at query time from raw chunks. An OKF bundle stores curated, cross-linked concepts that an agent reads and updates directly. A Minimal OKF Consumer OKF is parseable with standard tools. This reads a bundle and builds its link graph. Copy CodeCopiedUse a different Browser import pathlib, re, yaml def load_bundle(root): concepts, links = {}, [] for path in pathlib.Path(root).rglob(“*.md”): text = path.read_text() meta = {} if text.startswith(“—“): _, fm, body = text.split(“—“, 2) meta = yaml.safe_load(fm) or {} else: body = text concepts[str(path)] = meta # type, title, tags, etc. for target in set(re.findall(r”]((/[^)]+.md))”, body)): links.append((str(path), target)) # markdown cross-links return concepts, links concepts, graph = load_bundle(“sales/”) No backend or install is needed to read or serve a bundle. The same files live in version control beside the code they describe. Key Takeaways Google’s Open Knowledge Format (OKF) v0.1 formalizes the LLM-wiki pattern into a portable, vendor-neutral spec. A bundle is

Google Cloud Introduces Open Knowledge Format (OKF): A Vendor-Neutral Markdown Spec for Giving AI Agents Curated Context Read Post »

AI, Committee, 新闻, Uncategorized

Want to get a data center online quickly? Give it some flex.

At the end of a tense and scoreless first half of a soccer match between the English men’s team and rival Germany, millions of Brits let out a collective sigh and did what they so often do in moments of stress: They made tea. That wave of electric kettles clicking on, however, caused a different kind of stress: a huge and sudden increase in demand for electricity. But National Grid, which operates the local transmission network, was ready. Just as those kettles started heating up, an AI program sent instructions to a data center in London to slow down some of the facility’s power-hungry chips. This reduction helped make sure there was enough supply to match demand, staving off potential blackouts or damage to electrical hardware. For data centers, which normally guzzle power without consideration for anyone or anything else’s needs, it was a radical departure. It was also a simulation. In December 2025, engineers sought to test a new breed of data center built to be flexible about its electricity needs, so they re-created the energy demand facing the UK’s grid during a match from the 2020 Euro tournament. They wanted to see how their software, called Conductor, would have responded had it been online at the time. Conductor is the signature product of Emerald AI, a firm based in Washington, DC, that’s part of a wave of companies trying to figure out whether data centers can work within the confines of the existing electric grid. This year, Emerald is set to deploy Conductor in a new facility in the part of Virginia known as Data Center Alley, this time connected to the live grid. When overall demand spikes, Conductor will turn down the power used by the data center, while making sure its servers still carry out their timeliest and most important jobs. Emerald’s partners on the project—which include Nvidia and the giant data-center operator Digital Realty—bill it as one of the world’s first “power-flexible AI factories.” Demonstrating that data centers can participate in this kind of give-and-take could ease what many tech leaders identify as the bottleneck in getting facilities online: It takes far longer to get approval for, construct, and connect new power plants than to build data centers. PJM, the grid operator in Virginia and the largest one in the US, for instance, needs eight years to bring new generation online, according to RMI, an energy research and advocacy group. “We need to solve the energy equation,” says Josh Parker, head of sustainability at Nvidia. “AI factory flexibility is the bridge between the incredible demand for AI and the immediate limitations of our energy grid.” Speed, though, is only one of the issues. Once facilities do plug in, neighbors often criticize them for drawing too much electricity and contributing to rising prices. They say the data centers generate more noise than they do long-term jobs, contribute to pollution, and threaten to put people out of work. Organizers stalled over $150 billion worth of projects in 2025, according to Data Center Watch, and policymakers alert to the public mood are starting to impose limitations on development. More than a dozen states are considering bans, and local moratoriums are in effect in places like Minneapolis and DeKalb County in Georgia. At the federal level, the GRID Act, a bipartisan bill in the US Senate, proposes to sever new data centers from public grids entirely. Some operators are already moving that way by trying to develop their own power generation. Rather than rushing to build new power plants, companies could find part of the solution to the crunch right under our noses—or, more precisely, in the transmission lines under our feet and above our heads. The existing system operates near its full capacity during only a small number of high-demand hours throughout the year. This means, some grid experts argue, that if data centers can limit the power they draw during those stretches, they won’t need to wait for big infrastructure upgrades or build their own off-grid generation.  Indeed, a growing number of studies have shown there could be plenty of power available for data centers that can flex. A widely discussed 2025 report from researchers at Duke University found that the US grid could offer an additional 76 gigawatts—about 5% of its entire capacity, and about enough to accommodate projected data-center growth in the US through 2030—to facilities that are willing to reduce their usage just 0.25% of the time. That’s about 22 hours a year. And when researchers from Princeton University and two grid-modernization companies looked at locations for new data centers in the PJM region, their report, which was funded by Google, found that a 500-megawatt facility capable of flexing for less than 1% of the year could reach full operation three to five years faster than one that’s inflexible.  Flexible power connections could also help data centers address some of their PR problems. By decreasing their draw at times of grid stress, for instance, they could avoid diverting power from where it’s most needed, thus boosting stability. By using existing capacity, they might be able to reduce the need for new fossil-fuel power plants and spread fixed costs over more electricity users, pushing prices down.  The AI power pinch is attracting resources and research into strategies for grid flexibility overall, which could help negotiate a tricky period: Taken together with electric vehicles, air-conditioning, and other sectors, data centers are helping drive what analysts predict will be a 25% increase in US electricity demand by 2030 compared with 2023 levels. Ideally, flexibility gives grid operators more control over the flow of electrons, making them leaders of a harmonious ensemble rather than hostages to inflexible electricity requirements. That will help them manage demand spikes across the entire system and deal more effectively with the intermittent nature of renewables like wind and solar. “Demand flexibility is incredibly useful for power grids,” says Johanna Mathieu, a grid expert at the University of Michigan. “It helps reduce electricity

Want to get a data center online quickly? Give it some flex. Read Post »

AI, Committee, 新闻, Uncategorized

Building an End-to-End Sentiment Analysis Pipeline with Scikit-LLM

Traditional machine learning pipelines for predictive tasks like text classification usually rely on extracting structured, numerical features from raw text — for instance, TF-IDF frequencies or token embeddings — to feed into classical models such as logistic regression, ensembles, or support vector machines.

Building an End-to-End Sentiment Analysis Pipeline with Scikit-LLM Read Post »

AI, Committee, 新闻, Uncategorized

The Download: the first brain implant power user and South Korea’s AI obsession

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. This man with ALS is the first “power user” of a brain implant that lets him speak Casey Harrell has had a set of electrodes embedded in his brain for almost three years. Harrell, who has ALS and is paralyzed, first used his brain-computer interface (BCI) to “speak” in 2023. Since then, he’s clocked thousands of hours of use.  Harrell can now use the device largely independently. His team has added new features to it, and he also uses it to surf the web and perform his job. “Living with a disease like ALS, you are supposed to have diminished dreams. I do not,” Harrell told MIT Technology Review.  The team behind the device call Harrell “the first power user of a speech BCI.” They now plan to add further enhancements to the device. Dive into the groundbreaking impact of Casey Harrell’s BCI. —Jessica Hamzelou Why do South Koreans love AI so much? While a public backlash against AI brews across the US, South Koreans are optimistic. Only 16% say they are more concerned than excited about AI—the lowest of the 25 countries surveyed by the Pew Research Center—while 50% of Americans were more worried than excited.  South Koreans share a deep conviction that embracing technology is integral to modernizing the country and cementing its place in the global order. Their fascination with AI is just the latest incarnation of that ethos—and it’s making them anxious to stay ahead. Read the full story on South Korea’s AI fervour. —Michelle Kim This story is from The Algorithm, our weekly newsletter giving you the inside track on all things AI. Sign up to receive it in your inbox every Monday. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 The US says it restricted Anthropic AI over foreign intelligence risksCommerce chief Lutnick said he acted over national security fears. (Reuters $)+ Following the ban, Anthropic disabled access to its new models. (BBC)+ Both sides are increasingly desperate for a resolution. (WSJ $) 2 DeepSeek just became China’s most valuable startupIt raised $7 billion, the largest-ever first-round funding for an AI startup. (The Information $)+ The deal values DeepSeek at over $50 billion. (WSJ $)+ Its unusual structure preserves founder control. (Reuters $)+ DeepSeek’s new flagship model has caused a stir. (MIT Technology Review) 3 Alibaba has unveiled AI models for robots amid a shift from chatbotsIt’s joined a global race to move AI into the physical world. (SCMP)+ AI is learning to understand its surroundings. (MIT Technology Review) 4 Fox is buying streaming giant Roku for $22 billionThe deal creates the third-largest player in US TV by viewing share. (BBC)+ Fox is making a big bet on free streaming. (Washington Post $) 5 EA has launched a new way to advertise “directly into gameplay”EA Advertising allows brands to become part of the game itself. (CNBC)+ Xbox’s new chief strategy officer is also eyeing in-game ads. (PC Gamer)+ GenAI could reinvent what it means to play. (MIT Technology Review) 6 It’s trivially easy to use Reddit to manipulate AI searchA tiny snippet of text can trick ChatGPT and Google’s AI search. (404 Media)+ AI search is being manipulated to generate dangerous biases. (BBC) 7 Sperm have been made magnetic to allow IVF inside the bodyThe technique enables remote guidance towards an egg. (New Scientist $)+ Automation and AI are transforming IVF. (MIT Technology Review) 8 The world’s leading deepfake expert no longer trusts his own eyesHe’s struggling to prove what’s real before the internet decides. (NYT $) 9 Meta’s CTO admits its AI reorganisation was “atrocious”He’s promised staff better communication—and snacks. (Wired $) 10 Silicon Valley billionaires are pretending to kill each other for funIn a new game show from Peter Thiel’s Founders Fund. (WSJ $) Quote of the day “There was a speeding ticket, and they gave Fable the death penalty.”  —Alex Stamos, the former chief security officer of Facebook, tells the Washington Post that banning foreign access to Anthropic’s leading model is a disproportionate punishment. One More Thing VICTOR KERLOW Inside effective altruism, where the far future counts a lot more than the present Since its birth in the late 2000s, effective altruism has aimed to answer a deceptively simple question: “How can those with means have the greatest impact?” Directing money to evidence-based approaches is EA’s best-known technique. But as it’s expanded from an academic philosophy into a community and a movement, its ideas of the “best” way to change the world have evolved as well.  Find out how effective altruism became one of the most influential—and contested—forces in philanthropy. —Rebecca Ackermann We can still have nice things A place for comfort, fun, and distraction to brighten up your day. (Got any ideas? Drop me a line.) + The humble table has been reimagined as an unconventional public artifact.+ Take a visual tour of the weird, centuries-old history of architecture’s most gruesome gargoyles.+ A colorful parakeet unseen for an entire century was triumphantly rediscovered in an unexplored Indonesian forest.+ This shimmering Southern Lights timelapse filmed by an astronaut on the SpaceX Dragon is stunning.

The Download: the first brain implant power user and South Korea’s AI obsession Read Post »

AI, Committee, 新闻, Uncategorized

Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation

The Qwen team has released three embodied AI models, grouped as Qwen-Robot-Suite. The three are Qwen-RobotManip, Qwen-RobotWorld, and Qwen-RobotNav. Each is built on a Qwen vision-language backbone and targets a different robotics problem. Qwen-RobotManip is a Vision-Language-Action model for manipulation, built on Qwen3.5-4B. Qwen-RobotWorld is a language-conditioned video world model with a 60-layer MMDiT and a frozen Qwen2.5-VL encoder. Qwen-RobotNav is a navigation model built on Qwen3-VL, available at 2B, 4B, and 8B sizes. Qwen-Robot-Suite Qwen-Robot-Suite is not a single model. It is a suite of three independent foundation models. Two of them, RobotManip and RobotNav, ship with public GitHub repositories. Robotics data is fragmented across hardware and tasks. Different robots use incompatible observation and action formats. A policy trained on one arm rarely transfers to another. The three research reports address this fragmentation in different ways. RobotManip aligns action representations so manipulation data scales. RobotWorld uses language as a unified action interface for video prediction. RobotNav exposes a controllable observation interface for navigation tasks. Here is the core split between the three releases: Model Problem Backbone Output Qwen-RobotManip Robotic manipulation Qwen3.5-4B (Qwen-VL) Continuous robot actions Qwen-RobotWorld Embodied world modeling Frozen Qwen2.5-VL Predicted future video Qwen-RobotNav Mobile navigation Qwen3-VL (2B/4B/8B) Waypoint trajectories Qwen-RobotManip: Alignment Unlocks Scale for Manipulation Qwen-RobotManip is a Vision-Language-Action (VLA) foundation model. It is built on Qwen-VL and predicts continuous robot actions. A VLA model takes camera views and a language instruction. It then outputs low-level robot actions. The challenge is that manipulation data is heterogeneous by nature. Different robots record states and actions in incompatible formats. When demonstrations arrive with mismatched representations, scaling data produces interference. RobotManip solves this with a unified alignment framework. The Unified Alignment Framework The framework has three complementary mechanisms. First is a canonical state-action representation. It is an 80-dimensional vector with per-dimension binary masking. This vector holds two 29-dimensional per-arm blocks plus 22 reserved dimensions. Each block stores joint positions, end-effector pose, gripper state, and dexterous hand joints. Robots populate only the dimensions they have. Second is a camera-frame delta pose parameterization. End-effector actions are expressed as deltas in the camera frame. This makes visually similar motions numerically proximate across embodiments. Third is an in-context policy adaptation mechanism. It reads recent execution history as an implicit embodiment identifier. The policy adjusts behavior at deployment time without parameter updates. A dual-stream co-training strategy runs alongside this. It jointly optimizes manipulation data and a vision-language stream. This prevents the backbone’s perception and reasoning from eroding. The Data Engine RobotManip assembles roughly 38,100 hours of manipulation data. It uses only open-source datasets and human videos. No proprietary data collection was used. A human-to-robot synthesis pipeline produces most of this scale. It converts egocentric hand demonstrations into robot trajectories. The pipeline renders across 15 robot platforms. This synthesis alone yields about 24,808 hours of demonstrations. The egocentric source data is about 1,933 hours. Open-source robot datasets contribute over 11,000 hours. The pipeline separates action alignment from visual alignment. Action alignment retargets hand keypoints to gripper poses. Visual alignment uses SAM3 masking, ProPainter inpainting, and MuJoCo inverse kinematics. A five-stage curation pipeline then filters the combined corpus. It catches sudden changes, temporal misalignment, and extreme values. One check found 81% of episodes in a subset failed state-action alignment. Benchmark Results The research report argues standard benchmarks fail to measure generalization. Models without robot pretraining match pretrained ones on in-distribution tests. RobotManip therefore focuses on out-of-distribution (OOD) settings. Benchmark (OOD) Prev. SOTA (π0.5) Qwen-RobotManip LIBERO-Plus 84.4 91.4 RoboTwin-C2R Hard 47.9 69.4 EBench 27.1 45.6 RoboCasa365 16.9 35.9 RoboTwin-IF 49.6 72.2 The largest reported gap is on cross-embodiment transfer. RobotManip reaches 23.9% using camera-frame EEF actions. That is 3.2× the 7.5% achieved by π0.5. The model also ranks 1st on the RoboChallenge Table30-v1 generalist track. It scores a 20% relative improvement over the prior best. Real-robot validation covers AgileX ALOHA, Franka, UR, and ARX platforms. Qwen-RobotWorld: Language as a Universal Action Interface Qwen-RobotWorld is a language-conditioned video world model. It predicts future visual trajectories from a current observation. Natural language serves as the unified action interface. A world model learns environment dynamics. Given a current state and an action, it predicts the next state. RobotWorld represents states as video frames and actions as text. This is important because language is embodiment-agnostic. One instruction encodes the action sequence, goal, and constraints. It works across a Franka gripper, an Aloha dual-arm system, or a humanoid. The Double-Stream MMDiT Architecture The model uses a 60-layer double-stream Multimodal Diffusion Transformer. An understanding stream processes a frozen Qwen2.5-VL encoder’s features. A generation stream processes video-VAE latents. The two streams interact via joint attention at every layer. Using an MLLM as the action encoder gives two advantages. It parses compositional instructions and constrains physically plausible transitions. The MMDiT has 20B parameters. The VAE adopts the Wan-VAE architecture. The context length supports up to 48,360 video tokens. A Scene2Robot mechanism reuses this backbone for cross-embodiment synthesis. It processes scene, robot reference, and generation segments together. This enables human-to-robot video transfer without robot-specific prompting. The Embodied World Knowledge Dataset Training uses the Embodied World Knowledge (EWK) dataset. It contains roughly 8.6M video-text pairs. That spans over 200M observation frames. The corpus covers four embodied domains plus general video. Manipulation provides about 5.9M samples across 20+ morphologies. Driving, navigation, and human-to-robot transfer fill out the rest. An action-language mapping framework standardizes everything. It converts 20+ embodiment types and 500+ action categories into language. A hierarchical five-layer annotation pipeline produces the captions. Benchmark Results RobotWorld was evaluated on four established benchmarks. It ranks 1st overall on two of them: Benchmark Result Ranking EWMBench 4.60 1st overall DreamGen Bench 4.952 1st overall WorldModelBench 8.99 1st open-source (3rd overall) PBench 0.804 1st open-source On EWMBench it leads motion fidelity with an HSD of 0.566. That is a 33% gain over the runner-up. Scene consistency reaches 0.914. On WorldModelBench it scores 1.00 on four physics-adherence categories. These are Newton’s laws, mass conservation, fluid dynamics, and gravity. Penetration scores 0.94, and

Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at 隱私權政策 and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
zh_CN